# Chapter 8 - Applying Machine Learning To Sentiment Analysis

## Preparing the IMDb movie review data for text processing

### Obtaining the movie review dataset
The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/. After downloading the dataset, decompress the files.

In [1]:
import os
import sys
import tarfile
import time
import urllib.request

source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'

if os.path.exists(target):
    os.remove(target)

def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write(f'\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB '
                     f'| {speed:.2f} MB/s | {duration:.2f} sec elapsed')
    sys.stdout.flush()


if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
    urllib.request.urlretrieve(source, target, reporthook)

100% | 80.23 MB | 1.67 MB/s | 47.93 sec elapsed

In [2]:
if not os.path.isdir('aclImdb'):

    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()

### Preprocessing the movie dataset into a more convenient format
Having successfully extracted the dataset, we will now assemble the individual text documents from
the decompressed download archive into a single CSV file. In the following code section, we will be
reading the movie reviews into a pandas DataFrame object

To visualize the progress and estimated time until completion, we will use the Python Progress Indicator.

In [3]:
!pip install pyprind

Collecting pyprind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl.metadata (1.1 kB)
Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3


In [4]:
import pyprind
import pandas as pd
import os
import sys
from packaging import version


# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = '/content/aclImdb'

labels = {'pos': 1, 'neg': 0}

# if the progress bar does not show, change stream=sys.stdout to stream=2
pbar = pyprind.ProgBar(50000, stream=2)

df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file),
                      'r', encoding='utf-8') as infile:
                txt = infile.read()

            if version.parse(pd.__version__) >= version.parse("1.3.2"):
                x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
                df = pd.concat([df, x], ignore_index=False)

            else:
                df = df.append([[txt, labels[l]]],
                               ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:07


In [5]:
#Shuffling
import numpy as np


if version.parse(pd.__version__) >= version.parse("1.3.2"):
    df = df.sample(frac=1, random_state=0).reset_index(drop=True)

else:
    np.random.seed(0)
    df = df.reindex(np.random.permutation(df.index))

In the preceding code, we first initialized a new progress bar object, pbar, with 50,000 iterations, which
was the number of documents we were going to read in. Using the nested for loops, we iterated over
the train and test subdirectories in the main aclImdb directory and read the individual text files
from the pos and neg subdirectories that we eventually appended to the df pandas DataFrame, together
with an integer class label (1 = positive and 0 = negative).

Since the class labels in the assembled dataset are sorted, we will now shuffle the DataFrame using the
permutation function from the np.random submodule—this will be useful for splitting the dataset into
training and test datasets in later sections, when we will stream the data from our local drive directly.


For our own convenience, we will also store the assembled and shuffled movie review dataset as a
CSV file:

In [6]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

Since we are going to use this dataset later in this chapter, let’s quickly confirm that we have successfully
saved the data in the right format by reading in the CSV and printing an excerpt of the first three
examples:


In [7]:
df = pd.read_csv('movie_data.csv',encoding='utf-8')
df=df.rename(columns={'0' : 'review', '1' : 'sentiment'})
df.head(3)

Unnamed: 0,review,sentiment
0,"Election is a Chinese mob movie, or triads in ...",1
1,I was just watching a Forensic Files marathon ...,0
2,Police Story is a stunning series of set piece...,1


In [8]:
df.shape

(50000, 2)

## Introducing the bag-of-words model

### Transforming words into feature vectors
CountVectorizer takes an array of text data, which can be documents or sentences, and constructs
the bag-of-words model for us:

In [9]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

By calling the fit_transform method on CountVectorizer, we constructed the vocabulary of the
bag-of-words model and transformed the following three sentences into sparse feature vectors:

• 'The sun is shining'

• 'The weather is sweet'

• 'The sun is shining, the weather is sweet, and one and one is two'

Now, let’s print the contents of the vocabulary to get a better understanding of the underlying concepts:

In [10]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


As you can see from executing the preceding command, the vocabulary is stored in a Python dictionary
that maps the unique words to integer indices. Next, let’s print the feature vectors that we just created:

In [11]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


### Assessing word relevancy via term frequency-inverse document frequency (TF-IDF)

**Częstotliwość występowania terminu - TF**


Mierzy, jak często dane słowo występuje w danym dokumencie. Zakłada się, że im częściej słowo pojawia się w dokumencie, tym jest ważniejsze w tym dokumencie.

**Odwrócona częstotliwość dokumentu - IDF**


Mierzy, jak ważne jest słowo w całym zbiorze dokumentów. Słowa, które pojawiają się w wielu dokumentach, są uznawane za mniej istotne (często występujące), a te, które pojawiają się w mniej dokumentach, są uznawane za bardziej istotne (rzadsze). Celem IDF jest zmniejszenie wagi słów, które występują zbyt często w zbiorze dokumentów.

The scikit-learn library implements yet another transformer, the TfidfTransformer class, which takes
the raw term frequencies from the CountVectorizer class as input and transforms them into tf-idfs:

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)

np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As you saw in the previous subsection, the word 'is' had the largest term frequency in the third
document, being the most frequently occurring word. However, after transforming the same feature
vector into tf-idfs, the word 'is' is now associated with a relatively small tf-idf (0.45) in the third
document, since it is also present in the first and second document and thus is unlikely to contain
any useful discriminatory information.

### Cleaning text data

Let’s display the last 50 characters from the first document in the
reshuffled movie review dataset:

In [13]:
df.loc[16,'review'][-50:]

' a marvel to look at and never stops for a second.'

As you can see here, the text contains HTML markup as well as punctuation and other non-letter
characters. While HTML markup does not contain many useful semantics, punctuation marks can
represent useful, additional information in certain NLP contexts. However, for simplicity, we will now
remove all punctuation marks except for emoticon characters, such as :), since those are certainly
useful for sentiment analysis.

To accomplish this task, we will use Python’s regular expression (regex) library, re, as shown here:

In [14]:
import re
def preprocessor(text):
  text = re.sub('<[^>]*>','',text)
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)

  text = (re.sub('[\W]+',' ',text.lower()) + ' '.join(emoticons).replace('-','')) # usuwa niealfanumeryczne, zmienia na male, dodaje emotikony na koniec
  return text


Via the first regex, <[^>]*>, in the preceding code section, we tried to remove all of the HTML markup
from the movie reviews. After we removed the HTML markup, we used a slightly
more complex regex to find emoticons, which we temporarily stored as emoticons. Next, we removed all
non-word characters from the text via the regex [\W]+ and converted the text into lowercase characters.

`(?: ...)` tworzy **non-capturing group** w wyrażeniach regularnych.

 **Co to oznacza?**  
Normalnie nawiasy `()` w wyrażeniach regularnych tworzą **grupy przechwytywania** (**capturing groups**), co oznacza, że dane dopasowanie można później odwołać np. przez `re.match().group(1)`. Jednak dodanie `?:` na początku nawiasów powoduje, że grupa **nie jest przechwytywana**, co oznacza, że nie można jej później użyć jako odwołanie.



In [15]:
preprocessor(df.loc[16,'review'][-50:])

' a marvel to look at and never stops for a second '

In [16]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [17]:
#implementing to all movie reviews
df['review'] = df['review'].apply(preprocessor)

### Processing documents into tokens

After successfully preparing the movie review dataset, we now need to think about how to split the
text corpora into individual elements. One way to tokenize documents is to split them into individual
words by splitting the cleaned documents at their whitespace characters:

In [18]:
def tokenizer(text):
  return text.split()
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In the context of tokenization, another useful technique is word stemming, which is the process of
transforming a word into its root form. It allows us to map related words to the same stem. The Natural Language Toolkit for Python implements the Porter stemming algorithm, which we will use in the following
code section.

In [19]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
  return [porter.stem(word) for word in text.split()]

tokenizer_porter('my words is running and it has no sense at all sleeping')

['my',
 'word',
 'is',
 'run',
 'and',
 'it',
 'ha',
 'no',
 'sens',
 'at',
 'all',
 'sleep']

Before we jump into the next section, where we will train a machine learning model using the bagof-
words model, let’s briefly talk about another useful topic called stop word removal.

Stop words
are simply those words that are extremely common in all sorts of texts and probably bear no (or only
a little) useful information that can be used to distinguish between different classes of documents.
Examples of stop words are is, and, has, and like.

Removing stop words can be useful if we are working
with raw or normalized term frequencies rather than tf-idfs, which already downweight the frequently
occurring words.


To remove stop words from the movie reviews, we will use the set of 127 English stop words that is
available from the NLTK library, which can be obtained by calling the nltk.download function:

In [20]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [21]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

[w for w in tokenizer_porter('a runner likes running and runs a lot')
if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## Training a logistic regression model for document classification

In [22]:
# 25,000 for training and 25,000 for testing
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

`GridSearchCV` object to find the optimal set of parameters for our logistic regression
model using 5-fold stratified cross-validation:

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

"""
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]
"""

small_param_grid = [{'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]},
                    {'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf':[False],
                     'vect__norm':[None],
                     'clf__penalty': ['l2'],
                  'clf__C': [1.0, 10.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [24]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits




In [25]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7fb42928a200>}


Using the best model from this grid search, let’s print the average 5-fold cross-validation accuracy
scores on the training dataset and the classification accuracy on the test dataset:

In [26]:
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test)}')

CV Accuracy: 0.899
Test Accuracy: 0.89608


## Working with bigger data - online algorithms and out-of-core learning
Since not everyone has access to supercomputer facilities, we will now apply a technique called out-ofcore
learning, which allows us to work with such large datasets by fitting the classifier incrementally
on smaller batches of a dataset.

First, we will define a tokenizer function that cleans the unprocessed text data from the movie_data.
csv file that we constructed at the beginning of this chapter and separates it into word tokens while
removing stop words:

In [27]:
import numpy as np
import re
from nltk.corpus import stopwords


# The `stop` is defined as earlier in this chapter
# Added it here for convenience, so that this section
# can be run as standalone without executing prior code
# in the directory
stop = stopwords.words('english')


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized




Next, we will define a generator function, stream_docs, that reads in and returns one document at
a time:

In [28]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [29]:
next(stream_docs(path='movie_data.csv'))

('"Election is a Chinese mob movie, or triads in this case. Every two years an election is held to decide on a new leader, and at first it seems a toss up between Big D (Tony Leung Ka Fai, or as I know him, ""The Other Tony Leung"") and Lok (Simon Yam, who was Judge in Full Contact!). Though once Lok wins, Big D refuses to accept the choice and goes to whatever lengths he can to secure recognition as the new leader. Unlike any other Asian film I watch featuring gangsters, this one is not an action movie. It has its bloody moments, when necessary, as in Goodfellas, but it\'s basically just a really effective drama. There are a lot of characters, which is really hard to keep track of, but I think that plays into the craziness of it all a bit. A 100-year-old baton, which is the symbol of power I mentioned before, changes hands several times before things settle down. And though it may appear that the film ends at the 65 or 70-minute mark, there are still a couple big surprises waiting. Si

We will now define a function, get_minibatch, that will take a document stream from the stream_docs
function and return a particular number of documents specified by the size parameter:

In [30]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

Unfortunately, we can’t use CountVectorizer for out-of-core learning since it requires holding the
complete vocabulary in memory. Also, TfidfVectorizer needs to keep all the feature vectors of the
training dataset in memory to calculate the inverse document frequencies. However, another useful
vectorizer for text processing implemented in scikit-learn is HashingVectorizer. HashingVectorizer
is data-independent and makes use of the hashing trick

In [34]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None,tokenizer=tokenizer)

clf = SGDClassifier(loss='log_loss', random_state=1)
doc_stream = stream_docs(path='movie_data.csv')

Using the preceding code, we initialized HashingVectorizer with our tokenizer function and set the
number of features to 2**21. Furthermore, we reinitialized a logistic regression classifier by setting
the loss parameter of SGDClassifier to 'log'. Note that by choosing a large number of features in
HashingVectorizer, we reduce the chance of causing hash collisions, but we also increase the number
of coefficients in our logistic regression model.

Now comes the really interesting part—having set up all the complementary functions, we can start
the out-of-core learning using the following code:

In [35]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0,1])
for _ in range(45):
  X_train, y_train = get_minibatch(doc_stream, size=1000)
  if not X_train:
    break
  X_train = vect.transform(X_train)
  clf.partial_fit(X_train, y_train, classes=classes)
  pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:45


Again, we made use of the PyPrind package to estimate the progress of our learning algorithm. We
initialized the progress bar object with 45 iterations and, in the following for loop, we iterated over
45 mini-batches of documents where each mini-batch consists of 1,000 documents. Having completed
the incremental learning process, we will use the last 5,000 documents to evaluate the performance
of our model:

In [36]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 0.866


As you can see, the accuracy of the model is approximately 87 percent, slightly below the accuracy
that we achieved in the previous section using the grid search for hyperparameter tuning. However,
out-of-core learning is very memory efficient, and it took less than a minute to complete.

O WIELE SZYBSZE

Finally, we can use the last 5,000 documents to update our model:

In [37]:
clf = clf.partial_fit(X_test, y_test)

## Topic modeling with latent Dirichlet allocation

Popularna metoda modelowania tematów (topic modeling), która pozwala automatycznie odkrywać ukryte tematy w zbiorze dokumentów tekstowych. Jest to probabilistyczny model generatywny, który zakłada, że każdy dokument składa się z mieszaniny tematów, a każdy temat to zbiór powiązanych słów.

### Decomposing text documents with LDA

LDA is a generative probabilistic model that tries to find groups of words that appear frequently together
across different documents. These frequently appearing words represent our topics, assuming
that each document is a mixture of different words. The input to an LDA is the bag-of-words model
that we discussed earlier in this chapter.

Given a bag-of-words matrix as input, LDA decomposes it into two new matrices:

• A document-to-topic matrix

• A word-to-topic matrix

LDA decomposes the bag-of-words matrix in such a way that if we multiply those two matrices together,
we will be able to reproduce the input, the bag-of-words matrix, with the lowest possible error.
In practice, we are interested in those topics that LDA found in the bag-of-words matrix. The only
downside may be that we must define the number of topics beforehand—the number of topics is a
hyperparameter of LDA that has to be specified manually.

### LDA with scikit-learn

In this subsection, we will use the LatentDirichletAllocation class implemented in scikit-learn to
decompose the movie review dataset and categorize it into different topics. In the following example,
we will restrict the analysis to 10 different topics

First, we are going to load the dataset into a pandas DataFrame using the local movie_data.csv file of
the movie reviews that we created at the beginning of this chapter:

In [38]:
import pandas as pd
df = pd.read_csv('movie_data.csv', encoding='utf-8')

df = df.rename(columns={'0' : 'review', '1' : 'sentiment'})
df.head(3)

Unnamed: 0,review,sentiment
0,"Election is a Chinese mob movie, or triads in ...",1
1,I was just watching a Forensic Files marathon ...,0
2,Police Story is a stunning series of set piece...,1


Next, we are going to use the already familiar CountVectorizer to create the bag-of-words matrix as
input to the LDA.
For convenience, we will use scikit-learn’s built-in English stop word library via stop_words='english':

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)

Notice that we set the maximum document frequency of words to be considered to 10 percent (max_
df=.1) to exclude words that occur too frequently across documents. The rationale behind the removal
of frequently occurring words is that these might be common words appearing across all documents
that are, therefore, less likely to be associated with a specific topic category of a given document.
Also, we limited the number of words to be considered to the most frequently occurring 5,000 words
(max_features=5000), to limit the dimensionality of this dataset to improve the inference performed
by LDA. However, both max_df=.1 and max_features=5000 are hyperparameter values chosen arbitrarily.

The following code example demonstrates how to fit a LatentDirichletAllocation estimator to the
bag-of-words matrix and infer the 10 different topics from the documents

In [41]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

The following code example demonstrates how to fit a LatentDirichletAllocation estimator to the
bag-of-words matrix and infer the 10 different topics from the documents.

After fitting the LDA, we now have access to the components_ attribute of the lda instance, which stores
a matrix containing the word importance (here, 5000) for each of the 10 topics in increasing order:

In [42]:
lda.components_.shape

(10, 5000)

To analyze the results, let’s print the five most important words for each of the 10 topics. Note that the
word importance values are ranked in increasing order. Thus, to print the top five words, we need to
sort the topic array in reverse order:

In [43]:
n_top_words = 5
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
  print(f"Topic {topic_idx + 1}:")
  print(" ".join([feature_names[i] for i in topic.argsort()
  [:-n_top_words - 1:-1]]))

Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music history
Topic 4:
human audience cinema art feel
Topic 5:
police guy car dead murder
Topic 6:
horror house sex blood gore
Topic 7:
role performance comedy actor performances
Topic 8:
series episode episodes tv season
Topic 9:
book version original effects read
Topic 10:
action fight guy fun guys


Based on reading the five most important words for each topic, you may guess that the LDA identified
the following topics:

1. Generally bad movies (not really a topic category)
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies
7. Comedy movie reviews
8. Movies somehow related to TV shows
9. Movies based on books
10. Action movies

To confirm that the categories make sense based on the reviews, let’s plot three movies from the horror
movie category (horror movies belong to category 6 at index position 5):

In [44]:
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
Emilio Miraglia's first Giallo feature, The Night Evelyn Came Out of the Grave, was a great combination of Giallo and Gothic horror - and this second film is even better! We've got more of the Giallo side of the equation this time around, although Miraglia doesn't lose the Gothic horror stylings tha ...

Horror movie #2:
This film marked the end of the "serious" Universal Monsters era (Abbott and Costello meet up with the monsters later in "Abbott and Costello Meet Frankentstein"). It was a somewhat desparate, yet fun attempt to revive the classic monsters of the Wolf Man, Frankenstein's monster, and Dracula one "la ...

Horror movie #3:
This film marked the end of the "serious" Universal Monsters era (Abbott and Costello meet up with the monsters later in "Abbott and Costello Meet Frankentstein"). It was a somewhat desparate, yet fun attempt to revive the classic monsters of the Wolf Man, Frankenstein's monster, and Dracula one "la ...
