# **Applying Machine Learning to Sentiment Analysis**


- Preparing the IMDb movie review data for text processing
    - Obtaining the IMDb movie review dataset
    - Preprocessing the movie dataset into more convenient format

- Introducing the bag-of-words model
    - Transforming words into feature vectors
    - Assessing word relevancy via term frequency-inverse document frequency
    - Cleaning text data
    - Processing documents into tokens

- Training a logistic regression model for document classification

- Working with bigger data – online algorithms and out-of-core learning

- Topic modeling
    - Decomposing text documents with Latent Dirichlet Allocation
    - Latent Dirichlet Allocation with scikit-learn


**Sentiment Analysis**


### **1. Overview**

**Sentiment Analysis** is a Natural Language Processing (NLP) technique that determines the **emotional tone**, **opinion**, or **attitude** expressed in text.
It classifies text as **positive**, **negative**, or **neutral**, and in more advanced setups, as **multi-class emotions** (e.g., joy, anger, sadness).



### **2. Core Workflow**

| Step                      | Description                                                                              |
| ------------------------- | ---------------------------------------------------------------------------------------- |
| **1. Data Collection**    | Gather labeled datasets (e.g., tweets, product reviews).                                 |
| **2. Text Preprocessing** | Clean and normalize text: remove stop words, punctuation, apply lemmatization/stemming.  |
| **3. Feature Extraction** | Convert text into numerical form — Bag-of-Words, TF-IDF, or embeddings (Word2Vec, BERT). |
| **4. Model Training**     | Train classifiers such as Logistic Regression, Naïve Bayes, LSTM, or Transformer models. |
| **5. Evaluation**         | Use metrics like accuracy, precision, recall, F1-score, and confusion matrix.            |
| **6. Prediction**         | Infer sentiment of unseen data.                                                          |



### **3. Mathematical Foundation**

#### **A. Bag-of-Words (BoW) / TF-IDF**

Text is vectorized into counts or weighted frequencies:

$$TFIDF(w, d) = TF(w, d) \times \log\left(\frac{N}{DF(w)}\right)$$

Where:

* $`TF(w, d)`$ = term frequency of word *w* in document *d*
* $`DF(w)`$ = number of documents containing *w*
* $`N`$ = total number of documents



#### **B. Logistic Regression for Sentiment**

For binary sentiment classification:

$$P(y = 1 | x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

The model learns parameters $`w`$ and $`b`$ by minimizing the **binary cross-entropy loss**:

$$L = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)]$$



#### **C. Neural Network (e.g., LSTM / Transformer)**

Text sequences are embedded (using pre-trained vectors or embeddings) and fed through recurrent or transformer layers to capture context and dependencies:

$$h_t = f(W_h x_t + U_h h_{t-1} + b_h)$$

Final hidden states are passed to a softmax layer for sentiment classification.



### **4. Advanced Models**

| Model Type                                   | Description                                                           |
| -------------------------------------------- | --------------------------------------------------------------------- |
| **Naïve Bayes**                              | Assumes feature independence; performs well on simple text data.      |
| **Logistic Regression / SVM**                | Strong baselines using TF-IDF features.                               |
| **LSTM / GRU**                               | Captures sequential dependencies and contextual meaning.              |
| **Transformer-based Models (BERT, RoBERTa)** | Leverage bidirectional context; state-of-the-art for sentiment tasks. |



### **5. Evaluation Metrics**

* **Accuracy:** Overall correctness
* **Precision:** Correct positive predictions
* **Recall:** Coverage of actual positives
* **F1-score:** Balance between precision and recall
* **ROC-AUC:** Quality of classification threshold



### **6. Use Cases**

| Domain                   | Application                                             |
| ------------------------ | ------------------------------------------------------- |
| **Business / Marketing** | Customer feedback analysis, brand reputation monitoring |
| **Finance**              | Sentiment-based stock prediction, market trend analysis |
| **Healthcare**           | Patient feedback or stress analysis                     |
| **Social Media**         | Public opinion mining, political sentiment tracking     |



### **7. Key Insights**

* **Imbalanced data** is common — use oversampling, undersampling, or class weights.
* **Pretrained embeddings (e.g., BERT, GloVe)** significantly boost accuracy.
* **Explainability** (e.g., SHAP, LIME) helps interpret model predictions.
* **Context matters** — the same word may differ by domain ("killer app" vs. "killer bacteria").

---

## **Preparing the IMDb movie review data for text processing**

Sentiment analysis, sometimes also called `opinion mining`, is a popular subdiscipline of the broader field of `NLP`; it is concerned with analyzing the sentiment of documents. A popular task in sentiment analysis is the classification of documents based on the expressed opinions or emotions of the authors with regard to a particular topic.

- We will be working with a large dataset of movie reviews from `IMDb`.
- The movie review dataset consists of `50,000` polar movie reviews that are labeled as either positive or negative; 
  - positive means that a movie was rated with more than six stars on IMDb, and
  - negative means that a movie was rated with fewer than five stars on `IMDb`.


### **Obtaining the movie review dataset**

- A compressed archive of the movie review dataset (84.1 MB) can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/ as a gzip-compressed tarball archive.


### Preprocessing the movie dataset into a more convenient format

- Read the movie reviews into a pandas DataFrame object.
- Use the Python Progress Indicator (PyPrind) package, which was developed several years ago for such purposes.

In [1]:
import pyprind
import pandas as pd
import os
import sys
from packaging import version

basepath = '../data/aclImdb'
labels = {'pos': 1, 'neg': 0}

# if the progress bar does not show, change stream=sys.stdout to stream=2
pbar = pyprind.ProgBar(50000, stream=sys.stdout)

df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
                
            if version.parse(pd.__version__) >= version.parse("1.3.2"):
                x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
                df = pd.concat([df, x], ignore_index=False)

            else:
                df = df.append([[txt, labels[l]]], 
                               ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:18


In [2]:
df

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
0,Actor turned director Bill Paxton follows up h...,1
0,As a recreational golfer with some knowledge o...,1
0,"I saw this film in a sneak preview, and it is ...",1
0,Bill Paxton has taken the true story of the 19...,1
...,...,...
0,"Towards the end of the movie, I felt it was to...",0
0,This is the kind of movie that my enemies cont...,0
0,I saw 'Descent' last night at the Stockholm Fi...,0
0,Some films that you pick up for a pound turn o...,0


In [3]:
df.shape

(50000, 2)

- In the preceding code, we first initialized a new progress bar object, `pbar`, with `50,000 iterations`, which was the number of documents we were going to read in. Using the nested for loops, we iterated over the `train` and `test` subdirectories in the main `aclImdb` directory and read the individual text files from the `pos` and `neg` subdirectories that we eventually appended to the `df pandas DataFrame`, together with an integer class label `(1 = positive and 0 = negative)`.


- Since the `class labels` in the assembled dataset are sorted, we will now shuffle the DataFrame using the permutation function from the `np.random` submodule—this will be useful for splitting the dataset into `training` and `test` datasets in later sections, when we will stream the data from our local drive directly.

In [4]:
# Store as a CSV file
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
0,Actor turned director Bill Paxton follows up h...,1
0,As a recreational golfer with some knowledge o...,1
0,"I saw this film in a sneak preview, and it is ...",1
0,Bill Paxton has taken the true story of the 19...,1


In [6]:
# confirm data is saved as CSV
df = pd.read_csv('movie_data.csv', encoding='utf-8')
# column renaming is necessary on some computers:
df = df.rename(columns={"0": "review", "1": "sentiment"})
df.head(5)

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [7]:
df.shape

(50000, 2)

## **Introducing the bag-of-words model**

In this section, we will introduce the `bag-of-words` model, which allows us to represent text as numerical feature vectors. The idea behind `bag-of-words` is quite simple and can be summarized as follows:

- We create a vocabulary of unique tokens—for example, words—from the entire set of documents.
- We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros, which is why we call them `sparse`.

### **Transforming words into feature vectors**

To construct a `bag-of-words` model based on the word counts in the respective documents, we can use the `CountVectorizer` class implemented in scikit-learn. As you will see in the following code section, `CountVectorizer` takes an array of text data, which can be documents or sentences, and constructs
the bag-of-words model for us:

In [8]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [9]:
docs

array(['The sun is shining', 'The weather is sweet',
       'The sun is shining, the weather is sweet, and one and one is two'],
      dtype='<U64')

In [10]:
type(docs)

numpy.ndarray

By calling the `fit_transform` method on `CountVectorizer`, we constructed the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:

- 'The sun is shining'
- 'The weather is sweet'
- 'The sun is shining, the weather is sweet, and one and one is two'

In [11]:
# print contents of the vocabulary
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [12]:
bag

<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

- As you can see from executing the preceding command, the vocabulary is stored in a Python dictionary that maps the unique words to integer indices. Next, let’s print the `feature vectors` that we just created:

In [13]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


- Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the `CountVectorizer` vocabulary. For example, the first feature at index position 0 resembles the count of the word "and", which only occurs in the last document, and the word "is" at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. Those values in the feature vectors are also called the raw term frequencies: `tf (t,d)`—the number of times a term `t` occurs in a document `d`.

### **Assessing word relevancy via term frequency-inverse document frequency**

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweigh those frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:


$$tf-idf(t, d) = tf(t, d) x idf(t, d)$$

Here the tf(t, d) is the term frequency that we introduced in the previous section, and the inverse document frequency idf(t, d) can be calculated as:

$$idf(t, d) = \log\frac{n_{d}}{1 + df(d, t)}$$


where 
- $n_{d}$ is the total number of documents, and 
- $df(d, t)$ is the number of documents $d$ that contain the term $t$. 
- Note that adding the constant `1` to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training examples; 
- the `log` is used to ensure that low document frequencies are not given too much weight.

- Scikit-learn implements yet another transformer, the `TfidfTransformer` class, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into `tf-idfs`:

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)

print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.         0.43370786 0.         0.55847784 0.55847784 0.
  0.43370786 0.         0.        ]
 [0.         0.43370786 0.         0.         0.         0.55847784
  0.43370786 0.         0.55847784]
 [0.50238645 0.44507629 0.50238645 0.19103892 0.19103892 0.19103892
  0.29671753 0.25119322 0.19103892]]


In [15]:
print(tfidf.fit_transform(bag).toarray())

[[0.         0.43370786 0.         0.55847784 0.55847784 0.
  0.43370786 0.         0.        ]
 [0.         0.43370786 0.         0.         0.         0.55847784
  0.43370786 0.         0.55847784]
 [0.50238645 0.44507629 0.50238645 0.19103892 0.19103892 0.19103892
  0.29671753 0.25119322 0.19103892]]


In [16]:
tfidf.fit_transform(bag).toarray().shape

(3, 9)

As we saw in the previous subsection, the word "is" had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into `tf-idfs`, we see that the word "is" is now associated with a relatively small `tf-idf` (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.


However, if we'd manually calculated the `tf-idfs` of the individual terms in our feature vectors, we'd have noticed that the `TfidfTransformer` calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and `tf-idf` that were implemented in scikit-learn are:


$$idf(t, d) = \log\frac{1 + n_{d}}{1 + df(d, t)}$$


The `tf-idf` equation that was implemented in scikit-learn is as follows:

$$tf-idf(t, d) = tf(t, d) x (idf(t, d) + 1)$$


Note that the `"+1"` in the previous `idf` equation is due to setting `smooth_idf=True` in the previous code example, which is helpful for assigning zero weight (that is, `idf(t, d) = log(1) = 0`) to terms that occur in all documents.


While it is also more typical to normalize the raw term frequencies before calculating the `tf-idfs`, the `TfidfTransformer` normalizes the `tf-idfs` directly.

By default `(norm='l2')`, scikit-learn's `TfidfTransformer` applies the `L2-normalization`, which returns a vector of length 1 by dividing an un-normalized feature vector `v` by its `L2-norm`:

$$v_{norm} = \frac{v}{||v||_{2}} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + ... + v_{n}^{2}}} = \frac{v}{(\sum _{i=1}^{n} v_{i}^{2})^{1/2}}$$

To make sure that we understand how `TfidfTransformer` works, let us walk through an example and calculate the `tf-idf` of the word "is" in the 3rd document.

The word "is" has a term frequency of 3 (tf = 3) in document 3 (), and the document frequency of this term is 3 since the term "is" occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:

$$idf("is", d_{3}) = \log\frac{1 + 3}{1 + 3} = 0$$


Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:

$$tf-idf("is", d_{3}) = 3 \times (0 + 1) = 3$$


In [17]:
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print(f'tf-idf of term "is" = {tfidf_is:.2f}')

tf-idf of term "is" = 3.00


If we repeated these calculations for all terms in the 3rd document, we'd obtain the following `tf-idf` vectors: `[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]`. However, we notice that the values in this feature vector are different from the values that we obtained from the `TfidfTransformer` that we used previously. The final step that we are missing in this `tf-idf` calculation is the `L2-normalization`, which can be applied as follows:



$$tf-idf(d_{3})_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29]}{\sqrt{3.39^{2} + 3.0^{2} + 3.39^{2} + 1.29^{2} + 1.29^{2} + 1.29^{2} + 2.0^{2} + 1.69^{2} + 1.29{2}}} = [0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$


$tf-idf("is", d_{3}) = 0.45$


As we can see, the results match the results returned by scikit-learn's `TfidfTransformer` (below). Since we now understand how `tf-idfs` are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.

In [18]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf

array([3.38629436, 3.        , 3.38629436, 1.28768207, 1.28768207,
       1.28768207, 2.        , 1.69314718, 1.28768207])

In [19]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([0.50238645, 0.44507629, 0.50238645, 0.19103892, 0.19103892,
       0.19103892, 0.29671753, 0.25119322, 0.19103892])

### **Cleaning text data**

- Before we build our bag-of-words model, its ideal to clean the text data by stripping it of all unwanted characters.

In [20]:
df.loc[3, 'review'][-50:]

". <br /><br />This is one I'd recommend to anyone."

- As you can see here, the text contains `HTML` markup as well as punctuation and other non-letter characters. While `HTML` markup does not contain many useful semantics, punctuation marks can represent useful, additional information in certain `NLP` contexts. However, for simplicity, we will now remove all punctuation marks except for emoticon characters, since those are certainly useful for sentiment analysis.

- Using Python's `regular expression (regex)` library, `re`

In [21]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

- Via the first regex, `<[^>]*>`, in the preceding code section, we tried to remove all of the HTML markup from the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. Since we are only interested in removing HTML markup and do not plan to use the HTML markup further, using regex to do the job should be acceptable.

- After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons. Next, we removed all non-word characters from the text via the regex `[\W]+` and converted the text into lowercase characters.

- Although the addition of the `emoticon` characters to the end of the cleaned document strings may not look like the most elegant approach, we must note that the order of the words doesn’t matter in our `bag-of-words` model if our vocabulary consists of only one-word tokens. But before we talk more about the splitting of documents into individual terms, words, or tokens, let’s confirm that our preprocessor function works correctly:

In [22]:
preprocessor(df.loc[3, 'review'][-50:])

' this is one i d recommend to anyone '

In [23]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

- Let's now apply our `preprocessor` function to all the movie reviews in our `DataFrame`:

In [24]:
df['review'] = df['review'].apply(preprocessor)

In [25]:
df.head()

Unnamed: 0,review,sentiment
0,i went and saw this movie last night after bei...,1
1,actor turned director bill paxton follows up h...,1
2,as a recreational golfer with some knowledge o...,1
3,i saw this film in a sneak preview and it is d...,1
4,bill paxton has taken the true story of the 19...,1


### **Processing documents into tokens**

- Split the text corpora into individual elements.
- One way to `tokenize` documents is to split them into individual words by splitting the cleaned documents at their whitespace characters:

In [26]:
import nltk
from nltk.stem.porter import PorterStemmer

def tokenizer(text):
    return text.split()

# Porter stemming algorithm
porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [27]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [28]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

- Using the `PorterStemmer` from the `nltk` package, we modified our tokenizer function to reduce words to their root form, which was illustrated by the simple preceding example where the word 'running' was stemmed to its root form 'run'.

**Stemming algorithms**

While stemming can create non-real words, such as `'thu' (from 'thus')`, as shown in the previous example, a technique called `lemmatization` aims to obtain the canonical (grammatically correct) forms of individual words—the so-called `lemmas`. However, `lemmatization` is computationally more difficult and expensive compared to `stemming` and, in practice, it has been observed that stemming and `lemmatization` have little impact on the performance of text classification.

**Stop word removal**

Stop words are common words like `"the,"` `"a,"` and `"is"` that are often removed during Natural Language Processing (NLP) to reduce noise and focus on more meaningful content. By removing these high-frequency, low-information words, NLP tasks become more efficient and accurate, improving performance in areas like search engines and topic modeling. However, their removal is not always necessary, and in some contexts, such as sentiment analysis where words like `"not"` are crucial, they should be retained.


**Why stop words are removed**

- `Increase efficiency:` Removing stop words reduces the amount of data that needs to be processed, which can speed up analysis and reduce computational costs. 

- `Improve accuracy:` It helps algorithms focus on the most informative words, leading to more accurate results in tasks like search and topic modeling. 

- `Reduce noise:` By filtering out common grammatical words, NLP models can better identify the core meaning of a text. 

- `Example:` In the sentence, "The quick brown fox jumps over the lazy dog," removing "the" and "over" leaves "quick brown fox jumps lazy dog," which retains the core meaning for analysis. 

In [29]:
import ssl
import nltk

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

- After we download the stop words set, we can load and apply the English stop word set as follows:

In [30]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## **Training a logistic regression model for document classification**

- Train logistic regression model to classify the movie reviews into `positive` and `negative` reviews based on the bag-of-words model.
- Divide data into `25k` training and `25k` testing documents.

In [31]:
df.head(), df.shape

(                                              review  sentiment
 0  i went and saw this movie last night after bei...          1
 1  actor turned director bill paxton follows up h...          1
 2  as a recreational golfer with some knowledge o...          1
 3  i saw this film in a sneak preview and it is d...          1
 4  bill paxton has taken the true story of the 19...          1,
 (50000, 2))

In [32]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [33]:
X_train.shape, X_test.shape

((25001,), (25000,))

- `GridSearchCV` object to find the optimal set of parameters for our logistic regression model using `5-fold` stratified cross-validation:

In [34]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        token_pattern=None)

small_param_grid = [{'vect__ngram_range': [(1, 2), (1, 3)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [0.01, 0.1, 1.0, 10.0, 100.0],
                     'vect__use_idf': [True],
                     'vect__min_df': [2, 5],
                     'vect__max_df': [0.8, 1.0],
                     'vect__sublinear_tf': [True]},
                    {'vect__ngram_range': [(1, 1), (1, 3)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf':[False],
                     'vect__norm':[None],
                     'clf__penalty': ['l2'],
                  'clf__C': [0.01, 0.1, 1.0, 10.0, 100]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, 
                           small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

- Note that for the logistic regression classifier, we are using the `LIBLINEAR` solver as it can perform better than the default choice `('lbfgs')` for relatively large datasets.

**Multiprocessing via the n_jobs parameter**

- Please note that it is highly recommended to use `n_jobs=-1` (instead of `n_jobs=1`) in the previous code example to utilize all available cores on your machine and speed up the grid search.

- In the previous code example, we replaced `CountVectorizer` and `TfidfTransformer` from the previous subsection with `TfidfVectorizer`, which combines `CountVectorizer` with the `TfidfTransformer`.

- Our `param_grid` consisted of two parameter dictionaries. 
  - In the first dictionary, we used `TfidfVectorizer` with its default settings `(use_idf=True, smooth_idf=True, and norm='l2')` to calculate the `tf-idfs`; in the second dictionary, we set those parameters to `use_idf=False`, `smooth_idf=False`, and `norm=None` in order to train a model based on raw term frequencies.
  - Furthermore, for the logistic regression classifier itself, we trained models using `L2` regularization via the penalty parameter and compared different regularization strengths by defining a range of values for the inverse-regularization parameter `C`.

In [None]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


- Print the best parameter set:

In [55]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')

Best parameter set: {'clf__C': 1.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x1620294c0>}


- Get the average `5-fold` cross-validation accuracy scores on the training dataset and the classification accuracy on the test dataset:

In [56]:
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

CV Accuracy: 0.873


In [None]:
from sklearn.metrics import classification_report
y_pred = gs_lr_tfidf.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred))

In [77]:
gs_lr_tfidf.best_estimator_

In [57]:
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Test Accuracy: 0.881


- The results reveal that our machine learning model can predict whether a movie review is `positive` or `negative` with `88` percent accuracy.

**Make Prediction for a new text input (review) using our trained model:**

In [111]:
df['sentiment'].value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

In [97]:
# Example input review
new_review = ["The movie was absolutely good and entertaining!, I had a pretty good time"]

# Predict sentiment (1 = positive, 0 = negative)
prediction = gs_lr_tfidf.best_estimator_.predict(new_review)

# Display result
print("Predicted Sentiment:", "Positive" if prediction[0] == 1 else "Negative")

Predicted Sentiment: Positive


In [98]:
proba = gs_lr_tfidf.best_estimator_.predict_proba(new_review)
print("Probability (Negative, Positive):", proba[0])

Probability (Negative, Positive): [0.332789 0.667211]


**NOTE**

Please not that `gs_lr_tfidf.best_score_` is the avergae k-fold cross-validation score. i.e., if we have a `GridSearchCV` object with 5-fold `cross-validation` (like the one above), the `best_score_` attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:

In [58]:
from sklearn.linear_model import LogisticRegression
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

np.random.seed(0)
np.set_printoptions(precision=6)
y = [np.random.randint(3) for i in range(25)]
X = (y + np.random.randn(25)).reshape(-1, 1)

cv5_idx = list(StratifiedKFold(n_splits=5, shuffle=False).split(X, y))
    
lr = LogisticRegression()
cross_val_score(lr, X, y, cv=cv5_idx)

array([0.6, 0.4, 0.6, 0.2, 0.6])

In [67]:
X.shape, len(y)

((25, 1), 25)

- By executing the code above, we created a simple data set of random integers that shall represent our class labels. Next, we fed the indices of `5 cross-validation` folds `(cv3_idx)` to the `cross_val_score` scorer, which returned `5 accuracy scores` -- these are the 5 accuracy values for the 5 test folds.

- Next, let us use the `GridSearchCV` object and feed it the same `5` cross-validation sets (via the pre-generated `cv3_idx` indices):

In [68]:
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression()
gs = GridSearchCV(lr, {}, cv=cv5_idx, verbose=3).fit(X, y) 

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END ..................................., score=0.600 total time=   0.0s
[CV 2/5] END ..................................., score=0.400 total time=   0.0s
[CV 3/5] END ..................................., score=0.600 total time=   0.0s
[CV 4/5] END ..................................., score=0.200 total time=   0.0s
[CV 5/5] END ..................................., score=0.600 total time=   0.0s


- As we can see, the scores for the `5` folds are exactly the same as the ones from `cross_val_score` earlier.

- Now, the `best_score_` attribute of the `GridSearchCV` object, which becomes available after fitting, returns the average accuracy score of the best model:

In [70]:
print(gs.best_score_)

0.48


- As we can see, the result above is consistent with the average score computed with `cross_val_score`.

In [74]:
lr = LogisticRegression()
print(cross_val_score(lr, X, y, cv=cv5_idx).mean())

0.48


**END OF NOTE**

---

## **Working with bigger data - online algorithms and out-of-core learning**

- Out-of-core learning, allows us to work with large datasets by fitting the classifier incrementally on smaller batches of a dataset.

- Here, we will make use of the `partial_fit` function of `SGDClassifier` in scikit-learn to stream the documents directly from our local drive and train a logistic regression model using small mini-batches of documents.

- Define a tokenizer function that cleans the unprocessed text data from the `movies_data.csv` file that we constructed at the beginning, and separates it into word tokens while removing stop words:

In [3]:
import numpy as np
import re 
from nltk.corpus import stopwords

stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
                 + ' '.join(emoticons).replace('-','')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

- Next, we dill define a generator function, `stream_docs`, that reads in and returns one document at a time:

In [4]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

- Let's read in the first document from the `movie_data.csv` file, which should return a tuple consisting of the review text as well as the corresponding class label:

In [4]:
next(stream_docs(path='movie_data.csv'))

('"I went and saw this movie last night after being coaxed to by a few friends of mine. I\'ll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."',
 1)

- Define a function `get_minibatch`, that will take a document stream from the `stream_docs` function and return a particular number of documents specified by the size parameter:

In [5]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

- Unfortunately, we can’t use `CountVectorizer` for out-of-core learning since it requires holding the complete vocabulary in memory. Also, `TfidfVectorizer` needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies. However, another useful vectorizer for text processing implemented in scikit-learn is `HashingVectorizer`.

In [16]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)
clf = SGDClassifier(loss='hinge', random_state=1)
doc_stream = stream_docs(path='movie_data.csv')

- We initialized `HashingVectorizer` with our tokenizer function and set the number of features to `2**21`. Furthermore, we reinitialized a logistic regression classifier by setting the loss parameter of `SGDClassifier` to `log`. 

- Note that by choosing a large number of features in `HashingVectorizer`, we reduce the chance of causing hash collisions, but we also increase the number of coefficients in our logistic regression model.

- Start the out-of-core learning:

In [23]:
import pyprind

pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])

for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

pbar



Title: 
  Started: 10/16/2025 03:03:01
  Finished: 10/16/2025 03:03:01
  Total time elapsed: 00:00:00

- we iterated over `45 mini-batches` of documents where each mini-batch consists of `1,000 documents`. Having completed the incremental learning process, we will use the last `5,000 documents` to evaluate the performance of our model:

In [18]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 1.000


- Finally, we can use the last `5,000` documents to update our model:

In [None]:
clf = clf.partial_fit(X_test, y_test)

1.0


**The word2vec model**

The `word2vec` algorithm is an `unsupervised learning` algorithm based on neural networks that attempts to automatically learn the relationship between words. The idea behind `word2vec` is to put words that have similar meanings into similar clusters, and via clever vector spacing, the model can reproduce certain words using simple vector math, for example, `king – man + woman = queen`.

---

## **Topic modeling with Latent Dirichlet allocation**

`Topic modeling` describes the broad task of assigning topics to unlabeled text documents. For example, a typical application is the categorization of documents in a large text corpus of newspaper articles. In applications of topic modeling, we then aim to assign category labels to those articles, for example, `sports`, `finance`, `world news`, `politics`, and `local news`.

- Topic modeling is considered a clustering task, a subcategory of unsupervised learning.

### **Decomposing text documents with Latent Dirichlet Allocation**

`LDA` is a generative probabilistic model that tries to find groups of words that appear frequently together across different documents. These frequently appearing words represent our topics, assuming that each document is a mixture of different words.

Given a bag-of-words matrix as input, `LDA` decomposes it into two new matrices:

- A document-to-topic matrix
- A word-to-topic matrix

LDA decomposes the `bag-of-words` matrix in such a way that if we multiply those two matrices together, we will be able to reproduce the input, the bag-of-words matrix, with the lowest possible error. In practice, we are interested in those topics that LDA found in the bag-of-words matrix. The only downside may be that we must define the number of topics beforehand—the number of topics is a hyperparameter of LDA that has to be specified manually.

### **Latent Dirichlet Allocation with scikit-learn**

In this subsection, we will use the `LatentDirichletAllocation` class implemented in scikit-learn to decompose the movie review dataset and categorize it into different topics. In the following example, we will restrict the analysis to 10 different topics, but readers are encouraged to experiment with the hyperparameters of the algorithm to further explore the topics that can be found in this dataset.

- Load data into `pandas.DataFrame`.

In [25]:
import pandas as pd 
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head()

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


- Next, we are going to use the already familiar `CountVectorizer` to create the `bag-of-words` matrix as input to the `LDA`.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)

X = count_vec.fit_transform(df['review'].values)

<50000x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 2780504 stored elements in Compressed Sparse Row format>

- Notice that we set the maximum document frequency of words to be considered to 10 percent `(max_df=.1)` to exclude words that occur too frequently across documents. The rationale behind the removal of frequently occurring words is that these might be common words appearing across all documents that are, therefore, less likely to be associated with a specific topic category of a given document. Also, we limited the number of words to be considered to the most frequently occurring `5,000` words `(max_features=5000)`, to limit the dimensionality of this dataset to improve the inference performed by `LDA`. However, both `max_df=.1` and `max_features=5000` are hyperparameter values chosen arbitrarily, and readers are encouraged to tune them while comparing the results.

- fit a `LatentDirichletAllocation` estimator to the `bag-of-words` matrix and infer the 10 different topics from the documents.

In [31]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')

X_topics = lda.fit_transform(X)

In [41]:
X_topics

array([[2.50091416e-03, 9.77494779e-01, 2.50051158e-03, ...,
        2.50034654e-03, 2.50053199e-03, 2.50077758e-03],
       [8.33526593e-04, 8.33606132e-04, 1.02795734e-01, ...,
        8.33575277e-04, 8.33607028e-04, 1.44434277e-01],
       [1.28221566e-03, 2.66683358e-01, 2.12125823e-01, ...,
        1.28258459e-03, 1.25237669e-01, 1.28244482e-03],
       ...,
       [4.55003922e-01, 1.04190729e-03, 1.04205009e-03, ...,
        1.04185614e-03, 1.04205003e-03, 1.04195855e-03],
       [4.07023117e-01, 1.81857522e-03, 2.44313216e-01, ...,
        1.81854993e-03, 1.81860815e-03, 9.78587275e-02],
       [9.72720999e-01, 3.03088846e-03, 3.03092197e-03, ...,
        3.03082138e-03, 3.03090304e-03, 3.03089819e-03]])

- By setting `learning_method='batch'`, we let the lda estimator do its estimation based on all available training data (the `bag-of-words` matrix) in one iteration, which is slower than the alternative 'online' learning method, but can lead to more accurate results (setting `learning_method='online'` is analogous to online or mini-batch learning).

- access the `components_` attribute of the `lda` instance, which stores a matrix containing the word importance (here, 5000) for each of the 10 topics in increasing order:

In [32]:
lda.components_.shape

(10, 5000)

- To analyze the results, let's print the five most important words for each of the 10 topics. Note that the word importance values are ranked in increasing order. Thus, to print the top five words, we need to sort the topic array in reverse order:

In [39]:
for topic_idx, topic in enumerate(lda.components_):
    print(topic_idx, topic)

0 [ 91.50846593 103.00131299 352.6877031  ... 341.90195267 219.2361035
  30.20576233]
1 [29.23761791 13.10213299 43.38100074 ...  0.10000513  0.10000308
  2.46842427]
2 [1.78099605e+01 1.62127213e+02 1.29444216e+02 ... 1.00009529e-01
 1.00010992e-01 4.39793174e+00]
3 [ 0.62985605 20.64865459 61.08221094 ...  0.10001072  0.10001132
 10.03641183]
4 [ 56.65850322 217.61222717  36.61177975 ... 796.95703592 533.86776726
  26.8114528 ]
5 [ 1.24340354 22.34807824  8.57614113 ... 37.44094137 70.19605598
  3.80168648]
6 [2.03775633e+00 7.87278607e+00 1.29536844e+02 ... 1.00008379e-01
 1.00008924e-01 1.00417956e-01]
7 [9.78183711e-01 1.32804550e+01 1.00321571e+01 ... 1.00009451e-01
 1.00010088e-01 2.00522899e+02]
8 [ 8.79623014 29.22767486 59.64699724 ...  0.10001286  0.10001387
  0.75251745]
9 [ 0.10002268 30.77946524 98.00095046 ...  0.10001398  0.10001499
  1.90249633]


In [40]:
n_top_words = 5
feature_names = count_vec.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx + 1)}:')
    print(' '.join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

Topic 1:
worst minutes script awful stupid
Topic 2:
family mother father girl children
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art feel
Topic 5:
police guy car murder dead
Topic 6:
horror house gore blood sex
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read effects
Topic 10:
action fight guy guys fun


Based on reading the five most important words for each topic, you may guess that the `LDA` identified the following topics:

```
1. Generally bad movies (not really a topic category)
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies
7. Comedy movie reviews
8. Movies somehow related to TV shows
9. Movies based on books
10. Action movies
```

To confirm that the categories make sense based on the reviews, let's plot 5 movies from the horror movie category (category 6 at index position 5):

In [45]:
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:5]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
"House of the Damned" (also known as "Spectre") is one of your low budget haunted house horror flicks, filled with mediocre performances and cheap effects. It is about a family that inherits an old Irish mansion, and after moving in begin to experience strange phenomenon and ghostly apparitions, inc ...

Horror movie #3:
This film marked the end of the "serious" Universal Monsters era (Abbott and Costello meet up with the monsters later in "Abbott and Costello Meet Frankentstein"). It was a somewhat desparate, yet fun attempt to revive the classic monsters of the Wolf Man, Frankenstein's monster, and Dracula one "la ...

Horror movie #4:
This film mar

- Using the preceding code example, we printed the first 300 characters from the top five horror movies. The reviews—even though we don’t know which exact movie they belong to—sound like reviews of horror movies

---

## **Summary**

In this chapter, you learned how to use machine learning algorithms to classify text documents based on their polarity, which is a basic task in sentiment analysis in the field of NLP. Not only did you learn how to encode a document as a feature vector using the bag-of-words model, but you also learned how to weight the term frequency by relevance using tf-idf.


Working with text data can be computationally quite expensive due to the large feature vectors that are created during this process; in the last section, we covered how to utilize out-of-core or incremental learning to train a machine learning algorithm without loading the whole dataset into a computer’s memory.


Lastly, you were introduced to the concept of topic modeling using LDA to categorize the movie reviews into different categories in an unsupervised fashion.