<h1 style="font-family: Arial; font-weight: 900; color: magenta; text-align: center;">SENTIMENT ANALYSIS</h1><br>

<br/><div style="font-family: Arial; color: darkblue; font-size: 14px; font-weight: 600"><a id="toc"></a>
In this part 8:
    <p style="margin: 0 0 0 2rem; font-size: 16px; font-weight: 800">[1. The bag-of-words model](#1)</p>
        <p style="margin: 0 0 0 4rem">[1a. Transforming words into feature vectors](#1a)</p>
        <p style="margin: 0 0 0 4rem">[1b. The TF-IDF](#1b)</p>
        <p style="margin: 0 0 0 4rem">[1c. Preparing text data](#cb)</p><br/>
    <p style="margin: 0 0 0 2rem; font-size: 16px; font-weight: 800">[2. Logistic regression model for document classification](#2)</p>
        <p style="margin: 0 0 0 4rem">[2a. Training a logistic regression model for our movie reviews dataset](#2a)</p>
        <p style="margin: 0 0 0 4rem">[2b. Working with bigger data - online algorithms and out-of-core learning](#2b)</p><br/>
    <p style="margin: 0 0 0 2rem; font-size: 16px; font-weight: 800">[Appendix - Topic modeling with latent Dirichlet allocation](#Appendix)</p></div>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Datasets used : throughout this part, we only use the "movie_data" dataset; I have downloaded it from the author's
# Github for this chapter (at https://github.com/rasbt/machine-learning-book/tree/main/ch08), and I have placed it
# inside the folder containing the notebook, so it is possible to read it directly;
# besides, although it is called "df" in the book, I prefer to call it "df_movies"
df_movies = pd.read_csv("movie_data.csv")
# then, we rename the columns labeled "0" and "1" to "review" and "sentiment" respectively, providing more descriptive
# names to the columns for better readability and understanding of the data
df_movies = df_movies.rename(columns={"0": "review", "1": "sentiment"})

# Beautify code results using colors
from colorama import Fore, Back, Style
# define some strings for cleaner code
BB = Fore.BLUE + Style.BRIGHT
MB = Fore.MAGENTA + Style.BRIGHT
CB = Fore.CYAN + Style.BRIGHT

# Beautify DataFrames using colors
def beautify(df):
    # df.style creates a Styler object from the DataFrame df, which allows styling to be applied to the
    # dataframe's representation; then 'set_table_styles' applies css styling (eg 'text-align: center')
    # to the table header cells (<th>) and data cells (<td>) of the dataframe
    styled_df = df.style.set_table_styles([
        {'selector': 'th', 'props': [('text-align', 'center'), ('color', 'blue')]}, 
        {'selector': 'td', 'props': [('text-align', 'center'), ('color', 'magenta'), ('font-weight', 'bold')]}])    
    # The Styler object resets the precision of floats to 6 decimal places, whatever their original format; only
    # the display is affected, though, not the original data; here, the floats are float64 numbers and correspond
    # to seconds or percentages, so I reformat their display to 2 decimal places (in the end, I didn't use floats
    # in my tables, only strings, but I leave this reformatting anyway, just in case I'll add some tables one day)
    return styled_df.format({col: '{:.2f}' for col in df.select_dtypes(include=['float64'])})

<a id="1"></a><br><br><br><div style="font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">
    <h4 style="margin: 0 0 .5rem 2rem; font-size: 18px; font-weight: 800"><font color='darkpink'>1. THE BAG-OF-WORDS MODEL</font></h4>
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>1a. Transforming words into feature vectors</font><a id="1a"></a></p>$\quad$
        The goal in this part is to study a set of documents, denoted by <font color='#1aabff'>$\mathcal{D}$</font>.<br>$\quad$
        Before we can pass categorical data, such as words, to a machine learning algorithm, we need to convert it into a numerical format. The <font color='darkpink'>bag-of-words model</font> allows us to do exactly this. It uses a <font color='darkpink'>vocabulary</font> to assign an integer to each interesting piece of <font color='#1aabff'>$\mathcal{D}$</font>. Let's explain what this means by presenting some definitions.<br>$\quad$
        First, we extract from <font color='#1aabff'>$\mathcal{D}$</font> the units we want to keep; these are usually its words, or blocks of a fixed number of words, or blocks of a fixed number of characters satisfying some conditions of our choice. These units are called <font color='darkpink'>raw tokens</font>. Then, we may have rules under which some tokens are equivalent; for example <font color='#1aabff'>"cat"</font>, <font color='#1aabff'>"Cat"</font>, and <font color='#1aabff'>"cats"</font> may be considered equivalent. The equivalence classes of a set of documents <font color='#1aabff'>$\mathcal{D}$</font> are called its <font color='darkpink'>processed tokens</font>. An <font color='darkpink'>n-gram</font> is any sequence of <font color='#1aabff'>$n$</font> contiguous <font color='darkpink'>processed tokens</font>.<br>$\quad$
        A <font color='#cc33ff'>vocabulary</font> is a dictionary that assigns integers to strings. The strings (keys) may be independent of <font color='#1aabff'>$\mathcal{D}$</font>, or built from <font color='#1aabff'>$\mathcal{D}$</font>; they are the pieces of text we are interested in; they are called the <font color='darkpink'>terms</font> of the dictionary. The integers (values) must all be different, and, if the number of keys is <font color='#1aabff'>$N$</font>, they must go from <font color='#1aabff'>$0$</font> to <font color='#1aabff'>$N\!-\!1$</font>, with no gaps. The entries are not necessarily ordered according to these integers, and they rarely are (eg, see cell 2).<br>$\quad$
        We can build a <font color='#cc33ff'>vocabulary</font> from our set of documents, by specifying some rules under which an <font color='darkpink'>n-gram</font> becomes a term, and by assigning to each qualified <font color='darkpink'>n-gram</font> its number in the alphabetical ordered list of terms; other orders are possible, though less common.<br>$\quad$
        Once we have a <font color='#cc33ff'>vocabulary</font> and a processed set of documents, say <font color='#1aabff'>$\mathcal{D}_p$</font>, the <font color='#cc33ff'>bag-of-words model</font> counts, in each processed document <font color='#1aabff'>$\,D_p \in \mathcal{D}_p\,$</font>, the number of occurrences of each <font color='#cc33ff'>term</font>; it stores the results in a separate (<font color='#cc33ff'>NumPy</font>) array, whose entry at index <font color='#1aabff'>$i$</font> is the number of times the <font color='#cc33ff'>term</font> with value <font color='#1aabff'>$i$</font> appears in <font color='#1aabff'>$D_p$</font>. This array is called the <font color='darkpink'>count vector</font> of <font color='#1aabff'>$D_p$</font>.<br>
    For example if we are interested in the words <font color='#1aabff'>"dog"</font> and <font color='#1aabff'>"cat"</font>, then our <font color='#cc33ff'>vocabulary</font> will be <font color='#1aabff'>{"cat": 0, "dog": 1}</font> (alphabetical order); and if our document is <font color='#1aabff'>"The cat meows"</font>, then our <font color='#cc33ff'>count vector</font> will be <font color='#1aabff'>$[1,0]$</font>.<br>$\quad$
        A <font color='#cc33ff'>count vector</font> is a particular case of a <font color='#cc33ff'>feature vector</font>. A <font color='#cc33ff'>feature vector</font> associated with a processed document <font color='#1aabff'>$\mathcal{D}_p$</font> is a 1D array whose <font color='#1aabff'>$i$</font>-th entry corresponds to some data associated with the term of value <font color='#1aabff'>$i$</font>; for example <font color='#1aabff'>$D_p$</font>'s <font color='#cc33ff'>count vector</font> stores at index <font color='#1aabff'>$i$</font> the number of times that the term of value <font color='#1aabff'>$i$</font> appears in <font color='#1aabff'>$D_p$</font>, so it is a <font color='#cc33ff'>feature vector</font>.<br>$\quad$
        When the <font color='#cc33ff'>terms</font> present in a processed document <font color='#1aabff'>$D_p$</font> represent only a small subset of all the <font color='#cc33ff'>terms</font> in the <font color='#cc33ff'>vocabulary</font> under use, any <font color='#cc33ff'>feature vector</font> associated with <font color='#1aabff'>$D_p$</font> will mostly consist of zeros, in which case it is said to be <font color='darkpink'>sparse</font>. Usually, <font color='#cc33ff'>feature vectors</font> are sparse (eg, see cell 2). To save memory, a <font color='#cc33ff'>sparse vector</font> is not represented as a common vector, but by a couple of arrays : the first is the array of indices corresponding to non-zero values, and the second is the array of the non-zero values (eg $[0, 7, 0, 1, 0, 0]$ is represented by $([1,3], [7,1])$).

<div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        The <font color='#cc33ff'>bag-of-words model</font> is implemented in the <font color='#cc33ff'>feature_extraction.text</font> module of <font color='#cc33ff'>scikit-learn</font> under <font color='#cc33ff'>CountVectorizer(input='content', encoding='utf-8', decode_error='strict', lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None)</font>, where I have selected the following parameters :<ul>
    <li><font color='#cc33ff'>input='content'</font> specifies how the <font color='#cc33ff'>input</font> (ie the set of documents <font color='#1aabff'>$\mathcal{D}\,$</font>) will be passed : either the documents are already in the form of strings or bytes, and we use the default <font color='#cc33ff'>input=</font><font color='#1aabff'>'content'</font>, or they are stored in files and we either want to open them ourselves (eg for error handling) and pass them to <font color='#cc33ff'>scikit-learn</font> to read, in which case we use <font color='#cc33ff'>input=</font><font color='#1aabff'>'file'</font>, or we prefer <font color='#cc33ff'>scikit-learn</font> to open and read the files itself, and we use the option <font color='#cc33ff'>input=</font><font color='#1aabff'>'filename'</font>
    <li><font color='#cc33ff'>encoding='utf-8'</font> : when we save a text document, it is saved as a sequence of encoded characters; usually the encoding technique is <font color='#1aabff'>'utf-8'</font> where each character is encoded as a sequence of 1-4 bytes (1 byte = 1 block of 8 bits), but it may also be <font color='#1aabff'>'ascii'</font>, <font color='#1aabff'>'latin1'</font>, <font color='#1aabff'>'utf-16'</font>,... So when we ask <font color='#cc33ff'>scikit-learn</font> to read the files itself, we need to tell it the type of encoding used in the documents, by specifying the value of <font color='#cc33ff'>encoding</font> (since <font color='#cc33ff'>encoding=</font><font color='#1aabff'>'utf-8'</font> is the default we only need to specify <font color='#cc33ff'>encoding</font> when <font color='#cc33ff'>encoding=</font><font color='#1aabff'>'ascii'</font>,...). When we provide the documents directly, with <font color='#cc33ff'>input=</font><font color='#1aabff'>'content'</font>, if they are bytes, we also need to specify the encoding used (in the same way), and if they are strings, we don't need to specify this parameter, <font color='#cc33ff'>scikit-learn</font> won't decode our documents, and will use them directly</li>
    <li><font color='#cc33ff'>decode_error='strict'</font> specifies what to do if a byte sequence contains bytes not of the given encoding. By default, it is <font color='#cc33ff'>decode_error=</font><font color='#1aabff'>'strict'</font>, meaning that a <font color='#5555ff'>UnicodeDecodeError</font> will be raised. Other options are <font color='#cc33ff'>decode_error=</font><font color='#1aabff'>'ignore'</font> (ie any byte that cannot be decoded using the specified encoding is omitted from the document); and <font color='#cc33ff'>decode_error=</font><font color='#1aabff'>'replace'</font> (ie invalid bytes are replaced by the Unicode replacement character <font color='#5555ff'>�</font> (U+FFFD); this is useful when we want to preserve string length or/and show that something was corrupted)</li>
    <li><font color='#cc33ff'>lowercase=True</font> converts all characters to lowercase before tokenizing</li>
    <li><font color='#cc33ff'>preprocessor=None</font> specifies the preprocessor we want to apply to our documents (see "1c. Preparing text data"), if any</li>
    <li><font color='#cc33ff'>tokenizer=None</font> : a <font color='darkpink'>tokenizer</font> is a function that splits documents into tokens; by default, <font color='#cc33ff'>scikit-learn</font> uses <font color='#cc33ff'>CountVectorizer</font>'s built-in <font color='#cc33ff'>tokenizer</font>. This <font color='#cc33ff'>tokenizer</font> starts at the beginning of a document; the first token starts when two characters among the set <font color='#1aabff'>$S$</font> composed of all letters (lowercase or uppercase, accentuated or not), all digits and underscore follow one another, and end at the first occurrence of a character not in <font color='#1aabff'>$S$</font>; the second token then starts when two characters in <font color='#1aabff'>$S$</font> follow one another, and so on</li>
    <li><font color='#cc33ff'>stop_words=None</font> : the list of stop words (ie the "non informative" words like "the", "a", "is",... - see "1c. Preparing text data" -) we want to apply, if any</li>
    <li><font color='#cc33ff'>token_pattern='(?u)\\b\\w\\w+\\b'</font> : only used if <font color='#cc33ff'>analyzer=</font><font color='#1aabff'>'word'</font>, in which case the default regexp selects as <font color='#cc33ff'>tokens</font> any sequence of 2 or more alphanumeric characters ('\b' checks for a word boundary - eg a space, comma,... -; the actual <font color='#cc33ff'>token</font> starts immediately after it, and consists of a single alphanumerical character followed by at least one alphanumerical character, up to the next word boundary; '(?u)' ensures Unicode compatibility, so the pattern correctly recognizes non-ASCII letters - like é, ê, or letters from other languages -)</li>
    <li><font color='#cc33ff'>ngram_range=(1, 1)</font> : the length of our <font color='#cc33ff'>n-grams</font>; if we want the <font color='#cc33ff'>terms</font> to be between <font color='#1aabff'>min</font> and <font color='#1aabff'>max</font> words or characters long (both boundaries included), we specify <font color='#cc33ff'>ngram_range=</font><font color='#1aabff'>(min, max)</font></li>
    <li><font color='#cc33ff'>analyzer='word'</font> specifies whether the <font color='#cc33ff'>terms</font> should be made of words ('word', the default) or of characters ('char')</li>
    <li><font color='#cc33ff'>max_df=1.0</font> : if an integer <font color='#1aabff'>$n$</font>, then <font color='#1aabff'>$n$</font> is the maximum number of documents in which a <font color='#cc33ff'>processed token</font> can appear in before being excluded from the list of potential <font color='#cc33ff'>terms</font>; if a float <font color='#1aabff'>$f$</font>, then <font color='#1aabff'>$\,n=\text{ceil}(f*N_{\mathcal{D}})\,$</font>, where <font color='#1aabff'>$N_{\mathcal{D}}$</font> is the total number of documents (so, by default, <font color='#cc33ff'>CountVectorizer</font> keeps all <font color='#cc33ff'>processed tokens</font> as potential <font color='#cc33ff'>terms</font>). This parameter is ignored if <font color='#5555ff'>vocabulary</font> is not <font color='#1aabff'>None</font> (see below)</li>
    <li><font color='#cc33ff'>min_df=1</font> works like <font color='#cc33ff'>max_df</font> but sets a minimum threshold, an integer <font color='#1aabff'>$n$</font> or a frequency <font color='#1aabff'>$f$</font> - in which case <font color='#1aabff'>$\,n=\text{int}(f*N_{\mathcal{D}})\,$</font> -; if a <font color='#cc33ff'>processed tokens</font> is not present in enough documents, it will be excluded from the list of potential <font color='#cc33ff'>terms</font> (so, by default, <font color='#cc33ff'>CountVectorizer</font> keeps all <font color='#cc33ff'>processed tokens</font> as potential <font color='#cc33ff'>terms</font>)</li>
    <li><font color='#cc33ff'>max_features=None</font> : when <font color='#cc33ff'>max_features</font> is an integer, eg <font color='#cc33ff'>max_features=</font><font color='#1aabff'>1000</font>, <font color='#cc33ff'>scikit-learn</font> composes its dictionary with the <font color='#1aabff'>1000</font> <font color='#cc33ff'>processed tokens</font> that have the highest total count throughout the whole corpus of documents, and discards the others; by default, <font color='#cc33ff'>max_features=</font><font color='#1aabff'>None</font>, which means that <font color='#cc33ff'>scikit-learn</font> composes its dictionary with all the <font color='#cc33ff'>processed tokens</font>.</li>
    <li><font color='#cc33ff'>vocabulary=None</font> : our dictionary; we can either not provide it (<font color='#cc33ff'>vocabulary=</font><font color='#1aabff'>None</font>), or provide it as list or as a dictionary, where the value of each <font color='#cc33ff'>term</font> is already specified</li></ul><div style="margin-top: 20px"></div>$\quad$
        I have selected the following methods and attributes available through <font color='#cc33ff'>CountVectorizer</font> :<ul>
    <li><font color='#cc33ff'>.fit(documents)</font> : if we provided a <font color='#cc33ff'>vocabulary</font> as a dictionary, then <font color='#cc33ff'>.fit(documents)</font> simply acknowledges it; if we provided it as a list, then <font color='#cc33ff'>.fit(documents)</font> assigns to each term its index. If we didn't provide a <font color='#cc33ff'>vocabulary</font> (<font color='#cc33ff'>vocabulary=</font><font color='#1aabff'>None</font>), <font color='#cc33ff'>.fit(documents)</font> builds one : it collects from the provided set of documents the <font color='#cc33ff'>processed tokens</font> that satisfy the conditions specified via the parameters, and places them, in their order of appearance, as keys of a dictionary (if a <font color='#cc33ff'>processed token</font> has already been selected, it is skipped, so in the end we'll have exactly one value for each <font color='#cc33ff'>processed token</font>); then it builds an independent list of these <font color='#cc33ff'>terms</font>, and sorts it in alphabetical order, starting from $0$; finally, it completes the dictionary by assigning to each key its numerical order, without reordering the keys; this dictionary will be our <font color='#cc33ff'>vocabulary</font>; it is called the <font color='darkpink'>learned vocabulary</font>.</li>
    <li><font color='#cc33ff'>.transform(documents)</font> : uses the vocabulary to build for each document its <font color='#cc33ff'>count vector</font>; the <font color='#cc33ff'>count vectors</font> are <font color='#cc33ff'>sparse vectors</font>; the result is a <font color='darkpink'>sparse matrix</font> called the <font color='darkpink'>count matrix</font>. In <font color='#cc33ff'>scikit-learn</font>, the <font color='#cc33ff'>count matrix</font> is represented as a tuple of 3 arrays : the first array contains the indices of the first <font color='#cc33ff'>(sparse) count vector</font> (the one corresponding to the first document), followed by the indices of the second <font color='#cc33ff'>count vector</font>, and so on; the second array contains the values of the first <font color='#cc33ff'>count vector</font>, followed by the values of the second <font color='#cc33ff'>count vector</font>, and so on; and the third array stores the indices in the first two arrays where each <font color='#cc33ff'>count vector</font> begins, ending with the length of both arrays. For example, if our first  <font color='#cc33ff'>count vector</font> is $([1,3], [7,1])$ while the second is $([0, 1, 4], [3, 5, 7])$), then their <font color='#cc33ff'>count matrix</font> is $([1, 3, 0, 1, 4], [7, 1, 3, 5, 7], [0, 2, 5])$. Note that some <font color='#cc33ff'>count vectors</font> may be empty; this will correspond to repeated indices in the third array (ie the next <font color='#cc33ff'>count vector</font> starts at the same index that the previous started). A <font color='#cc33ff'>sparse matrix</font> is more general than a <font color='#cc33ff'>count matrix</font>, because it can be composed of any collection of <font color='#cc33ff'>feature vectors</font>, not necessarily <font color='#cc33ff'>count vectors</font> (we will see other examples in the next subchapter)</li>
    <li><font color='#cc33ff'>.fit_transform(documents)</font> : equivalent to <font color='#cc33ff'>.fit(documents)</font> followed by <font color='#cc33ff'>transform(documents)</font>, but more efficiently implemented</li>
    <li><font color='#cc33ff'>.vocabulary_</font> : returns the <font color='#cc33ff'>vocabulary</font> learned with <font color='#cc33ff'>.fit(documents)</font></li></ul><div style="margin-top: 20px"></div>$\quad$
        For example :

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

docs = np.array(["The sun is shining", "The weather is sweet",
                 "The sun is shining, the weather is sweet, and one and one is two"])
countV = CountVectorizer()
bag = countV.fit_transform(docs)
print(BB + "The array of our example documents is :" + MB, docs)
print(BB + "\nIts complete dictionary is :" + MB, countV.vocabulary_)
# '.toarray' transforms the sparse matrix "bag" into a usual matrix; its i-th row is docs[i] (processed);
# its j-th column corresponds to the term of value j in the vocabulary, and the (i,j)-th value is the
# number of times this term appears in docs[i] after processing (e.g., lowercasing, tokenization)
print(BB + "\nWith this dictionary, the count vectors are :\n" + MB, bag.toarray())

[34m[1mThe array of our example documents is :[35m[1m ['The sun is shining' 'The weather is sweet'
 'The sun is shining, the weather is sweet, and one and one is two']
[34m[1m
Its complete dictionary is :[35m[1m {'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
[34m[1m
With this dictionary, the count vectors are :
[35m[1m [[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


<br><br><div style="font-family: Arial; font-size:14px; color: darkblue">
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>1b. The TF-IDF</font><a id="1b"></a></p>$\quad$
        If we measure the importance of a <font color='#cc33ff'>term</font> in a document solely by its raw count, or, in a better-scaled manner, by its frequency inside the document, fittingly called <font color='darkpink'>term frequency (TF)</font>, we will find that some uninformative <font color='#cc33ff'>terms</font>, like "the", are more important than highly informative ones, like "brilliant" or "terrible". The most common way to downscale the importance of a very common but uninformative <font color='#cc33ff'>term</font> <font color='#1aabff'>$t$</font> in a document is to multiply its <font color='#cc33ff'>TF</font> by a factor which decreases when the proportion of documents containing <font color='#1aabff'>$t$</font> augments. This factor is called the <font color='darkpink'>inverse document frequency (IDF)</font>, and the product of the <font color='#cc33ff'>term frequency (TF)</font> of <font color='#1aabff'>$t$</font> by its <font color='#cc33ff'>inverse document frequency (IDF)</font> is called its <font color='darkpink'>term frequency-inverse document frequency (TF-IDF)</font>.<br>$\quad$
        More precisely, the <font color='#cc33ff'>term frequency (TF)</font> of a <font color='#cc33ff'>term</font> <font color='#1aabff'>$t$</font> inside a document <font color='#1aabff'>$D\,$</font> (or rather, since we are talking about <font color='#cc33ff'>terms</font>, in the <font color='#cc33ff'>processed version</font> of <font color='#1aabff'>$D$</font>) is the ratio between the number of times <font color='#1aabff'>$t$</font> appears in <font color='#1aabff'>$D$</font>, say <font color='#1aabff'>$n_D(t)$</font>, and the total number of <font color='#cc33ff'>terms</font> in <font color='#1aabff'>$D$</font>, say <font color='#1aabff'>$N_D$</font> :<br><div style="margin-top: 10px"></div><font color = darkpink>
        $$\text{TF}\,(D,t) \;=\; \frac{n_D(t)}{N_D} \qquad\quad (1)$$</font><br>$\quad$
        Historically, the <font color='#cc33ff'>inverse document frequency (IDF)</font> of a <font color='#cc33ff'>term</font> <font color='#1aabff'>$t$</font> is the natural logarithm of the ratio between the total number of documents <font color='#1aabff'>$\left|\mathcal{D}\,\right|$</font> and the number of documents <font color='#1aabff'>$\left|\mathcal{D}_t\,\right|$</font> which contain <font color='#1aabff'>$t$</font> at least once :<br><div style="margin-top: 10px"></div><font color = darkpink>
        $$\text{IDF}\,(t) \;=\; \ln\left( \frac{\left|\mathcal{D}\,\right|}{\left|\mathcal{D}_t\,\right|} \right) \qquad\quad (2)$$</font><br>$\quad$
    This could lead to a division by $0$, if a <font color='#1aabff'>term</font> <font color='#1aabff'>$t$</font> in our <font color='#cc33ff'>vocabulary</font> is present in none of the documents, which may happen if our <font color='#cc33ff'>vocabulary</font> is defined independently of the documents rather than being extracted from them. In order to avoid divisions by $0$, modern implementations add a $1$ to the denominator. But simply adding $1$ to the denominator (like S. Raschka in the book) causes the <font color='#cc33ff'>inverse document frequency</font> of a term that appears in every document to be strictly negative. Adding $1$ to both the numerator and the denominator avoids this oddity, and it is more natural (this is similar to first adding our list of terms as an additional document in the corpus <font color='#1aabff'>$\mathcal{D}$</font> before computing the historical <font color='#cc33ff'>IDF</font> formula); adding $1$ to the result ensures that all terms get an <font color='#cc33ff'>IDF</font> strictly positive, and this is the other formula that <font color='#cc33ff'>scikit-learn</font> proposes :<br><div style="margin-top: 10px"></div><font color = darkpink>
        $$\text{IDF}\,(t) \;=\; \ln\left( \frac{1+\left|\mathcal{D}\,\right|}{1+\left|\mathcal{D}_t\,\right|} \right) + 1 \qquad\quad (2')$$</font><br>$\quad$
        The <font color='#cc33ff'>term frequency-inverse document frequency (TF-IDF)</font> of a <font color='#cc33ff'>term</font> <font color='#1aabff'>$t$</font> is the product of these quantities :<br><div style="margin-top: 10px"></div><font color = darkpink>
        $$\text{TF-IDF}\,(D,t) \;=\; \text{TF}\,(D,t) \times \text{IDF}\,(t) \qquad\quad (3)$$</font><div style="margin-top: 10px"><br></div>
        <p style="margin: 0 0 0 2rem; font-size: 15px; font-weight: 600"><font color='purple'>$\quad$Note :</font></p> $\quad$
    The logarithm ensures that terms appearing in very few documents are not assigned excessively high weights.

<div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        <font color='#cc33ff'>scikit-learn</font> implements <font color='#cc33ff'>TF</font> and <font color='#cc33ff'>TF-IDF</font> in the <font color='#cc33ff'>feature_extraction.text</font> module, through the class <font color='#cc33ff'>TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True)</font>, which transforms a <font color='#cc33ff'>count matrix</font>, like our <font color='#5555ff'>bag</font> in the previous code snippet (cell $2$), into a <font color='#cc33ff'>sparse matrix</font> containing the non-zero <font color='#cc33ff'>TF</font> (resp. the non-zero <font color='#cc33ff'>TF-IDF</font>), called the <font color='darkpink'>TF matrix</font> (resp. <font color='darkpink'>TF-IDF matrix</font>) : the second array of the <font color='#cc33ff'>TF matrix</font> (resp. <font color='darkpink'>TF-IDF matrix</font>) contains the <font color='#cc33ff'>TF</font> (resp. <font color='#cc33ff'>TF-IDF</font>) of the <font color='#cc33ff'>terms</font> instead of their total counts (note that a <font color='#cc33ff'>term</font> does not appear in a <font color='#cc33ff'>count vector</font> iff it is not present in the corresponding document, iff its <font color='#cc33ff'>TF</font> and thus <font color='#cc33ff'>TF-IDF</font> are $0$ (we suppose that we're using definition $(2')$, or - if we use $(2)$ - that no <font color='#cc33ff'>term</font> appears in every document), so the first arrays in the <font color='#cc33ff'>count matrix</font>, the <font color='#cc33ff'>TF matrix</font> and the <font color='#cc33ff'>TF-IDF matrix</font> (the arrays containing the indices of the non-zero total counts, <font color='#cc33ff'>TF</font> and <font color='#cc33ff'>TF-IDF</font> resp.) coincide; and, consequently, their third arrays are identical too).<br>
    The parameters I have selected here are :<ul>
    <li><font color='#cc33ff'>norm='l2'</font> : by default, each individual <font color='#cc33ff'>feature vector</font> has its values normalized according to the $l^{\,2}$-norm (so that its total $l^{\,2}$-norm is $1$); other possible values are <font color='#cc33ff'>norm=</font><font color='#1aabff'>'l1'</font> (for $l^{\,1}$ normalization), or <font color='#cc33ff'>norm=</font><font color='#1aabff'>None</font> (for no normalization)</li>
    <li><font color='#cc33ff'>use_idf=True</font> : by default, the <font color='#cc33ff'>TF-IDF matrix</font> is returned; if we select <font color='#cc33ff'>use_idf=</font><font color='#1aabff'>False</font> instead, the <font color='#cc33ff'>TF matrix</font> is returned
    <li><font color='#cc33ff'>smooth_idf=True</font> : by default, the definition $(2')$ of the <font color='#cc33ff'>IDF</font> is used to compute the <font color='#cc33ff'>TF-IDF</font>; <font color='#cc33ff'>smooth_idf=False</font> corresponds to the <font color='#cc33ff'>TF-IDF</font> computed from the definition $(2)$ of the <font color='#cc33ff'>IDF</font></li></ul>$\quad$
        For example :

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer

# Set the precision for floats in NumPy arrays (here 2 digits after the decimal point)
np.set_printoptions(precision=2)
# first, the default case 'smooth_idf=True'
tfidf = TfidfTransformer(norm='l2', smooth_idf=True)
# Use our 'TfidfTransformer' to compute the TF-IDF from the count matrix "bag" obtained in cell 2,
# then prints it
print(BB + "The value for the default TF-IDF when 'smooth_idf=True' :\n" + MB,
      tfidf.fit_transform(bag).toarray())
# second, we do the same in the case 'smooth_idf=False'
tfidf_without_smooth = TfidfTransformer(norm='l2', smooth_idf=False)
print(BB + "\nThe value for the TF-IDF when 'smooth_idf=False' :\n" + MB,
      tfidf_without_smooth.fit_transform(bag).toarray())

[34m[1mThe value for the default TF-IDF when 'smooth_idf=True' :
[35m[1m [[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]
[34m[1m
The value for the TF-IDF when 'smooth_idf=False' :
[35m[1m [[0.   0.41 0.   0.58 0.58 0.   0.41 0.   0.  ]
 [0.   0.41 0.   0.   0.   0.58 0.41 0.   0.58]
 [0.54 0.39 0.54 0.18 0.18 0.18 0.26 0.27 0.18]]


<br><br><div style="font-family: Arial; font-size:14px; color: darkblue">
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>1c. Preparing text data</font><a id="1c"></a></p>$\quad$
        In order to facilitate understanding, after a general presentation of each sentiment analysis technique, we will apply them to the specific corpus of documents, <font color='#5555ff'>df_movies</font> (a set of movie reviews), that we downloaded in cell 1.<br>$\quad$
        As we've seen in "1a. Transforming words into feature vectors", the first step is to obtain the list of its <font color='#cc33ff'>raw tokens</font>. They'll be the words, numbers, and emoticons within <font color='#5555ff'>df_movies</font>. To get them, we first need to clean our corpus, because it contains irrelevant <font color='#cc33ff'>HTML tags</font>. For example, this is the first review and its last $50$ characters - note that there are also <font color='#cc33ff'>HTML tags</font> within the review (<font color='#5555ff'>&lt;br /&gt;</font> is the <font color='#cc33ff'>HTML</font> command to go to the next line) - :

In [4]:
print(BB + "Our first document is :\n\t" + MB, df_movies['review'].iloc[0])
print(BB + "\nIts last 50 characters are :\n\t" + MB, df_movies['review'].iloc[0][-50:])

[34m[1mOur first document is :
	[35m[1m In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70's, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother 

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        Cleaning text data is not as straightforward a process as it may seem. For example, we need to keep emoticon characters, eg ':)' or ':(', but we also need to remove HTML tags. The author proposes a simple 'preprocessor', using regular expressions (regex) and the Python’s regex library, <font color='#cc33ff'>re</font>. The <font color='#5555ff'>preprocessor</font> I propose here is an improvement on his code, both in concept and results, but the logic is the same.<br>$\quad$
        You may have noticed that the <font color='#cc33ff'>bag-of-words model</font> bases its analysis on <font color='#cc33ff'>term</font> counts only; the order of the terms in the documents plays no role, so a sentence like "I don't like checkers, I prefer chess" produces the same results as "I don't like chess, I prefer checkers". This is a limitation to correctly interpret reviews. Better results will be obtained when we will use <font color='#cc33ff'>transformers</font> (see part 16). But for now, after removal of the <font color='#cc33ff'>HTML tags</font>, we need to extract <font color='#cc33ff'>raw tokens</font>, in the order we want. Therefore, my code extracts words (excluding numbers), numbers, and emoticons separately (it's easier this way). Let's comment it and test it on the last 50 characters of our first document and on a short sentence :

In [5]:
import re

def preprocessor(text):
    # Step 1: Remove HTML tags
    #  '<[^>]*>' matches any substring that starts with '<', followed by zero or more characters that are not '>',
    # and ends with '>'; in other words, it matches HTML tags. Then 're.sub('<[^>]*>', '', text)' replaces all
    # occurrences of substrings matching the regular expression with an empty string; in other words, it removes
    # them from the 'text' string
    text = re.sub("<[^>]*>", "", text)
    
    # Step 2: Extract words, numbers, and emoticons
    # Extract words (preserving apostrophes while excluding numbers)
    # backslashes in the regular expressions are being interpreted as escape characters in Jupyter notebooks and it
    # throws a SyntaxWarning message; but if we add a 'r' prefix to the strings containing regular  expressions, we
    # tell Python to treat backslashes in those strings as literal characters, which resolves the issue
    # '.lower()' replaces all uppercase characters by their lowercase version
    words = re.findall(r"\b[a-zA-Z][\w']*\b", text.lower())
    # Extract numbers (keep decimal numbers intact)
    numbers = re.findall(r"\d+\.\d+|\d+", text)    # '|' means "or"
    # Extract emoticons
    # we improve on the author's line (emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text), in adding the
    # emoticons with 'd' or 'p'. This seems to be mere cosmetics here : results and speeds in 2a. are identical
    # '?:' indicates a non-capturing group, meaning the parentheses will be used for grouping only; eg '(?::|;|=)'
    # matches ':', or ';', or '='; then '(?:-)?' means an optional '-'; finally, '(?:\)|\(|[DdPp])' means that this
    # must be followed by either ')', '(', or one of the letters 'D', 'd', 'P' or 'p'
    emoticons = re.findall(r"(?::|;|=)(?:-)?(?:\)|\(|[DdPp])", text)
    
    # Step 3: Return the list of raw tokens
    # since hyphens don't change the meaning of emoticons, and since we want to count each term with unique meaning,
    # we remove potential hyphens, so if we have 5 ':-)' and 3 ':)', it counts 8':)'
    raw_tokens = words + numbers + [e.replace("-", "") for e in emoticons]
    return raw_tokens

print(BB + "Our preprocessor transforms the last 50 characters of our first document into :\n" + MB,
      preprocessor(df_movies.loc[0, 'review'][-50:]))
attempt = "<div>I can't believe it's already 2025 :-), and the version 2.0 still costs 'only' 3.14!</div>"
print(BB + f'Our preprocessor transforms "{attempt}" into :\n' + MB,
     preprocessor(attempt))

[34m[1mOur preprocessor transforms the last 50 characters of our first document into :
[35m[1m ['is', 'seven', 'title', 'brazil', 'not', 'available']
[34m[1mOur preprocessor transforms "<div>I can't believe it's already 2025 :-), and the version 2.0 still costs 'only' 3.14!</div>" into :
[35m[1m ['i', "can't", 'believe', "it's", 'already', 'and', 'the', 'version', 'still', 'costs', 'only', '2025', '2.0', '3.14', ':)']


<div style="margin-top: 20px"></div><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        As a comparison, the book's preprocessor returns "is seven title brazil not available" and "i can t believe it s already 2025 and the version 2 0 still costs only 3 14 :)" respectively. Apostrophes and dots have been replaced by spaces, leaving us with <font color='#cc33ff'>raw tokens</font> like 't' or 's' - which are not informative -, and decimal numbers have been split into two <font color='#cc33ff'>raw tokens</font> - two integers that should not be given individual existence, especially the decimal part. This is not to criticize the book's <font color='#5555ff'>preprocessor</font>, since it is similar to mine, and if accuracy was the ultimate goal, it would be better to use deep learning methods (see part 16); the goal in this part is rather to introduce ideas and definitions for this later part.

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        The second step is to obtain the list of <font color='#cc33ff'>processed tokens</font>, from our list of <font color='#cc33ff'>raw tokens</font>. There are three possible strategies : keep our <font color='#cc33ff'>raw tokens</font> unchanged, or use a <font color='darkpink'>stemmer</font>, or use a <font color='darkpink'>lemmatizer</font>.<br>$\quad$
        <font color='darkpink'>Word stemming</font> is the process of using a <font color='#cc33ff'>stemmer</font>, ie a function to transform a word into its <font color='darkpink'>stem</font>, ie its root form; it allows us to map related words to the same stem, and use a single stem for a family of <font color='#cc33ff'>raw tokens</font>.<br>$\quad$
        The original <font color='#cc33ff'>stemming algorithm</font> was developed by Martin F. Porter in 1979 and is hence known as the <font color='darkpink'>Porter stemmer</font> algorithm. This is the one used in the book, but other <font color='#cc33ff'>word stemming techniques</font> do exist, eg the <font color='darkpink'>Lancaster stemmer</font> and the <font color='darkpink'>Snowball stemmer</font>.<br>$\quad$
        The <font color='#cc33ff'>Snowball stemmer</font> is the newest of the three; it is an improvement over the original <font color='#cc33ff'>Porter stemmer</font> and is often called <font color='darkpink'>Porter 2</font>. The <font color='#cc33ff'>Snowball stemmer</font> is commonly favored for general applications due to its balance of accuracy and consistency, and it is the one we will use here, because it yields better results, and faster.<br>$\quad$
        The <font color='#cc33ff'>Lancaster stemmer</font> is the most aggressive of the three and will produce shorter and more obscure words, according to both S. Raschka and ChatGPT.<br>$\quad$
        These three <font color='#cc33ff'>stemmers</font> are available through the <font color='#cc33ff'>nltk package</font> (<font color='#cc33ff'>nltk</font> stands for "Natural Language Toolkit"), so we need to install this package (run <font color='#5555ff'>pip install nltk</font> in your console, or <font color='#5555ff'>!pip install nltk</font> in a new cell anywhere, that you can then remove).<br>$\quad$
        In the following code, we use a small text in which several words have the same root : "Better runners like running, and thus they ran a lot :-).". We compare the list of <font color='#cc33ff'>raw tokens</font> obtained with our <font color='#5555ff'>preprocessor</font> to the list we obtain when we add either a <font color='#cc33ff'>Porter stemmer</font> or a <font color='#cc33ff'>Snowball stemmer</font>. We can see that both <font color='#cc33ff'>stemmers</font> identify that "running" and "run" have the root "run", but they both fail to recognize that "runner" and "ran" also share the same root. Besides, while the <font color='#cc33ff'>Porter stemmer</font> reduced "thus" to an unknown root "thu", the <font color='#cc33ff'>Snowball stemmer</font> kept it as "thus"; this illustrates that <font color='#cc33ff'>stemmers</font> chop off word endings by following simple heuristic rules that may not always produce meaningful base terms.<div style="margin-top: 20px"></div>$\quad$
        A more refined approach is <font color='darkpink'>lemmatization</font>, which relies on a well-formatted dictionary (eg <font color='#cc33ff'>WordNet</font>, also offered by <font color='#cc33ff'>NLTK</font>) to map words to their correct root form - called <font color='darkpink'>lemma</font> - (eg "runner" to "run", and even "better" to "good"). Although rather complete (eg WordNet contains $155,000$ words, some with multiple meanings), it may still miss technical, misspelled, or uncommon words - leaving them unchanged unless handled separately -. Unlike <font color='#cc33ff'>stemming</font>, which might trim words down too aggressively, <font color='#cc33ff'>lemmatization</font> ensures greater accuracy, but it is computationally heavier, as it requires language-specific dictionaries.<br>$\quad$
        The base class to use <font color='#cc33ff'>WordNet</font> is called <font color='#cc33ff'>WordNetLemmatizer()</font>. It has a method, <font color='#cc33ff'>.lemmatize(word, pos)</font>, where <font color='#cc33ff'>word</font> is a word <font color='#1aabff'>word</font> and <font color='#cc33ff'>pos</font> is the part-of-speech (POS) tag corresponding to the grammatical function of <font color='#1aabff'>word</font>, which is needed to determine the correct <font color='darkpink'>lemma</font> (eg "running" as a noun should remain "running", but as a verb it should be reduced to "run").<br>
        However, <font color='#cc33ff'>WordNetLemmatizer()</font> does not assign POS tags itself, so we use the <font color='#cc33ff'>pos_tag</font> function from <font color='#cc33ff'>NLTK</font>, which assigns Penn Treebank POS tags (eg <font color='#5555ff'>NN</font> for a singular noun, <font color='#5555ff'>NNS</font> for a plural noun, <font color='#5555ff'>NNP</font> for a proper noun,..., <font color='#5555ff'>VB</font> for a base form verb, <font color='#5555ff'>VBD</font> for a past tense verb, etc.). Since these tags are not directly compatible with <font color='#cc33ff'>WordNet</font>, (which exclusively expects <font color='#5555ff'>wordnet.NOUN</font>, <font color='#5555ff'>wordnet.VERB</font>, <font color='#5555ff'>wordnet.ADJ</font> and <font color='#5555ff'>wordnet.ADV</font>), we define a mapping (<font color='#5555ff'>get_wordnet_pos</font>) to convert the first format into the second.<div style="margin-top: 20px"></div>$\quad$
        In the provided code, we define a few functions to output the <font color='#cc33ff'>processed tokens</font> from any input text. The first is simply our <font color='#5555ff'>preprocessor</font>, where our <font color='#cc33ff'>processed tokens</font> are our <font color='#cc33ff'>raw tokens</font>; the next two add a stemmer, a <font color='#cc33ff'>Porter stemmer</font> for one, and a <font color='#cc33ff'>Snowball stemmer</font> for the other; while the last introduces a <font color='#cc33ff'>WordNet lemmatizer</font>.<br>$\quad$
        In order to run the following code snippets, we need to download a few <font color='#cc33ff'>NLTK</font> files. The simple lemmatizer I use here yields results that are very slightly worse than the best results obtained otherwise, and it runs very slowly, between three and five times slower than the stemmer based models, so if you prefer (eg if your computer is even slower than mine), simply download the <font color='#cc33ff'>stopwords</font> list and don't run the lemmatizer based models (I provide their results, so you can read them anyway)... Open a new cell anywhere, and copy-paste :<br><font color='#5555ff'>
    import nltk<br>
    nltk.download("wordnet")<br>
    nltk.download("averaged_perceptron_tagger_eng")<br>
    nltk.download('punkt')<br>
    nltk.download('punkt_tab')<br>
    nltk.download("stopwords")</font><br>$\quad$
    I took the liberty to make you download <font color='#cc33ff'>NLTK</font>'s <font color='#cc33ff'>stopwords</font> list, that we will use shortly. Once your downloads are complete, you may remove the cell; otherwise, you will get ugly messages informing you that your files have already been downloaded and are up to date.

In [6]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Tokenize with 'preprocessor' (defined in cell 5) only
def base_tok_0(text):
    # Preprocess 'text', using our 'preprocessor' function defined in cell 5
    return preprocessor(text)

# Tokenize with 'preprocessor' followed by Porter Stemmer
porter = PorterStemmer()
def tok_port_0(text):
    return [porter.stem(w) for w in preprocessor(text)]

# Tokenize with 'preprocessor' followed by Snowball Stemmer
snowball = SnowballStemmer("english")
def tok_snow_0(text):
    return [snowball.stem(w) for w in preprocessor(text)]

# Tokenize with a variant of 'preprocessor' followed by WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()
# Function to map Penn Treebank POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    tag_dict = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}    # Using uppercase POS first letter
    return tag_dict.get(tag[0].upper(), wordnet.NOUN)    # Extracts first character, defaults to noun
# Tokenize to raw tokens, then lemmatize to processed tokens
def tok_wnet_0(text):
    # Extract words, numbers, and emoticons from 'text'
    raw_words_list = re.findall(r"\b[a-zA-Z][\w']*\b", text.lower())
    raw_numbers_list = re.findall(r"\d+\.\d+|\d+", text)
    raw_emoticons_list = [e.replace("-", "") for e in re.findall(r"(?::|;|=)(?:-)?(?:\)|\(|[DdPp])", text)]
    # The WordNet lemmatizer needs to use its own tokenizer, so we build the string composed of the words in 'word_list'
    raw_words_string = " ".join(raw_words_list)
    wordnet_raw_words = nltk.word_tokenize(raw_words_string)    # Proper tokenization when using lemmatization
    tagged_raw_words = nltk.pos_tag(wordnet_raw_words)
    proc_words_list = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged_raw_words]
    # Return the whole list of processed tokens
    return proc_words_list + raw_numbers_list + raw_emoticons_list

# Test different tokenizers
text_sample = "Better runners like running ;-), and thus they ran a lot :-)."
print(BB + 'The example sentence is :"' + MB, text_sample + '"')
print(BB + "\nIf we only use our preprocessor without a stemmer, the sentence becomes :\n" + MB, base_tok_0(text_sample))
print(BB + "\nIf we add the Porter stemmer, it becomes:\n" + MB, tok_port_0(text_sample))
print(BB + "\nIf we add the Snowball stemmer instead, it becomes:\n" + MB, tok_snow_0(text_sample))
print(BB + "\nIf we lemmatize with WordNet, it becomes:\n" + MB, tok_wnet_0(text_sample))

[34m[1mThe example sentence is :"[35m[1m Better runners like running ;-), and thus they ran a lot :-)."
[34m[1m
If we only use our preprocessor without a stemmer, the sentence becomes :
[35m[1m ['better', 'runners', 'like', 'running', 'and', 'thus', 'they', 'ran', 'a', 'lot', ';)', ':)']
[34m[1m
If we add the Porter stemmer, it becomes:
[35m[1m ['better', 'runner', 'like', 'run', 'and', 'thu', 'they', 'ran', 'a', 'lot', ';)', ':)']
[34m[1m
If we add the Snowball stemmer instead, it becomes:
[35m[1m ['better', 'runner', 'like', 'run', 'and', 'thus', 'they', 'ran', 'a', 'lot', ';)', ':)']
[34m[1m
If we lemmatize with WordNet, it becomes:
[35m[1m ['good', 'runner', 'like', 'run', 'and', 'thus', 'they', 'run', 'a', 'lot', ';)', ':)']


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        Finally, we may want to remove <font color='darkpink'>stop words</font>. <font color='#cc33ff'>Stop words</font> are simply those words that are extremely common in all sorts of texts and probably bear (almost) no useful information; for example "the", "is", "and", "has", etc. Removing <font color='#cc33ff'>stop words</font> is especially useful if we are working with raw counts or normalized <font color='#cc33ff'>TF</font> rather than <font color='#cc33ff'>TF-IDF</font>, which already tries to downweight the frequently occurring <font color='#cc33ff'>terms</font>; but still, if our documents are movie reviews, informative words like "good" will likely appear in nearly all documents, in which case their <font color='#cc33ff'>inverse document frequency</font> will be close to $1$, and their <font color='#cc33ff'>TF-IDF</font> won't be a better measure of their usefulness than their <font color='#cc33ff'>TF</font>. It is therefore interesting to establish a list of frequent words we believe will only play a marginal role in the classification of our documents (for example as positive or negative), called the <font color='#cc33ff'>stop words list</font>, and to remove from our vocabulary the <font color='#cc33ff'>terms</font> that are in this list.<br>$\quad$
        As an example, we will use <font color='#cc33ff'>NLTK</font>'s <font color='#cc33ff'>stopwords</font> list we've just downloaded. It is an all-around list composed of 127 English <font color='#cc33ff'>words</font> that you can check here : https://gist.github.com/sebleier/554280. It is not ideal for sentiment analysis if we want to build a <font color='#cc33ff'>vocabulary</font> with <font color='#1aabff'>$n$</font><font color='#cc33ff'>-grams</font> for <font color='#1aabff'>$\,n \geq 2\,$</font>; for example it contains "not", which is a relevant word in <font color='#cc33ff'>bigrams</font> like "not good", and these <font color='#cc33ff'>bigrams</font> are meaningful in sentiment analysis. Therefore, it is recommended to build <font color='#cc33ff'>vocabularies</font> with at least <font color='#cc33ff'>unigrams</font> and <font color='#cc33ff'>bigrams</font> and to remove some of the words from the <font color='#cc33ff'>stopwords</font> list. This is what we do here (the words I decided to keep are those that seemed meaningful to me in <font color='#cc33ff'>bigrams</font>; you may decide to keep less or more words if you wish) :

In [7]:
from nltk.corpus import stopwords

nltk_stopwords = stopwords.words('english')
kept_words = ["but", "as", "until", "at", "against", "between", "into", "through", "during", "before", "after",
              "above", "below", "up", "down", "over", "under", "again", "once", "here", "there", "why", "how",
              "all", "any", "both", "more", "most", "some", "such", "no", "nor", "not", "only", "so", "than",
              "too", "very", "just", "now"]
new_stopwords = [w for w in nltk_stopwords if w not in kept_words]

print(BB + "If we use the Snowball stemmer and remove the 'nltk' stopwords,\
 \"Runners like running and thus they ran a lot\" becomes :\n" + MB,
      [w for w in tok_snow_0("Runners like running and thus they run a lot") if w not in new_stopwords])

[34m[1mIf we use the Snowball stemmer and remove the 'nltk' stopwords, "Runners like running and thus they ran a lot" becomes :
[35m[1m ['runner', 'like', 'run', 'thus', 'run', 'lot']


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        We include the <font color='#5555ff'>new_stopwords</font> list in our models like so (I won't use <font color='#cc33ff'>stop words</font> with our <font color='#cc33ff'>lemmatizer</font>, because it takes a lot of time to run, but it works the same) :

In [8]:
# Define the functions for tokenization using the 'new_stopwords' list

def base_tok(text):
    return [w for w in preprocessor(text) if w not in new_stopwords]

porter = PorterStemmer()
def tok_port(text):
    return [porter.stem(w) for w in preprocessor(text) if w not in new_stopwords]

snowball = SnowballStemmer("english")
def tok_snow(text):
    return [snowball.stem(w) for w in preprocessor(text) if w not in new_stopwords]

<a id="2"></a><br><br><br><div style="font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">
    <h4 style="margin: 0 0 .5rem 2rem; font-size: 18px; font-weight: 800"><font color='darkpink'>2. LOGISTIC REGRESSION MODEL FOR DOCUMENT CLASSIFICATION</font></h4>
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>2a. Training a logistic regression model for our movie reviews dataset</font><a id="2a"></a></p>$\quad$
        In the previous subchapter, we constructed a few <font color='#cc33ff'>tokenizers</font>, each able to build a meaningful <font color='#cc33ff'>vocabulary</font> from any corpus of documents. It is now time to train a model to classify our movie reviews as positive and negative.<br>$\quad$
        Each of our <font color='#cc33ff'>tokenizers</font> can cut any string (usually a document in a corpus of documents) into individual pieces (called <font color='#cc33ff'>raw tokens</font>), and process these pieces (eg replace them by their <font color='#cc33ff'>stem</font> or by their <font color='#cc33ff'>lemma</font>, or don't modify them at all) to obtain our <font color='#cc33ff'>processed tokens</font>; we can then screen these <font color='#cc33ff'>processed tokens</font>, first by removing the ones that belong to a list (the <font color='#cc33ff'>stop words</font>), then by keeping only the <font color='#1aabff'>$n$</font> most frequent (or removing the less frequent and/or the most frequent). The <font color='#cc33ff'>processed tokens</font> that we keep are called <font color='#cc33ff'>terms</font>, and the list of <font color='#cc33ff'>terms</font> is our <font color='#cc33ff'>vocabulary</font>.<br>$\quad$
        In order to use the various machine learning techniques we have explored so far, we build our dataset matrix : its rows (called <font color='#cc33ff'>examples</font> or <font color='#cc33ff'>instances</font>) will be our documents; its columns (called <font color='#cc33ff'>features</font>) will be our <font color='#cc33ff'>terms</font>. Our dataset matrix can either hold the number of individual <font color='#cc33ff'>terms</font> that each document contains (ie the <font color='#cc33ff'>count matrix</font>), or the usually more informative <font color='#cc33ff'>TF-IDF</font> of the terms included in each document (ie the <font color='#cc33ff'>TF-IDF matrix</font>). That our dataset matrix is represented as a sparse matrix - because it is much less memory-demanding - is a technical idea that we have seen, but it plays no role in the understanding of this chapter, so we won't mention it; we simply need to remember that the shape of our dataset matrix is not the usual one, but we can still use our usual machine learning tools.<div style="margin-top: 20px"></div>$\quad$
        As usual, we split our dataset into a training and a test set. Since the <font color='#5555ff'>df_movies</font> dataset is rather large, we can train our model on the first half (25,000 movie reviews), and keep the other half for testing.<br>$\quad$
        Then, we choose a <font color='#cc33ff'>tokenizer</font>, build the corresponding <font color='#cc33ff'>vocabulary</font> (we will use <font color='#cc33ff'>unigrams</font> and <font color='#cc33ff'>bigrams</font>), and compute the <font color='#cc33ff'>TF-IDF matrix</font> corresponding to our <font color='#cc33ff'>vocabulary</font> and our training set.<br>$\quad$
        As we saw in "1b. The TF-IDF", we can compute the <font color='#cc33ff'>TF-IDF matrix</font> by first computing the <font color='#cc33ff'>count matrix</font>, using <font color='#cc33ff'>CountVectorizer</font>, and then the <font color='#cc33ff'>TF-IDF matrix</font> from the <font color='#cc33ff'>count matrix</font>, using <font color='#cc33ff'>TfidfTransformer</font>. However, <font color='#cc33ff'>scikit-learn</font> offers the <font color='#cc33ff'>TfidfVectorizer</font> class (in its <font color='#cc33ff'>feature_extraction.text</font> module), which is a direct equivalent to <font color='#cc33ff'>CountVectorizer</font> followed by <font color='#cc33ff'>TfidfTransformer</font>. The first part of its parameters coincides with the parameters of <font color='#cc33ff'>CountVectorizer</font>, and the second part coincides with the parameters of <font color='#cc33ff'>TfidfTransformer</font>. Therefore, in accordance with the parameters I have selected and presented above, we can do both steps with <font color='#cc33ff'>TfidfVectorizer(input='content', encoding='utf-8', decode_error='strict', lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, norm='l2', use_idf=True, smooth_idf=True)</font>.<br>$\quad$
        Like S. Raschka, we placed our <font color='#5555ff'>preprocessor</font> (implemented in cell 5), a potential <font color='#cc33ff'>stemmer</font> or <font color='#cc33ff'>lemmatizer</font> (presented "1c. Preparing text data"), and a potential <font color='#cc33ff'>stop words list</font> (ibid), all within our <font color='#cc33ff'>tokenizers</font> (see cells 6 and 8). Therefore, although we do use the lowercase version of the documents, a <font color='#cc33ff'>preprocessor</font> which uses a <font color='#cc33ff'>token pattern</font>, and although some of our <font color='#cc33ff'>tokenizers</font> do use a <font color='#cc33ff'>stop words list</font>, all these will be included within the <font color='#cc33ff'>tokenizer parameter</font>, and we will set <font color='#cc33ff'>lowercase=</font><font color='#1aabff'>False</font>, <font color='#cc33ff'>preprocessor=</font><font color='#1aabff'>None</font>, <font color='#cc33ff'>stop_words=</font><font color='#1aabff'>None</font>, and <font color='#cc33ff'>token_pattern=</font><font color='#1aabff'>None</font>.<br>$\quad$
        S. Raschka uses the default values of all other parameters. We do the same, except that, instead of using <font color='#cc33ff'>processed tokens</font> of length <font color='#1aabff'>$1$</font> with <font color='#cc33ff'>ngram_range=</font><font color='#1aabff'>($1$, $1$)</font>, we use <font color='#cc33ff'>ngram_range=</font><font color='#1aabff'>($1$, $2$)</font>, in order to distinguish between "not good" and "very good", for example; this slightly improves the results (by around $1\%$ in the various configurations I tried).<br>$\quad$
        Once we have transformed our training set of $25,000$ reviews into a training <font color='#cc33ff'>TF-IDF matrix</font> of $25,000$ rows containing the <font color='#cc33ff'>TF-IDF</font> of the <font color='#cc33ff'>terms</font> in the <font color='#cc33ff'>vocabulary</font> build by our <font color='#cc33ff'>tokenizer</font>, we will use a <font color='#cc33ff'>logistic regression model</font> (see part 3, "2. The logistic regression", for a general presentation of <font color='#cc33ff'>logistic regression</font>) to be able to classify any document given the <font color='#cc33ff'>TF-IDF vector</font> of the <font color='#cc33ff'>terms</font> it contains. For our <font color='#cc33ff'>logistic regression</font>, we will use <font color='#cc33ff'>scikit-learn</font>'s class <font color='#cc33ff'>LogisticRegression(penalty='l2', C=1.0, random_state=None, solver='lbfgs', max_iter=100, n_jobs=None, l1_ratio=None)</font> (see part 3, "2f. Logistic regression with scikit-learn"). S. Raschka uses <font color='#cc33ff'>solver=</font><font color='#1aabff'>'liblinear'</font>, but I will use <font color='#cc33ff'>solver=</font><font color='#1aabff'>'saga'</font>, which tends to give the same results, though slightly faster.<div style="margin-top: 20px"></div>$\quad$
        To summarize, our models are composed of a <font color='#cc33ff'>TfidfVectorizer</font> followed by a <font color='#cc33ff'>LogisticRegression</font>; S. Raschka groups both classes into a <font color='#cc33ff'>Pipeline</font> (see part 6, "1. Pipelines"), so we can use a <font color='#cc33ff'>GridSearchCV</font> object (see part 6, "4a. Grid search") to determine the best of $8$ models, which differ on three couples of parameters (the other parameters are set to their default values) : the first two correspond to the choice of the <font color='#cc33ff'>tokenizer</font> in <font color='#cc33ff'>TfidfVectorizer</font> (either a simple preprocessor - a weaker version of our <font color='#5555ff'>preprocessor</font> - or this preprocessor followed by a <font color='#cc33ff'>Porter stemmer</font>; both with or without applying <font color='#cc33ff'>NLTK</font>'s <font color='#cc33ff'>stop words list</font>), while the last choice corresponds to the amount of <font color='#cc33ff'>$L^2$-regularization</font> in <font color='#cc33ff'>LogisticRegression</font> (with two values for the <font color='#cc33ff'>inverse regularization parameter</font>, <font color='#cc33ff'>C=</font><font color='#1aabff'>1</font> - the default - and <font color='#cc33ff'>C=</font><font color='#1aabff'>10</font>). Moreover, it uses <font color='#cc33ff'>verbose=</font><font color='#1aabff'>2</font> and <font color='#cc33ff'>n_jobs=</font><font color='#1aabff'>1</font>, so that its <font color='#cc33ff'>GridSearchCV</font> object prints the times needed to train each model on the $5$ fold partitions.<br>$\quad$
        I use a similar code, but, instead of <font color='#cc33ff'>verbose=</font><font color='#1aabff'>2</font>, I use <font color='#cc33ff'>verbose=</font><font color='#1aabff'>3</font>, so that the <font color='#cc33ff'>GridSearchCV</font> object prints the timings and the accuracies obtained on the validation folds. This allows to compare the models with one another during training, rather than merely getting the results of the best model at the end. Instead of the four models from the book, I test our seven models : <font color='#5555ff'>base_tok_0</font> (our simple <font color='#5555ff'>preprocessor</font>), <font color='#5555ff'>tok_port_0</font>, <font color='#5555ff'>tok_snow_0</font>, and <font color='#5555ff'>tok_wnet_0</font>, together with the versions of the first three with the removal of our <font color='#cc33ff'>stop words list</font> (<font color='#5555ff'>new_stopwords</font>) included, denoted by <font color='#5555ff'>base_tok</font>, <font color='#5555ff'>tok_port</font>, and <font color='#5555ff'>tok_snow</font> respectively. I also test the influence of the <font color='#cc33ff'>$L^2$-regularization</font> with <font color='#cc33ff'>C=</font><font color='#1aabff'>$1$</font> and <font color='#cc33ff'>C=</font><font color='#1aabff'>$10$</font>, and I add <font color='#cc33ff'>C=</font><font color='#1aabff'>$20$</font> (I wanted a larger value, but this slowed down the training of several models so much that they failed to reach the default convergence criterion before the allowed number of iterations, and my laptop is too slow - or my patience too thin - to ask for more). Finally, I test the influence of limiting the number of <font color='#cc33ff'>terms</font>, with <font color='#cc33ff'>max_features=</font><font color='#1aabff'>None</font>, <font color='#cc33ff'>max_features=</font><font color='#1aabff'>$10,000$</font>, and <font color='#cc33ff'>max_features=</font><font color='#1aabff'>$5,000$</font>.<br>
    Another difference is that, within the <font color='#cc33ff'>GridSearchCV</font> object, the book uses the default <font color='#cc33ff'>cv=</font><font color='#1aabff'>$5$</font>; since, here, it corresponds to a classification task (ie <font color='#5555ff'>y_train</font> is composed of integers - here only <font color='#1aabff'>$0$</font> and <font color='#1aabff'>$1$</font> -), <font color='#cc33ff'>GridSearchCV</font> will use <font color='#cc33ff'>stratified cross-validation folds</font> (ie folds with the same proportions of instances of each class than in the whole training set; see part 6, "2. $k$-fold cross-validation"); in order to get reproducibility of the results, instead of <font color='#cc33ff'>cv=</font><font color='#1aabff'>$5$</font>, I use <font color='#cc33ff'>cv=StratifiedKFold(n_splits=<font color='#1aabff'>$5$</font>, shuffle=<font color='#1aabff'>True</font>, random_state=<font color='#1aabff'>$1$</font>)</font>.<br>$\quad$
        You will find the tables that present, for each model, the average of the validation accuracies it obtained over the <font color='#1aabff'>$5$</font> fold partitions, together with the total amount of time my laptop needed to get them, below two code snippets. The first code snippet is used for the <font color='#cc33ff'>grid search</font> when we use all the <font color='#cc33ff'>terms</font> in a vocabulary, the second code snippet is used when we limit our terms to the <font color='#1aabff'>$5,000$</font> most frequent. I wrote a few guidelines, so you can easily experiment with the values you want, including the ones I used to build the tables. Because they take a lot of time to complete, especially with a computer as slow as mine, by default I run the snippets with only the two best options in each case, using all available cores (via <font color='#cc33ff'>n_jobs=</font><font color='#1aabff'>$-\!1$</font>); this reduces the time needed to run the notebook, but, unfortunately, the intermediate results won't be printed, and the <font color='#cc33ff'>lemmatizer</font> won't run. Don't worry if you don't see the results on the partitions, and simply follow the guidelines to get more complete results.<br>
    The precision of the timings is $1$ second for all models but <font color='#5555ff'>tok_wnet_0</font> (because <font color='#cc33ff'>GridSearchCV</font> returns $5$ timings - one per partition - with a precision of $0.05$ second when the timing is less than $1$ minute, so if the sum of timings is 1'18"40, I write 1'18, while it could have been up to 1'18"65, which I would have written 1'19...; but each run of <font color='#5555ff'>tok_wnet_0</font> on a single partition took my laptop more than $1$ minute; when a timing is over a minute, <font color='#cc33ff'>GridSearchCV</font> writes, eg "2.6 min", which represents a timing between 2'33 and 2'39, so my total timings for <font color='#5555ff'>tok_wnet_0</font> have a precision of $15$ seconds). Besides, the timings highly depend on the cpu which makes the computations - mine is rather slow (AMD Ryzen 7 7435HS) - and on its temperature; I did all my measures at the same temperature, with pauses, so the timings in the tables can be compared with one another, although not too rigidly.<br>$\quad$
        We'll discuss the results below the three tables.

In [9]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

X_train = df_movies.iloc[:25000]['review'].values
y_train = df_movies.iloc[:25000]['sentiment'].values
X_test = df_movies.iloc[25000:]['review'].values
y_test = df_movies.iloc[25000:]['sentiment'].values

# Pipeline
# We handle lowercasing and tokenization ourselves, so we set 'lowercase=False' and 'token_pattern=None'
# Use 'ngram_range=(1,1)' for faster but slightly worse results
tfidf = TfidfVectorizer(lowercase=False, token_pattern=None, ngram_range=(1,2))
lr_tfidf = Pipeline([('v', tfidf), ('clf', LogisticRegression(penalty='l2', random_state=1, solver='saga'))])

# Grid search
param_grid = [
    # Uncomment this line, comment the next one, and use 'n_jobs=1' below, if you want to run the lemmatizer
    # 'tokenizer_wordnet'
    # {'v__tokenizer': [base_tok_0, base_tok, tok_port_0, tok_port, tok_snow_0, tok_snow, tok_wnet_0],
    # Use as many tokenizers as you want from the selection of the previous line
    {'v__tokenizer': [base_tok_0, tok_snow_0],
    # Also uncomment this line and comment the next one to run the code I'm talking about in the presentation;
    # feel free to try the values you want
    # 'clf__C': [1., 10., 20.]}]
    'clf__C': [20.]}]
# Control randomness in cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)  # Ensures consistent data splits
print(BB)    # Print the intermediate results in blue
# if you want the intermediate results, use 'n_jobs=1'
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=cv, verbose=3, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
clf = gs_lr_tfidf.best_estimator_

print(BB + "\nBest hyperparameters tested :" + MB, gs_lr_tfidf.best_params_)
print(BB + "Validation accuracy (5-fold CV) :" + MB, f"{round(100*gs_lr_tfidf.best_score_, 2)}%")
print(BB + "Test accuracy :" + MB, f"{round(100*clf.score(X_test, y_test), 2)}%")

[34m[1m
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[34m[1m
Best hyperparameters tested :[35m[1m {'clf__C': 20.0, 'v__tokenizer': <function tok_snow_0 at 0x0000022BE514BD80>}
[34m[1mValidation accuracy (5-fold CV) :[35m[1m 90.86%
[34m[1mTest accuracy :[35m[1m 91.12%


In [10]:
# Pipeline
# We handle lowercasing and tokenization ourselves, so we set 'lowercase=False' and 'token_pattern=None'
# Use 'ngram_range=(1,1)' for faster but slightly worse results
tfidf = TfidfVectorizer(lowercase=False, stop_words=None, token_pattern=None, ngram_range=(1,2), max_features=5000)
lr_tfidf = Pipeline([('v', tfidf), ('clf', LogisticRegression(penalty='l2', random_state=1, solver='saga'))])

# Grid search
param_grid = [
    # Uncomment this line, comment the next one, and use 'n_jobs=1' below, if you want to run the lemmatizer
    # 'tokenizer_wordnet'
    # {'v__tokenizer': [base_tok_0, base_tok, tok_port_0, tok_port, tok_snow_0, tok_snow, tok_wnet_0],
    # Use as many tokenizers as you want from the selection of the previous line
    {'v__tokenizer': [base_tok, tok_snow],
    # Also uncomment this line and comment the next one to run the code I'm talking about in the presentation;
    # feel free to try the values you want
    # 'clf__C': [1., 10., 20.]}]
    'clf__C': [1.]}]
# Control randomness in cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)  # Ensures consistent data splits
print(BB)    # Print the intermediate results in blue
# if you want the intermediate results, use 'n_jobs=1'
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=cv, verbose=3, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
clf = gs_lr_tfidf.best_estimator_

print(BB + "\nBest hyperparameters tested :" + MB, gs_lr_tfidf.best_params_)
print(BB + "Validation accuracy (5-fold CV) :" + MB, f"{round(100*gs_lr_tfidf.best_score_, 2)}%")
print(BB + "Test accuracy :" + MB, f"{round(100*clf.score(X_test, y_test), 2)}%")

[34m[1m
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[34m[1m
Best hyperparameters tested :[35m[1m {'clf__C': 1.0, 'v__tokenizer': <function tok_snow at 0x0000022BE5171A80>}
[34m[1mValidation accuracy (5-fold CV) :[35m[1m 89.23%
[34m[1mTest accuracy :[35m[1m 89.66%


In [11]:
print(BB + "\n\tAverage validation accuracy and total time over partitions when using the whole vocabulary :\n")

df = pd.DataFrame([["88.90% (1'18)", "88.80% (1'35)", "89.10% (4'05)", "88.62% (3'37)", "89.12% (2'35)", "88.62% (2'35)", "88.90% (13'06)"],
                   ["90.66% (1'47)", "90.24% (2'00)", "90.54% (4'35)", "90.14% (4'01)", "90.72% (3'01)", "90.12% (2'54)", "90.48% (13'30)"],
                   ["90.78% (2'13)", "90.44% (2'19)", "90.74% (4'58)", "90.26% (4'19)", "90.86% (3'26)", "90.24% (3'13)", "90.60% (14'00)"]])
df.columns = ["base_tok_0", "base_tok", "tok_port_0", "tok_port", "tok_snow_0", "tok_snow", "tok_wnet_0"]
df.index = ["C=1", "C=10", "C=20"]

# Apply our 'beautify' function (from cell 1) to render the dataframe
display(beautify(df))


print(BB + "\n\tAverage validation accuracy and total time over partitions when using " + CB + "max_features = 10000" + BB + " :\n")

df = pd.DataFrame([["89.38% (1'07)", "89.62% (1'29)", "89.56% (3'51)", "89.48% (3'28)", "89.56% (2'27)", "89.60% (2'26)", "89.24% (13'00)"],
                   ["89.60% (1'12)", "89.54% (1'31)", "89.70% (3'59)", "89.46% (3'32)", "89.74% (2'32)", "89.46% (2'31)", "89.44% (13'00)"],
                   ["89.12% (1'17)", "89.10% (1'34)", "89.26% (4'04)", "89.00% (3'35)", "89.36% (2'38)", "89.08% (2'33)", "88.98% (13'00)"]])
df.columns = ["base_tok_0", "base_tok", "tok_port_0", "tok_port", "tok_snow_0", "tok_snow", "tok_wnet_0"]
df.index = ["C=1", "C=10", "C=20"]
display(beautify(df))


print(BB + "\n\tAverage validation accuracy and total time over partitions when using " + CB + "max_features = 5000" + BB + " :\n")

df = pd.DataFrame([["88.82% (1'06)", "89.16% (1'29)", "89.08% (3'53)", "89.22% (3'28)", "89.12% (2'27)", "89.23% (2'26)", "88.76% (13'00)"],
                   ["88.78% (1'10)", "88.62% (1'31)", "88.86% (3'58)", "88.66% (3'32)", "88.90% (2'32)", "88.64% (2'28)", "88.60% (13'00)"],
                   ["88.16% (1'15)", "88.26% (1'33)", "88.32% (4'03)", "88.26% (3'36)", "88.34% (2'36)", "88.10% (2'32)", "88.16% (13'00)"]])
df.columns = ["base_tok_0", "base_tok", "tok_port_0", "tok_port", "tok_snow_0", "tok_snow", "tok_wnet_0"]
df.index = ["C=1", "C=10", "C=20"]
beautify(df)

[34m[1m
	Average validation accuracy and total time over partitions when using the whole vocabulary :



Unnamed: 0,base_tok_0,base_tok,tok_port_0,tok_port,tok_snow_0,tok_snow,tok_wnet_0
C=1,88.90% (1'18),88.80% (1'35),89.10% (4'05),88.62% (3'37),89.12% (2'35),88.62% (2'35),88.90% (13'06)
C=10,90.66% (1'47),90.24% (2'00),90.54% (4'35),90.14% (4'01),90.72% (3'01),90.12% (2'54),90.48% (13'30)
C=20,90.78% (2'13),90.44% (2'19),90.74% (4'58),90.26% (4'19),90.86% (3'26),90.24% (3'13),90.60% (14'00)


[34m[1m
	Average validation accuracy and total time over partitions when using [36m[1mmax_features = 10000[34m[1m :



Unnamed: 0,base_tok_0,base_tok,tok_port_0,tok_port,tok_snow_0,tok_snow,tok_wnet_0
C=1,89.38% (1'07),89.62% (1'29),89.56% (3'51),89.48% (3'28),89.56% (2'27),89.60% (2'26),89.24% (13'00)
C=10,89.60% (1'12),89.54% (1'31),89.70% (3'59),89.46% (3'32),89.74% (2'32),89.46% (2'31),89.44% (13'00)
C=20,89.12% (1'17),89.10% (1'34),89.26% (4'04),89.00% (3'35),89.36% (2'38),89.08% (2'33),88.98% (13'00)


[34m[1m
	Average validation accuracy and total time over partitions when using [36m[1mmax_features = 5000[34m[1m :



Unnamed: 0,base_tok_0,base_tok,tok_port_0,tok_port,tok_snow_0,tok_snow,tok_wnet_0
C=1,88.82% (1'06),89.16% (1'29),89.08% (3'53),89.22% (3'28),89.12% (2'27),89.23% (2'26),88.76% (13'00)
C=10,88.78% (1'10),88.62% (1'31),88.86% (3'58),88.66% (3'32),88.90% (2'32),88.64% (2'28),88.60% (13'00)
C=20,88.16% (1'15),88.26% (1'33),88.32% (4'03),88.26% (3'36),88.34% (2'36),88.10% (2'32),88.16% (13'00)


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        First, note that each line of a table corresponds to the choice of a number of most frequent <font color='#cc33ff'>terms</font> - <font color='#5555ff'>max_features</font> - and an <font color='#cc33ff'>inverse regularization parameter C</font>.<br>$\quad$
        When we use all the <font color='#cc33ff'>terms</font> (first table), the best value for the <font color='#cc33ff'>inverse regularization parameter</font> is <font color='#cc33ff'>C=</font><font color='#1aabff'>20</font>; it is <font color='#cc33ff'>C=</font><font color='#1aabff'>10</font> when we use the $10,000$ most frequent <font color='#cc33ff'>terms</font> (second table), and <font color='#cc33ff'>C=</font><font color='#1aabff'>1</font> when we use the $5,000$ most frequent <font color='#cc33ff'>terms</font> (third table). Usually, we observe the opposite : the more features we have, the easier it is to overfit the data, and the more regularization we need. This apparent contradiction comes from the fact that, here, the feature selection is not random : we select the most frequent <font color='#cc33ff'>terms</font> (eg "the", "is", "have",...), and, apparently, the proportion of informative words among the top (most frequent) <font color='#cc33ff'>terms</font> is decreasing; therefore we need a stronger regularization so our models don't fit the features too closely and can generalize better to validation data.<br>$\quad$
        This interpretation is reinforced by the fact that, the fewer the words, the more useful it is to use our <font color='#cc33ff'>stop words list</font>. More precisely, it is possible to compute the differences between a model's average validation accuracies, when the model uses our <font color='#5555ff'>new_stopwords</font> vs when it doesn't; averaging these differences shows that, when we use the complete vocabularies, using <font color='#5555ff'>new_stopwords</font> lowers the accuracies by $0.44\%$ on average; but when we use only the $10,000$ most frequent <font color='#cc33ff'>terms</font>, the accuracies are lowered by $0.10\%$ only, and sometimes it was better to use <font color='#5555ff'>new_stopwords</font>; with even fewer <font color='#cc33ff'>terms</font>, the use of <font color='#5555ff'>new_stopwords</font> is no more detrimental (the accuracies are lowered by $0.03\%$ on average when we use it), and the three best models do use it.<br>
    Of course, when we use all our <font color='#cc33ff'>processed tokens</font> as features, removing <font color='#cc33ff'>stop words</font> can only hurt the top model (I'm not talking about Ohm Pawat, here, I need your attention... are you coming to me ?...), since we only lose information. But when we use fewer <font color='#cc33ff'>terms</font>, it apparently helps to try and eliminate the less informative ones, ensuring better quality of features (still not talking about Ohm...).<div style="margin-top: 20px"></div>$\quad$
        Whatever the model, its validation accuracy is better when it uses all the <font color='#cc33ff'>terms</font> (with <font color='#cc33ff'>C=</font><font color='#1aabff'>20</font>), than it is when it uses only the top $10,000$ <font color='#cc33ff'>terms</font> (with <font color='#cc33ff'>C=</font><font color='#1aabff'>10</font>), itself better than when it uses only the top $5,000$ <font color='#cc33ff'>terms</font> (with <font color='#cc33ff'>C=</font><font color='#1aabff'>1</font>). This is to be expected, since the sets of features are included in one another; thus, when we use a larger <font color='#cc33ff'>max_features parameter</font>, the models have more parameters to yield their predictions, and can better, provided we endowed them with the right amount of <font color='#cc33ff'>regularization</font>.<br>$\quad$
        As for the choice of the <font color='#cc33ff'>tokenizer</font> itself, we see that, whatever the number of <font color='#cc33ff'>terms</font> selected, whatever the value of <font color='#cc33ff'>C</font>, if we don't use our <font color='#5555ff'>new_stopwords</font> list, the best <font color='#cc33ff'>tokenizer</font> is <font color='#5555ff'>tok_snow_0</font>, $9$ times out of $9$, and the worst is <font color='#5555ff'>tok_wnet_0</font>, also $9$ times out of $9$. For example, the best model overall is the one using all the documents, <font color='#5555ff'>tok_snow_0</font>, and <font color='#cc33ff'>C=</font><font color='#1aabff'>20</font>, with $90.86\%$ average validation accuracy, and $91.12\%$ test accuracy. However, the average of the difference obtained at each line between the best average validation accuracy and the worst, across the $9$ runs, is a mere $0.3\%$, and it never goes over $0.5\%$. Moreover, when we use all the documents, the test accuracy obtained by the model using <font color='#5555ff'>tok_wnet_0</font> (the worst model) is $90.02\%$, similar to the test accuracy of the best model. Therefore, in our situation, the choice of the <font color='#cc33ff'>tokenizer</font> should rather be done according to its convergence speed.<br>$\quad$
        Obviously, the fastest models are <font color='#5555ff'>base_tok_0</font> (which is our raw <font color='#5555ff'>preprocessor</font>) and <font color='#5555ff'>base_tok</font> (which adds our <font color='#cc33ff'>stop words list</font> <font color='#5555ff'>new_stopwords</font>. Then, the fastest models are the ones based on the <font color='#cc33ff'>Snowball stemmer</font>, and, as already mentioned, <font color='#5555ff'>tok_snow_0</font> is the <font color='#cc33ff'>tokenizer</font> which yields the best results overall. Therefore, either we value accuracies, and we should use <font color='#5555ff'>tok_snow_0</font> with all the documents, or we value speed, and we should choose <font color='#5555ff'>base_tok_0</font> or <font color='#5555ff'>base_tok</font>, according to the number of documents we decide to work with. There is no reason here to work with the <font color='#cc33ff'>Porter stemmer</font>, and even less so to use our <font color='#cc33ff'>lemmatizer</font>, which is even slower. By the way, when we want to work with our <font color='#cc33ff'>stop words list</font>, <font color='#5555ff'>base_tok</font> is the best <font color='#cc33ff'>tokenizer</font> $7$ times out of $9$, so it's fast and efficient.<br>$\quad$
        The models using our <font color='#cc33ff'>lemmatizer</font> are much slower than the other models (from $3$ to $12$ times slower), because determining the correct grammatical function of a word takes time, as well as looking for the corresponding <font color='#cc33ff'>lemma</font> one by one; moreover, the <font color='#cc33ff'>WordNet lemmatizer</font> is coded in Python, while the <font color='#cc33ff'>stemmers</font> use C under the hood; moreover it can use only one computer core. Besides, while <font color='#cc33ff'>stemmers</font> only need the correct <font color='#cc33ff'>stem</font> of a word to work well, <font color='#cc33ff'>lemmatizers</font> need precise syntax and grammar, so they are not ideal to analyze reviews, which often contain typos and other mistakes.<div style="margin-top: 10px"><br></div>
        <p style="margin: 0 0 0 2rem; font-size: 15px; font-weight: 600"><font color='purple'>$\quad$Note :</font></p> $\quad$
    If we use <font color='#cc33ff'>CountVectorizer</font> instead of <font color='#cc33ff'>TfidfVectorizer</font> on the whole corpus of documents, the features will be the same, but they won't be <font color='#cc33ff'>normalized</font>, and the <font color='#cc33ff'>logistic regression</font> won't converge as efficiently nor generalize as well. Even by normalizing the features with <font color='#cc33ff'>MaxAbsScaler</font> (which divides each value in a column by the max of its absolute values, and thus preserves sparsity), <font color='#cc33ff'>CountVectorizer</font> doesn't help the model to focus on the most important features, and the test accuracy of the model remains slightly inferior ($90.84\%$ vs $91.12\%$). The code takes slightly less time to run because it doesn't compute the <font color='#cc33ff'>TF-IDF matrix</font>, and because <font color='#cc33ff'>$L^2$-normalization</font> (used by <font color='#cc33ff'>TfidfVectorizer</font>) is slower than the simpler <font color='#cc33ff'>MaxAbsScaler</font> :

In [12]:
import time
from sklearn.preprocessing import MaxAbsScaler
from sklearn.metrics import accuracy_score

# Prepare data
X_train = df_movies.iloc[:25000]['review'].values
y_train = df_movies.iloc[:25000]['sentiment'].values
X_test = df_movies.iloc[25000:]['review'].values
y_test = df_movies.iloc[25000:]['sentiment'].values

# Build pipeline manually (without fit)
vectorizer = CountVectorizer(lowercase=False, token_pattern=None, ngram_range=(1, 2), tokenizer=tok_snow_0)
scaler = MaxAbsScaler()
clf = LogisticRegression(penalty='l2', solver='saga', random_state=1, warm_start=True, max_iter=20)

max_iter = 300
step = 20
start_time = time.time()

# Fit vectorizer and scaler on training data
X_train_vec = vectorizer.fit_transform(X_train)
X_train_scaled = scaler.fit_transform(X_train_vec)

# Transform test data
X_test_vec = vectorizer.transform(X_test)
X_test_scaled = scaler.transform(X_test_vec)

for iter_start in range(0, max_iter, step):    
    clf.max_iter = iter_start + step
    clf.fit(X_train_scaled, y_train)  # incremental fit
    
    y_pred = clf.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    
    elapsed = time.time() - start_time
    print(BB + "Iteration" + MB, iter_start + step,
          BB + ": Test accuracy:" + MB, f"{acc * 100:.2f}%", BB + " - Total time elapsed:" + MB, f"{elapsed:.2f} seconds")



[34m[1mIteration[35m[1m 20 [34m[1m: Test accuracy:[35m[1m 90.43% [34m[1m - Total time elapsed:[35m[1m 59.95 seconds




[34m[1mIteration[35m[1m 40 [34m[1m: Test accuracy:[35m[1m 90.79% [34m[1m - Total time elapsed:[35m[1m 69.87 seconds




[34m[1mIteration[35m[1m 60 [34m[1m: Test accuracy:[35m[1m 90.83% [34m[1m - Total time elapsed:[35m[1m 84.46 seconds




[34m[1mIteration[35m[1m 80 [34m[1m: Test accuracy:[35m[1m 90.89% [34m[1m - Total time elapsed:[35m[1m 103.75 seconds




[34m[1mIteration[35m[1m 100 [34m[1m: Test accuracy:[35m[1m 90.87% [34m[1m - Total time elapsed:[35m[1m 129.11 seconds




[34m[1mIteration[35m[1m 120 [34m[1m: Test accuracy:[35m[1m 90.86% [34m[1m - Total time elapsed:[35m[1m 160.90 seconds
[34m[1mIteration[35m[1m 140 [34m[1m: Test accuracy:[35m[1m 90.86% [34m[1m - Total time elapsed:[35m[1m 162.18 seconds
[34m[1mIteration[35m[1m 160 [34m[1m: Test accuracy:[35m[1m 90.85% [34m[1m - Total time elapsed:[35m[1m 163.48 seconds
[34m[1mIteration[35m[1m 180 [34m[1m: Test accuracy:[35m[1m 90.84% [34m[1m - Total time elapsed:[35m[1m 164.79 seconds
[34m[1mIteration[35m[1m 200 [34m[1m: Test accuracy:[35m[1m 90.84% [34m[1m - Total time elapsed:[35m[1m 166.08 seconds
[34m[1mIteration[35m[1m 220 [34m[1m: Test accuracy:[35m[1m 90.84% [34m[1m - Total time elapsed:[35m[1m 167.43 seconds
[34m[1mIteration[35m[1m 240 [34m[1m: Test accuracy:[35m[1m 90.84% [34m[1m - Total time elapsed:[35m[1m 168.74 seconds
[34m[1mIteration[35m[1m 260 [34m[1m: Test accuracy:[35m[1m 90.84% [34m[1m - Tota

<br><br><div style="font-family: Arial; font-size:14px; color: darkblue">
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>2b. Working with bigger data - online algorithms and out-of-core learning</font><a id="2b"></a></p>$\quad$
        As we have just seen in the previous subchapter, working with a $50,000$ reviews dataset can be computationally demanding. But in many real-world applications, we need to work with larger datasets, that can even exceed our computer’s memory. This particular paradigm is called <font color='darkpink'>incremental learning</font> (ie learning one <font color='#cc33ff'>mini-batch</font> at a time); it is used in <font color='darkpink'>online learning</font> (in real-world scenarios where the <font color='#cc33ff'>mini-batches</font> arrive one at a time - eg product reviews -, although this denomination is often used instead of <font color='#cc33ff'>incremental learning</font>), or in <font color='darkpink'>out-of-core learning</font> (where the whole dataset is too large for the computer's memory). In <font color='#cc33ff'>incremental learning</font>, we update the model parameters after each <font color='#cc33ff'>mini-batch</font>. One algorithm of choice is the <font color='#cc33ff'>mini-batch gradient descent</font> (see part 2, "2e. The mini-batch gradient descent").<br>$\quad$
        We first need to define a function that returns the next <font color='#cc33ff'>mini-batch</font> of the desired number of instances from the corpus of documents :

In [13]:
# Define a generator function to read data from a CSV file.
def stream_docs(path):
    # Open the CSV file in read mode ('r') using UTF-8 encoding to handle special characters
    # Using 'with' ensures proper file closure after execution (a best practice in Python)
    with open(path, 'r', encoding="utf-8") as csv:
        # Skip the first line (header) before processing actual data
        next(csv)
        # Iterate through each line in the file
        for line in csv:
            # Remove leading/trailing spaces and split at the last ',' to extract text and label
            text, label = line.strip().rsplit(',', 1)
            # Convert the label to an integer and yield both values as a tuple
            yield text, int(label)

# Define a function to retrieve a batch of samples from 'doc_stream'
def get_minibatch(doc_stream, size):
    # Initialize two empty lists to store the text data and corresponding labels, respectively
    docs, y = [], []
    # Attempt to retrieve 'size' samples from the doc_stream iterator
    try:
        for _ in range(size):
            # Fetch the next tuple (text, label) from 'doc_stream'
            text, label = next(doc_stream)
            # Add 'text' to the 'docs' list, and 'label' to the 'y' list
            docs.append(text)
            y.append(label)
    # If the iterator is exhausted, return 'None, None' to indicate no more data
    except StopIteration:
        return None, None
    # Return the collected mini-batch of 'docs' and corresponding labels 'y'
    return docs, y

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        Suppose we are about to receive a set of documents, and we need to classify them according to their sentiment (eg positive or negative). The first thing, which is necessary, is to be able to decompose our documents into <font color='#cc33ff'>terms</font>. We can rely on our <font color='#cc33ff'>tokenizers</font> here (eg <font color='#5555ff'>tok_snow_0</font>, <font color='#5555ff'>base_tok</font>,...). Then, it would certainly be great if we were able to tell which terms are helpful within the documents. Earlier, we considered that the most helpful words are those with the highest <font color='#cc33ff'>TF-IDF</font>, and we obtained rather good results; but we cannot compute the <font color='#cc33ff'>TF-IDF</font> of any <font color='#cc33ff'>term</font> <font color='#1aabff'>$t$</font> in <font color='#cc33ff'>online learning</font>, because we cannot know how many documents contain <font color='#1aabff'>$t$</font> (we haven't received all the documents yet). Therefore, we hypothesize that the most helpful <font color='#cc33ff'>terms</font> will be the most frequent ones (eg, if a review contains many "good", that should be indicative that the review is positive); we have seen in the "Note" of the previous subchapter that the <font color='#cc33ff'>count matrix</font> together with the <font color='#cc33ff'>MaxAbsScaler</font> can yield models almost as accurate as models build with <font color='#cc33ff'>TF-IDF matrices</font>. We may even discard these very frequent and unhelpful <font color='#cc33ff'>terms</font> that we called <font color='#cc33ff'>stop words</font> if needed, via our <font color='#cc33ff'>tokenizers</font> (in this notebook, <font color='#cc33ff'>tokenizers</font> that don't end in "_0" eliminate the <font color='#cc33ff'>terms</font> that are in our <font color='#5555ff'>new_stopwords</font> list).<br>$\quad$
        Thus, we read the list of <font color='#cc33ff'>terms</font> returned by the chosen <font color='#cc33ff'>tokenizer</font>; when we read a term, we check if it has already been registered, and then add 1 if it has, otherwise we create a variable that we add to our list of terms and assign $1$ to it. Wait! In truth, it doesn't work exactly this way : in fact, in order to retrieve our <font color='#cc33ff'>terms</font> more easily, and to save time when we count the documents' <font color='#cc33ff'>terms</font>, we use a <font color='#cc33ff'>sparse vector</font> (about <font color='#cc33ff'>sparse vectors</font>, see "1a. Transforming words into feature vectors") of the desired length ($2^{20}$ by default) : when we read a word, a <font color='darkpink'>hash function</font> transforms it into an index (in the desired range) and we add $1$ to its current value in the vector (which was empty if the <font color='#cc33ff'>term</font> hadn't been read yet, or a positive integer otherwise); the transformation doesn't need a dictionary, it is automatic (eg, "a" is replaced by "01", "b" by "02", etc., but in more clever ways - eg in <font color='#cc33ff'>scikit-learn</font>, the default hash function is based on MurmurHash3, which is fast and uniformly distributes outputs to reduce collisions (a collision is when two tokens are mapped to the same integer, which happens for example when there are not enough indices) -). When we have finished reading our document, we have built its <font color='#cc33ff'>feature sparse vector</font>. This method is implemented in <font color='#cc33ff'>HashingVectorizer(input='content', encoding='utf-8', decode_error='strict', lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, norm='l2')</font> (from <font color='#cc33ff'>scikit-learn</font>'s <font color='#cc33ff'>feature_extraction.text</font> module); all its parameters are identical to the parameters of the same name in <font color='#cc33ff'>CountVectorizer</font> (see "1a. Transforming words into feature vectors"), but the last two that I selected, which do not exist in <font color='#cc33ff'>CountVectorizer</font> :<ul>
    <li><font color='#cc33ff'>n_features=1048576</font> specifies the number of features (columns) in the output matrices (note that $1,048,576=2^{20}$; we can therefore write <font color='#5555ff'>2\*\*20</font> instead; <font color='#5555ff'>2\*\*21</font> is often used)</li>
    <li><font color='#cc33ff'>norm='l2'</font> : the norm to use to normalize term vectors; the default, <font color='#cc33ff'>norm=</font><font color='#1aabff'>'l2'</font>, corresponds to $l^{\,2}$-normalization (so the feature vectors have an $l^{\,2}$-norm of $1$); we can also use <font color='#cc33ff'>norm=</font><font color='#1aabff'>'l1'</font>, for $l^1$-normalization, and <font color='#cc33ff'>norm=</font><font color='#1aabff'>None</font>, for no normalization</li></ul><div style="margin-top: 20px"></div>$\quad$
        Let's try this class on our movie reviews dataset <font color='#5555ff'>df_movies</font>, to train an <font color='#cc33ff'>SGDClassifier</font> with the logistic loss function upon <font color='#cc33ff'>mini-batches</font> of size $1000$ (about <font color='#cc33ff'>SGD</font> and <font color='#cc33ff'>mini-batches</font>, see part 2, "2. Linear and polynomial regression", especially "2e. The mini-batch gradient descent"; about the <font color='#cc33ff'>logistic loss function</font>, see part 3, "2c. The logistic loss function", and part 7, "4a. The Gradient Boosting strategy", "Multiclass logistic loss function and related notations"), as if we fetched them from an external site.<br>$\quad$
        Again, the book uses only <font color='#cc33ff'>unigrams</font>, but we'll use <font color='#cc33ff'>bigrams</font> too, because they yield slightly better results. As we explained when we constructed our list of <font color='#cc33ff'>stop words</font> <font color='#5555ff'>new_stopwords</font> (see "1c. Preparing text data"), when we use bigrams, it is better to slightly adapt our list; if we worked with <font color='#cc33ff'>unigrams</font> only, we'd use <font color='#cc33ff'>NLTK</font>'s <font color='#cc33ff'>stop words list</font>, but we'll use our <font color='#5555ff'>new_stopwords</font> instead.<br>
    Also, it works better here to use no <font color='#cc33ff'>$L^2$-regularization</font> (according to my own tests), therefore, we'll set the <font color='#cc33ff'>penalty parameter</font> of <font color='#cc33ff'>SGDClassifier</font> to <font color='#cc33ff'>penalty=</font><font color='#1aabff'>None</font>.<br>
    Finally, the best of our <font color='#cc33ff'>tokenizers</font> is <font color='#5555ff'>tok_snow</font> (still according to my tests), so this is the one we use :

In [14]:
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.metrics import accuracy_score

# Initialize HashingVectorizer with tok_snow tokenizer
vect = HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None, tokenizer=tok_snow, ngram_range=(1, 2))
# Initialize SGDClassifier for online learning
clf = SGDClassifier(loss='log_loss', penalty=None, random_state=1)

# Stream data from CSV
doc_stream = stream_docs(path='movie_data.csv')

# Train the model iteratively for 45 epochs using mini-batches
for _ in range(45):
    # Fetch a minibatch of training data (X_train, y_train) from the document stream, using our
    # 'get_minibatch' function defined in the previous cell
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    # Check if 'X_train' is empty, indicating the end of the stream, and break the loop if so
    if not X_train: break
    # Otherwise, transform the training data 'X_train' using our HashingVectorizer
    X_train = vect.transform(X_train)
    # Update the model with the minibatch data
    clf.partial_fit(X_train, y_train, classes=np.array([0, 1]))

# Retrieve our test dataset (next and last 5000 samples)
X_test, y_test = get_minibatch(doc_stream, size=5000)
# Transform 'X_test' using our HashingVectorizer
X_test = vect.transform(X_test)
# Generate predictions for the test set
y_pred = clf.predict(X_test)
# Evaluate accuracy
print(BB + "Training accuracy with 'tok_snow_1' :" + MB, f"{round(100*clf.score(X_train, y_train), 2)}%")
print(BB + "Test accuracy with 'tok_snow_1' :" + MB, f"{round(100*accuracy_score(y_test, y_pred), 2)}%")

[34m[1mTraining accuracy with 'tok_snow_1' :[35m[1m 89.1%
[34m[1mTest accuracy with 'tok_snow_1' :[35m[1m 88.86%


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        A test accuracy of $88.86\%$ is not as good as the test accuracy we obtained when we trained our classifier upon $25,000$ instances, but keep in mind it is obtained with zero prior knowledge. We didn't use any <font color='#cc33ff'>vocabulary</font>, and only added and removed <font color='#cc33ff'>terms</font> from their one-to-one correspondance via a <font color='#cc33ff'>hash function</font>; in particular, we didn't compute any <font color='#cc33ff'>TF-IDF</font>.<br>$\quad$
        This remark hints at a potential way to improve our results : instead of starting from zero, we can use the first <font color='#cc33ff'>mini-batch</font> of $1,000$ documents to build a <font color='#cc33ff'>vocabulary</font>, giving more or less weight to our <font color='#cc33ff'>terms</font> according to the <font color='#cc33ff'>TF-IDF</font> computed upon these $1,000$ first documents. From there, we'll only consider, in any document, the <font color='#cc33ff'>terms</font> that actually are in our <font color='#cc33ff'>vocabulary</font>, with the weights deducted from the <font color='#cc33ff'>TF-IDF</font> we computed. The rest of the code is identical :

In [15]:
# Initialize classifier
clf = SGDClassifier(loss='log_loss', penalty=None, random_state=1)

# Stream data from CSV
doc_stream = stream_docs(path='movie_data.csv')

# Retrieve initial batch for vocabulary setup (first 1000 samples)
X_init, y_init = get_minibatch(doc_stream, size=1000)

# Initialize TfidfVectorizer using first batch to define vocabulary
vect = TfidfVectorizer(lowercase=False, tokenizer=tok_snow, token_pattern=None, ngram_range=(1, 2))
X_train = vect.fit_transform(X_init)  # Fit vocabulary using first batch
y_train = y_init

# Train in mini-batches for 44 epochs
clf.partial_fit(X_train, y_train, classes=np.array([0, 1]))  # Train on first batch

for _ in range(44):  # Remaining epochs
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if X_train is None: break  # Stop if dataset is exhausted
    X_train = vect.transform(X_train)  # Transform using existing TF-IDF vocabulary
    clf.partial_fit(X_train, y_train)  # Incremental learning update

# Retrieve test dataset (next 5000 samples)
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)

# Evaluate accuracy
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_test)
print(BB + "Training accuracy ('TfidfVectorizer' with 'tok_snow') :" + MB, f"{round(100*clf.score(X_train, y_train), 2)}%")
print(BB + "Test accuracy ('TfidfVectorizer' with 'tok_snow') :" + MB, f"{round(100*accuracy_score(y_test, y_pred), 2)}%")

[34m[1mTraining accuracy ('TfidfVectorizer' with 'tok_snow') :[35m[1m 91.1%
[34m[1mTest accuracy ('TfidfVectorizer' with 'tok_snow') :[35m[1m 90.28%


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        It could be tempting to use more samples to build a better <font color='#cc33ff'>vocabulary</font>, but we would be left with less samples to improve our model, and this would lead to a model that could be less accurate and less consistent. For example, if we use $44,000$ samples, the model has a test accuracy of $89.84\%$; with $43,999$ samples the test accuracy is $90.42\%$; with $43,998$ samples, the test accuracy is $89.78\%$; etc. With our $45,000$ available sample, the best consistent model I found has its <font color='#cc33ff'>vocabulary</font> built upon $5,000$ samples.<br>$\quad$
        This is the code I used to run my tests; it is similar to the previous one, except that the number of samples used to build the dictionary is <font color='#5555ff'>m_0</font> :

In [16]:
# m is the number of documents used to build our dictionary (must be <=45000, so there are at least 5000
# documents left for testing)
m_0 = 5000
# Number of epochs
nb_epoch = int((50000-m_0-5000)/1000)

# Initialize classifier
clf = SGDClassifier(loss='log_loss', penalty=None, random_state=1)

# Stream data from CSV
doc_stream = stream_docs(path='movie_data.csv')

# Retrieve initial batch for vocabulary setup (first m samples)
X_init, y_init = get_minibatch(doc_stream, size=m_0)

# Initialize TfidfVectorizer using first batch to define vocabulary
vect = TfidfVectorizer(lowercase=False, tokenizer=tok_snow, token_pattern=None, ngram_range=(1, 2))
X_train = vect.fit_transform(X_init)  # Fit vocabulary using first batch
y_train = y_init

# Train in mini-batches for nb_epoch epochs
clf.partial_fit(X_train, y_train, classes=np.array([0, 1]))  # Train on first batch
for _ in range(nb_epoch):  # Remaining epochs
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if X_train is None: break  # Stop if dataset is exhausted
    X_train = vect.transform(X_train)  # Transform using existing TF-IDF vocabulary
    clf.partial_fit(X_train, y_train)  # Incremental learning update

# Retrieve test dataset (next 5000 samples)
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)

# Evaluate accuracy
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_test)
print(BB + "Training accuracy ('TfidfVectorizer' with 'tok_snow') :" + MB, f"{round(100*clf.score(X_train, y_train), 2)}%")
print(BB + "Test accuracy ('TfidfVectorizer' with 'tok_snow') :" + MB, f"{round(100*accuracy_score(y_test, y_pred), 2)}%")

[34m[1mTraining accuracy ('TfidfVectorizer' with 'tok_snow') :[35m[1m 91.2%
[34m[1mTest accuracy ('TfidfVectorizer' with 'tok_snow') :[35m[1m 90.4%


<a id="Appendix"></a><br><br><br><div style="font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">
    <h4 style="margin: 0 0 .5rem 2rem; font-size: 18px; font-weight: 800"><font color='darkpink'>APPENDIX - TOPIC MODELING WITH LATENT DIRICHLET ALLOCATION</font></h4>$\quad$
        <font color='darkpink'>Topic modeling</font> describes the broad task of assigning topics to unlabeled text documents. For example, a typical application is the categorization of documents in a large corpus of newspaper articles. In such applications, we aim to assign category labels to those articles, for example, world news, local news, sports, politics,...<br>$\quad$
        In this section, we will discuss a popular <font color='#cc33ff'>topic modeling</font> technique called <font color='darkpink'>latent Dirichlet allocation (LDA)</font> (note that, while <font color='#cc33ff'>latent Dirichlet allocation</font> is often abbreviated as LDA, it should not be confused with <font color='#cc33ff'>linear discriminant analysis</font> - also abbreviated as <font color='#cc33ff'>LDA</font> -, a supervised <font color='#cc33ff'>dimensionality reduction technique</font> (see part 5, "2. Linear discriminant analysis").<br>$\quad$
        We will use the <font color='#cc33ff'>LatentDirichletAllocation</font> class implemented in the <font color='#cc33ff'>decomposition</font> module of <font color='#cc33ff'>scikit-learn</font> to decompose the movie review dataset and categorize it into $10$ different topics. Since we are going to use <font color='#cc33ff'>LatentDirichletAllocation</font> with its default values, and merely scratch the surface of the topic, I only present $4$ of its numerous parameters, <font color='#cc33ff'>LatentDirichletAllocation(n_components=10, learning_method='batch', max_iter=10, batch_size=128)</font> :<ul>
    <li><font color='#cc33ff'>n_components=10</font> specifies the number of topics to separate the corpus into</li>
    <li><font color='#cc33ff'>learning_method='batch'</font> specifies the method used for training. It can be either <font color='#cc33ff'>'batch'</font> (uses all available training data in one iteration) or <font color='#cc33ff'>'online'</font> (uses <font color='#cc33ff'>mini-batches</font> of training data). The <font color='#cc33ff'>'batch'</font> method is slower than the <font color='#cc33ff'>'online'</font> method, but generally leads to better results</li>
    <li><font color='#cc33ff'>max_iter=10</font> specifies the number of passes over the training data (aka epochs)</li>
    <li><font color='#cc33ff'>batch_size=128</font> specifies the number of documents to use in each iteration. Only used in online learning.</li></ul><div style="margin-top: 20px"></div>$\quad$
        The first step is to create a <font color='#cc33ff'>vocabulary</font>, and then pass it to the <font color='#cc33ff'>LatentDirichletAllocation</font> class so it can distribute the terms across $10$ categories.<br>$\quad$
        In the following code we show an implementation and ask for the $5$ most used words in each category to try understanding which categories were discovered. I have tried several values for the hyperparameters, but they almost never yielded 10 sensible categories. I could not reproduce the results presented in the book. My "best" result was obtained with the following code and parameter values :

In [17]:
from sklearn.decomposition import LatentDirichletAllocation

# I build the vocabulary using 'max_df=.1' (like in the book), 'max_features=1000' (5000 in the book),
# and 'stop_words=nltk_stopwords' (S. Raschka chose "english", scikit-learn’s built-in English stop words
# list - but scikit-learn themselves warn us against it : "There are several known issues with 'english'
# and you should consider an alternative"; so I used 'nltk_stopwords' instead (defined cell 7). Finally,
# instead of the book's 'CountVectorizer', I use 'TfidfVectorizer'
# Note that I wasn't able to obtain good results when I tried using bigrams
count = TfidfVectorizer(stop_words=nltk_stopwords, max_df=.1, max_features=1000)
# transforms the text data into a matrix X, where each row represents a document (here a movie review) and
# each column represents a unique word in the vocabulary. The values in the matrix are the TF-IDF of each
# word in the corresponding document
X = count.fit_transform(df_movies['review'].values)

lda = LatentDirichletAllocation(n_components=10, learning_method='batch', random_state=1)
# applies the LDA algorithm to X. It returns a new matrix, 'X_topics', where each row represents a document
# (here a movie review), and each column represents the estimated probability that the document belongs to
# a particular topic
X_topics = lda.fit_transform(X)
# specifies the number of top words to display for each topic
n_top_words = 5
# retrieves the list of feature names (terms) from the TfidfVectorizer object, ie 'feature_names' is the
# vocabulary built by the 'TfidfVectorizer'
feature_names = count.get_feature_names_out()
# for each topic, it prints the topic index, followed by the top 'n_top_words' terms with the highest weight
# in that topic. Since 'topic.argsort()' sorts indices in ascending order, we take the last 'n_top_words' and
# reverse them to display the most important terms first
for topic_idx, topic in enumerate(lda.components_):
    print(BB + "Topic", (topic_idx+1), ": " + MB)
    print(' '.join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]]))

[34m[1mTopic 1 : [35m[1m
documentary art human feel reality
[34m[1mTopic 2 : [35m[1m
father family wife woman mother
[34m[1mTopic 3 : [35m[1m
worst waste stupid money awful
[34m[1mTopic 4 : [35m[1m
kids dvd remember loved watched
[34m[1mTopic 5 : [35m[1m
war american action french rather
[34m[1mTopic 6 : [35m[1m
horror killer gore guy monster
[34m[1mTopic 7 : [35m[1m
book read original version novel
[34m[1mTopic 8 : [35m[1m
effects horror fi sci special
[34m[1mTopic 9 : [35m[1m
series episode episodes season tv
[34m[1mTopic 10 : [35m[1m
performance music role wonderful excellent


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        From these results, we can surmise that the code separated the movie reviews into the following categories :<br>
    Topic 1 : Documentaries<br>
    Topic 2 : Heterosexual movies<br>
    Topic 3 : Bad movies<br>
    Topic 4 : Ancient movies<br>
    Topic 5 : War movies<br>
    Topic 6 : Horror movies<br>
    Topic 7 : Movies based on books<br>
    Topic 8 : Science-fiction movies<br>
    Topic 9 : TV series<br>
    Topic 10 : Good movies<br>$\quad$