<h1 style="font-family: Arial; font-weight: 900; color: magenta; text-align: center;">SENTIMENT ANALYSIS</h1><br>

<br/><div style="font-family: Arial; color: darkblue; font-size: 14px; font-weight: 600"><a id="toc"></a>
In this part 8:
    <p style="margin: 0 0 0 2rem; font-size: 16px; font-weight: 800">[1. The bag-of-words model](#1)</p>
        <p style="margin: 0 0 0 4rem">[1a. Transforming words into feature vectors](#1a)</p>
        <p style="margin: 0 0 0 4rem">[1b. The TF-IDF](#1b)</p>
        <p style="margin: 0 0 0 4rem">[1c. Preparing text data](#cb)</p><br/>
    <p style="margin: 0 0 0 2rem; font-size: 16px; font-weight: 800">[2. Logistic regression model for document classification](#2)</p>
        <p style="margin: 0 0 0 4rem">[2a. Training a linear regression model for our movie reviews dataset](#2a)</p>
        <p style="margin: 0 0 0 4rem">[2b. Working with bigger data - online algorithms and out-of-core learning](#2b)</p><br/>
    <p style="margin: 0 0 0 2rem; font-size: 16px; font-weight: 800">[Appendix - Topic modeling with latent Dirichlet allocation](#Appendix)</p></div>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# The datasets we will use
# all along this part, we will use the "movie_data" dataset; I have downloaded it from the author's
# Github for this chapter (at https://github.com/rasbt/machine-learning-book/tree/main/ch08); also,
# while in the book it is called "df", I prefer to call it "df_movie"
df_movies = pd.read_csv("movie_data.csv")
# renames the columns labeled "0" and "1" to "review" and "sentiment" respectively, providing more
# descriptive names to the columns for better readability and understanding of the data
df_movies = df_movies.rename(columns={"0": "review", "1": "sentiment"})

# Beautify the code results by using colors
from colorama import Fore, Back, Style
# define some strings for cleaner codes
BB = Fore.BLUE + Style.BRIGHT
MB = Fore.MAGENTA + Style.BRIGHT

<a id="1"></a><br><br><br><div style="font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">
    <h4 style="margin: 0 0 .5rem 2rem; font-size: 18px; font-weight: 800"><font color='darkpink'>1. THE BAG-OF-WORDS MODEL</font></h4>
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>1a. Transforming words into feature vectors</font><a id="1a"></a></p>$\quad$
        The goal in this part is to study a set of documents, denoted by <font color='purple'>$\mathcal{D}$</font>.<br>$\quad$
        Before we can pass categorical data, such as words, to a machine learning algorithm, we need to convert it into a numerical format. The <font color='darkpink'>bag-of-words model</font> allows us to do exactly this. It creates a dictionary from $\mathcal{D}$, called a <font color='darkpink'>vocabulary</font>.<br>$\quad$
        The keys of our vocabulary are the words, or succession of a chosen number of words or characters, that we are interested in; it may be all the words in $\mathcal{D}$, or all successions of a certain number of words or of a certain number of characters, or we may specify the list of keys manually. In any case, the keys of our vocabulary are called <font color='darkpink'>tokens</font>. A token of $n$ words (or of $n$ characters if a token is a succession of characters), is called an <font color='darkpink'>n-gram</font> token.<br>
    The keys are ordered, usually alphabetically, and their value is their index. The values start at $0$; they are positive integers with no gaps between them.<br>$\quad$
        Then for each document $\,D \in \mathcal{D}\,$ the model counts the number of tokens present in the document. It uses a list to store these numbers : the number at index $i$ is the number of times the token with value $i$ appears in $D$. This list is called the <font color='darkpink'>feature vector</font> of $D$.<br>
    For example if we are interested in the words "dog" and "cat", then our vocabulary will be {cat: 0; dog: 1} (alphabetical order); and if our document is "The cat meows", then our feature vector will be $[1,0]$.<br>$\quad$
        Since the tokens in each document represent only a small subset of all the tokens in our vocabulary, the feature vectors will mostly consist of zeros, in which case they are said to be <font color='darkpink'>sparse</font>.<br><br>$\quad$
        The bag-of-words model is implemented in the <font color='green'>feature_extraction.text</font> module of <font color='green'>scikit-learn</font> under <font color='green'>CountVectorizer(input='content', encoding='utf-8', decode_error='strict', lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None)</font> which is a class with many attributes (see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). I have selected only a subset of the possible paramaters :<br>
    * <font color='green'>input='content'</font> specifies how the input will be passed: either the input data is already in the form of strings or bytes, and we use the default <font color='green'>input='content'</font>, or it is stored in files and then either we want to preprocess the files, this is the option <font color='green'>input='file'</font>, or we prefer to let <font color='green'>scikit-learn</font> do the work for us, and we use the option <font color='green'>input='filename'</font><br>
    * <font color='green'>encoding='utf-8'</font> specifies whether the content of our data is expressed as 'utf-8' characters (the default) or already decoded in strings<br>
    * <font color='green'>decode_error='strict'</font> specifies what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'<br>
    * <font color='green'>lowercase=True</font> converts all characters to lowercase before tokenizing<br>
    * <font color='green'>preprocessor=None</font> specifies the preprocessor we want to apply to our documents (see "1c. Preparing text data"), if any<br>
    * <font color='green'>tokenizer=None</font> specifies the tokenizer we want to apply, if any<br>
    * <font color='green'>stop_words=None</font> specifies the list of stop words (ie the "non informative" words like "the", "a", "is",... - see "1c. Preparing text data" -) we want to apply, if any<br>
    * <font color='green'>ngram_range=(1, 1)</font> specifies the length of our n-grams (ie the accepted values for n); if we want the tokens to be between $min$ and $max$ words or characters long, we specify <font color='green'>ngram_range=($min$,$max$)</font><br>
    * <font color='green'>analyzer='word'</font> specifies whether the tokens should be made of words ('word', the default) or of characters ('char')<br>
    * <font color='green'>max_df=1.0</font> sets a max threshold above which a word will not be part of the dictionary (this feature allows to easily build a list of corpus-specific stop words). If we use a float here, then it represents the frequency of appearance of the word in the whole set of documents; if we use an integer then it is its total count. This parameter is ignored if vocabulary is not None.<br>
    * <font color='green'>min_df=1</font> works like 'max_df' but sets a min threshold, to not take into account words that are not frequent enough<br>
    * <font color='green'>max_features=None</font> only puts in the vocabulary the most frequent <font color='green'>max_features</font> words. If None, all features are used.<br>
    * <font color='green'>vocabulary=None</font> specifies our dictionary; if not given, it will be determined from the input documents<br><br>$\quad$
        For example :

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

docs = np.array(["The sun is shining","The weather is sweet",
                 "The sun is shining, the weather is sweet, and one and one is two"])
count = CountVectorizer()
bag = count.fit_transform(docs)
print(BB + "Our example document is composed of :" + MB, docs)
print(BB + "\nThe complete dictionary will be :" + MB, count.vocabulary_)
print(BB + "\nWith this dictionary, the feature vectors are :\n" + MB, bag.toarray())

[34m[1mOur example document is composed of :[35m[1m ['The sun is shining' 'The weather is sweet'
 'The sun is shining, the weather is sweet, and one and one is two']
[34m[1m
The complete dictionary will be :[35m[1m {'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
[34m[1m
With this dictionary, the feature vectors are :
[35m[1m [[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


<br><br><div style="font-family: Arial; font-size:14px; color: darkblue">
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>1b. The TF-IDF</font><a id="1b"></a></p>$\quad$
        If we put in our vocabulary all the various words present in one or more documents (like we did in our previous example), our vocabulary will likely become very large when we will work with a large corpus of documents. This would result in high-dimensional feature vectors and computational challenges. So we need a criterion to limit the set of words we keep. One such criterion is the <font color='darkpink'>term frequency-inverse document frequency</font>, or <font color='darkpink'>TF-IDF</font> for short. The idea is, when we consider a document, to only count the terms (ie the processed tokens, eg the term "cat" may correspond to the tokens "cat", "Cat" or "cats") that appear frequently enough in the document, but not too frequently in our corpus of documents, in the hope of keeping only the most used words in the document but eliminating the words that are not relevant enough (like "a", "the", etc.) thus capturing the unique characteristics of the document.<br>$\quad$
        More precisely, let's call <font color='darkpink'>term frequency of a term $t$ inside a document $D$</font> the ratio between the number of times $t$ appears in $D$, say $n_D(t)$, and the total number of terms in $D$, say $N_D$ :<br><p style="margin: 0 0 .5rem 0"></p><font color = darkpink>
        $$\text{TF}\,(D,t) \;=\; \frac{n_D(t)}{N_D} \qquad (1)$$</font><br>$\quad$
        Historically, the <font color='darkpink'>inverse document frequency of a term $t$</font> is the natural logarithm of the ratio between the total number of documents $\left|\mathcal{D}\,\right|$ and the number of documents $\left|\mathcal{D}_t\,\right|$ which contain $t$ at least once :<br><p style="margin: 0 0 .5rem 0"></p><font color = darkpink>
        $$\text{IDF}\,(t) \;=\; ln\left( \frac{\left|\mathcal{D}\,\right|}{\left|\mathcal{D}_t\,\right|} \right) \qquad (2)$$</font><br>$\quad$
    This could lead to a division by $0$, if a term $t$ in our vocabulary is present in none of the documents, which may happen if our list of words is not done by extracting the tokens from $\mathcal{D}$. In order to avoid divisions by $0$, modern implementations add a $1$ to the denominator. But simply adding $1$ to the denominator (like S. Raschka in the book) implies that a term present in all the documents will have an inverse document frequency strictly negative. Adding $1$ to both the numerator and the denominator avoids this oddity, and it is more natural (this is similar to first adding our list of terms as an additional document in the corpus $\mathcal{D}$ before computing the historical IDF formula); adding $1$ to the result ensures that all terms get an IDF strictly positive, and this is the other formula that <font color='green'>scikit-learn</font> proposes :<br><p style="margin: 0 0 .5rem 0"></p><font color = darkpink>
        $$\text{IDF}\,(t) \;=\; ln\left( \frac{1+\left|\mathcal{D}\,\right|}{1+\left|\mathcal{D}_t\,\right|} \right) + 1 \qquad (2)'$$</font><br>$\quad$
        The "term frequency-inverse document frequency" is the product of these quantities :<br><p style="margin: 0 0 .5rem 0"></p><font color = darkpink>
        $$\text{TF-IDF}\,(D,t) \;=\; \text{TF}\,(D,t) \times \text{IDF}\,(t) \qquad (3)$$</font><br>
<p style="margin: 0 0 0 2rem; font-size: 15px; font-weight: 600"><font color='purple'>$\quad$Note :</font></p> $\quad$
        $ln\,$ is used to ensure that low document frequencies are not given too much weight.

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        <font color='green'>scikit-learn</font> implements the TF-IDF in the <font color='green'>feature_extraction.text</font> module, through the class <font color='green'>TfidfTransformer(norm='l2', smooth_idf=True)</font>, which transforms a count matrix, like our <font color='magenta'>bag</font> in our previous code (cell $2$) into a normalized TF-IDF representation : each line still corresponds to a single document, and is composed of a sequence of numbers corresponding to the various terms in our vocabulary, but instead of raw counts of each term it contains their TF-IDF.<br>
    The parameters I have selected here are :<br>
    * <font color='green'>norm='l2'</font> : by default, each line is normalized according to the $l^2$-norm; other possible values are <font color='green'>norm='l1'</font> or <font color='green'>norm=None</font><br>
    * <font color='green'>smooth_idf=True</font> : by default, their definition ($(2)'$) of the IDF is used to compute the TF-IDF; <font color='green'>smooth_idf=False</font> corresponds to the TF-IDF computed from the definition $(2)$ of the IDF<br><br>$\quad$
        For example :

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer

# sets the precision for floats in NumPy arrays (here 2 digits after the decimal point)
np.set_printoptions(precision=2)
# first, the default case 'smooth_idf=True'
tfidf = TfidfTransformer(norm='l2', smooth_idf=True)
# uses our 'TfidfTransformer' to compute the TF-IDF from the count matrix 'bag" obtained in cell 2,
# then prints it
print(BB + "The value for the default TF-IDF when 'smooth_idf=True' :\n" + MB,
      tfidf.fit_transform(bag).toarray())
# second, we do the same in the case 'smooth_idf=False'
tfidf_without_smooth = TfidfTransformer(norm='l2', smooth_idf=False)
print(BB + "\nThe value for the TF-IDF when 'smooth_idf=False' :\n" + MB,
      tfidf_without_smooth.fit_transform(bag).toarray())

[34m[1mThe value for the default TF-IDF when 'smooth_idf=True' :
[35m[1m [[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]
[34m[1m
The value for the TF-IDF when 'smooth_idf=False' :
[35m[1m [[0.   0.41 0.   0.58 0.58 0.   0.41 0.   0.  ]
 [0.   0.41 0.   0.   0.   0.58 0.41 0.   0.58]
 [0.54 0.39 0.54 0.18 0.18 0.18 0.26 0.27 0.18]]


<br><br><div style="font-family: Arial; font-size:14px; color: darkblue">
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>1c. Preparing text data</font><a id="1c"></a></p>$\quad$
        The first thing before working with text data is to clean our data. Indeed, there may be many irrelevant words or characters. For example, these are the last $50$ characters of our first document (in our 'movies' dataset) :

In [4]:
print(BB + "The last 50 characters of our first document are :\n" + MB,
      df_movies.loc[0, 'review'][-50:])

[34m[1mThe last 50 characters of our first document are :
[35m[1m is seven.<br /><br />Title (Brazil): Not Available


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        Cleaning text data is not as straightforward a process as it may seem. For example, we need to keep emoticons characters, eg ':)' or ':(', capital letters may be relevant (avoiding transformation of some family names into common words), etc., but still we need to remove mark up characters. The author proposes his simple 'preprocessor', using regex and the Python’s regex library, <font color='green'>re</font>. Let's define it and test it on the last 50 characters of our first document and on another example :

In [5]:
import re

def preprocessor(text):
    #  '<[^>]*>' matches any substring that starts with '<', followed by zero or more characters that
    # are not '>', and ends with '>'; in other words, it matches HTML tags.
    # Then 're.sub('<[^>]*>', '', text)' replaces all occurrences of substrings matching the regular
    # expression with an empty string; in other words, it removes them from the 'text' string
    text = re.sub("<[^>]*>", "", text)
    # I believe I can improve on the author's line, in adding the emoticons with 'd' or 'p', so I
    # have replaced his line. I am keeping the original line in case I am wrong. Anyway, this seems
    # to be mere cosmetics here : the results and speeds in 2a. are exactly the same 
    # <original line :> emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text)
    # '?:' indicates a non-capturing group, meaning the parentheses are used for grouping but not for
    # capturing the matched text; eg '(?::|;|=)' matches one and only one of the characters ':', ';',
    # or '='; then '(?:-)?' means that this character must be followed by 0 or 1 occurrence of '-'; at
    # last,'(?:\)|\(|[DdPp])' means that this must be followed by either ')', '(', or one of the
    # letters 'D', 'd', 'P' or 'p'
    # moreover, backslashes in the regular expressions are being interpreted as escape characters in the
    # new version of Jupyter notebook and it throws a SyntaxWarning message; but if we add a "r" prefix
    # to the strings containing regular expressions, we tell Python to treat backslashes in those
    # strings as literal characters, which resolves the issue
    emoticons = re.findall(r"(?::|;|=)(?:-)?(?:\)|\(|[DdPp])", text)
    # '[\W]+' matches one or more occurrences of any non-word character (ie any character that is
    # neither a letter, a digit, nor an underscore);
    # 're.sub("[\W]+", " ", text.lower())' : first, 'text' has its uppercase characters replaced by
    # their lowercase counterparts; then all the sequences of non-word characters it contains (eg
    # punctuation marks or symbols like emoticons or '#') are replaced by a single space
    # '" ".join(emoticons)' joins the list of emoticons with spaces between them
    # then '.replace("-", "")' removes any hyphens from the resulting string of emoticons
    text = re.sub(r"[\W]+", " ", text.lower()) + " ".join(emoticons).replace("-", "")
    return text

print(BB + "After using our preprocessor, the last 50 characters of our first document are :\n" + MB,
      preprocessor(df_movies.loc[0, 'review'][-50:]))
print(BB + "Our preprocessor transforms '</a>This :) is :( a test :-)!' into :\n" + MB,
     preprocessor("</a>This :) is :( a test :-)!"))

[34m[1mAfter using our preprocessor, the last 50 characters of our first document are :
[35m[1m is seven title brazil not available
[34m[1mOur preprocessor transforms '</a>This :) is :( a test :-)!' into :
[35m[1m this is a test :) :( :)


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        Before going to the next step, we clean our data :

In [6]:
df_movies['review'] = df_movies['review'].apply(preprocessor)

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        To finalise the preparation of our data we use a <font color='darkpink'>word stemming technique</font>. Word stemming is the process of transforming a word into its root form; it allows us to map related words to the same stem.<br>$\quad$
        The original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the <font color='darkpink'>Porter stemmer</font> algorithm. This is the one used in this class, but other word stemming techniques do exist, eg the <font color='darkpink'>Lancaster stemmer</font> and the <font color='darkpink'>Snowball stemmer</font>. The Snowball stemmer is the newest of the three; it is an improvement over the original Porter stemmer  and is often called "Porter 2". The Snowball stemmer offers better performance than its predecessor, especially for languages other than English; it is more aggressive and may produce more accurate stems. It is the option favored by ChatGPT "in general". The Lancaster stemmer is the most aggressive of the three and will produce shorter and more obscure words, according to both ChatGPT and the author.<br>$\quad$
        These three stemmers are available through the "nltk package" ("nltk" stands for "Natural Language Toolkit"), so I have downloaded this package, via 'pip install nltk'.<br>$\quad$
        In the following code we define a function to use the Porter stemmer and test it on a short example. I added a similar code for the Snowball stemmer. We can see that, although the Porter stemmer reduced "thus" to an unknown root "thu", the Snowball stemmer did a better job, but they both failed to recognize that "runner" has the same stem than "running" and "run" :

In [7]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
print(BB + 'The Porter stemmer transforms "Runners like running and thus they run a lot" into :\n' + MB,
      tokenizer_porter("Runners like running and thus they run a lot"))

from nltk.stem.snowball import SnowballStemmer
snowball = SnowballStemmer("english")
def tokenizer_snowball(text):
    return [snowball.stem(word) for word in text.split()]
print(BB + '\nThe Snowball stemmer transforms "Runners like running and thus they run a lot" into :\n' + MB,
      tokenizer_snowball("Runners like running and thus they run a lot"))

[34m[1mThe Porter stemmer transforms "Runners like running and thus they run a lot" into :
[35m[1m ['runner', 'like', 'run', 'and', 'thu', 'they', 'run', 'a', 'lot']
[34m[1m
The Snowball stemmer transforms "Runners like running and thus they run a lot" into :
[35m[1m ['runner', 'like', 'run', 'and', 'thus', 'they', 'run', 'a', 'lot']


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        At last, we need to remove <font color='darkpink'>stop words</font>. Stop words are simply those words that are extremely common in all sorts of texts and probably bear (almost) no useful information; for example "the", "is", "and", "has", etc. Removing stop words is especially useful if we are working with raw or normalized term frequencies rather than tf-idfs, which already downweight the frequently occurring words. (pr: in the code in cell 9, we will work with TF-IDF, and the best results will always be obtained when not using the stop words)<br>$\quad$
        To remove stop words from the movie reviews, we will use the set of 127 English stop words that is available from the NLTK library, which can be obtained by calling the <font color='green'>nltk.download</font> function via<br><font color='magenta'>
    import nltk<br>
    nltk.download("stopwords")</font><br>$\quad$
    Once it has been downloaded, it does not need to be downloaded again, but it still prints unimportant lines saying that it has been downloaded to my C drive and is up to date. So, I have removed these lines from the code.

In [8]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
print(BB + "After transformation by the Snowball stemmer and removal of the 'nltk' stopwords,\
 \"Runners like running and thus they run a lot\" becomes :\n" + MB,
      [w for w in tokenizer_snowball("Runners like running and thus they run a lot") if w not in stop])

[34m[1mAfter transformation by the Snowball stemmer and removal of the 'nltk' stopwords, "Runners like running and thus they run a lot" becomes :
[35m[1m ['runner', 'like', 'run', 'thus', 'run', 'lot']


<a id="2"></a><br><br><br><div style="font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">
    <h4 style="margin: 0 0 .5rem 2rem; font-size: 18px; font-weight: 800"><font color='darkpink'>2. LOGISTIC REGRESSION MODEL FOR DOCUMENT CLASSIFICATION</font></h4>
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>2a. Training a linear regression model for our movie reviews dataset</font><a id="2a"></a></p>$\quad$
        In this section, we will train a logistic regression model (see part 3, "2. The logistic regression", for a general presentation of logistic regression models) to classify the movie reviews into positive and negative reviews based on the bag-of-words model.<br>$\quad$
        First, we divide our dataFrame of cleaned text documents into 25,000 documents for training and 25,000 documents for testing. Then we use a <font color='green'>GridSearchCV</font> object to find the optimal set of parameters for our logistic regression model using 5-fold stratified cross-validation (see part 6, "1. Pipelines" for pipelines, "2. 𝑘-fold cross-validation" for cross-validation and "4a. Grid search" for '<font color='green'>GridSearchCV</font>'). At last, we present the accuracy scores on the training and test datasets.<br>$\quad$
    <p style="margin: 0 0 0 2rem; font-size: 15px; font-weight: 600"><font color='purple'>$\quad$Notes :</font></p>$\quad$
        * <font color='green'>TfidfVectorizer</font> is "Equivalent to CountVectorizer followed by TfidfTransformer" according to <font color='green'>scikit-learn</font><br>$\quad$
        * the author's code takes an infinite amount of time to complete with my potato laptop : 30'17". Since the Snowball stemmer is more efficient than the Porter stemmer, I have tried it instead; I gained almost 10 minutes, but it still took 22'46". Then I have tried other solvers than the one chosen by the author; the default 'lbfgs' did not converge on some fits (there are normalization problems when we do not normalize our TF-IDF rows), and took more time when it did converge; 'saga' always converged and was very slightly faster, with 22'13". In all these attempts, the hyperparameter values which got the best accuracy scores were the simplest : no stop words and no word stemmer; it got $89.7\%$ accuracy on our training dataset, and $89.9\%$ accuracy on our test dataset<br>$\quad$
        * since the use of a word stemmer is what took the most time (around 2.5 minutes per fit with the Porter stemmer, 1.6 minutes per fit with the Snowball stemmer, 10 seconds per fit with the basic stemmer), I have decided to only use the author's basic stemmer (see the code). Also, according to <font color='green'>scikit-learn</font> : "For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones"; our dataset is of medium size, but 'saga' was already what performed best (it gave the same results, but was the fastest), so I used it. And since we use a tokenizer, we must use <font color='green'>token_pattern=None</font>, that I added to avoid the "UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'". In the end my hyperparameter options are rather meager; I only test the inverse of the regularization strength 'C' for the values $1$ and $10$, $10$ being the best with accuracy scores of $89.7\%$ on our training dataset, and $89.9\%$ on our test dataset (while 'C=1' gets $88.9\%$ and $89.4\%$ respectively).<br>$\quad$
        * at least, that was before I decided to use either $1$ or $2$ words tokens instead of only $1$ word tokens; it takes between $6$ and $7$ times longer, but I think that $2$ words tokens might be interesting as well, since distinguishing between "very good" and "not good" might be relevant; so I use 'ngram_range=(1,2)' instead of 'ngram_range=(1,1)', and I got slightly better results : accuracy scores of $90.7\%$ on our training dataset, and $90.9\%$ on our test dataset<br>$\quad$
        * by choosing 'verbose=2' we get the individual times for the fits (as well as the tokens that have been eliminated when we choose to remove stop words)

In [9]:
X_train = df_movies.loc[:25000, 'review'].values
y_train = df_movies.loc[:25000, 'sentiment'].values
X_test = df_movies.loc[25000:, 'review'].values
y_test = df_movies.loc[25000:, 'sentiment'].values

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

def tokenizer(text):
    return text.split()

tfidf = TfidfVectorizer(lowercase=False)
param_grid = [
    {'v__ngram_range': [(1,2)], 'v__stop_words': [None], 'v__token_pattern': [None],
     'v__tokenizer': [tokenizer], 'clf__penalty': ['l2'], 'clf__C': [1.0, 10.0]}]
lr_tfidf = Pipeline([('v', tfidf), ('clf', LogisticRegression(solver='saga'))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=2, n_jobs=1)
gs_lr_tfidf.fit(X_train, y_train)

print(BB + "\nAmong the hyperparameters tested, the best combination is :\n" + MB,
      gs_lr_tfidf.best_params_)
print(BB + "On the training dataset with 5-fold cross validation, they yielded an accuracy of : " \
      + MB, round(gs_lr_tfidf.best_score_,3))
clf = gs_lr_tfidf.best_estimator_
print(BB + "And on the test dataset the accuracy is : " + MB, round(clf.score(X_test, y_test),3))

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] END clf__C=1.0, clf__penalty=l2, v__ngram_range=(1, 2), v__stop_words=None, v__token_pattern=None, v__tokenizer=<function tokenizer at 0x000001A93C4AB740>; total time=  50.7s
[CV] END clf__C=1.0, clf__penalty=l2, v__ngram_range=(1, 2), v__stop_words=None, v__token_pattern=None, v__tokenizer=<function tokenizer at 0x000001A93C4AB740>; total time=  49.8s
[CV] END clf__C=1.0, clf__penalty=l2, v__ngram_range=(1, 2), v__stop_words=None, v__token_pattern=None, v__tokenizer=<function tokenizer at 0x000001A93C4AB740>; total time= 1.0min
[CV] END clf__C=1.0, clf__penalty=l2, v__ngram_range=(1, 2), v__stop_words=None, v__token_pattern=None, v__tokenizer=<function tokenizer at 0x000001A93C4AB740>; total time=  50.1s
[CV] END clf__C=1.0, clf__penalty=l2, v__ngram_range=(1, 2), v__stop_words=None, v__token_pattern=None, v__tokenizer=<function tokenizer at 0x000001A93C4AB740>; total time=  49.4s
[CV] END clf__C=10.0, clf__penalty=l2, v

<br><br><div style="font-family: Arial; font-size:14px; color: darkblue">
    <p style="margin: 0 0 .5rem 4rem; font-size: 16px; font-weight: 600"><font color='darkpink'>2b. Working with bigger data - online algorithms and out-of-core learning</font><a id="2b"></a></p>$\quad$
        As we have just seen in the previous subchapter, working with a $50,000$ reviews dataset can be computationally demanding, especially during a grid search. But in many real-world applications, we need to work with even larger datasets, that can even exceed our computer’s memory. This particular paradigm is called <font color='darkpink'>out-of-core learning</font>, or <font color='darkpink'>online learning</font>, or even <font color='darkpink'>incremental learning</font>. In out-of-core learning the algorithm processes the data sequentially, one mini-batch at a time, updating the model parameters after each mini-batch. One algorithm of choice is the mini-batch gradient descent (see part 2, "2e. The mini-batch gradient descent").<br>$\quad$
        The first step still is to prepare our dataset : clean it, split it into words, and remove stop words. This was done with our 'preprocessor' function (defined in cell 5), our 'tokenizer' function to split a text into its words (defined in cell 9), and using our list 'stop' of stop words (defined in cell 8). Let's redefine our 'tokenizer' function so that it does all these steps :

In [10]:
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    # here again, I believe the author's version can be improved, since it transforms 'text'
    # to lowercase before looking for 'D' or 'P'; so I have replaced his line
    emoticons = re.findall(r"(?::|;|=)(?:-)?(?:\)|\(|[DdPp])", text)
    text = re.sub(r"[\W]+", " ", text.lower()) + " ".join(emoticons).replace("-", "")
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        Now, let's define a function that returns the next mini-batch of the desired number of instances from the corpus of documents :

In [11]:
def stream_docs(path):
    # opens the csv file specified by the path argument in read mode ('r'); the file is opened with
    # utf-8 encoding to handle text data;
    # 'with' is used to ensure that the file is properly closed after the code block indented
    # under it finishes executing (this is a best practice in Python for working with files)
    with open(path, 'r', encoding="utf-8") as csv:
        # reads the first line of the csv file, which typically contains the header (column names)
        next(csv) # skip header
        # iterates over each line in the csv file
        for line in csv:
            # 'line[:-3]' extracts all characters in the line except for the last three characters
            # (this likely removes the label and the newline characters at the end of each line)
            # 'int(line[-2])' extracts the second-to-last character in the line and converts it into
            # an integer
            text, label = line[:-3], int(line[-2])
            # generates a tuple containing the text and label and returns it, without putting an end
            # to the function like a 'return' would; 'yield' returns the result and the function's
            # state is preserved until the next time it is called; this is particularly useful for
            # iterating over large sequences of values without needing to store them all in memory
            yield text, label

# 'doc_stream' will be an iterator producing tuples of (text, label), and size will be an integer
# specifying the number of samples to retrieve from 'doc_stream'
def get_minibatch(doc_stream, size):
    # initializes two empty lists to store the text data and corresponding labels, respectively
    docs, y = [], []
    # attempts to retrieve 'size' samples from the doc_stream iterator
    try:
        for _ in range(size):
            # retrieves the next tuple, starting from the first one and proceeding sequentially
            text, label = next(doc_stream)
            # adds 'text' to the 'docs' list, and 'label' to the 'y' list
            docs.append(text)
            y.append(label)
    # If the 'try' block encounters a 'StopIteration' exception, meaning that the doc_stream iterator
    # has been exhausted, the function returns 'None, None'
    except StopIteration:
        return None, None
    # otherwise it returns the collected mini-batch of 'docs' and corresponding labels 'y'
    return docs, y

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        We can’t use <font color='green'>CountVectorizer</font> effectively for out-of-core learning since it requires holding the complete vocabulary in memory. Also, <font color='green'>TfidfVectorizer</font> needs to keep all the feature vectors of the
training dataset in memory to calculate the inverse document frequencies. However, another useful
vectorizer for text processing implemented in the <font color='green'>feature_extraction.text</font> module of <font color='green'>scikit-learn</font> is useful in our context : <font color='green'>HashingVectorizer(input='content', encoding='utf-8', decode_error='strict', lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, ngram_range=(1, 1), analyzer='word', n_features=1048576, norm='l2')</font>. I have selected only a subset of the possible paramaters :<br>
    * <font color='green'>input='content'</font> specifies how the input will be passed: either the input data is already in the form of strings or bytes, and we use the default <font color='green'>input='content'</font>, or it is stored in files and then either we want to preprocess the files, this is the option <font color='green'>input='file'</font>, or we prefer to let <font color='green'>scikit-learn</font> do the work for us, and we use the option <font color='green'>input='filename'</font><br>
    * <font color='green'>encoding='utf-8'</font> specifies whether the content of our data is expressed as 'utf-8' characters (the default) or already decoded in strings<br>
    * <font color='green'>decode_error='strict'</font> specifies what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'<br>
    * <font color='green'>lowercase=True</font> converts all characters to lowercase before tokenizing<br>
    * <font color='green'>preprocessor=None</font> specifies the preprocessor we want to apply, if any<br>
    * <font color='green'>tokenizer=None</font> specifies the tokenizer we want to apply, if any; here S. Raschka placed the preprocessing stage inside our tokenizer called <font color='magenta'>tokenizer</font> (see cell 10), so we will have <font color='green'>preprocessor=None</font> and <font color='green'>tokenizer=tokenizer</font><br>
    * <font color='green'>stop_words=None</font> specifies the list of stop words we want to apply, if any; here again, the author placed it in the 'tokenizer' function, so we will use the default <font color='green'>stop_words=None</font><br>
    * <font color='green'>ngram_range=(1, 1)</font> specifies the length of our n-grams (ie the accepted values for n); if we want the tokens to be between $min$ and $max$ words or characters long, we specify <font color='green'>ngram_range=($min$,$max$)</font><br>
    * <font color='green'>analyzer='word'</font> specifies whether the tokens should be made of words ('word', the default) or of characters ('char')<br>
    * <font color='green'>n_features=1048576</font> specifies the number of features (columns) in the output matrices (note that 1,048,576 = 2^20)<br>
    * <font color='green'>norm='l2'</font> specifies the norm to use to normalize term vectors; the default, 'l2', is $l^2$; we can also use 'l1', for $l^1$, and None, for no normalization<br>

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        Let's try this function on our movie reviews dataset of the previous subchapter :

In [12]:
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer

vect = HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None,
            tokenizer=tokenizer)
clf = SGDClassifier(loss='log_loss', random_state=1)
# we use the same dataset as in 2a.
doc_stream = stream_docs(path='movie_data.csv')

# initializes the 'classes' variable
classes = np.array([0, 1])

# trains the model iteratively for 45 epochs
for _ in range(45):
    # retrieves a minibatch of training data (X_train, y_train) from the document stream, using our
    # 'get_minibatch' function defined in cell 11
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    # checks if 'X_train' is empty, indicating the end of the stream, and breaks the loop if so
    if not X_train: break
    # otherwise, transforms the training data 'X_train' using our HashingVectorizer
    X_train = vect.transform(X_train)
    # updates the model with the minibatch data
    clf.partial_fit(X_train, y_train, classes=classes)

# uses the next 5000 lines of our dataset to define a test dataset
X_test, y_test = get_minibatch(doc_stream, size=5000)
# transforms the test data 'X_test' using our HashingVectorizer
X_test = vect.transform(X_test)
# prints the accuracy score of the trained model on the training and test datasets
print(BB + "The accuracy score of our model on the training dataset is :" + MB,
      round(clf.score(X_train, y_train),3))
print(BB + "The accuracy score of our model on the test dataset is :" + MB,
      round(clf.score(X_test, y_test),3))

[34m[1mThe accuracy score of our model on the training dataset is :[35m[1m 0.888
[34m[1mThe accuracy score of our model on the test dataset is :[35m[1m 0.868


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        "The accuracy of the model is approximately $87\%$, slightly below the accuracy
that we achieved in the previous section using the grid search for hyperparameter tuning. However,
out-of-core learning is very memory efficient, and it took 37 seconds to complete."<br>$\quad$
        Finally, we can use the last 5,000 documents to update our model :

In [13]:
clf = clf.partial_fit(X_test, y_test)

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
    <p style="margin: 0 0 0 2rem; font-size: 15px; font-weight: 600"><font color='purple'>$\quad$Notes (ChatGPT) :</font></p>$\quad$
        This line continues the training of the classifier 'clf' using the provided test data ('X_test' and 'y_test').<br>
    In the context of stochastic gradient descent (SGD) classifiers, the partial_fit method allows incremental learning, so we can update the model's parameters using new data without retraining it from scratch.<br>
    However, it's important to note that <font color='green'>partial_fit</font> should typically be used for online learning scenarios, where the data arrives in batches over time. Using test data for training is unconventional and might not yield meaningful results. Normally, we would use <font color='green'>partial_fit</font> with new batches of training data, not with the test data.

<a id="Appendix"></a><br><br><br><div style="font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">
    <h4 style="margin: 0 0 .5rem 2rem; font-size: 18px; font-weight: 800"><font color='darkpink'>APPENDIX - TOPIC MODELING WITH LATENT DIRICHLET ALLOCATION</font></h4>$\quad$
        Topic modeling describes the broad task of assigning topics to unlabeled text documents. For example, a typical application is the categorization of documents in a large text corpus of newspaper articles. In applications of topic modeling, we then aim to assign category labels to those articles, for example, sports, finance, world news, politics, and local news.<br>$\quad$
        In this section, we will discuss a popular technique for topic modeling called latent Dirichlet allocation (LDA). However, note that while latent Dirichlet allocation is often abbreviated as LDA, it is not to be confused with linear discriminant analysis (also abbreviated as LDA), a supervised dimensionality reduction technique (see part 5, "2. Linear discriminant analysis").<br>$\quad$
        We will use the <font color='green'>LatentDirichletAllocation</font> class implemented in the <font color='green'>decomposition</font> module of <font color='green'>scikit-learn</font> to decompose the movie review dataset and categorize it into $10$ different topics. Since we are going to use <font color='green'>LatentDirichletAllocation</font> with its default values, and only scratch the surface of the topic, I only present $4$ of its numerous parameters, <font color='green'>LatentDirichletAllocation(n_components=10, learning_method='batch', max_iter=10, batch_size=128)</font> :<br>
    * <font color='green'>n_components=10</font> specifies the number of topics to separate the corpus into<br>
    * <font color='green'>learning_method='batch'</font> specifies the method use for the training. It can be either <font color='green'>'batch'</font> (uses all available
training data in one iteration) or <font color='green'>'online'</font> (uses mini-batch of training data). The <font color='green'>'batch'</font> method is slower than the <font color='green'>'online'</font> method, but generally leads to better results<br>
    * <font color='green'>max_iter=10</font> specifies the number of passes over the training data (aka epochs)<br>
    * <font color='green'>batch_size=128</font> specifies the number of documents to use in each iteration. Only used in online learning.<br>

<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        The first step is to create a vocabulary, and then pass it to the <font color='green'>LatentDirichletAllocation()</font> so it can distribute it in $10$ categories.<br>$\quad$
        In the following code we show an implementation and ask for the $5$ most used words in each category to try understanding which categories were discovered. I have tried several values for the hyperparameters, but they almost never yielded 10 sensible categories. I could not reproduce the results in the book. The "best" result I got was obtained with the following code and parameter values :

In [14]:
from sklearn.decomposition import LatentDirichletAllocation

# here the author chose 'stop_words="english"', but scikit-learn warns us against it : "If 'english',
# a built-in stop word list for English is used. There are several known issues with 'english' and
# you should consider an alternative"; so I used 'stop' instead (defined cell 8)
# moreover I only used 'max_features=1000' instead of the author's 5000
count = CountVectorizer(stop_words=stop, max_df=.1, max_features=1000)
# transforms the text data into a document-term matrix X, where each row represents a document (here
# a movie review) and each column represents a unique word in the vocabulary. The values in the matrix
# are the frequencies of each word in the corresponding document
X = count.fit_transform(df_movies['review'].values)

lda = LatentDirichletAllocation(n_components=10, learning_method='batch', random_state=1)
# applies the LDA algorithm to X. It returns a new matrix X_topics where each row represents a
# document (here a movie review), and each column represents the probability of the document
# belonging to a particular topic (here movie category)
X_topics = lda.fit_transform(X)
# specifies the number of top words to display for each topic
n_top_words = 5
# retrieves the list of feature names (words) from the CountVectorizer object, ie 'feature_names' is
# the vocabulary built by the 'CountVectorizer'
feature_names = count.get_feature_names_out()
# for each topic, it prints the topic index, followed by the top n_top_words words associated with
# that topic; they are originally ranked in increasing order, so the code returns the last 5 words
# in reverse order, thus displaying the most frequent first
for topic_idx, topic in enumerate(lda.components_):
    print(BB + "Topic", (topic_idx+1), ": " + MB)
    print(' '.join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]]))

[34m[1mTopic 1 : [35m[1m
woman father girl wife mother
[34m[1mTopic 2 : [35m[1m
family kids children documentary feel
[34m[1mTopic 3 : [35m[1m
worst action script awful stupid
[34m[1mTopic 4 : [35m[1m
comedy dvd minutes laugh jokes
[34m[1mTopic 5 : [35m[1m
war american john joe michael
[34m[1mTopic 6 : [35m[1m
series episode tv effects episodes
[34m[1mTopic 7 : [35m[1m
book original version game read
[34m[1mTopic 8 : [35m[1m
horror budget gore blood low
[34m[1mTopic 9 : [35m[1m
role performance performances screen rather
[34m[1mTopic 10 : [35m[1m
music performance role wonderful beautiful


<br><div style="margin: 0; font-family: Arial; font-size: 14px; font-weight: 500; color: darkblue">$\quad$
        From these results, we can surmise that the code separated the movie reviews into the following categories :<br>
    Topic 1 : Heterosexual movies<br>
    Topic 2 : Family movies/documentaries<br>
    Topic 3 : Bad movies<br>
    Topic 4 : Comedy movies<br>
    Topic 5 : War movies<br>
    Topic 6 : TV series<br>
    Topic 7 : Movies based on books or/and games<br>
    Topic 8 : Horror movies<br>
    Topic 9 : Actors' performances<br>
    Topic 10 : Aesthetic movies<br>$\quad$
        Note that some categories are not very clear (eg with firstnames as relevant words for topic 5); also "performance" is in both topics $9$ and $10$. And this was the best I could obtain...<br>$\quad$
        (ChatGPT, answering my question about why are the results so bad here :) <font color='purple'>while topic modeling algorithms like LDA can be powerful tools for uncovering hidden structures in text data, their effectiveness is influenced by various factors including the size and quality of the dataset, algorithm parameters, and preprocessing techniques. Experimenting with different approaches and fine-tuning the algorithm parameters may help improve the relevance and clarity of the topics extracted from the data.</font>