# Natural Language Processing : Classic to Deep Methods for Sentiment Analysis

## Resources

Bag-Of-Word and TF-IDF:

https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Recurrent Neural Networks (RNNs):

https://medium.com/towards-data-science/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9

Long Short Term Memory networks (LSTMs):

https://medium.com/towards-data-science/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Word embeddings:

http://jalammar.github.io/illustrated-word2vec/

In [2]:
import os
import numpy as np
import pandas as pd

#TOFILL

Today we are going to tackle the sentiment analysis problem, a *text classification* problem. The idea is pretty simple : we want to automatically predict whether a text expresses positive or negative sentiments. To do so we will use the IMDB dataset, that contains 50000 movie reviews from the www.imdb.com website, and their corresponding sentiment : positive or negative. It is thus a binary classification problem, where we want to predict a binary target $y \in \{0,1\}$. We will go through different ways of encoding a text in a vectorial form $x \in \mathbb{R}^d$, as well as different classification models, from classic ways to modern deep learning models.

## Load the dataset

Load the dataset and explore a bit the data :

In [3]:
#Load and print the dataset
imdb_dataset_original=pd.read_csv('../data/IMDB Dataset.csv')
imdb_dataset = imdb_dataset_original.copy()
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,I saw this movie last night after waiting ages...,positive
1,"This was, so far, the worst movie I have seen ...",negative
2,"In a time of magic, barbarians and demons abou...",positive
3,I had high expectations of this movie (the tit...,negative
4,This is a film of immense appeal to a relative...,negative
5,An hilariously accurate caricature of trying t...,positive
6,I watch most movies that Nick Mancuso is in be...,positive
7,This is one of those films that's more interes...,negative
8,A wonderful and gritty war film that focuses o...,positive
9,"Man, some of you people have got to chill. Thi...",positive


In [4]:
#Print first review:
print(imdb_dataset["review"][0])

I saw this movie last night after waiting ages and ages for it to be released here in Canada (still only in limited release). It was worth the wait and then some. I am a very avid reader of Margaret Laurence and was excited to see that this novel was being turned into a film. I actually ended up liking the movie better than the novel. I liked that the character of Bram Shipley was a bit less harsh, and that there seemed to be more of a love story between Hagar and Bram, which made the scenes at the end of Bram's life that much more moving. The loss seemed stronger. Hagar was not any more likable on film than in the book, but Ellen Burstyn was a genius in this role. She WAS Hagar through and through. Christine Horne was brilliant and has many more great things ahead I am sure. Her scenes with Cole Hauser were electrifying. I could go on and on, overall a 9 * out of 10. Fantastic and can't wait for it to come out on DVD, a must own for my collection!


In [5]:
#Print the two classes size
imdb_dataset['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

## Text preprocessing

As you can see the text is quite messy, and before encoding our text into features, we are going to go through different preprocessing steps in order to clean it:
* Removing the HTML tags.
* Removing other special characters : this means all non alphanumeric characters, including punctuation.
* Lowercase the text.
* Tokenization : split a text as a list of words now called tokens.
* Stemming : removing all the suffixes from conjugation, plural, ... In order to bring a word back to its root form. For example.
* Removing stopwords : words like 'to', 'a', 'the', ... are called stopwords, we remove them as they are too frequent words and generally just add noise.

Fill the following functions to perform each of these steps. You are free to use the libraries of your choice to do so. Try to not reinvent the wheel!

In [6]:
### HTML encoding as the exemple 
print(imdb_dataset["review"][10])

Wow. I don't even really remember that much about this movie, except that it stunk.<br /><br />The plot's basically; a girl's parents neglect her, so this sicko PokeMon pretends to be her dad. Am I the only one disturbed by that? Then, this weirdo PokeMon kidnaps Ash's mom to pretend to be the girl's. I don't care if he was trying to make the girl happy, that's just gross.<br /><br />There was no real plot. The girl was just a whiny brat who wanted things her own way. She played with Unowns, was the "daughter" of Entei and apparently could grow and shrink in age on a whim with the help of her "dad".<br /><br />That's pretty much all I can remember, but I think you can take it as a hint, and not see it. (Or if you do see it, don't expect much.) 1 out of 10.<br /><br />Seriously. If you want a PokeMon movie, rent "PokeMon; the First Movie".


In [7]:
## BeautifulSoup is a function from bs4 library which remove HMTL tags
from bs4 import BeautifulSoup

## see https://www.geeksforgeeks.org/python/remove-all-style-scripts-and-html-tags-using-beautifulsoup/
def remove_html_tags(text):
    """
    Input: str : A string to clean from html tags
    Output: str : The same string with html tags removed
    """
    if not isinstance(text, str):
        return text  ## securité si ce n'est pas une string
    
    # texte HTLM
    soup = BeautifulSoup(text,"html.parser")  
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()

    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)
    #return clean_text
    

In [8]:
imdb_dataset['review']=imdb_dataset['review'].apply(remove_html_tags)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,I saw this movie last night after waiting ages...,positive
1,"This was, so far, the worst movie I have seen ...",negative
2,"In a time of magic, barbarians and demons abou...",positive
3,I had high expectations of this movie (the tit...,negative
4,This is a film of immense appeal to a relative...,negative
5,An hilariously accurate caricature of trying t...,positive
6,I watch most movies that Nick Mancuso is in be...,positive
7,This is one of those films that's more interes...,negative
8,A wonderful and gritty war film that focuses o...,positive
9,"Man, some of you people have got to chill. Thi...",positive


In [9]:
imdb_dataset['review'][10]

'Wow. I don\'t even really remember that much about this movie, except that it stunk. The plot\'s basically; a girl\'s parents neglect her, so this sicko PokeMon pretends to be her dad. Am I the only one disturbed by that? Then, this weirdo PokeMon kidnaps Ash\'s mom to pretend to be the girl\'s. I don\'t care if he was trying to make the girl happy, that\'s just gross. There was no real plot. The girl was just a whiny brat who wanted things her own way. She played with Unowns, was the "daughter" of Entei and apparently could grow and shrink in age on a whim with the help of her "dad". That\'s pretty much all I can remember, but I think you can take it as a hint, and not see it. (Or if you do see it, don\'t expect much.) 1 out of 10. Seriously. If you want a PokeMon movie, rent "PokeMon; the First Movie".'

In [10]:
## utilisation de la focntion translate() du package string (pas besoin de l'installer)
# https://www.geeksforgeeks.org/python/python-removing-unwanted-characters-from-string/

import string

def remove_special_characters(text):
    """
    Input: str : A string to clean from non alphanumeric characters
    Output: str : The same strings without non alphanumeric characters
    """
    res= text.translate(str.maketrans('','',string.punctuation))
    return(res)


In [11]:
imdb_dataset['review']=imdb_dataset['review'].apply(remove_special_characters)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,I saw this movie last night after waiting ages...,positive
1,This was so far the worst movie I have seen in...,negative
2,In a time of magic barbarians and demons aboun...,positive
3,I had high expectations of this movie the titl...,negative
4,This is a film of immense appeal to a relative...,negative
5,An hilariously accurate caricature of trying t...,positive
6,I watch most movies that Nick Mancuso is in be...,positive
7,This is one of those films thats more interest...,negative
8,A wonderful and gritty war film that focuses o...,positive
9,Man some of you people have got to chill This ...,positive


In [12]:
imdb_dataset['review'][10]

'Wow I dont even really remember that much about this movie except that it stunk The plots basically a girls parents neglect her so this sicko PokeMon pretends to be her dad Am I the only one disturbed by that Then this weirdo PokeMon kidnaps Ashs mom to pretend to be the girls I dont care if he was trying to make the girl happy thats just gross There was no real plot The girl was just a whiny brat who wanted things her own way She played with Unowns was the daughter of Entei and apparently could grow and shrink in age on a whim with the help of her dad Thats pretty much all I can remember but I think you can take it as a hint and not see it Or if you do see it dont expect much 1 out of 10 Seriously If you want a PokeMon movie rent PokeMon the First Movie'

In [13]:
## lowercase with str.lower() https://www.programiz.com/python-programming/methods/string/lower

def lowercase_text(text):
    """
    Input: str : A string to lowercase
    Output: str : The same string lowercased
    """
    res=text.lower()
    
    return res

In [14]:
imdb_dataset['review']=imdb_dataset['review'].apply(lowercase_text)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,i saw this movie last night after waiting ages...,positive
1,this was so far the worst movie i have seen in...,negative
2,in a time of magic barbarians and demons aboun...,positive
3,i had high expectations of this movie the titl...,negative
4,this is a film of immense appeal to a relative...,negative
5,an hilariously accurate caricature of trying t...,positive
6,i watch most movies that nick mancuso is in be...,positive
7,this is one of those films thats more interest...,negative
8,a wonderful and gritty war film that focuses o...,positive
9,man some of you people have got to chill this ...,positive


In [15]:
# different methods de tokenisation https://www.geeksforgeeks.org/nlp/5-simple-ways-to-tokenize-text-in-python/
# str.split() method is a very efficient method on large datasets, but only works with tabular text data, no punctuatiob handling

def tokenize_words(text):
    """
    Input: str : A string to tokenize
    Output: list of str : A list of the tokens splitted from the input string
    """
    
    tokens = text.split()

    
    return tokens

In [16]:
imdb_dataset['review']=imdb_dataset['review'].apply(tokenize_words)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,"[i, saw, this, movie, last, night, after, wait...",positive
1,"[this, was, so, far, the, worst, movie, i, hav...",negative
2,"[in, a, time, of, magic, barbarians, and, demo...",positive
3,"[i, had, high, expectations, of, this, movie, ...",negative
4,"[this, is, a, film, of, immense, appeal, to, a...",negative
5,"[an, hilariously, accurate, caricature, of, tr...",positive
6,"[i, watch, most, movies, that, nick, mancuso, ...",positive
7,"[this, is, one, of, those, films, thats, more,...",negative
8,"[a, wonderful, and, gritty, war, film, that, f...",positive
9,"[man, some, of, you, people, have, got, to, ch...",positive


In [17]:
# stopwords https://vectorize.io/blog/removing-nltk-stopwords-with-python
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))


def remove_stopwords(token_list):
    """
    Input: list of str : A list of tokens
    Output: list of str : The new list with removed stopwords tokens
    """
    stop_words = set(stopwords.words('english'))
    #they are present in stop_words or not
    filtered_sentence = [w for w in token_list if not w.lower() in stop_words]
    filtered_sentence = []
    for w in token_list:
	    if w not in stop_words:
		    filtered_sentence.append(w)
    return filtered_sentence

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/marie.vanacker@Digital-
[nltk_data]     Grenoble.local/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
imdb_dataset['review']=imdb_dataset['review'].apply(remove_stopwords)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,"[saw, movie, last, night, waiting, ages, ages,...",positive
1,"[far, worst, movie, seen, entire, life, seen, ...",negative
2,"[time, magic, barbarians, demons, abound, diab...",positive
3,"[high, expectations, movie, title, translated,...",negative
4,"[film, immense, appeal, relatively, welldefine...",negative
5,"[hilariously, accurate, caricature, trying, se...",positive
6,"[watch, movies, nick, mancuso, frankly, love, ...",positive
7,"[one, films, thats, interesting, watch, academ...",negative
8,"[wonderful, gritty, war, film, focuses, inner,...",positive
9,"[man, people, got, chill, movie, artistic, gen...",positive


In [19]:
## https://www.geeksforgeeks.org/machine-learning/python-stemming-words-with-nltk/ => cf Thibault github pour enlever reduce

from nltk.stem import PorterStemmer

def stem_words(token_list):
    """
    Input: list of str : A list of tokens to stem
    Output: list of str : The list of stemmed tokens
    """
    ps = PorterStemmer()
    
    stemmed_list = [ps.stem(word) for word in token_list]
    return stemmed_list
    

In [20]:
imdb_dataset['review']=imdb_dataset['review'].apply(stem_words)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,"[saw, movi, last, night, wait, age, age, relea...",positive
1,"[far, worst, movi, seen, entir, life, seen, re...",negative
2,"[time, magic, barbarian, demon, abound, diabol...",positive
3,"[high, expect, movi, titl, translat, get, rid,...",negative
4,"[film, immens, appeal, rel, welldefin, group, ...",negative
5,"[hilari, accur, caricatur, tri, sell, script, ...",positive
6,"[watch, movi, nick, mancuso, frankli, love, gu...",positive
7,"[one, film, that, interest, watch, academ, per...",negative
8,"[wonder, gritti, war, film, focus, inner, torm...",positive
9,"[man, peopl, got, chill, movi, artist, geniu, ...",positive


Let's join all that together and apply it to our dataset. The following function simply chains all the preprocessing steps you just implemented. 

It adds the `list_output` flag, if False it will reconcatenate all the preprocessed tokens into a single string (with spaces between tokens), if True it will keep each sentence as a list of tokens. Depending on the libraries you will use for the next steps, it can be useful to have one or the other representation.

In [21]:
def normalize_text_dataset(dataset, text_col_name = 'review', html_tags = True,
                           special_chars = True, lowercase = True , stemming = True , 
                           stopwords = True, list_output = False ):
    """
    Apply the choosen preprocessing steps to a corpus of texts and return the 
    preprocessed corpus. The list_output flag allows to return either a list
    of token, or a rejoined string with spaces between the preprocessed tokens.
    """
    def rejoin_text(token_list):
        return ' '.join(token_list)
    
    
    output = dataset.copy()
    
    if html_tags : 
        output[text_col_name] = output[text_col_name].apply(remove_html_tags)
        
    if special_chars :
        output[text_col_name] = output[text_col_name].apply(remove_special_characters)
        
    if lowercase :
        output[text_col_name] = output[text_col_name].apply(lowercase_text)
    
    #Tokenization for next steps:
    output[text_col_name] = output[text_col_name].apply(tokenize_words)
    
    if stopwords :
        output[text_col_name] = output[text_col_name].apply(remove_stopwords)
        
    if stemming :
        output[text_col_name] = output[text_col_name].apply(stem_words)
        
    if not list_output :
        output[text_col_name] = output[text_col_name].apply(rejoin_text)
        
    return output
        

In [22]:
imdb_clean_dataset = normalize_text_dataset(imdb_dataset_original, html_tags = True,
                           special_chars = True, lowercase = True , stemming = True , 
                           stopwords = True, list_output = False )

In [23]:
imdb_clean_dataset.head(10)

Unnamed: 0,review,sentiment
0,saw movi last night wait age age releas canada...,positive
1,far worst movi seen entir life seen realli bad...,negative
2,time magic barbarian demon abound diabol tyran...,positive
3,high expect movi titl translat get rid other c...,negative
4,film immens appeal rel welldefin group part we...,negative
5,hilari accur caricatur tri sell script documen...,positive
6,watch movi nick mancuso frankli love guy even ...,positive
7,one film that interest watch academ perspect e...,negative
8,wonder gritti war film focus inner torment bli...,positive
9,man peopl got chill movi artist geniu instead ...,positive


## Text classification with Bag-Of-Words

Now we have cleaned the reviews of our dataset, how do we represent them as vectors in order to classify it ? 
One classic way to achieve that is the Bag-Of-Words (BOW) approach. To encode a text in a bag of word, we first need to know all the different words $w$ that appear in all our reviews, called the vocabulary : $w \in \mathcal{V}$. For each word $w$ we attribute an index $idx(w) = i$ with $i \in \{0, |\mathcal{V}|-1\}$, and represent a review $r$ as a vector of the size of the vocabulary $x_r \in \mathbb{R}^{|\mathcal{V}|}$. To encode a review we are simply going to count how many time each word appears and assign it at its corresponding index in the bag-of-words vector : $x_{r,i} = count(w,r)$, where i = idx(w). 

This means that we completely disregard the words order, and simply take into account the number of times each word appears in each review to represent them. There are many variations of this concept, TF-IDF (term frequency-inverse document frequency) for example, gives more weight to uncommon words. Read more about BOW and TF-IDF there:

https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Let's start with bag-of-words. In general we don't consider the whole vocabulary but only some of the most frequent words in order to reduce the dimensionality and avoid noise from rare words. Here we will only consider the 10000 most frequent words of the training set, meaning the words that are only in the test set will be ignored. Thus we have : $x_r \in \mathbb{R^{10000}}$.

Encode all the reviews as bag-of-words, and train and evaluate a logistic regression model on the following train test splits. As we have seen previously, if we wanted to investigate this model we should also grid search for hyperparameters by doing a cross-validation with validation sets, etc. However this is not the goal today, so we'll simply go for a train/test split for this experiment. Concerning the evaluation metrics, in this case we care equally about correctly predicting the positives and the negatives, and we have a balanced dataset, thus we can simply use accuracy this time.

Once again, don't do everything from scratch and try to find libraries that propose implementations of these concepts !

Bag-of-Words +> Représente un document comme un sac de mots, sans ordre ni contexte.
On compte combien de fois chaque mot apparaît.      

“I loved the movie” → {“i”:1, “loved”:1, “the”:1, “movie”:1}        
“The movie was not good” → {“the”:1, “movie”:1, “was”:1, “not”:1, “good”:1}         

Pas de nuance :     
“good” = positif            
“not good” → BOW ne voit pas le “not” collé au mot      

→ d’où les limites.

In [24]:
from sklearn.preprocessing import LabelBinarizer

max_vocab_size = 10000 

#Train/test split:

lb=LabelBinarizer()
sentiment_labels=lb.fit_transform(imdb_clean_dataset['sentiment'])


train_reviews = imdb_clean_dataset.review[:25000]
test_reviews = imdb_clean_dataset.review[25000:]

#on vectorize
train_sentiments = sentiment_labels[:25000]
test_sentiments = sentiment_labels[25000:]

In [25]:
print("train_reviews :",train_reviews.head(5))
print("test_reviews:",test_reviews.head(5))

print("train_sentiments:",train_sentiments)
print("test_sentiments:",test_sentiments)


train_reviews : 0    saw movi last night wait age age releas canada...
1    far worst movi seen entir life seen realli bad...
2    time magic barbarian demon abound diabol tyran...
3    high expect movi titl translat get rid other c...
4    film immens appeal rel welldefin group part we...
Name: review, dtype: object
test_reviews: 25000    intent purpos teen devian might seem like anot...
25001    black dragon 1942 william nigh bela lugosi joa...
25002    retic see flick read extern review user commen...
25003    anoth exampl womeninprison genr exactli genr k...
25004    wealthi hors rancher bueno air longstand notra...
Name: review, dtype: object
train_sentiments: [[1]
 [0]
 [1]
 ...
 [1]
 [0]
 [0]]
test_sentiments: [[1]
 [0]
 [1]
 ...
 [1]
 [0]
 [0]]


In [26]:
# Bag-of-words avec fit_transfom() sur le train et transform sur le test

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=max_vocab_size)

# fit_transform sur le TRAIN uniquement
X_train = vectorizer.fit_transform(train_reviews)

# puis transform sur le TEST
X_test = vectorizer.transform(test_reviews)


In [27]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, train_sentiments)

y_pred = clf.predict(X_test)
acc = accuracy_score(test_sentiments, y_pred)

  y = column_or_1d(y, warn=True)


In [28]:
print(f"Bag-of-words + Logestic Rregression accuracy : {acc*100:.2f}%")

Bag-of-words + Logestic Rregression accuracy : 85.03%


You should get about 85% accuracy, pretty good for such a simple model. Now let's do the same with a tf-idf encoding:

# **TF-IDF (Unigrams)** # 

TF-IDF => améliore Bag-of-Words en pondérant les mots selon leur importance.

Un mot est important si :       
    - il est fréquent dans CE document (TF)     
    - il est rare dans le corpus entier (IDF)           

Exemple intuitif :
“movie” → très fréquent partout → poids faible          
“masterpiece”, “boring”, “terrible acting” → plus informatifs → poids fort          

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

max_vocab_size = 10000

# On réutilise train_reviews / test_reviews définis avant

tfidf_vectorizer = TfidfVectorizer(
    max_features=max_vocab_size # même fix que pour CountVectorizer
)

# TF-IDF sur le TRAIN
X_train_tfidf = tfidf_vectorizer.fit_transform(train_reviews)

# TF-IDF sur le TEST (même vocabulaire)
X_test_tfidf = tfidf_vectorizer.transform(test_reviews)

# Même modèle: logistic regression
clf_tfidf = LogisticRegression(max_iter=1000)
clf_tfidf.fit(X_train_tfidf, train_sentiments)

# Prédiction + accuracy
y_pred_tfidf = clf_tfidf.predict(X_test_tfidf)
acc_tfidf = accuracy_score(test_sentiments, y_pred_tfidf)

  y = column_or_1d(y, warn=True)


In [30]:
print(f"TF-IDF + LogisticRegression accuracy: {acc_tfidf*100:.2f}%")

TF-IDF + LogisticRegression accuracy: 88.04%


And you should get about 88% accuracy this time. Other classic but more sophisticated features include N-grams, part-of-speech tagging and syntax trees, you can read more about these there:

https://www.analyticsvidhya.com/blog/2020/07/part-of-speechpos-tagging-dependency-parsing-and-constituency-parsing-in-nlp/

But we will stop there for the classic approaches and go to deep learning methods.

# **TF-IDF avec N-grams (Unigrams + Bigrams)** #

...unless you are ahead of time, in this case learn about Bag of N-grams by yourself, and try them out :

C’est quoi un N-gram ?

Si on prend la phrase :

"this movie is very good"

Unigrams (1-grams) :
["this", "movie", "is", "very", "good"]

Bigrams (2-grams) :
["this movie", "movie is", "is very", "very good"]

Trigrams (3-grams) :
["this movie is", "movie is very", "is very good"]

Un bag of N-grams, c’est exactement comme Bag-of-Words, mais où les “mots” peuvent être des séquences de 2, 3 mots, etc.

Intérêt :           
- les unigrams voient les mots
- les bigrams capturent des motifs comme :              
    "not good" (négatif)            
    "very good" (positif)                   
    "no plot", "poor acting" …

Ce que BOW ou TF-IDF unigram only ratent souvent.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

max_vocab_size = 20000  # un peu plus grand car beaucoup plus de n-grams possibles

tfidf_vectorizer_ngrams = TfidfVectorizer(
    max_features=max_vocab_size,
    ngram_range=(1, 2)   
)

# Apprentissage du vocabulaire + TF-IDF sur le TRAIN
X_train_ngrams = tfidf_vectorizer_ngrams.fit_transform(train_reviews)

# TF-IDF sur le TEST avec le même vocabulaire
X_test_ngrams = tfidf_vectorizer_ngrams.transform(test_reviews)

# Modèle : logistic regression comme avant
clf_ngrams = LogisticRegression(max_iter=1000)
clf_ngrams.fit(X_train_ngrams, train_sentiments)

# Évaluation
y_pred_ngrams = clf_ngrams.predict(X_test_ngrams)
acc_ngrams = accuracy_score(test_sentiments, y_pred_ngrams)


  y = column_or_1d(y, warn=True)


In [32]:
print(f"F-IDF (1-2 grams) + Logistic Regression accuracy: {acc_ngrams*100:.2f}%")

F-IDF (1-2 grams) + Logistic Regression accuracy: 88.77%


In [33]:
import numpy as np

feature_names = np.array(tfidf_vectorizer_ngrams.get_feature_names_out())
coef = clf_ngrams.coef_[0]

# Top 20 n-grams les plus "positifs"
top_pos_idx = np.argsort(coef)[-20:]
print("Top positive n-grams:")
print(feature_names[top_pos_idx])

# Top 20 n-grams les plus "négatifs"
top_neg_idx = np.argsort(coef)[:20]
print("\nTop negative n-grams:")
print(feature_names[top_neg_idx])


Top positive n-grams:
['still' 'highli' 'superb' 'fantast' 'one best' 'brilliant' 'today'
 'definit' '710' 'fun' 'well' 'amaz' 'beauti' 'favorit' 'best' 'love'
 'enjoy' 'perfect' 'excel' 'great']

Top negative n-grams:
['worst' 'bad' 'wast' 'aw' 'bore' 'poor' 'disappoint' 'noth' 'terribl'
 'wors' 'fail' 'dull' 'horribl' 'poorli' 'unfortun' 'stupid' 'lack'
 'suppos' 'ridicul' 'save']


# *Save the data* # 
Preprocessing for the deep learning methods

In [34]:
imdb_deep_clean_dataset = normalize_text_dataset(imdb_dataset_original, html_tags = True,
                           special_chars = False, lowercase = True, stemming = False, 
                           stopwords = False, list_output = False )

In [35]:
imdb_deep_clean_dataset.head(10)

Unnamed: 0,review,sentiment
0,i saw this movie last night after waiting ages...,positive
1,"this was, so far, the worst movie i have seen ...",negative
2,"in a time of magic, barbarians and demons abou...",positive
3,i had high expectations of this movie (the tit...,negative
4,this is a film of immense appeal to a relative...,negative
5,an hilariously accurate caricature of trying t...,positive
6,i watch most movies that nick mancuso is in be...,positive
7,this is one of those films that's more interes...,negative
8,a wonderful and gritty war film that focuses o...,positive
9,"man, some of you people have got to chill. thi...",positive


In [36]:
train_deep_clean = imdb_deep_clean_dataset.iloc[:20000]
valid_deep_clean = imdb_deep_clean_dataset.iloc[20000:25000]
test_deep_clean = imdb_deep_clean_dataset.iloc[25000:]

In [37]:
outdir = '../data/imdb_clean/'
if not os.path.exists(outdir):
    os.mkdir(outdir)

In [38]:
train_deep_clean.to_csv(outdir + 'train.csv', index = False)
valid_deep_clean.to_csv(outdir + 'valid.csv', index = False)
test_deep_clean.to_csv(outdir + 'test.csv', index = False)