# RITAL - Bag of Words Project

Binôme :
- **Ben KABONGO**, 21116436

- **Sofia BORCHANI**, 21212080

# Partie II : Données classification de sentiments (films)

**Problématiques**

- **Variantes de BoW**
    - TF-IDF
    - Réduire la taille du vocabulaire (min_df, max_df, max_features)
    - BoW binaire
    - Bi-grams, tri-grams
    - **Quelles performances attendrent? Quels sont les avantages et les inconvénients des ces variantes?**

In [1]:
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import sklearn
import warnings

from imblearn.under_sampling import RandomUnderSampler

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


from sklearn import (linear_model, 
                     ensemble,
                     tree,
                     decomposition, 
                     naive_bayes, 
                     neural_network,
                     svm,
                     metrics,
                     preprocessing, 
                     model_selection, 
                     pipeline,)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from xgboost import XGBClassifier

import utils

In [2]:
plt.style.use('seaborn-whitegrid')
warnings.simplefilter("ignore")

### Chargement des données

In [3]:
fname = "./datasets/movies/movies1000/"
all_movies_df = utils.load_movies(fname)
all_movies_df

Unnamed: 0,text,label
0,bad . bad . \nbad . \nthat one word seems to p...,0
1,isn't it the ultimate sign of a movie's cinema...,0
2,""" gordy "" is not a movie , it is a 90-minute-...",0
3,disconnect the phone line . \ndon't accept the...,0
4,when robert forster found himself famous again...,0
...,...,...
1995,one of the funniest carry on movies and the th...,1
1996,"i remember making a pact , right after `patch ...",1
1997,barely scrapping by playing at a nyc piano bar...,1
1998,if the current trends of hollywood filmmaking ...,1


## Méthodes Machine Learning et variantes de BoW

In [4]:
y = all_movies_df.label

In [5]:
def mvectorizer(vectorizer):
    X_text_train, X_text_test, y_train, y_test = model_selection.train_test_split(all_movies_df['text'], 
        all_movies_df['label'], test_size=0.2, random_state=42)

    X_train = vectorizer.fit_transform(X_text_train)
    X_test = vectorizer.transform(X_text_test)
    
    clf = linear_model.LogisticRegression(solver='lbfgs')
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    y_prob = clf.predict_proba(X_test)[:,1]
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_prob)
    auc = metrics.auc(fpr, tpr)
    
    f1 = metrics.f1_score(y_test, y_pred)
    acc = metrics.accuracy_score(y_test, y_pred)
    report = metrics.classification_report(y_test, y_pred)
    print(report)

    print()
    print('Accuracy :\t', acc)
    print('F1 score :\t', f1)
    print('AUC :\t\t', auc)
    

### CountVectorizer

In [6]:
def cmvectorizer(**count_vectorizer_args):
    vectorizer = CountVectorizer(**count_vectorizer_args)
    return mvectorizer(vectorizer)

### TdfidfVectorizer

In [7]:
def tmvectorizer(**tfidf_vectorizer_args):
    vectorizer = TfidfVectorizer(**tfidf_vectorizer_args)
    return mvectorizer(vectorizer)

### Variantes et évaluations

#### Non binaire

In [8]:
cmvectorizer()

              precision    recall  f1-score   support

           0       0.84      0.80      0.82       199
           1       0.81      0.85      0.83       201

    accuracy                           0.82       400
   macro avg       0.83      0.82      0.82       400
weighted avg       0.83      0.82      0.82       400


Accuracy :	 0.825
F1 score :	 0.8292682926829269
AUC :		 0.9063476586914673


In [9]:
tmvectorizer()

              precision    recall  f1-score   support

           0       0.85      0.77      0.81       199
           1       0.79      0.86      0.82       201

    accuracy                           0.81       400
   macro avg       0.82      0.81      0.81       400
weighted avg       0.82      0.81      0.81       400


Accuracy :	 0.815
F1 score :	 0.8238095238095237
AUC :		 0.8891222280557014


#### Binary

In [10]:
cmvectorizer(binary=True)

              precision    recall  f1-score   support

           0       0.85      0.85      0.85       199
           1       0.85      0.85      0.85       201

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400


Accuracy :	 0.85
F1 score :	 0.8507462686567164
AUC :		 0.9306482662066551


In [11]:
tmvectorizer(binary=True)

              precision    recall  f1-score   support

           0       0.88      0.84      0.86       199
           1       0.85      0.89      0.87       201

    accuracy                           0.86       400
   macro avg       0.87      0.86      0.86       400
weighted avg       0.87      0.86      0.86       400


Accuracy :	 0.865
F1 score :	 0.8682926829268293
AUC :		 0.934598364959124


#### Stop word : english

In [12]:
cmvectorizer(stop_words='english')

              precision    recall  f1-score   support

           0       0.87      0.85      0.86       199
           1       0.86      0.87      0.86       201

    accuracy                           0.86       400
   macro avg       0.86      0.86      0.86       400
weighted avg       0.86      0.86      0.86       400


Accuracy :	 0.8625
F1 score :	 0.8641975308641976
AUC :		 0.9328733218330458


In [13]:
tmvectorizer(stop_words='english')

              precision    recall  f1-score   support

           0       0.86      0.80      0.83       199
           1       0.82      0.87      0.84       201

    accuracy                           0.83       400
   macro avg       0.84      0.83      0.83       400
weighted avg       0.84      0.83      0.83       400


Accuracy :	 0.835
F1 score :	 0.8405797101449274
AUC :		 0.9061726543163579


#### Stop words + binaire

In [14]:
cmvectorizer(stop_words='english', binary=True)

              precision    recall  f1-score   support

           0       0.85      0.87      0.86       199
           1       0.87      0.85      0.86       201

    accuracy                           0.86       400
   macro avg       0.86      0.86      0.86       400
weighted avg       0.86      0.86      0.86       400


Accuracy :	 0.86
F1 score :	 0.8585858585858587
AUC :		 0.9364484112102803


In [15]:
tmvectorizer(stop_words='english', binary=True)

              precision    recall  f1-score   support

           0       0.88      0.84      0.86       199
           1       0.85      0.89      0.87       201

    accuracy                           0.86       400
   macro avg       0.87      0.86      0.86       400
weighted avg       0.87      0.86      0.86       400


Accuracy :	 0.865
F1 score :	 0.8682926829268293
AUC :		 0.9313732843321083


#### Lower

In [16]:
cmvectorizer(lowercase=True)

              precision    recall  f1-score   support

           0       0.84      0.80      0.82       199
           1       0.81      0.85      0.83       201

    accuracy                           0.82       400
   macro avg       0.83      0.82      0.82       400
weighted avg       0.83      0.82      0.82       400


Accuracy :	 0.825
F1 score :	 0.8292682926829269
AUC :		 0.9063476586914673


In [17]:
tmvectorizer(lowercase=True)

              precision    recall  f1-score   support

           0       0.85      0.77      0.81       199
           1       0.79      0.86      0.82       201

    accuracy                           0.81       400
   macro avg       0.82      0.81      0.81       400
weighted avg       0.82      0.81      0.81       400


Accuracy :	 0.815
F1 score :	 0.8238095238095237
AUC :		 0.8891222280557014


In [18]:
cmvectorizer(lowercase=True, binary=True)

              precision    recall  f1-score   support

           0       0.85      0.85      0.85       199
           1       0.85      0.85      0.85       201

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400


Accuracy :	 0.85
F1 score :	 0.8507462686567164
AUC :		 0.9306482662066551


In [19]:
tmvectorizer(lowercase=True, binary=True)

              precision    recall  f1-score   support

           0       0.88      0.84      0.86       199
           1       0.85      0.89      0.87       201

    accuracy                           0.86       400
   macro avg       0.87      0.86      0.86       400
weighted avg       0.87      0.86      0.86       400


Accuracy :	 0.865
F1 score :	 0.8682926829268293
AUC :		 0.934598364959124


#### Réduction du dictionnaire

In [20]:
cmvectorizer(max_df=.75)

              precision    recall  f1-score   support

           0       0.85      0.83      0.84       199
           1       0.84      0.86      0.85       201

    accuracy                           0.84       400
   macro avg       0.85      0.84      0.84       400
weighted avg       0.85      0.84      0.84       400


Accuracy :	 0.845
F1 score :	 0.8472906403940886
AUC :		 0.9197229930748269


In [21]:
tmvectorizer(max_df=.75)

              precision    recall  f1-score   support

           0       0.86      0.80      0.83       199
           1       0.82      0.88      0.85       201

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400


Accuracy :	 0.84
F1 score :	 0.8461538461538461
AUC :		 0.9213980349508738


In [22]:
cmvectorizer(max_df=.75, binary=True)

              precision    recall  f1-score   support

           0       0.84      0.85      0.85       199
           1       0.85      0.85      0.85       201

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400


Accuracy :	 0.8475
F1 score :	 0.8478802992518704
AUC :		 0.9321483037075926


In [23]:
cmvectorizer(max_df=.75, min_df=len(all_movies_df)//1000)

              precision    recall  f1-score   support

           0       0.85      0.83      0.84       199
           1       0.84      0.86      0.85       201

    accuracy                           0.84       400
   macro avg       0.85      0.84      0.84       400
weighted avg       0.85      0.84      0.84       400


Accuracy :	 0.845
F1 score :	 0.8472906403940886
AUC :		 0.9186729668241705


In [24]:
tmvectorizer(max_df=.75, min_df=len(all_movies_df)//1000, binary=True)

              precision    recall  f1-score   support

           0       0.88      0.85      0.87       199
           1       0.86      0.89      0.87       201

    accuracy                           0.87       400
   macro avg       0.87      0.87      0.87       400
weighted avg       0.87      0.87      0.87       400


Accuracy :	 0.87
F1 score :	 0.8731707317073171
AUC :		 0.9364234105852647


In [25]:
cmvectorizer(lowercase=True, binary=True, ngram_range=(1, 2), max_features=1000)

              precision    recall  f1-score   support

           0       0.74      0.70      0.72       199
           1       0.72      0.76      0.74       201

    accuracy                           0.73       400
   macro avg       0.73      0.73      0.73       400
weighted avg       0.73      0.73      0.73       400


Accuracy :	 0.7275
F1 score :	 0.7360774818401937
AUC :		 0.8304707617690442


In [26]:
tmvectorizer(lowercase=True, binary=True, ngram_range=(1, 2), max_features=1000)

              precision    recall  f1-score   support

           0       0.78      0.74      0.76       199
           1       0.76      0.80      0.78       201

    accuracy                           0.77       400
   macro avg       0.77      0.77      0.77       400
weighted avg       0.77      0.77      0.77       400


Accuracy :	 0.77
F1 score :	 0.7766990291262135
AUC :		 0.8685967149178729


In [27]:
cmvectorizer(lowercase=True, binary=True, ngram_range=(1, 2), max_features=10_000)

              precision    recall  f1-score   support

           0       0.84      0.84      0.84       199
           1       0.84      0.85      0.84       201

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400


Accuracy :	 0.8425
F1 score :	 0.8436724565756825
AUC :		 0.9305232630815771


In [28]:
cmvectorizer(lowercase=True, binary=False, ngram_range=(1, 2), max_features=10_000)

              precision    recall  f1-score   support

           0       0.83      0.82      0.83       199
           1       0.82      0.84      0.83       201

    accuracy                           0.83       400
   macro avg       0.83      0.83      0.83       400
weighted avg       0.83      0.83      0.83       400


Accuracy :	 0.8275
F1 score :	 0.8296296296296296
AUC :		 0.9113977849446236


In [29]:
cmvectorizer(lowercase=True, binary=True, ngram_range=(1, 2), max_features=20_000)

              precision    recall  f1-score   support

           0       0.86      0.86      0.86       199
           1       0.86      0.86      0.86       201

    accuracy                           0.86       400
   macro avg       0.86      0.86      0.86       400
weighted avg       0.86      0.86      0.86       400


Accuracy :	 0.86
F1 score :	 0.86
AUC :		 0.9365484137103427


In [30]:
cmvectorizer(lowercase=True, binary=True, ngram_range=(1, 2), max_features=40_000)

              precision    recall  f1-score   support

           0       0.86      0.87      0.86       199
           1       0.87      0.86      0.86       201

    accuracy                           0.86       400
   macro avg       0.87      0.87      0.86       400
weighted avg       0.87      0.86      0.86       400


Accuracy :	 0.865
F1 score :	 0.865
AUC :		 0.940198504962624


In [31]:
cmvectorizer(lowercase=True, binary=True, ngram_range=(1, 3), max_features=40_000)

              precision    recall  f1-score   support

           0       0.88      0.88      0.88       199
           1       0.88      0.88      0.88       201

    accuracy                           0.88       400
   macro avg       0.88      0.88      0.88       400
weighted avg       0.88      0.88      0.88       400


Accuracy :	 0.88
F1 score :	 0.8805970149253731
AUC :		 0.9393984849621241


#### Suppressions : ponctuation, chiffres, majuscules, etc.

In [32]:
# fonction de preprocessing : suppression de balise, mots maj, chiffres, ponctuation
f = lambda doc: utils.delete_balise( utils.replace_maj_word( utils.delete_digit( utils.delete_punctuation(doc) ) ) )

In [33]:
cmvectorizer(preprocessor=utils.delete_punctuation)

              precision    recall  f1-score   support

           0       0.84      0.81      0.82       199
           1       0.82      0.85      0.83       201

    accuracy                           0.83       400
   macro avg       0.83      0.83      0.83       400
weighted avg       0.83      0.83      0.83       400


Accuracy :	 0.8275
F1 score :	 0.8312958435207823
AUC :		 0.9086477161929049


In [34]:
cmvectorizer(preprocessor=utils.delete_punctuation, lowercase=True, binary=True, ngram_range=(1, 3), 
             max_features=40_000, tokenizer=word_tokenize)

              precision    recall  f1-score   support

           0       0.88      0.87      0.88       199
           1       0.87      0.89      0.88       201

    accuracy                           0.88       400
   macro avg       0.88      0.88      0.88       400
weighted avg       0.88      0.88      0.88       400


Accuracy :	 0.8775
F1 score :	 0.8790123456790123
AUC :		 0.9432485812145303


In [35]:
cmvectorizer(preprocessor=f, lowercase=True, binary=True, ngram_range=(1, 2), 
             max_features=40_000, tokenizer=word_tokenize)

              precision    recall  f1-score   support

           0       0.85      0.86      0.86       199
           1       0.86      0.86      0.86       201

    accuracy                           0.86       400
   macro avg       0.86      0.86      0.86       400
weighted avg       0.86      0.86      0.86       400


Accuracy :	 0.8575
F1 score :	 0.85785536159601
AUC :		 0.9434235855896397


In [36]:
cmvectorizer(preprocessor=f, lowercase=True, binary=True, ngram_range=(1, 2), 
             max_features=40_000)

              precision    recall  f1-score   support

           0       0.86      0.87      0.86       199
           1       0.87      0.86      0.86       201

    accuracy                           0.86       400
   macro avg       0.87      0.87      0.86       400
weighted avg       0.87      0.86      0.86       400


Accuracy :	 0.865
F1 score :	 0.865
AUC :		 0.9413735343383585


In [37]:
tmvectorizer(preprocessor=f, lowercase=True, binary=True, ngram_range=(1, 2), 
             max_features=40_000)

              precision    recall  f1-score   support

           0       0.88      0.86      0.87       199
           1       0.86      0.88      0.87       201

    accuracy                           0.87       400
   macro avg       0.87      0.87      0.87       400
weighted avg       0.87      0.87      0.87       400


Accuracy :	 0.87
F1 score :	 0.8719211822660099
AUC :		 0.9374234355858897


In [38]:
cmvectorizer(preprocessor=utils.replace_maj_word, binary=True, ngram_range=(1, 2), 
             max_features=40_000, tokenizer=word_tokenize)

              precision    recall  f1-score   support

           0       0.86      0.85      0.86       199
           1       0.86      0.87      0.86       201

    accuracy                           0.86       400
   macro avg       0.86      0.86      0.86       400
weighted avg       0.86      0.86      0.86       400


Accuracy :	 0.86
F1 score :	 0.8613861386138614
AUC :		 0.9454486362159054
