# $Présentation$

L'objectif du présent Notebook est d'améliorer les performances obtenus précédament. Pour ce faire on testera dans l'opération de la vectorisation en utilisant 4 nouvelles techniques, i.e. _Unigram Counts_, _Unigram Tf-Idf_, _ Bigram Counts_, _Bigram Tf-Idf_. 
    
Afin de choisir la meilleur technique, pour chaque stratégie, nous divisons le dataset en data de formation et de validation puis nous formons un SGDClassifier et calculons le score. Nous allons par la suite ajuster le modèle obtenu, en cherchant la meilleure combinaison des paramétres suivants (i) _loss_, _learning rate_ et _initial learning rate_, et (ii) _Penalty_ and _Alpha_.
    
Au final, nous obtenons une précision dans les alentours de 90%. C'est beaucoup mieux que les modeles précédentes et relativement convainquant par rapport à un modèle linéaire simple. Il existe néanmoins des méthodes plus avancées qui donnent de meilleurs résultats. L'état actuel de l'art sur des dataset de ce type est de 97,42% [1, 2].

n.b. comme les précédentes, ce notebook requiert l'exécution du notebook *text_Normalisation*, qui désormais se termine par une sauvgarde des reviews après prétraitement (sauvgarde en csv, i.e. encoded_reviews.csv)

# $Initialisation$

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
# --

import re
from os import system, listdir
from os.path import isfile, join
from random import shuffle


%matplotlib inline
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier


from joblib import dump, load # used for saving and loading sklearn objects
from scipy.sparse import save_npz, load_npz # used for saving and loading sparse matrices

from scipy.sparse import csr_matrix
from scipy.stats import uniform

### $Reload$ $des$ $données$ $et$ $apperçu$

In [2]:
df = pd.read_csv('encoded_reviews.csv')
print('Dataset Shape :', df.shape)
df['sentiment'] = df['Rating'] > 2.5
df.sentiment = df.sentiment.map({True:1, False:0})
print(df.sentiment.value_counts(normalize=True), '\n')
# Let us add a new column
#df['text'] = df.Review_Text
txt = df.text[0]
print(txt)

Dataset Shape : (13630, 5)
1    0.862656
0    0.137344
Name: sentiment, dtype: float64 

hongkong tokyo far best look forward biggest orlando enough recommend stay resort enjoy fast track save huge amount stay strategize get fast track pass kiosk nearby ride projection fireworks show


### $Division$ $entrainement$, $test$

In [3]:
df.sentiment.value_counts(normalize=True)
X = df.text
y = df.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape

(10904,)

In [4]:
system("mkdir 'data_preprocessors'")
system("mkdir 'vectorized_data'")

0

### $Text$ $vectorization$

On testera dans ce qui suit, l'opération de la vectorisation en utilisant 4 techniques :

- Unigram Counts
- Unigram Tf-Idf
- Bigram Counts
- Bigram Tf-Idf

In [5]:
# 1) Unigram Counts
unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))
unigram_vectorizer.fit(X_train.values) 
dump(unigram_vectorizer, 'data_preprocessors/unigram_vectorizer.joblib')
# unigram_vectorizer = load('data_preprocessors/unigram_vectorizer.joblib')

X_train_unigram = unigram_vectorizer.transform(X_train.values)
save_npz('vectorized_data/X_train_unigram.npz', X_train_unigram)
# X_train_unigram = load_npz('vectorized_data/X_train_unigram.npz')

In [6]:
# 2) Unigram Tf-Idf
unigram_tf_idf_transformer = TfidfTransformer()
unigram_tf_idf_transformer.fit(X_train_unigram)

dump(unigram_tf_idf_transformer, 'data_preprocessors/unigram_tf_idf_transformer.joblib')
X_train_unigram_tf_idf = unigram_tf_idf_transformer.transform(X_train_unigram)
save_npz('vectorized_data/X_train_unigram_tf_idf.npz', X_train_unigram_tf_idf)
# X_train_unigram_tf_idf = load_npz('vectorized_data/X_train_unigram_tf_idf.npz')

In [7]:
# 3) Bigram Counts
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
bigram_vectorizer.fit(X_train.values)

dump(bigram_vectorizer, 'data_preprocessors/bigram_vectorizer.joblib')
X_train_bigram = bigram_vectorizer.transform(X_train.values)

save_npz('vectorized_data/X_train_bigram.npz', X_train_bigram)
# X_train_bigram = load_npz('vectorized_data/X_train_bigram.npz')

In [8]:
# 4) Bigram Tf-Idf
bigram_tf_idf_transformer = TfidfTransformer()
bigram_tf_idf_transformer.fit(X_train_bigram)
dump(bigram_tf_idf_transformer, 'data_preprocessors/bigram_tf_idf_transformer.joblib')

# bigram_tf_idf_transformer = load('data_preprocessors/bigram_tf_idf_transformer.joblib')
X_train_bigram_tf_idf = bigram_tf_idf_transformer.transform(X_train_bigram)
save_npz('vectorized_data/X_train_bigram_tf_idf.npz', X_train_bigram_tf_idf)
# X_train_bigram_tf_idf = load_npz('vectorized_data/X_train_bigram_tf_idf.npz')

### $Choix$ $de$ $la$ $meilleure$ $technique$

Afin de choisir la meilleur technique, pour chaque stratégie, nous divisons le dataset en data de formation et de validation puis nous formons un SGDClassifier et calculons le score.

In [9]:
def train_and_show_scores(X: csr_matrix, y: np.array, title: str) -> None:
    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, train_size=0.75, stratify=y
    )

    clf = SGDClassifier()
    clf.fit(X_train, y_train)
    train_score = clf.score(X_train, y_train)
    valid_score = clf.score(X_valid, y_valid)
    print(f'{title}\nTrain score: {round(train_score, 2)};\
            Validation score: {round(valid_score, 2)}\n')

#y_train = df['sentiment'].values # y_train is already calculated
start = time.time()
train_and_show_scores(X_train_unigram, y_train, 'Unigram Counts')
train_and_show_scores(X_train_unigram_tf_idf, y_train, 'Unigram Tf-Idf')
train_and_show_scores(X_train_bigram, y_train, 'Bigram Counts')
train_and_show_scores(X_train_bigram_tf_idf, y_train, 'Bigram Tf-Idf')
end = time.time()
print ('Duration : ', round(end-start, 2)) # 1.14

Unigram Counts
Train score: 0.99;            Validation score: 0.89

Unigram Tf-Idf
Train score: 0.97;            Validation score: 0.9

Bigram Counts
Train score: 1.0;            Validation score: 0.89

Bigram Tf-Idf
Train score: 1.0;            Validation score: 0.9

Duration :  0.84


Dans la quasi-totalité des cas, le meilleur résultat semble être toujours avec du Bigram avec tf-idf (précision de validation : 0.9)

Nous l'utiliserons par la suite pour le réglage des hyper-paramètres.

### $Hyperparameter$ $tuning$

$Phase$ $1:$  On cherche la meilleure combinaison : loss, learning rate et initial learning rate

In [10]:
#X_train = X_train_bigram 
X_train = X_train_bigram_tf_idf
#X_train = X_train_unigram_tf_idf


# Phase 1: loss, learning rate and initial learning rate
clf = SGDClassifier()

distributions = dict(
    loss=['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
    learning_rate=['optimal', 'invscaling', 'adaptive'],
    eta0=uniform(loc=1e-7, scale=1e-2)
)

random_search_cv = RandomizedSearchCV(
    estimator=clf,
    param_distributions=distributions,
    cv=5,
    n_iter=50
)

start = time.time()
random_search_cv.fit(X_train, y_train)
print(f'Best params: {random_search_cv.best_params_}')
print(f'Best score: {random_search_cv.best_score_}') #0.90

end = time.time()
print ('Duration : ', round(end-start, 2), 's') # 305.89s 69s  165s 141.22s

Best params: {'eta0': 0.007047183870982173, 'learning_rate': 'optimal', 'loss': 'squared_hinge'}
Best score: 0.9034302924758864
Duration :  141.22 s


$Phase$ $2:$  Penalty and Alpha

In [11]:
# Phase 2: penalty and alpha
clf = SGDClassifier()

distributions = dict(
    penalty=['l1', 'l2', 'elasticnet'],
    alpha=uniform(loc=1e-6, scale=1e-4)
)

random_search_cv = RandomizedSearchCV(
    estimator=clf,
    param_distributions=distributions,
    cv=5,
    n_iter=50
)
start = time.time()
random_search_cv.fit(X_train, y_train)
print(f'Best params: {random_search_cv.best_params_}') # {'alpha': 2.6334034050575763e-05, 'penalty': 'l2'}
print(f'Best score: {random_search_cv.best_score_}') # 0.91 0.90 0.9
end = time.time()
print ('Duration : ', round(end-start, 2), 's') # 221 38.47s 97.18s 83.64s

Best params: {'alpha': 1.1817931458286057e-05, 'penalty': 'elasticnet'}
Best score: 0.9074653492001395
Duration :  83.64 s


### $Sauvgardons$ $le$ $meilleur$ $classifier$

In [12]:
system("mkdir 'classifiers'")
sgd_classifier = random_search_cv.best_estimator_
dump(random_search_cv.best_estimator_, 'classifiers/sgd_classifier.joblib')
# sgd_classifier = load('classifiers/sgd_classifier.joblib')

['classifiers/sgd_classifier.joblib']

In [13]:
sgd_classifier = load('classifiers/sgd_classifier.joblib')

X_test = bigram_vectorizer.transform(X_test.values)
X_test = bigram_tf_idf_transformer.transform(X_test)

score = sgd_classifier.score(X_test, y_test)
print(round(score, 2)) #0.9

0.9


### $Testons$ $à$ $nouveau$ $notre$ $model$

In [14]:
#help(classification_report)
sgd_classifier = load('classifiers/sgd_classifier.joblib')
type (sgd_classifier)
sgd_classifier.fit(X_train, y_train)

y_pred = sgd_classifier.predict(X_test)
print('Classification Report : \n\n', classification_report(y_test, y_pred))

# Calcul et affichage de la matrice de confusion
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])
confusion_matrix

Classification Report : 

               precision    recall  f1-score   support

           0       0.68      0.62      0.65       381
           1       0.94      0.95      0.95      2345

    accuracy                           0.91      2726
   macro avg       0.81      0.79      0.80      2726
weighted avg       0.90      0.91      0.90      2726



Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,236,145
1,111,2234


Au final, nous obtenons une précision au alentours de 90%. C'est beaucoup mieux que les modeles précédentes et relativement convainquant par rapport à un modèle linéaire simple. 

Il existe néanmoins des méthodes plus avancées qui donnent de meilleurs résultats. L'état actuel de l'art sur des dataset de ce type est de 97,42% [1, 2].

# References

- **[1]  :** https://towardsdatascience.com/building-a-sentiment-classifier-using-scikit-learn-54c8e7c5d2f0
- **[2]  :** https://www.aclweb.org/anthology/P19-2057.pdf