# Présentation

Dans ce Notebook, nous testons différents algo de ML, e.g. : _Gradient Boosting_, _Logistic Regression_, _Multinomial NB_, _Random Forest_. Nous allons les évaluer et voir que, dans les meilleurs des cas, les performances ne dépasseraient pas les 82%. 

Plus de techniques (notament de véctorisation) vont être utilisées au cours du notebook suivants (predictions_02) et qui vont améliorer significativement les performances obtenus.


n.b. ce notebook requiert l'exécution du notebook *text_Normalisation*, qui  se termine par une sauvgarde des reviews après prétraitement (sauvgarde en csv, i.e. encoded_reviews.csv)

# Initialisation

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

## $Extraction$ $des$ $données$ $et$ $apperçu$

In [2]:
df = pd.read_csv('encoded_reviews.csv')
print('Dataset Shape :', df.shape)
print('Dataset overview :\n')
df.head()

Dataset Shape : (13630, 5)
Dataset overview :



Unnamed: 0,Rating,Year_Month,Reviewer_Location,sentiment,text
0,5,2019-3,United Arab Emirates,1,hongkong tokyo far best look forward biggest o...
1,4,2018-6,United Kingdom,1,go april easter weekend say june choose date l...
2,5,2019-4,United Kingdom,1,fantastic queue decent best apparently manage ...
3,4,2019-4,Australia,1,realise school holiday go consequently extreme...
4,5,missing,France,1,make warm fuzzy actual big make fun fill happy...


### $Division$ $entrainement$, $test$

In [3]:
df.sentiment.value_counts(normalize=True)
X = df.text
y = df.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape

(10904,)

### $Preprocessing$ $(Text$ $vectorization)$

In [4]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train).todense()
X_test = vectorizer.transform(X_test).todense()

Dans ce qui suit, nous testons différents algo de ML suivants :

- Gradient Boosting
- Logistic Regression
- Multinomial NB
- Random Forest

### $Gradient$ $Boosting$

In [5]:
start = time.time()
clf = GradientBoostingClassifier(n_estimators = 100, learning_rate = 1.0, max_depth = 1, random_state = 0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
end = time.time()
print('Calculation done in ', round(end - start, 2), 's') # ==> 109s 1190s 1112.37s 1209.06s 1114.24s

Calculation done in  1114.24 s


##### $ Evaluation $

In [6]:
from sklearn.metrics import classification_report
#help(classification_report)
print('Classification Report : \n\n', classification_report(y_test, y_pred))

# Calcul et affichage de la matrice de confusion
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])
confusion_matrix

Classification Report : 

               precision    recall  f1-score   support

           0       0.59      0.40      0.48       381
           1       0.91      0.95      0.93      2345

    accuracy                           0.88      2726
   macro avg       0.75      0.68      0.71      2726
weighted avg       0.86      0.88      0.87      2726



Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,154,227
1,106,2239


### $ Essayons$ $d'autre$ $méthodes$

In [7]:
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

### $Logistic$ $Regression$

In [8]:
start = time.time()
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))
pred_logreg = logreg.predict(X_test)
end = time.time()
print('Calculation done in ', round(end - start, 2), 's') # ==> 395s/383.6/168s

#help(classification_report)
print('Classification Report : \n\n', classification_report(y_test, pred_logreg))

# Calcul et affichage de la matrice de confusion
confusion_matrix = pd.crosstab(y_test, pred_logreg, rownames=['Classe réelle'], colnames=['Classe prédite'])
confusion_matrix

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training set score: 0.990
Test set score: 0.893
Calculation done in  168.79 s
Classification Report : 

               precision    recall  f1-score   support

           0       0.64      0.53      0.58       381
           1       0.93      0.95      0.94      2345

    accuracy                           0.89      2726
   macro avg       0.78      0.74      0.76      2726
weighted avg       0.89      0.89      0.89      2726



Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,201,180
1,113,2232


### $Multinomial$ $NB$

In [9]:
start = time.time()
nb = MultinomialNB()
nb.fit(X_train, y_train)
print("Training set score: {:.3f}".format(nb.score(X_train, y_train)))
print("Test set score: {:.3f}".format(nb.score(X_test, y_test)))
pred_nb = nb.predict(X_test)
end = time.time()
print('Calculation done in ', round(end - start, 2), 's') # ==> 125.26s, 85.01s, 145.5s

print('Classification Report : \n\n', classification_report(y_test, pred_nb))
# Calcul et affichage de la matrice de confusion
confusion_matrix = pd.crosstab(y_test, pred_nb, rownames=['Classe réelle'], colnames=['Classe prédite'])
confusion_matrix

Training set score: 0.934
Test set score: 0.898
Calculation done in  145.5 s
Classification Report : 

               precision    recall  f1-score   support

           0       0.66      0.56      0.61       381
           1       0.93      0.95      0.94      2345

    accuracy                           0.90      2726
   macro avg       0.79      0.76      0.77      2726
weighted avg       0.89      0.90      0.89      2726



Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,213,168
1,110,2235


### $Simple$ $Random$ $Forest$

In [10]:
start = time.time()
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
print("Training set score: {:.3f}".format(rf.score(X_train, y_train)))
print("Test set score: {:.3f}".format(rf.score(X_test, y_test)))
pred_rf = rf.predict(X_test)
end = time.time()
print('Calculation done in ', round(end - start, 2), 's') # ==> 233.01s, 180.49s

print('Classification Report : \n\n', classification_report(y_test, pred_rf))
# Calcul et affichage de la matrice de confusion
confusion_matrix = pd.crosstab(y_test, pred_rf, rownames=['Classe réelle'], colnames=['Classe prédite'])
confusion_matrix

Training set score: 1.000
Test set score: 0.862
Calculation done in  253.66 s
Classification Report : 

               precision    recall  f1-score   support

           0       0.80      0.02      0.04       381
           1       0.86      1.00      0.93      2345

    accuracy                           0.86      2726
   macro avg       0.83      0.51      0.48      2726
weighted avg       0.85      0.86      0.80      2726



Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,8,373
1,2,2343


# Discusion

Il est clair que malgré la vectorisation du texte et les différents opérations de nettoiyage (80%-82%), les résultats laissent toujours à désirer. Dans le notebook suivant, on va essayer d'utiliser d'autres approches (_en plus du "Bag of Words"_) avec quelques astuces qui nous permettent de gagner bouceaup en temps et en performence.