<a href="https://colab.research.google.com/github/RMoulla/IAO_Juin/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyse du sentiment avec le machine learning
Dans ce TP, nous nous intéressons à un problème de classification de textes en utilisant des techniques de machine learning, plus spécifiquement à une tâche d'analyse de sentiments (sentiment analysis).

In [3]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import re
import io

## Chargement des données.

Les données consistent en de commentaires de films en anglais (mais il y a également du bruit provenant d'autres langues). Ces commentaires peuvent être positifs ou, au contraire, négatifs selon les reviewers.
Les étudiants sont encouragés à lire quelques commentaires pour se faire une idées des données manipulées.

In [8]:
reviews = pd.read_csv('Reviews.csv')
reviews.head()

Unnamed: 0,product_id,page,page_order,recommended,date,text,hours,compensation,user_id,username,products,found_funny,early_access
0,1006510,1,0,True,25 February,Chinese people didn't like it 'cuz this game p...,6.3,,7.65612e+16,Schmitt,106.0,,False
1,1006510,1,1,False,25 February,I don't recommend this game. I don't care abou...,0.9,,7.65612e+16,cherryliji,1.0,2.0,False
2,1006510,1,2,True,25 February,Deep describing of native Taiwan culture of 19...,1.1,,7.65612e+16,acmonkey233,1.0,,False
3,1006510,1,3,True,25 February,Well at the risk of this review getting buried...,2.9,,7.65612e+16,Khorneflakes!,247.0,,False
4,1006510,1,4,True,25 February,It's not a political satire nor a boring propa...,3.4,,,asadelight,846.0,,False


Transformer les labels en entiers.

In [9]:
reviews['recommended'] = reviews['recommended'].astype(dtype=np.int64).copy()
reviews.head()

Unnamed: 0,product_id,page,page_order,recommended,date,text,hours,compensation,user_id,username,products,found_funny,early_access
0,1006510,1,0,1,25 February,Chinese people didn't like it 'cuz this game p...,6.3,,7.65612e+16,Schmitt,106.0,,False
1,1006510,1,1,0,25 February,I don't recommend this game. I don't care abou...,0.9,,7.65612e+16,cherryliji,1.0,2.0,False
2,1006510,1,2,1,25 February,Deep describing of native Taiwan culture of 19...,1.1,,7.65612e+16,acmonkey233,1.0,,False
3,1006510,1,3,1,25 February,Well at the risk of this review getting buried...,2.9,,7.65612e+16,Khorneflakes!,247.0,,False
4,1006510,1,4,1,25 February,It's not a political satire nor a boring propa...,3.4,,,asadelight,846.0,,False


## Pré-traitement des données


In [10]:
# Suppression des caractères spéciaux pour nettoyer les commentaires et réduire le bruit.
def preprocess(text):

    text = text.lower()
    text = re.sub('&lt;/?.*?&gt;',' &lt;&gt; ', text)
    text=re.sub('(\\d|\\W)+',' ', text)
    return text

reviews['text'] = reviews['text'].apply(lambda x:preprocess(x))
reviews.head(100)

Unnamed: 0,product_id,page,page_order,recommended,date,text,hours,compensation,user_id,username,products,found_funny,early_access
0,1006510,1,0,1,25 February,chinese people didn t like it cuz this game pr...,6.3,,7.656120e+16,Schmitt,106.0,,False
1,1006510,1,1,0,25 February,i don t recommend this game i don t care about...,0.9,,7.656120e+16,cherryliji,1.0,2.0,False
2,1006510,1,2,1,25 February,deep describing of native taiwan culture of s ...,1.1,,7.656120e+16,acmonkey233,1.0,,False
3,1006510,1,3,1,25 February,well at the risk of this review getting buried...,2.9,,7.656120e+16,Khorneflakes!,247.0,,False
4,1006510,1,4,1,25 February,it s not a political satire nor a boring propa...,3.4,,,asadelight,846.0,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1006510,10,5,1,2019-02-24,flawless family,2.8,,7.656120e+16,BeanKing,54.0,,False
96,1006510,10,6,1,2019-02-24,amazing art and atmosphere in the game really ...,3.6,,7.656120e+16,r03941007,2.0,,False
97,1006510,10,7,1,2019-02-24,great plot great vibe great soundtrack definit...,2.9,,,何老師,97.0,,False
98,1006510,10,8,1,2019-02-24,it s a decent game the developers deserve so m...,4.1,,,Delta_Frost,102.0,,False


## Vectorisation

La vectorisation consiste à transformer les commentaires en vecteurs qui, pour chaque mot, contiennent la fréquence du mots en question dans le commentaire correspondant, sinon 0.

In [11]:
vectorizer = CountVectorizer()
vectorizer.fit(reviews['text'])
X = vectorizer.transform(reviews['text'])
y = reviews['recommended']

In [14]:
print(vectorizer.get_feature_names_out())

['aa' 'aaa' 'aaaaaaaahahahahaha' ... '香港可以說特首' '驚人' '點醒我們']


In [15]:
with np.printoptions(threshold=np.inf):
      print(X.toarray()[0])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

## Splitting du dataset

D'une manière similaire aux projets de machine learning appliquées aux données tabulaires, on divise le dataset en jeu de données d'entrainement et de test.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

## Modèle de régression logistique

On se propose ici d'implémentaire un modèle de machine learning très simple, de type régression logistique pour avoir une performance de base. Un hyperparamètre de régularisation est optimisé pour obtenir les meilleures performances.

In [17]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:

    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print('Accuracy for C=%s: %s'
         % (c, accuracy_score(y_test, lr.predict(X_test))))

Accuracy for C=0.01: 0.7080745341614907
Accuracy for C=0.05: 0.7080745341614907
Accuracy for C=0.25: 0.7701863354037267
Accuracy for C=0.5: 0.782608695652174
Accuracy for C=1: 0.7888198757763976


Le meilleur modèle est obtenu pour C = ?

In [19]:
final_model = LogisticRegression(C = 1)
final_model.fit(X_train, y_train)
print('Final Model Accuracy: %s' %accuracy_score(y_test, final_model.predict(X_test)))

Final Model Accuracy: 0.7888198757763976


## Analyse les features du modèle

In [21]:
feature_to_coef = {
    word: coef for word, coef in zip(
    vectorizer.get_feature_names_out(), final_model.coef_[0])
}

In [22]:
print('Positive Words')
for best_positive in sorted(
    feature_to_coef.items(),
    key=lambda x: x[1],
    reverse=True)[:10]:
    print(best_positive)

Positive Words
('great', 1.173556531392562)
('love', 0.7180638583452227)
('excellent', 0.6991444305516265)
('very', 0.6802052037919195)
('horror', 0.6781179497702858)
('short', 0.6481755470025868)
('awesome', 0.6467057288815571)
('negative', 0.6227602566506757)
('review', 0.5995740329253191)
('ve', 0.5748883181783551)


In [23]:
print('Negative Words')
for best_positive in sorted(
    feature_to_coef.items(),
    key=lambda x: x[1],
    reverse = False)[:10]:
    print(best_positive)

Negative Words
('political', -1.8944814177832634)
('bad', -0.9629809698990577)
('independence', -0.8441067193019685)
('national', -0.8396263529232995)
('disgusting', -0.7706430850771008)
('company', -0.7386042788857038)
('was', -0.7221003439603104)
('politics', -0.7200692235459802)
('nm', -0.6988136343990444)
('lol', -0.6628731689215502)


## Analyse des commentaures mal prédits


In [24]:
y_pred = final_model.predict(X_test)
y_test_indices = y_test.index
for i, (j, k) in enumerate(zip(y_pred, y_test)):
    if j != k:
        print(reviews.loc[y_test_indices[i]]['text'])
        print('Predicted:{} '.format(j), 'Real:{}'.format(k))
        tokens = reviews.loc[y_test_indices[i]]['text'].split()
        for token in tokens:
            if token in feature_to_coef.keys():
                print(token, feature_to_coef[token])

彳亍口吧
Predicted:1  Real:0
彳亍口吧 0.0
some players i highly doubt if they did finish it or not left their reviews comments to criticise this game because there was a hidden easter egg of xi jinping meme they said a game should be just a game and was not supposed to be political well politics can be everywhere i m always surprised how successful the ccp educates people to defend for them on the internet as some players will probably say it s the so called loyalty i suggest them get over the great fire wall and make some google searches about the person political party which they are loyal to compared to the results they might get the easter egg was way too gentle 
Predicted:0  Real:1
some 0.25394350866616155
players -0.3794013859166926
highly 0.2538752707727691
doubt 0.02719166382702697
if -0.09654623947934125
they -0.07895993542025741
did 0.060103183680625456
finish 0.2918398271783084
it 0.19939768215498835
or 0.3630806149036161
not -0.6399049788507097
left 0.04674315690589045
their -0.120