Ce notebook explore différents modèle de machine learning pour du sentiment analysis, le texte est ici prétraité avec un BOW

https://ieeexplore.ieee.org/abstract/document/9142175

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import tweepy as tw
import pandas as pd
import numpy as np
import random
from functions import data_utilities as data_u
from functions import dict_utilities as dic
from functions import nlp_utilities as nlp

Le dataset comporte des textes orientés finance avec pour chaque texte une labelisation : positif, negatif ou neutre

In [2]:
dataset = pd.read_csv('dataset/dataset_sentiment.csv',sep=',',encoding='latin-1',header=None,names=['sentiment','text'])
dataset

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...
4844,negative,Net sales of the Paper segment decreased to EU...


In [3]:
## Rerun for another example
ind = random.randint(0,dataset.shape[0]-1)
print(dataset['text'].loc[ind],dataset['sentiment'].loc[ind])

The diesel margin has remained high . positive


In [4]:
from sklearn.preprocessing import OrdinalEncoder

On sépare le dataset en X et y. On encode ensuite le terme neutre par 0, positif par 1 et negatif par -1

In [5]:
enc = OrdinalEncoder()
X = dataset[['text']]
y = dataset[['sentiment']]
y = enc.fit_transform(y)
y = y-1 # neutral = 0, positive = 1, negative = -1

In [6]:
unique, counts = np.unique(y, return_counts=True)
print(unique,counts)

[-1.  0.  1.] [ 604 2879 1363]


Nous constatons que les différentes classes sont déséquilibrées. Nous prenons donc 600 textes positifs, 600 textes negatifs et 600 textes neutres aléatoires pour pouvoir entrainer notre modèle sur un dataset equilibré

In [7]:
X_train =[]
y_train =[]
k=0
i=0
while k<600:
    if y[i]==-1:
        X_train.append(X['text'][i])
        y_train.append(y[i])
        k+=1
    i+=1
k=0
i=0
while k<600:
    if y[i]==0:
        X_train.append(X['text'][i])
        y_train.append(y[i])
        k+=1
    i+=1
k=0
i=0
while k<600:
    if y[i]==1:
        X_train.append(X['text'][i])
        y_train.append(y[i])
        k+=1
    i+=1

In [8]:
import random

c = list(zip(X_train, y_train))

random.shuffle(c)

X_train, y_train = zip(*c)

X_train = list(X_train)
y_train = list(y_train)

y_train=list(np.array(y_train).reshape(-1))

Nous créons notre bag of words pour les données d'entrainement 

In [9]:
f= open("dict_mot/stop_words_english.txt","r")
stop_words_en=f.read().splitlines()
f.close

<function TextIOWrapper.close()>

In [10]:
X_train = {'Text':X_train}
data = pd.DataFrame(X_train)

In [11]:
data

Unnamed: 0,Text
0,`` Those uncertainties cloud the long-term out...
1,The business has sales of about ( Euro ) 35 mi...
2,The group intends to relocate warehouse and of...
3,The pretax profit of the group 's life insuran...
4,"The 718,430 new Series A shares will become su..."
...,...
1795,Master of Mayawas jointly developed by Nokia S...
1796,Operating profit decreased to EUR 16mn from EU...
1797,"Finnish Scanfil , a systems supplier and contr..."
1798,Finnair 's total traffic decreased by 8.7 % in...


In [12]:
bag_of_words_train = nlp.df_to_bow(data,stop_words_en)

In [13]:
bag_of_words_train

Unnamed: 0,able,above,abroad,access,accordance,according,account,accounting,achieve,acknowledged,...,working,world,worth,would,year,yesterday,yet,york,zinc,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.178424,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1795,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
1796,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
1797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
1798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0


In [14]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import OneClassSVM
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

Nous allons maintenant tester differents modèles 

In [15]:
gaussian_sentiment = GaussianNB()
scores_gaussian = cross_val_score(gaussian_sentiment,bag_of_words_train,y_train,cv=5)
scores_gaussian

array([0.57777778, 0.6       , 0.58611111, 0.56388889, 0.61388889])

In [16]:
svm_sentiment = OneClassSVM(kernel = 'rbf',gamma=0.01,nu=0.01)
scores_gaussian = cross_val_score(svm_sentiment,bag_of_words_train,y_train,cv=5,scoring='accuracy')
scores_gaussian 

array([0.36111111, 0.36944444, 0.37222222, 0.38055556, 0.33888889])

In [17]:
rnd_sentiment = RandomForestClassifier(n_estimators=1000,criterion='entropy')
scores_gaussian = cross_val_score(rnd_sentiment,bag_of_words_train,y_train,cv=5,scoring='accuracy')
scores_gaussian 

array([0.70833333, 0.68055556, 0.68888889, 0.71388889, 0.725     ])

In [18]:
y_train = np.array(y_train)

In [19]:
kf = KFold(n_splits=5,shuffle=True)
for train_index,test_index in kf.split(y_train):
    X_train_fold, X_test_fold = bag_of_words_train.iloc[train_index], bag_of_words_train.iloc[test_index]
    y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
    rnd_sentiment = RandomForestClassifier(n_estimators=500,criterion='entropy')
    rnd_sentiment.fit(X_train_fold,y_train_fold)
    y_predict = rnd_sentiment.predict(X_test_fold)
    print(confusion_matrix(y_test_fold,y_predict,labels=[-1, 0, 1]))

[[ 88  33   9]
 [  6 110  16]
 [ 14  21  63]]
[[92 29  5]
 [12 91  9]
 [20 34 68]]
[[ 89  27   5]
 [ 10 102   7]
 [ 21  22  77]]
[[67 21 13]
 [10 98 17]
 [13 37 84]]
[[83 26 13]
 [23 74 15]
 [13 25 88]]


La plupart des erreurs sont des textes positifs/negatifs interprétés comme neutres, il faudrait se pencher sur cela peut être en rajoutant plus de données positives et négatives

Regardons un peu le résultat sur des tweets réels

In [20]:
codes = data_u.get_codes('../../codes.txt')
Data = data_u.search_author('CNNBusiness','2020/01/15',20,'en',codes,["Text", "Author", "Date"],remove_URL=True)
Data

Unnamed: 0,Text,Author,Date
0,The sale of the WNBA's Atlanta Dream — co-owne...,CNNBusiness,2021-01-20 07:01:03
1,At least six major news networks have assigned...,CNNBusiness,2021-01-20 06:01:05
2,Netflix — which was founded in 1997 as a renta...,CNNBusiness,2021-01-20 05:01:07
3,Bed Bath &amp; Beyond has stopped selling MyPi...,CNNBusiness,2021-01-20 04:01:11
4,Christian crowdsource funding site used by ext...,CNNBusiness,2021-01-20 02:30:32
5,Today in business news:\n\n➡️ Stocks are havin...,CNNBusiness,2021-01-20 02:04:26
6,The Justice Department has notified Republican...,CNNBusiness,2021-01-20 01:45:49
7,A batch of chocolate milk has been recalled af...,CNNBusiness,2021-01-20 01:23:20
8,MGM Resorts International is walking away from...,CNNBusiness,2021-01-19 22:00:22
9,"The revamped service, called Paramount+, which...",CNNBusiness,2021-01-19 21:45:04


In [21]:
Texte_1 = Data['Text'][0]
Texte_1

"The sale of the WNBA's Atlanta Dream — co-owned by outgoing US Sen. Kelly Loeffler — is close to being finalized, a league spokesperson says "

In [22]:
Texte_2 = Data['Text'][8]
Texte_2

'MGM Resorts International is walking away from an attempt to buy the owner of British gambling brand Ladbrokes after its $11 billion bid was rejected. '

In [23]:
Texte_3 = Data['Text'][10]
Texte_3

"Some livestream platforms are taking steps to crack down on such broadcasts after the assault on the Capitol and in anticipation of potential disruptions at President-elect Joe Biden's inauguration. "

In [24]:
dataframe_to_predict = nlp.df_to_bow_prediction(bag_of_words_train.columns, Data, stop_words_en)

In [25]:
dataframe_to_predict

Unnamed: 0,able,above,abroad,access,accordance,according,account,accounting,achieve,acknowledged,...,working,world,worth,would,year,yesterday,yet,york,zinc,zone
0,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
rnd_sentiment = RandomForestClassifier(n_estimators=1000,criterion='entropy')
rnd_sentiment.fit(bag_of_words_train,y_train)

RandomForestClassifier(criterion='entropy', n_estimators=1000)

In [27]:
rnd_sentiment.predict(dataframe_to_predict)

array([-1., -1., -1., -1.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  1.,
        0.,  0.,  0., -1., -1.,  0.,  0.])

In [29]:
for i in range(Data.shape[0]):
    print('Tweet', i ,': ', Data['Text'][i])

Tweet 0 :  The sale of the WNBA's Atlanta Dream — co-owned by outgoing US Sen. Kelly Loeffler — is close to being finalized, a league spokesperson says 
Tweet 1 :  At least six major news networks have assigned women to lead White House coverage of the Biden administration, raising the profile of female journalists in an institution long dominated by men 
Tweet 2 :  Netflix — which was founded in 1997 as a rental company that sent you DVDs in the mail — says it now has more than 200 million subscribers globally, after adding 8.5 million subscribers in the fourth quarter of 2020, beating its own expectation 
Tweet 3 :  Bed Bath &amp; Beyond has stopped selling MyPillow products following CEO Mike Lindell's support of the January 6 insurrection and his continued false statements questioning the validity of the US presidential election 
Tweet 4 :  Christian crowdsource funding site used by extremists involved in Capitol riot 
Tweet 5 :  Today in business news:

➡️ Stocks are having a post

On voit qu'il y a clairement un manque de précision de notre modèle. Il lui faudrait un dataset d'entrainement bien plus vaste