Ce notebook explore différents modèle de machine learning pour du sentiment analysis, le texte est ici prétraité avec un BOW

https://ieeexplore.ieee.org/abstract/document/9142175

In [1]:
import TweetDataFrame as TDF
import Natural_Language_Processing as NLP
from sklearn.feature_extraction.text import CountVectorizer
import tweepy as tw
import pandas as pd
import numpy as np
import random

Le dataset comporte des textes orientés finance avec pour chaque texte une labelisation : positif, negatif ou neutre

In [2]:
dataset = pd.read_csv('dataset/dataset_sentiment.csv',sep=',',encoding='latin-1',header=None,names=['sentiment','text'])
dataset

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...
4844,negative,Net sales of the Paper segment decreased to EU...


In [3]:
## Rerun for another example
ind = random.randint(0,dataset.shape[0]-1)
print(dataset['text'].loc[ind],dataset['sentiment'].loc[ind])

Thus the group 's balance sheet will have about EUR25 .8 m of goodwill , the company added . neutral


In [4]:
from sklearn.preprocessing import OrdinalEncoder

On sépare le dataset en X et y. On encode ensuite le terme neutre par 0, positif par 1 et negatif par -1

In [5]:
enc = OrdinalEncoder()
X = dataset[['text']]
y = dataset[['sentiment']]
y = enc.fit_transform(y)
y = y-1 # neutral = 0, positive = 1, negative = -1

In [6]:
unique, counts = np.unique(y, return_counts=True)
print(unique,counts)

[-1.  0.  1.] [ 604 2879 1363]


Nous constatons que les différentes classes sont déséquilibrées. Nous prenons donc 600 textes positifs, 600 textes negatifs et 600 textes neutres aléatoires pour pouvoir entrainer notre modèle sur un dataset equilibré

In [7]:
X_train =[]
y_train =[]
k=0
i=0
while k<600:
    if y[i]==-1:
        X_train.append(X['text'][i])
        y_train.append(y[i])
        k+=1
    i+=1
k=0
i=0
while k<600:
    if y[i]==0:
        X_train.append(X['text'][i])
        y_train.append(y[i])
        k+=1
    i+=1
k=0
i=0
while k<600:
    if y[i]==1:
        X_train.append(X['text'][i])
        y_train.append(y[i])
        k+=1
    i+=1

In [8]:
import random

c = list(zip(X_train, y_train))

random.shuffle(c)

X_train, y_train = zip(*c)

X_train = list(X_train)
y_train = list(y_train)

y_train=list(np.array(y_train).reshape(-1))

Nous créons notre bag of words pour les données d'entrainement 

In [9]:
f= open("dict_mot/stop_words_english.txt","r")
stop_words_en=f.read().splitlines()
f.close

<function TextIOWrapper.close()>

In [10]:
X_train = {'Text':X_train}
data = pd.DataFrame(X_train)

In [11]:
data

Unnamed: 0,Text
0,Amer Sports divests an industrial site in Rumi...
1,Donations to universities The Annual General M...
2,Los Angeles-based Pacific Office Properties Tr...
3,"Copper , lead and nickel also dropped ... HBOS..."
4,"However , net sales in 2010 are seen to have g..."
...,...
1795,Ruukki Group calculates that it has lost EUR 4...
1796,Finnair said that the cancellation of flights ...
1797,"The company 's scheduled traffic , measured in..."
1798,"Also , a six-year historic analysis is provide..."


In [13]:
bag_of_words_train = NLP.df_to_bow(data,stop_words = stop_words_en,language = 'en',TFIDF = True)
bag_of_words_train

TypeError: __init__() missing 1 required positional argument: 'language'

In [23]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import OneClassSVM
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

Nous allons maintenant tester differents modèles 

In [24]:
gaussian_sentiment = GaussianNB()
scores_gaussian = cross_val_score(gaussian_sentiment,bag_of_words_train,y_train,cv=5)
scores_gaussian

array([0.61944444, 0.57777778, 0.56388889, 0.55833333, 0.59166667])

In [25]:
svm_sentiment = OneClassSVM(kernel = 'rbf',gamma=0.01,nu=0.01)
scores_gaussian = cross_val_score(svm_sentiment,bag_of_words_train,y_train,cv=5,scoring='accuracy')
scores_gaussian 

array([0.34722222, 0.37222222, 0.33333333, 0.26111111, 0.34166667])

In [26]:
rnd_sentiment = RandomForestClassifier(n_estimators=1000,criterion='entropy')
scores_gaussian = cross_val_score(rnd_sentiment,bag_of_words_train,y_train,cv=5,scoring='accuracy')
scores_gaussian 

array([0.75      , 0.74444444, 0.72777778, 0.75277778, 0.78888889])

In [27]:
y_train = np.array(y_train)

In [28]:
kf = KFold(n_splits=5,shuffle=True)
for train_index,test_index in kf.split(y_train):
    X_train_fold, X_test_fold = bag_of_words_train.iloc[train_index], bag_of_words_train.iloc[test_index]
    y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
    rnd_sentiment = RandomForestClassifier(n_estimators=500,criterion='entropy')
    rnd_sentiment.fit(X_train_fold,y_train_fold)
    y_predict = rnd_sentiment.predict(X_test_fold)
    print(confusion_matrix(y_test_fold,y_predict,labels=[-1, 0, 1]))

[[ 87  28  12]
 [  1 115  14]
 [  3  29  71]]
[[ 89  19   6]
 [  5 100  10]
 [  5  36  90]]
[[ 81  25  10]
 [  8 101   8]
 [  7  32  88]]
[[86 28 10]
 [ 8 93 11]
 [ 2 34 88]]
[[ 81  26  12]
 [  5 107  14]
 [ 10  22  83]]


La plupart des erreurs sont des textes positifs/negatifs interprétés comme neutres, il faudrait se pencher sur cela peut être en rajoutant plus de données positives et négatives

Regardons un peu le résultat sur des tweets réels

In [30]:
Data = TDF.search_author('CNNBusiness',20,remove_URL=True)
Data

Authentification ok


Unnamed: 0,Text,Author,Date
0,More than 170 prominent business leaders have ...,CNNBusiness,2021-01-05 15:05:03
1,A group of major oil producers including Saudi...,CNNBusiness,2021-01-05 14:29:06
2,Alibaba is shutting down its music streaming a...,CNNBusiness,2021-01-05 14:00:14
3,More than 170 prominent business leaders have ...,CNNBusiness,2021-01-05 00:34:03
4,A second round of stimulus payments are on the...,CNNBusiness,2021-01-04 23:32:03
5,"Fiat Chrysler Automobiles and Groupe PSA, the ...",CNNBusiness,2021-01-04 22:33:02
6,The results of Georgia's Senate runoffs will p...,CNNBusiness,2021-01-04 22:16:03
7,Microsoft knows there's an Xbox shortage. It's...,CNNBusiness,2021-01-04 21:41:47
8,"Haven, the health care company founded in 2018...",CNNBusiness,2021-01-04 20:29:07
9,Chipotle is adding cauliflower rice to its men...,CNNBusiness,2021-01-04 20:00:09


In [44]:
Texte_1 = Data['Text'][0]
Texte_1

'More than 170 prominent business leaders have signed a letter urging Congress to accept the Electoral College results that declared Joe Biden as the next President of the United States '

In [45]:
Texte_2 = Data['Text'][8]
Texte_2

'Haven, the health care company founded in 2018, is shutting down. The joint venture by Amazon, Berkshire Hathaway and JPMorgan Chase struggled to make inroads beyond its three partners. '

In [46]:
Texte_3 = Data['Text'][10]
Texte_3

"Janet Yellen, President-elect Joe Biden's pick for Treasury secretary, made more than $7 million in recent years by giving speeches to Wall Street banks, major corporations and industry groups. "

In [47]:
columns = bag_of_words_train.columns
validation_dataframe = pd.DataFrame(columns=columns)

In [48]:
words,BOW = NLP.bagofwords(Texte_1,'en', stop_words=stop_words_en, remove_non_words=False, stemming=False)
for i in range(len(words)):
    if words[i] in columns:
        validation_dataframe.loc[0,words[i]] = BOW[i]


words,BOW = NLP.bagofwords(Texte_2,'en', stop_words=stop_words_en, remove_non_words=False, stemming=False)
ind = validation_dataframe.shape[0]
for i in range(len(words)):
    if words[i] in columns:
        validation_dataframe.loc[ind,words[i]] = BOW[i]

        
words,BOW = NLP.bagofwords(Texte_3,'en', stop_words=stop_words_en, remove_non_words=False, stemming=False)
ind = validation_dataframe.shape[0]
for i in range(len(words)):
    if words[i] in columns:
        validation_dataframe.loc[ind,words[i]] = BOW[i]

In [49]:
validation_dataframe=validation_dataframe.fillna(0)
validation_dataframe

Unnamed: 0,job,raahe,reduced,steel,total,work,down,eur,group,loss,...,search,tata,perkonoja,considered,eero,gathered,sihvonen,taking,selects,transplace
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
rnd_sentiment = RandomForestClassifier(n_estimators=1000,criterion='entropy')
rnd_sentiment.fit(bag_of_words_train,y_train)

RandomForestClassifier(criterion='entropy', n_estimators=1000)

In [51]:
rnd_sentiment.predict(validation_dataframe)

array([ 0., -1.,  0.])