Ce notebook applique la méthode dictionnaire pour faire du sentiment analysis. Le dictionnaire utilisé est celui de Harvard 4

In [1]:
from functions import data_utilities as data
from functions import nlp_utilities as nlp
from sklearn.feature_extraction.text import CountVectorizer
import tweepy as tw
import pandas as pd
import numpy as np
import random

Le dataset comporte deux colonnes, une contenant des mots positifs et l'autre des mots négatifs

In [2]:
dico = pd.read_excel('dataset/dictionary_sentiment.xlsx',sheet_name=1)
dico

Unnamed: 0,POSITIV_KEYWORD,NEGATIV_KEYWORD
0,ABIDE,ABANDON
1,ABILITY,ABANDONMENT
2,ABLE,ABATE
3,ABOUND,ABDICATE
4,ABSOLVE,ABHOR
...,...,...
2286,,WRONGFUL
2287,,WROUGHT
2288,,YAWN
2289,,YEARN


Nous créons deux objets contenant d'une part les mots positifs et de l'autre part les mots négatifs

In [3]:
pos_words = dico['POSITIV_KEYWORD'].dropna()
pos_words

0             ABIDE
1           ABILITY
2              ABLE
3            ABOUND
4           ABSOLVE
           ...     
1902    WORTH-WHILE
1903     WORTHINESS
1904         WORTHY
1905         ZENITH
1906           ZEST
Name: POSITIV_KEYWORD, Length: 1907, dtype: object

In [4]:
neg_words = dico['NEGATIV_KEYWORD']
neg_words

0           ABANDON
1       ABANDONMENT
2             ABATE
3          ABDICATE
4             ABHOR
           ...     
2286       WRONGFUL
2287        WROUGHT
2288           YAWN
2289          YEARN
2290           YELP
Name: NEGATIV_KEYWORD, Length: 2291, dtype: object

Nous loadons le dataset, nous allons par la suite créer le bag of words a l'aide de celui ci. Il sera nécessaire d'enlever les mots neutres du dataset, en effet ici notre dictionnaire ne comporte qu'une liste de mots positifs et une liste de mots négatifs. Pas de mot neutre

In [5]:
dataset = pd.read_csv('dataset/dataset_sentiment.csv',sep=',',encoding='latin-1',header=None,names=['sentiment','text'])
dataset

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...
4844,negative,Net sales of the Paper segment decreased to EU...


In [6]:
dataset = dataset.drop(dataset.loc[dataset['sentiment']=='neutral'].index)
dataset

Unnamed: 0,sentiment,text
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
5,positive,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...
6,positive,"For the last quarter of 2010 , Componenta 's n..."
...,...,...
4840,negative,HELSINKI Thomson Financial - Shares in Cargote...
4841,negative,LONDON MarketWatch -- Share prices ended lower...
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...
4844,negative,Net sales of the Paper segment decreased to EU...


Nous encodons ici positif et negatif en 1 et 0

In [7]:
f= open("dict_mot/stop_words_english.txt","r")
stop_words_en=f.read().splitlines()
f.close

<function TextIOWrapper.close()>

In [8]:
X = dataset['text']
y = dataset['sentiment']

In [9]:
X = list(X)
y = list(y)

In [10]:
y_pred = []
pos_words = list(pos_words)
neg_words = list(neg_words)

In [11]:
for i in range(len(X)):
    y_pred.append(0)
    text = X[i].split(' ')
    for word in text :
        if word in pos_words:
            y_pred[i]+=1
        if word in neg_words:
            y_pred[i]+=(-1)

In [12]:
for i in range(len(y_pred)):
    if y_pred[i]>0:
        y_pred[i]='positiv'
    elif y_pred[i]==0:
        y_pred[i]='neutral'
    else:
        y_pred[i]='negativ'

In [13]:
y_pred

['neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'positiv',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'ne

Problème notre dictionnaire n'est pas adapté aux textes de finance de plus nous pouvons essayer de faire un traitement de texte avant de comparer au dictionnaire. Essayons de créer un dictionnaire à l'aide de notre dataset de textes financiers

In [14]:
dataset = pd.read_csv('dataset/dataset_sentiment.csv',sep=',',encoding='latin-1',header=None,names=['sentiment','text'])

In [15]:
from sklearn.preprocessing import OrdinalEncoder

In [16]:
enc = OrdinalEncoder()
X = dataset[['text']]
y = dataset[['sentiment']]
y = enc.fit_transform(y)
y = y-1

In [20]:
bag_of_words_total = pd.read_csv('../../bow_dataset_sentiment.csv')

In [21]:
bag_of_words_total

Unnamed: 0,according,although,company,gran,growing,move,plan,production,russia,area,...,notified,pequot,external,hs,keywords,newsroom,overshadowed,surprise,broader,rebound
0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
4842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4843,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
unique, counts = np.unique(y, return_counts=True)
print(unique,counts)

[-1.  0.  1.] [ 604 2879 1363]


In [23]:
y = np.array(y)
positiv_words = []
negativ_words = []
neutral_words = []
for word in bag_of_words_total.columns:
    
    somme_apparition = bag_of_words_total[word].sum()
    index = bag_of_words_total.index[bag_of_words_total[word]>0].tolist()
    somme_sentiment = np.sum(y[index])
    freq_sentiment = somme_sentiment/somme_apparition
    freq_apparition = somme_apparition/bag_of_words_total.shape[0]
    
    if freq_sentiment>1/3 :#and freq_apparition>0.001:
        sentiment = 1
        positiv_words.append(word.upper())
    if freq_sentiment<-1/3 :#and freq_apparition>0.001:
        sentiment = -1
        negativ_words.append(word.upper())
    else:
        sentiment = 0
        neutral_words.append(word.upper())

In [24]:
len(positiv_words)

2023

In [25]:
len(negativ_words)

536

In [26]:
len(neutral_words)

7509

In [27]:
dico_2 = pd.DataFrame()
negativ_words = pd.DataFrame(negativ_words,columns=['negativ'])
positiv_words = pd.DataFrame(positiv_words,columns=['positiv'])
neutral_words = pd.DataFrame(neutral_words,columns=['neutral'])

In [28]:
dico_2 = pd.concat([negativ_words,neutral_words,positiv_words],ignore_index=False,axis=1)

In [29]:
dico_2

Unnamed: 0,negativ,neutral,positiv
0,CONTRARY,ACCORDING,GRAN
1,LAID,ALTHOUGH,GROWING
2,LAYOFF,COMPANY,DEVELOP
3,POSTIMEES,GRAN,ORDER
4,RANK,GROWING,WORKING
...,...,...,...
7504,,ROP,
7505,,RUIN,
7506,,DEPOSITOR,
7507,,PREFERENCE,


In [30]:
codes = data.get_codes('../../codes.txt')

In [31]:
Data = data.search_author('CNNBusiness','2021/01/10',20,'en',codes,["Text", "Author", "Date"],remove_URL=True)
Data

Unnamed: 0,Text,Author,Date
0,The sale of the WNBA's Atlanta Dream — co-owne...,CNNBusiness,2021-01-20 07:01:03
1,At least six major news networks have assigned...,CNNBusiness,2021-01-20 06:01:05
2,Netflix — which was founded in 1997 as a renta...,CNNBusiness,2021-01-20 05:01:07
3,Bed Bath &amp; Beyond has stopped selling MyPi...,CNNBusiness,2021-01-20 04:01:11
4,Christian crowdsource funding site used by ext...,CNNBusiness,2021-01-20 02:30:32
5,Today in business news:\n\n➡️ Stocks are havin...,CNNBusiness,2021-01-20 02:04:26
6,The Justice Department has notified Republican...,CNNBusiness,2021-01-20 01:45:49
7,A batch of chocolate milk has been recalled af...,CNNBusiness,2021-01-20 01:23:20
8,MGM Resorts International is walking away from...,CNNBusiness,2021-01-19 22:00:22
9,"The revamped service, called Paramount+, which...",CNNBusiness,2021-01-19 21:45:04


In [32]:
Textes = list(Data['Text'])

In [34]:
Textes[4]

'Christian crowdsource funding site used by extremists involved in Capitol riot '

In [35]:
pos_words = dico_2['positiv'].dropna()
neg_words = dico_2['negativ'].dropna()
neu_words = dico_2['neutral'].dropna()

In [36]:
pos_words

0             GRAN
1          GROWING
2          DEVELOP
3            ORDER
4          WORKING
           ...    
2018    RHEUMATOID
2019     TOLERATED
2020         KRONE
2021    REBOUNDING
2022    CENTRICITY
Name: positiv, Length: 2023, dtype: object

In [37]:
f= open("dict_mot/stop_words_english.txt","r")
stop_words_en=f.read().splitlines()
f.close

<function TextIOWrapper.close()>

In [38]:
lemma = nlp.LemmaTokenizer(stop_words=stop_words_en,remove_non_words=False)
texte = lemma(Textes[5])
texte

TypeError: __init__() got multiple values for argument 'stop_words'

In [250]:
y_pred = 0
for word in texte:
    if word.upper() in pos_words :
        print('pos')
        y_pred +=1
    elif word.upper() in neg_words :
        print('neg')
        y_pred += -1
y_pred

0

In [252]:
'FIRST' in neu_words

False