## Importation des librairies et des jeux de données nécessaires à notre travail

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing

# vecteur_emotion_final = pd.read_csv('/home/apprenant/Documents/Brief-Emotion-Analysis-Text/data/03_vectorized/emotion_final_matrix.csv')
emotion_final = pd.read_csv('/home/apprenant/Documents/Brief-Emotion-Analysis-Text/data/02_cleaned/cleaned_emotion_final.csv')

# vecteur_text_emotion = pd.read_csv('/home/apprenant/Documents/Brief-Emotion-Analysis-Text/data/03_vectorized/text_emotion_matrix.csv')
text_emotion = pd.read_csv('/home/apprenant/Documents/Brief-Emotion-Analysis-Text/data/02_cleaned/cleaned_text_emotion.csv')

In [2]:
emotion_final.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [3]:
emotion_final.shape

(21456, 2)

In [4]:
text_emotion.shape

(40000, 3)

## Etape 1 : Comparaison de classification du premier jeu de données avec la classification du second jeu de données

### Premier jeu de données

#### Label-encoding des émotions

In [5]:
conditions = [(emotion_final['label'] == "sadness") | (emotion_final['label'] == "fear") | (emotion_final['label'] == "anger"), (emotion_final['label'] == "love") | (emotion_final['label'] == "surprise") | (emotion_final['label'] == "happy")] 

In [6]:
values = [0, 1]

In [7]:
emotion_final['binary_emotion'] = np.select(conditions, values)

In [8]:
emotion_final.head()

Unnamed: 0,text,label,binary_emotion
0,i didnt feel humiliated,sadness,0
1,i can go from feeling so hopeless to so damned...,sadness,0
2,im grabbing a minute to post i feel greedy wrong,anger,0
3,i am ever feeling nostalgic about the fireplac...,love,1
4,i am feeling grouchy,anger,0


#### Choix des variables

In [9]:
X = emotion_final['text']
y = emotion_final['binary_emotion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Mise en place du modèle

In [10]:
vectorizer = CountVectorizer(min_df=0, lowercase=False)
vectorizer.fit(emotion_final['text'])

CountVectorizer(lowercase=False, min_df=0)

In [11]:
text_train = vectorizer.transform(X_train)
text_test = vectorizer.transform(X_test)

In [12]:
classifier = LogisticRegression()
classifier.fit(text_train, y_train)
score = classifier.score(text_test, y_test)
print("Accuracy:", round(score, ndigits=4))

Accuracy: 0.9555
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Second jeu de données

#### Label encoding des émotions

In [13]:
text_emotion['label'].unique()

array(['empty', 'sadness', 'enthusiasm', 'neutral', 'worry', 'surprise',
       'love', 'fun', 'hate', 'happiness', 'boredom', 'relief', 'anger'],
      dtype=object)

In [14]:
conditions = [(text_emotion['label'] == "empty") | (text_emotion['label'] == "sadness") | (text_emotion['label'] == "worry") | (text_emotion['label'] == "hate") | (text_emotion['label'] == "boredom") | (text_emotion['label'] == "anger"), (text_emotion['label'] == "enthusiasm") | (text_emotion['label'] == "neutral") | (text_emotion['label'] == "surprise") | (text_emotion['label'] == "love") | (text_emotion['label'] == "fun") | (text_emotion['label'] == "happiness") | (text_emotion['label'] == "relief")] 

In [15]:
values = [0, 1]

In [16]:
text_emotion['binary_emotion'] = np.select(conditions, values)

In [17]:
text_emotion

Unnamed: 0,tweet_id,label,text,binary_emotion
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...,0
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...,0
2,1956967696,sadness,Funeral ceremony...gloomy friday...,0
3,1956967789,enthusiasm,wants to hang out with friends SOON!,1
4,1956968416,neutral,@dannycastillo We want to trade with someone w...,1
...,...,...,...,...
39995,1753918954,neutral,@JohnLloydTaylor,1
39996,1753919001,love,Happy Mothers Day All my love,1
39997,1753919005,love,Happy Mother's Day to all the mommies out ther...,1
39998,1753919043,happiness,@niariley WASSUP BEAUTIFUL!!! FOLLOW ME!! PEE...,1


#### Choix des variables

In [18]:
X = text_emotion['text']
y = text_emotion['binary_emotion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Mise en place du modèle

In [19]:
vectorizer = CountVectorizer(min_df=0, lowercase=False)
vectorizer.fit(text_emotion['text'])

CountVectorizer(lowercase=False, min_df=0)

In [20]:
text_train = vectorizer.transform(X_train)
text_test = vectorizer.transform(X_test)

In [21]:
classifier = LogisticRegression()
classifier.fit(text_train, y_train)
score = classifier.score(text_test, y_test)
print("Accuracy:", round(score, ndigits=4))

Accuracy: 0.7101
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Les résultats obtenus dans les deux jeux de données sont différents. On peut supposer que le nombre de lignes des jeux de données influence ce score (21 456 pour le jeu de données 1 vs 40 000 pour le jeu de données 2). De plus, dans le jeu 1, nous avons 6 émotions différentes alors que le jeu 2 en présente 13, soit le double. 
Ces variations de valeurs pourraient donc avoir une certaine influence sur le score final.