# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

### Project Outline

**Task 1**: Introduction ((Expliquer la différence entre BERT/CamemBERT et Tfidf))

**Task 2**: Analyse exploratoire et prétraitement des données

**Task 3**: Training/Validation Split

**Task 4**: Chargement du Tokenizer et encodage de nos données

**Task 5**: Entrainer un modèle

**Task 6**: Classification des documents à l'aide de la régression logistique multinomiale

**Task 7**: Evaluation sur la base de validation

**Task 8**: Tester le Random Forest, SVM, Xgboost, Light GBM, Stacking

## Task 1: Introduction (Expliquer la différence entre BERT/CamemBERT et Tfidf)

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="BERT_diagrams.pdf" width="1000">

## Task 2: Analyse exploratoire et prétraitement des données

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [1]:
import torch
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np

In [2]:
df = pd.read_csv(
    'smile-annotations-final.csv',
    names=['id', 'text', 'category']
)
df.set_index('id', inplace=True)

In [3]:
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [4]:
df['category'].value_counts()

category
nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: count, dtype: int64

In [5]:
df['text'].iloc[0]  # Regarder le premier commentaire

'@aandraous @britishmuseum @AndrewsAntonio Merci pour le partage! @openwinemap'

In [6]:
# Enlever toutes les lignes contenant le caractere |
df = df[~df['category'].str.contains('\|')]
# (synonyme de double sentiment exprime)

`.str` permet d'appliquer des fonctions sur les strings d'une colonne.

Ici, on applique la méthode `.contains()` pour vérifier si la colonne `category` contient des caractères spéciaux. Ensuite on peut inverser le résultat avec `~` pour avoir les lignes qui ne contiennent pas de caractères spéciaux.

In [7]:
# Enlever les lignes contenant la modalite nocode
df = df[df["category"] != 'nocode']

In [8]:
df['category'].value_counts()

category
happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: count, dtype: int64

In [9]:
possible_labels = df['category'].unique()

In [10]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

Ici, la méthode `.unique()` permet de retourner les valeurs uniques d'une colonne.

On utilise ensuite une boucle pour labelliser chacune des catégories.

Un remplacement possible avec une dict comprehension :

```python
label_dict = {val:idx for idx, val in enumerate(df.category.unique())}
```

In [11]:
label_dict

{'happy': 0,
 'not-relevant': 1,
 'angry': 2,
 'disgust': 3,
 'sad': 4,
 'surprise': 5}

In [12]:
df['label'] = df['category'].map(label_dict)
df.head(10)

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,happy,0
613601881441570816,Yr 9 art students are off to the @britishmuseu...,happy,0
613696526297210880,@RAMMuseum Please vote for us as @sainsbury #s...,not-relevant,1
610746718641102848,#AskTheGallery Have you got plans to privatise...,not-relevant,1
612648200588038144,@BarbyWT @britishmuseum so beautiful,happy,0


`.map()` permet de passer chacune des valeurs d'une colonne dans un dictionnaire pour les remplacer par une autre valeur.

## Task 3: Training/Validation Split

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_val, y_train, X_y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=17,  # Pour la reproductibilite des analyses/resultats
    stratify=df.label.values
)

L'argument important ici est `stratify`. Il permet de s'assurer que les proportions de chaque classe sont respectées dans les deux jeux de données.

In [15]:
# Creation de la base d'apprentissage et de test
df['data_type'] = ['not_set']*df.shape[0]

In [16]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [17]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


On s'assure que les proportions sont respectées.

## Task 4: Chargement du Tokenizer et encodage de nos données

In [18]:
import torch

In [19]:
import transformers as ppb

camembert, tokenizer, weights = (
    ppb.CamembertModel, ppb.CamembertTokenizer, 'camembert-base')

In [20]:
# Load pretrained model/tokenizer
tokenizer = tokenizer.from_pretrained(weights)
model = camembert.from_pretrained(weights)

sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

In [21]:
df_app = df[df['data_type'] == 'train']
df_test = df[df['data_type'] == 'val']

In [22]:
df_app.head()

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,train
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,train
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,train
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,train
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,train


In [23]:
# see if there are length > 512
max_len_app = 0
for i, sent in enumerate(df_app['text']):
    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids_app = tokenizer.encode(sent, add_special_tokens=True)
    if len(input_ids_app) > 512:
        print("annoying review at", i, "with length",
              len(input_ids_app))
    # Update the maximum sentence length.
    max_len_app = max(max_len_app, len(input_ids_app))

print('Max sentence length: ', max_len_app)

# see if there are length > 512
max_len_test = 0
for i, sent in enumerate(df_test['text']):
    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids_test = tokenizer.encode(sent, add_special_tokens=True)
    if len(input_ids_test) > 512:
        print("annoying review at", i, "with length",
              len(input_ids_test))
    # Update the maximum sentence length.
    max_len_test = max(max_len_test, len(input_ids_test))

print('Max sentence length: ', max_len_test)

Max sentence length:  97
Max sentence length:  73


Ici le code n'affiche rien d'autre que les longueurs de texte maximale ce qui veut dire que tous les textes sont plus courts que 512.

On peut donc continuer.

In [24]:
tokenized_app = df_app['text'].apply(
    (lambda x: tokenizer.encode(str(x), add_special_tokens=True)))
max_len_app = 0
for i in tokenized_app.values:
    if len(i) > max_len_app:
        max_len_app = len(i)

padded_app = np.array([i + [0]*(max_len_app-len(i))
                      for i in tokenized_app.values])
np.array(padded_app).shape

(1258, 97)

On construit ensuite une liste encodée pour chaque texte. Logiquement, `.shape` retourne le nombre de textes et la longueur maximale de ces textes, ici 92.

In [25]:
tokenized_test = df_test['text'].apply(
    (lambda x: tokenizer.encode(str(x), add_special_tokens=True)))
max_len_test = 0
for i in tokenized_test.values:
    if len(i) > max_len_test:
        max_len_test = len(i)

padded_test = np.array([i + [0]*(max_len_test-len(i))
                       for i in tokenized_test.values])
np.array(padded_test).shape

(223, 73)

De même pour le jeu de données de test.

In [26]:
attention_mask_app = np.where(padded_app != 0, 1, 0)
attention_mask_app.shape

(1258, 97)

In [27]:
attention_mask_test = np.where(padded_test != 0, 1, 0)
attention_mask_test.shape

(223, 73)

On utilise `np.where()` pour créer une matrice de 0 et de 1.

In [28]:
# Enfin nous transformer les tokens en tensor pour les passer dans le fameux transformer. Seule la dernière
# couche est conservée pour faire la classification.

input_ids_app = torch.tensor(padded_app)
attention_mask_app = torch.tensor(attention_mask_app)

In [29]:
len(attention_mask_app)

1258

In [30]:
# Enfin nous transformer les tokens en tensor pour les passer dans le fameux transformer. Seule la dernière
# couche est conservée pour faire la classification.

input_ids_test = torch.tensor(padded_test)
attention_mask_test = torch.tensor(attention_mask_test)

In [31]:
len(attention_mask_test)

223

In [32]:
with torch.no_grad():
    last_hidden_states_app = model(
        input_ids_app, attention_mask=attention_mask_app)

In [33]:
with torch.no_grad():
    last_hidden_states_test = model(
        input_ids_test, attention_mask=attention_mask_test)

On récupère ensuite nos encodages avec le modèle BERT.

## Task 5: Entrainer un modèle

In [34]:
features_valid = last_hidden_states_test[0][:, 0, :].numpy()
labels_valid = df_test.label
labels_valid

id
613359710343929857    1
611947559444172801    0
612264160311803905    0
611844583224438784    0
615216447787270144    0
                     ..
614815258092421120    0
612216252686299136    0
611554358812090368    0
613813229735804928    0
610829951890120704    0
Name: label, Length: 223, dtype: int64

In [35]:
features = last_hidden_states_app[0][:, 0, :].numpy()
labels = df_app.label
labels

id
614484565059596288    0
614746522043973632    0
614877582664835073    0
611932373039644672    0
611570404268883969    0
                     ..
611258135270060033    1
612214539468279808    0
613678555935973376    0
615246897670922240    0
613016084371914753    1
Name: label, Length: 1258, dtype: int64

In [36]:
train_features, test_features, train_labels, test_labels = train_test_split(
    features,
    labels,
    test_size=0.2,
    # random_state=39444, # Pour la reproductibilite des analyses/resultats
    stratify=labels
)

On forme un nouveau jeu de données avec les encodages et les labels.

## Task 6: Classification des documents à l'aide de la régression logistique multinomiale

In [37]:
from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression(random_state=0, multi_class='multinomial',
                            penalty='none', solver='newton-cg').fit(train_features, train_labels)
preds = model1.predict(test_features)

# print the tunable parameters (They were not tuned in this example, everything kept as default)
params = model1.get_params()
print(params)



{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'multinomial', 'n_jobs': None, 'penalty': 'none', 'random_state': 0, 'solver': 'newton-cg', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}


In [38]:
# Model validation
from sklearn.metrics import accuracy_score
print('Accuracy: {:.2f}'.format(accuracy_score(test_labels, preds)))
print('Error rate: {:.2f}'.format(1 - accuracy_score(test_labels, preds)))

Accuracy: 0.84
Error rate: 0.16


## Task 7: Evaluation sur la base de validation

In [39]:
preds_valid = model1.predict(features_valid)

In [40]:
# Prediction finale avec inverse Tag
final_preds = pd.DataFrame(preds_valid)
final_preds = final_preds.rename(columns={0: 'preds_Tag'})

label_dict_inverse = {}
for index, possible_label in enumerate(possible_labels):
    label_dict_inverse[index] = possible_label

label_dict_inverse

{0: 'happy',
 1: 'not-relevant',
 2: 'angry',
 3: 'disgust',
 4: 'sad',
 5: 'surprise'}

On construit l'inverse du label_dict pour pouvoir retrouver les labels à partir des valeurs prédites.

Une autre façon de faire :

```python
label_dict_inverse = {v:k for k,v in label_dict.items()}
```

In [41]:
final_preds['preds_Tag'] = final_preds['preds_Tag'].map(label_dict_inverse)

final_preds

Unnamed: 0,preds_Tag
0,happy
1,happy
2,happy
3,happy
4,not-relevant
...,...
218,happy
219,happy
220,happy
221,happy


In [42]:
# Model validation
print('Accuracy: {:.2f}'.format(accuracy_score(labels_valid, preds_valid)))
print('Error rate: {:.2f}'.format(
    1 - accuracy_score(labels_valid, preds_valid)))

Accuracy: 0.77
Error rate: 0.23


In [43]:
# Create classification report
from sklearn.metrics import classification_report
class_report = classification_report(labels_valid, preds_valid)
print(class_report)

              precision    recall  f1-score   support

           0       0.87      0.89      0.88       171
           1       0.40      0.44      0.42        32
           2       0.80      0.44      0.57         9
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00         5
           5       0.67      0.40      0.50         5

    accuracy                           0.77       223
   macro avg       0.46      0.36      0.40       223
weighted avg       0.77      0.77      0.77       223



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [44]:
# Calculated probabilities
df_results = pd.DataFrame(model1.predict_proba(
    features_valid), columns=model1.classes_)
valid_values = df_test[['text']]
valid_tags = df_test[['category']]
# valid_documents = df_test[['id']]
valid_values.index = pd.RangeIndex(len(valid_values.index))
valid_tags.index = pd.RangeIndex(len(valid_tags.index))
# valid_documents.index = pd.RangeIndex(len(valid_documents.index))
df_results.index = pd.RangeIndex(len(df_results.index))

In [45]:
frames = [valid_values, valid_tags, final_preds, df_results.round(decimals=6)]
result = pd.concat(frames, axis=1)

On construit le dataframe `result` avec les labels prédits et les labels réels. `df_results` contient les probabilités de chaque classe.

In [46]:
class_report = classification_report(valid_tags, final_preds)
print(class_report)

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       angry       0.80      0.44      0.57         9
     disgust       0.00      0.00      0.00         1
       happy       0.87      0.89      0.88       171
not-relevant       0.40      0.44      0.42        32
         sad       0.00      0.00      0.00         5
    surprise       0.67      0.40      0.50         5

    accuracy                           0.77       223
   macro avg       0.46      0.36      0.40       223
weighted avg       0.77      0.77      0.77       223



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


`classification_report()` permet de retourner les métriques de classification, ce qui est particulièrement utile pour les problèmes multiclasse.

In [47]:
result = result.rename(columns=label_dict_inverse)
result

Unnamed: 0,text,category,preds_Tag,happy,not-relevant,angry,disgust,sad,surprise
0,Over 100 people signed up for 'What's It Worth...,not-relevant,happy,0.999246,0.000000,0.0,0.0,0.000754,0.0
1,"Wonderful experience, hearing Tim Knox’s #obje...",happy,happy,1.000000,0.000000,0.0,0.0,0.000000,0.0
2,KETTLE'S YARD: ANTIMUSEUM - meet the Archivist...,happy,happy,0.999946,0.000054,0.0,0.0,0.000000,0.0
3,Plus excellent prizes from the @britishmuseum ...,happy,happy,1.000000,0.000000,0.0,0.0,0.000000,0.0
4,"Feliz cumpleaños,Rubens! Happy birthday,Rubens...",happy,not-relevant,0.000467,0.999533,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...
218,Bests museums apps @britishmuseum @uffizidotco...,happy,happy,1.000000,0.000000,0.0,0.0,0.000000,0.0
219,Well done @britishmuseum - looking forward to ...,happy,happy,1.000000,0.000000,0.0,0.0,0.000000,0.0
220,"@Tate_StIves It was a amazing night , a pleasu...",happy,happy,1.000000,0.000000,0.0,0.0,0.000000,0.0
221,Enjoyable afternoon @kettlesyard discussing re...,happy,happy,1.000000,0.000000,0.0,0.0,0.000000,0.0


## Task 8: Tester le Random Forest, SVM, Xgboost, Light GBM, Stacking

In [48]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
model2 = RandomForestClassifier(random_state=0, n_jobs=-1).fit(
    train_features, train_labels)
preds2 = model2.predict(test_features)

class_report = classification_report(test_labels, preds2)
print(class_report)

              precision    recall  f1-score   support

           0       0.81      0.98      0.89       194
           1       0.81      0.36      0.50        36
           2       1.00      0.10      0.18        10
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00         5
           5       0.00      0.00      0.00         6

    accuracy                           0.81       252
   macro avg       0.44      0.24      0.26       252
weighted avg       0.78      0.81      0.76       252



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [49]:
# SVM
from sklearn.svm import SVC
model3 = SVC(random_state=0).fit(train_features, train_labels)
preds3 = model3.predict(test_features)

class_report = classification_report(test_labels, preds3)
print(class_report)

              precision    recall  f1-score   support

           0       0.79      1.00      0.88       194
           1       1.00      0.19      0.33        36
           2       0.00      0.00      0.00        10
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00         5
           5       0.00      0.00      0.00         6

    accuracy                           0.80       252
   macro avg       0.30      0.20      0.20       252
weighted avg       0.75      0.80      0.73       252



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [52]:
# Light GBM
import lightgbm as lgb
model4 = lgb.LGBMClassifier(random_state=0, n_jobs=-1, verbose=-1).fit(
    train_features, train_labels)
preds4 = model4.predict(test_features)

class_report = classification_report(test_labels, preds4)
print(class_report)

              precision    recall  f1-score   support

           0       0.83      1.00      0.91       194
           1       1.00      0.39      0.56        36
           2       1.00      0.50      0.67        10
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00         5
           5       0.00      0.00      0.00         6

    accuracy                           0.85       252
   macro avg       0.47      0.31      0.36       252
weighted avg       0.82      0.85      0.81       252



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [53]:
# Stacking
from sklearn.ensemble import StackingClassifier
estimators = [
    ('rf', RandomForestClassifier(random_state=0, n_jobs=-1)),
    ('svr', SVC(random_state=0)),
    ('lgbm', lgb.LGBMClassifier(random_state=0, n_jobs=-1, verbose=-1))
]
clf = StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression(random_state=0)).fit(train_features, train_labels)
preds5 = clf.predict(test_features)

class_report = classification_report(test_labels, preds5)
print(class_report)



              precision    recall  f1-score   support

           0       0.84      0.97      0.90       194
           1       0.70      0.44      0.54        36
           2       1.00      0.50      0.67        10
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00         5
           5       0.00      0.00      0.00         6

    accuracy                           0.83       252
   macro avg       0.42      0.32      0.35       252
weighted avg       0.79      0.83      0.80       252



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


&uarr; Ici le meilleur modèle est LGBM.

De façon générale, on remarque que les 3 dernières classes sont trop peu nombreuses pour être bien prédites et que la seconde et la troisième classe ont un recall trop faible.