# Sentiment analysis using `camemBERT`

`camemBERT` is a pre-trained version of `roBERTa` on french language data. The objective is to use pre-trained `camemBERT` to predict the polarity (positive or negative) of tweets. We only focus on model evaluation since we do not have labelled data. 

## Setup

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
DRIVE_PATH = "/content/drive/MyDrive/twitter-inflation-perception/"

import os
os.chdir(DRIVE_PATH+"notebooks/")

In [3]:
import sys 
sys.path.append("../")

In [4]:
from lib.sentiment.preprocessing import (
    load_tokenizer, 
    preprocess, 
    train_val_split
)
from lib.sentiment.model import load_model, backup_model 

from lib.sentiment.training import (
    train, 
    init_scheduler, 
    check_convergence
)
from lib.sentiment.validation import evaluate 

from lib.sentiment.utils import results_to_dict, get_avg_training_losses

In [5]:
import os
import json

import time
import datetime

import numpy as np
import pandas as pd
import pickle as pkl

import matplotlib.pyplot as plt
from sklearn import metrics

In [6]:
import torch
from torch.utils.data import (
    TensorDataset, 
    random_split, 
    DataLoader, 
    RandomSampler, 
    SequentialSampler
)

In [7]:
# !pip install transformers==4.25.1

In [8]:
# !pip install sentencepiece

In [9]:
from transformers import AdamW

## Data

In [10]:
file_path = DRIVE_PATH + "backup/tweets/french_tweets.csv"
french_tweets = pd.read_csv(file_path)

In [11]:
french_tweets.head()

Unnamed: 0,label,text
0,0,"- Awww, c'est un bummer. Tu devrais avoir davi..."
1,0,Est contrarié qu'il ne puisse pas mettre à jou...
2,0,J'ai plongé plusieurs fois pour la balle. A ré...
3,0,Tout mon corps a des démangeaisons et comme si...
4,0,"Non, il ne se comporte pas du tout. je suis en..."


In [12]:
n_tweets, _ = french_tweets.shape
print(f"{n_tweets} tweets in the dataset")

1526724 tweets in the dataset


In [13]:
french_tweets["label"].value_counts() / n_tweets

0    0.505398
1    0.494602
Name: label, dtype: float64

In [14]:
# extract sample to reduce computation time 

prop = .1
size = int(n_tweets * prop) 
idxs = np.random.randint(low=0, high=n_tweets, size=size).tolist()

tweets_sample = french_tweets.iloc[idxs, :]

print(len(tweets_sample))

152672


In [15]:
tweets_sample["label"].value_counts() / len(tweets_sample)

0    0.505875
1    0.494125
Name: label, dtype: float64

In [16]:
tweets = tweets_sample["text"].values.tolist()
sentiments = tweets_sample["label"].values.tolist()

## Preprocessing

In [17]:
tokenizer = load_tokenizer()

In [18]:
type(tokenizer)

transformers.models.camembert.tokenization_camembert.CamembertTokenizer

In [19]:
tweets_train, tweets_validation, sentiments_train, sentiments_validation = train_val_split(tweets, sentiments, train_prop=.8)

In [20]:
input_ids, attention_mask, sentiments_train = preprocess(tweets_train, tokenizer, sentiments=sentiments_train)

train_dataset = TensorDataset(
    input_ids,
    attention_mask,
    sentiments_train)



In [21]:
input_ids, attention_mask, sentiments_validation = preprocess(tweets_validation, tokenizer, sentiments=sentiments_validation)

validation_dataset = TensorDataset(
    input_ids,
    attention_mask,
    sentiments_validation)

In [22]:
batch_size = 32

train_dataloader = DataLoader(
            train_dataset,
            sampler = RandomSampler(train_dataset),
            batch_size = batch_size)

validation_dataloader = DataLoader(
            validation_dataset,
            sampler = SequentialSampler(validation_dataset),
            batch_size = batch_size)

## Model

### Load `camemBERT`

In [23]:
model = load_model()

# initialize a variable holding the device used for training ('cpu' or 'cuda')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"{device=}")
model = model.to(device)

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.dense.weight'

device=device(type='cuda', index=0)


In [24]:
n_params = sum(p.numel() for p in model.parameters())
print("{:,} parameters in camemBERT".format(n_params) )

110,623,490 parameters in camemBERT


### Training & validation

In [25]:
statistics = []

total_t0 = time.time()
num_epochs = 5

optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)
scheduler = init_scheduler(num_epochs, train_dataloader, optimizer)

model_path = "../backup/models/twitter-camembert.pt"

# this variable will evaluate the convergence on the training
consecutive_epochs_with_no_improve = 0



In [27]:
for epoch in range(num_epochs):
    
    batch_losses, training_times = train(
        model, 
        train_dataloader, 
        device, 
        optimizer, 
        scheduler, 
        epoch, 
        num_epochs)

    if num_epochs > 3 and epoch > 1: 
        curr_loss =  np.mean(batch_losses)
        avg_train_losses = get_avg_training_losses(statistics)

        consecutive_epochs_with_no_improve = check_convergence(
            model, 
            model_path, 
            avg_train_losses, 
            curr_loss, 
            consecutive_epochs_with_no_improve)
        
        if consecutive_epochs_with_no_improve == 2:
          print("Stop training: The loss has not changed since 2 epochs!")
          break

    accuracy_scores = evaluate(model, validation_dataloader, device)
    statistics.append(results_to_dict(epoch, batch_losses, training_times, accuracy_scores))

Training Epoch [1/5]: 100%|██████████| 2/2 [00:01<00:00,  1.16it/s, loss_train=0.69, training_time=1.67e+9]
Validation in progress: 100%|██████████| 2/2 [00:00<00:00,  7.98it/s, balanced_accuracy_score=0.52]
Training Epoch [2/5]: 100%|██████████| 2/2 [00:01<00:00,  1.73it/s, loss_train=0.69, training_time=1.67e+9]
Validation in progress: 100%|██████████| 2/2 [00:00<00:00,  8.28it/s, balanced_accuracy_score=0.5]
Training Epoch [3/5]: 100%|██████████| 2/2 [00:01<00:00,  1.74it/s, loss_train=0.68, training_time=1.67e+9]


Model saved at ../backup/models/twitter-camembert.pt


Validation in progress: 100%|██████████| 2/2 [00:00<00:00,  8.37it/s, balanced_accuracy_score=0.5]
Training Epoch [4/5]: 100%|██████████| 2/2 [00:01<00:00,  1.72it/s, loss_train=0.68, training_time=1.67e+9]


Model saved at ../backup/models/twitter-camembert.pt


Validation in progress: 100%|██████████| 2/2 [00:00<00:00,  8.43it/s, balanced_accuracy_score=0.5]
Training Epoch [5/5]: 100%|██████████| 2/2 [00:01<00:00,  1.71it/s, loss_train=0.67, training_time=1.67e+9]


Model saved at ../backup/models/twitter-camembert.pt


Validation in progress: 100%|██████████| 2/2 [00:00<00:00,  8.93it/s, balanced_accuracy_score=0.5]


In [31]:
training_stats_path = "../backup/models/training-stats-camembert.json"

import json 

with open(training_stats_path, "w") as f:
    json.dump(statistics, f) 

# backup_model(model, model_path)

## Evaluation on unseen data