# Sentiment analysis using `camemBERT`

`camemBERT` is a pre-trained version of `roBERTa` on french language data. The objective is to use pre-trained `camemBERT` to predict the polarity (positive or negative) of tweets. We only focus on model evaluation since we do not have labelled data. 

## Setup

In [3]:
import torch 
from transformers import CamembertForSequenceClassification, CamembertTokenizer

In [4]:
import pickle as pkl

In [5]:
TOKENIZER = CamembertTokenizer.from_pretrained("camembert-base", do_lower_case=True)

Downloading:   0%|          | 0.00/811k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/508 [00:00<?, ?B/s]

In [6]:
path = "../backup/data/tweets_preprocessed.pkl"
with open(path, "rb") as f: 
    tweets_preprocessed = pkl.load(f)

## Functions

In [10]:
from typing import List, Dict 
from transformers.models.camembert.modeling_camembert import CamembertForSequenceClassification


def preprocess(tweets: List) -> Dict:
    encoded_batch = TOKENIZER.batch_encode_plus(tweets,
                                                truncation=True,
                                                pad_to_max_length=True,
                                                return_attention_mask=True,
                                                return_tensors="pt")
    return encoded_batch["input_ids"], encoded_batch["attention_mask"]

def predict(tweets: List, model):
    with torch.no_grad():
        model.eval()
        input_ids, attention_mask = preprocess(tweets)
        outputs = model(input_ids, attention_mask=attention_mask)
        return outputs

## Model

In [43]:
model = CamembertForSequenceClassification.from_pretrained(
    "camembert-base",
    num_labels = 2)

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForSequenceClassification: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.dense.weight'

In [16]:
model.__dict__

{'training': False,
 '_parameters': OrderedDict(),
 '_buffers': OrderedDict(),
 '_non_persistent_buffers_set': set(),
 '_backward_hooks': OrderedDict(),
 '_is_full_backward_hook': None,
 '_forward_hooks': OrderedDict(),
 '_forward_pre_hooks': OrderedDict(),
 '_state_dict_hooks': OrderedDict(),
 '_load_state_dict_pre_hooks': OrderedDict(),
 '_load_state_dict_post_hooks': OrderedDict(),
 '_modules': OrderedDict([('roberta',
               CamembertModel(
                 (embeddings): CamembertEmbeddings(
                   (word_embeddings): Embedding(32005, 768, padding_idx=1)
                   (position_embeddings): Embedding(514, 768, padding_idx=1)
                   (token_type_embeddings): Embedding(1, 768)
                   (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                   (dropout): Dropout(p=0.1, inplace=False)
                 )
                 (encoder): CamembertEncoder(
                   (layer): ModuleList(
                     (0): 

In [9]:
type(model)

transformers.models.camembert.modeling_camembert.CamembertForSequenceClassification

## Sentiments

In [69]:
import numpy as np 

tweets = tweets_preprocessed["cleaned_emojis"]
# selected = np.random.choice(tweets, size=10).tolist()

selected = ["J'aime les pâtes", "Il fait pas beau ici", "Je l'aime bien mais il sent pas très bon"]

In [70]:
predictions = predict(selected, model)



In [65]:
from torch import Tensor
from typing import Tuple 
from scipy.special import softmax

LABELS = ["Negative", "Positive"]

def transform_logits(logits: Tensor) -> np.ndarray: 
    """Description. Transform logits to probabilitities using softmax."""

    scores = softmax(logits, axis=1)
    return scores

def get_sentiment(scores: np.ndarray) -> List: 
    """Description. Get sentiment with highest probability."""

    return np.argmax(scores, axis=1).tolist()

In [71]:
logits = predictions.logits
scores = transform_logits(logits)
sentiments = get_sentiment(scores)

In [72]:
for i in range(len(selected)): 
    print(f"Tweet: {selected[i]}")
    print(f"Sentiment: {sentiments[0]}")
    print(f"Scores={scores[i]}")
    print("-"*100)

Tweet: J'aime les pâtes
Sentiment: 0
Scores=[0.51511437 0.48488557]
----------------------------------------------------------------------------------------------------
Tweet: Il fait pas beau ici
Sentiment: 0
Scores=[0.5043335  0.49566653]
----------------------------------------------------------------------------------------------------
Tweet: Je l'aime bien mais il sent pas très bon
Sentiment: 0
Scores=[0.5112664 0.4887336]
----------------------------------------------------------------------------------------------------
