# Example solution for tweet sentiment analysis

This is a baseline example to help you with the third challenge. It was originally developed by our Ph.D. student Jonas Wacker

In [32]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
from tqdm.notebook import tqdm # progress bars

# plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")

# general NLP preprocessing and basic tools
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# train/test split
from sklearn.model_selection import train_test_split
# basic machine learning models
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# our evaluation metric for sentiment classification
from sklearn.metrics import fbeta_score

In [33]:
# install HuggingFace's transformers library
! pip install transformers

[0m

In [34]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/eurecom-aml-2023-challenge-3/sample_submission.csv
/kaggle/input/eurecom-aml-2023-challenge-3/train.csv
/kaggle/input/eurecom-aml-2023-challenge-3/test.csv


## Loading the data

In [35]:
train_df = pd.read_csv('/kaggle/input/eurecom-aml-2023-challenge-3/train.csv')
test_df = pd.read_csv('/kaggle/input/eurecom-aml-2023-challenge-3/test.csv')

## Quick data inspection

In [36]:
len(train_df)+len(test_df)

27480

In [37]:
train_df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,28ac06f416,good luck with your auction,good luck with your auction,positive
1,92098cf9a7,Hmm..You can`t judge a book by looking at its ...,Hmm..You can`t judge a book by looking at its ...,neutral
2,7858ff28f2,"Hello, yourself. Enjoy London. Watch out for ...",They`re mental.,negative
3,b0c9c67f32,We can`t even call you from belgium sucks,m suck,negative
4,7b36e9e7a5,not so good mood..,not so good mood..,negative


In [38]:
test_df.head()

Unnamed: 0,textID,text,selected_text
0,102f98e5e2,Happy Mother`s Day hahaha,Happy Mother`s Day
1,033b399113,"Sorry for the triple twitter post, was having ...","Sorry for the triple twitter post, was having ..."
2,c125e29be2,thats much better than the flu syndrome!,thats much better
3,b91e2b0679,Aww I have a tummy ache,tummy ache
4,1a46141274,hey chocolate chips is good. i want a snack ...,good.


## Data exploration

In [39]:
# stuff

## Data pre-processing

In [40]:
# we create a validation dataset from the training data
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=0)

We start off by converting the labels to numbers. This is a requirement for the submission and numerical inputs are generally more compatible with machine learning libraries.

In [41]:
target_conversion = {
    'neutral': 0,
    'positive': 1,
    'negative': 2
}

In [42]:
train_df['target'] = train_df['sentiment'].map(target_conversion)
val_df['target'] = val_df['sentiment'].map(target_conversion)

##  Loading Tokenizer and Encoding our Data

In [43]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [44]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case=True
)

### Encoding training and validation data

In [45]:
encoded_data_train = tokenizer.batch_encode_plus(
    train_df.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val_df.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train_df.target.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val_df.target.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [46]:
dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)

In [47]:
len(dataset_train)

22258

In [48]:
dataset_val.tensors

(tensor([[ 101, 3232, 2420,  ...,    0,    0,    0],
         [ 101, 2016, 2134,  ...,    0,    0,    0],
         [ 101, 1030, 1035,  ...,    0,    0,    0],
         ...,
         [ 101, 1035, 2857,  ...,    0,    0,    0],
         [ 101, 8840, 2140,  ...,    0,    0,    0],
         [ 101, 2821, 1998,  ...,    0,    0,    0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([0, 0, 0,  ..., 1, 0, 2]))

## Setting up BERT Pretrained Model

In [49]:
from transformers import BertForSequenceClassification

In [50]:
model = BertForSequenceClassification.from_pretrained(
                                      'bert-base-uncased', 
                                      num_labels = len(target_conversion),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Setting up RoBERTa Pretrained model

In [1]:
from transformers import RobertaForSequenceClassification



In [None]:
model = RobertaForSequenceClassification.from_pretrained(
                                      'roberta-base', 
                                      num_labels = len(target_conversion),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

## Creating Data Loaders

In [51]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [52]:
batch_size = 4

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=32
)

## Setting Up Optimizer and Scheduler

In [53]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [54]:
optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)



In [55]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

## Defining our Performance Metrics

import numpy as np
from sklearn.metrics import f1_score



In [56]:
import numpy as np
from sklearn.metrics import f1_score

In [57]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'macro')

In [58]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Creating our Training Loop

In [59]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [60]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [61]:
def evaluate(dataloader_val):
    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

### Training loop

In [None]:
for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    #torch.save(model.state_dict(), f'Models/BERT_ft_Epoch{epoch}.model')
    
    tqdm.write('\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (macro): {val_f1}')

  0%|          | 0/10 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/5565 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.661934728919874


  0%|          | 0/78 [00:00<?, ?it/s]

Validation loss: 0.6093265222242246
F1 Score (macro): 0.7950058935126392


Epoch 2:   0%|          | 0/5565 [00:00<?, ?it/s]

## Evaluating our Model

In [None]:
accuracy_per_class(predictions, true_vals)

## Training a simple classifier

We are training a naive Bayes classifier on the Bag-of-Words features of the training data:

https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

It is already built into the sklearn library:

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

Keep in mind that not only storing the features is challenging but also processing them. A simple SVM may be quite slow on such high-dimensional features. Naive Bayes works well with Bag-of-Words.



In [None]:
%%time
clf = MultinomialNB().fit(X_train_counts, train_df['target'])

In [None]:
val_predictions_nb = clf.predict(X_val_counts)

In [None]:
accuracy = (val_predictions_nb == val_df['target'].values).mean()
print('The accuracy of our multinomial Naive Bayes classifier is: {:.2f}%'.format(accuracy*100))

In [None]:
fbeta = fbeta_score(val_df['target'].values, val_predictions_nb, average='macro', beta=1.0)
print('The fbeta score is:', fbeta)

In [None]:
# Creating a submission

X_train_counts = count_vect.fit_transform(list(train_df['text'].values) + list(val_df['text'].values))
X_test_counts = count_vect.transform(list(test_df['text'].values))

clf = MultinomialNB().fit(X_train_counts, np.hstack([train_df['target'].values, val_df['target'].values]))
test_predictions_nb = clf.predict(X_test_counts)

submission_df = pd.DataFrame()
submission_df['textID'] = test_df['textID']
submission_df['sentiment'] = test_predictions_nb
submission_df.to_csv('TA_baseline_NB.csv', index=False)

## How good is this score?

Early approaches in NLP used rule-based classifiers for sentiment analysis. A popular baseline is VADER which was published in 2014:

https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/viewPaper/8109

VADER does not use any machine learning but is purely handcrafted by humans. It uses text preprocessing and lexica to determine the sentiment of a text.

In [None]:
nltk.download('vader_lexicon')

In [None]:
sid = SentimentIntensityAnalyzer()

In [None]:
# We show a few prediction examples:
for doc in val_df['text'].iloc[:5].values:
    print(doc)
    print(sid.polarity_scores(doc))

In [None]:
def vader_predict(x):
    prediction = sid.polarity_scores(x)
    prediction_list = [
        (1, prediction['pos']),
        (-1, prediction['neg']),
        (0, prediction['neu'])
    ]
    label = sorted(prediction_list, key=lambda x: x[1], reverse=True)[0][0]
    return label

In [None]:
predictions_vader = val_df['text'].apply(vader_predict)

In [None]:
accuracy = (predictions_vader == val_df['target'].values).mean()
print('The accuracy of VADER is: {:.2f}%'.format(accuracy*100))

In [None]:
fbeta = fbeta_score(val_df['target'].values, predictions_vader, average='macro', beta=1.0)
print('The fbeta score is:', fbeta)

VADER performs worse! That is a good sign that our classifier learned useful generalizations from the training data (better than standard handcrafted rules).

## Where to go from here?

We can improve our Machine Learning pipeline on multiple aspects:

### Data analysis:
How is the data distributed? Can we analyze our data to find patterns associated with the classes? Which kinds of words are useful, which aren't?

### Feature extraction:
Can we make our Bag-of-Words representation more compact or richer? There are many things you could try to implement. Here are some buzzwords: tokenization, stop words removal, lemmatization, n-gram extraction, ...
A useful Python library to address these issues is: NLTK (https://www.nltk.org/)
The sklearn CountVectorizer we used can be combined with NLTK preprocessing: https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes
Is there also a dense (as opposed to sparse) representation of documents (tweets in our case)? Buzzwords: word2vec, gloVe
The state-of-the-art: ... are neural network language models, so-called Transformers. There are pretrained models available. If you feel comfortable with neural networks, fine-tuning and GPUs, have a look here: https://huggingface.co/transformers/

In general, we also recommend spaCy as a convenient Python library that covers most of the above features at once and may be a great resource to start with: https://spacy.io/

### Model selection:
The model of choice highly depends on the previously extracted features. Depending on whether you obtain a sparse or dense feature representation, you have to choose an appropriate model!

### Model evaluation:
Make sure to select potential model hyperparameters using cross-validation or similar. Our evaluation metric of choice is the F1-score:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#sklearn.metrics.fbeta_score

We choose beta=1 and average=macro

### Extension idea 1:
Apart from classifying the sentiment of tweets, we can also try to determine which words are the reason for the classifier to determine the classification. Ground-truth labels for these words are contained in our training data. The evaluation will not take place on the Kaggle platform. You need to do it yourself. Use the Jaccard coefficient to evaluate the overlap between the selected words and the ground truth:

https://scikit-learn.org/stable/modules/model_evaluation.html#jaccard-similarity-coefficient-score

In [None]:
# selected_text shows the words selected from text to lead to the classification stored in sentiment
train_df[['text', 'selected_text', 'sentiment']].iloc[:5]

### Extension idea 2:

You may want to give it a try to Kaggle's brand new feature called models!