<a href="https://colab.research.google.com/github/Dimi-G/Capstone_Project/blob/main/Beginners_guide_to_emotion_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project description
As part of our personalized diary assistant (link to follow soon), we need to be able to identify emotions from text entries. The approach is that of an NLP-based multiclass classification task. Our training dataset is [dar-ai/emotion](https://huggingface.co/datasets/dair-ai/emotion). We introduce a K-Nearest Neighbors naive model and proceed with implementing transfer learning from the [RoBERTa](https://huggingface.co/docs/transformers/v4.41.3/en/model_doc/roberta#transformers.RobertaForSequenceClassification) model.

Special thanks to [bhadresh-savani](https://huggingface.co/bhadresh-savani/roberta-base-emotion), whose notebook was the main guide for this work but also to many others who have shared their work and contributed to better understanding this fascinating topic.

## Imports

In [None]:
#setting the gpu as first choice if it is accessible
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
#mounting google drive for saving or loading the models
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#huggingface and pytorch relevant installations
! pip install -U sentence-transformers
! pip install -q datasets
! pip install -U accelerate
! pip install -U transformers


In [None]:
#installing joblib for saving the KNN model
!pip install joblib



In [None]:
import pandas as pd
import numpy as np
import joblib
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from transformers import pipeline
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification, AutoTokenizer


## Naive Model

### Dataset import from Kaggle

The same emotions for NLP dataset is available in [Kaggle](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp). It can be downloaded and added to google drive to be accessed locally.

In [None]:
train_df = pd.read_csv("drive/MyDrive/NLP_data/train.txt", delimiter=";", names=["text", "label"])
val_df = pd.read_csv("drive/MyDrive/NLP_data/val.txt", delimiter=";", names=["text", "label"])
test_df = pd.read_csv("drive/MyDrive/NLP_data/test.txt", delimiter=";", names=["text", "label"])

In [None]:
# splitting datasets in half to reduce size
RANDOM_SEED =42
train_ds = train_df.sample(frac=0.5, random_state= RANDOM_SEED)
val_ds = val_df.sample(frac=0.5, random_state= RANDOM_SEED)
test_ds = test_df.sample(frac=0.5, random_state= RANDOM_SEED)

### Basic Exploratory Data Analysis

Checking the distribution of the labels

In [None]:
train_ds['label'].value_counts()/train_ds.shape[0]

In [None]:
print(f"Training dataset: \n shape: {train_ds.shape} \n label counts:{train_ds['label'].value_counts()} \n label ratios: {train_ds['label'].value_counts()/train_ds.shape[0]}")
print(f"Training dataset: \n shape: {test_ds.shape} \n label counts:{test_ds['label'].value_counts()} \n label ratios: {test_ds['label'].value_counts()/test_ds.shape[0]}")
print(f"Training dataset: \n shape: {val_ds.shape} \n label counts:{val_ds['label'].value_counts()} \n label ratios: {val_ds['label'].value_counts()/val_ds.shape[0]}")

Dataset is split. Data is imbalanced but we have the same ratios per split.

In [None]:
train_ds.groupby('label').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

### Creating embeddings

Text embeddings are vectors (lists) or floating point numbers and they are designed to capture the semantic meaning and context of the words they represent. There are many models available which can be used for getting embeddings from given text. In this case we will use directly a [RoBERTa embedding Transformer](https://huggingface.co/sentence-transformers/all-roberta-large-v1). A more traditional approach would be employing CountVectorizer, TF-IDF, N-grams, Normalization, Stemming, Lemmatization, Stopwords, POS-tagging etc.

In [None]:
embedder = SentenceTransformer('sentence-transformers/all-roberta-large-v1')

Input format has to be a list

In [None]:
train_sentences = train_ds['text'].to_list()
test_sentences = test_ds['text'].to_list()
val_sentences = val_ds['text'].to_list()

In [None]:
train_embeddings = embedder.encode(train_sentences)
test_embeddings = embedder.encode(test_sentences)
val_embeddings = embedder.encode(val_sentences)

In [None]:
train_embeddings.shape

Training dataset embeddings have 8000 data points, each represented by a 1024-dimensional vector

### KNN model and Hyperparameter tuning

Initialize KNeighborsClassifier and fit on training data

In [None]:
model_baseline = KNeighborsClassifier()
model_baseline.fit(train_embeddings, train_ds['label'])

Scoring metric of choice is the F1-score, which is a harmonic mean of precision and recall. The F1-score is more sensitive to data distribution and is a suitable measure for classification problems on imbalanced datasets.


In [None]:
cv_scores = cross_val_score(estimator=model_baseline, X=train_embeddings, y=train_ds['label'], scoring="f1_macro", cv=3)

print(
    f"""
      Baseline model CV scores by fold: {cv_scores},
      Mean CV score {cv_scores.mean()}
"""
)

Running a Randomized Search CV on a set of hyperparameters for the KNN

In [None]:
params = {
    "n_neighbors": [3, 5, 7, 9, 11, 13, 15], # number of neighbors
    "weights": ["distance", "uniform"], # whether the votes from all neighbors should be counted equally or by distance to the prediction point
    "metric": ["cosine", "euclidean" ], # which metric to use for distance calculation
}

# define RandomizedSearchCV
clf = RandomizedSearchCV(
    estimator=KNeighborsClassifier(n_jobs=-1),
    param_distributions=params,
    n_iter=20,
    random_state=0,
    cv=3,
    scoring="f1_macro",
)

In [None]:
random_search = clf.fit(train_embeddings, train_ds['label'])

In [None]:
print(f"""
      Best parameters: {random_search.best_params_}
      F1-score: {random_search.best_score_}
""")

Calling in the model with the best hyperparameters

In [None]:
best_knn_model = random_search.best_estimator_
cv_scores_tuned = cross_val_score(estimator=best_knn_model, X=train_embeddings, y=train_ds['label'], scoring="f1_macro", cv=3)

print(
    f"""
      Best model CV scores by fold: {cv_scores_tuned},
      Mean CV scores: {cv_scores_tuned.mean()}
"""
)

KNN with Hyperparameter tuning return marginally improved F1 score. We use the best estimator for prediction and print a classification report

### Performance Evaluation

In [None]:
y_pred = best_knn_model.predict(test_embeddings)

In [None]:
#converting y test to 1d array to match the y_pred for the classification report input
y_test = test_ds['label'].values

In [None]:
print(classification_report(y_test, y_pred ))

### Saving KNN model

In [None]:
#saving the KNN best model
joblib_file = "knn_model.joblib"
joblib.dump(best_knn_model, joblib_file)

# Copy the model file to Google Drive
!cp knn_model.joblib /content/drive/MyDrive/NLP_data/knn_model.joblib

## Zero-shot classification

We test Zero-shot classification performance on our labels using the [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) pipeline. It allows for text categorization without specific training on your labels. This pipeline takes your text and potential labels as input, predicting which labels apply based on the model's pre-existing knowledge. The results of the pipeline are for multilabel classification.


In [None]:
zero_shot_classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

Bringing data in the necessary format for the pipeline and saving in a list only the prevailing category prediction

In [None]:
candidate_labels = list(train_ds['label'].unique())

In [None]:
sequences = test_ds['text'].to_list()

In [None]:
prediction = zero_shot_classifier(sequences, candidate_labels)

In [None]:
#Choosing the label with the highest prediction score
pred_list = [prediction[i]['labels'][0] for i in range(0,len(prediction))]

In [None]:
len(pred_list)

### Performance Evaluation

In [None]:
print(classification_report(y_test, pred_list ))

The Bart zero shot classification does not perform better than the baseline KNN model. Especially the performance of the 'surprise' category is so poor that drops the macro accuracy.

## RoBERTa: A Robustly Optimized BERT Pretraining Approach

### Import dataset from Hugging Face
Using the same [emotion dataset](https://huggingface.co/datasets/dair-ai/emotion) but this time from the Hugginng Face interface.
Labels are sadness (0), joy (1), love (2), anger (3), fear (4), surprise (5).

In [None]:
from datasets import load_dataset
emotions = load_dataset("emotion")

### Checking dataset format, tokenizing, downsizing

In [None]:
emotions

In [None]:
tokenizer = RobertaTokenizer.from_pretrained("FacebookAI/roberta-base")

In [None]:
# tokenize function
def tokenize(batch):
  return tokenizer(batch['text'], padding=True, truncation=True)

In [None]:
emotions_encoded = emotions.map(tokenize, batched =True, batch_size =None)

In [None]:
#making a smaller dataset. since for the baseline models we used 50% of each split, we will use the same size and random seed too
RANDOM_SEED = 42
small_train_ds = emotions_encoded['train'].shuffle(seed=RANDOM_SEED).select(range(8000))
small_val_ds =  emotions_encoded['validation'].shuffle(seed=RANDOM_SEED).select(range(1000))
small_test_ds = emotions_encoded['test'].shuffle(seed=RANDOM_SEED).select(range(1000))

Prepare the datasets for use with PyTorch models

In [None]:
small_train_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])
small_val_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])
small_test_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])

### Create Metrics

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

## Train a RoBERTa custom classification head

We use the RobertaForSequenceClassification class, which is a transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output).  The last step utilizes a softmax activation function for multiclass classification. The default optimizer is AdamW, while the cost function is Categorical Cross-Entropy (Softmax activation plus a Cross-Entropy loss)

In [None]:
model_path = "FacebookAI/roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

We perform class inheritance and customize the classifier head

In [None]:
class CustomModel(AutoModelForSequenceClassification):
    def __init__(self, config):
        super(CustomModel, self).__init__(config)
        self.classifier = nn.Sequential(
            nn.Linear(config.hidden_size, 526),
            nn.Dropout(0.1),
            nn.Linear(526, 258),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(258, config.num_labels)
        )

    def forward(self, **inputs):
        outputs = self.roberta(**inputs)
        sequence_output = outputs[0]
        logits = self.classifier(sequence_output[:, 0, :].squeeze(1))
        return logits


Define the training hyperparameters and initialize the trainer

In [None]:
cp_model = CustomModel.from_pretrained(model_path, num_labels=6)
cp_model.to(device)

batch_size = 64
logging_steps = len(small_train_ds) // batch_size

training_args = TrainingArguments(
    output_dir="results",
    num_train_epochs=8,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=2e-5,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    run_name="roberta_head_classification",
    disable_tqdm=False,
    logging_steps=logging_steps,
)

Initialize the trainer

In [None]:
trainer = Trainer(
    model=cp_model,
    args=training_args,
    train_dataset=small_train_ds,
    eval_dataset=small_val_ds,
    compute_metrics=compute_metrics,
)

Train the model

In [None]:
trainer.train()

Evaluate the model on the evaluation dataset

In [None]:
results = trainer.evaluate()

Use it for predictions in the test dataset

In [None]:
predictions = trainer.predict(small_test_ds)

In [None]:
predictions.metrics

predictions.predictions contains the raw model outputs, which are typically probability distributions or logits for each class

In [None]:
predictions.predictions

The predicted label is the one with the maximum value along each row :

In [None]:
y_preds = np.argmax(predictions.predictions, axis=1)

In [None]:
print(classification_report(small_test_ds['label'].numpy(),y_preds ))

Save the model

In [None]:
model.save_pretrained('./model/custom_head')
tokenizer.save_pretrained('./model/custom_head')

In [None]:
!cp -r './model/custom_head' /content/drive/MyDrive/NLP_data/model/custom_head

## Fine-tune RoBERTa on our dataset

The other approach is to  simply fine tune the RobertaForSequenceClassification on our dataset, by following the previous steps but not customizing the head.

In [None]:
model = RobertaForSequenceClassification.from_pretrained("FacebookAI/roberta-base", num_labels = 6).to(device)

In [None]:
batch_size = 64
logging_steps = len(small_train_ds) // batch_size

training_args = TrainingArguments(output_dir="results",
                                  num_train_epochs=10,
                                  learning_rate=5e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  load_best_model_at_end=True,
                                  warmup_steps=500,
                                  metric_for_best_model="f1",
                                  weight_decay=0.03,
                                  eval_strategy="epoch",
                                  save_strategy="epoch",
                                  run_name = "roberta_classification",
                                  disable_tqdm=False)

In [None]:
trainer = Trainer(model=model, args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset=small_train_ds,
                  eval_dataset=small_val_ds)

In [None]:
trainer.train()

In [None]:
results = trainer.evaluate()
results

In [None]:
predictions = trainer.predict(small_test_ds)

In [None]:
predictions.metrics

In [None]:
y_preds = np.argmax(predictions.predictions, axis=1)

In [None]:
print(classification_report(small_test_ds['label'].numpy(),y_preds ))

Save model

In [None]:
model.save_pretrained('./model')
tokenizer.save_pretrained('./model')

In [None]:
!cp -r './model' /content/drive/MyDrive/NLP_data/model/

### Load model
If needed you can load the model from the drive location you have it saved

In [None]:
!cp -r /content/drive/MyDrive/NLP_data/model/ './model'

In [None]:
model_path = "./model"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

### Push model to Hugging Face

Once it is uploaded, add the label mapping at the config.json
   ```
   "id2label": {
     "0": "Sadness",
     "1": "Joy",
    "2": "Love",
     "3": "Anger",
     "4": "Fear",
     "5": "Surprise"
   }
    ```

In [None]:
!huggingface-cli login

In [None]:
!git config --global credential.helper store

In [None]:
!sudo apt-get install git-lfs

In [None]:
# use your git credentials
!git config --global user.email ""
!git config --global user.name ""
!git config --global user.password ""

In [None]:
model.push_to_hub("roberta-base-emotion")

In [None]:
tokenizer.push_to_hub("roberta-base-emotion")

### Pipeline use demo

In [None]:
classifier = pipeline(model="Dimi-G/roberta-base-emotion")

In [None]:
emotions=classifier('i feel very happy and excited since i learned so many things', top_k=None)

In [None]:
emotions