<a href="https://colab.research.google.com/github/Buggy1004/NLP-Text-Classification-using-BERT-Transformer/blob/main/RoBERTa_Emotion_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Check Hardware & RAM availability:

Commands to check for available GPU and RAM allocation on runtime

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Wed Jan 10 07:00:20 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

Your runtime has 13.6 gigabytes of available RAM



### Install Required Libraries for Transformer Models:

* Pre-Trained Transformer models are part of Hugging Face Library(transformers).
* Similarly, any datatset part of Hugging Face can be called from the **datasets** library.
* Finally we will use a high level abstraction package called **k-train** to simplify our modelling and predictions

In [3]:
!pip install ktrain
!pip install transformers
!pip install datasets

Collecting ktrain
  Downloading ktrain-0.39.0.tar.gz (25.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.3/25.3 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langdetect (from ktrain)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting syntok>1.3.3 (from ktrain)
  Downloading syntok-1.4.4-py3-none-any.whl (24 kB)
Collecting tika (from ktrain)
  Downloading tika-2.6.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from ktrain)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting k

### Import Libraries:

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ktrain
from ktrain import text
import tensorflow as tf
from sklearn.model_selection import train_test_split
from datasets import list_datasets
from datasets import load_dataset
from sklearn.metrics import classification_report, confusion_matrix
import timeit
import warnings

pd.set_option('display.max_columns', None)
warnings.simplefilter(action="ignore")

### Load Emotion Dataset:

In [5]:
emotion_train = load_dataset('emotion', split='train')
emotion_val = load_dataset('emotion', split='validation')
emotion_test = load_dataset('emotion', split='test')
print("Details for Emotion Train Dataset: ", emotion_train.shape)
print("Details for Emotion Validation Dataset: ", emotion_val.shape)
print("Details for Emotion Test Dataset: ", emotion_test.shape)

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Details for Emotion Train Dataset:  (16000, 2)
Details for Emotion Validation Dataset:  (2000, 2)
Details for Emotion Test Dataset:  (2000, 2)


In [6]:
print("\nTrain Dataset Features for Emotion: \n", emotion_train.features)
print("\nTest Dataset Features for Emotion: \n", emotion_val.features)
print("\nTest Dataset Features for Emotion: \n", emotion_test.features)


Train Dataset Features for Emotion: 
 {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

Test Dataset Features for Emotion: 
 {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

Test Dataset Features for Emotion: 
 {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}


### Create DataFrame object:

For modelling, it is required to convert the dataset object into a dataframe object

In [7]:
emotion_train_df = pd.DataFrame(data=emotion_train)
emotion_val_df = pd.DataFrame(data=emotion_val)

In [8]:
class_label_names = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

### Instantiating a RoBERTa Instance:

Create a RoBERTa instance with the model name, max token length, the labels to be used for each category and the batch size.

In [9]:
roberta_transformer = text.Transformer('roberta-base', maxlen=512, classes=class_label_names, batch_size=6)

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

### Split Train & Validation data:

In [10]:
X_train = emotion_train_df[:]["text"]
y_train = emotion_train_df[:]["label"]
X_test = emotion_val_df[:]["text"]
y_test = emotion_val_df[:]["label"]
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(16000,) (16000,) (2000,) (2000,)


### Perform Data Preprocessing:

In [11]:
roberta_train = roberta_transformer.preprocess_train(X_train.to_list(), y_train.to_list())
roberta_val = roberta_transformer.preprocess_test(X_test.to_list(), y_test.to_list())

preprocessing train...
language: en
train sequence lengths:
	mean : 19
	95percentile : 41
	99percentile : 52


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 19
	95percentile : 40
	99percentile : 52


### Compile RoBERTa in a K-Train Learner Object:

Since we are using k-train as a high level abstration package, we need to wrap our model in a k-train Learner Object for further compuation

In [12]:
roberta_model = roberta_transformer.get_classifier()

In [13]:
roberta_learner_ins = ktrain.get_learner(model=roberta_model,
                            train_data=roberta_train,
                            val_data=roberta_val,
                            batch_size=6)

### RoBERTa Model Details:

In [14]:
roberta_learner_ins.model.summary()

Model: "tf_roberta_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 roberta (TFRobertaMainLaye  multiple                  124055040 
 r)                                                              
                                                                 
 classifier (TFRobertaClass  multiple                  595206    
 ificationHead)                                                  
                                                                 
Total params: 124650246 (475.50 MB)
Trainable params: 124650246 (475.50 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### Find Optimal Learning Rate for RoBERTa:

This is an optional step used just to show how the learning rate can be found for any transformer model.
For Transformer models as per the research papers, the optimal learning rates have already been estimated and established.

In [None]:
rate_finder_start_time = timeit.default_timer()
roberta_learner_ins.lr_find(show_plot=True, max_epochs=3)
rate_finder_stop_time = timeit.default_timer()

print("\nTotal time in minutes on estimating optimal learning rate: \n", (rate_finder_stop_time - rate_finder_start_time)/60)

simulating training for different learning rates... this may take a few moments...
Epoch 1/3

### RoBERTa Optimal Learning Rates:

As per the evaluations made in the research paper "**RoBERTa: A Robustly Optimized BERT Approach**", below are the best choices in terms of fine-tuning the model:

* Batch Sizes => {16, 32}
* Learning Rates => {1e−5, 2e−5, 3e−5}

We will choose the maximum among these for our fine-tuning and evaluation purposes.

In [None]:
roberta_fine_tune_start_time = timeit.default_timer()
roberta_learner_ins.fit_onecycle(lr=3e-5, epochs=3)
roberta_fine_tune_stop_time = timeit.default_timer()

print("\nTotal time in minutes for Fine-Tuning RoBERTa on Emotion Dataset: \n", (roberta_fine_tune_stop_time - roberta_fine_tune_start_time)/60)

### Fine Tuning RoBERTa on Emotion Dataset:

We take our emotion dataset along with the RoBERTa model, define the learning-rate & epochs to be used and start fine-tuning.

### Checking RoBERTa performance metrics:

In [None]:
roberta_learner_ins.validate()

              precision    recall  f1-score   support

           0       0.98      0.97      0.97       550
           1       0.95      0.98      0.96       704
           2       0.95      0.81      0.87       178
           3       0.93      0.93      0.93       275
           4       0.86      0.95      0.90       212
           5       0.90      0.81      0.86        81

    accuracy                           0.94      2000
   macro avg       0.93      0.91      0.92      2000
weighted avg       0.94      0.94      0.94      2000



array([[531,   1,   1,   9,   8,   0],
       [  0, 688,   7,   1,   1,   7],
       [  0,  34, 144,   0,   0,   0],
       [  7,   4,   0, 255,   9,   0],
       [  3,   0,   0,   8, 201,   0],
       [  0,   1,   0,   0,  14,  66]])

In [None]:
roberta_learner_ins.validate(class_names=class_label_names)

              precision    recall  f1-score   support

     sadness       0.98      0.97      0.97       550
         joy       0.95      0.98      0.96       704
        love       0.95      0.81      0.87       178
       anger       0.93      0.93      0.93       275
        fear       0.86      0.95      0.90       212
    surprise       0.90      0.81      0.86        81

    accuracy                           0.94      2000
   macro avg       0.93      0.91      0.92      2000
weighted avg       0.94      0.94      0.94      2000



array([[531,   1,   1,   9,   8,   0],
       [  0, 688,   7,   1,   1,   7],
       [  0,  34, 144,   0,   0,   0],
       [  7,   4,   0, 255,   9,   0],
       [  3,   0,   0,   8, 201,   0],
       [  0,   1,   0,   0,  14,  66]])

In [None]:
roberta_learner_ins.view_top_losses(preproc=roberta_transformer)

----------
id:1870 | loss:5.89 | true:joy | pred:love)

----------
id:1124 | loss:5.27 | true:anger | pred:sadness)

----------
id:415 | loss:4.64 | true:love | pred:joy)

----------
id:1836 | loss:4.47 | true:fear | pred:anger)



### Saving RoBERTa Model:

In [None]:
roberta_predictor = ktrain.get_predictor(roberta_learner_ins.model, preproc=roberta_transformer)
roberta_predictor.get_classes()

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

In [None]:
roberta_predictor.save('/content/roberta-emotion-predictor')

In [None]:
!zip -r /content/roberta-emotion-predictor /content/roberta-emotion-predictor

  adding: content/roberta-emotion-predictor/ (stored 0%)
  adding: content/roberta-emotion-predictor/merges.txt (deflated 53%)
  adding: content/roberta-emotion-predictor/tokenizer.json (deflated 72%)
  adding: content/roberta-emotion-predictor/special_tokens_map.json (deflated 52%)
  adding: content/roberta-emotion-predictor/config.json (deflated 54%)
  adding: content/roberta-emotion-predictor/tf_model.preproc (deflated 47%)
  adding: content/roberta-emotion-predictor/tokenizer_config.json (deflated 76%)
  adding: content/roberta-emotion-predictor/tf_model.h5 (deflated 14%)
  adding: content/roberta-emotion-predictor/vocab.json (deflated 59%)


### Loading Saved Model for New Predictions:

In [None]:
roberta_predictor_new = ktrain.load_predictor('/content/roberta-emotion-predictor')
roberta_predictor_new.get_classes()

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

### Load Test data split:

In [None]:
emotion_test_df = pd.DataFrame(data=emotion_test)
print("\nShape of Test Dataset: ", emotion_test_df.shape,"\n\n")
emotion_test_df.head()


Shape of Test Dataset:  (2000, 2) 




Unnamed: 0,text,label
0,im feeling rather rotten so im not very ambiti...,0
1,im updating my blog because i feel shitty,0
2,i never make her separate from me because i do...,0
3,i left with my bouquet of red and yellow tulip...,1
4,i was feeling a little vain when i did this one,0


In [None]:
emotion_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2000 non-null   object
 1   label   2000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


### Modify dataframe for label mis-match:

In [None]:
label_dict = {0: "sadness", 1: "joy", 2: "love", 3: "anger", 4: "fear", 5: "surprise"}
emotion_test_df["label"] = emotion_test_df["label"].map(label_dict)
emotion_test_df.head()

Unnamed: 0,text,label
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness


In [None]:
emotion_test_df[emotion_test_df.columns] = emotion_test_df[emotion_test_df.columns].astype(str)

### Use test data as new prediction data:

In [None]:
X_test_new = emotion_test_df[:]["text"]
y_test_new = emotion_test_df[:]["label"]
print(X_test_new.shape, y_test_new.shape)

(2000,) (2000,)


In [None]:
test_predictions = roberta_predictor_new.predict(X_test_new.to_list())

### View Performance Metrics on new test data:

In [None]:
print(confusion_matrix(y_test_new, test_predictions))

[[253  10   3   0   9   0]
 [  4 211   0   0   7   2]
 [  1   0 677   8   2   7]
 [  1   0  48 110   0   0]
 [ 10   5   3   0 563   0]
 [  0  17   0   0   3  46]]


In [None]:
print(classification_report(y_test_new, test_predictions))