<a href="https://colab.research.google.com/github/NewCodeLearner/HandsOnLLM-Projects/blob/main/02_BERT_FinetuningModel_For_Movie_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lecture, we use a Large Language Representation Model for Movie Reviews Sentiment classification.

We use a pre-trained and fine-tuned RoBERTa based NLP model.

The Model Link : https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**

### Install required packages

In [1]:
! pip install transformers datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

# DATA

In [2]:
# datasets is a library provided by Huggingface which lets you download any dataset from HF website as shown below.
# We need to pass dataset name in load_dataset method
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [None]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

# Text Classification with Representation Models

## Using a Task-specific Model

In [None]:
from transformers import pipeline

model_path ='cardiffnlp/twitter-roberta-base-sentiment'

pipe = pipeline(
    'text-classification',
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device='cuda:0'
)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]



In [None]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run Inference
y_pred =[]

#Calls to the pipeline object with a KeyDataset as input returns PipelineIterator object that is iterable.
#Hence, one can enumerate the PipelineIterator object to get both the result and the index for the particular result,
#and then use that index to retrieve the associated sample in the dataset.
results = pipe(KeyDataset(data['test'],'text'))

for output in tqdm(results,total = len(data['test'])):
    negative_score = output[0]['score'] #Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive
    positive_score = output[2]['score'] #Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive
    assignment = np.argmax([negative_score,positive_score])
    y_pred.append(assignment)


100%|██████████| 1066/1066 [00:13<00:00, 78.23it/s] 


In [None]:
y_pred[50]

0

### Create and print the classification report

In [3]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true,y_pred):
    performance = classification_report (
        y_true,y_pred,
        target_names=['Negative Review','Positive Review']
    )
    print(performance)

In [None]:
evaluate_performance(data['test']['label'],y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.83      0.80       533
Positive Review       0.81      0.74      0.78       533

       accuracy                           0.79      1066
      macro avg       0.79      0.79      0.79      1066
   weighted avg       0.79      0.79      0.79      1066



## Classification Tasks that Leverage Embeddings

### Supervised Classification

Model Link: https://huggingface.co/sentence-transformers/all-mpnet-base-v2

In [6]:
from sentence_transformers import SentenceTransformer

# Load Model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings =model.encode(data['train']['text'],show_progress_bar =True)
test_embeddings =model.encode(data['test']['text'],show_progress_bar =True)

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [7]:
test_embeddings.shape

(1066, 768)

In [9]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(train_embeddings,data['train']['label'])

In [10]:
#Predict previously unseen instances

y_pred = clf.predict(test_embeddings)
evaluate_performance(data['test']['label'],y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



F1 score of .85 is relatively very good score.