<a href="https://colab.research.google.com/github/SaibalPatraDS/Hands-on-LLM/blob/main/Movie_Review_Sentiment_Analsis_Generative_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movie Review Sentiment Analysis using `Generative Model - T5`

-- **Text to Text Transfer Transformer**

In [None]:
## installing necessary libraries
!pip install transformers sentence-transformers datasets

In [2]:
## loading the dataset
from datasets import load_dataset
review_df = load_dataset(
    "cornell-movie-review-data/rotten_tomatoes"
)

## looking into the data
review_df

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

## Encoder and Decoder Model

In [3]:
## Loading the model
from transformers import pipeline
pipe = pipeline(
    "text2text-generation",
    model = "google/flan-t5-small",
    device = "cuda:0"
)
pipe

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline at 0x7d4aa0c3a0b0>

### Preparing the Data

-- Prepare the Data with correct `Prompt`

In [6]:
## prepare the data
prompt = "Is this movie review is Positive or Negative? "
review_df = review_df.map(lambda example : {"t5" : prompt + example["text"]})
review_df

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [13]:
# review_df["train"]["t5"][0]

### Prediction

In [22]:
## Prediction
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

## Run Inference
y_pred = []
for output in tqdm(pipe(KeyDataset(review_df["test"], "t5")), total = len(review_df["test"])):
  pred = output[0]["generated_text"]
  y_pred.append(0 if pred == "Negative" else 1)

100%|██████████| 1066/1066 [00:57<00:00, 18.66it/s]


In [24]:
# np.sum(y_pred)

### Evaluation Metrics

In [30]:
## Evaluation Metrics
from sklearn.metrics import classification_report
def evaluation_metrics(y_true, y_pred):
  """
  Printing the Evaluation Metrics
  """
  report = classification_report(
      y_true, y_pred,
      target_names = ["Negative Reviews", "Positive Reviews"]
  )
  print(report)

In [31]:
## Classification Report
evaluation_metrics(y_true = review_df["test"]["label"], y_pred = y_pred)

                  precision    recall  f1-score   support

Negative Reviews       0.82      0.91      0.86       533
Positive Reviews       0.90      0.80      0.85       533

        accuracy                           0.85      1066
       macro avg       0.86      0.85      0.85      1066
    weighted avg       0.86      0.85      0.85      1066



## Conclusion

1. Using `Flan-T5` model we have acheived an amazing accuracy of 85% in movie review sentiment analysis task.
2. Even the `f1-score` are also quite high and not significantly differ from one class to another class.