# Text Classification NLP with 10,000 Rotten Tomatoes reviews
I'll be using the Rotten Tomatoes movie review dataset from Hugging Face datasets, this contains 10,000 reviews that will be split for training and testing. In this notebook, we use pretrained sentiment analysis model to classify the movie reviews from either as positive or negative.



Libraries added:

!pip install datasets transformers evaluate

from datasets import load_dataset

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

import evaluate

import numpy as np

import torch

from sklearn.metrics import accuracy_score

In [2]:
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
import numpy as np
import torch
from sklearn.metrics import accuracy_score

dataset = load_dataset("rotten_tomatoes")
print(dataset)
dataset['train'][0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})


{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

# Model set up and hyperparameter tuning
As I have texts that dont exceed 512 tokens, I chose to continue with the base(distilbert-base-uncased). Alternatively was going to use (allenai/longformer-base-4096), but ran slower than distolbert.

I included my standard tokenizer and classifier onto the dataset that will run preprocessing, passing the inputs through the model, and postprocessing. I had issues with the token limit and adjusted the maximum capable.

My inputs is the preprocess_function that I have defined to also include padding, truncation, and max_length. I also mapped the dataset and batched it to run smoother than defore as its running on a older cpu.



In [3]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Input to preprocess the dataset with the defined padding, truncation and max_length
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=725)  # Adjusted the max_length

dataset = dataset.map(preprocess_function, batched=True)

Device set to use cpu


Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

#Testing
To make sure that the classifer is working, I test it with the first 50 rows and make the display its predictions. Interestingly enough, this model needs defined words especially for a label_map of the results. Label_map works just fine without it, but kept it in there for transparency.

In [4]:
texts = dataset["test"]["text"][:50]
results = classifier(texts)
label_map = {"NEGATIVE": "Negative", "POSITIVE": "Positive"}

for text, result in zip(texts, results):
    print(f"Review: {text}")
    print(f"Label: {label_map[result['label']]}, Score: {result['score']:.4f}")
    print("-" * 80) # This made it easier to read


Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
Label: Positive, Score: 0.9998
--------------------------------------------------------------------------------
Review: consistently clever and suspenseful .
Label: Positive, Score: 0.9999
--------------------------------------------------------------------------------
Review: it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
Label: Negative, Score: 0.9910
--------------------------------------------------------------------------------
Review: the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
Label: Positive, Score: 0.9999
--------------------------------------------------------------------------------
Review: red dragon " never cuts corners .
Label: Positive, Score: 0.9996
-------

# Evaluate accuracy on the full dataset
Evaluating the accuracy on all test reviews and make my own predictions. I then convert the labels into integers to make the accuracy. I can't use the BLEU method as that is focused for other metrics especially for summarizing and translation. As I am using sentiment analysis, we'll use accuracy.

In [5]:
accuracy = evaluate.load("accuracy")
test_texts = dataset["test"]["text"]
test_labels = dataset["test"]["label"]
predictions = classifier(test_texts)

predicted_labels = [1 if result["label"] == "POSITIVE" else 0 for result in predictions]
accuracy_result = accuracy.compute(predictions=predicted_labels, references=test_labels)
print("Test Set Accuracy:", accuracy_result["accuracy"])


Test Set Accuracy: 0.8968105065666041


# Final Reflection

- **Model Used**: distilbert-base-uncased-finetuned-sst-2-english  
-**Dataset**: Rotten Tomatoes movie review dataset (binary labels)  
-**Process**: Tokenize → Predict → Evaluate  

## Key Learnings
- Useful in this case for evaluating the success/failure of a film to set proprt ratings.
- Using pretrained models allows for quick and effective sentiment classification.
- The Hugging Face pipeline makes inference simple and interpretable.
- Accuracy on test data is strong with 89%
- Would need new measure of metrics if attempting summarization or translation.

##  Limitations
- Doesn't pull any key words or phrases commonly used.
- The model only supports binary classification.
- It may miss contextual cues not seen during training.

## Future Improvements
- Explore multi-class sentiment classification.
- Add model interpretability.
- Try more advanced models like RoBERTa or XLNet.
- Try other datasets with more data like with "IMDB" with 50,000 reviews as initially attempted.


#Performing the sentiment analysis
I apply the classifier just as an option and was able to get the full list below. Took around 11 minutes to complete. Included the tokens as the output for the training of the dataset.

In [6]:
preds = classifier(dataset['train']['text'])
preds

[{'label': 'POSITIVE', 'score': 0.9998360872268677},
 {'label': 'POSITIVE', 'score': 0.9998277425765991},
 {'label': 'NEGATIVE', 'score': 0.9960036873817444},
 {'label': 'POSITIVE', 'score': 0.9998257756233215},
 {'label': 'POSITIVE', 'score': 0.9997782111167908},
 {'label': 'POSITIVE', 'score': 0.9998192191123962},
 {'label': 'POSITIVE', 'score': 0.9998753070831299},
 {'label': 'NEGATIVE', 'score': 0.9845964312553406},
 {'label': 'POSITIVE', 'score': 0.997896671295166},
 {'label': 'POSITIVE', 'score': 0.9998527765274048},
 {'label': 'POSITIVE', 'score': 0.9998834133148193},
 {'label': 'POSITIVE', 'score': 0.9998408555984497},
 {'label': 'POSITIVE', 'score': 0.9985002279281616},
 {'label': 'POSITIVE', 'score': 0.9998641014099121},
 {'label': 'POSITIVE', 'score': 0.9717578291893005},
 {'label': 'POSITIVE', 'score': 0.9998157620429993},
 {'label': 'POSITIVE', 'score': 0.9992558360099792},
 {'label': 'POSITIVE', 'score': 0.9998703002929688},
 {'label': 'POSITIVE', 'score': 0.9996273517608