# Text Classification with Generative Models

This notebook explores different approaches to classify movie reviews as positive or negative using the rotten_tomatoes dataset.

## Part 1: Setup and Data Loading

In [24]:
!uv pip install transformers accelerate sentence-transformers datasets scikit-learn numpy pandas tqdm groq python-dotenv

[2mAudited [1m10 packages[0m [2min 68ms[0m[0m


In [25]:
from datasets import load_dataset

data = load_dataset("rotten_tomatoes")
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

The dataset contains 3 splits: train (8530 samples), validation (1066 samples), and test (1066 samples). Each sample has a text field containing the movie review and a label field (0 for negative, 1 for positive).

In [26]:
print("Example positive review:")
print(data["train"][0])
print("\nExample negative review:")
print(data["train"][5000])

Example positive review:
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}

Example negative review:
{'text': 'possibly the most irresponsible picture ever released by a major film studio .', 'label': 0}


## Part 2: Task-Specific Model (RoBERTa)

Using a pre-trained sentiment classifier is the simplest approach. We load a RoBERTa model fine-tuned on Twitter sentiment data and use it directly on our movie reviews.

In [27]:
from transformers import pipeline
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="mps"
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps


In [28]:
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 1066/1066 [00:48<00:00, 21.89it/s]


In [29]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [30]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



The RoBERTa model achieves around 80% accuracy. This is decent for a model trained on Twitter data being applied to movie reviews. The model tends to have higher recall for negative reviews but lower precision, suggesting a slight negative bias.

## Part 3: Classification with Embeddings

When no task-specific model exists, we can use embeddings to represent text as vectors and train a classifier on top.

### Supervised classification with embeddings

In [31]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

Batches: 100%|██████████| 267/267 [00:54<00:00,  4.94it/s]
Batches: 100%|██████████| 34/34 [00:07<00:00,  4.50it/s]


In [32]:
train_embeddings.shape

(8530, 768)

In [33]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,100


In [34]:
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



### Unsupervised use case

What if we do not use a classifier at all? We can average the embeddings per class and use cosine similarity to predict which class matches each document best.

In [35]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-shot classification

A zero-shot classification is when a model can classify text into categories it has never been explicitly trained on. We describe our labels as text and embed them alongside the documents.

In [36]:
label_embeddings = model.encode(["A negative review", "A positive review"])

In [37]:
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



## Part 4: Text Classification with Generative Models

Generative language models like GPT differ from traditional classification. They receive text input and produce text output. We must provide context through prompts to guide the model toward our desired outcome.

In [38]:
import os
from groq import Groq
from dotenv import load_dotenv

load_dotenv()

client = Groq(
    api_key=os.getenv("GROQ_API_KEY"),
)

In [39]:
sample_text = data["test"]["text"][0]
print(f"Review: {sample_text}\n")

Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .



In [40]:
chat_completion = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a sentiment classifier. Respond with only 'positive' or 'negative'."
        },
        {
            "role": "user",
            "content": f"Classify the sentiment of this movie review: {sample_text}"
        }
    ],
    temperature=0,
    max_tokens=10
)
print(chat_completion.choices[0].message.content)

positive.


We can also output a score if we need more granularity.

In [41]:
chat_completion = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a sentiment classifier. Respond with only a number between 0 and 1."
        },
        {
            "role": "user",
            "content": f"Rate the sentiment of this movie review: {sample_text}"
        }
    ],
    temperature=0,
    max_tokens=10
)
print(chat_completion.choices[0].message.content)

0.8


In [42]:
def groq_generation(prompt, model="meta-llama/llama-4-scout-17b-16e-instruct"):
    message = [
        {
            "role": "system",
            "content": "You are a sentiment classifier. Respond with only a number between 0 and 1."
        },
        {
            "role": "user",
            "content": f"Rate the sentiment of this movie review: {prompt}"
        }
    ]
    chat_completion = client.chat.completions.create(
        model=model,
        messages=message,
        temperature=0,
        max_tokens=10
    )
    return chat_completion.choices[0].message.content

In [43]:
groq_generation(sample_text)

'0.8'

In [44]:
#predictions = [groq_generation(doc) for doc in tqdm(data["test"]["text"])]
#y_pred = [int(float(pred) >= 0.5) for pred in predictions]
#evaluate_performance(data["test"]["label"], y_pred)

## Part 5: Text2Text Transfer Transformers (T5)

T5 reframes every NLP task as text-to-text. As input text goes in, output text comes out. This simplifies model design and enables multitask learning. Because T5 generates text tokens for answers and labels, it excels in zero-shot, few-shot, and instruction-based tasks.

In [45]:
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="mps"
)

Device set to use mps


In [46]:
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

Map: 100%|██████████| 8530/8530 [00:00<00:00, 50799.19 examples/s]
Map: 100%|██████████| 1066/1066 [00:00<00:00, 60414.12 examples/s]
Map: 100%|██████████| 1066/1066 [00:00<00:00, 61091.83 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [47]:
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|██████████| 1066/1066 [03:38<00:00,  4.88it/s]


In [48]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066

