### Text classification with Generative Models

Text classification is a common task in NLP, it involves categorizing text into predefined categories or classes based on its content. This task is essential in various applications, such as sentiment analysis, spam filtering, topic classification...

Now with all the generative models it's tempting to use them for classification tasks, but are they good at it? How can we measure the success of a classification model? Let's find out 

For this example we will use the rotten_tomatoes dataset, it contains 50000 movie reviews with their corresponding sentiment (positive or negative).

In [1]:
!pip install transformers>=4.41.2 accelerate>=0.31.0
!pip install transformers sentence-transformers openai
!pip install -U datasets

zsh:1: 4.41.2 not found
Collecting sentence-transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting openai
  Downloading openai-2.9.0-py3-none-any.whl.metadata (29 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.10.0 (from openai)
  Downloading jiter-0.12.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.2 kB)
Collecting sniffio (from openai)
  Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Downloading sentence_transformers-5.1.2-py3-none-any.whl (488 kB)
Downloading openai-2.9.0-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m14.4 MB/s[0m  [33m0:00:00[0m
[?25hDownloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.12.0-cp311-cp311-macosx_11_0_arm64.whl (320 kB)
Using cached sniffio-1.3.1-py3-none-any.whl (10 kB)
Installing collected packages: sniffio, jiter, distro, open

In [2]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

## Using a Task-Specific Model

Using specific task models is the easiest way to solve our problem, we just need to find a model that fits our needs, download it and use it in a pipeline to test it on our data.

For this example we will use a roberta model to classify our data.

We will use a pipeline object - if you are not familiar with this read the [official doc](https://huggingface.co/docs/transformers/pipeline_tutorial)

In [4]:
from transformers import pipeline
import torch

model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    #device="cuda"
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


Now let's run an inference loop to get the predictions for our dataset

In [5]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)


100%|██████████| 1066/1066 [00:29<00:00, 36.23it/s]


## Evaluation

Then we will define a function to evaluate how well the model performed by comparing predictions to actual labels. For this we will use the `classification_report` from sklearn

In [6]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [7]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



## Classification Tasks with Embeddings

Now let's see how we can use embeddings to classify our data.

What's happening if we cannot find a model that fits perfectly our needs?

Then we need to fine-tune a model to our specific task, but it will be long, hard and costly... 

So what's the solution? **Use embeddings!**

### Supervised Classification with Embeddings

Instead of using a pre-trained model for our specific task, we will use an embedding model for feature generation. Then those features will be used to train a classifier. This method is called **Supervised classification with embeddings** because we do not need to fine-tune the model, we just need to train a classifier on the features 

For this example we will use a sentence-transformers model to generate embeddings for our data - it's very popular and well-performing for this kind of task.

In [8]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [9]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)

train_embeddings.shape

[[ 0.02250263 -0.07829173 -0.02303073 ... -0.00827931  0.02652684
  -0.00201897]
 [ 0.04170237  0.0010974  -0.01553417 ... -0.02181629 -0.06359361
  -0.00875287]]


(8530, 768)

This shape shows that each of our 8530 input documents has an embeddings dimension of 768!

Now let's train a very simple logistic regression on our embeddings 

In [10]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,100


In [11]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



This demonstrates the possibility of training a light classifier while keeping the embeddings model frozen.

In [12]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero Shot Classification
A zero-shot classification is when a model can classify text into categories it has never been explicitly trained on, simply by understanding the semantic relationship between the input text and candidate label descriptions.

In our case we do not have labeled data - we will try to predict these labels of input text even though the model was not trained on them.

To perform zero-shot classification with embeddings, there is a little trick that we can use. We can describe our labels based on what they should represent. For example, a negative label for movie reviews can be described as "This is a negative movie review." By describing and embedding the labels and documents, we have data that we can work with.

In [None]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

To assign labels to documents, we can apply cosine similarity to the document-label pairs.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



An F1 score of 0.78 is quite impressive considering we did not use any labels!! This is the perfect illustration of why embeddings can be a very useful tool!

### Text Classification with Generative Models
Generative language models like OpenAI's GPT differ fundamentally in their approach to classification compared to traditional methods.

Rather than following conventional classification paradigms, these models function as sequence-to-sequence systems: they receive text input and produce text output.

While these generative models undergo training across diverse tasks, they typically cannot handle specialized use cases immediately. Consider feeding a movie review to such a model without additional guidance: the model would lack direction on how to process it.

To achieve meaningful results, we must provide context and steer the model toward our desired outcomes. This guidance occurs primarily through carefully crafted instructions, known as prompts

For our demo we will use the Groq API because OpenAI does not give us free API keys

In [16]:
! pip install groq

Collecting groq
  Downloading groq-0.37.1-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.37.1-py3-none-any.whl (137 kB)
Installing collected packages: groq
Successfully installed groq-0.37.1


In [19]:
sample_text = data["test"]["text"][0]
print(f"Review: {sample_text}\n")


Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .



In [24]:
import os
from groq import Groq
from dotenv import load_dotenv

load_dotenv()

client = Groq(
    api_key=os.getenv("GROQ_API_KEY_REMOVED"), 
)

chat_completion = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a sentiment classifier. Respond with only 'positive' or 'negative'."
        },
        {
            "role": "user",
            "content": f"Classify the sentiment of this movie review: {sample_text}"
        }
    ],
    temperature=0,
    max_tokens=10

)
print(chat_completion.choices[0].message.content)

positive


Or we can output a score if you need more granularity:

In [25]:
chat_completion = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a sentiment classifier. Rate the sentiment as a number between 0 (negative) and 1 (positive). Respond with only the number."
        },
        {
            "role": "user",
            "content": f"Rate the sentiment of this movie review: {sample_text}"
        }
    ],
    temperature=0,
    max_tokens=10

)
print(chat_completion.choices[0].message.content)

0.8


Let's evaluate this classifier with the same classification report and see how it's performing

Keep in mind this is a very simple prompt. If you need more control about the LLM output, you can check how to structure the output of an LLM on the OpenAI doc.

In [26]:
def groq_generation(prompt, model="meta-llama/llama-4-scout-17b-16e-instruct"):
  message = [
        {
            "role": "system",
            "content": "You are a sentiment classifier. Rate the sentiment as a number between 0 (negative) and 1 (positive). Respond with only the number."
        },
        {
            "role": "user",
            "content": f"Rate the sentiment of this movie review: {prompt}"
        }
  ]
  chat_completion = client.chat.completions.create(
      model=model,
      messages=message,
      temperature=0,
      max_tokens=10
    )
  return chat_completion.choices[0].message.content

In [27]:
groq_generation(sample_text)

'0.8'

### Text-to-Text Transfer Transformers (T5)
Let's explore a final technique called text-to-text transfer transformers or T5 models. The architecture is similar to the original Transformers with encoder and decoder parts stacked together.

T5 reframes every common NLP task such as translation, summarization, classification, and question answering as input text → output text, simplifying model design and enabling multitask learning.

T5 was trained on the Colossal Clean Crawled Corpus, with a self-supervised objective called span corruption, giving it strong generalization across NLP tasks.

Because T5 generates text tokens for answers and labels, it excels in zero-shot, few-shot, and instruction-based tasks, without needing task-specific heads or architectures

In [29]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cpu"
)

Device set to use cpu


Now let's prepare our data by adding a T5-compatible prompt to each text:

In [30]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

Since this model generates text, we need to map "negative" to 0 and "positive" to 1, then we can run our evaluation

In [31]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

evaluate_performance(data["test"]["label"], y_pred)

100%|██████████| 1066/1066 [01:39<00:00, 10.72it/s]

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066






### Conclusion
Now I hope you have a better understanding of text classification and how to handle it with or without generative models. We know now that pretrained models are very good for classifying text!

We also know that we can leverage the power of embeddings to use them as input to train classifiers. The key takeaways are:

- Task-specific models (like RoBERTa) achieve ~80% accuracy with minimal setup
- Embeddings + Logistic Regression can reach 85% F1 score without fine-tuning
- Zero-shot with embeddings achieves 78% F1 without any labels!
- Generative models (T5) reach 84% F1 with simple prompting
- LLM APIs (like Groq) offer flexible classification but require API calls
Each approach has trade-offs between performance, cost, and complexity. Choose based on your specific needs!