# Text Classification with Generative Models

Text classification is a common task in NLP. It involves categorizing text into predefined categories or classes based on its content. This task is essential in various applications, such as sentiment analysis, spam filtering, topic classification...

![Text Classification Meme](https://i.imgflip.com/7xqz0v.jpg)

Now with all the generative models it's tempting to use them for classification tasks, but are they good at it? How can we measure the success of a classification model? Let's find out ü§î

For this example we will use the `rotten_tomatoes` dataset, it contains 50000 movie reviews with their corresponding sentiment (positive or negative).

In [None]:
# Install dependencies using uv
import sys
!{sys.executable} -m pip install uv
!uv pip install transformers==4.41.2 accelerate==0.31.0 torch datasets sentence-transformers scikit-learn pandas numpy groq python-dotenv --system

In [None]:
# Import required libraries
from datasets import load_dataset
from transformers import pipeline
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load our data
data = load_dataset("rotten_tomatoes")
data

---

## Using a Task-Specific Model

Using specific task models is the easiest way to solve our problem, we just need to find a model that fits our needs, download it and use it in a pipeline to test it on our data.

> For this example we will use a `roberta` model to classify our data.

We will use a `pipeline` object. If you are not familiar with this, read the [official doc](https://huggingface.co/docs/transformers/main/en/pipeline_tutorial)

### üìö Questions: Using Task-Specific Models

Before running the code, let's understand what we're doing:

#### 1. What is RoBERTa?

**Answer:**

**RoBERTa** (Robustly Optimized BERT Approach) is an improved version of BERT:

- **Architecture**: Same as BERT (bidirectional transformer encoder)
- **Key improvements over BERT**:
  - Trained on **more data** (160GB vs 16GB)
  - Trained **longer** with **larger batches**
  - Removed Next Sentence Prediction (NSP) task
  - Dynamic masking (changes which tokens are masked during training)
  - Uses byte-pair encoding (BPE) instead of WordPiece
- **Result**: Better performance on most NLP benchmarks
- **Use cases**: Text classification, NER, question answering, etc.

#### 2. What does "cardiffnlp/twitter-roberta-base-sentiment-latest" tell us about this model?

**Answer:**

Breaking down the model name:
- **`cardiffnlp`**: Organization/creator (Cardiff NLP research group)
- **`twitter-roberta-base`**: RoBERTa base model trained on Twitter data
- **`sentiment-latest`**: Fine-tuned for sentiment analysis (latest version)

**Key characteristics**:
- Trained on **Twitter text** ‚Üí good for informal, social media language
- Specialized for **sentiment analysis**
- May perform well on movie reviews (similar informal style)
- Understands abbreviations, slang, emojis common in social media

#### 3. Why do we use `return_all_scores=True`?

**Answer:**

- **Without this parameter**: Returns only the highest-scoring label
- **With `return_all_scores=True`**: Returns confidence scores for ALL classes

**Benefits**:
- See model's confidence for each class (negative vs positive)
- Understand how certain/uncertain the model is
- Useful for setting custom thresholds
- Better for debugging and analysis

Example:
```python
# Without return_all_scores: {'label': 'POSITIVE', 'score': 0.95}
# With return_all_scores: [{'label': 'NEGATIVE', 'score': 0.05}, {'label': 'POSITIVE', 'score': 0.95}]
```

#### 4. What does `device="cuda"` do? What if you don't have a GPU?

**Answer:**

**`device="cuda"`**:
- Runs the model on **GPU** (Graphics Processing Unit)
- **Much faster** inference (10-100x speedup for large models)
- Required for processing large datasets efficiently

**If you don't have a GPU**:
- Use `device="cpu"` or omit the parameter (defaults to CPU)
- Or use `device=-1` (also means CPU)
- Inference will be slower but still work
- Consider using smaller models or processing in smaller batches

**Check if you have CUDA available**:
```python
import torch
print(torch.cuda.is_available())  # True if GPU available
```

In [None]:
from transformers import pipeline
import torch

model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Check if CUDA is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device=device
)

Now let's run an inference loop to get the predictions for our dataset

In [None]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]['score']
    positive_score = output[2]['score']  # Note: index 2, not 1 (neutral is at index 1)
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

---

## Evaluation

Then we will define a function to evaluate how well the model performed by comparing predictions to actual labels. For this we will use the `classification_report` from sklearn

### üìö Questions: Understanding Evaluation Metrics

#### 1. What is a classification report?

**Answer:**

A **classification report** is a summary of a classifier's performance showing:

- **Precision**: Of all items predicted as a class, how many were correct?
  - Formula: TP / (TP + FP)
  - "When the model says positive, how often is it right?"

- **Recall**: Of all actual items in a class, how many did we find?
  - Formula: TP / (TP + FN)
  - "Of all positive reviews, how many did we catch?"

- **F1-Score**: Harmonic mean of precision and recall
  - Formula: 2 √ó (Precision √ó Recall) / (Precision + Recall)
  - Balances both metrics

- **Support**: Number of actual occurrences of each class

- **Accuracy**: Overall correctness

#### 2. What do TP, TN, FP, FN mean?

**Answer:**

**Confusion Matrix Terms** (for Positive class):

- **TP (True Positive)**: Predicted positive, actually positive ‚úÖ
  - Example: Model says "positive review", it IS positive

- **TN (True Negative)**: Predicted negative, actually negative ‚úÖ
  - Example: Model says "negative review", it IS negative

- **FP (False Positive)**: Predicted positive, actually negative ‚ùå
  - Example: Model says "positive review", but it's actually negative
  - Also called "Type I Error" or "False Alarm"

- **FN (False Negative)**: Predicted negative, actually positive ‚ùå
  - Example: Model says "negative review", but it's actually positive
  - Also called "Type II Error" or "Miss"

**Visual representation**:
```
                Predicted
              Neg      Pos
Actual  Neg | TN  |  FP |
        Pos | FN  |  TP |
```

#### 3. When would you prefer high precision vs high recall?

**Answer:**

**High Precision** (minimize False Positives):
- **Spam filtering**: Don't want to mark important emails as spam
- **Medical diagnosis for risky treatment**: Don't want to treat healthy patients
- **Fraud detection with manual review**: Don't overwhelm reviewers with false alarms
- "When being wrong is costly"

**High Recall** (minimize False Negatives):
- **Cancer screening**: Don't want to miss any cancer cases
- **Security threats**: Don't want to miss potential attacks
- **Customer service**: Don't want to miss unhappy customers
- "When missing something is costly"

**For sentiment analysis**:
- Depends on use case!
- Product reviews: Might prefer **balanced** (F1-score)
- Crisis detection: Prefer **high recall** (don't miss negative sentiment)

#### 4. What does a "macro avg" vs "weighted avg" mean?

**Answer:**

**Macro Average**:
- **Simple average** across all classes
- Treats all classes equally (each class has equal weight)
- Formula: (Metric_Class1 + Metric_Class2) / 2
- Good for: **Balanced datasets** or when **all classes matter equally**

**Weighted Average**:
- **Weighted by support** (number of instances per class)
- Classes with more instances contribute more
- Formula: (Metric_Class1 √ó Support1 + Metric_Class2 √ó Support2) / Total
- Good for: **Imbalanced datasets** (reflects overall performance better)

**Example**:
```
Class A: 100 samples, F1=0.9
Class B: 10 samples, F1=0.5

Macro avg: (0.9 + 0.5) / 2 = 0.70
Weighted avg: (0.9√ó100 + 0.5√ó10) / 110 = 0.86
```

In our case, support is equal (533 each), so macro = weighted!

In [None]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [None]:
# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

Overall the model is not bad - he's correctly classifies 4 out of 5 reviews.

> This is pretty good for sentiment analysis ü§ô

Here's a detailed analysis of this classification report:

### üìä Analysis of Results

#### Class-by-Class Analysis

**Negative Review Performance**

- **Precision: 0.76 (76%)**
  - When the model predicts "negative", it's correct 76% of the time
  - **24% false positives** - sometimes it incorrectly labels positive reviews as negative

- **Recall: 0.88 (88%)**
  - The model finds 88% of all actual negative reviews
  - **Only 12% false negatives** - rarely misses a negative review

- **F1-score: 0.81** - Good balance, but recall is stronger than precision

**Interpretation**: The model is **sensitive to negativity** - it catches almost all negative reviews but sometimes over-predicts negativity.

---

**Positive Review Performance**

- **Precision: 0.86 (86%)**
  - When the model predicts "positive", it's correct 86% of the time
  - **14% false positives** - rarely mislabels negatives as positive

- **Recall: 0.72 (72%)**
  - The model only finds 72% of actual positive reviews
  - **28% false negatives** - misses over 1 in 4 positive reviews!

- **F1-score: 0.78** - Slightly lower than negative class

**Interpretation**: The model is **conservative with positivity** - when it says positive, it's usually right, but it misses many positive reviews (probably labeling them as negative instead).

---

#### The Trade-off Pattern We Need to Find

There's an **inverse relationship between the classes**:

- **Negative**: High recall (0.88), Lower precision (0.76) ‚Üí *Over-predicts negative*
- **Positive**: High precision (0.86), Lower recall (0.72) ‚Üí *Under-predicts positive*

This suggests the model has a **negative bias** - it's more likely to classify uncertain reviews as negative.

---

#### üìà Aggregate Metrics

- **Macro avg (0.81, 0.80, 0.80)**: Simple average across both classes
- **Weighted avg (0.81, 0.80, 0.80)**: Weighted by support (but since support is equal, same as macro)
- **Support: 533 each** - Perfectly balanced dataset, so no class imbalance issues

---

#### Conclusion

**Strengths:**
- ‚úÖ **80% accuracy is solid**
- ‚úÖ Excellent at detecting negative sentiment (88% recall)
- ‚úÖ When it predicts positive, it's usually right (86% precision)

**Weaknesses:**
- ‚ö†Ô∏è Misses 28% of positive reviews
- ‚ö†Ô∏è Has a slight negative bias
- ‚ö†Ô∏è Could improve positive review detection

**üí° If false negatives on positive reviews are costly** (e.g., missing happy customers), you might want to adjust the classification threshold or retrain the model to be less pessimistic!

---

## Classification Tasks with Embeddings

Now let's see how we can use embeddings to classify our data.

> What's happening if we can not find a model that fits perfectly our needs?

Then we need to fine-tune a model to our specific task, but it will be long, hard and costly... üò≠

> So what's the solution?

**Use embeddings!**

### üìö Questions: Understanding Embeddings

#### 1. What are embeddings?

**Answer:**

**Embeddings** are numerical representations of text in a high-dimensional vector space:

- **Dense vectors** of real numbers (typically 384, 768, or 1024 dimensions)
- Capture **semantic meaning** - similar texts have similar embeddings
- Generated by pre-trained neural networks

**Example**:
```python
"I love this movie" ‚Üí [0.23, -0.45, 0.12, ..., 0.67]  # 768 numbers
"This film is great" ‚Üí [0.25, -0.43, 0.15, ..., 0.69] # Similar numbers!
"I hate this film" ‚Üí [-0.21, 0.42, -0.18, ..., -0.65] # Different!
```

**Key properties**:
- Semantically similar texts are **close** in vector space
- Can be used for: classification, clustering, search, recommendation

#### 2. Why use embeddings instead of directly fine-tuning a model?

**Answer:**

**Advantages of embeddings approach**:

‚úÖ **Faster**:
- No need to train the entire transformer model
- Just train a small classifier (logistic regression, SVM)
- Minutes instead of hours/days

‚úÖ **Cheaper**:
- Less computational resources (can run on CPU)
- No expensive GPU training needed

‚úÖ **Less data needed**:
- Fine-tuning needs thousands of examples
- Embeddings + classifier can work with hundreds

‚úÖ **More flexible**:
- Can try different classifiers quickly
- Easy to retrain with new data
- Can combine with other features

‚úÖ **Interpretable**:
- Traditional ML models (logistic regression) are easier to interpret
- Can see which features matter

**When to fine-tune instead**:
- When you have lots of labeled data (10k+ examples)
- When you need the absolute best performance
- When the task is very different from pre-training

#### 3. What is "supervised classification with embeddings"?

**Answer:**

**Supervised classification with embeddings** is a two-step process:

**Step 1: Generate embeddings**
- Use a pre-trained embedding model (frozen, not trained)
- Convert all text into vectors
- Example: SentenceTransformer, BERT embeddings

**Step 2: Train a classifier**
- Use traditional ML classifier (Logistic Regression, SVM, Random Forest)
- Train only the classifier on the embeddings
- This is the "supervised" part (uses labeled data)

**Why it's called "supervised classification with embeddings"**:
- **Supervised**: We have labeled training data (positive/negative)
- **With embeddings**: We use embeddings as features (not raw text or TF-IDF)
- We **do NOT** fine-tune the embedding model, we just use it as a feature extractor
- Only the classifier is trained (üßä frozen embeddings, üî• trained classifier)

**Analogy**:
- Embeddings = Hiring a translator who knows the language
- Classifier = Teaching a simple rule-based system using the translations

#### 4. What embedding model are we using and why?

**Answer:**

We're using **`sentence-transformers/all-mpnet-base-v2`**:

**What is it?**
- Based on **MPNet** architecture (Masked and Permuted pre-training)
- Trained specifically for **sentence-level embeddings**
- Part of the Sentence-Transformers library
- Outputs **768-dimensional** vectors

**Why this model?**
- ‚úÖ **Very popular and well-performing** for semantic similarity
- ‚úÖ **General-purpose** - works well across many domains
- ‚úÖ **Good quality/speed trade-off**
- ‚úÖ **Trained on diverse data** - not domain-specific
- ‚úÖ **Proven performance** on semantic textual similarity benchmarks

**Alternatives you could use**:
- `all-MiniLM-L6-v2` - Faster but slightly lower quality
- `all-mpnet-base-v2` - Best balance (what we're using)
- Larger models for even better quality (but slower)

### Supervised Classification with Embeddings

Instead of using a pre-trained model for our specific task, we will use an embedding model for feature generation. Then those features will be used to train a classifier. This method is called **Supervised classification with embeddings** because we do not need to fine-tune the model, we just need to train a classifier on the features üèãÔ∏è

For this example we will use a `sentence-transformers` model to generate embeddings for our data. It's very popular and well-performing for this kind of task.

In [None]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
print("Generating embeddings for training data...")
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)

print("Generating embeddings for test data...")
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

In [None]:
# Check the shape of embeddings
print(f"Training embeddings shape: {train_embeddings.shape}")
print(f"Test embeddings shape: {test_embeddings.shape}")

print(f"\nThis shows that each of our {train_embeddings.shape[0]} input documents has an embeddings dimension of {train_embeddings.shape[1]}!")

Now let's train a very simple logistic regression on our embeddings ü§ì

### üìö Questions: Logistic Regression on Embeddings

#### 1. Why use Logistic Regression instead of a neural network?

**Answer:**

**Advantages of Logistic Regression**:

‚úÖ **Fast training**:
- Trains in seconds/minutes on embeddings
- Neural networks take hours

‚úÖ **No hyperparameter tuning needed**:
- Neural networks: learning rate, layers, dropout, epochs...
- Logistic Regression: just regularization (C parameter)

‚úÖ **Interpretable**:
- Can see feature importance (weights)
- Understand which embedding dimensions matter

‚úÖ **Less prone to overfitting**:
- Simpler model = less risk with small datasets

‚úÖ **Works surprisingly well**:
- Embeddings already contain rich features
- A simple linear classifier is often enough!

**When to use neural networks instead**:
- Very large datasets (100k+ examples)
- Need to capture complex non-linear patterns
- Multi-task learning

#### 2. What does `random_state=42` do?

**Answer:**

`random_state=42` sets the **random seed** for reproducibility:

- **With seed**: Same results every time you run the code
- **Without seed**: Different results each run (randomized initialization)

**Why important**:
- Makes experiments reproducible
- Essential for debugging
- Fair comparison between methods
- Required for scientific papers

**What it affects in Logistic Regression**:
- Initial shuffling of data
- Solver's random initialization (for some solvers)
- Sample order in stochastic methods

The number 42 is arbitrary (a reference to "Hitchhiker's Guide to the Galaxy")

#### 3. How does the performance compare to the task-specific RoBERTa model?

**Answer:**

**Comparison** (based on typical results):

**RoBERTa (Task-Specific)**:
- Accuracy: ~0.80 (80%)
- F1-Score: ~0.80
- Training: Used pre-trained model (no training needed)
- Inference: Slower (full transformer forward pass)

**Embeddings + Logistic Regression**:
- Accuracy: ~0.85 (85%)
- F1-Score: ~0.85
- Training: Fast (minutes)
- Inference: Very fast (just matrix multiplication)

**üéâ Embeddings approach performs BETTER!**

**Why embeddings work better here**:
1. The embedding model captures general semantic meaning well
2. We train the classifier specifically on our data
3. The RoBERTa model was trained on Twitter data (different domain)
4. Simple is sometimes better!

**Trade-offs**:
- ‚úÖ Embeddings: Better accuracy, faster, simpler
- ‚úÖ RoBERTa: No training needed, works out-of-the-box

**Congrats!** üéä With the embeddings training we achieve a better F1 score than initially!

In [None]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

print("‚úÖ Logistic Regression trained!")
print(f"Model: {clf}")

In [None]:
# Predict on previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

**Congrats!** üéä With the embeddings training we achieve a better F1 score than initially!

> This demonstrates the possibility of training a light classifier while keeping the embeddings model frozen

---

## What if We Do Not Have Labeled Data: Unsupervised Use Case

What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best üîç

### üìö Questions: Zero-Shot Classification with Embeddings

#### 1. What is zero-shot classification?

**Answer:**

**Zero-shot classification** is when a model can classify text into categories it has **never been explicitly trained on**, simply by understanding the semantic relationship between the input text and candidate label descriptions.

**Key characteristics**:
- **No labeled training data** needed for the target task
- Model uses **semantic understanding** only
- Works by comparing **meaning** of text vs. label descriptions

**How it works**:
1. Embed the document: "This movie is terrible"
2. Embed label descriptions: "A negative review", "A positive review"
3. Calculate similarity between document and each label
4. Assign the label with highest similarity

**Real-world example**:
- Input: "The product broke after one day"
- Labels: "complaint", "praise", "question"
- Model figures out it's a "complaint" without any examples!

#### 2. How does this approach work without training?

**Answer:**

**The magic is in the embeddings**:

**Step 1: Average embeddings per class**
```python
# Average all negative review embeddings
negative_center = mean(all_negative_training_embeddings)

# Average all positive review embeddings  
positive_center = mean(all_positive_training_embeddings)
```

**Step 2: For new document**
```python
# Get embedding for test document
test_embedding = model.encode("This movie was awful")

# Calculate similarity to both centers
sim_to_negative = cosine_similarity(test_embedding, negative_center)
sim_to_positive = cosine_similarity(test_embedding, positive_center)

# Assign closest label
label = "negative" if sim_to_negative > sim_to_positive else "positive"
```

**Why it works**:
- The embedding model already understands semantic meaning
- Similar texts have similar embeddings
- We're just finding which "cluster" the new text is closest to
- No training needed - it's pure geometric comparison!

#### 3. What is cosine similarity?

**Answer:**

**Cosine similarity** measures the **angle** between two vectors:

**Formula**:
```
cosine_similarity(A, B) = (A ¬∑ B) / (||A|| √ó ||B||)
```

**Range**: -1 to 1
- **1**: Vectors point in exactly the same direction (identical meaning)
- **0**: Vectors are perpendicular (unrelated)
- **-1**: Vectors point in opposite directions (opposite meaning)

**Why use cosine instead of Euclidean distance?**
- ‚úÖ **Ignores magnitude**: Only cares about direction (semantic meaning)
- ‚úÖ **Better for high-dimensional data**: Embeddings are 768-dimensional
- ‚úÖ **Normalized**: Always in [-1, 1] range
- ‚úÖ **Standard in NLP**: Used everywhere for text similarity

**Visual analogy**:
```
        Text A ‚Üí
         /
        /  small angle = high similarity
       /
      Text B ‚Üí
```

#### 4. When would zero-shot be better than supervised?

**Answer:**

**Zero-shot is better when**:

‚úÖ **No labeled data available**:
- New use case with no examples
- Labeling is expensive/time-consuming

‚úÖ **Need extreme flexibility**:
- Categories change frequently
- Want to classify into arbitrary categories
- Example: "Is this review about price, quality, or shipping?"

‚úÖ **Many rare categories**:
- Long tail classification
- Not enough examples per category to train

‚úÖ **Quick prototyping**:
- Testing ideas fast
- MVP development

‚úÖ **Cold start problem**:
- Just launched a product/service
- Don't have historical data yet

**Supervised is better when**:
- You have 100+ labeled examples per class
- Need highest possible accuracy
- Categories are fixed
- Performance is critical

**Hybrid approach**:
- Start with zero-shot
- Collect labels from user feedback
- Gradually transition to supervised

#### 5. How do we describe our labels for zero-shot?

**Answer:**

**Label description is critical** for zero-shot performance!

**In our example**:
```python
label_embeddings = model.encode([
    "A negative review",  # Description for label 0
    "A positive review"   # Description for label 1
])
```

**Best practices for label descriptions**:

‚úÖ **Be descriptive, not just the label name**:
- ‚ùå Bad: "negative", "positive"
- ‚úÖ Good: "A negative movie review", "A positive movie review"
- ‚úÖ Better: "This is a negative review expressing disappointment"

‚úÖ **Match the domain/style of your documents**:
- For movie reviews: "This movie is terrible" vs "This movie is great"
- For product reviews: "The product is defective" vs "The product is excellent"

‚úÖ **Use examples or prototypes**:
- "A negative review like: bad, terrible, awful, disappointing"

‚úÖ **Be specific about what makes something belong to that category**:
- "A negative review that criticizes the movie"
- "A positive review that recommends the movie"

**Pro tip**: The better your label descriptions match the vocabulary and style of your documents, the better zero-shot will work!

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

print(f"Shape of averaged embeddings per class: {averaged_target_embeddings.shape}")
print("This gives us a 'prototype' or 'center' for each class")

In [None]:
# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

**AN F1 score at 0.84 is quite impressive considering we did not used any labels!!** This is the perfect illustration why embeddings can be a very useful tool!

### Zero-Shot Classification

A **zero-shot classification** is when a model can classify text into categories it has never been explicitly trained on, simply by understanding the semantic relationship between the input text and candidate label descriptions.

In our case we do not have labeled data, we will try to predict these labels of input text even though the model was not trained on them üîÆ

> To perform zero-shot classification with embeddings, there is a little trick that we can use. We can describe our labels based on what they should represent. For example, a negative label for movie reviews can be described as "This is a negative movie review." By describing and embedding the labels and documents, we have data that we can work with.

In [None]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review", "A positive review"])

print(f"Label embeddings shape: {label_embeddings.shape}")
print("We now have embeddings for what 'negative' and 'positive' mean!")

To assign labels to documents, we can apply cosine similarity to the document-label pairs.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

print(f"Similarity matrix shape: {sim_matrix.shape}")
print(f"Each test document gets a similarity score to both labels")
print(f"\nExample similarity scores for first test document:")
print(f"  Similarity to 'negative': {sim_matrix[0][0]:.4f}")
print(f"  Similarity to 'positive': {sim_matrix[0][1]:.4f}")
print(f"  Predicted label: {'Negative' if y_pred[0] == 0 else 'Positive'}")

In [None]:
# Evaluate zero-shot performance
evaluate_performance(data["test"]["label"], y_pred)

**AN F1 score at 0.78 is quite impressive considering we did not use any labels!!** This is the perfect illustration why embeddings can be a very useful tool!

### üìä Comparison of All Three Approaches

Let's summarize what we learned:

In [None]:
# Summary comparison
comparison_df = pd.DataFrame({
    'Approach': [
        'Task-Specific RoBERTa',
        'Embeddings + Logistic Regression',
        'Zero-Shot (Averaged Embeddings)',
        'Zero-Shot (Label Descriptions)'
    ],
    'F1-Score': ['~0.80', '~0.85', '~0.84', '~0.78'],
    'Training Time': ['0 (pre-trained)', 'Minutes', '0', '0'],
    'Labeled Data Needed': ['0', 'Full training set', '0 (uses train for averaging)', '0'],
    'Flexibility': ['Low', 'Medium', 'High', 'Very High'],
    'Inference Speed': ['Slow', 'Fast', 'Fast', 'Fast']
})

print("\n" + "="*80)
print("COMPARISON OF ALL APPROACHES")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)

---

## Text Classification with Generative Models

Generative language models like OpenAI's GPT differ fundamentally in their approach to classification compared to traditional methods.

Rather than following conventional classification paradigms, these models function as **sequence-to-sequence systems**: in short, **they receive text input and produce text output**.

While these generative models undergo training across diverse tasks, they typically cannot handle specialized use cases immediately. Consider feeding a movie review to such a model without additional guidance: the model would lack direction on how to process it.

To achieve meaningful results, we must **provide context and steer the model toward our desired outcomes**. This guidance occurs primarily through carefully crafted instructions, known as **prompts** üéØ

For our demo we will use the [groq API](https://groq.com/) because OpenAI do not give us a free API keys üòä

### üìö Questions: Generative Models for Classification

#### 1. How do generative models differ from discriminative models?

**Answer:**

**Discriminative Models** (BERT, RoBERTa, traditional ML):
- Learn the **boundary** between classes
- Answer: "Given input X, what is the probability of class Y?"
- Formula: P(Y|X)
- Example: "This text is 85% likely to be positive"
- **Specialized**: Trained for specific tasks
- **Fast**: Direct classification

**Generative Models** (GPT, LLaMA, Claude):
- Learn to **generate** the next token
- Answer: "Given input X, what text should come next?"
- Formula: P(X) or P(X|context)
- Example: "This text is... positive" (generates the word "positive")
- **General-purpose**: Can do many tasks
- **Flexible**: Just change the prompt

**For classification**:
- Discriminative: Outputs a probability distribution over classes
- Generative: Generates text that represents the class

#### 2. What is prompt engineering?

**Answer:**

**Prompt engineering** is the art and science of crafting instructions to get desired behavior from language models.

**Key elements of a good prompt**:

1. **Role/Context**: "You are a sentiment classifier"
2. **Task Description**: "Rate the sentiment of this movie review"
3. **Format Specification**: "Respond with only 'positive' or 'negative'"
4. **Examples** (few-shot): Show 2-3 examples
5. **The Input**: The actual text to classify

**Example prompt evolution**:

‚ùå **Bad**: "positive or negative?"
```
Model might respond: "What do you mean?"
```

‚ö†Ô∏è **Better**: "Is this review positive or negative?"
```
Model might respond: "Well, it could be seen as..."
```

‚úÖ **Good**: "Classify the sentiment of this movie review. Respond with only 'positive' or 'negative'."
```
Model responds: "positive"
```

**Advanced techniques**:
- Chain-of-thought: "Let's think step by step"
- Few-shot learning: Provide examples
- Temperature control: Adjust randomness
- System prompts: Set global behavior

#### 3. What is few-shot learning vs zero-shot?

**Answer:**

**Zero-Shot**:
- **No examples** provided in the prompt
- Model relies purely on instructions
- Example:
```
Classify this review as positive or negative:
"The movie was terrible"
```

**Few-Shot** (1-shot, 2-shot, 5-shot, etc.):
- **Provide examples** in the prompt
- Model learns the pattern from examples
- Example:
```
Classify reviews as positive or negative:

Review: "Amazing film!" ‚Üí positive
Review: "Waste of time" ‚Üí negative

Review: "The movie was terrible" ‚Üí ?
```

**Comparison**:

| Aspect | Zero-Shot | Few-Shot |
|--------|-----------|----------|
| Examples needed | 0 | 1-10 |
| Performance | Good | Better |
| Prompt length | Short | Longer |
| Token cost | Lower | Higher |
| Setup time | None | Minimal |

**When to use which**:
- **Zero-shot**: Simple, clear tasks (sentiment, spam detection)
- **Few-shot**: Complex or ambiguous tasks, domain-specific categories

#### 4. Why do we ask for just the label and not an explanation?

**Answer:**

**Asking for just the label** ("Respond with only 'positive'"):

‚úÖ **Faster**:
- Fewer tokens to generate
- Lower latency
- Cheaper (pay per token)

‚úÖ **Easier to parse**:
- Simple string matching
- No need for complex parsing
- More reliable automation

‚úÖ **More consistent**:
- Model can't ramble
- Format is predictable
- Easier to evaluate

‚úÖ **Prevents errors**:
- Model might say "It's positive because..."
- Harder to extract the label
- More can go wrong

**When to ask for explanations**:
- User-facing applications (need to show reasoning)
- Debugging (understand model's thinking)
- High-stakes decisions (need justification)
- Active learning (to improve training data)

**Best of both worlds**:
```python
# Ask for structured output
prompt = """
Rate this review and explain briefly:
Format: label: [positive/negative], reason: [brief explanation]
"""
```

#### 5. How do we handle models that output scores/probabilities?

**Answer:**

**Two approaches for getting confidence scores**:

**Approach 1: Ask for a numerical score**
```python
prompt = """
Rate the sentiment of this review on a scale from 0 to 1:
0 = very negative, 1 = very positive

Review: {text}
Score:
"""
```

**Approach 2: Use logit probabilities** (API-dependent)
```python
# Some APIs return token probabilities
response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[...],
    logprobs=True,  # Get probability distribution
    temperature=0    # Deterministic
)

# Extract probability for "positive" vs "negative"
```

**In our example**:
- We ask: "Rate the sentiment... Score:"
- Model outputs: "0.87" (a probability/score)
- We parse it as a float
- Convert to binary: >0.5 = positive, <0.5 = negative

**Why this is useful**:
- Can set custom thresholds (e.g., only show if confidence > 0.8)
- Can identify uncertain predictions
- Better for active learning (label uncertain cases first)
- Matches discriminative model outputs

In [None]:
# Install groq
!pip install groq

In [None]:
# Set up your API key
# You need to get a free API key from https://console.groq.com/
import os

# Option 1: Set environment variable (recommended)
# export GROQ_API_KEY="your_api_key_here"

# Option 2: Set it in code (not recommended for production)
os.environ["GROQ_API_KEY"] = "YOUR_API_KEY"  # Replace with your actual key

print("‚ö†Ô∏è Remember to replace YOUR_API_KEY with your actual Groq API key!")
print("Get it from: https://console.groq.com/")

In [None]:
# Test with a single example
sample_text = data["test"]["text"][0]
print(f"Sample Review: {sample_text}\n")

In [None]:
import os
from groq import Groq
from dotenv import load_dotenv

load_dotenv()

# Initialize Groq client
client = Groq(
    api_key=os.getenv("GROQ_API_KEY"),
)

# Create a simple prompt
chat_completion = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a sentiment classifier. Respond with only 'positive' or 'negative'."
        },
        {
            "role": "user",
            "content": f"Classify the sentiment of this movie review: {sample_text}"
        }
    ],
    temperature=0,
    max_tokens=10
)

print(f"Model prediction: {chat_completion.choices[0].message.content}")

**or we can output a score if you need more granularity**

In [None]:
# Get a probability score instead
chat_completion = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a sentiment classifier. Rate the sentiment of movie reviews on a scale from 0 to 1, where 0 is very negative and 1 is very positive."
        },
        {
            "role": "user",
            "content": f"Rate the sentiment of this movie review: {sample_text}\n\nScore:"
        }
    ],
    temperature=0,
    max_tokens=10
)

score = chat_completion.choices[0].message.content
print(f"Sentiment score: {score}")

Let's evaluate this classifier with the same classification report and see how it's performing ü§î

> **Keep in mind this is a very simple prompt**, if you need more control about the LLM output you can check how to structure the output of an LLM on the [openai doc](https://platform.openai.com/docs/guides/structured-outputs)

In [None]:
# Create a function for generation
def groq_generation(prompt, model="meta-llama/llama-4-scout-17b-16e-instruct"):
    """Generate sentiment classification using Groq API"""
    message = [
        {
            "role": "system",
            "content": "You are a sentiment classifier. Rate the sentiment of movie reviews on a scale from 0 to 1, where 0 is very negative and 1 is very positive. Respond with only a number."
        },
        {
            "role": "user",
            "content": f"Rate the sentiment of this movie review: {prompt}\n\nScore:"
        }
    ]
    
    chat_completion = client.chat.completions.create(
        model=model,
        messages=message,
        temperature=0,
        max_tokens=10
    )
    
    return chat_completion.choices[0].message.content

In [None]:
# Test on one example
result = groq_generation(sample_text)
print(f"Score for sample: {result}")

In [None]:
# Run on test set (this will take a few minutes)
from tqdm import tqdm

print("Generating predictions for test set...")
print("This may take a few minutes...\n")

predictions = []
for doc in tqdm(data["test"]["text"][:100]):  # Limit to 100 for demo (remove [:100] for full dataset)
    pred = groq_generation(doc)
    predictions.append(pred)

print(f"\nGenerated {len(predictions)} predictions")

In [None]:
# Convert scores to binary predictions
y_pred = []
for pred in predictions:
    try:
        score = float(pred.strip())
        # Convert to binary: 0 if score < 0.5, else 1
        y_pred.append(0 if score < 0.5 else 1)
    except ValueError:
        # If can't parse as float, try to detect keywords
        if "negative" in pred.lower():
            y_pred.append(0)
        else:
            y_pred.append(1)

print(f"Converted {len(y_pred)} predictions to binary labels")

In [None]:
# Evaluate performance (only on the subset we tested)
evaluate_performance(data["test"]["label"][:len(y_pred)], y_pred)

**GG for the Flan T5 model, 0.84 is a very good first F1 score** üèÜ

---

## Text-to-Text Transfer Transformers (T5)

Let's explore a final technique called **text-to-text transfer transformers** or T5 models. üîÑ The architecture is similar to the original Transformers with an encoder and decoder parts stacked together.

T5 reframes every common NLP tasks such as translation, summarization, classification, question answering **as input text ‚Üí output text, simplifying model design and enabling multitask learning**.

T5 was trained on the [Colossal Clean Crawled Corpus](https://www.tensorflow.org/datasets/catalog/c4), with a self-supervised objective called **span corruption**, giving it strong generalization across NLP tasks.

> Because T5 generates text tokens for answers and labels, it excels in zero-shot, few-shot, and instruction-based tasks, without needing task-specific heads or architectures üòé

### üìö Questions: T5 Models

#### 1. What is T5 and how does it differ from BERT/GPT?

**Answer:**

**T5 (Text-to-Text Transfer Transformer)**:

**Architecture**:
- **Encoder-Decoder** (like original Transformer)
- Both encoder and decoder use self-attention
- Can attend to input and previously generated output

**Key Innovation**:
- **Everything is text-to-text**
- All tasks reformulated as: input text ‚Üí output text

**Comparison**:

| Model | Architecture | Training | Best For |
|-------|-------------|----------|----------|
| **BERT** | Encoder only | Masked LM | Classification, NER |
| **GPT** | Decoder only | Next token prediction | Text generation |
| **T5** | Encoder-Decoder | Span corruption | All tasks! |

**T5 advantages**:
- ‚úÖ Single model for all tasks
- ‚úÖ Natural task specification (just describe in text)
- ‚úÖ Flexible output format
- ‚úÖ Good at both understanding and generation

#### 2. What does "text-to-text" mean?

**Answer:**

**Text-to-text** means every task is framed as:
- **Input**: Text string
- **Output**: Text string

**Examples**:

**Translation**:
```
Input: "translate English to French: Hello"
Output: "Bonjour"
```

**Summarization**:
```
Input: "summarize: [long article]"
Output: "[short summary]"
```

**Classification**:
```
Input: "Is the following sentence positive or negative? The movie was great."
Output: "positive"
```

**Question Answering**:
```
Input: "question: What is the capital of France? context: Paris is the capital..."
Output: "Paris"
```

**Benefits**:
- Unified API for all tasks
- Easy to add new tasks (just change the prompt)
- Natural for multitask learning
- Flexible output format

#### 3. What is the "t5" prompt format?

**Answer:**

T5 uses **task prefixes** to specify what to do:

**Format**: `task_name: input_text`

**Common prefixes**:
- `translate English to French:`
- `summarize:`
- `question: ... context: ...`
- `sentiment:` (for classification)

**In our example**:
```python
prompt = "Is the following sentence positive or negative? " + review
```

**Why use prefixes?**
- Tells the model which task to perform
- Activates task-specific knowledge from training
- Consistent with T5's training format

**Alternative formats** (all work):
```python
# Explicit
"sentiment: " + review

# Question format (what we use)
"Is the following sentence positive or negative? " + review

# Instruction format
"Classify this review as positive or negative: " + review
```

#### 4. How do we convert T5 text output to labels?

**Answer:**

**In our example**:

**Step 1**: T5 generates text
```python
output = pipe(...)
text = output[0]["generated_text"]
# text = "negative" or "positive"
```

**Step 2**: Map text to numeric labels
```python
# Check if "negative" appears in output
if "negative" in text.lower():
    label = 0
else:
    label = 1
```

**More robust mapping**:
```python
label_map = {
    "negative": 0,
    "positive": 1
}

# Parse output
for key, value in label_map.items():
    if key in text.lower():
        return value
```

**Handling errors**:
- Model might say "It's negative because..."
- Use keyword matching (check for "negative" or "positive")
- Could use regex for more robust parsing
- Default to most common class if uncertain

**Why this works**:
- T5 is trained to generate labels as text
- We just need to parse its natural language output
- More flexible than fixed class indices

#### 5. What are the trade-offs of T5 vs task-specific models?

**Answer:**

**T5 Advantages**:
- ‚úÖ **Flexibility**: One model for all tasks
- ‚úÖ **Easy to adapt**: Just change the prompt
- ‚úÖ **Natural output**: Generates text labels
- ‚úÖ **Transfer learning**: Benefits from multitask training
- ‚úÖ **Few-shot learning**: Works with minimal examples

**T5 Disadvantages**:
- ‚ùå **Slower**: Encoder + Decoder (2x parameters active)
- ‚ùå **Larger**: More parameters than encoder-only models
- ‚ùå **Less accurate**: Jack-of-all-trades, master of none
- ‚ùå **Parsing needed**: Must convert text output to labels
- ‚ùå **More tokens**: Generates text, not just logits

**Task-Specific Models** (like RoBERTa):
- ‚úÖ **Faster**: Encoder only
- ‚úÖ **More accurate**: Specialized for the task
- ‚úÖ **Direct output**: Probability distribution over classes
- ‚úÖ **Smaller**: Fewer parameters
- ‚ùå **Limited**: One model per task
- ‚ùå **Requires fine-tuning**: Need labeled data

**When to use T5**:
- Multiple different tasks
- Need flexibility to change tasks
- Few labeled examples available
- Want natural language output

**When to use task-specific**:
- Single task, need best performance
- Latency-critical applications
- Large labeled dataset available
- Deployment constraints (memory/compute)

In [None]:
# Load T5 model for text-to-text generation
from transformers import pipeline

pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0" if torch.cuda.is_available() else "cpu"
)

print("T5 model loaded!")

In [None]:
# Prepare our data with T5 format
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example["text"]})
data

Since this model generates text, we need to map 0 for negative and 1 for positive. Then we can run our evaluation ü§ì

In [None]:
# Run inference
from tqdm import tqdm

y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    # Check if text contains "negative" to assign 0, else 1
    y_pred.append(0 if "negative" in text.lower() else 1)

In [None]:
# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)

**GG for the Flan T5 model, 0.84 is a very good first F1 score** üèÜ

---

## Conclusion

Now I hope you have a better understanding of text classification and how to handle it with and without generative models. We know now that **pre-trained models are very good for classifying text!**

We also know that we can leverage the power of embeddings to use it as input to train classifiers. **Now in the next episode we'll explore the world of text clustering and topic modeling** üìä

### üéØ Final Summary

**What we learned**:

1. **Task-Specific Models** (RoBERTa):
   - Pre-trained and ready to use
   - Good accuracy (~80%)
   - No training needed

2. **Embeddings + Supervised Learning**:
   - Best accuracy (~85%)
   - Fast training
   - Requires labeled data

3. **Zero-Shot with Embeddings**:
   - No labeled data needed
   - Good performance (~78-84%)
   - Very flexible

4. **Generative Models** (GPT, LLaMA via Groq):
   - Natural language interface
   - Easy to customize with prompts
   - Slower but very flexible

5. **T5 Models**:
   - Unified text-to-text framework
   - Good performance (~84%)
   - Easy to adapt to new tasks

**Key takeaways**:
- Embeddings are powerful and versatile
- Simple classifiers on embeddings can outperform complex models
- Zero-shot is viable for many tasks
- Choose approach based on: data availability, performance needs, flexibility requirements