# Text Classification Lab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HassanAlgoz/B5/blob/main/W5_NLP/M3/labs/01_Text_Classification.ipynb)

## Overview

This notebook explores three different approaches to text classification using pre-trained models:
1. **Task-specific models**: Using models fine-tuned for sentiment analysis
2. **Embedding models + Classifier**: Using general-purpose embeddings with a trained classifier
3. **Embedding models + Cosine Similarity**: Zero-shot classification without labeled data

We'll work with the Rotten Tomatoes movie review dataset to classify reviews as positive or negative.

## Getting Started: Loading the Dataset

Let's start by loading the Rotten Tomatoes dataset. This dataset contains movie reviews labeled as positive or negative.

In [1]:
from datasets import load_dataset

data = load_dataset("rotten_tomatoes")
data

README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Investigate: Explore the Dataset

**Exercise**: Before running the code below, predict what you think the structure of the data will be:
- What keys will be in each example?
- What will the labels look like (what values will they have)?
- How many examples are in train vs test?

Now let's examine the data:

---

## Section A: Using a Task-specific Model

### Introduction to Hugging Face Transformers and Pipelines

Before we dive into classification, let's get familiar with **Hugging Face Transformers** - one of the most popular libraries for working with pre-trained language models.

#### What is Hugging Face Transformers?

**Hugging Face Transformers** is a Python library that provides easy access to thousands of pre-trained models for Natural Language Processing (NLP). These models have been trained on massive amounts of text data and can understand language patterns, making them incredibly powerful for various tasks like:
- Text classification (sentiment analysis, spam detection, etc.)
- Question answering
- Text generation
- Translation
- And much more!

#### What is a Pipeline?

A **pipeline** is Hugging Face's high-level API that makes it incredibly easy to use pre-trained models. Think of it as a "one-stop shop" that handles all the complex steps for you:

1. **Loading the model**: Downloads and loads the pre-trained model
2. **Tokenization**: Converts text into numbers the model can understand
3. **Inference**: Runs the model to make predictions
4. **Post-processing**: Formats the output in a readable way

**Why use pipelines?**
- **Simplicity**: You can classify text in just a few lines of code
- **No deep learning knowledge required**: The pipeline handles all the technical details
- **Consistent interface**: Same API for different models and tasks
- **Production-ready**: Optimized for real-world use

#### A Simple Example

Here's what using a pipeline looks like (we'll see this in action soon):

```python
from transformers import pipeline

# Create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")

# Use it!
result = classifier("I love this movie!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9998}]
```

That's it! No model architecture knowledge, no tokenization code, no manual inference - just simple, powerful text classification.

Now let's use this powerful tool to classify our movie reviews!

### Predict Phase

**Before running the code below, think about:**
1. What do you think `pipeline` does? What are its advantages?
2. What does `return_all_scores=True` mean?
3. Why might we specify `device="cuda"`?
4. What will the output format look like?

### Run Phase

Now let's create our pipeline. We'll use a specific model that's been trained on Twitter data for sentiment analysis:

Now let's create our pipeline. We'll use a specific model that's been trained on Twitter data for sentiment analysis:

In [2]:
from transformers import pipeline

# Path to our Hugging Face model
# This model was trained on Twitter data for sentiment analysis
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Create a pipeline for sentiment analysis
# - model: specifies which pre-trained model to use
# - tokenizer: converts text to numbers (usually same as model name)
# - return_all_scores: returns scores for all classes, not just the top one
# - device: "cuda" for GPU (faster), "cpu" for CPU (works everywhere)
pipe = pipeline(
    "sentiment-analysis",  # The task we want to perform
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda"
)

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda


Now let's run inference on the entire test set:

In [3]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)


  0%|          | 0/1066 [00:00<?, ?it/s][A
  0%|          | 1/1066 [00:02<44:06,  2.48s/it][A
  0%|          | 2/1066 [00:02<19:21,  1.09s/it][A
  1%|          | 10/1066 [00:02<02:40,  6.56it/s][A
  2%|▏         | 20/1066 [00:02<01:09, 15.15it/s][A
  3%|▎         | 30/1066 [00:02<00:41, 24.89it/s][A
  4%|▍         | 41/1066 [00:03<00:28, 36.57it/s][A
  5%|▍         | 51/1066 [00:03<00:21, 46.52it/s][A
  6%|▌         | 62/1066 [00:03<00:17, 58.00it/s][A
  7%|▋         | 73/1066 [00:03<00:14, 68.17it/s][A
  8%|▊         | 83/1066 [00:03<00:13, 73.28it/s][A
  9%|▊         | 93/1066 [00:03<00:12, 78.25it/s][A
 10%|▉         | 103/1066 [00:03<00:14, 68.06it/s][A
 11%|█         | 112/1066 [00:03<00:15, 62.44it/s][A
 11%|█▏        | 120/1066 [00:04<00:15, 59.60it/s][A
 12%|█▏        | 130/1066 [00:04<00:13, 67.99it/s][A
 13%|█▎        | 141/1066 [00:04<00:12, 76.58it/s][A
 14%|█▍        | 152/1066 [00:04<00:11, 82.85it/s][A
 15%|█▌        | 162/1066 [00:04<00:10, 85.59it/s]

**Investigate**:
- Why do we use `output[0]` and `output[2]`? What is `output[1]`?
- What does `np.argmax` do? Why do we use it here?
- What are the possible values in `y_pred`? How do they map to positive/negative?

In [4]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [5]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

Now let's evaluate the performance:

In [6]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



**Note**: To improve the performance of our selected model, we could do a few different things including selecting a model trained on our domain data, movie reviews in this case, like DistilBERT base uncased finetuned SST-2.

---

## Section B: Using an Embedding Model + Classifier Head

### Introduction to Sentence Transformers

However, what if we cannot find a model that was pretrained for this specific task? Do we need to fine-tune a representation model ourselves? The answer is no!

There might be times when you want to fine-tune the model yourself if you have sufficient computing available. However, not everyone has access to extensive computing. This is where general-purpose embedding models come in.

#### What is Sentence Transformers?

**Sentence Transformers** is a Python library built on top of Hugging Face Transformers that specializes in creating **embeddings** - numerical representations of text that capture semantic meaning.

The model `sentence-transformers/all-mpnet-base-v2` we'll use:
- Maps sentences & paragraphs to a **768-dimensional** dense vector space
- Each dimension captures some aspect of the text's meaning
- Can be used for tasks like clustering, semantic search, or (as we'll see) classification

#### The Strategy: Embeddings + Classifier

Instead of using a task-specific model, we'll:
1. **Convert text to embeddings** using Sentence Transformers (frozen, no training needed)
2. **Train a simple classifier** (like Logistic Regression) on top of these embeddings

This approach gives us:
- ✅ Flexibility to adapt to any classification task
- ✅ Fast training (only the classifier needs training, not the embedding model)
- ✅ Good performance with less computational resources
- ✅ Ability to reuse embeddings for multiple tasks

### Run Phase

Let's load a Sentence Transformer model and convert our text to embeddings:

In [7]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained Sentence Transformer model
# This model converts text into 768-dimensional vectors
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Investigate Phase

**Exercise**: Try encoding a single sentence and examine its embedding. What do you notice about the values?
```python
# Try this:
single_embedding = model.encode("This is a test sentence")
print(f"Shape: {single_embedding.shape}")
print(f"Sample values: {single_embedding[0][:10]}")
print(f"Min: {single_embedding.min()}, Max: {single_embedding.max()}")
```

**Exercise 3**: Compare embeddings of similar vs different sentences. What patterns do you see?
```python
# Try this:
similar1 = model.encode("I love this movie")
similar2 = model.encode("This film is amazing")
different = model.encode("The weather is nice today")

# Calculate cosine similarity (we'll learn about this in Section C)
from sklearn.metrics.pairwise import cosine_similarity
print("Similar sentences:", cosine_similarity([similar1], [similar2])[0][0])
print("Different sentences:", cosine_similarity([similar1], [different])[0][0])
```

In [8]:
# Convert our text data to embeddings
# Each review becomes a vector of 768 numbers
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

Let's check the shape of our embeddings to understand what we've created:

In [9]:
train_embeddings.shape

(8530, 768)

Now let's train a simple classifier on top of these embeddings. We'll use Logistic Regression - a fast, interpretable classifier that works well with embeddings:

In [10]:
from sklearn.linear_model import LogisticRegression

# Train a classifier on embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])


Now let's evaluate our classifier on the test set:

In [11]:
# Predict and evaluate
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



**Investigate**:

- What are the advantages of this approach?

**Result**: By training a classifier on top of our embeddings, we managed to get an F1 score of 0.85! This demonstrates the possibilities of training a lightweight classifier while keeping the underlying embedding model frozen.

## C. Using just the Embedding Model (headless) + Cosine Similarity

**What If We Do Not Have Labeled Data?**

Getting labeled data is a resource-intensive task that can require significant human labor. Moreover, is it actually worthwhile to collect these labels?

To perform **zero-shot classification** with embeddings, there is a neat trick that we can use. We can describe our labels based on what they should represent. For example, a negative label for movie reviews can be described as “This is a negative movie review.” By describing and embedding the labels and documents, we have data that we can work with. This process, as illustrated in Figure 4-14, allows us to generate our own target labels without the need to actually have any labeled data.

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098150952/files/assets/holl_0414.png" alt="Figure 4-14. To embed the labels, we first need to give them a description, such as “a negative movie review.” This can then be embedded through sentence-transformers.">

Figure 4-14. To embed the labels, we first need to give them a description, such as “a negative movie review.” This can then be embedded through sentence-transformers.


In [12]:
# Create embeddings for our labels
label_embeddings = model.encode([
    "A negative review",
    "A positive review"
])

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098150952/files/assets/holl_0415.png">

Figure 4-15. The cosine similarity is the angle between two vectors or embeddings. In this example, we calculate the similarity between a document and the two possible labels, positive and negative.


<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098150952/files/assets/holl_0416.png" />

Figure 4-16. After embedding the label descriptions and the documents, we can use cosine similarity for each label document pair.


In [13]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

And that is it! We only needed to come up with names for our labels to perform our classification tasks. Let’s see how well this method works:

In [14]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



#### Improve our label emeddings

Let's try improving our label embeddings by:
1. making it more polar by having the word **"very"** and
2. being more specific by adding the word **"movie"**

In [15]:
# Create embeddings for our labels
label_embeddings = model.encode([
    "A very negative movie review",
    "A very positive movie review"
])

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [17]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.86      0.73      0.79       533
Positive Review       0.76      0.88      0.82       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



Do you notice the performance increase?

> The author [(Jay Alammar)](https://jalammar.github.io/) notes that using NLI-based [zero-shot classification](https://huggingface.co/tasks/zero-shot-classification) **is better than using emedding models**. However, this was done to illustrate the **versatility of emedding models**. We will look at **Natural Language Inference (NLI)** in the next notebook Inshallah.