<a href="https://colab.research.google.com/github/ShubhamW248/LLM-Practice/blob/main/TextClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Text Classification

Text classification is a fundamental task in Natural Language Processing (NLP) where the goal is to assign predefined categories to textual data. In this notebook, we explore **Text Classification with Representation Models** and **Classification with Generative Models**, highlighting supervised, zero-shot, and task-specific approaches.

---

## Text Classification with Representation Models

Representation models use embeddings or specialized architectures to process and classify text. We will study:

1. **Using a Task-specific Model**  
   Task-specific models are trained or fine-tuned for specific classification tasks such as sentiment analysis, spam detection, or topic categorization.

2. **Classification Tasks that Leverage Embeddings**  
   - **Supervised Classification**: Pre-labeled data is used to train the model for tasks like news categorization or emotion detection.  
   - **Zero-shot Classification**: Models generalize to new tasks or labels without explicit training on them, leveraging embeddings or pre-trained representations.

---

## Classification with Generative Models

Generative models offer a versatile approach by generating text or probabilities directly for classification tasks. We will focus on:

1. **Encoder-Decoder Models**  
   Using models like **Flan-T5 Small**, we explore how pre-trained encoder-decoder architectures handle classification tasks through prompt engineering and fine-tuning.

2. **Generative for Classification**  
   Learn how conversational generative models like ChatGPT can be adapted for classification tasks by prompting them to produce structured outputs or probabilities.

---

By the end of this notebook, you'll have a clear understanding of how to apply different models for text classification, whether leveraging embeddings, task-specific architectures, or generative capabilities.


#Loading the Data

```
# This is formatted as code
```



In [19]:
#!pip install datasets transformers sentence-transformers openai

In [2]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [5]:
data["train"][0, -1]


{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

#Text Classification with Representation Models


Using a Task-specific Model


In [3]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [4]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)


100%|██████████| 1066/1066 [00:18<00:00, 56.78it/s] 


In [5]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)


In [6]:
evaluate_performance(data["test"]["label"], y_pred)


                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



#Classification using Embeddings

Supervised Classification


In [7]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [8]:
train_embeddings.shape


(8530, 768)

In [9]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])


In [10]:
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)


                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



We can also use embedding with cosine similarity for classification task directly without a classification model

In [11]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)


                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



Zero-shot Classification


In [12]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])


In [13]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)


In [14]:
evaluate_performance(data["test"]["label"], y_pred)


                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



#Classification with Generative Models

Encoder-decoder Models


In [15]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0


In [16]:
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data


Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [17]:
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)


100%|██████████| 1066/1066 [00:51<00:00, 20.68it/s]


In [18]:
evaluate_performance(data["test"]["label"], y_pred)


                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



Generative Model for Classification


In [34]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from tqdm import tqdm
import torch
from sklearn.metrics import classification_report

def setup_generative_model(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
    """
    Initialize the generative model pipeline.
    Using TinyLlama as default since it's lightweight but effective.
    You can replace with larger models like 'meta-llama/Llama-2-7b-chat-hf' if you have access.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    # Create text generation pipeline
    generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
    )
    return generator

def generate_sentiment(text, generator):
    """
    Predict sentiment using a generative model approach.
    Returns 1 for positive, 0 for negative.
    """
    prompt = f"""Predict whether the following movie review is positive or negative:

Review: {text}

If it is positive return 1 and if it is negative return 0. Only return the number."""

    # Generate response with minimal tokens since we only need 0 or 1
    response = generator(prompt, max_new_tokens=5, do_sample=False)[0]['generated_text']

    # Extract the last digit from the response
    for char in reversed(response):
        if char in ['0', '1']:
            return int(char)
    return 0  # Default to 0 if no valid prediction found

def batch_predict_sentiments(texts, generator, batch_size=8):
    """
    Predict sentiments for a list of texts in batches.
    Smaller batch size than classifier due to memory requirements of generative models.
    """
    predictions = []

    for i in tqdm(range(0, len(texts), batch_size)):
        batch = texts[i:i + batch_size]
        batch_predictions = [generate_sentiment(text, generator) for text in batch]
        predictions.extend(batch_predictions)

    return predictions

def evaluate_performance(y_true, y_pred):
    """
    Evaluate model performance using classification report.
    """
    report = classification_report(y_true, y_pred, digits=4)
    print(report)

# Example usage
if __name__ == "__main__":
    # Initialize the generator
    generator = setup_generative_model()

    # Single prediction example
    document = "unpretentious, charming, quirky, original"
    prediction = generate_sentiment(document, generator)
    print(f"Single prediction: {prediction}")

    # Batch prediction example
    # Assuming data["test"]["text"] is your test dataset
    y_pred = batch_predict_sentiments(data["test"]["text"], generator)

    # Evaluate performance using classification report
    # Assuming data["test"]["label"] contains your true labels
    evaluate_performance(data["test"]["label"], y_pred)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda


Single prediction: 0


100%|██████████| 134/134 [02:07<00:00,  1.05it/s]

              precision    recall  f1-score   support

           0     0.5000    1.0000    0.6667       533
           1     0.0000    0.0000    0.0000       533

    accuracy                         0.5000      1066
   macro avg     0.2500    0.5000    0.3333      1066
weighted avg     0.2500    0.5000    0.3333      1066




  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
