**💡 NOTE:** We will want to use a GPU to run the examples in this notebook. In Google Colab, go to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.**

In [1]:
# %%capture: A Jupyter magic command to suppress the cell's output or store it for later use.
%%capture
!pip install datasets transformers sentence-transformers openai

In [2]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [3]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

# Text Classification with Representation Models

### Using a Task-specific Model

In [4]:
# Importing the Pipeline
# The pipeline function from the transformers library allows you to quickly use pre-trained models for tasks like sentiment analysis, text classification, etc.
from transformers import pipeline

#The path refers to the Hugging Face model repository. In this case, the Cardiff NLP's fine-tuned RoBERTa model for sentiment analysis is being used.
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"


# Loading the Model and Tokenizer
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

# model and tokenizer: Specifies the model and tokenizer paths (both from the same Hugging Face repository).
# return_all_scores=True: Ensures that the pipeline returns scores for all sentiment labels (e.g., positive, negative, neutral).
# device="cuda:0": Loads the model onto the first GPU (if CUDA is available). If CUDA isn't available, set device=-1 to use the CPU.


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
import numpy as np            # numpy: Used for numerical operations, specifically np.argmax to find the index of the maximum score.
from tqdm import tqdm         # tqdm: Provides a progress bar for iterables.
from transformers.pipelines.pt_utils import KeyDataset
# # KeyDataset: A utility from transformers.pipelines.pt_utils that extracts a specific column (in this case, "text") from a dataset for inference.

# Run inference
y_pred = []
# Iterating Over the Dataset tqdm: Tracks progress and displays a progress bar during inference, output: The result from the pipeline for each input text. Each result is a list of dictionaries with labels and scores,
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
  # KeyDataset(data["test"], "text"): Extracts the "text" field from the test dataset.
  # Extracting Scores and Assigning Labels, Accesses the scores for the "negative" (index 0) and "positive" (index 2) labels.
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    # np.argmax: Finds the index of the maximum value between negative_score and positive_score. This assumes the task is binary (negative vs positive), ignoring the "neutral" score.
    y_pred.append(assignment)
    # Storing Predictions, Appends the predicted label (0 for negative, 1 for positive) to the y_pred list.

100%|██████████| 1066/1066 [00:22<00:00, 47.93it/s]


**This function, evaluate_performance, generates and prints a classification report for evaluating the performance of a classification model**

The classification_report function from Scikit-learn generates a detailed summary of classification metrics like precision, recall, F1-score, and support (the number of true instances for each class).

y_true: Ground truth (actual) labels for the dataset.

y_pred: Predicted labels from the model.

target_names: Maps the numeric class labels (e.g., 0 and 1) to human-readable class names ("Negative Review" and "Positive Review").

The classification report includes:
- Precision: The fraction of relevant instances among the retrieved instances.
- Recall: The fraction of relevant instances that were retrieved.
- F1-Score: The harmonic mean of precision and recall.
- Support: The number of occurrences of each label in y_true.

Outputs the classification report to the console.

In [6]:
# Importing the Classification Report
from sklearn.metrics import classification_report

#  Defining the Function
def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"] # Ensure y_true and y_pred are of the same length and represent the same classes
    )
    # Printing the Report
    print(performance)

In [7]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



### Classification Tasks that Leverage Embeddings

#### Supervised Classification

1. Importing SentenceTransformer

The SentenceTransformer module is part of the sentence-transformers library, widely used for creating sentence and text embeddings.

2. Loading a Pre-Trained Model
- 'sentence-transformers/all-mpnet-base-v2': A state-of-the-art model from the Sentence Transformers library. It's based on the MPNet architecture and is trained to produce high-quality embeddings for text inputs.

- The model is downloaded and initialized from the Hugging Face Model Hub.

3. Generating Embeddings

- model.encode(...): Converts input text into dense vector embeddings.
- Inputs:
   - data["train"]["text"]: Assumes data["train"] is a dictionary or dataset where "text" is a key containing a list of text samples for training.
   - data["test"]["text"]: Similar to training, but for test data.
- show_progress_bar=True: Displays a progress bar to monitor the encoding process for large datasets.

**Output**
- train_embeddings: A 2D array (or list of arrays), where each row is a dense vector representation of a sentence from the training data.
- test_embeddings: Similarly, embeddings for the test data.
- Embeddings are typically 768-dimensional vectors (depending on the model) for each sentence.

In [8]:
# Importing SentenceTransformer
from sentence_transformers import SentenceTransformer
# Loading a Pre-Trained Model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings (Generating Embeddings)
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [9]:
train_embeddings.shape

(8530, 768)

In [10]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

In [11]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



### Tip!

What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:

In [12]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-shot Classification

In [13]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [15]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



### Tip!

What would happen if you were to use different descriptions? Use "A very negative movie review" and "A very positive movie review" to see what happens!

## Classification with Generative Models

### Encoder-decoder Models

In [16]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [17]:
# Prepare our data (user input / Query)
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [18]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|██████████| 1066/1066 [00:43<00:00, 24.47it/s]


In [19]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



## ChatGPT for Classification

In [20]:
import openai

# Create client
client = openai.OpenAI(api_key="YOUR_KEY_HERE")

In [21]:
def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    """Generate an output based on a prompt and an input document."""
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
            },
        {
            "role": "user",
            "content":   prompt.replace("[DOCUMENT]", document)
            }
    ]
    chat_completion = client.chat.completions.create(
      messages=messages,
      model=model,
      temperature=0
    )
    return chat_completion.choices[0].message.content

In [None]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the target using GPT
document = "unpretentious , charming , quirky , original"
chatgpt_generation(prompt, document)

In [None]:
# You can skip this if you want to save your (free) credits
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)