## Data

In [2]:
from datasets import load_dataset

data = load_dataset("rotten_tomatoes")
data

Downloading readme:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/699k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [3]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

## Text Classification with Representation Models

### Using a Task-specific Model

In [10]:
import torch
from transformers import pipeline

model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device=device
)

loading configuration file config.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--cardiffnlp--twitter-roberta-base-sentiment-latest\snapshots\4ba3d4463bd152c9e4abd892b50844f30c646708\config.json
Model config RobertaConfig {
  "_name_or_path": "cardiffnlp/twitter-roberta-base-sentiment-latest",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "negative",
    "1": "neutral",
    "2": "positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 0,
    "neutral": 1,
    "positive": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "positio

In [11]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

Disabling tokenizer parallelism, we're using DataLoader multithreading already
100%|██████████| 1066/1066 [01:18<00:00, 13.58it/s]


In [12]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["negative Review", "Positive Review"]
    )
    print(performance)

## Classification Tasks that Leverage Embeddings

### Supervised Classification

In [13]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

loading configuration file config.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--sentence-transformers--all-mpnet-base-v2\snapshots\12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0\config.json
Model config MPNetConfig {
  "_name_or_path": "sentence-transformers/all-mpnet-base-v2",
  "architectures": [
    "MPNetForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "mpnet",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.41.2",
  "vocab_size": 30527
}



model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

loading weights file model.safetensors from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--sentence-transformers--all-mpnet-base-v2\snapshots\12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0\model.safetensors
All model checkpoint weights were used when initializing MPNetModel.

All the weights of MPNetModel were initialized from the model checkpoint at sentence-transformers/all-mpnet-base-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MPNetModel for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

loading file vocab.txt from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--sentence-transformers--all-mpnet-base-v2\snapshots\12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0\vocab.txt
loading file tokenizer.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--sentence-transformers--all-mpnet-base-v2\snapshots\12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0\tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--sentence-transformers--all-mpnet-base-v2\snapshots\12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0\special_tokens_map.json
loading file tokenizer_config.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--sentence-transformers--all-mpnet-base-v2\snapshots\12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0\tokenizer_config.json


config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [14]:
train_embeddings.shape

(8530, 768)

In [16]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

In [17]:
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



In [18]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



## Zero-shot Classification

In [20]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review", "A positive review"])

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [22]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



## Classification with Generative Models

### Encoder-Decoder Models

In [23]:
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device=device
)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
loading configuration file config.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--google--flan-t5-small\snapshots\0fc9ddf78a1e988dac52e2dac162b0ede4fd74ab\config.json
Model config T5Config {
  "_name_or_path": "google/flan-t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 8,
  "num_head

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

loading weights file model.safetensors from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--google--flan-t5-small\snapshots\0fc9ddf78a1e988dac52e2dac162b0ede4fd74ab\model.safetensors
Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0
}

All model checkpoint weights were used when initializing T5ForConditionalGeneration.

All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at google/flan-t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.


generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

loading configuration file generation_config.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--google--flan-t5-small\snapshots\0fc9ddf78a1e988dac52e2dac162b0ede4fd74ab\generation_config.json
Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0
}



tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

loading file spiece.model from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--google--flan-t5-small\snapshots\0fc9ddf78a1e988dac52e2dac162b0ede4fd74ab\spiece.model
loading file tokenizer.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--google--flan-t5-small\snapshots\0fc9ddf78a1e988dac52e2dac162b0ede4fd74ab\tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--google--flan-t5-small\snapshots\0fc9ddf78a1e988dac52e2dac162b0ede4fd74ab\special_tokens_map.json
loading file tokenizer_config.json from cache at C:\Users\siddharth.yadav\.cache\huggingface\hub\models--google--flan-t5-small\snapshots\0fc9ddf78a1e988dac52e2dac162b0ede4fd74ab\tokenizer_config.json


In [24]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example["text"]})
data

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [26]:
# Run interference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|██████████| 1066/1066 [01:18<00:00, 13.67it/s]


In [27]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### ChatGPT for Classification

In [47]:
import os
import openai

openai_api_key = os.environ["OPENAI_API_KEY"]
client = openai.OpenAI(api_key=openai_api_key)

In [48]:
def chatgpt_generation(prompt, document, model="gpt-3.5-turbo"):
    """Generate a output based on a prompt and an input document"""
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": prompt.replace("[DOCUMENT]", document)
        }
    ]
    chat_completion = client.chat.completions.create(
        messages=messages,
        model=model,
        temperature=0
    )
    return chat_completion.choices[0].message.content

In [49]:
# Define a prompt tempate as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is a positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the trget using GPT
document = "unpretentious, charming, quirky, original"
chatgpt_generation(prompt, document)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

**Skip the following cells to save (free) credits**

In [50]:
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

  0%|          | 0/1066 [00:00<?, ?it/s]
KeyboardInterrupt



In [None]:
y_pred = [int(pred) for pred in predictions]

evaluate_performance(data["test"]["label"], y_pred)