# GPT1 - SST2 Sentiment Analysis

In [None]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install accelerate

In [None]:
!pip install spacy ftfy==4.4.3
!python -m spacy download en

In [11]:
import numpy as np
import pandas as pd

import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    OpenAIGPTTokenizer,
    DataCollatorWithPadding,
)

from transformers import (
    pipeline,
    Trainer,
    TrainingArguments,
    OpenAIGPTModel,
    OpenAIGPTForSequenceClassification,
)

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# Dataset

## Loading

In [None]:
#sst2 = load_dataset("stanfordnlp/sst2")
sst2 = load_dataset("nyu-mll/glue", "sst2")

## Analyzing

Stanford Sentiment Treebank (SST) is a 5-class sentiment analysis dataset of movie reviews taken from Rotten Tomatoes. It was published as a part ["Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank"](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) paper in 2013.

Then, its binary version (SST-2) was standardized by the [GLUE benchmark](https://openreview.net/pdf?id=rJ4km2R5t7) where train and test splits are specifically provided. In HuggingFace Glue Benchmark, we have a small sized validation split too. Each data sample is actually the composition of ***sentence*** and ***label*** in a dictionary.

- Positive Review: label 1
- Negative Review: label 0

In [4]:
# train, valid and test dataset size
print("Train size: ", len(sst2["train"]))
print("Valid size: ", len(sst2["validation"]))
print("Test size: ", len(sst2["test"]))
print()

train_set = sst2["train"]
valid_set = sst2["validation"]
test_set = sst2["test"]

# a data sample = <text, label> dict
print("Type of a sample: ", type(train_set[0]))
print("Sentence: ", train_set[0]["sentence"][:100])
print("Label: ", train_set[0]["label"])

Train size:  67349
Valid size:  872
Test size:  1821

Type of a sample:  <class 'dict'>
Sentence:  hide new secretions from the parental units 
Label:  0


## Text Preprocessing - Tokenizer

### 1. Structure of Tokenizer:

GPT-1 tokenizer utilizes byte-pair encoding [\[1\]](https://huggingface.co/learn/nlp-course/en/chapter6/5#implementing-bpe), with a vocabulary of 40478 tokens. While index 0 is exclusively reserved for unknown tokens, and padding token is discarded. The main reason for this lies behind the followed pre-training strategy:

GPT-1 architecture was generative pre-trained on fixed-length blocks of 512 tokens. In other words, the dataset was partitioned into contiguous sequences of exactly 512 tokens. Hence, every sample in a batch has always 512-tokens length; there was no need for padding. However, to be able to finetune it with variable-length sequences in SST2 dataset, we define a padding token and inserted it into the tokenizer.

### 2. Truncation and Padding:

To be able to finetune GPT-1 architecture with batched inputs, we need to pad shorter sequences, and truncate the sequences longer than 512 tokens in the batch. One of the controversial issues is in which direction we need to apply truncation and padding operations.

***A. Left and Right Padding:***

You can encounter in some resources that left padding is recommended for sentence-level prediction tasks like sentiment analysis. Since GPT models process the text autoregressively from left to right, each token representation depends on all preceding tokens, meaning that final token effectively aggregates information from the entire sequence. Because of this, classification heads at the top of GPT models account for last token [\[2\]](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt#transformers.OpenAIGPTForSequenceClassification). In that case, leaving pad tokens at the left side and preserving actual text content to the right helps model to attends to the rightmost content as actual text context. This ensures that last token encapsulates the interpretation of whole text and model predictions are not contaminated by padding tokens.

Besides, GPT models are decoder-only architectures, so while generating an asnwer, they actually iterate over your prompt. GPT-based arcihtectures are not trained to generate text from pad tokens, which interrupts the generation of continuous semantic information. Hence, left-sided padding can be useful [\[3\]](https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side).


Nevertheless, GPT adapts absolute position embeddings. If you do left padding, you actually push real tokens to higher position indices than they would normally occupy, which can cause a mismatch between how the model was pre-trained and how you are fine-tuning. *\"So it is usually advised to pad the inputs on the right rather than the left\"* [\[2\]](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt#transformers.OpenAIGPTForSequenceClassification).


***B. Left and Right Truncation:***

In left-truncation, the final tokens are retained to preserve the portion of text where the meaning often accumulates. This method can be particularly advantageous in sentiment analysis, as evaluative cues frequently appear near the end of a passage. Conversely, right truncation keeps the opening tokens, which may be more beneficial when a task relies on the prompt’s initial content, such as the instructions or sample code, since subsequent queries or responses depend on that foundational context.


In [5]:
# definition of tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "openai-community/openai-gpt",
    truncation_side="right",
    padding_side="right")

# getting vocabulary and printing vocab-size
vocab = tokenizer.get_vocab()
print("Vocabulary size before pad token: ", len(vocab))


# adding pad token
# updating vocabulary
# printing vocab-size and pad index
tokenizer.add_special_tokens({"pad_token": "<pad>"})
vocab = tokenizer.get_vocab()
print("Vocabulary size after pad token: ", len(vocab))
print("Pad token id: ", tokenizer.pad_token_id)

# defining inverse vocabulary
# printing first and last tokens
inv_vocab = {str(value): key for key, value in vocab.items()}
print("Index 0 token: ", inv_vocab["0"])
print("Index 40478 token: ", inv_vocab["40478"])

# tokenizing the dataset
def tokenize(examples):
    return tokenizer(examples["sentence"], max_length=512, padding="max_length", truncation=True)

tokenized_sst2 = sst2.map(tokenize, batched=True)

Vocabulary size before pad token:  40478
Vocabulary size after pad token:  40479
Pad token id:  40478
Index 0 token:  <unk>
Index 40478 token:  <pad>


### 2. Batching Tokens

* In general, the sentences or paragraphs to be processed by the language model have different length; so after tokenization each sentence has different number of tokens. This is a problem, because batched inputs need to be fixed-size tensors, where padding enters the picture. It adds a special padding token to guarantee that all sequences have same length as the longest one or maximum length accepted by the model [\[4\]](https://huggingface.co/docs/transformers/en/pad_truncation#).

* Our main purpose is to create a batch of padded samples. To achieve that, [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/data_collator#transformers.DataCollatorWithPadding) can be used. It dynamically pads the sequences to longest length in the batch during collation. As an alternative, all samples in the dataset can be also padded to maximum length, but we do not need this. What we need is that only the samples in the same batch should have same length. At this point, dynamic padding inside the batch becomes more efficient.

In [6]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Evaluate

To be able to evaluate the performance of the model during training and validation steps, we need some metrics. The most used ones for classification is accuracy, precision and recall. We can use HF [evaluate](https://huggingface.co/docs/evaluate/en/index) library.

We implement `compute_metrics()` function, which will be automatically invoked to calculate 3 evaluation metrics for validation set during training. When multiple metrics are calculated in this function, its return statement needs to be a dictionary [\[5, ](https://huggingface.co/docs/transformers/en/main_classes/trainer)[6\]](https://discuss.huggingface.co/t/combine-multiple-metrics-in-compute-metrics-for-validation/90088)

In [7]:
recall = evaluate.load("recall")
precision = evaluate.load("precision")
accuracy = evaluate.load("accuracy")


def compute_metrics(preds_labels) -> dict:
    preds, labels = preds_labels
    preds = np.argmax(preds, axis=1)  # expected shape: (B, 2)

    acc = accuracy.compute(predictions=preds, references=labels)["accuracy"]
    rec = recall.compute(predictions=preds, references=labels)["recall"]
    pre = precision.compute(predictions=preds, references=labels)["precision"]

    return {"accuracy": acc, "recall": rec, "precision": pre}

# Training

GPT-1 classifier is extended version of autoregressive GPT-1 architecture [\[7\]](https://huggingface.co/openai-community/openai-gpt#model-details); it has an additional classification layer (only weights, no bias) built on last token. That is why, the architecture needs to the position of the last token. At this point, specification of pad_token_id is important, it allows the model to recognize which one is padding token and easily finds the last token [\[2\]](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt#transformers.OpenAIGPTForSequenceClassification).


In [8]:
# Dictionaries to map from class ids to corresponding labels or vice versa.
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

# instantiating gpt1 classifier
model = OpenAIGPTForSequenceClassification.from_pretrained("openai-gpt", num_labels=2)
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = 40478

print("\nNumber of parameters: ", model.num_parameters())

Some weights of OpenAIGPTForSequenceClassification were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`



Number of parameters:  116537088


## Training Arguments and Trainer

To train the model, we need to determine hyperparameters, which is handled by `TrainingArguments` [\[8\]](https://huggingface.co/docs/transformers/en/main_classes/trainer).

- ***output_dir:*** The directory where model predictions and checkpoints will be saved.

- ***per_device_train_batch_size:*** When you train your model with multiple GPUs or TPUs, it controls the batch size for each GPU/TPU.

- ***per_device_eval_batch_size:*** When you evaluate your model with multiple GPUs or TPUs, it controls the batch size for each GPU/TPU.

- ***num_train_epochs:*** How many number of epochs the model will be trained.

- ***eval_strategy:*** It can be "no", "steps", or "epoch".
    - no: No evaluation is done during training
    - steps: Evaluation is done and logged every *\"eval_steps\"*
    - epoch: Evaluation is done at the end of each epoch.


- ***save_strategy:*** It can be "no", "steps", "epoch", "best".
    - no: No ckpt save is done during training
    - steps: Ckpt save is done every *\"eval_steps\"*
    - epoch: Ckpt save is done at the end of each epoch.
    - best: Ckpt save is done when a new *\"best_metric\"* achieved.

- ***load_best_model_at_end:*** It is a boolean value to make sure whether the best model according to `metric_for_best_model` will be saved or not.

In [9]:
training_args = TrainingArguments(
    output_dir="gpt1_sst2_right",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_sst2["train"],
    eval_dataset=tokenized_sst2["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Recall,Precision
1,0.2,0.295824,0.90367,0.864865,0.941176
2,0.1455,0.31718,0.918578,0.95045,0.895966


Epoch,Training Loss,Validation Loss,Accuracy,Recall,Precision
1,0.2,0.295824,0.90367,0.864865,0.941176
2,0.1455,0.31718,0.918578,0.95045,0.895966
3,0.0892,0.363674,0.927752,0.925676,0.931973
4,0.0584,0.421563,0.925459,0.936937,0.918322


TrainOutput(global_step=16840, training_loss=0.1379245318596267, metrics={'train_runtime': 6928.678, 'train_samples_per_second': 38.881, 'train_steps_per_second': 2.43, 'total_flos': 7.0391028252672e+16, 'train_loss': 0.1379245318596267, 'epoch': 4.0})

In [None]:
trainer.push_to_hub()

# Inference

In [15]:
model = pipeline("sentiment-analysis", model="goktug14/gpt1_sst2_right")


preds = {"samples": [], "classes": []}

for i in range(len(sst2["test"])):
    sample = sst2["test"][i]
    pred = pipe(sample["sentence"], top_k=1)[0]

    preds["samples"].append(sample["sentence"])
    preds["classes"].append(pred["label"].split("_")[1])

df = pd.DataFrame(preds)
df.to_csv("./preds.csv", index=False)

Device set to use cuda:0


<a name="ref"></a>

# References

1. [Byte Pair Encoding](https://huggingface.co/learn/nlp-course/en/chapter6/5#implementing-bpe)

2. [GPT-1 Sequence Classifier](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt#transformers.OpenAIGPTForSequenceClassification)

3. [Padding Side](https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side)

4. [Padding and Truncation](https://huggingface.co/docs/transformers/en/pad_truncation#)

5. [Compute Metrics and Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer)

6. [Computation of Multiple Metrics Discussion](https://discuss.huggingface.co/t/combine-multiple-metrics-in-compute-metrics-for-validation/90088)

7. [OpenAI - GPT1](https://huggingface.co/openai-community/openai-gpt#model-details)

8. [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer)