<a href="https://colab.research.google.com/github/Shrivastav-Gaurav/GenAI-ML-Notebook/blob/main/Assignment_Transformer_for_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Problem1: Implement a self attention layer.

Given a text sequence, we expect each token in the sequence to be represented by an embedding vector, giving us an input shape of [1, seq_len, embedding_dim], where 1 is the batch size.

In the self attention layer, we first need to compute the query, key and value vectors for each input token embedding vector. For token i, `dot_product(query_i, key_j)` is the attention score between token i and token j (note: the attention scores should be softmax normalized). The output of token i is the attention weighted sum of value vectors from all tokens in the sequence:
```
score(i, 0) * value_0 + score(i, 1) * value_1 + ... + score(i, i) * value_i + ... + score(i, seqlen - 1) * value_seqlen-1
```

The attention layer should return an output vector for each input token, therefore the shape of the layer output is [batch, seq_len, value_dim].

https://jalammar.github.io/illustrated-transformer/ is a great visual explanation of attention operations.

You can sanity check your implementation with a snippet like this:
```
class SimpleAttention(embedding_dim, kdim, vdim):
  """Your implementation."""

d_emb = 5
kdim = 4
vdim = 3
batch = 2
seq_len = 10
attention = SimpleAttention(d_emb, kdim=kdim, vdim=vdim)
assert attention(torch.rand(batch, seq_len, d_emb)).shape == torch.Size([batch, seq_len, vdim])
```








In [None]:
class SimpleAttention(torch.nn.Module):
  def __init__(self, embedding_dim: int, kdim: int | None, vdim: int | None):
    super().__init__()

    self.embedding_dim = embedding_dim
    self.kdim = kdim or embedding_dim
    self.vdim = vdim or embedding_dim

    """your code here"""

  def forward(self, inputs):
    """
    inputs: [batch, seq_len, embedding_dim]
    """

## Problem 2: IMDB movie review dataset sentiment analysis

The [IMDB movie review dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) has 50k movie reviews. Each review has a binary sentiment classification label "positive" or "negative". We will use LLMs to perform movie review sentiment classification.

By the end of the assignment, you will gain hands on experience with
* Zero / Few shot LLMs
* Using embeddings for downstream use case
* Prepare data and finetune a LLM
* Evaluate IMDB movie review sentiment analysis across multiple approaches

In [None]:
import os

# Note: To load files correct, add the "Module 6 : Deep Dive Into LLMs" folder
# as shortcut under "MyDrive".
from google.colab import drive
drive.mount('/content/drive')
assets_dir = '/content/drive/MyDrive/Module 6 : Deep Dive Into LLMs/Assignment and MCQs/datasets/'

Mounted at /content/drive


In [None]:
# Parse the csv data file into a data frame. Use the first half as training
# partition, and the second half for eval.
import pandas as pd

df_reviews = pd.read_csv(os.path.join(assets_dir, 'IMDB_Dataset.csv'))
df_reviews_train = df_reviews.iloc[0: len(df_reviews) // 2]
df_reviews_test = df_reviews.iloc[len(df_reviews) // 2:]
df_reviews_test.head(3)

Unnamed: 0,review,sentiment
25000,This movie was bad from the start. The only pu...,negative
25001,"God, I never felt so insulted in my whole life...",negative
25002,Not being a fan of the Coen Brothers or George...,positive


###Use pre-trained models

Pretrained LLMs are quite powerful and often have good performance on novel tasks. Here are a few approaches we can try, feel free to brain storm your own solutions as well.


1.   Zero shot or few shot prompting. In Zero shot prompting, we describe the task for the model with no demonstration: "Please classify the sentiment of this movie review as positive or negative. Review: I loved the cinematograph of....... Sentiment:". In few shot prompting, we describe the task and also provide a few demonstrations:
```
Please classify the sentiment of this movie review as positive or negative.
Review: I loved the cinematograph of.......
Sentiment: positive
Review: What a waste of time. I wish......
Sentiment: negative
Review: The movie was a big surprise......
Sentiment:
```
Zero/few shot performance is generally considered an "emerging capability at scale", meaning sufficiently models can do well while small models offen show performance gap.

  *   Try prompting the small gpt-2 model we used during live class demo, and if you have access to APIs of stronger models, try send a subset of the reviews as well.
  *   Compute an accuracy metric for zero/few shot approaches.


2.   Embedding and nearest neighbors. Often inputs that are semantically similar are also close to each other in the embedding space (e.g. they have higher cosine similarity). We showed during live class demo how to extract embeddings of movie reviews from a pretrained BERT model's CLS token, and retrieve top k reviews most similar to a query review. Can we additionally use the setting for sentiment analysis?

  *    For a test review, find the top k most similar training reviews, and use majority voting of these top k training reviews' sentiment as the query's sentiment prediction.
  *    Compute an accuracy metric for the embedding + KNN approach.
  *    Is there an optimal k value?



In [None]:
# @title Use few shot prompt on gpt-2/gpt-3.5 for IMDB movie review dataset

In [None]:
# @title Use BERT embedding and KNN for IMDB movie review dataset

# You can re-use the code in "Transformer Embedding" live demo, as well as
# loading the pre-computed movie review embeddings. Note that the embeddings are
# computed for `df_reviews` in order and has 50k rows. Additionally, the path
# below assumes you've added a shortcut to "Module 6 : Deep Dive Into LLMs"
# folder under MyDrive.
vector_store = torch.load('/content/drive/MyDrive/Module 6 : Deep Dive Into LLMs/vector_store.pt')
vector_test = vector_store[len(df_reviews) // 2:, :]
vector_test_norm = torch.nn.functional.normalize(vector_test, p=2, dim=-1)
vector_test_norm = vector_test_norm.to(device)

### Finetune a pre-trained gpt-2 model

### Task Framing
To convert the sentiment classification problem to a language generation task, we can structure training inputs as `[review_text]<|endoftext|>positive<|endoftext|>`. The model should learn to predict `<|endoftext|>positive<|endoftext|>` after seeing the `review_text`. For example:
```
Not being a fan of the Coen Brothers or George...<|endoftext|>positive<|endoftext|>
```


`<|endoftext|>` is both the bos and eos token in gpt-2 model. We surround the sentiment token with `<|endoftext|>` to teach the model when to start generating sentiment classification (otherwise the model will not know when to continue a review vs perform sentiment classification), and when to stop generating (we don't need any more output follow the sentiment classification).

We can obtain classification result by simply comparing model output conditioned on a review text to "positive" and "negative".



### Model

We use [GPT2LMHeadModel](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2LMHeadModel), which has 125M parameters.

To finish training the model, you may require a A100 GPU. To check that training runs correctly, you can use a CPU/T4 run time with very small bach size (e.g. 1). Another alternative is the finetuned model weights have been saved to the assignment directory. T4 is sufficient to run inference from loaded weights.

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

lm_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
lm =  GPT2LMHeadModel.from_pretrained('gpt2')
lm = lm.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

{'input_ids': tensor([[17250,   612]]), 'attention_mask': tensor([[1, 1]])}


In [None]:
# Count number of parameters in gpt-2
sum(p.numel() for p in lm.parameters() if p.requires_grad)

124439808

In [None]:
# View tokenizer output.
results = lm_tokenizer("Hi there", return_tensors="pt")
print(results)

{'input_ids': tensor([[17250,   612]]), 'attention_mask': tensor([[1, 1]])}


### Create a custom Dataset

We will first need to create a custom Dataset class to yield appropriate training input.

The forward pass of `GPT2LMHeadModel` requires these arguments:


*   input_ids: integer token ids
*   attention_mask: 0/1 mask. During training, we usually want to pad/truncate each input sequence to the same length (e.g. max_length of the model). For shorter sequences, we will add padding token ids. Attention mask has the same length as the token id sequence, and is 0 for padding tokens and 1 else where. Attention mask lets the model ignore padding tokens during training. This [blog](https://gmongaras.medium.com/how-do-self-attention-masks-work-72ed9382510f) explains masked attention in detail.
*   labels: This is the desired generation we want the model to learn. In the IMDB dataset, we want the model to respond with "positive" for positive sentiment reviews, and "negative" otherwise. GPT model is a language model, and can compute P(positive | review) and P(negative | review). This allow us to compute cross entropy loss against the groundtruth annotation. For a positive review, the cross entropy is 1.0 * Gpt_logP(positive | review), where 1.0 simply expresses the true probability of "positive" is 1.0.

Some the devil is in the details.


*    `GPT2LMHeadModel` implementation expects labels to be of the same length as input_ids ([source code](https://github.com/huggingface/transformers/blob/25245ec26dc29bcf6102e1b4ddd0dfd02e720cf5/src/transformers/models/gpt2/modeling_gpt2.py#L1331)). This make sense for pre-training: we want to keep predicting next token in the entire sequence. In our finetuning case, we only want to predict "positive" or "negative" after the model has seen the review, i.e. we want to compute cross entropy loss only on the sentiment prediction token, but not on the review tokens. The [documentation](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2LMHeadModel) of `GPT2LMHeadModel` calls out that label ids set to `-100` will be ignored from cross entropy loss term (this corresponds to `torch.nn.CrossEntropyLoss.ignore_index`). So our dataset object should set all label ids prior to the sentiment token id to -100 (model will learn to predict `positive<|endoftext|>` and `negative<|endoftext|>`).
*   Hugging Face tokenizer objects can take a `trucate` boolean parameter to truncate input texts exceeding the model's token length limit. However we must preserve the ending `<|endoftext|>positive<|endoftext|>` string, so we can only truncate review texts. This means we need to tokenize the review text and the sentiment label string separately, truncate the review text if necessary, and combine the two sets of tokens. Tokenization returns both `input_ids` and `attention_mask`, so we need to make sure the `attention_mask` tensor is properly constructed for review_text_ids_maybe_truncated + sentiment_ids as well.








In [None]:
from torch.utils.data import Dataset, DataLoader

class ImdbDataset(Dataset):
  def __init__(self, df, tokenizer, max_len=512, train=True):
    self.df = df
    self.tokenizer = tokenizer
    self.max_len = max_len
    self.train = train

  def __len__(self):
    return len(self.df)

  def __getitem__(self, idx):
    """Your code here"""

### Train and evaluate the model

Compute an accuracy metric in eval.

In [None]:
# Your training and eval code here

### Alternative approaches
There are many possible modeling choices for the IMDB movie review dataset. To name a few:


*   We can train a classifier on gpt model final token's last layer logit. [GPT2ForSequenceClassification](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2ForSequenceClassification) uses this approach.
*   We can train a classifier on an embedding model output, e.g. we looked at BERT model, whose CLS token is a sequence level representation.



### Debugging Tips
*   Remember to always call `model.to(gpu_device)` and `data.to(gpu_device)` to push computation to GPU. The computations required in this assignment will be too slow on CPU.

*   If you see GPU out of memory answer error, try:
    * Reduce train/inference batch size
    * Use a smaller max_seq_len
*   If you see very poor finetuning performance (e.g. loss not going down), the first thing to check is usually whether the input data is correctly constructed. LLMs often are robust to a wide range of hyperparameters, but when the learning rate is too high the training can still fail to converge.