
<br>
<font>
<div dir=ltr align=center>
<font color=0F5298 size=10>
    Deep Learning - HW4 <br>
<font color=2565AE size=5>
    Electrical Engineering Department <br>
    winter 2024<br>
<font color=3C99D size=5>
    Practical Assignment 3 <br>
<font color=696880 size=4>
    Amirabbas Afzali 

____

# Personal Data

In [2]:
# Set your student number
student_number = '401110437'
Name = 'Parsa'
Last_Name = 'Ghezelbash'

# Rules
- Make sure that all of your cells can be run perfectly. 
- Try to minimize your use of ChatGPT (or any other AI assistant) as much as possible.
- You must create a report for this task in PDF format and explain the main results.

---

## **Introduction**

Large Language Models (LLMs) are a class of deep learning models designed for processing and generating natural language. These models are trained using large amounts of textual data and utilize architectures based on transformers. Some of the applications of these models include text generation, machine translation, text summarization, question answering, and text classification.

### *Encoder-Decoder LLMs*

One of the common architectures in large language models is the Encoder-Decoder architecture. In this architecture, the encoder processes an input sequence and maps it to a latent space. Then, the decoder uses this latent space to generate an output sequence. Models like T5 [1] (Text-to-Text Transfer Transformer) use this architecture to perform various tasks. In T5, all tasks are expressed in a "text-to-text" format, meaning both input and output are text. This model has capabilities such as translation, summarization, and text classification. One of the advantages of the Encoder-Decoder architecture is that it allows the encoder to utilize information from both before and after a word to gain a more comprehensive understanding of the text.

### *Decoder-only LLMs*

Decoder-only models, such as GPT-2, GPT-3, and LLaMA [2], unlike the Encoder-Decoder architecture, only use the decoder part. These models use an autoregressive mode, meaning they predict the next token based on previous tokens. These models are highly efficient for text generation and have found widespread applications today.

Advantages of Decoder-only Models

- Efficiency: Decoder-only models are more efficient than Encoder-Decoder models due to the absence of an independent encoder. This makes them require fewer computational and memory resources.
- Simplicity: Due to their autoregressive nature, these models can easily generate sequences in order.
- Scalability: Due to their simpler architecture, these models can be scaled to much larger sizes.


However, one of the drawbacks of these models is that they can only utilize information from tokens before the current token and cannot use tokens that come after for prediction. This limitation is significant in tasks like classification or translation, where a full understanding of the sequence is needed.



## **Objective of the Exercise**


In this exercise, the goal is to convert a generative Decoder-only language model into an encoder and evaluate its performance on a binary sentiment classification task. The main aim is to modify the Decoder-only model so that it can function as an encoder and better handle tasks requiring bidirectional understanding.

## **In this exercise, you should:**


### In this exercise, you should:

1. **Import a Decoder-only model** and load the weights of a pre-trained version of the model.
2. **Generate several outputs from the model**, and include at most 10 sample outputs in your report for different inputs.  
   You should also briefly explain the effects of key configurations in text generation, including:  
   - `Temperature`
   - `top_k`
   - `top_p`
   - `repetition_penalty`
   - `num_beams`
   - `no_repeat_ngram_size`
3. **Load the SST-2 dataset**, which is part of the GLUE benchmark for sentiment classification.  
   - Note that the model’s output depends on the number of input tokens. 
   - Apply necessary padding to the dataset after loading it to allow for parallel execution of the model.
4. **Remove the model’s final layer**, which outputs to the size of the model’s dictionary.  
   - Use the embedding vector of the first token (CLS token) for classification.
5. As observed in the previous step, sometimes the embedding vector of the first token does not provide a good representation of the entire input text.  
   - **Add a linear layer** with the same input and output dimensions on top of the encoder's output, and use the output of this linear layer (corresponding to the CLS token) for classification.  
   - This step aggregates information of different tokens to get a comprehensive understanding of the input text.
6. **Instead of the linear layer** in the previous section, use a **bidirectional attention layer** with a custom number of heads (preferably 12).
7. **Repeat step 6** using **left-to-right unidirectional attention** and **right-to-left unidirectional attention**.
8. **Load a pre-trained decoder** (preferably BERT-base) and report its **zero-shot performance** (i.e., without needing to train the model) on the test data.

## **Evaluation:**

In this exercise, for each of sections 4, 5, 6, 7, and 8, you need to plot the confusion matrix corresponding to the model's performance on the test data. Additionally, you should plot two separate graphs showing the training loss and the accuracy of the trained models, and compare them with each other, providing an appropriate analysis of your results. Also, note that high accuracy is not expected for sections 4 and 5, but the correctness of your code will be checked. However, for sections 6 and 7, higher accuracy (around 90%) is expected.













## **Let's go:**


Load `gpt2` model:

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Model, GenerationConfig
import torch
from torch import nn


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2").to("cuda")

Load `sst-2` dataset:

In [None]:
from datasets import load_dataset

# Load the SST-2 dataset from Hugging Face 
dataset = load_dataset("glue", "sst2")

go ahead:

## 2)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

In [None]:
tokenizer.add_special_tokens({"cls_token": "<|CLS|>"})
model.resize_token_embeddings(len(tokenizer))
model.to(device).eval()

In [None]:
prompts = [
    "Once upon a time",
    "Deep learning is all about",
    "In a shocking finding today, scientists discovered",
    "The meaning of life is",
    "The best way to learn is",
    "The most important thing in the world is",
    "The most beautiful thing in the world is",
    "The most dangerous thing in the world is",
    "The most exciting thing in the world is",
    "The most boring thing in the world is",
]

generation_configs = [
    GenerationConfig(
        max_new_tokens=30,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        repetition_penalty=1.2,
        num_beams=1,
        no_repeat_ngram_size=2,
        do_sample=True
    ),
]

for i, prompt in enumerate(prompts[:10]):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    print(f"\nPrompt {i+1}: {prompt}\n{'-'*40}")

    for j, gen_cfg in enumerate(generation_configs):
        output_ids = model.generate(
            input_ids,
            generation_config=gen_cfg
        )
        generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        print(f"  [Gen config {j+1}] --> {generated_text}")


## 3)

In [None]:
from torch.utils.data import DataLoader

def tokenize_function(examples):
    texts = ["<|CLS|> " + t for t in examples["sentence"]]
    return tokenizer(texts, truncation=True, padding="max_length", max_length=32)

dataset = dataset.map(tokenize_function, batched=True)
dataset.set_format(type="torch", columns=["input_ids", "label", "attention_mask"])

train_data = dataset["train"]
val_data   = dataset["validation"]
test_data  = dataset["test"]

print("Train sample:", train_data[0])

train_loader = DataLoader(train_data, batch_size=8, shuffle=True)
val_loader   = DataLoader(val_data, batch_size=8)
test_loader  = DataLoader(test_data, batch_size=8)

## 4)

In [None]:
model_without_lm_head = GPT2Model.from_pretrained("gpt2")
model_without_lm_head.resize_token_embeddings(len(tokenizer))
model_without_lm_head.to(device).eval()

def extract_cls_embedding(input_ids, attention_mask):
    outputs = model_without_lm_head(
        input_ids=input_ids,
        attention_mask=attention_mask
    )
    hidden_states = outputs.last_hidden_state
    cls_embed = hidden_states[:, 0, :]
    return cls_embed

sample = train_data[0]
input_ids = sample["input_ids"].unsqueeze(0).to(device)
attention_mask = sample["attention_mask"].unsqueeze(0).to(device)

cls_vector = extract_cls_embedding(input_ids, attention_mask)
print("CLS vector shape:", cls_vector.shape)

## 5)

In [None]:
class GPT2ClassifierLinear(nn.Module):
    def __init__(self, gpt2_model, hidden_dim=768, num_labels=2):
        super().__init__()
        self.gpt2 = gpt2_model
        self.linear = nn.Linear(hidden_dim, num_labels)

    def forward(self, input_ids, attention_mask):
        outputs = self.gpt2(input_ids, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state
        cls_token = hidden_states[:, 0, :]
        logits = self.linear(cls_token)
        return logits

model_cls_linear = GPT2ClassifierLinear(model_without_lm_head).to(device)

#TODO: We'll need a training loop to fine-tune on SST-2, but here's the main architecture.
print("GPT2ClassifierLinear created.")

## 6)

In [None]:
class GPT2ClassifierBiAttn(nn.Module):
    def __init__(self, gpt2_model, hidden_dim=768, num_labels=2, num_heads=12):
        super().__init__()
        self.gpt2 = gpt2_model
        self.attn = nn.MultiheadAttention(embed_dim=hidden_dim, num_heads=num_heads, batch_first=True)
        self.linear = nn.Linear(hidden_dim, num_labels)

    def forward(self, input_ids, attention_mask):
        outputs = self.gpt2(input_ids, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state
        
        cls_token = hidden_states[:, 0:1, :]

        out, _ = self.attn(query=cls_token, key=hidden_states, value=hidden_states)

        logits = self.linear(out.squeeze(1))
        return logits

model_cls_biattn = GPT2ClassifierBiAttn(
    gpt2_model=model_without_lm_head, 
    hidden_dim=model_without_lm_head.config.hidden_size, 
    num_labels=2, 
    num_heads=12
).to(device)

print("GPT2ClassifierBiAttn created (bidirectional attention).")

## 7)

In [None]:
def generate_causal_mask(seq_len):
    mask = torch.ones(seq_len, seq_len).triu(1).bool()
    return mask

class GPT2ClassifierCausalAttn(nn.Module):
    def __init__(self, gpt2_model, hidden_dim=768, num_labels=2, num_heads=12, l2r=True):
        super().__init__()
        self.gpt2 = gpt2_model
        self.attn = nn.MultiheadAttention(hidden_dim, num_heads, batch_first=True)
        self.linear = nn.Linear(hidden_dim, num_labels)
        self.l2r = l2r

    def forward(self, input_ids, attention_mask):
        outputs = self.gpt2(input_ids, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state
        B, L, H = hidden_states.shape

        cls_token = hidden_states[:, 0:1, :]
        # We'll build a mask: [L, L], then slice or rearrange for left->right or right->left
        causal_mask = generate_causal_mask(L).to(hidden_states.device)
        
        if not self.l2r:
            causal_mask = torch.flip(causal_mask, dims=(0,1))
        
        # Apply the multi-head attn with a key_padding_mask or attn_mask
        out, _ = self.attn(
            query=cls_token,  # [B, 1, H]
            key=hidden_states,  # [B, L, H]
            value=hidden_states,
            attn_mask=causal_mask[:1, :],  # shape must be [tgt_len, src_len] => [1, L]
        )
        logits = self.linear(out.squeeze(1))  # [B, num_labels]
        return logits

model_cls_l2r = GPT2ClassifierCausalAttn(
    model_without_lm_head, 
    hidden_dim=model_without_lm_head.config.hidden_size, 
    num_labels=2,
    num_heads=12,
    l2r=True
).to(device)

model_cls_r2l = GPT2ClassifierCausalAttn(
    model_without_lm_head,
    hidden_dim=model_without_lm_head.config.hidden_size,
    num_labels=2,
    num_heads=12,
    l2r=False
).to(device)

print("GPT2ClassifierCausalAttn created for L2R and R2L unidirectional attention.")


## 8)

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
 
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2).to(device)
bert_model.eval()

print("Loaded BERT-base for classification (zero-shot).")

# Evaluate on test set (zero-shot)
correct = 0
total = 0

for example in test_data:
    input_ids = bert_tokenizer.encode(example["sentence"], return_tensors="pt", truncation=True, max_length=64).to(device)
    with torch.inference_mode():
        logits = bert_model(input_ids).logits  # shape [1, 2]
    pred_label = torch.argmax(logits, dim=-1).item()
    if pred_label == example["label"]:
        correct += 1
    total += 1

accuracy_bert = correct / total
print(f"Zero-shot Test Accuracy with BERT-base: {accuracy_bert:.4f}")


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification

# Load pre-trained BERT
bert_model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2).to("cuda")
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Zero-shot evaluation
def evaluate_bert(batch):
    inputs = bert_tokenizer(batch["sentence"], return_tensors="pt", padding=True, truncation=True, max_length=128).to("cuda")
    outputs = bert_model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=1)
    return predictions


---------
### References

[1] Raffel, Colin, Noam Shazeer, Adam Roberts, et al. (2020). *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer*. [Link to paper](https://arxiv.org/abs/1910.10683)

[2] Touvron, Hugo, et al. (2023). *LLaMA 2: Open Foundation and Fine-Tuned Chat Models*. [Link to paper](https://arxiv.org/abs/2307.09288)

<span style="color:yellow;">*For further reading on this field of research, you can refer to the following papers:*</span>

[3] BehnamGhader, Adlakha, et al. (2024). *LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders*. [Link to paper](https://arxiv.org/abs/2404.05961)

[4] Gao, Tianyu, et al. (2021). *SimCSE: Simple Contrastive Learning of Sentence Embeddings*. [Link to paper](https://arxiv.org/abs/2104.08821)

[5] Lee, et al. (2023). *NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models*. [Link to paper](https://arxiv.org/abs/2405.17428)




# **Best regards.**