# Homework 1 Part 2

## Course Name: Large Language Models
#### Lecturers: Dr. Soleimani, Dr. Rohban, Dr. Asgari

---

#### Notebooks Supervised By: MohammadAli SadraeiJavaheri
#### Notebooks Prepared By: Zeinab Sadat Taghavi, Hamed Jamshidian, Seyed Mohammad Reza Modarres

**Contact**: Ask your questions in Quera

---

### Instructions:
- Complete all exercises presented in this notebook.
- Ensure you run each cell after you've entered your solution.
- After completing the exercises, save the notebook and <font color='red'>follow the submission guidelines provided in the PDF.</font>


---

**Note**: Replace the placeholders (between `## Your code begins ##` and `## Your code ends ##`) with the appropriate details.


# Introduction

<b> What are soft prompts? </b>
<br>
soft prompts can be described as a concept that involves incorporating vectors into an input sequence and then fine-tuning these vectors while keeping the rest of the pre-trained model's components unchanged. We deonte our input with $X$ and we denote $P$ as the matrix of these soft prompt vectors.
<br>
<div>
<img src="https://drive.google.com/uc?id=1aGI6FgvK3udOmHnWt1dCvC7lh6e9C2Oe" width="50%"/>
</div>

Read More :
<br>[Youtube : PEFT and Soft Prompt](https://www.youtube.com/watch?v=8uy_WII76L0)
<br>[Blog : What are soft prompts?](https://softwaremind.com/blog/how-and-why-soft-promps-are-slowly-replacing-text-prompts/)


### Requirements

In [1]:
%%capture
! pip install datasets transformers

### Imports

In [2]:
from tqdm.notebook import tqdm
from IPython import display

import numpy as np
import pandas as pd

from sklearn.metrics import accuracy_score

import torch
import torch.nn as nn

from datasets import load_dataset
from transformers import T5TokenizerFast, T5ForConditionalGeneration, DataCollatorForSeq2Seq

### Constants

### Base Model Selection
We will use `t5-small` as our base model from Hugging Face ([HF_Link](https://huggingface.co/t5-small)). For our tuning, we intend to utilize `10` soft prompt tokens ([HF_Link](https://huggingface.co/docs/peft/conceptual_guides/prompting), [Paper_Link](https://arxiv.org/abs/2104.08691)).


In [3]:
#####################################
###### DO NOT CHANGE THIS CELL ######
#####################################

BASE_MODEL_NAME = 't5-small'
N_SOFT_PROMPT_TOKENS = 10

BATCH_SIZE = 32
LEARNING_RATE = 0.1
EPOCHS = 10

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Dataset

### Load dataset

`imdb` dataset is a famouns NLP for binary sentiment dataset. Each row of data is either `negative` or `positive` ([HF_Link](https://huggingface.co/datasets/imdb)).

In [4]:
dataset = load_dataset('imdb')
dataset.pop('unsupervised')
print(dataset)

Using the latest cached version of the module from /home/minuano/.cache/huggingface/modules/datasets_modules/datasets/imdb/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0 (last modified on Tue Nov 21 18:59:23 2023) since it couldn't be found locally at imdb., or remotely on the Hugging Face Hub.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


### Define related functions

Because `T5` model is a sequence to sequence model we should map our labels to label_names before training and doing vice versa duing calculating metrics.

The functions `id2label` and `label2id` are defined to do this.

In [5]:
def id2label(ids):
    label_names = ['negative', 'positive']
    return [label_names[id] for id in ids]

def label2id(labels):
    label_names_dict = {
        'negative': 0,
        'positive': 1
    }
    return [
        label_names_dict.get(label, 2)
        for label in labels
    ]

# Tokenizer

### Load tokenizer

In [6]:
tokenizer = T5TokenizerFast.from_pretrained(BASE_MODEL_NAME)

### Process dataset using tokenizer

In this step we will getting our dataset ready for training.

We preprocess tokenize our `text` and `label`.

For easier prompt tuning we put placeholders by prepending multiple `pad_token` to our input. The count of this pad tokens is the same as `n_soft_prompt_tokens`.

<font color='#73FF73'><b>You have to complete</b></font> `preprend_padding_token` <font color='#73FF73'><b>function.</b></font>

Replace `None` with your code.

In [7]:
def preprocess_input(text):
    text = text.lower()
    text = text.replace('<br />', ' ')
    return text

def preprend_padding_token(text):
    n_soft_prompt_tokens = N_SOFT_PROMPT_TOKENS
    pad_token = tokenizer.pad_token

    ######### Your code begins #########
    prefix = str([pad_token] * n_soft_prompt_tokens)
    ######### Your code ends ###########

    return prefix + text

def map_function(row):
    processed_input = [
        preprend_padding_token(preprocess_input(text))
        for text in row['text']
    ]
    input_info = tokenizer(processed_input, truncation=True, max_length=256)
    output_info = tokenizer(id2label(row['label']))
    return {
        **input_info,
        'labels': output_info.input_ids
    }


dataset = dataset.map(map_function, batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Model

### Load model

In [8]:
model = T5ForConditionalGeneration.from_pretrained(BASE_MODEL_NAME)

### Define prompt related layers

In this part we will define our prompt layer in `SimplePrompts`. It's a simple layer that only returns it's prompt matrix when called.

`EmbeddingWrapper` is a layer that will replace original embedding layer of model and it functions as our injection into the model architecture.

We use `sharif_llm` in our PEFT module name so we could keep it unfreeze during training.

<font color='#73FF73'><b>You have to complete</b></font> `prompts_joiner` <font color='#73FF73'><b>function.</b></font>

In this function prompts will concatenated to model input embeddings. But in `preprend_padding_token` we allready put some placeholders for prompts. We just need to replace it with real prompts.

At first step you must repeat `prompts` in each batch_size and then remove placeholder embedings from `input_embedding` to calculate `non_place_holders`.

In [9]:
class SimplePrompts(nn.Module):
    def __init__(self, inital_values: torch.Tensor):
        super().__init__()
        self.n_tokens = inital_values.size(0)
        self.emb_dim = inital_values.size(1)
        self.prompt_emb = nn.parameter.Parameter(
            inital_values.detach().clone()
        )

    def forward(self):
        return self.prompt_emb

def prompts_joiner(prompts, input_embedding):
    # prompts.shape         = (n_tokens, emb_dim)
    # input_embedding.shape = (batch_size, n_tokens + seq_len, emb_dim)

    n_tokens = prompts.size(0)
    batch_size = input_embedding.size(0)
    ######### Your code begins #########
    prompts_batched = prompts.expand(batch_size, -1, -1)
    non_place_holders = input_embedding[:,n_tokens:,:]
    ######### Your code ends ###########

    assert prompts_batched.shape == (batch_size, *prompts.shape)
    assert non_place_holders.shape[1] + n_tokens == input_embedding.shape[1]

    return torch.cat([prompts_batched, non_place_holders], dim=1)

class EmbeddingWrapper(nn.Module):
    def __init__(
        self,
        emb_layer: nn.Embedding,
        n_tokens: int,
        **kwargs
    ):
        super().__init__()
        self.emb_layer = emb_layer

        prompt_inital_values = self.emb_layer.weight[:n_tokens]

        self.sharif_llm_soft_prompts = SimplePrompts(inital_values=prompt_inital_values)

    def forward(self, tokens):
        prompts = self.sharif_llm_soft_prompts()
        input_embedding = self.emb_layer(tokens)
        return prompts_joiner(prompts, input_embedding)

### Replace encoder's embedding layer with our layer

<font color='#73FF73'><b>You have to complete</b></font> `mutate_model` <font color='#73FF73'><b>function.</b></font>

In this part we want to replace <b>model encoder embedding layer</b> with our wrapper.

You must use `get_encoder`, `get_input_embeddings` to get model embedding layer and use `EmbeddingWrapper` to create new embedding layer.

In [10]:
def mutate_model(model, n_tokens):
    if hasattr(model, '_mutated'):
        print("Model already contains Soft Prompt layers! \n Try reloading the model.")
        return
    ######### Your code begins #########
    encoder = model.get_encoder()
    embedding_layer = encoder.get_input_embeddings()
    new_embedding_layer = EmbeddingWrapper(embedding_layer, n_tokens)
    ######### Your code ends ###########
    encoder.set_input_embeddings(new_embedding_layer)

    model._mutated = True

mutate_model(model, n_tokens=N_SOFT_PROMPT_TOKENS)

### Freeze all model's weight except our PEFT module

In this part we will freeze entire model except `encoder.embed_tokens.sharif_llm_soft_prompts.prompt_emb`

In [11]:
def freeze_non_pefts(model, peft_key):
    print('Non freezed weights:')
    for param_name, weights in model.named_parameters():
        weights.requires_grad = peft_key in param_name
        if weights.requires_grad:
            print(param_name)

freeze_non_pefts(model, peft_key='sharif_llm')

Non freezed weights:
encoder.embed_tokens.sharif_llm_soft_prompts.prompt_emb


# Train and evaluate

### Define dataloaders

In [12]:
col_fn = DataCollatorForSeq2Seq(
    tokenizer, return_tensors='pt', padding='longest',
)

train_loader = torch.utils.data.DataLoader(
    dataset['train'],
    batch_size=BATCH_SIZE,
    collate_fn=col_fn,
    shuffle=True
)

test_loader = torch.utils.data.DataLoader(
    dataset['test'],
    batch_size=BATCH_SIZE,
    collate_fn=col_fn
)

### Train functions

In [13]:
def train_loop(model, loader, optimizer):
    model.train()

    batch_losses = []

    for row in tqdm(loader, desc='Training:'):
        optimizer.zero_grad()

        out = model(**row.to(model.device))
        loss = out.loss

        batch_loss_value = loss.item()
        loss.backward()
        optimizer.step()

        batch_losses.append(batch_loss_value)

    loss_value = np.mean(batch_losses)
    return {'train_loss': loss_value}

def _predict(model, row):
    return model.generate(
        input_ids=row.input_ids,
        attention_mask=row.attention_mask,
        max_length=5
    )

def tokenizer_ids_to_label(all_input_ids):
    return tokenizer.batch_decode(all_input_ids, skip_special_tokens=True)

def valid_loop(model, loader, compute_metrics):
    model.eval()

    all_true = []
    all_pred = []

    with torch.no_grad():
        for row in tqdm(loader, desc='Validating:'):
            row.to(model.device)
            pred = _predict(model, row)

            all_true += row.labels.detach().cpu().tolist()
            all_pred += pred.detach().cpu().tolist()

    all_true = label2id(tokenizer_ids_to_label(all_true))
    all_pred = label2id(tokenizer_ids_to_label(all_pred))

    return {'valid_acc': compute_metrics(y_true=all_true, y_pred=all_pred)}

### Define our optimizer and metric function

In [14]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
compute_metrics = accuracy_score

In [15]:
model.to(DEVICE)

all_results = []
for epoch in range(EPOCHS):
    epoch_results = {'epoch': epoch}

    epoch_results.update(
        train_loop(
            model=model,
            loader=train_loader,
            optimizer=optimizer,
        )
    )

    epoch_results.update(
        valid_loop(
            model=model,
            loader=test_loader,
            compute_metrics=compute_metrics,
        )
    )
    all_results.append(epoch_results)

    display.clear_output()
    display.display(pd.DataFrame(all_results).set_index('epoch'))

Unnamed: 0_level_0,train_loss,valid_acc
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.686789,0.8376
1,0.217867,0.84388
2,0.208369,0.8482
3,0.203363,0.83956
4,0.200727,0.84404
5,0.198525,0.84152
6,0.200024,0.85152
7,0.196788,0.84848
8,0.196731,0.83976
9,0.200037,0.85096


### Best Performance and number of parameters

You must report this number in you final report.

In [16]:
best_score = pd.DataFrame(all_results)['valid_acc'].max() * 100
total_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {total_params}")
print('Best model preformance is: %%%.1f' % best_score)

Number of parameters: 60511744
Best model preformance is: %85.2


### Save PEFT file

We expect you to <font color='#FF7373'>upload this file </font> with the rest of your files.

In [17]:
peft_dict = {
    key: val
    for (key, val) in model.state_dict().items()
    if 'sharif_llm' in key
}
torch.save(peft_dict, 'prompts.pt')

# Use external library

In [18]:
! pip install opendelta

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Use `OpenDelta` library to do the same thing. [link](https://opendelta.readthedocs.io/en/latest/modules/deltas.html)

For hyperparameters, test with `N_SOFT_PROMPT_TOKENS=1` and `N_SOFT_PROMPT_TOKENS=10` and report them in your report.

In [20]:
######### Your code begins #########
####################################
####################################
from opendelta import SoftPromptModel

N_SOFT_PROMPT_TOKENS_LIST = [1, 10]

for n_soft_prompt_tokens in N_SOFT_PROMPT_TOKENS_LIST:
    model = T5ForConditionalGeneration.from_pretrained(BASE_MODEL_NAME)
    
    print(f"Number of Soft Prompt Tokens being used: {n_soft_prompt_tokens}")

    # Instantiate Delta Model
    delta_model = SoftPromptModel(model, soft_token_num=n_soft_prompt_tokens)
    delta_model.freeze_module(exclude=["deltas"])

    # Fine-tune
    model.to(DEVICE)

    optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
    compute_metrics = accuracy_score

    for epoch in range(EPOCHS):
        epoch_results = {'epoch': epoch}

        train_loss = train_loop(
                model=model,
                loader=train_loader,
                optimizer=optimizer,
            )['train_loss']

        valid_acc =valid_loop(
                model=model,
                loader=test_loader,
                compute_metrics=compute_metrics,
            )['valid_acc']

        print(f"Epoch: {epoch}, train_loss: {train_loss}, valid_acc: {valid_acc}")

    delta_model.save_finetuned(
        f"{BASE_MODEL_NAME}_soft_prompts_FT_n={n_soft_prompt_tokens}"
    )


Number of Soft Prompt Tokens being used: 1


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 0, train_loss: 6.863217959318624, valid_acc: 0.83132


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 1, train_loss: 0.5135027703155032, valid_acc: 0.83192


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 2, train_loss: 0.38230650027847046, valid_acc: 0.83116


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 3, train_loss: 0.3807598795656048, valid_acc: 0.83044


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 4, train_loss: 0.40726433501905185, valid_acc: 0.83184


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 5, train_loss: 0.4276168885095345, valid_acc: 0.83092


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 6, train_loss: 0.4512150761911936, valid_acc: 0.83292


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 7, train_loss: 0.4506600707235848, valid_acc: 0.82948


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 8, train_loss: 0.4526224087761796, valid_acc: 0.82932


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 9, train_loss: 0.4705702768795935, valid_acc: 0.83024


[INFO|(OpenDelta)delta_configs:148]2024-03-20 00:09:02,134 >> Configuration saved in t5-small_soft_prompts_FT_n=1/config.json
[INFO|(OpenDelta)saving_loading_utils:182]2024-03-20 00:09:02,135 >> 
******************************
You delta models has been saved locally to:	/home/minuano/sharif-llm-fall-2023/assignments/assignment-1-PEFT/notebooks/t5-small_soft_prompts_FT_n=1
[INFO|(OpenDelta)saving_loading_utils:211]2024-03-20 00:09:02,135 >> The state dict size is 0.017 MB
[INFO|(OpenDelta)saving_loading_utils:200]2024-03-20 00:09:02,136 >> We encourage users to push their final and public models to delta center to share them with the community!


Number of Soft Prompt Tokens being used: 10


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 0, train_loss: 1.6289572974528803, valid_acc: 0.82784


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 1, train_loss: 0.22154917693732645, valid_acc: 0.84344


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 2, train_loss: 0.2095711538496682, valid_acc: 0.83384


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 3, train_loss: 0.20512747036679016, valid_acc: 0.8438


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 4, train_loss: 0.20179360226520796, valid_acc: 0.847


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 5, train_loss: 0.20250317930717907, valid_acc: 0.84168


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 6, train_loss: 0.20375190942031343, valid_acc: 0.84088


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 7, train_loss: 0.20001341675972695, valid_acc: 0.84836


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 8, train_loss: 0.20321749136461625, valid_acc: 0.84608


Training::   0%|          | 0/782 [00:00<?, ?it/s]

Validating::   0%|          | 0/782 [00:00<?, ?it/s]

Epoch: 9, train_loss: 0.20276426290021377, valid_acc: 0.84684


[INFO|(OpenDelta)delta_configs:148]2024-03-20 00:46:28,393 >> Configuration saved in t5-small_soft_prompts_FT_n=10/config.json
[INFO|(OpenDelta)saving_loading_utils:182]2024-03-20 00:46:28,394 >> 
******************************
You delta models has been saved locally to:	/home/minuano/sharif-llm-fall-2023/assignments/assignment-1-PEFT/notebooks/t5-small_soft_prompts_FT_n=10
[INFO|(OpenDelta)saving_loading_utils:211]2024-03-20 00:46:28,394 >> The state dict size is 0.034 MB
[INFO|(OpenDelta)saving_loading_utils:200]2024-03-20 00:46:28,395 >> We encourage users to push their final and public models to delta center to share them with the community!
