# Homework 1 Part 2

## Course Name: Large Language Models
#### Lecturers: Dr. Soleimani, Dr. Rohban, Dr. Asgari

---

#### Notebooks Supervised By: MohammadAli SadraeiJavaheri
#### Notebooks Prepared By: Zeinab Sadat Taghavi, Hamed Jamshidian, Seyed Mohammad Reza Modarres

**Contact**: Ask your questions in Quera

---

### Instructions:
- Complete all exercises presented in this notebook.
- Ensure you run each cell after you've entered your solution.
- After completing the exercises, save the notebook and <font color='red'>follow the submission guidelines provided in the PDF.</font>


---

**Note**: Replace the placeholders (between `## Your code begins ##` and `## Your code ends ##`) with the appropriate details.


# Introduction

<b> What are soft prompts? </b>
<br>
soft prompts can be described as a concept that involves incorporating vectors into an input sequence and then fine-tuning these vectors while keeping the rest of the pre-trained model's components unchanged. We deonte our input with $X$ and we denote $P$ as the matrix of these soft prompt vectors.
<br>
<div>
<img src="https://drive.google.com/uc?id=1aGI6FgvK3udOmHnWt1dCvC7lh6e9C2Oe" width="50%"/>
</div>

Read More :
<br>[Youtube : PEFT and Soft Prompt](https://www.youtube.com/watch?v=8uy_WII76L0)
<br>[Blog : What are soft prompts?](https://softwaremind.com/blog/how-and-why-soft-promps-are-slowly-replacing-text-prompts/)


### Requirements

In [None]:
%%capture
! pip install datasets transformers

### Imports

In [None]:
from tqdm.notebook import tqdm
from IPython import display

import numpy as np
import pandas as pd

from sklearn.metrics import accuracy_score

import torch
import torch.nn as nn

from datasets import load_dataset
from transformers import T5TokenizerFast, T5ForConditionalGeneration, DataCollatorForSeq2Seq

### Constants

### Base Model Selection
We will use `t5-small` as our base model from Hugging Face ([HF_Link](https://huggingface.co/t5-small)). For our tuning, we intend to utilize `10` soft prompt tokens ([HF_Link](https://huggingface.co/docs/peft/conceptual_guides/prompting), [Paper_Link](https://arxiv.org/abs/2104.08691)).


In [None]:
#####################################
###### DO NOT CHANGE THIS CELL ######
#####################################

BASE_MODEL_NAME = 't5-small'
N_SOFT_PROMPT_TOKENS = 10

BATCH_SIZE = 32
LEARNING_RATE = 0.1
EPOCHS = 10

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Dataset

### Load dataset

`imdb` dataset is a famouns NLP for binary sentiment dataset. Each row of data is either `negative` or `positive` ([HF_Link](https://huggingface.co/datasets/imdb)).

In [None]:
dataset = load_dataset('imdb')
dataset.pop('unsupervised')
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


### Define related functions

Because `T5` model is a sequence to sequence model we should map our labels to label_names before training and doing vice versa duing calculating metrics.

The functions `id2label` and `label2id` are defined to do this.

In [None]:
def id2label(ids):
    label_names = ['negative', 'positive']
    return [label_names[id] for id in ids]

def label2id(labels):
    label_names_dict = {
        'negative': 0,
        'positive': 1
    }
    return [
        label_names_dict.get(label, 2)
        for label in labels
    ]

# Tokenizer

### Load tokenizer

In [None]:
tokenizer = T5TokenizerFast.from_pretrained(BASE_MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

### Process dataset using tokenizer

In this step we will getting our dataset ready for training.

We preprocess tokenize our `text` and `label`.

For easier prompt tuning we put placeholders by prepending multiple `pad_token` to our input. The count of this pad tokens is the same as `n_soft_prompt_tokens`.

<font color='#73FF73'><b>You have to complete</b></font> `preprend_padding_token` <font color='#73FF73'><b>function.</b></font>

Replace `None` with your code.

In [None]:
def preprocess_input(text):
    text = text.lower()
    text = text.replace('<br />', ' ')
    return text

def preprend_padding_token(text):
    n_soft_prompt_tokens = N_SOFT_PROMPT_TOKENS
    pad_token = tokenizer.pad_token

    ######### Your code begins #########
    prefix = pad_token * n_soft_prompt_tokens
    ######### Your code ends ###########

    return prefix + text

def map_function(row):
    processed_input = [
        preprend_padding_token(preprocess_input(text))
        for text in row['text']
    ]
    input_info = tokenizer(processed_input, truncation=True, max_length=256)
    output_info = tokenizer(id2label(row['label']))
    return {
        **input_info,
        'labels': output_info.input_ids
    }


dataset = dataset.map(map_function, batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

# Model

### Load model

In [None]:
model = T5ForConditionalGeneration.from_pretrained(BASE_MODEL_NAME)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Define prompt related layers

In this part we will define our prompt layer in `SimplePrompts`. It's a simple layer that only returns it's prompt matrix when called.

`EmbeddingWrapper` is a layer that will replace original embedding layer of model and it functions as our injection into the model architecture.

We use `sharif_llm` in our PEFT module name so we could keep it unfreeze during training.

<font color='#73FF73'><b>You have to complete</b></font> `prompts_joiner` <font color='#73FF73'><b>function.</b></font>

In this function prompts will concatenated to model input embeddings. But in `preprend_padding_token` we allready put some placeholders for prompts. We just need to replace it with real prompts.

At first step you must repeat `prompts` in each batch_size and then remove placeholder embedings from `input_embedding` to calculate `non_place_holders`.

In [None]:
class SimplePrompts(nn.Module):
    def __init__(self, inital_values: torch.Tensor):
        super().__init__()
        self.n_tokens = inital_values.size(0)
        self.emb_dim = inital_values.size(1)
        self.prompt_emb = nn.parameter.Parameter(
            inital_values.detach().clone()
        )

    def forward(self):
        return self.prompt_emb

def prompts_joiner(prompts, input_embedding):
    # prompts.shape         = (n_tokens, emb_dim)
    # input_embedding.shape = (batch_size, n_tokens + seq_len, emb_dim)

    n_tokens = prompts.size(0)
    batch_size = input_embedding.size(0)
    ######### Your code begins #########
    prompts_batched = prompts.unsqueeze(0).repeat(batch_size, 1, 1)
    # prompts_batched = prompts.unsqueeze(0).expand(batch_size, -1, -1)
    non_place_holders = input_embedding[:,n_tokens:,:]
    ######### Your code ends ###########

    assert prompts_batched.shape == (batch_size, *prompts.shape)
    assert non_place_holders.shape[1] + n_tokens == input_embedding.shape[1]

    return torch.cat([prompts_batched, non_place_holders], dim=1)

class EmbeddingWrapper(nn.Module):
    def __init__(
        self,
        emb_layer: nn.Embedding,
        n_tokens: int,
        **kwargs
    ):
        super().__init__()
        self.emb_layer = emb_layer

        prompt_inital_values = self.emb_layer.weight[:n_tokens]

        self.sharif_llm_soft_prompts = SimplePrompts(inital_values=prompt_inital_values)

    def forward(self, tokens):
        prompts = self.sharif_llm_soft_prompts()
        input_embedding = self.emb_layer(tokens)
        return prompts_joiner(prompts, input_embedding)

### Replace encoder's embedding layer with our layer

<font color='#73FF73'><b>You have to complete</b></font> `mutate_model` <font color='#73FF73'><b>function.</b></font>

In this part we want to replace <b>model encoder embedding layer</b> with our wrapper.

You must use `get_encoder`, `get_input_embeddings` to get model embedding layer and use `EmbeddingWrapper` to create new embedding layer.

In [None]:
def mutate_model(model, n_tokens):
    if hasattr(model, '_mutated'):
        print("Model already contains Soft Prompt layers! \n Try reloading the model.")
        return
    ######### Your code begins #########
    encoder = model.get_encoder()
    embedding_layer = encoder.get_input_embeddings()
    new_embedding_layer = EmbeddingWrapper(embedding_layer, n_tokens)
    ######### Your code ends ###########
    encoder.set_input_embeddings(new_embedding_layer)

    model._mutated = True

mutate_model(model, n_tokens=N_SOFT_PROMPT_TOKENS)

### Freeze all model's weight except our PEFT module

In this part we will freeze entire model except `encoder.embed_tokens.sharif_llm_soft_prompts.prompt_emb`

In [None]:
def freeze_non_pefts(model, peft_key):
    print('Non freezed weights:')
    for param_name, weights in model.named_parameters():
        weights.requires_grad = peft_key in param_name
        if weights.requires_grad:
            print(param_name)

freeze_non_pefts(model, peft_key='sharif_llm')

Non freezed weights:
encoder.embed_tokens.sharif_llm_soft_prompts.prompt_emb


# Train and evaluate

### Define dataloaders

In [None]:
col_fn = DataCollatorForSeq2Seq(
    tokenizer, return_tensors='pt', padding='longest',
)

train_loader = torch.utils.data.DataLoader(
    dataset['train'],
    batch_size=BATCH_SIZE,
    collate_fn=col_fn,
    shuffle=True
)

test_loader = torch.utils.data.DataLoader(
    dataset['test'],
    batch_size=BATCH_SIZE,
    collate_fn=col_fn
)

### Train functions

In [None]:
def train_loop(model, loader, optimizer):
    model.train()

    batch_losses = []

    for row in tqdm(loader, desc='Training:'):
        optimizer.zero_grad()

        out = model(**row.to(model.device))
        loss = out.loss

        batch_loss_value = loss.item()
        loss.backward()
        optimizer.step()

        batch_losses.append(batch_loss_value)

    loss_value = np.mean(batch_losses)
    return {'train_loss': loss_value}

def _predict(model, row):
    return model.generate(
        input_ids=row.input_ids,
        attention_mask=row.attention_mask,
        max_length=5
    )

def tokenizer_ids_to_label(all_input_ids):
    return tokenizer.batch_decode(all_input_ids, skip_special_tokens=True)

def valid_loop(model, loader, compute_metrics):
    model.eval()

    all_true = []
    all_pred = []

    with torch.no_grad():
        for row in tqdm(loader, desc='Validating:'):
            row.to(model.device)
            pred = _predict(model, row)

            all_true += row.labels.detach().cpu().tolist()
            all_pred += pred.detach().cpu().tolist()

    all_true = label2id(tokenizer_ids_to_label(all_true))
    all_pred = label2id(tokenizer_ids_to_label(all_pred))

    return {'valid_acc': compute_metrics(y_true=all_true, y_pred=all_pred)}

### Define our optimizer and metric function

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
compute_metrics = accuracy_score

In [None]:
model.to(DEVICE)

all_results = []
for epoch in range(EPOCHS):
    epoch_results = {'epoch': epoch}

    epoch_results.update(
        train_loop(
            model=model,
            loader=train_loader,
            optimizer=optimizer,
        )
    )

    epoch_results.update(
        valid_loop(
            model=model,
            loader=test_loader,
            compute_metrics=compute_metrics,
        )
    )
    all_results.append(epoch_results)

    display.clear_output()
    display.display(pd.DataFrame(all_results).set_index('epoch'))

Unnamed: 0_level_0,train_loss,valid_acc
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.428666,0.84648
1,0.19801,0.85504
2,0.18891,0.86288
3,0.183999,0.86312
4,0.182111,0.867
5,0.183934,0.8638
6,0.185219,0.86752
7,0.182389,0.86812
8,0.180623,0.86164
9,0.180916,0.85956


### Best Performance and number of parameters

You must report this number in you final report.

In [None]:
best_score = pd.DataFrame(all_results)['valid_acc'].max() * 100
total_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {total_params}")
print('Best model preformance is: %%%.1f' % best_score)

Number of parameters: 60511744
Best model preformance is: %86.8


### Save PEFT file

We expect you to <font color='#FF7373'>upload this file </font> with the rest of your files.

In [None]:
peft_dict = {
    key: val
    for (key, val) in model.state_dict().items()
    if 'sharif_llm' in key
}
torch.save(peft_dict, 'prompts.pt')

# Use external library

In [None]:
%%capture
!pip install git+https://github.com/thunlp/OpenDelta.git

Use `OpenDelta` library to do the same thing. [link](https://opendelta.readthedocs.io/en/latest/modules/deltas.html)

For hyperparameters, test with `N_SOFT_PROMPT_TOKENS=1` and `N_SOFT_PROMPT_TOKENS=10` and report them in your report.

In [None]:
model = T5ForConditionalGeneration.from_pretrained(BASE_MODEL_NAME)

######### Your code begins ###################
##############################################
## have to change transformers.deepspeed    ##
## with transformers.integrations.deepspeed ##
##############################################
from opendelta import SoftPromptModel

spm_10 = SoftPromptModel(model, soft_token_num=10)
spm_10.freeze_module(exclude=["deltas"], set_state_dict=True)
spm_10.log()

[INFO|(OpenDelta)basemodel:698]2024-12-14 14:56:19,210 >> Trainable Ratio: 5120/60511744=0.008461%
[INFO|(OpenDelta)basemodel:700]2024-12-14 14:56:19,218 >> Delta Parameter Ratio: 5120/60511744=0.008461%
[INFO|(OpenDelta)basemodel:702]2024-12-14 14:56:19,225 >> Static Memory 3.09 GB, Max Memory 3.31 GB


In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
compute_metrics = accuracy_score
model.to(DEVICE)

all_results = []
for epoch in range(EPOCHS):
    epoch_results = {'epoch': epoch}

    epoch_results.update(
        train_loop(
            model=model,
            loader=train_loader,
            optimizer=optimizer,
        )
    )

    epoch_results.update(
        valid_loop(
            model=model,
            loader=test_loader,
            compute_metrics=compute_metrics,
        )
    )
    all_results.append(epoch_results)

    display.clear_output()
    display.display(pd.DataFrame(all_results).set_index('epoch'))

Unnamed: 0_level_0,train_loss,valid_acc
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.144724,0.85064
1,0.207812,0.85624
2,0.195068,0.8576
3,0.189593,0.85916
4,0.189837,0.85944
5,0.187426,0.86676
6,0.19467,0.85244
7,0.193339,0.8552
8,0.188324,0.8598
9,0.187415,0.85952


In [None]:
best_score = pd.DataFrame(all_results)['valid_acc'].max() * 100
tuned_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Number of tuned parameters: {tuned_params}/{total_params} ({tuned_params/total_params*100:.2f})")
print('Best model preformance is: %%%.1f' % best_score)

Number of tuned parameters: 5120/60511744 (0.01)
Best model preformance is: %86.7
