# Fine Tuning

NLP - Spring Semester of 2024 at University of Tehran - CA4

In [None]:
!pip install datasets
!pip install -U accelerate
!pip install -U transformers
!pip install tokenizers
!pip install torch
!pip install evaluate
!pip install peft

In [None]:
import datasets as hf_datasets
import evaluate

import transformers
from transformers import RobertaTokenizer
from transformers import RobertaForSequenceClassification
from transformers import RobertaConfig
from transformers import LlamaTokenizer
from transformers import LlamaForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers import BitsAndBytesConfig

from peft import get_peft_model
from peft import LoraConfig, PromptEncoderConfig, TaskType
from peft import prepare_model_for_kbit_training

import torch

import random
import time

import numpy as np
import torch

## Q1

### Dataset

For this project we'll use the nyu-mll/multi-nli dataset. We'll use the `datasets` from `huggingface` library. Here's a summary about this dataset from the [huggingface website](https://huggingface.co/datasets/nyu-mll/multi_nli):

>The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.

In [None]:
DATASET_DUMP_PATH = './dump'

try:
    print("Loading dataset from disk.")
    dataset = hf_datasets.DatasetDict.load_from_disk(DATASET_DUMP_PATH)
except:
    print("Dump file not found. Fallback to remote option.")
    dataset = hf_datasets.load_dataset("nyu-mll/multi_nli")
    dataset.save_to_disk(DATASET_DUMP_PATH)

Each slice of `multi-nli` has several columns:

- `premise`, and `hypothesis`: Two sentences following eachother.
- `premise`, and `hypothesis` parse: Each sentence as parsed by the Stanford PCFG Parser 3.5.2
- `premise`, and `hypothesis` binary parse: Each sentence parsed in unlabeled binary-branching format
- `genre`: A string feature showing 5 different features: {telephone, governemnt, travel, fiction, slate}
- `label`: A classification label, 0 means entailment, 1 means neutral, and 2 means contradiction.

In [None]:
print(f"A row with entailing sentences: {random.choice(dataset['train'].filter(lambda x: x['label'] == 0))}")

### Part 1. Fine Tuning

#### Explain general fine-tuning method. Explain LoRA briefly.

There are several ways to fine tune models. Traditionaly, there are two ways of doing this:<br>
The first approach is to freeze the model and re-train only a subset of it. In this way we can preserve all that model has learnt from previous, probaably by feeding a huge amount of data to it, and also adapt the final layers to the specific task that we are looking forward to. This is cost effective approach because we don't need much computation power or data to train that subset of the model.<br>
The second approach is to re-train the whole model with the newer, domain specific data. This requires a larger amount of data and also more computation power and time. According to [this article](https://www.turing.com/resources/finetuning-large-language-models), this approach is particularly beneficial when the task-specific dataset is large and significantly different from the pre-training data.

#### How Does LoRA work?

To answer this question we'll refer to [this explanatory article](https://medium.com/@AIBites/lora-low-rank-adaptation-of-llms-paper-explained-5ae866871c8a). LoRA tries to make fine-tuning faster and more efficient by leveraging the idea of rank decomposition. The rank of a matrice shows how many of its rows or columns are linearly independent from eachother. If two rows are linearly dependant, we can easily show them as one row and a multiplier. This is the idea of rank decomposition; To decompose the linearly dependent rows of a matrix into smaller matrices. In LoRA's article they empirically show that the pre-trained weight matrices are low-ranked. Based on this hypothesis, they also assume that the updated weights after fine-tuning are also low-ranked. So they do the rank decomposition on these matrices and make the training process more efficiently, since we will work with smaller matrices.

#### Explain prompt tuning and the difference between hard and soft tuning.

Instead of training the LLM from the starting point, we can give it some prompt, specifically crafted for our target task, to guide the model. This means adding additional input tokens to the original user-given input, so that the model can better understand what the task is. We have two methods of doing so: hard and soft.

- **Hard Prompting**: In this approach we manually craft a template, and put the inputs in that template. The template consists of several tokens (Probably written in natural language) that will guide the model for our task. This can be hard as finding the best tokens manually can be hard.

- **Soft Prompting**: With this method, we freeze the model and instead train the freshly added token parameters. These parameters are supposed to add tokens to the input so that the prompt that we give to the model will guide it through our task.

### Part 2. Training the models

#### Training the Model from Start

In [None]:
BATCH_SIZE = 64
LEARNING_RATE = 2e-4
EPOCH_COUNT = 10

MODEL_NAME = 'roberta-base'

For this part we are going to use the RoBERTa model from hugging face transformers library. Let's first create our tokenizer.

In [None]:
tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME)

test_text_to_tokenize = random.choice(dataset['train']['premise'])
print(f"Original text: {test_text_to_tokenize}, \nTokenized: {tokenizer.tokenize(test_text_to_tokenize)}")

For the next step we'll need to create a preprocess function that concatenates `premise` and `hypothesis` and apply tokenization.

In [None]:
def  preprocess_dataset(dataset: hf_datasets.Dataset, tokenizer: RobertaConfig):
    return tokenizer(dataset['premise'], dataset['hypothesis'], truncation=True)

Apply it to the dataset.

In [None]:
encoded_dataset = dataset.map(lambda x: preprocess_dataset(x, tokenizer), batched=True)
print(encoded_dataset)

We will use the `RobertaForSequenceClassification` model and train it from start. The NLI task is actually a classification task in which we give the two sentences to the model and ask it to predict the label. Note that our dataset class actually concatenate the premise and hypothesis sentence and give as input to the model.

In [None]:
model_config = RobertaConfig(
    vocab_size=tokenizer.vocab_size,
    max_position_embeddings=512,
    num_attention_heads=12,
    num_hidden_layers=6,
    num_labels=3
)

model = RobertaForSequenceClassification(config=model_config)
sum(p.numel() for p in model.parameters() if p.requires_grad)

The model we'll have 12 attention heads and 6 hidden layers. Note that the original `roberta-base` has 12 hidden layers. We'll use the `Trainer` from hugging face, in which it uses `TrainingArguments`.

In [None]:
training_args = TrainingArguments(
    'RoBerta-trained-from-start',
    eval_strategy='epoch',
    save_strategy='no',
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCH_COUNT,
    weight_decay=0.01,
    metric_for_best_model="accuracy"
)

In [None]:
metric = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Before training, due to shortage of computation resources we'll reduce the size of the dataset to 10% of its initial size.

In [None]:
num_train_samples = int(len(encoded_dataset['train']) * 0.1)
num_val_samples = int(len(encoded_dataset['validation_matched']) * 0.1)
small_train_dataset = encoded_dataset['train'].shuffle(seed=42).select(range(num_train_samples))
small_val_dataset = encoded_dataset['validation_matched'].shuffle(seed=42).select(range(num_val_samples))

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
start_time = time.time()

trainer.train()

print(f'Elapsed time: {time.time() - start_time}')

In [None]:
del model

Here we can see a full description of accuracy and loss at the end of each epoch. We trained the model for 30 epochs with the learning rate equal to 2e-5.

#### Using LoRA Fine Tuning to Train the Model

In [None]:
LORA_ALPHA = 32
LORA_R = 8

In [None]:
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    lora_alpha=LORA_ALPHA,
    r=LORA_R
)

In [None]:
model = RobertaForSequenceClassification(model_config)
model = get_peft_model(model, peft_config)
sum(p.numel() for p in model.parameters() if p.requires_grad)

In [None]:
training_args = TrainingArguments(
    'RoBerta-LoRA-Fine-Tuned',
    eval_strategy='epoch',
    save_strategy='no',
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCH_COUNT,
    weight_decay=0.01,
    metric_for_best_model="accuracy"
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
start_time = time.time()

trainer.train()

print(f'Elapsed time: {time.time() - start_time}')

In [None]:
del model

#### Using Prompt Tuning to Train the Model

In [None]:
peft_config = PromptEncoderConfig(
    task_type=TaskType.SEQ_CLS,
    num_virtual_tokens=30,
    encoder_hidden_size=128
)

In [None]:
model = RobertaForSequenceClassification(model_config)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

In [None]:
training_args = TrainingArguments(
    'RoBerta-P-Tuning-Fine-Tuned',
    eval_strategy='epoch',
    save_strategy='no',
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCH_COUNT,
    weight_decay=0.01,
    metric_for_best_model="accuracy"
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
start_time = time.time()

trainer.train()

print(f'Elapsed time: {time.time() - start_time}')

In [None]:
del model

### Part 3. Why LoRA?

## Q2

### Part 1. In Context Learning

In [None]:
LLAMA_MODEL_NAME = "meta-llama/Meta-Llama-3-8B"

#### Zero Shot Prompting on llama 3B

We'll make a system prompt to tell the model what's our task and how it should behave.

In [None]:
messages = [
    {"role": "system", "content": "You should get two sentences in every prompt. You will then say 'entailment' if two sentences entail eachother, 'neutral' if they might or might not entail eachother, and 'contradict' if they don't entail eachother."}
]

Gather some examples from the dataset and store their labels to evaluate the performance.

In [None]:
examples = dataset['validation_matched'].shuffle(seed=42).select(range(10))
labels = []

for premise, hypothesis, label in zip(examples['premise'], examples['hypothesis'], examples['label']):
    messages.append(
        {"role": "user", "content": premise + " " + hypothesis}
    )
    labels.append(label)

Now let's create our pipeline.

In [None]:
llama_zero_shot_pipeline = transformers.pipeline(
    "text-generation",
    model=LLAMA_MODEL_NAME,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda"
)

The prompts should be in the way that llama3 has provided. We'll use the tokenizer in the pipeline to do so. Here is the format specified in the instructions [here](https://huggingface.co/blog/llama3#how-to-prompt-llama-3):

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>s
```

In [None]:
prompt = llama_zero_shot_pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

In [None]:
terminators = [
    llama_zero_shot_pipeline.tokenizer.eos_token_id,
    llama_zero_shot_pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = llama_zero_shot_pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

#### Few Shot Prompting on llama 3B

We'll use the previous part's pipeline with a little modification. Instead of asking the model on example per prompt, and not giving it any example, we'll include 5 examples in each prompt: 2 from class 0, 1 from class 1, and 2 from class 2. We'll also tell the model about the new template in the system's prompt.

In [None]:
messages = [
    {"role": "system", "content": "You should get 5 examples per each input. Each example contains two sentences followed by their label, indicating if the two sentences entail each other, they are neutral, or they contradict each other. After the 5 sentences there will be another example without a label. Predict the last example's label. Also note that 0 means 'entailment', 1 means 'neutral', and 2 means 'contradiction'"}
]

In [None]:
few_shot_0_examples = dataset['train'].filter(lambda x : x['label'] == 0).shuffle(seed=42).select(range(20))
few_shot_1_examples = dataset['train'].filter(lambda x : x['label'] == 1).shuffle(seed=42).select(range(10))
few_shot_2_examples = dataset['train'].filter(lambda x : x['label'] == 2).shuffle(seed=42).select(range(20))
few_shot_eval_examples = dataset['validation_matched'].shuffle(seed=42).select(range(10))

for i in range(10):
    current_prompt = []

    current_prompt.extend(
        [
            f"{few_shot_0_examples['premise'][i*2]} {few_shot_0_examples['hypothesis'][i*2]} label={0}\n",
            f"{few_shot_0_examples['premise'][i*2 + 1]} {few_shot_0_examples['hypothesis'][i*2 + 1]} label={0}\n"
        ]
    )

    current_prompt.append(
        f"{few_shot_1_examples['premise'][i]} {few_shot_1_examples['hypothesis'][i]} label={1}\n",
    )

    current_prompt.extend(
        [
            f"{few_shot_2_examples['premise'][i*2]} {few_shot_2_examples['hypothesis'][i*2]} label={2}\n",
            f"{few_shot_2_examples['premise'][i*2 + 1]} {few_shot_2_examples['hypothesis'][i*2 + 1]} label={2}\n"
        ]
    )

    current_prompt.append(f"{few_shot_eval_examples['premise'][i]} {few_shot_eval_examples['hypothesis'][i]}")

    messages.append(
        {"role": "user", "content": ''.join(current_prompt)}
    )


In [None]:
prompt = llama_zero_shot_pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

In [None]:
outputs = llama_zero_shot_pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

### Part 2. Fine Tuning the Model using QLoRA

First we'll use the sequence classifier model of llama and use QLoRA on it.

In [None]:
EPOCH_COUNT = 10
LEARNING_RATE = 2e-4

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
tokenizer = LlamaTokenizer.from_pretrained(LLAMA_MODEL_NAME)
model = LlamaForSequenceClassification.from_pretrained(LLAMA_MODEL_NAME)

In [None]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    task_type=TaskType.SEQ_2_SEQ_LM
)

model = get_peft_model(model, config)

In [None]:
training_args = TrainingArguments(
    'Llama-QLoRA',
    per_device_train_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCH_COUNT,
    gradient_accumulation_steps=4,
    learning_rate=LEARNING_RATE,
    fp16=True,
    logging_steps=1,
    save_strategy="no",
    optim="paged_adamw_8bit"
)

In [None]:
trainer = Trainer(
    model=model,
    train_dataset=small_train_dataset,
    eval_dataset=small_val_dataset,
    tokenizer=tokenizer
)

trainer.train()

In [None]:
model = model.merge_and_unload()
trainer.evaluate()