# Assignment 4: Instruction finetuning a Llama-2 7B model - part 2
**Assignment due 19 April 11:59pm**

Welcome to the fourth assignment for 50.055 Machine Learning Operations. These assignments give you a chance to practice the methods and tools you have learned. 

**This assignment is a group assignment.**

- Read the instructions in this notebook carefully
- Add your solution code and answers in the appropriate places. The questions are marked as **QUESTION:**, the places where you need to add your code and text answers are marked as **ADD YOUR SOLUTION HERE**
- The completed notebook, including your added code and generated output will be your submission for the assignment.
- The notebook should execute without errors from start to finish when you select "Restart Kernel and Run All Cells..". Please test this before submission.
- Use the SUTD Education Cluster to solve and test the assignment.

**Rubric for assessment** 

Your submission will be graded using the following criteria. 
1. Code executes: your code should execute without errors. The SUTD Education cluster should be used to ensure the same execution environment.
2. Correctness: the code should produce the correct result or the text answer should state the factual correct answer.
3. Style: your code should be written in a way that is clean and efficient. Your text answers should be relevant, concise and easy to understand.
4. Partial marks will be awarded for partially correct solutions.
5. There is a maximum of 200 (80 + 40 + 80) points for this assignment.

**ChatGPT policy** 

If you use AI tools, such as ChatGPT, to solve the assignment questions, you need to be transparent about its use and mark AI-generated content as such. In particular, you should include the following in addition to your final answer:
- A copy or screenshot of the prompt you used
- The name of the AI model
- The AI generated output
- An explanation why the answer is correct or what you had to change to arrive at the correct answer

**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.




### Step 2: finetune model
The second step of the assignment finetune an LLM model using the synthetic question-answer pairs from step 1. 
We will use LoRA and Huggingface. It can be tricky to make finetuning on small GPUs. You can take the code samples from the labs as a starting pint. 
There are many blogs and guides on the internet you can consult, too. 

In [None]:
# Installing required packages
!pip install -U -q peft==0.6.2 transformers==4.35.2 datasets==2.15.0 bitsandbytes==0.41.2.post2 trl==0.7.4 accelerate==0.24.1 scipy==1.12.0 wandb==0.16.5 coloredlogs==15.0.1

In [None]:
# Load required packages

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, pipeline
from datasets import load_dataset, Dataset
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
import torch

import pickle


In [None]:
# load SUTD QA dataset from step 1
with open('sutd_qa_dataset.pkl', 'rb') as f:
    sutd_qa_dataset = pickle.load(f)

In [None]:
# split data into traing and test set, 160 instances for train, rest for test
sutd_qa_dataset = sutd_qa_dataset.train_test_split(train_size=160, shuffle=False)

In [None]:
# check schema and number of instances
sutd_qa_dataset

In [None]:
# inspect first instance
sutd_qa_dataset["train"][0]

In [None]:
# QUESTION: create a formating function 'formatting_func' which takes an example from your QA dataset as input and outputs 
# a dictionary with the key "text" and as value a text prompt with the following format:
# ### USER: {question from example goes here}
# ### ASSISTANT: {answer from example goes here}


#--- ADD YOUR SOLUTION HERE (10 points)---


#----------------------------------------


In [None]:
# apply formatting function to data set
formatted_dataset = sutd_qa_dataset.map(formatting_func)

In [None]:
# check formatted prompt
formatted_dataset["train"]["text"][0]

# Note: you should see something like this (not necessary the same prompt but same format)
# '### USER: What are some of the best places to eat near the SUTD campus?\n### ASSISTANT: There are several great dining options near the SUTD campus. 
# One popular spot is the Changi Business Park Food Court, ...


In [None]:
# model id of base model
model_id = "NousResearch/Llama-2-7b-hf"

# model id for our finetuned model
new_model = "llama-7b-qlora-sutd-qa"

# config for model quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    use_nested_quant = False
)

# Load the entire model on the GPU 0
device_map = {"": 0}


In [None]:
# Load model

# QUESTION: load the base LLM into a variable 'model' using the HF AutoModelForCausalLM class with the given quantization. Load all weights to the GPU

#--- ADD YOUR SOLUTION HERE (10 points)---


#------------------------------------------

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"



In [None]:
# Apply lora configuration
lora_config = LoraConfig(
    lora_alpha=8,
    r=8,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:

# QUESTION: Now it is time to configure the training parameters. 
# To make it easier for your, the list of parameters is given.
# Find reasonable values for the parameters, at least something that make the training run without crashing. 
# You can refer to the lab exercises and to open source examples on the internet

# list of parameters, some with pre-set values, others you need to set yourself:
# output_dir = "./results"
# per_device_train_batch_size  
# gradient_accumulation_steps 
# optim
# save_steps = 10
# logging_steps = 10
# learning_rate
# weight_decay
# max_grad_norm
# num_train_epochs
# warmup_ratio
# lr_scheduler_type
# packing
# max_seq_length

#--- ADD YOUR SOLUTION HERE (20 points)---



#------------------------------------------

In [None]:
# configure trainer 

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
    peft_config=lora_config,
    dataset_text_field="text",
    packing=packing,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
)

In [None]:
# now finetune the model!
trainer.train()

In [None]:
# Save trained model
trainer.model.save_pretrained(new_model)

In [None]:
#evaluate and return the metrics
trainer.evaluate()

In [None]:
# Empty VRAM
# Note: this did not unload everything from the GPU, maybe you can find a way to fix this
# As a workaorund you can restart your kernel to clear the GPU, then run the below cells
# https://stackoverflow.com/questions/69357881/how-to-remove-the-model-of-transformers-in-gpu-memory

import gc

del trainer
del model
del tokenizer

gc.collect()
torch.cuda.empty_cache()

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
from peft import get_peft_model, LoraConfig, PeftModel
import transformers
import torch


# model id of base model (repeat in case of kernel restart)
model_id = "NousResearch/Llama-2-7b-hf"

# model id for our finetuned model (repeat in case of kernel restart)
new_model = "llama-7b-qlora-sutd-qa"

# Load the entire model on the GPU 0 (repeat in case of kernel restart)
device_map = {"": 0}


# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map
)
    
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
from huggingface_hub import login

# log in to huggingface, you need to put your huggingface access token
# https://huggingface.co/docs/hub/en/security-tokens

hf_access_token = "<YOUR HF WRITE ACCESS TOKEN>"
login(token=hf_access_token)

In [None]:
# push finetuned model to huggingface
model.push_to_hub(new_model, use_temp_dir=False)


In [None]:
# push tokenizer to huggingface
tokenizer.push_to_hub(new_model, use_temp_dir=False)

### This concludes the second part of the assignment. Continue with the next part