<a href="https://www.kaggle.com/code/aisuko/fine-tuning-microsoft-phi2?scriptVersionId=161626208" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Microsoft-Phi2 with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered website. According to the model card, it showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. This means it has a remarkable performance.

Let's fine-tune it on Kaggle environment.

In [None]:
!pip install transformers==4.36.2
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install accelerate==0.25.0
!pip install trl==0.7.7
!pip install tqdm==4.66.1
# Although flash-attn is not supported in Kaggle env.However, we prepare the notebook for future usage.
!pip install flash-attn==2.4.2

In [None]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine-tuning casual language models"
os.environ["WANDB_NAME"] = "fine-tuning-Phi2-with-webglm-qa-with-lora"
os.environ["MODEL_NAME"] = "microsoft/phi-2"
os.environ["DATASET_NAME"]="THUDM/webglm-qa"

In [None]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

In [None]:
!nvdia-smi

# Load the dataset

Here are the several steps:
* load the dataset
* tokenize the train/test datasets for fine-tuning purposes

Here we are merging validate and test datasets, which amount to 1400 rows.

In [None]:
from datasets import load_dataset

train_dataset=load_dataset(os.getenv("DATASET_NAME"), split="train[5000:5500]")

# merge validation/test datasets
test_dataset=load_dataset(os.getenv("DATASET_NAME"), split="validation+test")

# Define the processing function

In [None]:
from transformers import AutoTokenizer

# Setting up the tokenizer for Phi-2
tokenizer=AutoTokenizer.from_pretrained(
    os.getenv("MODEL_NAME"),
    add_eos_token=True, 
    trust_remote_code=True
)

tokenizer.pad_token=tokenizer.eos_token
tokenizer.truncation_side="left"

In [None]:
def collate_and_tokenize(examples):
    question=examples["question"][0].replace('"',r'\"')
    answer=examples["answer"][0].replace('"',r'\"')
    references='\n'.join([f"[{index+1}] {string}" for index, string in enumerate(examples["references"][0])])
    
    # Merging into one prompt for tokenization and training
    prompt=f"""###System:
Read the reference provided and answer the corresponding question.
###References:
{references}
###Question:
{question}
###Answer:
{answer}"""
    
    # Tokenize the prompt
    encoded =tokenizer(
        prompt,
        return_tensors="np",
        padding="max_length",
        truncation=True,
        max_length=None,
    )
    
    encoded["labels"]=encoded["input_ids"]
    return encoded

In [None]:
# We will just keep the input_ids and labels that we add in function above.
columns_to_remove=["question","answer","references"]

#tokenize the training and test datasets
tokenized_dataset_train=train_dataset.map(
    collate_and_tokenize,
    batched=True,
    batch_size=1,
    remove_columns=columns_to_remove
)

tokenized_dataset_test=test_dataset.map(
    collate_and_tokenize,
    batched=True,
    batch_size=1,
    remove_columns=columns_to_remove
)

# Load the model


We are going to use quantization technique.

32-bit floating points will cause 4 bytes of memory for each weight. 16-bit requires 2 bytes, an 8-bit requires 1 byte. 4-bit requires 0.5 bytes.

For Phi-2, with 2.7 billion parameters, the memory requirement for loading the model is approximately $2.7*4=10.8$ GB. It's important to note that this is solely for loading the model; during training, the memory usage expands ofeten doubling the initial requirement. And with Adam optimizer, it will quadruple it.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig

bnb_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True
)

model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    device_map='auto',
    quantization_config=bnb_config,
#     attn_implementation="flash_attention_2"
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

def print_trainable_parameters(model):
    trainable_params=0
    all_params=0
    for _, param in model.named_parameters():
        all_params+=param.numel()
        if param.requires_grad:
            trainable_params+=param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100 * trainable_params/all_params:.2f}")

print_trainable_parameters(model)

In [None]:
model.get_memory_footprint()

In [None]:
model.config.quantization_config

# Training with QLoRA

In [None]:
from peft import prepare_model_for_kbit_training

#gradient checkpointing to save memory
model.gradient_checkpointing_enable()
model.get_memory_footprint()

In [None]:
#freeze base model layers and casr layernorm in fp32
prepared_model=prepare_model_for_kbit_training(
    model, use_gradient_checkpointing=True
)
prepared_model.get_memory_footprint()

When we print the model, we can see that the target modules it uses. We are going to use these target_modules in our LoRA adapter below.

In [None]:
print(prepared_model)

In [None]:
# ValueError: FSDP requires PyTorch >= 2.1.0

# from accelerate import FullyShardedDataParallelPlugin, Accelerator
# from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

# fsdp_plugin=FullyShardedDataParallelPlugin(
#     state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
#     optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False)
# )

# accelerator=Accelerator(fsdp_plugin=fsdp_plugin)

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

peft_config=LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense',
        'fc1',
        'fc2',
    ],
    bias="none",
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

lora_model=get_peft_model(prepared_model, peft_config)
lora_model.get_memory_footprint()

# lora_model=accelerator.prepare_model(lora_model)

## Introduction of the parameters

* **per_device_train_batch_size** and **gradient_accumulation_steps**

    Both these params together would form the overall batch size. As we have these set to "2" and "5", our training batch size is 10. That means the our total steps would be $(500/10)*1=50$. Where 500 is the training dataset size, 10 is the batch size and 1 is the number of epochs.
    
* **max_steps** and **num_train_epochs**

    These two parameters are mutually exclusive. One epoch is one full cycle through the training data, whereas steps is calculated as (datasetsize/batch_size)*(num_epcohs)
    
* **optim**

    Optimizers are primarily responsible for minimizing the error of loss of the model by adjusting the model's parameters or weights. Their ultimate goal is to find the "optimal" set of parameters that enables the model to make close-to-accurate predictions on new, previosuly unseen data.
    Regular optimizers like Adam can consume a substantially large amount of GPU memory. That's why we are using an 8-bit paged optimizer, employing lower precision to store the state and enabling paging, which reduce the load on the GPU.
    

In [None]:
import time
from transformers import TrainingArguments, Trainer

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    overwrite_output_dir=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=5,
    gradient_checkpointing=True,  # Enable gradient checkpointing
    gradient_checkpointing_kwargs={"use_reentrant": False},
    warmup_steps=20,
    max_steps=100, # Total number of training steps
    num_train_epochs=1, # Number of training epochs
    learning_rate=5e-5, # Learning rate
    weight_decay=0.01, # Weight decay
    optim="paged_adamw_8bit", # Keep the optimizer state and quantize it
#     bf16=True, # Do not supported in Kaggle environment, require Ampere....
    fp16=True, # use fp16 16bit(mixed) precision training instead of 32-bit training.
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2, # Limit the total number of checkpoints
    evaluation_strategy="steps",
    eval_steps=20,
    load_best_model_at_end=True, # Load the best model at the end of training,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME")
)

lora_model.config.use_cache=False

trainer=Trainer(
    model=lora_model,
    train_dataset=tokenized_dataset_train,
    eval_dataset=tokenized_dataset_test,
    args=training_args,
)


start_time=time.time()
trainer.train()
end_time=time.time()

training_time=end_time-start_time

print(f"Training completed in {training_time} seconds.")

In [None]:
trainer.push_to_hub(os.getenv("WANDB_NAME"))
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))

# Inference

In [None]:
#Setup a prompt that we can use for testing

new_prompt = """###System:
Read the references provided and answer the corresponding question.
###References:
[1] For most people, the act of reading is a reward in itself. However, studies show that reading books also has benefits that range from a longer life to career success. If you’re looking for reasons to pick up a book, read on for seven science-backed reasons why reading is good for your health, relationships and happiness.
[2] As per a study, one of the prime benefits of reading books is slowing down mental disorders such as Alzheimer’s and Dementia  It happens since reading stimulates the brain and keeps it active, which allows it to retain its power and capacity.
[3] Another one of the benefits of reading books is that they can improve our ability to empathize with others. And empathy has many benefits – it can reduce stress, improve our relationships, and inform our moral compasses.
[4] Here are 10 benefits of reading that illustrate the importance of reading books. When you read every day you:
[5] Why is reading good for you? Reading is good for you because it improves your focus, memory, empathy, and communication skills. It can reduce stress, improve your mental health, and help you live longer. Reading also allows you to learn new things to help you succeed in your work and relationships.
###Question:
Why is reading books widely considered to be beneficial?
###Answer:
"""

In [None]:
del lora_model, trainer

In [None]:
import gc

gc.collect()
torch.cuda.empty_cache()

In [None]:
inputs=tokenizer(
    new_prompt, 
    return_tensors="pt", 
    return_attention_mask=False, 
    padding=True, 
    truncation=True)

inputs.to('cuda')
prepared_model.config.use_cache=True

outputs=prepared_model.generate(**inputs, repetition_penalty=1.0, max_length=1000)
result=tokenizer.batch_decode(outputs, skip_special_tokens=True)
result

In [None]:
from peft import PeftConfig, PeftModel

model_name="aisuko/"+os.getenv("WANDB_NAME")
peft_model=PeftModel.from_pretrained(prepared_model, model_name)

In [None]:
outputs=peft_model.generate(**inputs, max_length=1000)
text=tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
text

# Credit

* https://medium.com/@yernenip/optimizing-phi-2-a-deep-dive-into-fine-tuning-small-language-models-9d545ac90a99