<a href="https://www.kaggle.com/code/aisuko/fine-tuning-microsoft-phi2?scriptVersionId=161495903" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Microsoft-Phi2 with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered website. According to the model card, it showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. This means it has a remarkable performance.

Let's fine-tune it on Kaggle environment.

In [1]:
!pip install transformers==4.36.2
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install accelerate==0.25.0
!pip install trl==0.7.7
!pip install tqdm==4.66.1
# Although flash-attn is not supported in Kaggle env.However, we prepare the notebook for future usage.
!pip install flash-attn==2.4.2

Collecting datasets==2.15.0
  Obtaining dependency information for datasets==2.15.0 from https://files.pythonhosted.org/packages/e2/cf/db41e572d7ed958e8679018f8190438ef700aeb501b62da9e1eed9e4d69a/datasets-2.15.0-py3-none-any.whl.metadata
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.15.0)
  Obtaining dependency information for pyarrow-hotfix from https://files.pythonhosted.org/packages/e4/f4/9ec2222f5f5f8ea04f66f184caafd991a39c8782e31f5b0266f101cb68ca/pyarrow_hotfix-0.6-py3-none-any.whl.metadata
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec[http]<=2023.10.0,>=2023.1.0 (from datasets==2.15.0)
  Obtaining dependency information for fsspec[http]<=2023.10.0,>=2023.1.0 from https://files.pythonhosted.org/packages/e8/f6/3eccfb530aac90ad1301c582da228e4763f19e719ac8200752a4841b0b2d/fsspec-2023.10.0-py3-none-any.whl.metadata
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine-tuning casual language models"
os.environ["WANDB_NAME"] = "fine-tuning-Phi2-with-webglm-qa-with-lora"
os.environ["MODEL_NAME"] = "microsoft/phi-2"
os.environ["DATASET_NAME"]="THUDM/webglm-qa"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `microsoft/phi-2` from `transformers`...
config.json: 100%|█████████████████████████████| 863/863 [00:00<00:00, 5.53MB/s]
┌────────────────────────────────────────────────────┐
│     Memory Usage for loading `microsoft/phi-2`     │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│   500.2 MB  │ 10.37 GB │      41.48 GB     │
│float16│   250.1 MB  │ 5.19 GB  │      20.74 GB     │
│  int8 │  125.05 MB  │ 2.59 GB  │      10.37 GB     │
│  int4 │   62.52 MB  │  1.3 GB  │      5.19 GB      │
└───────┴─────────────┴──────────┴───────────────────┘


In [4]:
!nvdia-smi

/bin/bash: line 1: nvdia-smi: command not found


# Load the dataset

Here are the several steps:
* load the dataset
* tokenize the train/test datasets for fine-tuning purposes

Here we are merging validate and test datasets, which amount to 1400 rows.

In [5]:
from datasets import load_dataset

train_dataset=load_dataset(os.getenv("DATASET_NAME"), split="train[5000:7000]")

# merge validation/test datasets
test_dataset=load_dataset(os.getenv("DATASET_NAME"), split="validation+test")

Downloading readme:   0%|          | 0.00/4.27k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/115M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.63M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

# Define the processing function

In [6]:
from transformers import AutoTokenizer

# Setting up the tokenizer for Phi-2
tokenizer=AutoTokenizer.from_pretrained(
    os.getenv("MODEL_NAME"),
    add_eos_token=True, 
    trust_remote_code=True
)

tokenizer.pad_token=tokenizer.eos_token
tokenizer.truncation_side="left"

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
def collate_and_tokenize(examples):
    question=examples["question"][0].replace('"',r'\"')
    answer=examples["answer"][0].replace('"',r'\"')
    references='\n'.join([f"[{index+1}] {string}" for index, string in enumerate(examples["references"][0])])
    
    # Merging into one prompt for tokenization and training
    prompt=f"""###System:
Read the reference provided and answer the corresponding question.
###References:
{references}
###Question:
{question}
###Answer:
{answer}"""
    
    # Tokenize the prompt
    encoded =tokenizer(
        prompt,
        return_tensors="np",
        padding="max_length",
        truncation=True,
        max_length=None,
    )
    
    encoded["labels"]=encoded["input_ids"]
    return encoded

In [8]:
# We will just keep the input_ids and labels that we add in function above.
columns_to_remove=["question","answer","references"]

#tokenize the training and test datasets
tokenized_dataset_train=train_dataset.map(
    collate_and_tokenize,
    batched=True,
    batch_size=1,
    remove_columns=columns_to_remove
)

tokenized_dataset_test=test_dataset.map(
    collate_and_tokenize,
    batched=True,
    batch_size=1,
    remove_columns=columns_to_remove
)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1400 [00:00<?, ? examples/s]

# Load the model


We are going to use quantization technique.

32-bit floating points will cause 4 bytes of memory for each weight. 16-bit requires 2 bytes, an 8-bit requires 1 byte. 4-bit requires 0.5 bytes.

For Phi-2, with 2.7 billion parameters, the memory requirement for loading the model is approximately $2.7*4=10.8$ GB. It's important to note that this is solely for loading the model; during training, the memory usage expands ofeten doubling the initial requirement. And with Adam optimizer, it will quadruple it.

In [9]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig

bnb_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True
)

model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    device_map='auto',
    quantization_config=bnb_config,
#     attn_implementation="flash_attention_2"
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

def print_trainable_parameters(model):
    trainable_params=0
    all_params=0
    for _, param in model.named_parameters():
        all_params+=param.numel()
        if param.requires_grad:
            trainable_params+=param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100 * trainable_params/all_params:.2f}")

print_trainable_parameters(model)

configuration_phi.py:   0%|          | 0.00/9.26k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-2:
- configuration_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi.py:   0%|          | 0.00/62.7k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-2:
- modeling_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

trainable params: 262364160 || all params: 1521392640 || trainable%: 17.24


In [10]:
model.get_memory_footprint()

1792884736

In [11]:
model.config.quantization_config

BitsAndBytesConfig {
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": true,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

# Training with QLoRA

In [12]:
from peft import prepare_model_for_kbit_training

#gradient checkpointing to save memory
model.gradient_checkpointing_enable()
model.get_memory_footprint()

1792884736

In [13]:
#freeze base model layers and casr layernorm in fp32
prepared_model=prepare_model_for_kbit_training(
    model, use_gradient_checkpointing=True
)
prepared_model.get_memory_footprint()

2319087616

When we print the model, we can see that the target modules it uses. We are going to use these target_modules in our LoRA adapter below.

In [14]:
print(prepared_model)

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layern

In [15]:
# ValueError: FSDP requires PyTorch >= 2.1.0

# from accelerate import FullyShardedDataParallelPlugin, Accelerator
# from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

# fsdp_plugin=FullyShardedDataParallelPlugin(
#     state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
#     optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False)
# )

# accelerator=Accelerator(fsdp_plugin=fsdp_plugin)

In [16]:
from peft import LoraConfig, get_peft_model, TaskType

peft_config=LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense',
        'fc1',
        'fc2',
    ],
    bias="none",
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

lora_model=get_peft_model(prepared_model, peft_config)
lora_model.get_memory_footprint()

# lora_model=accelerator.prepare_model(lora_model)

2413459456

## Introduction of the parameters

* **per_device_train_batch_size** and **gradient_accumulation_steps**

    Both these params together would form the overall batch size. As we have these set to "2" and "5", our training batch size is 10. That means the our total steps would be $(2000/10)*1=200$. Where 2000 is the training dataset size, 10 is the batch size and 1 is the number of epochs.
    
* **max_steps** and **num_train_epochs**

    These two parameters are mutually exclusive. One epoch is one full cycle through the training data, whereas steps is calculated as (datasetsize/batch_size)*(num_epcohs)
    
* **optim**

    Optimizers are primarily responsible for minimizing the error of loss of the model by adjusting the model's parameters or weights. Their ultimate goal is to find the "optimal" set of parameters that enables the model to make close-to-accurate predictions on new, previosuly unseen data.
    Regular optimizers like Adam can consume a substantially large amount of GPU memory. That's why we are using an 8-bit paged optimizer, employing lower precision to store the state and enabling paging, which reduce the load on the GPU.
    

In [17]:
import time
from transformers import TrainingArguments, Trainer

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    overwrite_output_dir=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=5,
    gradient_checkpointing=True,  # Enable gradient checkpointing
    gradient_checkpointing_kwargs={"use_reentrant": False},
    warmup_steps=50,
    max_steps=100, # Total number of training steps
    num_train_epochs=1, # Number of training epochs
    learning_rate=5e-5, # Learning rate
    weight_decay=0.01, # Weight decay
    optim="paged_adamw_8bit", # Keep the optimizer state and quantize it
#     bf16=True, # Do not supported in Kaggle environment, require Ampere....
    fp16=True, # use fp16 16bit(mixed) precision training instead of 32-bit training.
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2, # Limit the total number of checkpoints
    evaluation_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True, # Load the best model at the end of training,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME")
)

trainer=Trainer(
    model=lora_model,
    train_dataset=tokenized_dataset_train,
    eval_dataset=tokenized_dataset_test,
    args=training_args,
)

prepared_model.config.use_cache=False

start_time=time.time()
trainer.train()
end_time=time.time()

training_time=end_time-start_time

print(f"Training completed in {training_time} seconds.")

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.16.2
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240203_045345-ns3jxawu[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mfine-tuning-Phi2-with-webglm-qa-with-lora[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/ns3jxawu[0m


Step,Training Loss,Validation Loss
100,2.4092,0.531142


Training completed in 12278.02936887741 seconds.


In [18]:
trainer.push_to_hub(os.getenv("WANDB_NAME"))
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))

adapter_model.safetensors:   0%|          | 0.00/94.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/fine-tuning-Phi2-with-webglm-qa-with-lora/commit/e3bcde76dcd1527ab894c266d8186bc01b8e80d0', commit_message='Upload tokenizer', commit_description='', oid='e3bcde76dcd1527ab894c266d8186bc01b8e80d0', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [19]:
#Setup a prompt that we can use for testing

new_prompt = """###System:
Read the references provided and answer the corresponding question.
###References:
[1] For most people, the act of reading is a reward in itself. However, studies show that reading books also has benefits that range from a longer life to career success. If you’re looking for reasons to pick up a book, read on for seven science-backed reasons why reading is good for your health, relationships and happiness.
[2] As per a study, one of the prime benefits of reading books is slowing down mental disorders such as Alzheimer’s and Dementia  It happens since reading stimulates the brain and keeps it active, which allows it to retain its power and capacity.
[3] Another one of the benefits of reading books is that they can improve our ability to empathize with others. And empathy has many benefits – it can reduce stress, improve our relationships, and inform our moral compasses.
[4] Here are 10 benefits of reading that illustrate the importance of reading books. When you read every day you:
[5] Why is reading good for you? Reading is good for you because it improves your focus, memory, empathy, and communication skills. It can reduce stress, improve your mental health, and help you live longer. Reading also allows you to learn new things to help you succeed in your work and relationships.
###Question:
Why is reading books widely considered to be beneficial?
###Answer:
"""

In [20]:
del lora_model, trainer

In [21]:
import gc

gc.collect()
torch.cuda.empty_cache()

In [22]:
inputs=tokenizer(
    new_prompt, 
    return_tensors="pt", 
    return_attention_mask=False, 
    padding=True, 
    truncation=True)

inputs.to('cuda')
prepared_model.config.use_cache=True

outputs=prepared_model.generate(**inputs, repetition_penalty=1.0, max_length=1000)
result=tokenizer.batch_decode(outputs, skip_special_tokens=True)
result

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['###System:\nRead the references provided and answer the corresponding question.\n###References:\n[1] For most people, the act of reading is a reward in itself. However, studies show that reading books also has benefits that range from a longer life to career success. If you’re looking for reasons to pick up a book, read on for seven science-backed reasons why reading is good for your health, relationships and happiness.\n[2] As per a study, one of the prime benefits of reading books is slowing down mental disorders such as Alzheimer’s and Dementia  It happens since reading stimulates the brain and keeps it active, which allows it to retain its power and capacity.\n[3] Another one of the benefits of reading books is that they can improve our ability to empathize with others. And empathy has many benefits – it can reduce stress, improve our relationships, and inform our moral compasses.\n[4] Here are 10 benefits of reading that illustrate the importance of reading books. When you read 

In [23]:
from peft import PeftConfig, PeftModel

model_name="aisuko/"+os.getenv("WANDB_NAME")
peft_model=PeftModel.from_pretrained(prepared_model, model_name)

adapter_config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/94.4M [00:00<?, ?B/s]

In [24]:
outputs=peft_model.generate(**inputs, max_length=1000)
text=tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
text

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'###System:\nRead the references provided and answer the corresponding question.\n###References:\n[1] For most people, the act of reading is a reward in itself. However, studies show that reading books also has benefits that range from a longer life to career success. If you’re looking for reasons to pick up a book, read on for seven science-backed reasons why reading is good for your health, relationships and happiness.\n[2] As per a study, one of the prime benefits of reading books is slowing down mental disorders such as Alzheimer’s and Dementia  It happens since reading stimulates the brain and keeps it active, which allows it to retain its power and capacity.\n[3] Another one of the benefits of reading books is that they can improve our ability to empathize with others. And empathy has many benefits – it can reduce stress, improve our relationships, and inform our moral compasses.\n[4] Here are 10 benefits of reading that illustrate the importance of reading books. When you read e

# Credit

* https://medium.com/@yernenip/optimizing-phi-2-a-deep-dive-into-fine-tuning-small-language-models-9d545ac90a99