In [1]:
import os
import torch

### Reference

https://huggingface.co/blog/dvgodoy/fine-tuning-llm-hugging-face


# Fine Tune with LoRA
In this notebook, we are going to train an instruction type pretrained model to sounds like Master Yoda in Star War series. We will adapt the LoRA and quantized model to reduce the resource usage during training.

## Load a Quantized Base Model
We will load a quantized model which takes up less space in GPU's RAM. A quantized model <u>replaces the original weights with approximate values that are represented by fewer bits</u>. This practice reduces the model's memory footprint.<br>
To load quantized model, we have to fill `quantization_config` argument of `from_pretrained()` function with `BitsAndBytesConfig` instance.

In [2]:
model_name = 'microsoft/Phi-3-mini-4k-instruct'

In [4]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

In [5]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float32
)

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", quantization_config=bnb_config)

Loading weights:   0%|          | 0/195 [00:00<?, ?it/s]



In [6]:
# check how much space the model takes
print(model.get_memory_footprint()/1e6)

2206.341504


In [7]:
# check model architecture
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear4bit(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear4bit(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLUActivation()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=

## Setup LoRA
LoRA are attached to layers of pretrained model. This time LoRA would attached every one of the quantized layers. <br>
In our case, the quantized layers are frozen. We only need to train LoRA layers to adapt the domain usage. <u><b>LoRA adapters</b> on a quantized model would drastically reduce the total number of trainable parameters.</u>

To set LoRA, we have to use `peft` package. We need three steps:
1. call `prepare_model_for_kbit_training()` to wrapped the pretrained model
2. Configure instance of `LoraConfig`
3. apply the configuration with `get_peft_model()` function

In [19]:
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

In [20]:
model = prepare_model_for_kbit_training(model)

In [21]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=16, # multiplier, usually 2*r
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
    target_modules=['o_proj', 'qkv_proj', 'gate_up_proj', 'down_proj'],
)

model = get_peft_model(model, lora_config)

In [22]:
# check how much space the model takes
print(model.get_memory_footprint()/1e6)

2651.074944


450MB size was added.

In [23]:
# check model architecture
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Phi3ForCausalLM(
      (model): Phi3Model(
        (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
        (layers): ModuleList(
          (0-31): 32 x Phi3DecoderLayer(
            (self_attn): Phi3Attention(
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (

In [25]:
### reusable function to print out trainable parameters
def print_trainable_parameters(model):
    train_params, total_params = model.get_nb_trainable_parameters()
    print(f"Trainable parameters:      {train_params/1e6:.2f}MB")
    print(f'Total parameters:          {total_params/1e6:.2f}MB')
    print(f'% of trainable parameters: {100*train_params/total_params:.2f}%')

In [24]:
# check how many trainable parameters now
print_trainable_parameters(model)

Trainable parameters:      12.58MB
Total parameters:          3833.66MB
% of trainable parameters: 0.33%


After applied LoRA, we have only to train 0.33% of the original parameters.

## Prepare Dataset

### Load the dataset from Hugging Face hub

In [26]:
from datasets import load_dataset

In [27]:
dataset = load_dataset("dvgodoy/yoda_sentences", split="train")
#dataset.column_names
dataset.column_names

README.md:   0%|          | 0.00/531 [00:00<?, ?B/s]

sentences.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/720 [00:00<?, ? examples/s]

['sentence', 'translation', 'translation_extra']

In [28]:
dataset

Dataset({
    features: ['sentence', 'translation', 'translation_extra'],
    num_rows: 720
})

In [29]:
dataset[0]

{'sentence': 'The birch canoe slid on the smooth planks.',
 'translation': 'On the smooth planks, the birch canoe slid.',
 'translation_extra': 'On the smooth planks, the birch canoe slid. Yes, hrrrm.'}

### Formating dataset for training usage
`Phi` model has requirement on format of input data. To fulfill the requirement, we need to transform the dataset to the format it required. In this part, we prepare an higher order function to transform the format of dataset with `map()` function of dataset.<br>
We may later call `tokenizer.apply_chat_template()` function or `tokenizer.chat_template` to explode the format of data to be feed to model.

In [30]:
### reusable function to 
### convert dataset from prompt/completion format 
### to conversational format

def format_dataset(data):
    if isinstance(data["prompt"], list):
        outputs = []
        for i in range(len(data["prompt"])):
            converted_data = [
                {"role":"user", "content":data["prompt"][i]},
                {"role":"assistant", "content":data["completion"][i]}
            ]
            outputs.append(converted_data)
        return {'messages': outputs}
    else:
        converted_data = [
            {"role":"user", "content":data["prompt"]},
            {"role":"assistant", "content":data["completion"]}
        ]
        return {'messages': converted_data}

In [31]:
dataset = dataset.rename_column("sentence", "prompt")
dataset = dataset.rename_column("translation_extra", "completion")
dataset = dataset.map(format_dataset)
dataset = dataset.remove_columns(["prompt", "completion", "translation"])

Map:   0%|          | 0/720 [00:00<?, ? examples/s]

In [32]:
dataset

Dataset({
    features: ['messages'],
    num_rows: 720
})

In [33]:
dataset[0]

{'messages': [{'content': 'The birch canoe slid on the smooth planks.',
   'role': 'user'},
  {'content': 'On the smooth planks, the birch canoe slid. Yes, hrrrm.',
   'role': 'assistant'}]}

### Tokenizing dataset

In [3]:
from transformers import AutoTokenizer

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name)



configuring special token for fulfilling model's requirements

In [5]:
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id

take a look on template

In [60]:
print(tokenizer.chat_template)

{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}


In [66]:
dataset[0]['messages']

[{'content': 'The birch canoe slid on the smooth planks.', 'role': 'user'},
 {'content': 'On the smooth planks, the birch canoe slid. Yes, hrrrm.',
  'role': 'assistant'}]

In [61]:
print(tokenizer.apply_chat_template(dataset[0]['messages'], tokenize=False))

<|user|>
The birch canoe slid on the smooth planks.<|end|>
<|assistant|>
On the smooth planks, the birch canoe slid. Yes, hrrrm.<|end|>
<|endoftext|>


## Fine Tune with SFT
Fine-tuning more or less follows same training procedure as training a deep learning model. We could use `Trainer` from transformers package or `SFTTrainer` from trl package instead of writting our own training loop. In my case, we use `SFTTrainer` of trl package.

In [41]:
from trl import SFTConfig, SFTTrainer

In [44]:
sft_config = SFTConfig(
    # configs for memory usage
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={'use_reentrant': False}, 
    gradient_accumulation_steps=1,  
    per_device_train_batch_size=16, 
    auto_find_batch_size=True,
    # configs for dataset
    max_length=64,
    packing=True,
    packing_strategy='wrapped',
    # configs for training process
    num_train_epochs=10,
    learning_rate=3e-4,
    optim='paged_adamw_8bit',
    # config for logging
    logging_steps=10,
    logging_dir='./logs',
    output_dir='./phi3-mini-yoda-adapter',
    # other configuration
    bf16=torch.cuda.is_bf16_supported(including_emulation=False)
)

`logging_dir` is deprecated and will be removed in v5.2. Please set `TENSORBOARD_LOGGING_DIR` instead.


In [45]:
sft_trainer = SFTTrainer(
    model=model.base_model.model,
    peft_config=lora_config,
    processing_class=tokenizer,
    args=sft_config,
    train_dataset=dataset,
)



Tokenizing train dataset:   0%|          | 0/720 [00:00<?, ? examples/s]

Packing train dataset:   0%|          | 0/720 [00:00<?, ? examples/s]

Take a look on how SFT-Trainer process dataset

In [46]:
data_loader = sft_trainer.get_train_dataloader()
batch = next(iter(data_loader))

In [47]:
batch['input_ids'][0]

tensor([ 3974, 29892,  4337,   278,   325,   271, 29892,   366,  1818, 29889,
        32007, 32000, 32010,   450,   289,   935,   310,   278,   282,   457,
         5447,   471,   528,  4901,   322,  6501, 29889, 32007, 32001, 26399,
         1758,  4317, 29889,  1383,  4901,   322,  6501, 29892,   278,   289,
          935,   310,   278,   282,   457,  5447,   471, 29889, 32007, 32000,
        32010,   951,  5989,  2507, 17354,   322, 13328,   297,   278,  6416,
        29889, 32007, 32001,   512], device='cuda:0')

In [48]:
batch['labels'][0]

tensor([ 3974, 29892,  4337,   278,   325,   271, 29892,   366,  1818, 29889,
        32007, 32000, 32010,   450,   289,   935,   310,   278,   282,   457,
         5447,   471,   528,  4901,   322,  6501, 29889, 32007, 32001, 26399,
         1758,  4317, 29889,  1383,  4901,   322,  6501, 29892,   278,   289,
          935,   310,   278,   282,   457,  5447,   471, 29889, 32007, 32000,
        32010,   951,  5989,  2507, 17354,   322, 13328,   297,   278,  6416,
        29889, 32007, 32001,   512], device='cuda:0')

## Training process

In [49]:
sft_trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.


Step,Training Loss
10,2.839255
20,1.822768
30,1.604206
40,1.517168
50,1.398046
60,1.296729
70,1.191879
80,0.997083
90,0.897667
100,0.630632


TrainOutput(global_step=220, training_loss=0.8327989242293617, metrics={'train_runtime': 971.9506, 'train_samples_per_second': 3.508, 'train_steps_per_second': 0.226, 'total_flos': 4890970340720640.0, 'train_loss': 0.8327989242293617})

In [50]:
### save the model
sft_trainer.save_model('Phi-3-mini-4k-yoda')

# Test with fine tuned model
We have trained the master Yoda speaking model. Let's compare the generated response from model before and after fine tuning. 

We will prepare two reusable functions:
- `generate_prompt`: transform input prompt from user to the one with format that required by the model
- `generate_response`: provide with model and tokenizer to generate response of prompt from user

In [78]:
def generate_prompt(tokenizer, sentence):
    converted = [{"role":"user", "content":sentence}]
    prompt = tokenizer.apply_chat_template(converted, tokenize=False, add_generation_prompt=True)
    return prompt

In [79]:
sentence = "May the Force be with you!"
prompt = generate_prompt(tokenizer, sentence)
print(prompt)

<|user|>
May the Force be with you!<|end|>
<|assistant|>



In [82]:
from contextlib import nullcontext

def generate_response(model, tokenizer, prompt, max_new_tokens=64, skip_special_tokens=False):
    tokenized = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
    model.eval()
    ctx = torch.autocast(device_type=model.device.type, dtype=model.dtype) if model.dtype in [torch.float16, torch.bfloat16] else nullcontext()
    with ctx:
        gen_output = model.generate(**tokenized, eos_token_id=tokenizer.eos_token_id, max_new_tokens=max_new_tokens)
    output = tokenizer.batch_decode(gen_output, skip_special_tokens=skip_special_tokens)
    return output[0]

In [83]:
print(generate_response(model, tokenizer, prompt))

<|user|> May the Force be with you!<|end|><|assistant|> With you, may the Force be.<|end|><|endoftext|>


It sounds like Master Yoda now. Let's compare with the model before fine tuning.

In [84]:
raw_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", quantization_config=bnb_config)

Loading weights:   0%|          | 0/195 [00:00<?, ?it/s]

In [85]:
print(generate_response(raw_model, tokenizer, prompt))

<|user|> May the Force be with you!<|end|><|assistant|> Always here to assist you. If you have any questions or need support, feel free to ask. May the Force be with you!<|end|><|endoftext|>


What a huge difference!!!

## Further reduce the memory usage
We could further reduce the memory usage merging LoRA to our original model. We could achieve it by `peft` package models with calling `merge_and_unload()` function.

In [6]:
from peft import AutoPeftModelForCausalLM

merged_model = AutoPeftModelForCausalLM.from_pretrained(
    'Phi-3-mini-4k-yoda',
    low_cpu_mem_usage=True,
    device_map="auto"
).merge_and_unload()

Loading weights:   0%|          | 0/195 [00:00<?, ?it/s]

In [8]:
# check how much space the model takes
print(merged_model.get_memory_footprint()/1e6)

7642.159488


Only 760MB was used after merging LoRA fine tuned model.

Our model could be adapted by `transformers.pipeline` too!

In [7]:
from transformers import pipeline

prompt = """<|user|>
Tell me something about star war!.<|end|>
<|assistant|>
"""

yoda_pipeline = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(yoda_pipeline(prompt)[0]["generated_text"])

Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


<|user|>
Tell me something about star war!.<|end|>
<|assistant|>
 About Star War, tell me something, you must.
