# LLM - Fine-Tuning

Derek Lilienthal  
Cristina Stone

# Instruction Fine-tune LLM Tutorial

*Note*

The original Llama model is available through a permissible license
through [meta](https://llama.meta.com/llama-downloads/). However, I will
be using a 3.5 billion parameter open-source version by
[openlm-research](https://github.com/openlm-research/open_llama).

## Fine-Tuning Llama 3.5 Billion using LoRA

**Imports**

In [1]:
import random
import subprocess
import warnings
warnings.filterwarnings('ignore')

from pynvml import *
import torch
from datasets import load_dataset
from peft import (
    LoraConfig, 
    prepare_model_for_kbit_training, 
    get_peft_model, 
    PeftModel
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
    GenerationConfig
)
from trl import SFTTrainer
from datasets import Dataset
import pandas as pd

In [2]:
### Helper functions for printing model size and gpu usage 

def print_gpu_utilization():
    """Prints the GPU memory usage."""
    nvmlInit()
    deviceCount = nvmlDeviceGetCount()
    for i in range(deviceCount):
        handle = nvmlDeviceGetHandleByIndex(i)
        info = nvmlDeviceGetMemoryInfo(handle)
        print(f"GPU memory occupied: {info.used//1024**2} MB.")

def print_summary(result):
    """Prints the results of a training run."""
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print(f"Average Epoch Time: {result.metrics['train_runtime'] / 5:.2f}")
    print_gpu_utilization()

def trainable_parameters(model):
    """Returns the number of trainable parameters in a model."""
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    return(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

def verify_loaded_model_size():
    """Calls nvidia-smi to verify the amount of data loaded into the GPU."""
    cmd = 'nvidia-smi'
    query = '--query-gpu=gpu_name,pci.bus_id,memory.total,memory.used,memory.free'
    returned = subprocess.run([cmd, query, "--format=csv"], stdout=subprocess.PIPE).stdout.decode('utf-8')
    print('################## GPU Memory Usage: ##################')
    print(returned)

GPU used in the tutorial

In [3]:
print(f'Found {torch.cuda.device_count()} GPU(s)')
for i in range(torch.cuda.device_count()):
   print(f'GPU {i+1}:', torch.cuda.get_device_properties(i).name) 

Found 1 GPU(s)
GPU 1: NVIDIA A100-PCIE-40GB

HuggingFace model URL

In [4]:
MODEL_NAME = 'openlm-research/open_llama_3b_v2'

## Datasets

The dataset used in this tutorial is the
[Samsum](https://huggingface.co/datasets/samsum) dataset. This dataset
contains dialog conversion with a summary of the conversation. The
fine-tuning task we will train the LLM to do is to generate a summary of
a conversation.

In [5]:
dataset = load_dataset("samsum")  # Loads data from hugging faces using the 'datasets' package
train_dataset = dataset["train"]
test_dataset = dataset["test"]

# Look at the first 5 examples in the training set
pd.DataFrame(train_dataset).head()

Sample dialog

In [6]:
# Random examples from the training set
rand_example = random.choice(train_dataset)
print(f'### Dialogue:\n{rand_example["dialogue"]}\n### Summary:\n{rand_example["summary"]}\n')

### Dialogue:
Eva: I need to shop for groceries this week.
Maggie: OK, where do you want to go?
Eva: Well, I don’t want to go to a big supermarket. Will you go there with me?
Maggie: I hate this store, it’s too big. 
Eva: Me too, I never know where stuff I want is. 
Maggie: So maybe we can drive there in the morning ‘cause it gets really busy in the afternoon.
Eva: Yeah, I know, people get out of work then. 
Maggie: :-( and you’ve got to wait in those long lines, people push you... ugh.. :-(
Eva: I hate it. You know what? I like this small store nearby your place. 
Maggie: Yeah, I know it, but the selection there is not so big :-(
Eva: What do you need?
Maggie: I’m looking for Brazilian nuts.
Eva: Guess what, I know they’ve got them there. I just saw them last time I was there.
Maggie: That’s great!
Eva: Yeah, and I think they also have some more exotic stuff there, so it’s the best place we can go.
Maggie: Great! :-) When can we meet?
Eva: when did you want to buy them?
Maggie: I need

In order to properly train the LLM to do this fine-tune task, we first
need to create the fine-tuning dataset. The fine-tuning dataset is a
dataset that contains the input-output pairs that the LLM will be
trained on. The input is the dialog conversation and the output is the
summary of the dialog.

Each data point in the fine-tuning dataset follows the following format:

    ### Dialogue:
    <dialog conversations>
    ### Summary:
    <summary of conversations>

The LLM will be trained to generate the summary given the dialog
conversations.

In [7]:
def combine_dialogue_summary(dialogue, summary):
    """Function to combine the dialogue and summary into a single string."""
    return "### Dialogue:\n"+dialogue+"\n### Summary:\n"+ summary

# Creating the training and testing sets
train_list = []
for i in range(len(train_dataset)):
    train_list.append(combine_dialogue_summary(train_dataset[i]['dialogue'], train_dataset[i]['summary']))

test_list = []
for i in range(len(test_dataset)):
    test_list.append(combine_dialogue_summary(test_dataset[i]['dialogue'], test_dataset[i]['summary']))

# Create a new dataset with the combined dialogue and summary
train_dataset = Dataset.from_dict({"text": train_list})
test_dataset = Dataset.from_dict({"text": test_list})

## Loading LLama

Below we will load the LLama model using the HuggingFace model URL. We
will also load the tokenizer for the model. Additionally, we comment out
the option to load the model in a quantized format for even more memory
efficiency. However, it should be noted that the quantized model will
can be slower than the non-quantized model to train. But it is a good
option if you are limited on memory.

In [8]:
### 4 bit quantization
# bnb_config = BitsAndBytesConfig(
#         load_in_4bit=True,
#         bnb_4bit_quant_type="nf4",
#         bnb_4bit_compute_dtype=torch.float16,
#         bnb_4bit_use_double_quant=True,
# )

### 8 bit quantization
# bnb_config = BitsAndBytesConfig(
#         load_in_8bit=True,
#         bnb_8bit_compute_dtype=torch.float16,
#         bnb_8bit_use_double_quant=True,
# )

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, 
    torch_dtype=torch.float16,  # Load as FP16
    device_map='auto',
    # quantization_config=bnb_config  ### Uncomment here for 4 or 8 bit quantization
)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True, legacy=False)
# Create a new token and add it to the tokenizer
tokenizer.add_special_tokens({"pad_token":"<pad>"})
tokenizer.padding_side = 'right'

The Llama architecture a decoder-only transformer model. Below is the
general architecture of the model.

In [9]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 3200, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (k_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (v_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (o_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3200, out_features=8640, bias=False)
          (up_proj): Linear(in_features=3200, out_features=8640, bias=False)
          (down_proj): Linear(in_features=8640, out_features=3200, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )


## Generating text

In [None]:
def generate(model, instruction):
    """Helper function to generate a summary from a given instruction using the model."""
    prompt = "### Dialogue:\n"+instruction+"\n### Summary:\n"
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
            input_ids=input_ids,
            generation_config=GenerationConfig(temperature=1.0, top_p=1.0, top_k=50, num_beams=1),
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256  # Generate a max response of 256 tokens
    )
    for seq in generation_output.sequences:
        output = tokenizer.decode(seq)
        print(output.split("### Summary:")[1].strip())

Below are two examples of the model generating text before training. The
model essentially repeats the input text. This is expected as the model
has not been trained to generate summaries yet.

In [10]:
test_dialog = """
Derek: I'm going to the store. Do you need anything?
Ashley: Yes, can you get some milk?
Derek: Sure. Anything else?
Ashley: No, that's it. Thanks.
"""
generate(model, test_dialog)

Derek is going to the store to get milk. Ashley asks Derek if he needs anything else. Derek says he will get milk. Ashley says that is all she needs. Derek says he will get milk.

In [11]:
test_dialog = """Derek: I'm going to make a tutorial about large language models. What kind of topics would you like me to cover?
Audience: How about topics on finetunning on a large computing platform.
Derek: Okay! I will do that. What other topics?
Audience: How about you show how to train the model using LoRa.
Derek: Great! I will also show how to load in the finetunned model as well."""
generate(model, test_dialog)

This is a dialogue between Derek and an audience member. Derek is a data scientist and he is going to make a tutorial about large language models. The audience member asks Derek what topics he should cover. Derek responds by asking the audience member what topics they would like to see. The audience member responds by asking Derek how to train the model using LoRa. Derek responds by saying that he will show how to train the model using LoRa.

## LoRA

Setting up the LoRA hyperparameters, resize the embeddings to take into
account the new vocabulary size, and then define the LoRA
configurations.

*Note* The `target_modules=["q_proj","v_proj", "k_proj", "o_proj"]`
aligns with some of the components defined in the model architecture a
few cells above. These components are the query, key, and value
projections, and the output projection. By specifying these components,
we are telling the LoRA algorithm to only update these components during
the fine-tuning process.

In [12]:
# Resize the embeddings
model.resize_token_embeddings(len(tokenizer))
# Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching

model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
        lora_alpha=32,
        lora_dropout=0.1,
        r=8,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj","v_proj", "k_proj", "o_proj"]
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

QLoRa added an additional 5,324,800 parameters in to the model, which
only increase the model size by 0.16%. This means during training, we
only perform gradient updates on 0.16% of the model parameters, which is
a significant reduction in computation and memory usage.

In [13]:
trainable_parameters(model)

'trainable params: 5324800 || all params: 3431804800 || trainable%: 0.16'

Specifying the training hyperparameters used for training

In [14]:
EPOCHS = 1
BATCH_SIZE = 8
MAX_SEQ_LENGTH = 128
OPTIMIZER = "paged_adamw_8bit"

The model itself occupies ~18GB of memory on a single A100

In [15]:
verify_loaded_model_size()

################## GPU Memory Usage: ##################
name, pci.bus_id, memory.total [MiB], memory.used [MiB], memory.free [MiB]
NVIDIA A100-PCIE-40GB, 00000000:A4:00.0, 40960 MiB, 17701 MiB, 22636 MiB

## Training Llama

We use the `SFTTrainer` from HuggingFaces to train the model instead of
creating our own training loop. This is because the `SFTTrainer` is
already optimized for training large language models. We will use the
`SFTTrainer` to train the model on the fine-tuning dataset we created
earlier.

In [16]:
training_arguments = TrainingArguments(
        output_dir="./_model_checkpoints",
        evaluation_strategy="steps",
        do_eval=True,  # Tests loss on eval set
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=1,
        per_device_eval_batch_size=BATCH_SIZE,
        log_level="error",  # Gives us output 
        optim=OPTIMIZER, 
        save_strategy="epoch",
        save_steps=500, 
        logging_steps=100, 
        learning_rate=1e-4,
        eval_steps=500, 
        fp16=True,  # Uses mixed 16-bit precision training
        max_grad_norm=0.3, 
        num_train_epochs=EPOCHS, 
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
)

trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        dataset_text_field="text",
        max_seq_length=MAX_SEQ_LENGTH,
        tokenizer=tokenizer,
        args=training_arguments,
)

results = trainer.train()
print_summary(results)

Time: 413.75
Samples/second: 35.61
Average Epoch Time: 6.52
GPU memory occupied: 25373 MB.

Saving the model

In [17]:
save_dir = './llama_3b_finetune'
trainer.model.save_pretrained(save_dir)

## Load and evaluate

In [30]:
finetunned_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, 
    torch_dtype=torch.float16,  # Load as FP16
    device_map='auto',
)
finetunned_model.resize_token_embeddings(len(tokenizer))
finetunned_model = PeftModel.from_pretrained(finetunned_model, save_dir)

We can now see the large language model has been fine-tuned to give more
concise summary of dialog using an LLM

In [31]:
test_dialog = """
Derek: I'm going to the store. Do you need anything?
Ashley: Yes, can you get some milk?
Derek: Sure. Anything else?
Ashley: No, that's it. Thanks.
"""
generate(finetunned_model, test_dialog)

Derek is going to the store. He will buy milk for Ashley. 

In [32]:
test_dialog = """Derek: I'm going to make a tutorial about large language models. What kind of topics would you like me to cover?
Audience: How about topics on finetunning on a large computing platform.
Derek: Okay! I will do that. What other topics?
Audience: How about you show how to train the model using LoRa.
Derek: Great! I will also show how to load in the finetunned model as well."""
generate(finetunned_model, test_dialog)

Derek will make a tutorial about large language models. He will also show how to train the model using LoRa. 