# Fin-tuning LLM on a simple task using single GPU with fast inference

_Authored by: [Mohammadreza Esmaeiliyan](https://github.com/MrzEsma)_

In this notebook, the attempt has been made to fine-tune an LLM in the simplest manner without adding unnecessary complexity, with a parameter count suitable for a Customer-level GPU, and then to perform inference using one of the fastest open-source inference engines, Vllm. I have tried to explain all the concepts and techniques used as far as possible; however, since there are many concepts and techniques to explain, I firstly gave a priority based on importance so that the more important concepts and techniques can be studied first. Secondly, since others have written these explanations well and in more detail in blogs, I have referred to these links. As the Iranian saying goes, "In a house of wisdom, a few words suffice" :)
Let's get started.


In [ ]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

The `peft` library, or parameter efficient fine tuning, has been created to fine-tune LLMs more efficiently. If we were to open and fine-tune the upper layers of the network traditionally like all neural networks, it would require a lot of processing and also a significant amount of VRAM. With the methods developed in recent papers, this library has been implemented for efficient fine-tuning of LLMs. Read more about peft here: [Hugging Face PEFT](https://huggingface.co/blog/peft). Importance level: 1

After the emergence of LLMs, a task known as Alignment was created, in which we try to produce outputs from LLMs that are compatible with our preferences. We start with simple supervised fine tuning or SFT in the first stage, and in the second stage, a mechanism for receiving feedback from users is created, and with other techniques, we make the LLM more aligned with our preferences. The `trl` library has been created for such a task, and this library is used in the first stage, which is SFT. For further reading on the Alignment task, see [OpenAI Blog on Instruction Following](https://openai.com/research/instruction-following#fn1) and [Hugging Face Blog on RLHF](https://huggingface.co/blog/rlhf). Importance level: 1


## Set parameters

In [ ]:
# General parameters
model_name = "NousResearch/Llama-2-7b-chat-hf"  # The model that you want to train from the Hugging Face hub
dataset_name = "yahma/alpaca-cleaned"  # The instruction dataset to use
new_model = "llama-2-7b-alpaca"  # The name for fine-tuned LoRA Adaptor

# LoRA parameters
lora_r = 64
lora_alpha = lora_r * 2
lora_dropout = 0.1
target_modules = ["q_proj", "v_proj", 'k_proj']

# QLoRA parameters
load_in_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
bnb_4bit_use_double_quant = False

# TrainingArguments parameters
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
learning_rate = 0.00015
weight_decay = 0.01
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 0
logging_steps = 25

# SFT parameters
max_seq_length = None
packing = False
device_map = {"": 0}

# Dataset parameters
use_special_template = True
response_template = ' ### Answer:'
instruction_prompt_template = '"### Human:"'
use_llama_like_model = True

## Train Code

In [ ]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")
percent_of_train_dataset = 0.95
other_columns = [i for i in dataset.column_names if i not in ['instruction', 'output', 'text']]
dataset = dataset.remove_columns(other_columns)
split_dataset = dataset.train_test_split(train_size=int(dataset.num_rows * percent_of_train_dataset), seed=19, shuffle=False)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]
print(f"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(eval_dataset)}")

Two techniques, LoRA and QLoRA, are among the most important techniques of PEFT. In brief, LoRA aims to open only these layers for fine-tuning by constructing and adding a low-rank matrix to each of the model layers, thus neither changing the model weights nor requiring lengthy training, and the created weights are lightweight and can be produced multiple times, allowing multiple tasks to be fine-tuned with an LLM loaded into RAM. In the QLoRA technique, the weights are quantized to 4 bits, further reducing RAM consumption. Read about LoRA [here at Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/) and about QLoRA [here at Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes). For other efficient training methods, see [Hugging Face Docs on Performance Training](https://huggingface.co/docs/transformers/perf_train_gpu_one) and [SFT Trainer Enhancement](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune). Importance level: 2


In [ ]:
# Load QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=load_in_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
)

In [ ]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

In [ ]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False

In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). Importance level: 1


In [ ]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=new_model,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    gradient_checkpointing=gradient_checkpointing,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type
)

In [ ]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Fix weird overflow issue with fp16 training
if not tokenizer.chat_template:
    tokenizer.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}"

In [ ]:
def special_formatting_prompts(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"{instruction_prompt_template}{example['instruction'][i]}{example['input'][i]}\n{response_template} {example['output'][i]}"
        output_texts.append(text)
    return output_texts


def normal_formatting_prompts(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        chat_temp = [{"role": "system", "content": example['instruction'][i]},
                     {"role": "user", "content": {example['input'][i]}},
                     {"role": "assistant", "content": example['output'][i]}]
        text = tokenizer.apply_chat_template(chat_temp, tokenize=False)
        output_texts.append(text)
    return output_texts


Regarding the chat template, let me briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response, so the model precisely understands where each message comes from and has a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates). Importance level: 3


In [ ]:
if use_special_template:
    formatting_func = special_formatting_prompts
    if use_llama_like_model:
        response_template_ids = tokenizer.encode(response_template, add_special_tokens=False)[2:]
        collator = DataCollatorForCompletionOnlyLM(response_template=response_template_ids, tokenizer=tokenizer)
    else:
        collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer)
else:
    formatting_func = normal_formatting_prompts

In [ ]:
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    formatting_func=formatting_func,
    data_collator=collator,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing
)

In [ ]:
# Train model
trainer.train()

# Save fine tuned Lora Adaptor 
trainer.model.save_pretrained(new_model)

## Inference Code

In [ ]:
import torch
import gc


def clear_hardwares():
    torch.clear_autocast_cache()
    torch.cuda.ipc_collect()
    torch.cuda.empty_cache()
    gc.collect()

In [ ]:
clear_hardwares()
clear_hardwares()

In [ ]:
def generate(model, prompt: str, kwargs):
    tokenized_prompt = tokenizer(prompt, return_tensors='pt').to(model.device)
    prompt_length = len(tokenized_prompt.get('input_ids')[0])
    with torch.cuda.amp.autocast():
        output_tokens = model.generate(**tokenized_prompt, **kwargs) if kwargs else model.generate(**tokenized_prompt)
        output = tokenizer.decode(output_tokens[0][prompt_length:], skip_special_tokens=True)
    return output

In [ ]:
base_model = AutoModelForCausalLM.from_pretrained(new_model, return_dict=True, device_map='auto', token='')
tokenizer = AutoTokenizer.from_pretrained(new_model, max_length=max_seq_length)
model = PeftModel.from_pretrained(base_model, new_model)
del base_model

In [ ]:
sample = eval_dataset[0]
if use_special_template:
    prompt = f"{instruction_prompt_template}{sample['instruction']}{sample['input']}\n{response_template}"
else:
    chat_temp = [{"role": "system", "content": sample['instruction']},
                 {"role": "user", "content": {sample['input']}}]
    prompt = tokenizer.apply_chat_template(chat_temp, tokenize=False, add_generation_prompt=True)

In [ ]:
gen_kwargs = {"max_new_tokens": 1024}
generated_texts = generate(model=model, prompt=prompt, kwargs=gen_kwargs)
print(generated_texts)

## Merge to base model

In [ ]:
clear_hardwares()
merged_model = model.merge_and_unload()
clear_hardwares()
del model
new_model_name = 'your_hf_account/your_desired_name'
merged_model.push_to_hub(new_model_name)

## Fast Inference with [Vllm](https://github.com/vllm-project/vllm)


The `vllm` library is one of the fastest inference engines for LLMs. For a comparative overview of available options, you can use this blog: [7 Frameworks for Serving LLMs](https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88). Importance level: 3


In [ ]:
from vllm import LLM, SamplingParams

gen_kwargs = {"max_tokens": 1024}

llm = LLM(model=new_model_name, gpu_memory_utilization=0.9, trust_remote_code=True)
sampling_params = SamplingParams(**gen_kwargs)
outputs = llm.generate(prompt, gen_kwargs)
print(outputs)