<a href="https://colab.research.google.com/github/Bhabuk10/FineTuning_LLMs/blob/main/Finetuning_Llama_3_1_8b_using_UnSloth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Unsloth is an open-source Python toolkit for optimizing the fine-tuning and continued pre-training of large language models (LLMs). By utilizing this library, we can significantly boost training speed (1.5-2x faster) and drastically reduce GPU memory consumption by 50-60%.

##In this walkthrough, we will fine-tune a model with Unsloth and see how it accelerates the process.



#Dependencies
First, we’ll install the necessary libraries, including Unsloth and `Xformers`, a library that implements memory-efficient attention mechanisms.

The installation instructions below have been adapted from several of Unsloth's own `notebooks`, [Notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks) which you can explore to see examples for fine-tuning popular models.

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton


This will install Unsloth and other key libraries, enabling fast fine-tuning and efficient memory management.


#Loading the Model

Now, let's load a pre-trained model using Unsloth's FastLanguageModel. We’ll be working with a specific pre-quantized version of Meta’s Llama model:-   `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit`.

Since this is a 4-bit pre-quantized model, it’s optimized for faster downloads and lower GPU memory usage.


In [None]:
from unsloth import FastLanguageModel
import torch

# Load the 4-bit quantized model using Unsloth's faster API
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=1024,  # Adjust this to manage memory usage
    dtype=None,  # Auto-detect based on GPU hardware
    # load_in_4bit=True, ## optional because we're using a pre-quantized model
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9.post1: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

🔧 Unsloth Tip:   When loading a model, if it's already pre-quantized to 4-bit, there's no need to manually set `load_in_4bit=True`.

Unsloth also provides RoPE scaling, allowing us to adjust the sequence length flexibly to fit within GPU memory constraints.


#Dataset Preparation
For training, we’ll use a synthetic dataset created by that **@AI Maker Space**
[dataset](https://huggingface.co/datasets/ai-maker-space/acronyms_and_initialisms_translated) . It consists of acronyms and their translations or expanded forms in English. This dataset will serve as a good example for training a model to understand and generate acronym translations.




We’ll load the dataset from Hugging Face's repository and examine a few data points:

In [None]:
from datasets import load_dataset

# Load a synthetic dataset of acronyms and their translations
dataset = load_dataset("ai-maker-space/acronyms_and_initialisms_translated", split="train")

README.md:   0%|          | 0.00/403 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/173k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1664 [00:00<?, ? examples/s]

In [None]:
# Check the size and some sample data

print(f"Dataset size: {len(dataset)}")
print(dataset[2]["acronym_sentence"])
print(dataset[2]["english_translation"])

Dataset size: 1664
Yo, ? about the meetup deets. Can you fill me in?
Hey, I have a question about the details of the meetup. Can you provide information?



#Prompt Design for Fine-tuning
To fine-tune the model, we need to create a custom prompt format. This template will guide the model by providing example input-output pairs, helping it learn how to generate the desired output.

The prompt should include both the acronym sentence and its corresponding expanded translation, wrapped in an instruction to clarify the task for the model.

In [None]:
def create_prompt_with_template(example, return_response=True):
  prompt_template = "<|begin_of_text|>"
  prompt_template += "<|start_header_id|>system<|end_header_id|>\n\n"
  prompt_template += "You are provided an English sentence, and are expected to translate it into a 'text speak' sentence.<|eot_id>"
  prompt_template += "<|start_header_id|>user<|end_header_id|>\n\n"
  prompt_template += f"Sentence: {example['english_translation']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
  if return_response:
    prompt_template += f"\n{example['acronym_sentence']}<|end_of_text|>"
  return {"text" :prompt_template}

In [None]:
# example of the formatted prompt template!

create_prompt_with_template(dataset[1])["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are provided an English sentence, and are expected to translate it into a 'text speak' sentence.<|eot_id><|start_header_id|>user<|end_header_id|>\n\nSentence: Hey, I have a question about the details of the meetup. Can you provide information?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\nYo, ? about the meetup deets. Can you fill me in?<|end_of_text|>"

In [None]:
# Mapping it across our dataset.

dataset = dataset.map(create_prompt_with_template)

Map:   0%|          | 0/1664 [00:00<?, ? examples/s]

#Creating a Trainable LoRA PEFT Model
Unsloth is fully compatible with Parameter-Efficient Fine-Tuning (PEFT), especially with LoRA (Low-Rank Adaptation) adapters. With Unsloth, you can easily integrate LoRA into your model for efficient fine-tuning.

 By using Unsloth's `get_peft_model` method, we can apply LoRA adapters to specific layers of our model.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=[
        "q_proj",  # Query projection layers
        "k_proj",  # Key projection layers
        "v_proj",  # Value projection layers
        "o_proj",  # Output projection layers
        "gate_proj",  # Gating mechanism layers
        "up_proj",  # MLP up projection layers
        "down_proj",  # MLP down projection layers
    ],
    lora_alpha=32,  # Scaling factor for LoRA
    lora_dropout=0,  # No dropout for LoRA
    bias="none",
    use_gradient_checkpointing="unsloth",  # Optimized for memory with gradient checkpointing
    random_state=40
)


Unsloth 2024.9.post1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


#Model Training Setup
Now that we’ve prepared our model, it’s time to set it up for training. Unsloth works seamlessly with Hugging Face's TRL (Transformers Reinforcement Learning) library, allowing us to use familiar components like `SFTTrainer` for supervised fine-tuning.

We’ll first define the training arguments, which include batch sizes, learning rate, and other important parameters. One dynamic aspect Unsloth introduces is automatic detection of hardware support for `bfloat16` (bf16) precision, which can further optimize performance if available.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,  # Accumulate gradients for better memory efficiency
    warmup_steps=5,
    num_train_epochs=2,  # Train for 2 epochs
    learning_rate=2e-4,
    fp16=not is_bfloat16_supported(),  # Use FP16 if bf16 is not supported
    bf16=is_bfloat16_supported(),  # Use bf16 if supported by hardware
    logging_steps=1,  # Log training progress frequently
    optim="adamw_8bit",
    weight_decay=0.01,  # Apply weight decay for regularization
    lr_scheduler_type="linear",
    seed=40,  # Seed for reproducibility
    output_dir="llama3_1_8b_instruct_ft"  # Output directory for saving the model
)


Once the training arguments are ready, we can initialize our `SFTTrainer` from Hugging Face. This allows us to seamlessly fine-tune the model on our acronym dataset.

In [None]:
trainer = SFTTrainer(
    model=model,  # LoRA-adapted model
    train_dataset=dataset,
    dataset_text_field="text",  # Specify the text field in the dataset
    tokenizer=tokenizer,
    args=training_args,
    max_seq_length=1024,  # Maximum sequence length for inputs
    dataset_num_proc=2,  # Parallel processing of the dataset
    packing=True  # Enable packing for more efficient tokenization
)


Generating train split: 0 examples [00:00, ? examples/s]

Now, let’s train the model by calling the `.train()` method.

In [None]:
training_stats = trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 123 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 30
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.4844
2,2.5025
3,2.289
4,1.9902
5,1.8222
6,1.6322
7,1.498
8,1.3725
9,1.2627
10,1.3024


At this point, the model will begin fine-tuning using the LoRA adapters, and thanks to Unsloth’s optimizations, the process will be faster and more memory-efficient.

#Testing the Fine-tuned Model
After training, we can use Unsloth's inference mode to test the model. Let’s see how well the fine-tuned model performs on generating acronym expansions.

In [None]:
FastLanguageModel.for_inference(model)

# Create a prompt using the dataset
prompt = create_prompt_with_template(dataset[1], return_response=False)["text"]

inputs = tokenizer(
    [prompt],
    return_tensors="pt"  # Convert input to PyTorch tensors
).to("cuda")  # Move input to GPU

# Generate the model’s output
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,  # Limit the number of new tokens generated
    use_cache=True,  # Enable caching for faster generation
)

# Decode and display the output
print(tokenizer.batch_decode(outputs)[2])


<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are provided an English sentence, and are expected to translate it into a 'text speak' sentence.<|eot_id><|start_header_id|>user<|end_header_id|>

Sentence: Hey, I have a question about the details of the meetup. Can you provide information?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Yo, got a Q about the meetup details. Can you hit me with the 411?<|end_of_text|>


You can also test the model on completely new input that it hasn’t been trained on:

In [None]:
FastLanguageModel.for_inference(model)

example = {

    "english_translation":  "Each friend represents a world in us, a world possibly not born until they arrive, and it is only by this meeting that a new world is born.",
    "acronym_sentence": ""  # Provide an empty input for the model to generate
}

prompt = create_prompt_with_template(example, return_response=False)["text"]

inputs = tokenizer(
    [prompt],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    use_cache=True
)

print(tokenizer.batch_decode(outputs)[0])


<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are provided an English sentence, and are expected to translate it into a 'text speak' sentence.<|eot_id><|start_header_id|>user<|end_header_id|>

Sentence: Each friend represents a world in us, a world possibly not born until they arrive, and it is only by this meeting that a new world is born.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Each FAM is a new world within us, a world that might not exist until they show up, and it’s only through that meeting that a new world is born.<|end_of_text|>



### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model.

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

### Saving to float16 for VLLM

Unsloth also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. They also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account!

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

#Exporting the Model to Hugging Face

Finally, once you're satisfied with the fine-tuning results, you can export the model to Hugging Face’s Model Hub.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model.push_to_hub_merged("vhab10/Llama-3-1-8B-Instruct-Unsloth-LoRA-4bit", tokenizer, save_method = "merged_4bit_forced")



README.md:   0%|          | 0.00/605 [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.05G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.65G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/611 [00:00<?, ?B/s]