## Unsloth: Optimizing Training and Inference Performance

For many software algorithms, the performance does not only depend on the number and kind of calculations performed. Instead, the exact order and the size of chunks has an enormous influence on the calculation speed.
For large language models, a library called `unsloth` contains optimized GPU kernels created by manually deriving all compute heavy math steps. By using these optimized kernels, a significant speed-up can be obtained.

### Key Techniques in Unsloth:

1. **Efficient Data Loading**: Optimizing data pipelines to reduce latency and improve throughput during training.
2. **Batching and Padding Strategies**: Dynamically adjusting batch sizes and minimizing padding to optimize memory usage.
3. **Half-Precision and Quantized Inference**: Using mixed precision or quantized models to speed up inference and reduce memory footprint.
4. **Model Pruning and Distillation**: Reducing the size of the model by removing redundant parameters or training smaller models to mimic larger ones.

### Benefits of Unsloth:

- **Reduced Training Time**: Optimizing data loading and model architecture reduces the time required for each epoch.
- **Lower Memory Usage**: Using techniques like mixed precision and quantization reduces the amount of GPU memory required.
- **Faster Inference**: Optimizing the model for deployment can significantly reduce latency during inference.

### Hands-On Example: Efficient Data Loading and Mixed Precision Training

In this example, we take the example from the previous notebook ("PEFT") and adjust them to use `unsloth`.

In [1]:
# Import libraries
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig, pipeline, TrainingArguments
from trl import SFTTrainer, SFTConfig
## Instead of:
# from transformers import AutoModelForCausalLM, AutoTokenizer
## use:
from unsloth import FastLanguageModel



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
[2025-01-22 11:43:09,593] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [2]:
# Choose a model and load tokenizer and model (using 4bit quantization):
model_name = "/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/unsloth--Phi-3.5-mini-instruct-bnb-4bit"
# model_name = "unsloth/Phi-3.5-mini-instruct-bnb-4bit"
# model_name = "microsoft/Phi-3.5-mini-instruct"

## Instead of:
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(...)
## use: 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name,
    ## Instead of:
    # quantization_config=BitsAndBytesConfig(...)
    ## use:
    load_in_4bit=True,
    # device_map='cuda:0',
    trust_remote_code=True
)
tokenizer.padding_side = 'right'
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.48.0.
   \\   /|    GPU: NVIDIA A100-SXM-64GB. Max memory: 63.423 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Device set to use cuda:0


In [3]:
# Load the guanaco dataset
guanaco_train = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='train')
# guanaco_test = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='test')
# guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')
# guanaco_test = load_dataset('timdettmers/openassistant-guanaco', split='test')

Repo card metadata block was not found. Setting CardData to empty.


In [4]:
guanaco_train = guanaco_train.map(lambda entry: {
    'question1': entry['text'].split('###')[1].removeprefix(' Human: '),
    'answer1': entry['text'].split('###')[2].removeprefix(' Assistant: ')
})
guanaco_train = guanaco_train.map(lambda entry: {'messages': [
    {'role': 'user', 'content': entry['question1']},
    {'role': 'assistant', 'content': entry['answer1']}
]})

In [5]:
## Instead of:
# peft_config = LoraConfig(
#     task_type='CAUSAL_LM',
#     r=16,
#     lora_alpha=32,  # thumb rule: lora_alpha should be 2*r
#     bias='none',
#     target_modules='all-linear',
# )
# model = get_peft_model(model, peft_config)
## use:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,  # rule: lora_alpha should be 2*r
    lora_dropout=0.05,  # Unsloth supports any, but = 0 is optimized
    bias='none',  # Unsloth supports any, but = 'none' is optimized
    # Unsloth does not allow 'all-linear' => manually specify target modules: 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    use_gradient_checkpointing='unsloth',  # True or 'unsloth' for very long context
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.1.5 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [6]:
training_arguments = SFTConfig(
    output_dir='output/unsloth-phi-3.5-mini-instruct-guanaco',
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
        # e.g. Mistral 7B with PEFT using bitsandbytes:
        # - enabled: 11 GB GPU RAM and 8 samples/second
        # - disabled: 40 GB GPU RAM and 12 samples/second
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    optim='adamw_torch',
    learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
    logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    logging_steps=10,
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    # save_steps=2000,
    # num_train_epochs=5,
    max_steps=100,
    bf16=True,  # mixed precision training
    report_to='none',  # disable wandb
    max_seq_length=1024,
)

In [7]:
def formatting_func(entry):
    return tokenizer.apply_chat_template(entry['messages'], tokenize=False)

In [8]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=guanaco_train,
    processing_class=tokenizer,
    formatting_func=formatting_func,
)

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [9]:
train_result = trainer.train()
print("Training result:")
print(train_result)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 9,846 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 1
\        /    Total batch size = 8 | Total steps = 100
 "-____-"     Number of trainable parameters = 29,884,416


Step,Training Loss
10,1.4005
20,1.1711
30,1.2157
40,1.1816
50,1.2066
60,1.289
70,1.1655
80,1.1047
90,1.1283
100,1.1328


Training result:
TrainOutput(global_step=100, training_loss=1.1995847034454346, metrics={'train_runtime': 118.8432, 'train_samples_per_second': 6.732, 'train_steps_per_second': 0.841, 'total_flos': 1.230171564294144e+16, 'train_loss': 1.1995847034454346, 'epoch': 0.08123476848090982})


In [13]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

{'status': 'ok', 'restart': False}