## Unsloth: Optimizing Training and Inference Performance

For many software algorithms, the performance does not only depend on the number and kind of calculations performed. Instead, the exact order and the size of chunks has an enormous influence on the calculation speed.
For large language models, a library called `unsloth` contains optimized GPU kernels created by manually deriving all compute heavy math steps. By using these optimized kernels, a significant speed-up can be obtained.

### Key Techniques in Unsloth:

1. **Efficient Data Loading**: Optimizing data pipelines to reduce latency and improve throughput during training.
2. **Batching and Padding Strategies**: Dynamically adjusting batch sizes and minimizing padding to optimize memory usage.
3. **Half-Precision and Quantized Inference**: Using mixed precision or quantized models to speed up inference and reduce memory footprint.
4. **Model Pruning and Distillation**: Reducing the size of the model by removing redundant parameters or training smaller models to mimic larger ones.

### Benefits of Unsloth:

- **Reduced Training Time**: Optimizing data loading and model architecture reduces the time required for each epoch.
- **Lower Memory Usage**: Using techniques like mixed precision and quantization reduces the amount of GPU memory required.
- **Faster Inference**: Optimizing the model for deployment can significantly reduce latency during inference.

### Hands-On Example: Efficient Data Loading and Mixed Precision Training

In this example, we take the example from the previous notebook ("PEFT") and adjust them to use `unsloth`.

In [1]:
# Import libraries
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig, pipeline, TrainingArguments
from trl import SFTTrainer, SFTConfig
## Instead of:
# from transformers import AutoModelForCausalLM, AutoTokenizer
## use:
from unsloth import FastLanguageModel


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel
  r"for ([^\s]{1,}) in " + modulelist_item + "\:[\n]" + \
  regex_find = f"{call_class}\(([^\)]{{1,}})\)"
  regex_find = f"{call_class}\(([^\)]{{1,}})\)"
  regex_find = "def forward\(([^\)]{1,})\)"
  inherited_modules = re.findall(r"class ([^\s]{1,})\(" + inherited_class + "\)", full_source)
  called = re.findall(r"[\s]{1,}" + re.escape(function) + "\(.+?\)", full_source, flags = re.DOTALL)
  name = re.sub("\.([\d]{1,})\.", r"[\1].", name)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [2]:
# Choose a model and load tokenizer and model (using 4bit quantization):
# model_name = "unsloth--Phi-3.5-mini-instruct-bnb-4bit"
model_name = "unsloth/Phi-3.5-mini-instruct-bnb-4bit"
# model_name = "microsoft/Phi-3.5-mini-instruct"

## Instead of:
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(...)
## use: 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name,
    cache_dir='models',
    ## Instead of:
    # quantization_config=BitsAndBytesConfig(...)
    ## use:
    load_in_4bit=True,
    # device_map='cuda:0',
    trust_remote_code=True
)
tokenizer.padding_side = 'right'
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.6.2: Fast Llama patching. Transformers: 4.52.4.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.527 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


config.json:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.37k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Device set to use cuda:0


In [3]:
# Load the guanaco dataset
guanaco_train = load_dataset("timdettmers/openassistant-guanaco", cache_dir='data', split='train')
guanaco_test = load_dataset("timdettmers/openassistant-guanaco", cache_dir='data', split='test')# guanaco_test = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='test')


Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.


In [4]:
guanaco_train = guanaco_train.map(lambda entry: {
    'question1': entry['text'].split('###')[1].removeprefix(' Human: '),
    'answer1': entry['text'].split('###')[2].removeprefix(' Assistant: ')
})
guanaco_train = guanaco_train.map(lambda entry: {'messages': [
    {'role': 'user', 'content': entry['question1']},
    {'role': 'assistant', 'content': entry['answer1']}
]})

In [5]:
## Instead of:
# peft_config = LoraConfig(
#     task_type='CAUSAL_LM',
#     r=16,
#     lora_alpha=32,  # thumb rule: lora_alpha should be 2*r
#     bias='none',
#     target_modules='all-linear',
# )
# model = get_peft_model(model, peft_config)
## use:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,  # rule: lora_alpha should be 2*r
    lora_dropout=0.05,  # Unsloth supports any, but = 0 is optimized
    bias='none',  # Unsloth supports any, but = 'none' is optimized
    # Unsloth does not allow 'all-linear' => manually specify target modules: 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    use_gradient_checkpointing='unsloth',  # True or 'unsloth' for very long context
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.6.2 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [6]:
training_arguments = SFTConfig(
    output_dir='output/unsloth-phi-3.5-mini-instruct-guanaco',
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
        # e.g. Mistral 7B with PEFT using bitsandbytes:
        # - enabled: 11 GB GPU RAM and 8 samples/second
        # - disabled: 40 GB GPU RAM and 12 samples/second
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    optim='adamw_torch',
    learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
    logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    logging_steps=10,
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    # save_steps=2000,
    # num_train_epochs=5,
    max_steps=100,
    bf16=True,  # mixed precision training
    report_to='none',  # disable wandb
    max_seq_length=1024,
)

In [7]:
def formatting_func(entry):
    return tokenizer.apply_chat_template(entry['messages'], tokenize=False)

In [8]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=guanaco_train,
    processing_class=tokenizer,
    formatting_func=formatting_func,
)

Applying formatting function to train dataset (num_proc=64):   0%|          | 0/9846 [00:00<?, ? examples/s]

Converting train dataset to ChatML (num_proc=64):   0%|          | 0/9846 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=64):   0%|          | 0/9846 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=64):   0%|          | 0/9846 [00:00<?, ? examples/s]

Truncating train dataset (num_proc=64):   0%|          | 0/9846 [00:00<?, ? examples/s]

In [10]:
import os
os.environ['UNSLOTH_RETURN_LOGITS'] = '1'

train_result = trainer.train()
print("Training result:")
print(train_result)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 9,846 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 29,884,416/2,039,024,640 (1.47% trained)


Step,Training Loss
10,1.4448
20,1.1725
30,1.2164
40,1.1808
50,1.2031
60,1.2853
70,1.1634
80,1.1012
90,1.1268
100,1.1313


Training result:
TrainOutput(global_step=100, training_loss=1.2025622177124022, metrics={'train_runtime': 132.2607, 'train_samples_per_second': 6.049, 'train_steps_per_second': 0.756, 'total_flos': 1.230171564294144e+16, 'train_loss': 1.2025622177124022})


In [13]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

{'status': 'ok', 'restart': False}