## Unsloth: Optimizing Training and Inference Performance

For many software algorithms, the performance does not only depend on the number and kind of calculations performed. Instead, the exact order and the size of chunks has an enormous influence on the calculation speed.
For large language models, a library called `unsloth` contains optimized GPU kernels created by manually deriving all compute heavy math steps. By using these optimized kernels, a significant speed-up can be obtained.

### Key Techniques in Unsloth:

1. **Efficient Data Loading**: Optimizing data pipelines to reduce latency and improve throughput during training.
2. **Batching and Padding Strategies**: Dynamically adjusting batch sizes and minimizing padding to optimize memory usage.
3. **Half-Precision and Quantized Inference**: Using mixed precision or quantized models to speed up inference and reduce memory footprint.
4. **Model Pruning and Distillation**: Reducing the size of the model by removing redundant parameters or training smaller models to mimic larger ones.

### Benefits of Unsloth:

- **Reduced Training Time**: Optimizing data loading and model architecture reduces the time required for each epoch.
- **Lower Memory Usage**: Using techniques like mixed precision and quantization reduces the amount of GPU memory required.
- **Faster Inference**: Optimizing the model for deployment can significantly reduce latency during inference.

### Hands-On Example: Efficient Data Loading and Mixed Precision Training

In this example, we take the example from the previous notebook ("PEFT") and adjust them to use `unsloth`.

In [34]:
%%writefile unsloth_demo.py
# Import libraries
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig, pipeline, TrainingArguments
from trl import SFTTrainer
from textwrap import dedent  # Remove leading whitespace from multiline strings
## Instead of:
# from transformers import AutoModelForCausalLM, AutoTokenizer
## use:
from unsloth import FastLanguageModel


# Choose a model and load tokenizer and model (using 4bit quantization):
model_name = "microsoft/Phi-3-mini-4k-instruct"
## Instead of:
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(...)
## use: 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name,
    ## Instead of:
    # quantization_config=BitsAndBytesConfig(...)
    ## use:
    load_in_4bit=True,
    # device_map='cuda:0',
    trust_remote_code=True
)
tokenizer.padding_side = 'right'
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Load the guanaco dataset
guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')
guanaco_test = load_dataset('timdettmers/openassistant-guanaco', split='test')

## Instead of:
# peft_config = LoraConfig(
#     task_type='CAUSAL_LM',
#     r=16,
#     lora_alpha=32,  # thumb rule: lora_alpha should be 2*r
#     bias='none',
#     target_modules='all-linear',
# )
# model = get_peft_model(model, peft_config)
## use:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,  # rule: lora_alpha should be 2*r
    lora_dropout=0,  # Unsloth supports any, but = 0 is optimized
    bias='none',  # Unsloth supports any, but = 'none' is optimized
    # Unsloth does not allow 'all-linear' => manually specify target modules: 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    use_gradient_checkpointing='unsloth',  # True or 'unsloth' for very long context
)


training_arguments = TrainingArguments(
    # When using newer versions of `trl`, use SFTConfig(...) instead of TrainingArguments(...).
    output_dir='phi-3-mini-instruct-guanaco',
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
        # e.g. Mistral 7B with PEFT using bitsandbytes:
        # - enabled: 11 GB GPU RAM and 12 samples/second
        # - disabled: 40 GB GPU RAM and 8 samples/second
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    optim='adamw_torch',
    learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
    logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    logging_steps=10,
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    # eval_strategy='steps',
    # eval_steps=20,
    # save_steps=2000,
    # num_train_epochs=5,
    max_steps=100,
    bf16=True,  # mixed precision training: faster, but uses more memory
    report_to='none',  # disable wandb
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=guanaco_train,
    eval_dataset=guanaco_test,
    tokenizer=tokenizer,
    dataset_text_field='text',
    max_seq_length=1024,
)

eval_result = trainer.evaluate()
print("Evaluation on test dataset before finetuning:")
print(eval_result)

train_result = trainer.train()
print("Training result:")
print(train_result)

eval_result = trainer.evaluate()
print("Evaluation on test dataset after finetuning:")
print(eval_result)

Overwriting unsloth_demo.py


In [35]:
%%writefile unsloth_demo.slurm
#!/bin/bash

#SBATCH --partition=zen3_0512_a100x2
# #SBATCH --qos=zen3_0512_a100x2
#SBATCH --qos=admin
#SBATCH --gres=gpu:1  # Number of GPUs (1 or 2)
#SBATCH --time=0:20:00

# Load conda:
module purge
module load miniconda3
eval "$(conda shell.bash hook)"

conda run -n finetuning --no-capture-output python unsloth_demo.py

Overwriting unsloth_demo.slurm


In [36]:
!sbatch unsloth_demo.slurm

sbatch: Allocating 50.0 % of cpu resources: 64 / 128.
sbatch: Number of tasks adjusted to 64.
Submitted batch job 3982157


In [37]:
!squeue --me

             JOBID            PARTITION     NAME     USER ST       TIME  NODES     NODELIST(REASON)
           3982157     zen3_0512_a100x2 unsloth_ mpfister  R       0:24      1            n3072-015
           3979795     zen3_0512_a100x2 vsc5_jh_ mpfister  R   10:52:24      1            n3071-001


#### Output:

```
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: WARNING `trust_remote_code` is True.
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.43.4.
   \\   /|    GPU: NVIDIA A100-PCIE-40GB. Max memory: 39.393 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.24. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
Map: 100%|██████████| 9846/9846 [00:01<00:00, 7331.67 examples/s]
Map: 100%|██████████| 518/518 [00:00<00:00, 7213.22 examples/s]
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
100%|██████████| 65/65 [00:26<00:00,  2.45it/s]
Evaluation on test dataset before finetuning:
{'eval_loss': 1.4085867404937744, 'eval_model_preparation_time': 0.006, 'eval_runtime': 61.1316, 'eval_samples_per_second': 8.474, 'eval_steps_per_second': 1.063}
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 9,846 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 1
\        /    Total batch size = 8 | Total steps = 100
 "-____-"     Number of trainable parameters = 29,884,416
{'loss': 1.2382, 'grad_norm': 0.41076236963272095, 'learning_rate': 0.00018, 'epoch': 0.01}
{'loss': 1.1579, 'grad_norm': 0.3123883306980133, 'learning_rate': 0.00016, 'epoch': 0.02}
{'loss': 1.1584, 'grad_norm': 0.33330079913139343, 'learning_rate': 0.00014, 'epoch': 0.02}
{'loss': 1.1558, 'grad_norm': 0.2736963927745819, 'learning_rate': 0.00012, 'epoch': 0.03}
{'loss': 1.1786, 'grad_norm': 0.30604463815689087, 'learning_rate': 0.0001, 'epoch': 0.04}
{'loss': 1.2531, 'grad_norm': 0.21231390535831451, 'learning_rate': 8e-05, 'epoch': 0.05}
{'loss': 1.2122, 'grad_norm': 0.21684657037258148, 'learning_rate': 6e-05, 'epoch': 0.06}
{'loss': 1.1345, 'grad_norm': 0.17966635525226593, 'learning_rate': 4e-05, 'epoch': 0.06}
{'loss': 1.1084, 'grad_norm': 0.2505042850971222, 'learning_rate': 2e-05, 'epoch': 0.07}
{'loss': 1.1221, 'grad_norm': 0.21714507043361664, 'learning_rate': 0.0, 'epoch': 0.08}
{'train_runtime': 169.8121, 'train_samples_per_second': 4.711, 'train_steps_per_second': 0.589, 'train_loss': 1.1719264698028564, 'epoch': 0.08}
100%|██████████| 100/100 [02:49<00:00,  1.70s/it]
Training result:
TrainOutput(global_step=100, training_loss=1.1719264698028564, metrics={'train_runtime': 169.8121, 'train_samples_per_second': 4.711, 'train_steps_per_second': 0.589, 'total_flos': 1.551664613154816e+16, 'train_loss': 1.1719264698028564, 'epoch': 0.08123476848090982})
100%|██████████| 65/65 [00:27<00:00,  2.38it/s]
Evaluation on test dataset after finetuning:
{'eval_loss': 1.2287541627883911, 'eval_model_preparation_time': 0.006, 'eval_runtime': 27.6746, 'eval_samples_per_second': 18.718, 'eval_steps_per_second': 2.349, 'epoch': 0.08123476848090982}
```

### Just for comparison, the same code without unsloth:

In [38]:
%%writefile unsloth_demo_nounsloth.py
# Import libraries
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig, pipeline, TrainingArguments
from trl import SFTTrainer
from textwrap import dedent  # Remove leading whitespace from multiline strings
from transformers import AutoModelForCausalLM, AutoTokenizer


# Choose a model and load tokenizer and model (using 4bit quantization):
model_name = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = 'right'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=torch.bfloat16),
    device_map='cuda:0',
    trust_remote_code=True
)
tokenizer.padding_side = 'right'
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Load the guanaco dataset
guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')
guanaco_test = load_dataset('timdettmers/openassistant-guanaco', split='test')

## Instead of:
peft_config = LoraConfig(
    task_type='CAUSAL_LM',
    r=16,
    lora_alpha=32,  # thumb rule: lora_alpha should be 2*r
    bias='none',
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",]
)
model = get_peft_model(model, peft_config)

training_arguments = TrainingArguments(
    # When using newer versions of `trl`, use SFTConfig(...) instead of TrainingArguments(...).
    output_dir='phi-3-mini-instruct-guanaco',
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
        # e.g. Mistral 7B with PEFT using bitsandbytes:
        # - enabled: 11 GB GPU RAM and 12 samples/second
        # - disabled: 40 GB GPU RAM and 8 samples/second
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    optim='adamw_torch',
    learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
    logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    logging_steps=10,
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    # eval_strategy='steps',
    # eval_steps=20,
    # save_steps=2000,
    # num_train_epochs=5,
    max_steps=100,
    bf16=True,  # mixed precision training: faster, but uses more memory
    report_to='none',  # disable wandb
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=guanaco_train,
    eval_dataset=guanaco_test,
    tokenizer=tokenizer,
    dataset_text_field='text',
    max_seq_length=1024,
)

eval_result = trainer.evaluate()
print("Evaluation on test dataset before finetuning:")
print(eval_result)

train_result = trainer.train()
print("Training result:")
print(train_result)

eval_result = trainer.evaluate()
print("Evaluation on test dataset after finetuning:")
print(eval_result)

Writing unsloth_demo_nounsloth.py


In [42]:
%%writefile unsloth_demo_nounsloth.slurm
#!/bin/bash

#SBATCH --partition=zen3_0512_a100x2
# #SBATCH --qos=zen3_0512_a100x2
#SBATCH --qos=admin
#SBATCH --gres=gpu:1  # Number of GPUs (1 or 2)
#SBATCH --time=0:20:00

# Load conda:
module purge
module load miniconda3
eval "$(conda shell.bash hook)"

conda run -n finetuning --no-capture-output python unsloth_demo_nounsloth.py

Overwriting unsloth_demo_nounsloth.slurm


In [43]:
!sbatch unsloth_demo_nounsloth.slurm

sbatch: Allocating 50.0 % of cpu resources: 64 / 128.
sbatch: Number of tasks adjusted to 64.
Submitted batch job 3982161


In [44]:
!squeue --me

             JOBID            PARTITION     NAME     USER ST       TIME  NODES     NODELIST(REASON)
           3982161     zen3_0512_a100x2 unsloth_ mpfister  R       0:25      1            n3072-015
           3982160     zen3_0512_a100x2 unsloth_ mpfister  R       3:28      1            n3072-015
           3979795     zen3_0512_a100x2 vsc5_jh_ mpfister  R   10:59:48      1            n3071-001


#### Output:

```
Loading checkpoint shards: 100%|██████████| 2/2 [00:16<00:00,  8.48s/it]
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
Map: 100%|██████████| 9846/9846 [00:01<00:00, 7429.40 examples/s]
Map: 100%|██████████| 518/518 [00:00<00:00, 7066.79 examples/s]
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
You are not running the flash-attention implementation, expect numerical differences.
100%|██████████| 65/65 [00:44<00:00,  1.47it/s]
Evaluation on test dataset before finetuning:
{'eval_loss': 1.4072628021240234, 'eval_model_preparation_time': 0.0026, 'eval_runtime': 46.7094, 'eval_samples_per_second': 11.09, 'eval_steps_per_second': 1.392}
  0%|          | 0/100 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
{'loss': 1.2626, 'grad_norm': 0.2466505616903305, 'learning_rate': 0.00018, 'epoch': 0.01}
{'loss': 1.1671, 'grad_norm': 0.22209370136260986, 'learning_rate': 0.00016, 'epoch': 0.02}
{'loss': 1.1649, 'grad_norm': 0.22614338994026184, 'learning_rate': 0.00014, 'epoch': 0.02}
{'loss': 1.1588, 'grad_norm': 0.19410459697246552, 'learning_rate': 0.00012, 'epoch': 0.03}
{'loss': 1.1821, 'grad_norm': 0.22094522416591644, 'learning_rate': 0.0001, 'epoch': 0.04}
{'loss': 1.257, 'grad_norm': 0.16384980082511902, 'learning_rate': 8e-05, 'epoch': 0.05}
{'loss': 1.2183, 'grad_norm': 0.1637776792049408, 'learning_rate': 6e-05, 'epoch': 0.06}
{'loss': 1.1377, 'grad_norm': 0.14242839813232422, 'learning_rate': 4e-05, 'epoch': 0.06}
{'loss': 1.1138, 'grad_norm': 0.19825458526611328, 'learning_rate': 2e-05, 'epoch': 0.07}
{'loss': 1.1256, 'grad_norm': 0.16682812571525574, 'learning_rate': 0.0, 'epoch': 0.08}
{'train_runtime': 237.3404, 'train_samples_per_second': 3.371, 'train_steps_per_second': 0.421, 'train_loss': 1.1787943458557129, 'epoch': 0.08}
100%|██████████| 100/100 [03:57<00:00,  2.37s/it]
Training result:
TrainOutput(global_step=100, training_loss=1.1787943458557129, metrics={'train_runtime': 237.3404, 'train_samples_per_second': 3.371, 'train_steps_per_second': 0.421, 'total_flos': 1.542992772194304e+16, 'train_loss': 1.1787943458557129, 'epoch': 0.08123476848090982})
100%|██████████| 65/65 [00:45<00:00,  1.43it/s]
Evaluation on test dataset after finetuning:
{'eval_loss': 1.2330236434936523, 'eval_model_preparation_time': 0.0026, 'eval_runtime': 46.069, 'eval_samples_per_second': 11.244, 'eval_steps_per_second': 1.411, 'epoch': 0.08123476848090982}
```