# Unsloth Library: Accelerate LLM Fine-tuning

Speed up your large language model (LLM) fine-tuning with Unsloth, an open-source library!

## Focus:
- Works with Llama 3, Mistral, Phi, and Gemma LLMs.

## Faster Training:
- Achieves 2-5x faster fine-tuning compared to traditional methods.

## Reduced Memory Usage:
- Requires fewer parameter updates, enabling larger batch sizes and up to 30% less VRAM consumption.

## How it Works:

### LoRA Adapters:
- Significantly reduces the number of parameters needing updates during fine-tuning (often to just 1-10%).

### Custom Backpropagation Engine:
- Optimized for LLM architecture for further efficiency gains.

## Benefits:

### Faster Experimentation:
- Get results quicker with accelerated fine-tuning.

### Open-Source and Accessible:
- Easy to use and contribute to.

## Choosing Unsloth vs. QLoRA:

### Unsloth:
- Ideal for supported LLMs if speed and ease of use are priorities.

### QLoRA:
- More general-purpose, but might require additional setup depending on your LLM.
- Research compatibility and performance for your specific needs.

## Additional Considerations:

### Hardware:
- Requires NVIDIA GPUs with CUDA 7.0 or above.

### Ease of Use:
- Unsloth seems simpler to adopt.

### Documentation/Support:
- Explore resources for each library to see which aligns better with you.


####  Practical Demonstration: Fine-tune a Mistral 7B LLM (4bit) on the IMDB dataset for text generation, all within Google Colab using Unsloth.

## Parameter-Efficient Fine Tuning (PEFT)

### Overview:
PEFT methods allow fine-tuning of large pre-trained models with minimal computational resources and memory usage by keeping the original model parameters frozen and introducing a small number of additional trainable parameters, called adapters. These adapters are designed to learn task-specific adjustments, which makes the fine-tuning process more efficient.

### Key Points:

- **Freezing Pretrained Parameters:** During fine-tuning, the pretrained model's parameters are kept unchanged.
- **Adding Trainable Adapters:** A small number of trainable parameters (adapters) are added. These adapters capture the task-specific knowledge required for fine-tuning.
- **Memory and Compute Efficiency:** By reducing the number of trainable parameters, PEFT methods are more memory and compute-efficient than fully fine-tuning the entire model.
- **Comparable Performance:** Despite the reduced number of trainable parameters, PEFT methods often achieve performance levels close to fully fine-tuned models.
- **Smaller Adapter Size:** The adapters are significantly smaller than the full model, making them easier to share, store, and load.


In [20]:
#!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [21]:
#!pip install "git+https://github.com/huggingface/transformers.git"

In [22]:
#!pip install trl

In [23]:
#!pip install bitsandbytes xformers peft

In [5]:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [6]:
from datasets import load_dataset

In [7]:
max_seq_length = 2048

In [8]:
dataset = load_dataset("imdb", split='train')

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [9]:
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [10]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = 'unsloth/mistral-7b-bnb-4bit',
    max_seq_length= max_seq_length,
    dtype = None,
    load_in_4bit = True
)

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

## Parameters and Their Purpose

### model:
The base language model that will be fine-tuned.

### r = 16:
The rank of the low-rank adaptation matrices. This determines the complexity of the adaptation.

### target_modules:
A list of the model's modules (layers) where the LoRA adapters will be applied. These typically include projections and layer transformations crucial for attention mechanisms in transformers:
- "q_proj": Query projection.
- "k_proj": Key projection.
- "v_proj": Value projection.
- "o_proj": Output projection.
- "gate_proj": Gating projection.
- "up_proj": Up projection.
- "down_proj": Down projection.

### lora_alpha = 16:
The scaling factor for the LoRA adapters. This scales the updates applied to the parameters during fine-tuning.

### lora_dropout = 0:
Dropout rate for LoRA adapters. A value of 0 means no dropout, which is optimized for performance.

### bias = "none":
Specifies how biases are handled. "none" means no bias updates, which is optimized for performance.

### use_gradient_checkpointing = "unsloth":
This option enables gradient checkpointing, which saves memory during backpropagation. The "unsloth" setting specifically leverages the Unsloth library's optimizations, resulting in 30% less VRAM usage and allowing for larger batch sizes.

### random_state = 3407:
A seed for random number generation to ensure reproducibility of results.

### max_seq_length = max_seq_length:
The maximum sequence length for the model inputs. This needs to be defined elsewhere in the code.

### use_rslora = False:
This indicates whether to use rank stabilized LoRA. In this case, it is set to False, meaning it is not used.

### loftq_config = None:
Configuration for LoftQ (a quantization technique). None indicates that LoftQ is not used.


In [None]:
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)



## Parameters and Their Purpose

### model:
The language model that will be trained.

### train_dataset:
The dataset used for training the model.

### dataset_text_field:
Specifies the field in the dataset containing the text data.

### max_seq_length:
Maximum sequence length for the input data.

### tokenizer:
The tokenizer used to preprocess the text data.

### args:
TrainingArguments object containing various training parameters:
- **per_device_train_batch_size = 2**: Batch size per GPU.
- **gradient_accumulation_steps = 4**: Number of gradient accumulation steps.
- **warmup_steps = 10**: Number of warmup steps for the learning rate scheduler.
- **max_steps = 60**: Maximum number of training steps.
- **fp16 = not torch.cuda.is_bf16_supported()**: Whether to use mixed precision training. It checks if BF16 (bfloat16) is supported by the CUDA device.
- **bf16 = torch.cuda.is_bf16_supported()**: Specifies whether to use BF16 (bfloat16) precision if supported.
- **logging_steps = 1**: Number of steps between logging.
- **output_dir = "outputs"**: Directory to save training outputs.
- **optim = "adamw_8bit"**: Optimization algorithm to use, here it seems to be a custom 8-bit version of AdamW.
- **seed = 3407**: Random seed for reproducibility.


In [13]:

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 25,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.6063
2,2.3436
3,2.3417
4,2.4766
5,2.5381
6,2.6856
7,2.3512
8,2.3699
9,2.221
10,2.5378


TrainOutput(global_step=60, training_loss=2.4146055658658345, metrics={'train_runtime': 677.151, 'train_samples_per_second': 0.709, 'train_steps_per_second': 0.089, 'total_flos': 9642624560529408.0, 'train_loss': 2.4146055658658345, 'epoch': 0.0192})

In [14]:

inputs = tokenizer(
    [
      "I really liked the movie because it shows emotions and talks humanity."
    ],
    return_tensors="pt",
).to("cuda")

In [15]:

inputs

{'input_ids': tensor([[    1,   315,  1528,  8232,   272,  5994,  1096,   378,  4370, 13855,
           304, 15066, 17676, 28723]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [16]:

outputs = model.generate(**inputs, max_new_tokens=128, use_cache=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [17]:
outputs

tensor([[    1,   315,  1528,  8232,   272,  5994,  1096,   378,  4370, 13855,
           304, 15066, 17676, 28723,   661,   349,   264,  1215,  1179,  5994,
           304,   315,  6557,   378,   298,  3376, 28723,   661,   349,   264,
          1215,  1179,  5994,   304,   315,  6557,   378,   298,  3376, 28723,
           661,   349,   264,  1215,  1179,  5994,   304,   315,  6557,   378,
           298,  3376, 28723,   661,   349,   264,  1215,  1179,  5994,   304,
           315,  6557,   378,   298,  3376, 28723,   661,   349,   264,  1215,
          1179,  5994,   304,   315,  6557,   378,   298,  3376, 28723,   661,
           349,   264,  1215,  1179,  5994,   304,   315,  6557,   378,   298,
          3376, 28723,   661,   349,   264,  1215,  1179,  5994,   304,   315,
          6557,   378,   298,  3376, 28723,   661,   349,   264,  1215,  1179,
          5994,   304,   315,  6557,   378,   298,  3376, 28723,   661,   349,
           264,  1215,  1179,  5994,   304,   315,  

In [18]:
tokenizer.batch_decode(outputs)

['<s> I really liked the movie because it shows emotions and talks humanity. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to everyone. It is a very good movie and I recommend it to']

In [19]:
model.save_pretrained("lora_model")

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"


In [None]:
!pip install huggingface_hub

In [None]:

from huggingface_hub import notebook_login

In [None]:
notebook_login()

In [None]:

model.push_to_hub("skuma307/unsloth_4bit_mistral_imdb_model")