### News

Unsloth library and its dependencies are installed correctly, whether you're running it in a Google Colab/Kaggle notebook or on your local machine. It intelligently adapts the installation process based on the environment, streamlining setup for users.

### Installation

In [None]:

    # Do this only in Colab and Kaggle notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting xformers==0.0.29
  Downloading xformers-0.0.29-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting trl
  Downloading trl-0.15.1-py3-none-any.whl.metadata (11 kB)
Downloading xformers-0.0.29-cp311-cp311-manylinux_2_28_x86_64.whl (15.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl (69.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.15.1-py3-none-any.whl (318 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers, trl, bitsandbytes
Successfully installed bitsandbytes-0.45.2 trl-0.15.1 xformers-0.

**from unsloth import FastLanguageModel**: This line imports the FastLanguageModel class from the unsloth library. This class is central to working with the language models provided by Unsloth.

**import torch**: This line imports the torch library, which is a fundamental library for deep learning in Python.


**max_seq_length** = 2048: This sets the maximum sequence length for the model's input and output. It determines how many tokens the model can process at once.

**dtype** = None: This sets the data type for the model's weights. None enables automatic detection based on your hardware. You can manually set it to Float16 or Bfloat16 for specific hardware like Tesla T4/V100 or Ampere GPUs, respectively.

**load_in_4bit** = True: This enables 4-bit quantization, a technique to reduce the model's memory footprint, leading to faster loading and potentially allowing you to use larger models on your hardware.

**fourbit_models**: This is a list of pre-quantized models available in the Unsloth library. These models are optimized for faster downloading and reduced memory consumption.

**model, tokenizer** = FastLanguageModel.from_pretrained(...): This line loads the specified pre-trained language model and its associated tokenizer using the from_pretrained method of the FastLanguageModel class.

**model_name** = "unsloth/Llama-3.2-3B-Instruct": This specifies the name of the model to load. You can choose from the models listed in fourbit_models or other compatible models.

**max_seq_length, dtype, load_in_4bit**: These parameters are passed to the from_pretrained method to configure the model loading process.

this code snippet imports necessary libraries, sets parameters for model configuration, provides a list of available pre-quantized models, and finally loads the selected language model along with its tokenizer for use in subsequent tasks.

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


This code is focused on efficiently fine-tuning a pre-trained language model using a technique called LoRA **(Low-Rank Adaptation)**. LoRA is a method that allows you to fine-tune large language models with significantly fewer trainable parameters, making the process faster and requiring less memory.

**model**= FastLanguageModel.get_peft_model(...): This line calls the get_peft_model function from the FastLanguageModel class. This function is responsible for adding LoRA adapters to the existing pre-trained model.

**model**: This is the pre-trained language model that you loaded earlier using FastLanguageModel.from_pretrained. It's the base model you want to fine-tune.

**r** = 16: This parameter (r) controls the rank of the LoRA adapters. It's a hyperparameter that you can adjust, and it affects the number of trainable parameters introduced by LoRA. The code suggests some typical values (8, 16, 32, 64, 128). A higher rank generally leads to better performance but increases the number of trainable parameters.

**target_modules** = [...]: This list specifies the names of the modules within the base model where you want to apply LoRA adapters. These modules are typically related to the attention mechanism in the transformer architecture.

**lora_alpha** = 16: This parameter (lora_alpha) is a scaling factor used in the LoRA update. It helps control the impact of the LoRA adapters on the model's output.

**lora_dropout** = 0: This sets the dropout rate for the LoRA adapters. Dropout is a regularization technique used to prevent overfitting during training. While it supports any value, 0 is optimized for performance in this case.

**bias** = "none": This parameter controls whether to add a bias term to the LoRA adapters. It is set to "none" for optimization purposes.

**use_gradient_checkpointing **= "unsloth": This option enables gradient checkpointing, a technique to reduce memory consumption during training, especially for models with long context. The "unsloth" setting is a specific implementation optimized for Unsloth's library.

**random_state** = 3407: This is used to seed the random number generator, ensuring reproducibility of your experiments.

**use_rslora **= False: This indicates whether to use rank-stabilized LoRA (RS-LoRA), a variation of LoRA. It is set to False in this example.

**loftq_config** = None: This parameter is related to another quantization technique called LoftQ. It's set to None, meaning it's not used in this case.

this code snippet adds LoRA adapters to your pre-trained language model to prepare it for efficient fine-tuning. This enables you to adapt the model for specific tasks with minimal changes to the base model's architecture while using fewer resources compared to traditional fine-tuning methods.

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

<a name="Data"></a>
### Data Prep
Use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

`get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

This code defines a function called formatting_prompts_func. This function's primary purpose is to take a batch of examples from a dataset and format them for the language model to understand.

**def formatting_prompts_func(examples)**:: This line defines the function named formatting_prompts_func and indicates that it takes one argument called examples. examples is expected to be a dictionary-like object (a batch of data) containing the conversations.

**convos **= examples["conversations"]:: This line extracts the conversations from the examples dictionary and assigns them to the convos variable. The "conversations" key is assumed to exist within the examples dictionary and hold the actual conversation data.

**texts** = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]: This line is more complex. Let's break it down further:

It uses a list comprehension to iterate through each convo in the convos list.
For each convo, it calls the apply_chat_template method of the tokenizer object.
apply_chat_template is a function that formats the conversation (convo) into a specific structure that the language model expects.

**tokenize** = False indicates that the conversation shouldn't be tokenized yet.
add_generation_prompt = False means that special tokens used for prompting the model to generate text aren't added at this stage.
Finally, the formatted text is added to the texts list.

**return { "text" : texts, }**: After processing all conversations, the function returns a dictionary where the key is "text" and the value is the texts list containing the formatted conversations.


**pass**: This line is essentially a placeholder. It doesn't do anything but prevents an error if the function were empty. It's often used during development when you're still working on the logic within a function.

In essence, this function takes raw conversation data, applies a specific chat template to format it, and then returns the formatted data ready to be used by the language model.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

In [None]:
dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

 `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

This section of the code deals with preparing the dataset for training  the language model. It involves two main steps: standardization and formatting.

**from unsloth.chat_templates import standardize_sharegpt**: This line imports the standardize_sharegpt function from the unsloth.chat_templates module. This function is specifically designed to handle datasets in the ShareGPT format.

**dataset** = standardize_sharegpt(dataset): This line applies the standardize_sharegpt function to the dataset variable. It converts the dataset from the ShareGPT format (using "from" and "value" keys) to the Hugging Face format (using "role" and "content" keys).

This standardization step ensures that the dataset is compatible with the expected input format of the Hugging Face language model.

**dataset = dataset.map(...)**: The map function is a common operation in dataset processing. It applies a given function (formatting_prompts_func in this case) to each element of the dataset.

**formatting_prompts_func**: This is a custom function (defined earlier in the code) that takes a batch of conversation data and formats it according to the specific template required by the language model.

**batched** = True: This argument indicates that the formatting_prompts_func should be applied to batches of data rather than individual elements. This can significantly speed up the proce

 this part of the code standardizes the dataset to a common format and then applies a specific formatting function to structure the conversations for the language model's understanding. This preparation is crucial for successful fine-tuning.

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

We look at how the conversations are structured for item 5:

In [None]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [None]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

This section is where the actual training of the language model takes place using the Hugging Face Transformers and TRL (Transformer Reinforcement Learning) libraries.

**from trl import SFTTrainer**: This line imports the SFTTrainer class from the trl library. SFTTrainer is a specialized trainer for Supervised Fine-Tuning (SFT) of language models. It provides a framework for training language models on a dataset of input-output pairs.

**from transformers import TrainingArguments, DataCollatorForSeq2Seq**: This imports necessary components from the transformers library:

**TrainingArguments**: This class holds all the hyperparameters and configurations for the training process (learning rate, batch size, number of epochs, etc.).

**DataCollatorForSeq2Seq**: This is used to prepare the data for input to the model during training, specifically for sequence-to-sequence tasks like chat. It handles padding, tokenization, and formatting of inputs and targets.


**from unsloth import is_bfloat16_supported**: This imports a utility function is_bfloat16_supported from the unsloth library. This function checks if your hardware supports the bfloat16 data type, which can be more efficient for training on certain GPUs.

**Trainer Initialization**


**trainer = SFTTrainer(...)**: This line creates an instance of the SFTTrainer class, which will manage the training process. Let's examine the arguments provided:

**model = model**: This passes the pre-trained language model (with LoRA adapters) that you loaded and configured earlier. This is the model that will be fine-tuned.

**tokenizer = tokenizer**: This provides the tokenizer associated with the model. The tokenizer is used to convert text into numerical representations that the model can understand.

**train_dataset = dataset**: This specifies the dataset that will be used for training. This dataset should contain conversations or text examples in the format expected by the model.

**dataset_text_field = "text"**: This tells the trainer where to find the text data within each example in the dataset. In this case, it's assumed that the text is stored under the key "text".


**max_seq_length = max_seq_length**: This sets the maximum sequence length for the input to the model during training. Sequences longer than this will be truncated.


**data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer)**: This provides the data collator, which prepares the data for input to the model. It uses the specified tokenizer.

**dataset_num_proc = 2**: This sets the number of processes to use for data preprocessing. Using multiple processes can speed up data loading.

**packing = False**: This controls whether to pack multiple shorter sequences into a single batch. Packing can improve training speed but is disabled here.

**args = TrainingArguments(...)**: This passes an instance of TrainingArguments containing all the hyperparameters for the training process. Let's break down the important ones:

**per_device_train_batch_size = 2**: The number of training examples to process in each batch on each device (GPU).

**gradient_accumulation_steps = 4**: The number of steps to accumulate gradients before updating the model's weights. This effectively increases the batch size without requiring more GPU memory.

**warmup_steps = 5**: The number of initial steps where the learning rate is gradually increased.

**max_steps = 60**: The total number of training steps to perform.

**learning_rate = 2e-4**: The learning rate for the optimizer.

**fp16 = not is_bfloat16_supported()**: Whether to use 16-bit floating-point precision for training. This can speed up training on compatible GPUs.

**bf16 = is_bfloat16_supported()**: Whether to use bfloat16 precision if supported by your hardware.

**logging_steps = 1**: How often to log training progress (every step in this case).

**optim = "adamw_8bit"**: The optimizer to use for training (AdamW with 8-bit optimization).

**weight_decay = 0.01**: A regularization technique to prevent overfitting.
lr_scheduler_type = "linear": The type of learning rate scheduler to use.

**seed = 3407**: A random seed for reproducibility.

**output_dir = "outputs"**: The directory where training outputs will be saved.

**report_to = "none"**: Disables reporting to external platforms like Weights & Biases.

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Converting train dataset to ChatML (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

This code snippet is designed to modify the training process so that the language model is only trained on the assistant's responses and ignores the loss on the user's inputs. This is a common practice in fine-tuning chatbots, as it helps the model learn to generate better responses rather than trying to predict the user's prompts.

**from unsloth.chat_templates import train_on_responses_only:** This line imports the train_on_responses_only function from the unsloth.chat_templates module. This function is the key to achieving the desired behavior.

**trainer = train_on_responses_only(...):** This line calls the imported train_on_responses_only function and assigns the result back to the trainer variable. This essentially modifies the existing trainer object to incorporate the new training logic.

**trainer:** This is the SFTTrainer object that was created earlier and is responsible for managing the training process.

**instruction_part** = "<|start_header_id|>user<|end_header_id|>\n\n": This argument specifies the pattern that identifies the user's instruction or prompt within the training data. It's looking for the specific tokens that mark the beginning and end of the user's turn in the conversation.

**response_part** = "<|start_header_id|>assistant<|end_header_id|>\n\n": Similar to the previous argument, this one specifies the pattern that identifies the assistant's response within the training data. It's looking for the tokens that mark the beginning and end of the assistant's turn.

The train_on_responses_only function takes the trainer object, the patterns for identifying user instructions and assistant responses, and then modifies the training process to mask out the user's input during loss calculation. This ensures that the model is only evaluated and updated based on how well it generates the assistant's responses.

Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

This line takes the 5th example from your training dataset, gets the numerical representation of that text, and then uses the tokenizer to translate it back into a format you can understand and read.

verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight rel

In [None]:
trainer.train_dataset[5]["input_ids"]

[128000,
 128000,
 128006,
 9125,
 128007,
 271,
 38766,
 1303,
 33025,
 2696,
 25,
 6790,
 220,
 2366,
 18,
 198,
 15724,
 2696,
 25,
 220,
 1627,
 5887,
 220,
 2366,
 19,
 271,
 128009,
 128006,
 882,
 128007,
 271,
 4438,
 656,
 87887,
 8417,
 279,
 4113,
 46406,
 315,
 3177,
 48042,
 555,
 264,
 77480,
 2547,
 520,
 2800,
 11,
 902,
 374,
 5995,
 369,
 30090,
 1202,
 4732,
 1701,
 279,
 97674,
 13206,
 2515,
 30,
 128009,
 128006,
 78191,
 128007,
 271,
 32,
 496,
 14609,
 388,
 1304,
 1005,
 315,
 279,
 5016,
 57077,
 77777,
 315,
 5540,
 1766,
 304,
 9958,
 13,
 4314,
 5540,
 17105,
 323,
 35406,
 3177,
 520,
 3230,
 11,
 3967,
 93959,
 11,
 30164,
 459,
 44225,
 20326,
 13,
 3296,
 42118,
 279,
 3177,
 4036,
 505,
 29827,
 9958,
 323,
 27393,
 433,
 311,
 279,
 27692,
 35073,
 40412,
 63697,
 315,
 1521,
 5540,
 11,
 87887,
 649,
 10765,
 279,
 29735,
 304,
 1521,
 93959,
 4245,
 311,
 279,
 97674,
 13206,
 2515,
 13,
 578,
 13468,
 6541,
 10975,
 1124,
 279,
 13112,
 311,
 902,

These lines are used to verify that the masking process during training is working correctly. It focuses on how the model is trained to predict only the assistant's responses and ignore the user's prompts. Here's a step-by-step explanation:

**space** = tokenizer(" ", add_special_tokens = False).input_ids[0]:

**tokenizer(" ", add_special_tokens = False)**: This part uses the tokenizer to process a single space character (" "). The add_special_tokens = False argument ensures that no special tokens (like beginning-of-sequence or end-of-sequence tokens) are added. The result is a dictionary-like object containing the tokenized representation of the space.

**.input_ids[0]**: This extracts the numerical ID of the space token from the tokenized representation. This ID will be used to replace the masked positions.

**space** = ...: The result (the numerical ID of the space) is assigned to the variable space.

**tokenizer.decode(**[space if x == -100 else x for x in trainer.train_dataset[5]["labels"]]):

**trainer.train_dataset**[5]["labels"]: This accesses the "labels" of the 5th example in the training dataset. The "labels" represent the target sequence the model should learn to predict.

[space if x == -100 else x for x in ...]: This is a list comprehension that iterates through each element (x) in the "labels" list.

If x is equal to -100, it means this position was masked during data preparation. In this case, it's replaced with the space token ID.

If x is not -100, it means it's part of the actual target sequence, and it's kept as is.

**tokenizer.decode(...)**: This takes the modified list of token IDs and uses the tokenizer to convert them back into human-readable text.

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                 \n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
4.51 GB of memory reserved.


This single line of code is where the fine-tuning process actually begins.

**trainer**: This is the SFTTrainer object that you previously configured with all the settings for training (dataset, model, hyperparameters, etc.). It's essentially the manager of the training process.

**.train()**: This is a method of the SFTTrainer class that starts the training loop. The model will begin learning from the data you provided. It iterates over the dataset, calculates the loss, updates the model's weights, and repeats this process for the specified number of steps or epochs.

**trainer_stats**= ...: The trainer.train() method returns information about the training process, such as the training loss, time taken, and other metrics. These stats are assigned to the variable trainer_stats so you can access and analyze them later if needed.

Imagine trainer as a coach, .train() as the command to start the training session, and trainer_stats as the report card after the session. This line initiates the actual learning phase for your

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,0.7726
2,0.8361
3,1.0706
4,0.8888
5,0.755
6,0.9326
7,0.6174
8,0.9933
9,0.858
10,0.7592


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

483.0789 seconds used for training.
8.05 minutes used for training.
Peak reserved memory = 4.51 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 30.595 %.
Peak reserved memory for training % of max memory = 0.0 %.


**from unsloth.chat_templates import get_chat_template**: This line imports a function called get_chat_template which is used to format the prompts for the model.

**tokenizer **= get_chat_template(...): This applies the "llama-3.1" chat template to your tokenizer. This template defines how the model expects the input to be structured, following the style of the Llama 3.1 model.

**FastLanguageModel.for_inference(model)**: This line optimizes the model specifically for inference, potentially making the prediction process faster.

**Providing Input to the Model**


**messages** = [...]: This defines the input to the model. In this case, it's a list containing a single message, simulating a user asking the model to continue the Fibonacci sequence.

**inputs** = tokenizer.apply_chat_template(...): This takes the messages, applies the chat template, tokenizes them (converts them to numerical representations), adds a special prompt to signal the start of generation, and formats the output as PyTorch tensors. Finally, it moves the tensors to the "cuda" device, indicating it will be processed on a GPU if available.

**Generating Output**


**outputs**= model.generate(...): This is the core of inference. It uses the model to generate a response based on the provided inputs.

**max_new_tokens** = 64: Limits the generated output to a maximum of 64 tokens.

**use_cache** = True: Enables caching to speed up the generation process.

**temperature** = 1.5, min_p = 0.1: These parameters control the randomness and creativity of the generated text.

**tokenizer.batch_decode(outputs)**: Finally, this line takes the numerical output (outputs) from the model and decodes it back into human-readable text using the tokenizer. This is the final prediction that the model produces.

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers. The sequence you provided starts with 1, 1, 2, 3, 5, and 8. Here are the next three numbers in the sequence:\n9, 14, 23<|eot_id|>']

**Inference with Text Streaming**


This code snippet focuses on using the TextStreamer from the transformers library to see the model's output as it's generated, token by token, rather than waiting for the entire generation to complete.

**Optimizing for Inference**


**FastLanguageModel.for_inference(model)**: This line is specific to the Unsloth library. It likely optimizes the model for inference, potentially making predictions faster.

**messages**: This defines the input to the model, structured as a list of dictionaries. Each dictionary represents a turn in the conversation, with "role" indicating who is speaking (user or assistant) and "content" containing the message.

**inputs**= tokenizer.apply_chat_template(...): This line processes the messages using the tokenizer:

**apply_chat_template**: Formats the messages according to the specific chat template being used (likely Llama 3.1 in this case).

**tokenize** = True: Tokenizes the text, converting it into numerical representations the model understands.

**add_generation_prompt** = True: Adds a special prompt to signal the model to start generating text.

**return_tensors** = "pt": Returns the output as PyTorch tensors.
.to("cuda"): Moves the tensors to the "cuda" device (GPU) if available for faster processing.

**from transformers import TextStreamer**: Imports the TextStreamer class.

**text_streamer** = TextStreamer(tokenizer, skip_prompt = True): Creates a TextStreamer instance. skip_prompt = True likely means the initial prompt won't be printed in the output stream.

**_ = model.generate(...)**: This uses the model to generate text, but instead of waiting for the entire generation, it streams the output:

**input_ids** = inputs: Provides the tokenized input.

**streamer** = text_streamer: Uses the text_streamer to handle the output stream.

**max_new_tokens** = 128: Limits the generated output to a maximum of 128 tokens.

**use_cache** = True: Enables caching for faster generation.

**temperature** = 1.5, min_p = 0.1: These parameters influence the randomness and creativity of the generated text.

use a`TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Certainly! Here is the next few terms in the Fibonacci sequence: 13, 21, 34, 55, 89, 144, 233. Let me know if you need anything else!<|eot_id|>


<a name="Save"></a>
### Saving, loading trained/finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model_rahulraj")  # Local saving
tokenizer.save_pretrained("lora_model_rahulraj")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model_rahulraj/tokenizer_config.json',
 'lora_model_rahulraj/special_tokens_map.json',
 'lora_model_rahulraj/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model_rahulraj", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
The Eiffel Tower in Paris, the capital of France, is an iconic tower that stands at over 1,000 feet (300 meters) high. It was built for the 1889 World's Fair to celebrate the 100th anniversary of the French Revolution. The tower is supported by four pillars located at its base, and its latticework structure allows it to flex and resist strong winds.

The Eiffel Tower was the tallest man-made structure in the world for nearly 40 years and has become a symbol of Paris and the French Republic. The tower has been repurposed for various uses, includi

**model.push_to_hub_merged(...)**: This is the function call that initiates the saving and uploading process. It's a method provided by the Unsloth library (or a library it depends on) that's specifically designed for this purpose.

**"mynamerahulkumar/model"**: This is the repository name on the Hugging Face Model Hub where your model will be saved. It's in the format "username/model_name". In this case, it's saving to a repository named "model" under the user "mynamerahulkumar".

**tokenizer**: This argument passes the tokenizer associated with your model. The tokenizer is crucial for converting text into numerical representations that the model understands. When you load the model later, you'll also need the tokenizer to use it correctly.

**save_method** = "merged_16bit": This argument specifies how the model should be saved. "merged_16bit" indicates that the model and LoRA adapters will be merged and saved in 16-bit floating-point precision. This can reduce the file size of the model while maintaining reasonable accuracy.

**token** = "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": This is your Hugging Face user access token. It's needed to authenticate and authorize the upload to your Hugging Face account. You can find or generate your token in the settings of your Hugging Face account.

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
model.push_to_hub_merged("mynamerahulkumar/model", tokenizer, save_method = "merged_16bit", token = "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")

Unsloth: You are pushing to hub, but you passed your HF username = mynamerahulkumar.
We shall truncate mynamerahulkumar/model to model
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.4G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.63 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 22.39it/s]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.
Unsloth: Saving model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving model/pytorch_model-00002-of-00002.bin...


README.md:   0%|          | 0.00/627 [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/mynamerahulkumar/model


This line of code takes your trained model (model), including the learned adjustments (LoRA adapters), and saves it into a folder named "model". It also saves the tokenizer which is necessary to understand the model's input and output. The model is saved in a compact format (merged_16bit) to reduce its size.



In [None]:
# Merge to 16bit
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)


**save_method** = "merged_4bit": This argument specifies how the model should be saved:

**merged_4bit**: This means the model and its LoRA adapters (the learned changes during fine-tuning) will be merged together and saved using 4-bit quantization.

Merging makes it easier to load and use the model later.

**4-bit Quantization** is a technique to reduce the model's size by storing its weights using fewer bits, making it more efficient for storage and download.

In [None]:

# Merge to 4bit
model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)


In [None]:
model.push_to_hub_merged("mynamerahulkumar/model", tokenizer, save_method = "merged_4bit", token = "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")


**save_method = "lora"**: This argument specifies how the model should be saved. "lora" indicates that only the LoRA adapters (the changes made during fine-tuning) will be saved, and not the entire base model. This is often preferred as it results in much smaller file sizes compared to saving the full model

In [None]:

# Just LoRA adapters
model.save_pretrained_merged("model", tokenizer, save_method = "lora",)


In [None]:
model.push_to_hub_merged("mynamerahulkumar/model", tokenizer, save_method = "lora", token = "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")

In [None]:
# # Merge to 16bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "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")

# # Merge to 4bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "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")

# # Just LoRA adapters
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "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")