### Installation

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.1.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

from datasets import load_dataset
dataset = load_dataset("jeongyoun/Fairytale-QAwithSUM", split = "train")

README.md:   0%|          | 0.00/923 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.99M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/393k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/366k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8548 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1025 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1007 [00:00<?, ? examples/s]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
import numpy as np
from datasets import Dataset

def convert_to_finetuning_format(dataset):
    """
    Convert a dataset with 'content' and 'summarization' keys
    into a fine-tuning format that includes system, user,
    and assistant roles in a single text field.
    """
    finetuning_data = []

    for entry in dataset:
        # Construct a text string that embeds the roles and content
        conversation_text = (
            "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
            "You are an assistant\n\n"
            "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
            f"{entry['summarization']}\n\n"
            "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
            f"{entry['content']}<|e"
        )
        # Add user and assistant roles for each data point
        datas = []
        datas.append({"role": "system", "content": "You are an assistant"})
        datas.append({"role": "user", "content": entry['summarization']})
        datas.append({"role": "assistant", "content": entry['content']})

        finetuning_data.append({
            "conversations" : datas,
            "text": conversation_text
        })

    return Dataset.from_list(finetuning_data)


In [None]:
type(dataset)

In [None]:
dataset_new = convert_to_finetuning_format(dataset)

In [None]:
type(dataset_new)

In [None]:
dataset_new[0]

{'conversations': [{'content': 'You are an assistant', 'role': 'system'},
  {'content': 'A king is wished well by his people. He is described as kind and just. However, there is a peculiar old woman who has a peculiar request. She wants to be allowed to stay outside under the open sky until she is 15 years old. According to her, a mountain troll will take her away.',
   'role': 'user'},
  {'content': 'once upon a time there was a king who went forth into the world and fetched back a beautiful queen . and after they had been married a while god gave them a little daughter . then there was great rejoicing in the city and throughout the country , for the people wished their king all that was good , since he was kind and just . while the child lay in its cradle , a strange - looking old woman entered the room , and no one knew who she was nor whence she came . the old woman spoke a verse over the child , and said that she must not be allowed out under the open sky until she were full fifte

In [None]:
dataset_new[0]

{'conversations': {'content': 'once upon a time there was a king who went forth into the world and fetched back a beautiful queen . and after they had been married a while god gave them a little daughter . then there was great rejoicing in the city and throughout the country , for the people wished their king all that was good , since he was kind and just . while the child lay in its cradle , a strange - looking old woman entered the room , and no one knew who she was nor whence she came . the old woman spoke a verse over the child , and said that she must not be allowed out under the open sky until she were full fifteen years of age , since otherwise the mountain troll would fetch her . when the king heard this he took her words to heart , and posted guards to watch over the little princess so that she would not get out under the open sky .',
  'summarization': 'A king is wished well by his people. He is described as kind and just. However, there is a peculiar old woman who has a pecu

We look at how the conversations are structured for item 5:

In [None]:
dataset_new[5]["conversations"]

[{'content': 'You are an assistant', 'role': 'system'},
 {'content': "The king took measures to ensure the little princess's safety by posting guards to watch over her. The king's love for his children is unparalleled, surpassing all else. The scene is set in the castle, where the little princess is safely confined. Meanwhile, the king's daughter looks out of the castle window, taking in the beauty of the sun shining on the flowers in the garden.",
  'role': 'user'},
 {'content': 'once upon a time there was a king who went forth into the world and fetched back a beautiful queen . and after they had been married a while god gave them a little daughter . then there was great rejoicing in the city and throughout the country , for the people wished their king all that was good , since he was kind and just . while the child lay in its cradle , a strange - looking old woman entered the room , and no one knew who she was nor whence she came . the old woman spoke a verse over the child , and s

In [None]:
dataset_new[5]["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an assistant\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThe king took measures to ensure the little princess's safety by posting guards to watch over her. The king's love for his children is unparalleled, surpassing all else. The scene is set in the castle, where the little princess is safely confined. Meanwhile, the king's daughter looks out of the castle window, taking in the beauty of the sun shining on the flowers in the garden.\n\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nonce upon a time there was a king who went forth into the world and fetched back a beautiful queen . and after they had been married a while god gave them a little daughter . then there was great rejoicing in the city and throughout the country , for the people wished their king all that was good , since he was kind and just . while the child lay in its cradle , a strange - looking old woman entered the room , a

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_new,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/8548 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/8548 [00:00<?, ? examples/s]

We verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThe king took measures to ensure the little princess's safety by posting guards to watch over her. The king's love for his children is unparalleled, surpassing all else. The scene is set in the castle, where the little princess is safely confined. Meanwhile, the king's daughter looks out of the castle window, taking in the beauty of the sun shining on the flowers in the garden.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nonce upon a time there was a king who went forth into the world and fetched back a beautiful queen. and after they had been married a while god gave them a little daughter. then there was great rejoicing in the city and throughout the country, for the people wished their king all that was good, since he was kind and just. while the child lay in its cradl

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                    \n\nonce upon a time there was a king who went forth into the world and fetched back a beautiful queen. and after they had been married a while god gave them a little daughter. then there was great rejoicing in the city and throughout the country, for the people wished their king all that was good, since he was kind and just. while the child lay in its cradle, a strange - looking old woman entered the room, and no one knew who she was nor whence she came. the old woman spoke a verse over the child, and said that she must not be allowed out under the open sky until she were full fifteen years of age, since otherwise the mountain troll would fetch her. when the king heard this he took her words to heart, and posted guards to watch over the little princess so that she would not get out under the open sky.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 8,548 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,2.8433
2,2.9901
3,2.8611
4,2.8775
5,2.9461
6,2.8552
7,2.8746
8,2.7055
9,2.5398
10,2.653


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
#tokens are removed.
# Save to 8bit Q8_0
if True: model.push_to_hub_gguf("Serdarbayraktar/llama3.2-1B-Fairytale", tokenizer, token = "")
# Save to 16bit GGUF
if True: model.push_to_hub_gguf("Serdarbayraktar/llama3.2-1B-Fairytale", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.push_to_hub_gguf("Serdarbayraktar/llama3.2-1B-Fairytale", tokenizer, quantization_method = "q4_k_m", token = "")



Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.92 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 21.82it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Serdarbayraktar/llama3.2-1B-Fairytale/pytorch_model-00001-of-00002.bin...
Unsloth: Saving Serdarbayraktar/llama3.2-1B-Fairytale/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at Serdarbayraktar/llama3.2-1B-Fairytale into q8_0 GGUF format.
The output location will be /content/Serdarbayraktar/llama3.2-1B-Fairytale/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: llama3.2-1B-Fairytale
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/Serdarbayraktar/llama3.2-1B-Fairytale


No files have been modified since last commit. Skipping to prevent empty commit.


Saved Ollama Modelfile to https://huggingface.co/Serdarbayraktar/llama3.2-1B-Fairytale
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.97 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 29.79it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Serdarbayraktar/llama3.2-1B-Fairytale/pytorch_model-00001-of-00002.bin...
Unsloth: Saving Serdarbayraktar/llama3.2-1B-Fairytale/pytorch_model-00002-of-00002.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at Serdarbayraktar/llama3.2-1B-Fairytale into f16 GGUF format.
The output location will be /content/Serdarbayraktar/llama3.2-1B-Fairytale/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: llama3.2-1B-Fairytale
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Serdarbayraktar/llama3.2-1B-Fairytale


No files have been modified since last commit. Skipping to prevent empty commit.


Saved Ollama Modelfile to https://huggingface.co/Serdarbayraktar/llama3.2-1B-Fairytale
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.92 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 30.71it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Serdarbayraktar/llama3.2-1B-Fairytale/pytorch_model-00001-of-00002.bin...
Unsloth: Saving Serdarbayraktar/llama3.2-1B-Fairytale/pytorch_model-00002-of-00002.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at Serdarbayraktar/llama3.2-1B-Fairytale into f16 GGUF format.
The output location will be /content/Serdarbayraktar/llama3.2-1B-Fairytale/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: llama3.2-1B-Fairytale
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
IN

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Serdarbayraktar/llama3.2-1B-Fairytale


No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


Saved Ollama Modelfile to https://huggingface.co/Serdarbayraktar/llama3.2-1B-Fairytale
