### Installation

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.1.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

from datasets import load_dataset
dataset = load_dataset("Serdarbayraktar/Fairytale", split = "train")

Generating train split:   0%|          | 0/232 [00:00<?, ? examples/s]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
import numpy as np
from datasets import Dataset

def convert_to_finetuning_format(dataset):
    """
    Convert a dataset with 'content' and 'summarization' keys
    into a fine-tuning format that includes system, user,
    and assistant roles in a single text field.
    """
    finetuning_data = []

    for entry in dataset:
        # Construct a text string that embeds the roles and content
        conversation_text = (
            "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
            "You are an assistant\n\n"
            "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
            f"{entry['summarization']}\n\n"
            "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
            f"{entry['content']}<|e"
        )
        # Add user and assistant roles for each data point
        datas = []
        datas.append({"role": "system", "content": "You are an assistant"})
        datas.append({"role": "user", "content": entry['summarization']})
        datas.append({"role": "assistant", "content": entry['content']})

        finetuning_data.append({
            "conversations" : datas,
            "text": conversation_text
        })

    return Dataset.from_list(finetuning_data)


In [None]:
type(dataset)

In [None]:
dataset_new = convert_to_finetuning_format(dataset)

In [None]:
type(dataset_new)

In [None]:
dataset_new[0]

{'conversations': [{'content': 'You are an assistant', 'role': 'system'},
  {'content': 'A king is wished well by his people. He is described as kind and just. However, there is a peculiar old woman who has a peculiar request. She wants to be allowed to stay outside under the open sky until she is 15 years old. According to her, a mountain troll will take her away.',
   'role': 'user'},
  {'content': 'once upon a time there was a king who went forth into the world and fetched back a beautiful queen . and after they had been married a while god gave them a little daughter . then there was great rejoicing in the city and throughout the country , for the people wished their king all that was good , since he was kind and just . while the child lay in its cradle , a strange - looking old woman entered the room , and no one knew who she was nor whence she came . the old woman spoke a verse over the child , and said that she must not be allowed out under the open sky until she were full fifte

In [None]:
dataset_new[0]

{'conversations': [{'content': 'You are an assistant', 'role': 'system'},
  {'content': 'A king is wished well by his people. He is described as kind and just. However, there is a peculiar old woman who has a peculiar request. She wants to be allowed to stay outside under the open sky until she is 15 years old. According to her, a mountain troll will take her away.',
   'role': 'user'},
  {'content': 'once upon a time there was a king who went forth into the world and fetched back a beautiful queen . and after they had been married a while god gave them a little daughter . then there was great rejoicing in the city and throughout the country , for the people wished their king all that was good , since he was kind and just . while the child lay in its cradle , a strange - looking old woman entered the room , and no one knew who she was nor whence she came . the old woman spoke a verse over the child , and said that she must not be allowed out under the open sky until she were full fifte

We look at how the conversations are structured for item 5:

In [None]:
dataset_new[5]["conversations"]

[{'content': 'You are an assistant', 'role': 'system'},
 {'content': 'The person is fond of reading and is staying in the mountains. They notice that the sound of a bell is inconsistent, sometimes faint and other times clear. This surprises them, leading them to put down their book and go outside to investigate.',
  'role': 'user'},
 {'content': 'once the schoolmaster of etnedal was staying in the mountains to fish . he was very fond of reading , and so he always carried one book or another along with him , with which he could lie down , and which he read on holidays , or when the weather forced him to stay in the little fishing - hut . one sunday morning , as he was lying there reading , it seemed as though he could hear church bells ; sometimes they sounded faintly , as though from a great distance ; at other times the sound was clear , as though carried by the wind . he listened long and with surprise ; and did not trust his ears -- for he knew that it was impossible to hear the bel

In [None]:
dataset_new[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an assistant\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThe person is fond of reading and is staying in the mountains. They notice that the sound of a bell is inconsistent, sometimes faint and other times clear. This surprises them, leading them to put down their book and go outside to investigate.\n\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nonce the schoolmaster of etnedal was staying in the mountains to fish . he was very fond of reading , and so he always carried one book or another along with him , with which he could lie down , and which he read on holidays , or when the weather forced him to stay in the little fishing - hut . one sunday morning , as he was lying there reading , it seemed as though he could hear church bells ; sometimes they sounded faintly , as though from a great distance ; at other times the sound was clear , as though carried by the wind . he listened long a

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_new,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/232 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/232 [00:00<?, ? examples/s]

We verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThe person is fond of reading and is staying in the mountains. They notice that the sound of a bell is inconsistent, sometimes faint and other times clear. This surprises them, leading them to put down their book and go outside to investigate.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nonce the schoolmaster of etnedal was staying in the mountains to fish. he was very fond of reading, and so he always carried one book or another along with him, with which he could lie down, and which he read on holidays, or when the weather forced him to stay in the little fishing - hut. one sunday morning, as he was lying there reading, it seemed as though he could hear church bells ; sometimes they sounded faintly, as though from a great distance ; at other times the sound was clear, a

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                      \n\nonce the schoolmaster of etnedal was staying in the mountains to fish. he was very fond of reading, and so he always carried one book or another along with him, with which he could lie down, and which he read on holidays, or when the weather forced him to stay in the little fishing - hut. one sunday morning, as he was lying there reading, it seemed as though he could hear church bells ; sometimes they sounded faintly, as though from a great distance ; at other times the sound was clear, as though carried by the wind. he listened long and with surprise ; and did not trust his ears -- for he knew that it was impossible to hear the bells of the parish church so far out among the hills -- yet suddenly they sounded quite clearly on his ear. so he laid aside his book, stood up and went out. the sun was shining, the weather was fine, and one group of churchgoers after another passed him in their sunday 

We can see the System and Instruction prompts are successfully masked!

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.635 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 232 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.5312
2,2.5701
3,2.4593
4,2.4329
5,2.3885
6,2.3284
7,2.2849
8,2.1878
9,2.2983
10,2.0541


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "a fairytale that teaches importance of education with crocodiles and king. "},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 2048, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\na fairytale that teaches importance of education with crocodiles and king. <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nin the middle of a great jungle, near a winding brook, there lived once upon a time a mighty king. he ruled over his subjects with such equity and wisdom, that all loved and revered him. there lived in the neighborhood, in the midst of the jungle, a number of crocodiles. there were thirty in all. they lived for years as they had lived for thousands, each doing as he liked, without troubling himself about anything else. they spent their days in basking on the sunny bank of the brook and in basking in the water, or in watching the fish as they darted past, and now and then in making a snack of them.<|eot_id|>']

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
if True:
    model.push_to_hub_gguf(
        "Serdarbayraktar/llama3.1-8B-Fairytale",
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # tken removed
    )

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 4.84 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 50%|█████     | 16/32 [00:01<00:01, 11.93it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [02:36<00:00,  4.90s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Serdarbayraktar/llama3.1-8B-Fairytale/pytorch_model-00001-of-00004.bin...
Unsloth: Saving Serdarbayraktar/llama3.1-8B-Fairytale/pytorch_model-00002-of-00004.bin...
Unsloth: Saving Serdarbayraktar/llama3.1-8B-Fairytale/pytorch_model-00003-of-00004.bin...
Unsloth: Saving Serdarbayraktar/llama3.1-8B-Fairytale/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0', 'q5_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at Serdarbayraktar/llama3.1-8B-Fairytale into f16 GGUF format.
The output location will be /content/Serdarbayraktar/llama3.1-8B-Fairytale/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: llama3.1-8B-Fairytale
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorc

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Serdarbayraktar/llama3.1-8B-Fairytale
Unsloth: Uploading GGUF to Huggingface Hub...


No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Serdarbayraktar/llama3.1-8B-Fairytale


No files have been modified since last commit. Skipping to prevent empty commit.


Saved Ollama Modelfile to https://huggingface.co/Serdarbayraktar/llama3.1-8B-Fairytale
