To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.18 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [16]:
# from unsloth.chat_templates import get_chat_template

# tokenizer = get_chat_template(
#     tokenizer,
#     chat_template = "llama-3.1",
# )

# def formatting_prompts_func(examples):
#     convos = examples["conversations"]
#     texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
#     return { "text" : texts, }
# pass

# from datasets import load_dataset
# dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

# Nh·∫≠p h√†m get_chat_template t·ª´ module unsloth.chat_templates
from unsloth.chat_templates import get_chat_template

# S·ª≠ d·ª•ng h√†m get_chat_template ƒë·ªÉ c·∫•u h√¨nh l·∫°i tokenizer theo template chat "llama-3.1"
# L∆∞u √Ω: bi·∫øn tokenizer ph·∫£i ƒë∆∞·ª£c ƒë·ªãnh nghƒ©a tr∆∞·ªõc ƒë√≥ ho·∫∑c ƒë∆∞·ª£c truy·ªÅn v√†o t·ª´ m·ªôt ngu·ªìn kh√°c
tokenizer = get_chat_template(
    tokenizer,                      # Bi·∫øn tokenizer ƒë√£ c√≥
    chat_template="llama-3.1",      # Ch·ªçn template chat "llama-3.1" ƒë·ªÉ ƒë·ªãnh d·∫°ng c√°c cu·ªôc h·ªôi tho·∫°i
)

# ƒê·ªãnh nghƒ©a h√†m formatting_prompts_func v·ªõi tham s·ªë ƒë·∫ßu v√†o l√† examples
def formatting_prompts_func(examples):
    # L·∫•y danh s√°ch c√°c cu·ªôc h·ªôi tho·∫°i t·ª´ key "conversations" trong ƒë·ªëi t∆∞·ª£ng examples
    convos = examples["conversations"]
    # √Åp d·ª•ng template chat cho m·ªói cu·ªôc h·ªôi tho·∫°i trong convos, kh√¥ng th·ª±c hi·ªán tokenize v√† kh√¥ng th√™m generation prompt
    # K·∫øt qu·∫£ l√† danh s√°ch c√°c chu·ªói vƒÉn b·∫£n ƒë√£ ƒë∆∞·ª£c ƒë·ªãnh d·∫°ng
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    # Tr·∫£ v·ªÅ m·ªôt dictionary v·ªõi key "text" ch·ª©a danh s√°ch c√°c chu·ªói vƒÉn b·∫£n ƒë√£ ƒë·ªãnh d·∫°ng
    return {"text": texts,}

# D√≤ng pass ·ªü ƒë√¢y kh√¥ng c√≥ t√°c d·ª•ng c·ª• th·ªÉ v√¨ ƒë√£ c√≥ return, c√≥ th·ªÉ ƒë∆∞·ª£c gi·ªØ l·∫°i ho·∫∑c lo·∫°i b·ªè t√πy √Ω
pass

# Nh·∫≠p h√†m load_dataset t·ª´ th∆∞ vi·ªán datasets ƒë·ªÉ t·∫£i b·ªô d·ªØ li·ªáu
from datasets import load_dataset

# T·∫£i b·ªô d·ªØ li·ªáu "mlabonne/FineTome-100k" v·ªõi ph·∫ßn d·ªØ li·ªáu hu·∫•n luy·ªán (split="train")
dataset = load_dataset("mlabonne/FineTome-100k", split="train")


In [17]:
# In ra th√¥ng tin t·ªïng quan c·ªßa dataset
print(dataset)

# In ra 5 m·∫´u d·ªØ li·ªáu ƒë·∫ßu ti√™n trong dataset
for i in range(5):
    print(f"M·∫´u d·ªØ li·ªáu th·ª© {i}: ", dataset[i])


Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})
M·∫´u d·ªØ li·ªáu th·ª© 0:  {'conversations': [{'from': 'human', 'value': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where 

In [21]:
dataset[5]["conversations"]

[{'from': 'human',
  'value': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?'},
 {'from': 'gpt',
  'value': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.'}]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [22]:
# from unsloth.chat_templates import standardize_sharegpt
# dataset = standardize_sharegpt(dataset)
# dataset = dataset.map(formatting_prompts_func, batched = True,)

# Nh·∫≠p h√†m standardize_sharegpt t·ª´ module unsloth.chat_templates ƒë·ªÉ chu·∫©n h√≥a ƒë·ªãnh d·∫°ng d·ªØ li·ªáu
from unsloth.chat_templates import standardize_sharegpt

# S·ª≠ d·ª•ng h√†m standardize_sharegpt ƒë·ªÉ chu·∫©n h√≥a dataset v·ªÅ ƒë·ªãnh d·∫°ng sharegpt
dataset = standardize_sharegpt(dataset)

# √Åp d·ª•ng h√†m formatting_prompts_func cho t·ª´ng m·∫´u trong dataset b·∫±ng c√°ch s·ª≠ d·ª•ng ph∆∞∆°ng th·ª©c map
# Tham s·ªë batched=True cho ph√©p x·ª≠ l√Ω nhi·ªÅu m·∫´u c√πng l√∫c, gi√∫p tƒÉng hi·ªáu qu·∫£ khi √°p d·ª•ng h√†m
dataset = dataset.map(formatting_prompts_func, batched=True,)


Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

H·ªèi:
```
H√†m n√†y ch·ªâ x·ª≠ l√Ω conversations thui ch·ª© source, v·ªõi score ko x·ª≠ l√Ω nh·ªâ. gi·ªù t√¥i ƒëang c√≥ 1 b·ªô dataset kh√°c √Ω n√≥ c√≥ conversations ko c√≥ source v·ªõi score
```
=> ƒê√∫ng v·∫≠y. H√†m `formatting_prompts_func` m√† b·∫°n ƒë√£ ƒë·ªãnh nghƒ©a ch·ªâ t·∫≠p trung x·ª≠ l√Ω tr∆∞·ªùng `conversations`, c√≤n c√°c tr∆∞·ªùng nh∆∞ `source` v√† `score` kh√¥ng ƒë∆∞·ª£c ·∫£nh h∆∞·ªüng. N·∫øu b·∫°n c√≥ m·ªôt b·ªô dataset kh√°c m√† ch·ªâ ch·ª©a tr∆∞·ªùng `conversations` (kh√¥ng c√≥ `source` v√† `score`), th√¨ h√†m n√†y v·∫´n ho·∫°t ƒë·ªông b√¨nh th∆∞·ªùng, v√¨ n√≥ ch·ªâ l·∫•y v√† x·ª≠ l√Ω n·ªôi dung c·ªßa `conversations` ƒë·ªÉ t·∫°o ra c√°c prompt ƒë·ªãnh d·∫°ng. B·∫°n kh√¥ng c·∫ßn thay ƒë·ªïi g√¨ n·∫øu b·∫°n ch·ªâ mu·ªën x·ª≠ l√Ω cu·ªôc h·ªôi tho·∫°i.

H·ªèi:
```
t·ª©c l√† n·∫øu n√≥ chu·∫©n r·ªìi th√¨ sao, ki·ªÉu data c·ªßa m√¨nh ƒë√£ chu·∫©n lu√¥n system - user - assistant r·ªìi
```


N·∫øu d·ªØ li·ªáu c·ªßa b·∫°n ƒë√£ c√≥ c·∫•u tr√∫c chu·∫©n (c√°c tin nh·∫Øn ƒë√£ c√≥ kh√≥a "role" v√† "content", v·ªõi gi√° tr·ªã c·ªßa "role" l√† "system", "user", "assistant"), th√¨ h√†m `standardize_data_formats` v·∫´n s·∫Ω th·ª±c hi·ªán c√°c b∆∞·ªõc ki·ªÉm tra v√† √°nh x·∫° nh∆∞ b√¨nh th∆∞·ªùng. C·ª• th·ªÉ:

- H√†m s·∫Ω duy·ªát qua m·ªôt s·ªë m·∫´u t·ª´ tr∆∞·ªùng `conversations` v√† l·∫•y ra c√°c kh√≥a c·ªßa t·ª´ng tin nh·∫Øn.  
- N√≥ s·∫Ω nh·∫≠n th·∫•y r·∫±ng c√≥ ƒë√∫ng 2 kh√≥a (gi·∫£ s·ª≠ l√† `"role"` v√† `"content"`).  
- Sau ƒë√≥, d·ª±a v√†o s·ªë l∆∞·ª£ng gi√° tr·ªã duy nh·∫•t, n√≥ s·∫Ω x√°c ƒë·ªãnh kh√≥a n√†o l√† vai tr√≤ (role) v√† kh√≥a n√†o l√† n·ªôi dung (content).  
- Ti·∫øp theo, n√≥ s·∫Ω ki·ªÉm tra c√°c gi√° tr·ªã c·ªßa kh√≥a vai tr√≤ c√≥ n·∫±m trong t·∫≠p c√°c alias ƒë√£ ƒë·ªãnh nghƒ©a hay kh√¥ng. N·∫øu d·ªØ li·ªáu c·ªßa b·∫°n ƒë√£ chu·∫©n th√¨ c√°c gi√° tr·ªã n√†y ƒë√£ l√† `"system"`, `"user"` ho·∫∑c `"assistant"` n√™n s·∫Ω kh·ªõp v·ªõi alias ƒë√£ ƒë∆∞·ª£c mapping.  
- Cu·ªëi c√πng, m·ªói tin nh·∫Øn s·∫Ω ƒë∆∞·ª£c x√¢y d·ª±ng l·∫°i v·ªõi ƒë·ªãnh d·∫°ng: `{"role": <role chu·∫©n>, "content": <n·ªôi dung>}`. N·∫øu d·ªØ li·ªáu ƒë√£ chu·∫©n, k·∫øt qu·∫£ cu·ªëi c√πng s·∫Ω gi·ªëng v·ªõi d·ªØ li·ªáu ban ƒë·∫ßu.

T√≥m l·∫°i, n·∫øu d·ªØ li·ªáu c·ªßa b·∫°n ƒë√£ chu·∫©n, h√†m n√†y s·∫Ω kh√¥ng thay ƒë·ªïi n·ªôi dung, m√† ch·ªâ "t√°i ƒë√≥ng g√≥i" d·ªØ li·ªáu theo ƒë√∫ng ƒë·ªãnh d·∫°ng ti√™u chu·∫©n c·ªßa Hugging Face. B·∫°n c√≥ th·ªÉ ch·∫°y h√†m n√†y m√† kh√¥ng lo b·ªã l·ªói, nh∆∞ng v·ªÅ c∆° b·∫£n n√≥ s·∫Ω kh√¥ng t·∫°o ra s·ª± kh√°c bi·ªát so v·ªõi d·ªØ li·ªáu ban ƒë·∫ßu.

We look at how the conversations are structured for item 5:

In [6]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [None]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

### ‚úÖ T√≥m t·∫Øt qu√° tr√¨nh chu·∫©n h√≥a v√† ƒë·ªãnh d·∫°ng d·ªØ li·ªáu v·ªõi `standardize_sharegpt` + `formatting_prompts_func`

1. **`standardize_sharegpt(dataset)`**  
   ‚Üí Chuy·ªÉn d·ªØ li·ªáu v·ªÅ ƒë·ªãnh d·∫°ng chu·∫©n ki·ªÉu ShareGPT:
   - M·ªói tin nh·∫Øn c√≥ d·∫°ng `{"from": ..., "value": ...}`
   - Chuy·ªÉn `from = "human"` th√†nh `"user"`, `"gpt"` th√†nh `"assistant"` ƒë·ªÉ ƒë·ªìng b·ªô vai tr√≤.

2. **`dataset.map(formatting_prompts_func, batched=True)`**  
   ‚Üí Duy·ªát t·ª´ng h·ªôi tho·∫°i, √°p d·ª•ng **chat template** ƒë·ªÉ t·∫°o chu·ªói ƒë·ªãnh d·∫°ng c√≥ token ƒë·∫∑c bi·ªát:
   - Sinh ra tr∆∞·ªùng `"text"` ch·ª©a n·ªôi dung ƒë√£ ƒë∆∞·ª£c ch√®n c√°c th·∫ª nh∆∞:  
     `<|begin_of_text|>`, `<|start_header_id|>user<|end_header_id|>`, `<|eot_id|>`, v.v.

---

üëâ **M·ª•c ti√™u**:  
ƒê·ªãnh d·∫°ng l·∫°i h·ªôi tho·∫°i sao cho m√¥ h√¨nh d·ªÖ hi·ªÉu vai tr√≤ `system`, `user`, `assistant` v√† ng·ªØ c·∫£nh t·ª´ng l∆∞·ª£t n√≥i.

üìå **K·∫øt qu·∫£**:  
Dataset c√≥ th√™m c·ªôt `"text"` ‚Üí d√πng ƒë·ªÉ hu·∫•n luy·ªán m√¥ h√¨nh ng√¥n ng·ªØ theo phong c√°ch chat.

## My Dataset

In [27]:
import pandas as pd
import json
from datasets import Dataset
from unsloth.chat_templates import standardize_data_formats  # Ho·∫∑c h√†m t∆∞∆°ng ƒë∆∞∆°ng

# B∆∞·ªõc 1: ƒê·ªçc file Excel
# Gi·∫£ s·ª≠ file n·∫±m t·∫°i "/content/conversations_format.xlsx"
df = pd.read_excel("/content/conversations_format.xlsx")

# Ki·ªÉm tra nhanh 5 d√≤ng ƒë·∫ßu
print("D·ªØ li·ªáu g·ªëc (5 d√≤ng ƒë·∫ßu):")
print(df.head())

# B∆∞·ªõc 2: M·ªói d√≤ng trong df ƒë√£ ch·ª©a m·ªôt cu·ªôc h·ªôi tho·∫°i ·ªü d·∫°ng chu·ªói JSON.
# Ta c·∫ßn chuy·ªÉn c·ªôt `conversations` t·ª´ chu·ªói JSON sang list[dict].

all_data = []
for idx, row in df.iterrows():
    # row["conversations"] l√† m·ªôt chu·ªói JSON,
    # v√≠ d·ª•: '[{"role": "system", "content": "You are..."}, {"role": "user", "content": "Hello"}, ...]'
    conv_str = row["conversations"]

    # Ki·ªÉm tra chu·ªói JSON c√≥ b·ªã r·ªóng, None ho·∫∑c l·ªói format kh√¥ng
    if not isinstance(conv_str, str):
        raise ValueError(f"H√†ng th·ª© {idx} kh√¥ng ph·∫£i chu·ªói JSON h·ª£p l·ªá: {conv_str}")

    # Gi·∫£i m√£ chu·ªói JSON th√†nh list[dict]
    conv_list = json.loads(conv_str)

    # T·∫°o dictionary c√≥ key "conversations" cho m·ªói h√†ng
    all_data.append({"conversations": conv_list})

# B∆∞·ªõc 3: T·∫°o Dataset c·ªßa Hugging Face t·ª´ danh s√°ch all_data
my_dataset = Dataset.from_list(all_data)

# Ki·ªÉm tra c·∫•u tr√∫c ph·∫ßn t·ª≠ ƒë·∫ßu ti√™n
print("\nPh·∫ßn t·ª≠ ƒë·∫ßu ti√™n c·ªßa my_dataset tr∆∞·ªõc khi chu·∫©n h√≥a:")
print(my_dataset[0])

# B∆∞·ªõc 4: Chu·∫©n h√≥a d·ªØ li·ªáu (n·∫øu c·∫ßn) b·∫±ng h√†m standardize_data_formats
# - H√†m n√†y s·∫Ω ƒë·ªìng nh·∫•t role = system / user / assistant theo alias
my_dataset = standardize_data_formats(my_dataset)

# - Trong qu√° tr√¨nh format, c√≥ th·ªÉ `formatting_prompts_func` (ho·∫∑c m·ªôt h√†m template t∆∞∆°ng t·ª±) ch√®n c√°c token ƒë·∫∑c bi·ªát v√† c√°c th·∫ª ƒë√°nh d·∫•u vai tr√≤ (system/user/assistant).
# - C√°c th·∫ª n√†y gi√∫p m√¥ h√¨nh (ho·∫∑c pipeline hu·∫•n luy·ªán) ph√¢n bi·ªát ranh gi·ªõi t·ª´ng vai tr√≤ trong h·ªôi tho·∫°i, c≈©ng nh∆∞ h·ªó tr·ª£ m·ªôt s·ªë t√≠nh nƒÉng n·ªôi b·ªô (nh∆∞ ‚Äúend of text‚Äù, ‚Äústart header‚Äù, v.v.).
# Tham s·ªë batched=True cho ph√©p x·ª≠ l√Ω nhi·ªÅu m·∫´u c√πng l√∫c, gi√∫p tƒÉng hi·ªáu qu·∫£ khi √°p d·ª•ng h√†m
my_dataset = my_dataset.map(formatting_prompts_func, batched=True,)

# Ki·ªÉm tra sau chu·∫©n h√≥a
print("\nPh·∫ßn t·ª≠ ƒë·∫ßu ti√™n sau khi chu·∫©n h√≥a:")
print(my_dataset[0])
print(my_dataset[0]["conversations"])
print(my_dataset[0]["text"])


D·ªØ li·ªáu g·ªëc (5 d√≤ng ƒë·∫ßu):
                                       conversations
0  [{"role": "system", "content": "You are an int...
1  [{"role": "system", "content": "You are an int...
2  [{"role": "system", "content": "You are an int...
3  [{"role": "system", "content": "You are an int...
4  [{"role": "system", "content": "You are an int...

Ph·∫ßn t·ª≠ ƒë·∫ßu ti√™n c·ªßa my_dataset tr∆∞·ªõc khi chu·∫©n h√≥a:
{'conversations': [{'content': 'You are an intelligent **task classification and response generation assistant**. Your job is to analyze user commands and generate structured JSON responses according to the correct task category. \n\nYou will be given: \n- **User Input** ‚Äì A natural language command from the user. \n\nYour task is to: \n1. **Identify the correct tool** for the request: \n- `"todo"` ‚Üí For general task management. \n- `"todo_with_calendar"` ‚Üí For scheduling and calendar-based tasks. \n- `"email"` ‚Üí For email-related tasks. \n- `"article"` ‚Üí For 

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/3892 [00:00<?, ? examples/s]

Map:   0%|          | 0/3892 [00:00<?, ? examples/s]


Ph·∫ßn t·ª≠ ƒë·∫ßu ti√™n sau khi chu·∫©n h√≥a:
{'conversations': [{'content': 'You are an intelligent **task classification and response generation assistant**. Your job is to analyze user commands and generate structured JSON responses according to the correct task category. \n\nYou will be given: \n- **User Input** ‚Äì A natural language command from the user. \n\nYour task is to: \n1. **Identify the correct tool** for the request: \n- `"todo"` ‚Üí For general task management. \n- `"todo_with_calendar"` ‚Üí For scheduling and calendar-based tasks. \n- `"email"` ‚Üí For email-related tasks. \n- `"article"` ‚Üí For content creation and article management. \n- `"thutuchanhchinh"` ‚Üí For legal and administrative procedure queries. \n\n2. **Determine the appropriate action** based on user intent: \n- `"todo"` ‚Üí `[create, get_all, update, delete, complete]` \n- `"todo_with_calendar"` ‚Üí `[create, get_upcoming, get_past, get_now, update, delete, invite]` \n- `"email"` ‚Üí `[compose, ge

N·ªôi dung hi·ªÉn th·ªã cho ta th·∫•y r√µ c√°ch th·ª©c d·ªØ li·ªáu ƒëang ƒë∆∞·ª£c ‚Äúƒë√≥ng g√≥i‚Äù v√† ‚Äúƒë·ªãnh d·∫°ng‚Äù sau khi s·ª≠ d·ª•ng c√°c h√†m chu·∫©n h√≥a t·ª´ th∆∞ vi·ªán `unsloth`. C·ª• th·ªÉ:

1. **C√≥ tr∆∞·ªùng** `"conversations"` **l∆∞u danh s√°ch tin nh·∫Øn**  
   M·ªói tin nh·∫Øn l√† m·ªôt dictionary g·ªìm:
   - `"role"`: x√°c ƒë·ªãnh ng∆∞·ªùi g·ª≠i l√† `system`, `user`, hay `assistant`.
   - `"content"`: n·ªôi dung vƒÉn b·∫£n th·ª±c t·∫ø.

2. **Xu·∫•t hi·ªán tr∆∞·ªùng** `"text"` **ch·ª©a n·ªôi dung ƒë√£ ƒë∆∞·ª£c th√™m c√°c token ƒë·∫∑c bi·ªát**  
   Trong ƒë√≥ g·ªìm nh·ªØng token nh∆∞:
   - `"<|begin_of_text|>"` v√† `"<|eot_id|>"` ƒë·ªÉ ƒë√°nh d·∫•u b·∫Øt ƒë·∫ßu/k·∫øt th√∫c n·ªôi dung.  
   - `"<|start_header_id|>system<|end_header_id|>"`, `"<|start_header_id|>user<|end_header_id|>"`, `"<|start_header_id|>assistant<|end_header_id|>"` ƒë·ªÉ bi·ªÉu di·ªÖn vai tr√≤.  
   - C√°c d√≤ng ‚ÄúCutting Knowledge Date: ‚Ä¶‚Äù v√† ‚ÄúToday Date: ‚Ä¶‚Äù cho th·∫•y m√¥ ph·ªèng ng·ªØ c·∫£nh ho·∫∑c metadata v·ªÅ ng√†y th√°ng (n·∫øu c√≥).  

3. **C·∫•u tr√∫c text**  
   Text n√†y tr·ªôn c·∫£ n·ªôi dung th·ª±c t·∫ø (v√≠ d·ª• ‚ÄúYou are an intelligent ‚Ä¶,‚Äù ‚ÄúTi·∫øp t·ª•c h·ªçc ti·∫øng anh,‚Äù ‚Ä¶) v·ªõi c√°c ƒëo·∫°n ƒë√°nh d·∫•u (token ƒë·∫∑c bi·ªát), nh·ªù ƒë√≥ m√¥ h√¨nh hu·∫•n luy·ªán ho·∫∑c m√¥ h√¨nh inference c√≥ th·ªÉ ph√¢n bi·ªát ranh gi·ªõi gi·ªØa c√°c vai tr√≤, ho·∫∑c nh·∫≠n bi·∫øt c√°c si√™u th√¥ng tin kh√°c.

4. **Nh·∫≠n x√©t t·ªïng quan**  
   - D·ªØ li·ªáu ·ªü d·∫°ng `"conversations"` + `"text"` n√†y r·∫•t h·ªØu √≠ch ƒë·ªÉ **hu·∫•n luy·ªán m√¥ h√¨nh** theo k·ªãch b·∫£n h·ªôi tho·∫°i (chat-based).  
   - Ph·∫ßn `"text"` ƒë∆∞·ª£c ‚Äúrender‚Äù d·ª±a tr√™n template, gi√∫p **ƒë·ªìng nh·∫•t** vi·ªác cung c·∫•p ng·ªØ c·∫£nh, role, v√† n·ªôi dung cho m√¥ h√¨nh.  
   - N·∫øu b·∫°n kh√¥ng mu·ªën c√°c token ƒë·∫∑c bi·ªát (nh∆∞ `"<|begin_of_text|>"`‚Ä¶), b·∫°n c√≥ th·ªÉ t√πy bi·∫øn l·∫°i ph·∫ßn template.  
   - Ng∆∞·ª£c l·∫°i, n·∫øu b·∫°n c·∫ßn m√¥ h√¨nh hi·ªÉu ƒë∆∞·ª£c ng·ªØ c·∫£nh ‚Äúsystem‚Äù / ‚Äúuser‚Äù / ‚Äúassistant‚Äù t√°ch bi·ªát, th√¨ c√°ch t·∫°o nh·ªØng token n√†y l√† c·∫ßn thi·∫øt.

Nh∆∞ v·∫≠y, to√†n b·ªô ƒëo·∫°n ‚Äútext‚Äù n√†y ƒë∆°n gi·∫£n l√† **s·∫£n ph·∫©m** c·ªßa qu√° tr√¨nh chuy·ªÉn ƒë·ªïi v√† ƒë·ªãnh d·∫°ng nh·∫±m ph·ª•c v·ª• nhu c·∫ßu hu·∫•n luy·ªán m√¥ h√¨nh ki·ªÉu h·ªôi tho·∫°i.

### ‚úÖ T√≥m t·∫Øt: **Token ƒë∆∞·ª£c t√≠nh t·ª´ ƒë√¢u ƒë·∫øn ƒë√¢u?**

- **Token ƒë∆∞·ª£c t√≠nh t·ª´ to√†n b·ªô n·ªôi dung trong tr∆∞·ªùng `"text"`**, sau khi √°p d·ª•ng template ChatML.
  
- Bao g·ªìm t·∫•t c·∫£:
  - `<|begin_of_text|>` ‚Äì b·∫Øt ƒë·∫ßu chu·ªói
  - `<|start_header_id|>system<|end_header_id|>` + n·ªôi dung system
  - `<|start_header_id|>user<|end_header_id|>` + prompt ng∆∞·ªùi d√πng
  - `<|start_header_id|>assistant<|end_header_id|>` + c√¢u tr·∫£ l·ªùi
  - `<|eot_id|>` ‚Äì ƒë√°nh d·∫•u k·∫øt th√∫c m·ªói ph·∫ßn

- üëâ V√¨ prompt `"system"` c·ªßa b·∫°n r·∫•t d√†i, b·∫°n n√™n **tƒÉng `max_seq_length`** (n·∫øu m√¥ h√¨nh h·ªó tr·ª£), v√≠ d·ª•:
  - 1024 ‚Üí c√≥ th·ªÉ b·ªã c·∫Øt
  - 2048 ho·∫∑c 4096 ‚Üí an to√†n h∆°n

- ‚úÖ B·∫°n c√≥ th·ªÉ d√πng `tokenizer(text)` ƒë·ªÉ ƒë·∫øm ch√≠nh x√°c s·ªë token.

---

N·∫øu c·∫ßn, m√¨nh c√≥ th·ªÉ gi√∫p b·∫°n:  
- T√≠nh s·ªë token th·∫≠t s·ª± c·ªßa t·ª´ng m·∫´u  
- G·ª£i √Ω r√∫t g·ªçn prompt `"system"`  
- T·ªëi ∆∞u h√≥a cho LoRA ho·∫∑c c·∫•u h√¨nh hu·∫•n luy·ªán ph√π h·ª£p h∆°n.

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [28]:
# from trl import SFTTrainer
# from transformers import TrainingArguments, DataCollatorForSeq2Seq
# from unsloth import is_bfloat16_supported

# trainer = SFTTrainer(
#     model = model,
#     tokenizer = tokenizer,
#     train_dataset = dataset,
#     dataset_text_field = "text",
#     max_seq_length = max_seq_length,
#     data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
#     dataset_num_proc = 2,
#     packing = False, # Can make training 5x faster for short sequences.
#     args = TrainingArguments(
#         per_device_train_batch_size = 2,
#         gradient_accumulation_steps = 4,
#         warmup_steps = 5,
#         # num_train_epochs = 1, # Set this for 1 full training run.
#         max_steps = 60,
#         learning_rate = 2e-4,
#         fp16 = not is_bfloat16_supported(),
#         bf16 = is_bfloat16_supported(),
#         logging_steps = 1,
#         optim = "adamw_8bit",
#         weight_decay = 0.01,
#         lr_scheduler_type = "linear",
#         seed = 3407,
#         output_dir = "outputs",
#         report_to = "none", # Use this for WandB etc
#     ),
# )


# Import c√°c h√†m v√† l·ªõp c·∫ßn thi·∫øt
# - SFTTrainer: L·ªõp hu·∫•n luy·ªán m√¥ h√¨nh ki·ªÉu "Supervised Fine-Tuning" (ƒëi·ªÅu ch·ªânh c√≥ gi√°m s√°t)
# - TrainingArguments: ƒê·ªëi t∆∞·ª£ng c·∫•u h√¨nh cho qu√° tr√¨nh hu·∫•n luy·ªán (v√≠ d·ª• s·ªë b∆∞·ªõc, lr, v.v.)
# - DataCollatorForSeq2Seq: C√¥ng c·ª• gom d·ªØ li·ªáu theo batch, ph√π h·ª£p v·ªõi m√¥ h√¨nh d·∫°ng seq2seq
# - is_bfloat16_supported: H√†m ki·ªÉm tra xem m√°y t√≠nh c√≥ h·ªó tr·ª£ ki·ªÉu s·ªë bfloat16 kh√¥ng
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

# Kh·ªüi t·∫°o ƒë·ªëi t∆∞·ª£ng SFTTrainer ƒë·ªÉ hu·∫•n luy·ªán m√¥ h√¨nh
trainer = SFTTrainer(
    model = model,                       # M√¥ h√¨nh (ƒë√£ ƒë∆∞·ª£c n·∫°p s·∫µn)
    tokenizer = tokenizer,               # Tokenizer ƒë·ªÉ bi·∫øn vƒÉn b·∫£n th√†nh s·ªë v√† ng∆∞·ª£c l·∫°i
    train_dataset = dataset,             # D·ªØ li·ªáu hu·∫•n luy·ªán (d·∫°ng Dataset)
    dataset_text_field = "text",         # T√™n c·ªôt (field) ch·ª©a chu·ªói vƒÉn b·∫£n trong dataset
    max_seq_length = max_seq_length,     # ƒê·ªô d√†i t·ªëi ƒëa c·ªßa m·ªói chu·ªói sau khi tokenizer
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    # DataCollatorForSeq2Seq gi√∫p gom c√°c m·∫´u theo batch, th√™m m√£ ho√° ƒë·∫∑c bi·ªát n·∫øu c·∫ßn

    dataset_num_proc = 2,               # S·ªë lu·ªìng CPU ƒë·ªÉ x·ª≠ l√Ω d·ªØ li·ªáu song song
    packing = False,                     # C√≥ g·ªôp (packing) c√°c chu·ªói ng·∫Øn l·∫°i hay kh√¥ng
    # (t·∫Øt packing c√≥ th·ªÉ hu·∫•n luy·ªán ch·∫≠m h∆°n, nh∆∞ng code d·ªÖ theo d√µi h∆°n)

    # C√°c tham s·ªë hu·∫•n luy·ªán ƒë∆∞·ª£c ƒë∆∞a v√†o TrainingArguments
    args = TrainingArguments(
        per_device_train_batch_size = 2, # M·ªói thi·∫øt b·ªã (GPU/CPU) s·∫Ω l·∫•y 2 m·∫´u cho m·ªói batch
        gradient_accumulation_steps = 4, # T√≠ch l≈©y gradient 4 l·∫ßn tr∆∞·ªõc khi c·∫≠p nh·∫≠t m√¥ h√¨nh
        warmup_steps = 5,               # S·ªë b∆∞·ªõc "h√¢m n√≥ng" (warmup) tr∆∞·ªõc khi v√†o giai ƒëo·∫°n ch√≠nh
        max_steps = 60,                 # Hu·∫•n luy·ªán t·ªïng c·ªông 60 b∆∞·ªõc
        learning_rate = 2e-4,           # T·ªëc ƒë·ªô h·ªçc (learning rate) l√† 0.0002
        fp16 = not is_bfloat16_supported(), # N·∫øu m√°y kh√¥ng h·ªó tr·ª£ bfloat16 th√¨ d√πng fp16
        bf16 = is_bfloat16_supported(), # N·∫øu m√°y h·ªó tr·ª£ bfloat16 th√¨ d√πng bfloat16
        logging_steps = 1,              # C·ª© 1 b∆∞·ªõc th√¨ in th√¥ng tin log (ti·∫øn tr√¨nh) 1 l·∫ßn
        optim = "adamw_8bit",           # S·ª≠ d·ª•ng tr√¨nh t·ªëi ∆∞u AdamW, t√≠nh to√°n v·ªõi s·ªë 8-bit ƒë·ªÉ ti·∫øt ki·ªám b·ªô nh·ªõ
        weight_decay = 0.01,            # H·ªá s·ªë tr·ª´ng ph·∫°t ƒë·ªô l·ªõn tham s·ªë (regularization)
        lr_scheduler_type = "linear",   # Ki·ªÉu thay ƒë·ªïi learning rate (tuy·∫øn t√≠nh)
        seed = 3407,                    # H·∫°t ng·∫´u nhi√™n (random seed) ƒë·ªÉ k·∫øt qu·∫£ l·∫∑p l·∫°i ƒë∆∞·ª£c
        output_dir = "outputs",         # Th∆∞ m·ª•c ƒë·ªÉ l∆∞u k·∫øt qu·∫£ v√† m√¥ h√¨nh sau hu·∫•n luy·ªán
        report_to = "none",             # Ch·ªó b√°o c√°o k·∫øt qu·∫£ (none = kh√¥ng b√°o)
    ),
)


Converting train dataset to ChatML (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Truncating train dataset (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

ƒê√¢y l√† th√¥ng b√°o qu√° tr√¨nh x·ª≠ l√Ω d·ªØ li·ªáu khi hu·∫•n luy·ªán, m·ªói d√≤ng cho bi·∫øt m·ªôt b∆∞·ªõc v√† ti·∫øn ƒë·ªô th·ª±c hi·ªán:

1. **Converting train dataset to ChatML (num_proc=2): 100%**  
   - H·ªá th·ªëng ƒëang **chuy·ªÉn ƒë·ªïi** b·ªô d·ªØ li·ªáu hu·∫•n luy·ªán (train dataset) sang **ƒë·ªãnh d·∫°ng ChatML** (m·ªôt ƒë·ªãnh d·∫°ng ƒë·∫∑c bi·ªát ƒë·ªÉ m√¥ h√¨nh hi·ªÉu ƒë∆∞·ª£c h·ªôi tho·∫°i).  
   - `num_proc=2` cho bi·∫øt ƒëang x·ª≠ l√Ω b·∫±ng 2 lu·ªìng (ho·∫∑c 2 ti·∫øn tr√¨nh) song song.  
   - Khi b√°o `100000/100000 [00:24<00:00, 1501.55 examples/s]` nghƒ©a l√† ƒë√£ x·ª≠ l√Ω xong **100.000 m·∫´u** trong **24 gi√¢y**, t·ªëc ƒë·ªô kho·∫£ng **1.501 m·∫´u/gi√¢y**.

2. **Applying chat template to train dataset (num_proc=2): 100%**  
   - Sau b∆∞·ªõc chuy·ªÉn ƒë·ªïi sang ChatML, h·ªá th·ªëng **√°p d·ª•ng m·ªôt ‚Äúchat template‚Äù** (m·ªôt khu√¥n m·∫´u ƒë·ªãnh d·∫°ng) cho d·ªØ li·ªáu.  
   - D·ªØ li·ªáu ƒë∆∞·ª£c duy·ªát qua v√† b·ªï sung c√°c token ƒë·∫∑c bi·ªát (v√≠ d·ª•: hi·ªÉn th·ªã user/assistant, c·∫Øt b·ªõt, v.v.) theo template m√† b·∫°n ƒë√£ c·∫•u h√¨nh.  
   - T∆∞∆°ng t·ª±, `100000/100000 [00:30<00:00, 4935.73 examples/s]` cho th·∫•y ƒë√£ x·ª≠ l√Ω xong 100.000 m·∫´u trong 30 gi√¢y, t·ªëc ƒë·ªô g·∫ßn 4.935 m·∫´u/gi√¢y.

3. **Tokenizing train dataset (num_proc=2): 100%**  
   - ƒê√¢y l√† **b∆∞·ªõc bi·∫øn ƒë·ªïi vƒÉn b·∫£n th√†nh c√°c token** ‚Äì nghƒ©a l√† bi·∫øn c√¢u ch·ªØ th√†nh c√°c ch·ªâ s·ªë (s·ªë nguy√™n) ƒë·ªÉ m√¥ h√¨nh hi·ªÉu ƒë∆∞·ª£c.  
   - V·ªõi 100.000 m·∫´u, qu√° tr√¨nh n√†y m·∫•t kho·∫£ng 3 ph√∫t 31 gi√¢y (t·ªëc ƒë·ªô ~503 m·∫´u/gi√¢y).

4. **Truncating train dataset (num_proc=2): 100%**  
   - B∆∞·ªõc **truncating** (c·∫Øt ng·∫Øn) s·∫Ω gi·ªõi h·∫°n ƒë·ªô d√†i m·ªói m·∫´u (th∆∞·ªùng l√† theo `max_seq_length`) ƒë·ªÉ tr√°nh v∆∞·ª£t qu√° m·ª©c cho ph√©p c·ªßa m√¥ h√¨nh.  
   - Th·ªùi gian th·ª±c hi·ªán ng·∫Øn (ch·ªâ 2 gi√¢y) do ƒë√¢y l√† thao t√°c c·∫Øt b·ªè ph·∫ßn v∆∞·ª£t qu√°. T·ªëc ƒë·ªô kh√° cao (~44.089 m·∫´u/gi√¢y).

T√≥m l·∫°i, c√°c d√≤ng log n√†y cho th·∫•y **b·ªën b∆∞·ªõc n·ªëi ti·∫øp** trong vi·ªác chu·∫©n b·ªã d·ªØ li·ªáu hu·∫•n luy·ªán:  
- Chuy·ªÉn ƒë·ªïi sang ChatML,  
- √Åp d·ª•ng template h·ªôi tho·∫°i,  
- Tokenize (m√£ h√≥a vƒÉn b·∫£n),  
- V√† cu·ªëi c√πng l√† c·∫Øt ng·∫Øn d·ªØ li·ªáu v∆∞·ª£t qu√° ƒë·ªô d√†i cho ph√©p.  

M·ªói b∆∞·ªõc c√≥ `num_proc=2` nghƒ©a l√† chia vi·ªác x·ª≠ l√Ω cho 2 ti·∫øn tr√¨nh, tƒÉng t·ªëc ƒë·ªô so v·ªõi vi·ªác ch·∫°y ƒë∆°n lu·ªìng.

### ‚úÖ T√≥m t·∫Øt: **Token ƒë∆∞·ª£c t√≠nh t·ª´ ƒë√¢u ƒë·∫øn ƒë√¢u?**

- **Token ƒë∆∞·ª£c t√≠nh t·ª´ to√†n b·ªô n·ªôi dung trong tr∆∞·ªùng `"text"`**, sau khi √°p d·ª•ng template ChatML.
  
- Bao g·ªìm t·∫•t c·∫£:
  - `<|begin_of_text|>` ‚Äì b·∫Øt ƒë·∫ßu chu·ªói
  - `<|start_header_id|>system<|end_header_id|>` + n·ªôi dung system
  - `<|start_header_id|>user<|end_header_id|>` + prompt ng∆∞·ªùi d√πng
  - `<|start_header_id|>assistant<|end_header_id|>` + c√¢u tr·∫£ l·ªùi
  - `<|eot_id|>` ‚Äì ƒë√°nh d·∫•u k·∫øt th√∫c m·ªói ph·∫ßn

- üëâ V√¨ prompt `"system"` c·ªßa b·∫°n r·∫•t d√†i, b·∫°n n√™n **tƒÉng `max_seq_length`** (n·∫øu m√¥ h√¨nh h·ªó tr·ª£), v√≠ d·ª•:
  - 1024 ‚Üí c√≥ th·ªÉ b·ªã c·∫Øt
  - 2048 ho·∫∑c 4096 ‚Üí an to√†n h∆°n

- ‚úÖ B·∫°n c√≥ th·ªÉ d√πng `tokenizer(text)` ƒë·ªÉ ƒë·∫øm ch√≠nh x√°c s·ªë token.

---

N·∫øu c·∫ßn, m√¨nh c√≥ th·ªÉ gi√∫p b·∫°n:  
- T√≠nh s·ªë token th·∫≠t s·ª± c·ªßa t·ª´ng m·∫´u  
- G·ª£i √Ω r√∫t g·ªçn prompt `"system"`  
- T·ªëi ∆∞u h√≥a cho LoRA ho·∫∑c c·∫•u h√¨nh hu·∫•n luy·ªán ph√π h·ª£p h∆°n.

---

ƒê·∫øm = tool: 608

In [34]:
from transformers import AutoTokenizer

# Gi·∫£ s·ª≠ b·∫°n ƒë√£ c√≥ tokenizer
text = my_dataset[0]["text"]  # ho·∫∑c standardized_dataset[0]["text"]
tokens = tokenizer(text, return_tensors="pt")
print(f"S·ªë l∆∞·ª£ng token: {len(tokens['input_ids'][0])}")


S·ªë l∆∞·ª£ng token: 542


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [29]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

We verify masking is actually done:

In [30]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight rel

In [31]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                  Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [33]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
4.131 GB of memory reserved.


In [32]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.7747
2,0.839
3,1.0758
4,0.8918
5,0.7575
6,0.9373
7,0.6192
8,0.9986
9,0.8596
10,0.7613


### ‚úÖ T√≥m t·∫Øt k·∫øt qu·∫£ hu·∫•n luy·ªán t·ª´ log Unsloth

---

#### üöÄ **C·∫•u h√¨nh hu·∫•n luy·ªán:**
- **T·ªïng s·ªë m·∫´u:** 100,000  
- **S·ªë epoch:** 1  
- **S·ªë b∆∞·ªõc hu·∫•n luy·ªán (steps):** 60  
- **Batch size:**  
  - M·ªói thi·∫øt b·ªã: 2  
  - `gradient_accumulation_steps = 4`  
  - T·ªïng batch size: `2 x 4 x 1 GPU = 8`  
- **S·ªë tham s·ªë ƒë∆∞·ª£c hu·∫•n luy·ªán:** 24.3 tri·ªáu / 3 t·ª∑ (~0.81%) ‚Üí ƒëang d√πng **LoRA ho·∫∑c partial fine-tuning**
- ‚úÖ **T·ªëi ∆∞u VRAM:** c√≥ offload gradients th√¥ng minh.

---

#### üìâ **Di·ªÖn bi·∫øn Loss qua t·ª´ng b∆∞·ªõc:**
- Loss dao ƒë·ªông t·ª´ kho·∫£ng **0.45 ƒë·∫øn 1.3**, trung b√¨nh ~**0.8‚Äì0.9**
- C√≥ v√†i b∆∞·ªõc loss cao ƒë·ªôt bi·∫øn (v√≠ d·ª• b∆∞·ªõc 53: **1.3178**), nh∆∞ng ph·∫ßn l·ªõn gi·ªØ ·ªü m·ª©c ·ªïn ƒë·ªãnh.
- M·ªôt s·ªë b∆∞·ªõc c√≥ loss th·∫•p r√µ r·ªát (v√≠ d·ª• b∆∞·ªõc 51: **0.4573**, b∆∞·ªõc 34: **0.5803**) ‚Üí m√¥ h√¨nh c√≥ th·ªÉ h·ªçc t·ªët tr√™n c√°c m·∫´u ƒë√≥.

---

#### üîç **ƒê√°nh gi√° nhanh:**
- V·ªõi ch·ªâ **60 b∆∞·ªõc training**, ƒë√¢y l√† giai ƒëo·∫°n **warmup ho·∫∑c th·ª≠ nghi·ªám nhanh**, kh√¥ng ƒë·∫°i di·ªán cho training d√†i h·∫°n.
- M·ª©c loss t∆∞∆°ng ƒë·ªëi ·ªïn ƒë·ªãnh, kh√¥ng tƒÉng li√™n t·ª•c ‚áí **m√¥ h√¨nh ƒëang h·ªçc ƒë∆∞·ª£c**.
- B·∫°n c√≥ th·ªÉ:
  - ‚úÖ D√πng checkpoint n√†y ƒë·ªÉ **inference th·ª≠**.
  - üîÅ TƒÉng s·ªë b∆∞·ªõc (ho·∫∑c epoch) n·∫øu mu·ªën hu·∫•n luy·ªán th·ª±c s·ª± nghi√™m t√∫c.
  - üìâ Theo d√µi th√™m `eval loss` (n·∫øu c√≥ t·∫≠p validation) ƒë·ªÉ ƒë√°nh gi√° overfitting.

---

N·∫øu b·∫°n mu·ªën m√¨nh:
- V·∫Ω bi·ªÉu ƒë·ªì Loss  
- G·ª£i √Ω c√°ch ƒë√°nh gi√° output sau hu·∫•n luy·ªán  
- G·ª£i √Ω checkpoint saving, inference ho·∫∑c ti·∫øp t·ª•c training  

üëâ c·ª© n√≥i nh√©!

In [35]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

496.5793 seconds used for training.
8.28 minutes used for training.
Peak reserved memory = 4.131 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 28.024 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Unsloth_Studio.ipynb)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [36]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers. The sequence you provided starts with 1, 1, 2, 3, 5, and 8. Here are the next three numbers in the sequence:\n9, 14, 23<|eot_id|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [45]:
from transformers import TextStreamer
from unsloth import FastLanguageModel

FastLanguageModel.for_inference(model)  # B·∫≠t ch·∫ø ƒë·ªô suy lu·∫≠n nhanh

messages = [
    {
        "role": "system",
        "content": """
You are an intelligent task classification and response generation assistant.
Your job is to analyze user commands and generate structured JSON responses according to the correct task category.

You will be given a User Input ‚Äì a natural language command from the user.

Your task is to:
1. Identify the correct tool: "todo", "todo_with_calendar", "email", "article", "thutuchanhchinh".
2. Determine the appropriate action based on user intent:
   - "todo" ‚Üí [create, get_all, update, delete, complete]
   - "todo_with_calendar" ‚Üí [create, get_upcoming, get_past, get_now, update, delete, invite]
   - "email" ‚Üí [compose, get_inbox, reply, save_draft, delete, mark_important]
   - "article" ‚Üí [create, get_all, get_published, update, delete, publish, save_draft]
   - "thutuchanhchinh" ‚Üí [lookup_inf]

3. Extract any relevant summary_task and event_time.

Respond ONLY with a valid JSON object following this exact template:
{
  "tool": "<tool>",
  "action": "<action>",
  "details": {
    "summary_task": "<summary>",
    "event_time": "<time or null>"
  }
}

DO NOT explain. DO NOT include markdown. DO NOT add extra text.
"""
    },
    {
        "role": "user",
        "content": "L·∫•y to√†n b·ªô danh s√°ch c√¥ng vi·ªác ƒë√£ note"
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)


```json
{
  "tool": "todo",
  "action": "get_all",
  "details": {
    "summary_task": "l·∫•y to√†n b·ªô danh s√°ch c√¥ng vi·ªác ƒë√£ note",
    "event_time": null
  }
}
```<|eot_id|>


1 prompt ƒë∆∞·ª£c GPT ƒë·ªÅ xu·∫•t ƒë·ªÉ ch·ªëng vi·ªác gen th·ª´a. Ngon ph·∫øt. (trong khi prompt c≈© c·ªßa m√¨nh b·ªã gen th·ª´a nhi·ªÅu).

B·∫°n v·ª´a ƒë∆∞a ra hai ƒëo·∫°n code s·ª≠ d·ª•ng m√¥ h√¨nh ng√¥n ng·ªØ (c√≥ v·∫ª l√† LLaMA 3.1 qua Unsloth) ƒë·ªÉ **sinh vƒÉn b·∫£n (generate)** d·ª±a tr√™n input ng∆∞·ªùi d√πng. C·∫£ hai ƒëo·∫°n ƒë·ªÅu ƒë√∫ng, v√† d√πng trong c√°c t√¨nh hu·ªëng h∆°i kh√°c nhau. M√¨nh s·∫Ω **gi·∫£i th√≠ch ng·∫Øn g·ªçn** s·ª± kh√°c bi·ªát v√† √Ω nghƒ©a t·ª´ng ƒëo·∫°n:

---

### ‚úÖ ƒêo·∫°n 1 ‚Äì **Sinh vƒÉn b·∫£n v√† l·∫•y k·∫øt qu·∫£ v·ªÅ ƒë·ªÉ x·ª≠ l√Ω**
```python
outputs = model.generate(
    input_ids = inputs,
    max_new_tokens = 64,
    use_cache = True,
    temperature = 1.5,
    min_p = 0.1
)

tokenizer.batch_decode(outputs)
```

- ‚úÖ D√πng `model.generate(...)` ƒë·ªÉ m√¥ h√¨nh sinh ra token m·ªõi.
- ‚úÖ Sau ƒë√≥ d√πng `tokenizer.batch_decode(outputs)` ƒë·ªÉ chuy·ªÉn token th√†nh chu·ªói vƒÉn b·∫£n (string).
- üëâ D·∫°ng n√†y d√πng khi b·∫°n **mu·ªën l·∫•y k·∫øt qu·∫£ ra ƒë·ªÉ l∆∞u, ph√¢n t√≠ch, ƒë√°nh gi√° t·ª± ƒë·ªông, ho·∫∑c hi·ªÉn th·ªã sau**.

---

### ‚úÖ ƒêo·∫°n 2 ‚Äì **Sinh vƒÉn b·∫£n tr·ª±c ti·∫øp ra m√†n h√¨nh (streaming)**
```python
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)

_ = model.generate(
    input_ids = inputs,
    streamer = text_streamer,
    max_new_tokens = 128,
    use_cache = True,
    temperature = 1.5,
    min_p = 0.1
)
```

- ‚úÖ D√πng `TextStreamer` ƒë·ªÉ **hi·ªÉn th·ªã output theo th·ªùi gian th·ª±c**, gi·ªëng nh∆∞ chatbot ƒëang "g√µ ch·ªØ".
- `skip_prompt=True`: ch·ªâ hi·ªÉn th·ªã ph·∫ßn m√¥ h√¨nh sinh ra, kh√¥ng in l·∫°i prompt.
- üëâ D·∫°ng n√†y d√πng khi b·∫°n mu·ªën **xem k·∫øt qu·∫£ ngay l·∫≠p t·ª©c** (live demo, chatbot, giao di·ªán ng∆∞·ªùi d√πng‚Ä¶).

---

### üîÅ ƒêi·ªÉm chung:
- C·∫£ hai ƒëo·∫°n ƒë·ªÅu:
  - D√πng template chu·∫©n t·ª´ `get_chat_template`.
  - C√≥ `add_generation_prompt=True` ƒë·ªÉ m√¥ h√¨nh bi·∫øt n√™n b·∫Øt ƒë·∫ßu sinh ti·∫øp.
  - Chuy·ªÉn `inputs` sang `cuda` ƒë·ªÉ ch·∫°y b·∫±ng GPU.

---

### ‚ùìN√™n d√πng ƒëo·∫°n n√†o?

| M·ª•c ƒë√≠ch                        | N√™n d√πng ƒëo·∫°n |
|-------------------------------|---------------|
| L∆∞u k·∫øt qu·∫£, x·ª≠ l√Ω h√†ng lo·∫°t   | ƒêo·∫°n 1        |
| Hi·ªÉn th·ªã tr·ª±c ti·∫øp nh∆∞ chat   | ƒêo·∫°n 2        |
| Ch·∫°y trong notebook            | C·∫£ hai ƒë·ªÅu ·ªïn |
| Giao di·ªán chatbot              | ƒêo·∫°n 2        |

---

N·∫øu b·∫°n c·∫ßn demo t∆∞∆°ng t√°c nhi·ªÅu l∆∞·ª£t, streaming ti·∫øng Vi·ªát ƒë·∫πp h∆°n, ho·∫∑c n·ªëi ti·∫øp h·ªôi tho·∫°i th√¨ m√¨nh c√≥ th·ªÉ gi√∫p b·∫°n m·ªü r·ªông ƒëo·∫°n code nh√©!

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The Eiffel Tower, located in the heart of Paris, stands tall among the city's historic and cultural landmarks. This iron structure, standing at an impressive 324 meters high, offers breathtaking views of the City of Light's iconic landscape. The Eiffel Tower was built for the 1889 World's Fair and has since become a symbol of French engineering and culture.<|eot_id|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>
