To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


# Key Ideas for this video:
1. What is LLM with tool calling?
2. Train LLM to reason before tool call.
2. How to update chat template of LLMs?
3. How to add special tokens to tokenizers?
4. How to merge and push huggingface? (there is subtlety due to the tie of token embedding weight and lm_head weight)

# Overview of LLM with function calling capability

![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hugs/function-callin.png)

## Example from OpenAI Function calling

![img](https://cdn.openai.com/API/docs/images/function-calling-diagram-steps.png)

 Schema of 'messages' to be tokenized by the associated `tokenizer` follows OpenAI's templates [here](https://platform.openai.com/docs/guides/function-calling).

 ```python
tools = [{"name": "get_weather", "description": ...}]
messages = [
    {
        'role': 'system',
        'content': ...,
    },
    {
        'role': 'user',
        'content': "What is the weather in Paris?"
    },
    {
        'role': 'assistant',
        'content': '<think>Okay,....</think>',
        'tool_calls': [{"name": "get_weather", "arguments": {"location": "paris"}}]
    },
    {
        'role': 'tool',
        'content': '{"temperature": 14}'
    }
]
token_ids = tokenizer.apply_chat_template(messages, tools, ...)
```


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
#fourbit_models = [
#    "unsloth/mistral-7b-bnb-4bit",
#    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
#    "unsloth/llama-2-7b-bnb-4bit",
#    "unsloth/llama-2-13b-bnb-4bit",
#    "unsloth/codellama-34b-bnb-4bit",
#    "unsloth/tinyllama-bnb-4bit",
#    "unsloth/gemma-7b-bnb-4bit", # New Google 6 trillion tokens model 2.5x faster!
#    "unsloth/gemma-2b-bnb-4bit",
#] # More models at https://huggingface.co/unsloth
#
#model, tokenizer = FastLanguageModel.from_pretrained(
#    model_name = "unsloth/tinyllama-bnb-4bit", # "unsloth/tinyllama" for 16bit loading
#    max_seq_length = max_seq_length,
#    dtype = dtype,
#    load_in_4bit = load_in_4bit,
#    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
#)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


In general, we have to import FastLanguageModel at the top.

However, we need to prepare our dataset for tool-calling first. We follow the instruction from this [notebook](https://colab.research.google.com/#scrollTo=37bf938d-08fa-4577-9966-0238339afcdb&fileId=https%3A//huggingface.co/agents-course/notebooks/blob/main/bonus-unit1/bonus-unit1.ipynb).

In [None]:
from datasets import load_dataset
dataset_name = "Jofthomas/hermes-function-calling-thinking-V1"
dataset = load_dataset(dataset_name)
dataset = dataset.rename_column("conversations", "messages")
dataset

README.md:   0%|          | 0.00/354 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 3570
    })
})

In [None]:
dataset['train'][0]

{'messages': [{'content': "You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'get_stock_price', 'description': 'Get the current stock price of a company', 'parameters': {'type': 'object', 'properties': {'company': {'type': 'string', 'description': 'The name of the company'}}, 'required': ['company']}}}, {'type': 'function', 'function': {'name': 'get_movie_details', 'description': 'Get details about a movie', 'parameters': {'type': 'object', 'properties': {'title': {'type': 'string', 'description': 'The title of the movie'}}, 'required': ['title']}}}] </tools>Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Ar

In [None]:
import re
import json
import ast

def retrieve_within_tag_name(text, tag_name):
    pattern = r'<' + re.escape(tag_name) + r'>\n(.*?)\n</' + re.escape(tag_name) + r'>'

    # Find all occurrences of the pattern in the string
    return re.findall(pattern, text)

def retrieve_tools(text):
    tag_name = 'tools'
    pattern = r'<' + re.escape(tag_name) + r'>(.*?)</' + re.escape(tag_name) + r'>'

    # Find all occurrences of the pattern in the string
    extracted_texts = re.findall(pattern, text)
    # Print the extracted texts
    if len(extracted_texts) == 1 and extracted_texts[0] == '':
        raise ValueError("No tools are given!")
    elif len(extracted_texts) == 0:
        return []
    elif len(extracted_texts) > 2:
        raise ValueError("Give tools more than once!")
    else:
        return ast.literal_eval(extracted_texts[1])

def retrieve_tool_response(text):
    extracted_texts = retrieve_within_tag_name(text, 'tool_response')

    # Print the extracted texts
    if len(extracted_texts) == 1:
        json_text = json.dumps(ast.literal_eval(extracted_texts[0]), ensure_ascii=False)
        return json_text
    else:
        raise ValueError("Too many tags")

def retrieve_tool_calls(text):
    extracted_texts = retrieve_within_tag_name(text, 'tool_call')

    # Print the extracted texts
    if len(extracted_texts) == 0:
        return []
    else:
        to_return = []
        for extracted_text in extracted_texts:
            p_obj = ast.literal_eval(extracted_text)
            # to_return.append(json.dumps(p_obj, ensure_ascii=False))
            to_return.append(p_obj)
        if len(to_return) > 1:
            print(to_return)
        return to_return

In [None]:
def remove_tag_block(text, tag_name):
    pattern = r'<' + re.escape(tag_name) + r'>\n(.*?)\n</' + re.escape(tag_name) + r'>'

    # Find all occurrences of the pattern in the string
    extracted_texts = re.search(pattern, text, re.DOTALL)
    if extracted_texts:
        result = re.sub(
            pattern,
            '\n',
            text,
            flags=re.DOTALL
        )
        return result.strip()
    else:
        return text

In [None]:
from transformers import AutoTokenizer, set_seed
set_seed(3407)

model_name = "unsloth/tinyllama-chat"
# the template use `<|user|>` instead of `<|tool|>`

tokenizer = AutoTokenizer.from_pretrained(model_name)

print(tokenizer.chat_template)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>
'  + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}


In [None]:
chat_template = (
    '{{- bos_token }}\n'
    '{%- if messages[0].role == "system" %}\n'
        '{%- set system_message = messages[0].content|trim %}\n'
        '{%- set messages = messages[1:] %}\n'
    '{%- else %}\n'
        '{%- set system_message = "You are a function calling AI model." %}\n'
    '{%- endif %}\n'
    '{{- "<|system|>\n" }}\n'
    '{{- system_message }}\n'
    '{%- if tools %}\n'
        '{{- "You are provided with function signatures within <tools></tools> XML tags:\n<tools>\n" }}\n'
        '{%- for tool in tools %}\n'
            '{{- tool | tojson }}\n'
            '{{- "\n" }}\n'
        '{%- endfor %}\n'
        '{{- "</tools>\n" }}\n'
        '{{- "Don\'t make assumptions about what values to plug into functions. " }}\n'
        '{{- "For each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\\"name\\": function name, \\"arguments\\": dictionary of argument name and its value}\n</tool_call>\n" }}\n'
        '{{- "Before making a function call, take the time to plan the functions to execute. " }}\n'
        '{{- "Make that thinking process between <think>your thoughts</think>." }}\n'
    '{%- endif %}\n'
    '{{- eos_token }}\n'
    '{%- for message in messages %}\n'
        '{%- if (message.role == "user") %}\n'
            '{{- "<|user|>\n" + message.content + eos_token }}\n'
         '{%- elif message.role == "tool" %}\n'
            '{{- "<|user|>\n<tool_response>\n" }}\n'
            '{{- message.content }}\n'
            '{{- "\n</tool_response>" + eos_token }}\n'
        '{%- elif message.role == "assistant" %}\n'
            '{{- "<|assistant|>\n" + message.content }}\n'
            '{%- if message.tool_calls %}\n'
                '{%- for tool_call in message.tool_calls %}\n'
                    '{%- if (loop.first and message.content) or (not loop.first) %}\n'
                        '{{- "\n" }}\n'
                    '{%- endif %}\n'
                    '{%- if tool_call.function %}\n'
                        '{%- set tool_call = tool_call.function %}\n'
                    '{%- endif %}\n'
                    '{{- "<tool_call>\n{\\"name\\": \\"" }}\n'
                    '{{- tool_call.name }}\n'
                    '{{- "\\", \\"arguments\\": " }}\n'
                    '{%- if tool_call.arguments is string %}\n'
                        '{{- tool_call.arguments }}\n'
                    '{%- else %}\n'
                        '{{- tool_call.arguments | tojson }}\n'
                    '{%- endif %}\n'
                    '{{- "}\n</tool_call>" }}\n'
                '{%- endfor %}\n'
            '{%- endif %}\n'
            '{{- eos_token }}\n'
        '{%- endif %}\n'
        '{%- if loop.last and add_generation_prompt %}\n'
            '{{- "<|assistant|>" }}\n'
        '{%- endif %}'
    '{%- endfor %}\n'
)

In [None]:
import random
tokenizer.chat_template = chat_template

def preprocess(sample):
    messages = sample["messages"]
    first_message = messages[0]

    # Instead of adding a system message, we merge the content into the first user message
    first_message_content = first_message['content']
    tools = retrieve_tools(first_message_content)
    if len(tools) == 0:
        tools = None
    elif '<tool_call>' in first_message_content and first_message['role'] == 'human':
        raise ValueError("Tool call without system prompt. Need to implement retrieve tools from human message")
    else:
        messages.pop(0)
        if len(tools) > 1:
            random.shuffle(tools)
    for i, message in enumerate(messages):
        if message['role'] == 'human':
            messages[i]['role'] = 'user'
        elif message['role'] == 'model':
            messages[i]['role'] = 'assistant'
            tool_calls = retrieve_tool_calls(message['content'])
            if len(tool_calls) > 0:
                messages[i]['tool_calls'] = tool_calls
                messages[i]['content'] = remove_tag_block(message['content'], 'tool_call')
        elif message['role'] == 'tool':
            messages[i]['content'] = retrieve_tool_response(message['content'])
    return {"text": tokenizer.apply_chat_template(messages, tools, tokenize=False).strip('\n').removeprefix(tokenizer.bos_token)}

In [None]:
print(tokenizer.chat_template)

{{- bos_token }}
{%- if messages[0].role == "system" %}
{%- set system_message = messages[0].content|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{%- set system_message = "You are a function calling AI model." %}
{%- endif %}
{{- "<|system|>
" }}
{{- system_message }}
{%- if tools %}
{{- "You are provided with function signatures within <tools></tools> XML tags:
<tools>
" }}
{%- for tool in tools %}
{{- tool | tojson }}
{{- "
" }}
{%- endfor %}
{{- "</tools>
" }}
{{- "Don't make assumptions about what values to plug into functions. " }}
{{- "For each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{\"name\": function name, \"arguments\": dictionary of argument name and its value}
</tool_call>
" }}
{{- "Before making a function call, take the time to plan the functions to execute. " }}
{{- "Make that thinking process between <think>your thoughts</think>." }}
{%- endif %}
{{- eos_token }}
{%- for message 

In [None]:
dataset = dataset.map(preprocess, remove_columns="messages")
# dataset = dataset['train'].train_test_split(0.1)
print(dataset)

Map:   0%|          | 0/3570 [00:00<?, ? examples/s]

[{'name': 'create_note', 'arguments': {'title': 'Team Meeting', 'content': 'Discuss project updates, assign new tasks, and review deadlines.'}}, {'name': 'create_note', 'arguments': {'title': 'Team Meeting', 'content': 'Discuss project updates, assign new tasks, and review deadlines.'}}]
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 3570
    })
})


In [None]:
print(dataset['train'][0]['text'])

<|system|>
You are a function calling AI model.You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "get_movie_details", "description": "Get details about a movie", "parameters": {"type": "object", "properties": {"title": {"type": "string", "description": "The title of the movie"}}, "required": ["title"]}}}
{"type": "function", "function": {"name": "get_stock_price", "description": "Get the current stock price of a company", "parameters": {"type": "object", "properties": {"company": {"type": "string", "description": "The name of the company"}}, "required": ["company"]}}}
</tools>
Don't make assumptions about what values to plug into functions. For each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": function name, "arguments": dictionary of argument name and its value}
</tool_call>
Before making a function call, take the time to plan

We need to create `ChatmlSpecialTokens` and add them as special tokens to the tokenizer.

In [None]:
from enum import Enum
from functools import partial
import pandas as pd
import torch
import json


class ChatmlSpecialTokens(str, Enum):
    tool_call = "<tool_call>"
    tools = "<tools>"
    think = "<think>"
    tool_response = "<tool_response>"
    eotool_call = "</tool_call>"
    eotools = "</tools>"
    eothink = "</think>"
    eotool_response = "</tool_response>"
    @classmethod
    def list(cls):
        return [c.value for c in cls]

Different from the tutorial notebook, we will use `tokenizer` loaded by `unsloth`.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name, # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2026.1.2: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
print(tokenizer.chat_template)

{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>
'  + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}


Now, we change `chat_template` to work with our `dataset`.

In [None]:
tokenizer.chat_template = chat_template
tokenizer.add_special_tokens({'additional_special_tokens': ChatmlSpecialTokens.list()})
print(tokenizer.vocab_size)
print(len(tokenizer))

32000
32008


We have added `8` new tokens in `additional_special_tokens`. We need update `embed_tokens` layer of `model` and use `LoRA` on update layer.

In [None]:
model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(32008, 2048, padding_idx=0)

**[NOTE]** TinyLlama's internal maximum sequence length is 2048. We use RoPE Scaling to extend it to 4096 with Unsloth!

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

**[NOTE]** We set `gradient_checkpointing=False` ONLY for TinyLlama since Unsloth saves tonnes of memory usage. This does NOT work for `llama-2-7b` or `mistral-7b` since the memory usage will still exceed Tesla T4's 15GB. GC recomputes the forward pass during the backward pass, saving loads of memory.

`**[IF YOU GET OUT OF MEMORY]**` set `gradient_checkpointing` to `True`.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head"],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    modules_to_save = ["embed_tokens", "lm_head"],
    use_gradient_checkpointing = False, # @@@ IF YOU GET OUT OF MEMORY - set to True @@@
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2026.1.2 patched 22 layers with 22 QKV layers, 22 O layers and 22 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM


In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): ModulesToSaveWrapper(
          (original_module): Embedding(32008, 2048, padding_idx=0)
          (modules_to_save): ModuleDict(
            (default): Embedding(32008, 2048, padding_idx=0)
          )
        )
        (layers): ModuleList(
          (0-21): 22 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): Para

In [None]:
model.base_model.model.model.embed_tokens.weight = model.base_model.model.lm_head.weight

KeyError: "attribute 'weight' already exists"

<a name="Train"></a>
### Train the model
Now let's train our model. We do 1 full epoch which takes around 45 minutes!

In [None]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset['train'],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = False, # Packs short sequences together to save time!
    args = SFTConfig(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        eval_strategy="no",
        save_strategy="no",
        eval_steps=25,
        warmup_ratio = 0.05,
        num_train_epochs = 1,
        learning_rate = 3e-4,
        logging_steps = 25,
        optim = "adamw_8bit",
        weight_decay = 0.1,
        lr_scheduler_type = "cosine",
        seed = 3407,
        # output_dir = "outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/3570 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
2.982 GB of memory reserved.


```javascript
function ConnectButton(){
    console.log("Connect pushed");
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click()
}
setInterval(ConnectButton, 60000);
```

In [None]:
# alpha 64
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 3,570 | Num Epochs = 1 | Total steps = 112
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 143,720,448 of 1,243,801,600 (11.55% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
25,0.7638
50,0.3856
75,0.3437
100,0.3338


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1502.2121 seconds used for training.
25.04 minutes used for training.
Peak reserved memory = 6.461 GB.
Peak reserved memory for training = 3.479 GB.
Peak reserved memory % of max memory = 43.83 %.
Peak reserved memory for training % of max memory = 23.601 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
#this prompt is a sub-sample of one of the test set examples.
# In this example, we start the generation after the model generation starts.
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
#this prompt is a sub-sample of one of the test set examples. In this example we start the generation after the model generation starts.

tools = [{'type': 'function', 'function': {'name': 'convert_currency', 'description': 'Convert from one currency to another', 'parameters': {'type': 'object', 'properties': {'amount': {'type': 'number', 'description': 'The amount to convert'}, 'from_currency': {'type': 'string', 'description': 'The currency to convert from'}, 'to_currency': {'type': 'string', 'description': 'The currency to convert to'}}, 'required': ['amount', 'from_currency', 'to_currency']}}}, {'type': 'function', 'function': {'name': 'calculate_distance', 'description': 'Calculate the distance between two locations', 'parameters': {'type': 'object', 'properties': {'start_location': {'type': 'string', 'description': 'The starting location'}, 'end_location': {'type': 'string', 'description': 'The ending location'}}, 'required': ['start_location', 'end_location']}}}]
messages = [
    {'role': 'user', 'content': "Hi, I need to convert 500 USD to Euros. Can you help me with that?"}
]

prompt = tokenizer.apply_chat_template(messages, tools, add_generation_prompt=True, tokenize=False)
# print(prompt)

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs,
                         max_new_tokens=500,# Adapt as necessary
                         do_sample=True,
                         top_p=0.95,
                         temperature=0.01,
                         repetition_penalty=1.0,
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<s> <|system|>
You are a function calling AI model.You are provided with function signatures within <tools></tools>  XML tags:
<tools> 
{"type": "function", "function": {"name": "convert_currency", "description": "Convert from one currency to another", "parameters": {"type": "object", "properties": {"amount": {"type": "number", "description": "The amount to convert"}, "from_currency": {"type": "string", "description": "The currency to convert from"}, "to_currency": {"type": "string", "description": "The currency to convert to"}}, "required": ["amount", "from_currency", "to_currency"]}}}
{"type": "function", "function": {"name": "calculate_distance", "description": "Calculate the distance between two locations", "parameters": {"type": "object", "properties": {"start_location": {"type": "string", "description": "The starting location"}, "end_location": {"type": "string", "description": "The ending location"}}, "required": ["start_location", "end_location"]}}}
</tools> 
Don't make assumpt

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1500)

<s> <|system|>
You are a function calling AI model.You are provided with function signatures within <tools></tools>  XML tags:
<tools> 
{"type": "function", "function": {"name": "convert_currency", "description": "Convert from one currency to another", "parameters": {"type": "object", "properties": {"amount": {"type": "number", "description": "The amount to convert"}, "from_currency": {"type": "string", "description": "The currency to convert from"}, "to_currency": {"type": "string", "description": "The currency to convert to"}}, "required": ["amount", "from_currency", "to_currency"]}}}
{"type": "function", "function": {"name": "calculate_distance", "description": "Calculate the distance between two locations", "parameters": {"type": "object", "properties": {"start_location": {"type": "string", "description": "The starting location"}, "end_location": {"type": "string", "description": "The ending location"}}, "required": ["start_location", "end_location"]}}}
</tools> 
Don't make assumpt

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

To merge the adapters, we need 2 steps:
1. push unsloth model with the adapters to huggingface repo
2. load the model using `peft` auto model and merge the model

After merge, we can push the new model to huggingface repo for utilization.

### Step 1: push unsloth model to huggingface platform

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [None]:
# model.save_pretrained("lora_model")  # Local saving
# tokenizer.save_pretrained("lora_model")
model.push_to_hub("Kimang18/TinyLlama-tool-calling-v2", private=True) # Online saving
tokenizer.push_to_hub("Kimang18/TinyLlama-tool-calling-v2", private=True)

README.md:   0%|          | 0.00/565 [00:00<?, ?B/s]



Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   0%|          |  551kB / 1.36GB            

Saved model to https://huggingface.co/Kimang18/TinyLlama-tool-calling-v2


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...pjlhmerle/tokenizer.model:  75%|#######4  |  374kB /  500kB            

### Step 2: load model using peft, then merge

In [None]:
!pip install transformers==4.56.2 peft



In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [None]:
if True:
    # I highly do NOT suggest - use Unsloth if possible
    adapter_model_id = "Kimang18/TinyLlama-tool-calling-v2"
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        adapter_model_id, # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = False,
    )
    tokenizer = AutoTokenizer.from_pretrained(adapter_model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.55k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.75k [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


adapter_model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

In [None]:
print(len(tokenizer.get_vocab()))
device = "cuda"
model = model.eval().to(device)

tools = [{'type': 'function', 'function': {'name': 'convert_currency', 'description': 'Convert from one currency to another', 'parameters': {'type': 'object', 'properties': {'amount': {'type': 'number', 'description': 'The amount to convert'}, 'from_currency': {'type': 'string', 'description': 'The currency to convert from'}, 'to_currency': {'type': 'string', 'description': 'The currency to convert to'}}, 'required': ['amount', 'from_currency', 'to_currency']}}}, {'type': 'function', 'function': {'name': 'calculate_distance', 'description': 'Calculate the distance between two locations', 'parameters': {'type': 'object', 'properties': {'start_location': {'type': 'string', 'description': 'The starting location'}, 'end_location': {'type': 'string', 'description': 'The ending location'}}, 'required': ['start_location', 'end_location']}}}]
messages = [{'role': 'user', 'content': 'Hi, I need to convert 500 USD to Euros. Can you help me with that?'}]
prompt = tokenizer.apply_chat_template(messages, tools, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to(device) for k,v in inputs.items()}

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1500)

#TODO: improve chat template with clear new lines

32008
<s> <|system|>
You are a function calling AI model.You are provided with function signatures within <tools></tools>  XML tags:
<tools> 
{"type": "function", "function": {"name": "convert_currency", "description": "Convert from one currency to another", "parameters": {"type": "object", "properties": {"amount": {"type": "number", "description": "The amount to convert"}, "from_currency": {"type": "string", "description": "The currency to convert from"}, "to_currency": {"type": "string", "description": "The currency to convert to"}}, "required": ["amount", "from_currency", "to_currency"]}}}
{"type": "function", "function": {"name": "calculate_distance", "description": "Calculate the distance between two locations", "parameters": {"type": "object", "properties": {"start_location": {"type": "string", "description": "The starting location"}, "end_location": {"type": "string", "description": "The ending location"}}, "required": ["start_location", "end_location"]}}}
</tools> 
Don't make a

In [None]:
model = model.merge_and_unload()

In [None]:
model.push_to_hub('Kimang18/TinyLlama-tool-calling-v2-pt')
tokenizer.push_to_hub('Kimang18/TinyLlama-tool-calling-v2-pt')

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...kpyrqb8/model.safetensors:   1%|          | 41.8MB / 4.40GB            

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...pxjl9609q/tokenizer.model: 100%|##########|  500kB /  500kB            

CommitInfo(commit_url='https://huggingface.co/Kimang18/TinyLlama-tool-calling-v2-pt/commit/90219e435333169f88f6eb5755db33eaeeb5202d', commit_message='Upload tokenizer', commit_description='', oid='90219e435333169f88f6eb5755db33eaeeb5202d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Kimang18/TinyLlama-tool-calling-v2-pt', endpoint='https://huggingface.co', repo_type='model', repo_id='Kimang18/TinyLlama-tool-calling-v2-pt'), pr_revision=None, pr_num=None)

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    # the push model
    adapter_model_id = "Kimang18/TinyLlama-tool-calling-v2-pt"
    from transformers import AutoModelForCausalLM, AutoTokenizer
    model = AutoModelForCausalLM.from_pretrained(
        adapter_model_id, # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = False,
    )
    tokenizer = AutoTokenizer.from_pretrained(adapter_model_id)

In [None]:
print(len(tokenizer.get_vocab()))
device = "cuda"
model = model.eval().to(device)

# tools = [{'type': 'function', 'function': {'name': 'convert_currency', 'description': 'Convert from one currency to another', 'parameters': {'type': 'object', 'properties': {'amount': {'type': 'number', 'description': 'The amount to convert'}, 'from_currency': {'type': 'string', 'description': 'The currency to convert from'}, 'to_currency': {'type': 'string', 'description': 'The currency to convert to'}}, 'required': ['amount', 'from_currency', 'to_currency']}}}, {'type': 'function', 'function': {'name': 'calculate_distance', 'description': 'Calculate the distance between two locations', 'parameters': {'type': 'object', 'properties': {'start_location': {'type': 'string', 'description': 'The starting location'}, 'end_location': {'type': 'string', 'description': 'The ending location'}}, 'required': ['start_location', 'end_location']}}}]
# messages = [{'role': 'user', 'content': 'Hi, I need to convert 500 USD to Euros. Can you help me with that?'}]
tools = [
    {
        "type": "function",
        "name": "get_horoscope",
        "description": "Get today's horoscope for an astrological sign.",
        "parameters": {
            "type": "object",
            "properties": {
                "sign": {
                    "type": "string",
                    "description": "An astrological sign like Taurus or Aquarius",
                },
            },
            "required": ["sign"],
        },
    },
    {
    "type": "function",
    "name": "get_weather",
    "description": "Retrieves current weather for the given location.",
    "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and country e.g. Bogot√°, Colombia"
                },
                "units": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "Units the temperature will be returned in."
                }
            },
            "required": ["location", "units"],
            "additionalProperties": False
        },
    "strict": True
    },
]
messages = [{'role': 'user', 'content': 'What is my horoscope? I am an Aquarius.'}]
# messages = [{'role': 'user', 'content': 'What is the weather in Paris?'}]
prompt = tokenizer.apply_chat_template(messages, tools, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to(device) for k,v in inputs.items()}

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 2048)

32008
<s> <|system|>
You are a function calling AI model.You are provided with function signatures within <tools></tools>  XML tags:
<tools> 
{"type": "function", "name": "get_horoscope", "description": "Get today's horoscope for an astrological sign.", "parameters": {"type": "object", "properties": {"sign": {"type": "string", "description": "An astrological sign like Taurus or Aquarius"}}, "required": ["sign"]}}
{"type": "function", "name": "get_weather", "description": "Retrieves current weather for the given location.", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "City and country e.g. Bogot√°, Colombia"}, "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Units the temperature will be returned in."}}, "required": ["location", "units"], "additionalProperties": false}, "strict": true}
</tools> 
Don't make assumptions about what values to plug into functions. For each function call, return a JSON object with

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...olCalling/tokenizer.model: 100%|##########|  500kB /  500kB            

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `Kimang18/tinyllama-chat-ToolCalling`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:18<00:00, 18.33s/it]


Successfully copied all 1 files from cache to `Kimang18/tinyllama-chat-ToolCalling`
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `Kimang18/tinyllama-chat-ToolCalling`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 49.48it/s]


Successfully copied all 1 files from cache to `Kimang18/tinyllama-chat-ToolCalling`


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 10810.06it/s]
Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...Calling/model.safetensors:   2%|2         | 50.3MB / 2.20GB            

Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:19<00:00, 79.80s/it]


Unsloth: Merge process complete. Saved to `/content/Kimang18/tinyllama-chat-ToolCalling`


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
</div>
