# *Fine-tuning for Function Calling - VERSION 3*

Built by [Trelis](https://trelis.com).

This script is commercially licensed to those who have purchased access to the Trelis ADVANCED Fine-Tuning repo OR have purchased this script individually.

Related products:
- [Function calling training dataset v3](https://huggingface.co/datasets/Trelis/function_calling_v3).

Video Demo:
- [Recommended viewing](https://youtu.be/hHn_cV5WUDI?si=JOsec4dPeJ85RFDV)

Notes:
- This notebook has built built for function-calling training, but you can swap out the dataset for any Question-Answer style dataset.
- The system prompts have been omitted, but can be re-included if you wish to fine-tune for a certain system message.

Attribution:
- A simpler training notebook, on which this notebook is based, is available [here](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing).

---
## Getting Set Up
### Colab Setup
- You can run training on a free Google Colab Notebook for 7B models if you use quantization.
- Save a copy of this notebook: Go to File -> Save a copy in Drive. (optional, but needed if you want to make changes).
- Go to the menu -> Runtime -> Change Runtime Type - Select GPU (T4). [Make sure to comment out flash attention when loading the model if you are using a T4 as flash is only supported on newer GPUs).
- Then go to Runtime -> Run all.
- It takes about 2-5 mins* for the installation (which all happens in the cloud in this notebook).
- Once all cells have run, you'll find the chat interface at the bottom.-
- *Optionally, you can comment back in the code below to mount Google Drive. This will download the model to your Google Drive, bringing down the total start time to about 3 mins.

### Setup on an Ampere GPU (A40, A6000, A100, H100) with Cuda 12.1 and Pytorch 2.2.1 - RECOMMENDED.
Ampere architecture GPUs allow for the use of Flash Attention, which provides a speed up. Otherwise, you  need to train with fp16 instead of bf16.

For the best reproducibility, run this script on an A6000 using a one-click template from Runpod ([affiliate link for sign up here](https://runpod.io/?ref=jmfkcdio), supports Trelis' YouTube channel) or VastAI ([affiliate link for sign up here](https://cloud.vast.ai/?ref_id=98762), supports Trelis' YouTube channel):
- Runpod one-click template [here](https://runpod.io/gsc?template=ifyqsvjlzj&ref=jmfkcdio) - easier setup.
- Vast.ai one-click template [here](https://cloud.vast.ai/?ref_id=98762&creator_id=98762&name=Fine-tuning%20Notebook%20by%20Trelis%20-%20Cuda%2012.1) - offers smaller GPUs (which are cheaper to run).

# Install

In [1]:
# stable versions
!python -m pip install --upgrade pip
!pip install -U transformers==4.38.1 -q
!pip install -U bitsandbytes==0.42.0 -q
!pip install -U peft==0.8.2 -q
!pip install -U accelerate==0.27.2 -q
!pip install -U datasets==2.17.1 -q
!pip install -U scipy==1.12.0 -q
!pip install -U trl==0.7.11 -q
!pip install -U flash-attn==2.5.5 -q
!pip install -U hf_transfer==0.1.5 -q
!pip install -U huggingface-hub==0.20.3 -q
!pip install -U wandb==0.16.3 -q

# # install latest versions
# !python -m pip install --upgrade pip
# !pip install -U -q transformers
# !pip install -q -U bitsandbytes
# !pip install -q -U peft
# !pip install -q -U accelerate
# !pip install -q datasets
# !pip install -q -U scipy
# !pip install -q -U trl
# !pip install -U flash-attn -q
# !pip install -q -U datasets
# !pip install -U hf_transfer -q

# CONSIDER RESTARTING THE KERNEL AFTER INSTALL to ensure all updates are applied.

Collecting pip
  Downloading pip-24.0-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.1
    Uninstalling pip-23.3.1:
      Successfully uninstalled pip-23.3.1
Successfully installed pip-24.0
[0m

In [2]:
# # #If using DoRA (this may soon not be needed as DoRA will be part of transformers). DO NOT RUN THIS IF RUNNING LoRA or Unsloth LoRA, until DoRA is merged into transformers, see [here](https://github.com/huggingface/peft/pull/1474).
# !pip uninstall peft -y
# !pip install git+https://github.com/BenjaminBossan/peft.git@feat-dora -q

In [2]:
# Required when training models/data that are gated on HuggingFace, and required for pushing models to HuggingFace
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
## Optional - hook up weights and biases to track training
import wandb
wandb.login()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

# Load Model

In [4]:
# base_model = "./Mistral-7B-Instruct-v0.1-function-calling-v2"
# base_model = "meta-llama/Llama-2-7b-hf"
# base_model = "meta-llama/Llama-2-7b-chat-hf"
# base_model = "meta-llama/Llama-2-13b-chat-hf"
# base_model = "codellama/CodeLlama-34b-Instruct-hf"
# base_model = "meta-llama/Llama-2-70b-chat-hf"
# base_model = "mistralai/Mistral-7B-Instruct-v0.1"
# base_model = "deepseek-ai/deepseek-coder-1.3b-instruct"
# base_model = "deepseek-ai/deepseek-coder-6.7b-instruct"
# base_model = "deepseek-ai/deepseek-coder-33b-instruct"
# base_model = "larryvrh/Yi-34B-200K-Llamafied"
# base_model = "./Yi-34B-200K-Llamafied-chat-SFT"
# base_model = "openchat/openchat_3.5"
# base_model = "SUSTech/SUS-Chat-34B"
# base_model = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# base_model = "microsoft/phi-2"
# base_model = "Qwen/Qwen1.5-0.5B-Chat" # Not strong enough for function calling with LoRA and r=8 OR r=128.
# base_model = "Qwen/Qwen1.5-1.8B-Chat" # Gets 4 out of 7 with LoRA and r=8. Gets 5 out of 7 with DoRA and r=8. HOWEVER, not capable of handling responses well...
# base_model = "Qwen/Qwen1.5-4B-Chat" # Gets 6 out of 7 correct with DoRA and r=8. Handles responses well!
base_model = "Qwen/Qwen1.5-14B-Chat" # Gets 7 out of 7 correct with DoRA and r=8. Handles responses well!

cache_dir = '' #initialise the cache_dir to null. you can set google drive as the cache_dir below if you wish

In [5]:
%env HF_HUB_ENABLE_HF_TRANSFER=True

env: HF_HUB_ENABLE_HF_TRANSFER=True


In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, AutoConfig
import transformers
import torch
from torch.utils.data import DataLoader, Dataset

In [6]:
## For Google Colab and using Google Drive Only
# from google.colab import drive
# drive.mount('/content/drive')
# import os
# cache_dir = "/content/drive/My Drive/huggingface_cache"
# os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists

# # https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
# import locale
# def getpreferredencoding(do_setlocale = True):
#     return "UTF-8"
# locale.getpreferredencoding = getpreferredencoding

In [7]:
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

# Now you can create the model with the modified configuration
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    # quantization_config=bnb_config, #uncomment to use quantization, comment out to use full-precision
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2", #turn off if not supported by model or your GPU
    cache_dir=cache_dir
)

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/3.94G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/3.94G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/3.92G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/3.92G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/3.10G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/217 [00:00<?, ?B/s]

## Set Up Tokenizer

In [8]:
# # Required for certain tokenizers like Yi
# !pip install sentencepiece -q -U

In [8]:
tokenizer = AutoTokenizer.from_pretrained(base_model, cache_dir=cache_dir, trust_remote_code=True)
# tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", cache_dir=cache_dir)

tokenizer_config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
print("EOS token:", tokenizer.eos_token)
print("EOS token id:", tokenizer.eos_token_id)

EOS token: <|im_end|>
EOS token id: 151645


In [10]:
print("Pad token: ", tokenizer.pad_token)
print("Pad token ID: ", tokenizer.pad_token_id)

Pad token:  <|endoftext|>
Pad token ID:  151643


In [11]:
tokenizer.padding_side='right'
print(tokenizer)

Qwen2TokenizerFast(name_or_path='Qwen/Qwen1.5-14B-Chat', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


### Set the pad token...
Some models already have a pad token set. You can see whether they do or don't from the tokenizer print statement above.
- Ideally, the model and tokenizer already have a pad token set, then you don't need to do anything further.
- The next best thing is to use the < unk > token, if there is no pad token set (provided there is an unk token).
- The next (quick) option is to use the EOS token, but it can be better to use a custom pad token instead.
- The last option is to add a pad token. This expands the size of the model embeddings so that's it's no longer a factor of 16, which can slow down inference. So this is the last option.

In [13]:
# ## OPTION A - set the pad token to <unk> if <unk> is in the tokenizer OR set it to the EOS token.
# if '<unk>' in tokenizer.get_vocab():
#     print('unk token is in the tokenizer. Using unk for pad')
#     # Set the pad token
#     tokenizer.pad_token = '<unk>'
# else:
#     print(f'Using EOS token, {tokenizer.eos_token}, for padding')
#     tokenizer.pad_token = tokenizer.eos_token

# ## OPTION B - create a pad token
# # # Check if the pad token is already in the tokenizer vocabulary
# # if '<pad>' not in tokenizer.get_vocab():
# #     print('pad token not in the tokenizer')

# #     # Add the pad token
# #     tokenizer.add_tokens(['<pad>'])

# # # Set the pad token
# # tokenizer.pad_token = '<pad>'

# # # Resize token embeddings
# # model.resize_token_embeddings(len(tokenizer))

unk token is in the tokenizer. Using unk for pad


In [14]:
# # Update pad token id in model and its config
# model.pad_token_id = tokenizer.pad_token_id
# model.config.pad_token_id = tokenizer.pad_token_id

# # Check if they are equal
# assert model.pad_token_id == tokenizer.pad_token_id, "The model's pad token ID does not match the tokenizer's pad token ID!"

# # Print the pad token ids
# print('Tokenizer pad token ID:', tokenizer.pad_token_id)
# print('Model pad token ID:', model.pad_token_id)
# print('Model config pad token ID:', model.config.pad_token_id)
# print('Number of tokens now in tokenizer:', len(tokenizer))

Tokenizer pad token ID: 128244
Model pad token ID: 128244
Model config pad token ID: 128244
Number of tokens now in tokenizer: 151646


In [12]:
# Print the model configuration
print(model.config)

Qwen2Config {
  "_name_or_path": "Qwen/Qwen1.5-14B-Chat",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 13696,
  "max_position_embeddings": 32768,
  "max_window_layers": 35,
  "model_type": "qwen2",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "num_key_value_heads": 40,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.38.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}



In [13]:
# Sample string
# sample_string = ['hello [/INST]', 'my good friend</s>']
sample_string = ['howdy']

# Tokenize the stringified JSON object
# encoded_sample = tokenizer(sample_string, truncation=True, padding=True, max_length=1024, return_tensors='pt', add_special_tokens=True)
encoded_sample = tokenizer(sample_string, truncation=True, padding=True, max_length=1024, return_tensors='pt', add_special_tokens=True)

# Count the number of tokens
token_count = len(encoded_sample)

# Fetch BOS and EOS tokens
BOS_token_id = tokenizer.bos_token_id
EOS_token_id = tokenizer.eos_token_id

# Check and print BOS and EOS tokens
if BOS_token_id is not None:
    BOS_token = tokenizer.decode([BOS_token_id])
    print(f"Beginning of the sequence: {sample_string[0]} (BOS token: {BOS_token}, id: {BOS_token_id})")
else:
    print('There is no BOS token')

if EOS_token_id is not None:
    EOS_token = tokenizer.decode([EOS_token_id])
    print(f"End of the sequence: {sample_string[-1]} (EOS token: {EOS_token}, id: {EOS_token_id})")
else:
    print('There is no EOS token')

print(f"The number of tokens in the string is: {token_count}")
print(f"The ids are: {encoded_sample}")

# Decode the input_ids
decoded_sample = tokenizer.decode(encoded_sample['input_ids'][0], skip_special_tokens=False)

# Print the decoded string
print(f"The decoded string is: {decoded_sample}")

# Print the attention mask
print(f"The attention mask is: {encoded_sample['attention_mask']}")


There is no BOS token
End of the sequence: howdy (EOS token: <|im_end|>, id: 151645)
The number of tokens in the string is: 2
The ids are: {'input_ids': tensor([[ 5158, 10258]]), 'attention_mask': tensor([[1, 1]])}
The decoded string is: howdy
The attention mask is: tensor([[1, 1]])


## Set up LoRa

In [14]:
# #only if you want to load the model with standard LoRA (not unsloth) adapters (it can be faster to download a base model and add adapters)
# from peft import PeftModel

# # adapter_model = f'{base_model}' + '-function-calling-adapters-v2'
# adapter_model = "Trelis/Yi-34B-200K-Llamafied-chat-SFT-adapters"

# # load peft model with adapters
# model = PeftModel.from_pretrained(
#     model,
#     adapter_model,
# )

In [15]:
# Include this to reduce VRAM usage. Supported by most models,
model.gradient_checkpointing_enable()

## ONLY INCLUDE THIS IF DOING QLORA
# from peft import prepare_model_for_kbit_training
# model = prepare_model_for_kbit_training(model)

In [18]:
# # if you want to see a list of the model's modules
# print(model.state_dict().keys())

In [20]:
# print(model)

In [16]:
# # # only needed if you are trying to extend the context of a model.
# # def set_added_trainable_params(model):
# #     """
# #     Sets the parameters with names containing "embed" or "norm" as trainable.
# #     """
# #     trainable_params_dict = {}
    
# #     for name, param in model.named_parameters():
# #         if "embed" in name or "norm" in name: #for most models
# #         # if "ln" in name or "embd" in name: #for Phi-2
# #             param.requires_grad_()
# #             trainable_params_dict[name] = param
            
# #     return trainable_params_dict

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
            
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

# Initialize LoRA configuration
config = LoraConfig(
    r=8, #typically use 8 for models 7B or larger and use 128 for smaller models
    lora_alpha=32, #typically use 8 for models 7B or larger and use 128 for smaller models
    # r=128,
    # lora_alpha=512,
    target_modules=[
        #     "Wqkv", #for Phi-2
        #     "fc1", #for Phi-2
        #     "fc2" #for Phi-2
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        # "self_attn.rotary_emb.inv_freq",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lora_magnitude_vector", #required for DoRA
        # "input_layernorm.weight",
        # "post_attention_layernorm.weight",
        # "model.norm.weight",
        # "lm_head.weight"
    ],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    # use_dora=True
)

# Assuming 'model' is your original model instance
# Apply LoRA to the model
model = get_peft_model(model, config)

# # Set added parameters with names containing "embed" or "norm" as trainable. Recommended if you are extending an LLM's context window.
# set_added_trainable_params(model)

# Print out the number of trainable parameters
print_trainable_parameters(model)

trainable params: 31170560 || all params: 14198461440 || trainable%: 0.21953477235347552


In [17]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(152064, 5120)
        (layers): ModuleList(
          (0-39): 40 x Qwen2DecoderLayer(
            (self_attn): Qwen2FlashAttention2(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=5120, out_features=5120, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=5120, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=5120, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear(
                (base_layer): Linear(in_features=5

# Prepare Data

In [18]:
from datasets import load_dataset

data = load_dataset(
    # "Trelis/function_calling_extended", #this is the v2 dataset
    # revision="functionList" # optionally specify a branch
    "Trelis/function_calling_v3",
    revision="multi-lingual" # optionally specify a branch
    # "Trelis/function_calling_v3_SAMPLE", # use this for testing if you haven't purchased access to the v3 full dataset AND want to run a simple test.
    )

Downloading readme:   0%|          | 0.00/8.93k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/104k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.14k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/100k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/109k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.83k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/32.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/31.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [19]:
print(data)

DatasetDict({
    train: Dataset({
        features: ['functionList', 'userPrompt', 'assistantResponse'],
        num_rows: 198
    })
    validation: Dataset({
        features: ['functionList', 'userPrompt', 'assistantResponse'],
        num_rows: 57
    })
    test: Dataset({
        features: ['functionList', 'userPrompt', 'assistantResponse'],
        num_rows: 21
    })
})


In [20]:
class TextDataset(Dataset):
    def __init__(self, encodings, response_lengths, input_lengths):
        self.encodings = encodings
        self.response_lengths = response_lengths
        self.input_lengths = input_lengths

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}

        # Set labels to be the same as input_ids
        item["labels"] = item["input_ids"].clone()

        # Calculate the start and end positions of the response
        response_start_position = self.input_lengths[idx]
        response_end_position = self.input_lengths[idx] + self.response_lengths[idx]

        # Create a loss mask that covers only the response tokens
        item["loss_mask"] = torch.zeros_like(item["input_ids"])
        item["loss_mask"][response_start_position:response_end_position] = 1

        # Shift the loss mask to the left by one position
        shifted_loss_mask = torch.cat([item["loss_mask"][1:], torch.tensor([0])])
        item["loss_mask"] = shifted_loss_mask

        # Shift the labels to the left by one position
        item["labels"][:-1] = item["input_ids"][1:]

        # Replace the token after the response with an EOS token
        item["labels"][response_end_position - 1] = tokenizer.eos_token_id

        # Replace the token after the response with an 1 in the loss mask
        item["loss_mask"][response_end_position - 1] = 1

        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [21]:
# Generate sample prompt from an array of messages and the chat template
messages = [
    {
        "role": "user",
        "content": "How are you?"
    },
    {
        "role": "assistant",
        "content": "Great."
    },
    {
        "role": "user",
        "content": "Where are you from?"
    },
]

formatted_messages = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

print(formatted_messages)

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
Great.<|im_end|>
<|im_start|>user
Where are you from?<|im_end|>
<|im_start|>assistant



In [22]:
# Define the function start and end strings
B_FUNC, E_FUNC = "You have access to the following functions. Use them if required:\n\n", "\n\n"

# Define the user prompt start and end strings
# B_INST, E_INST = "[INST] ", " [/INST]" #Llama 2 or Mistral style
B_INST, E_INST = "<|im_start|>user\n", "<|im_end|>\n<|im_start|>assistant\n" #Qwen1.5 style
# B_INST, E_INST = "GPT4 Correct User: ", "<|end_of_turn|>GPT4 Correct Assistant:" #OpenChat style
# B_INST, E_INST = "Instruct:", "\nOutput:" #Phi 2
# B_INST, E_INST = "\n### Instruction:\n", "\n### Response:\n" #DeepSeek Coder Style
# B_INST, E_INST = "Human: ", " Assistant:" #Yi Style for function calling, no training space
# B_INST, E_INST = "### Human: ", "\n\n### Assistant: " #SUSChat

In [23]:
def prepare_dataset(dataset, tokenizer):
    # Create the formatted text with the correct roles for each part of the dialogue
    formatted_dataset = dataset.map(
        lambda x: {
            "input_text": "".join([
                f"{B_INST}{B_FUNC}{x['functionList'].strip()}{E_FUNC}",
                f"{x['userPrompt'].strip()}{E_INST}\n\n", # note that \n\n is added here during training to avoid different tokenizations of the E_INST string with whatever follows.
                f"{x['assistantResponse'].strip()}",  # appending the EOS token in TextData...
            ]),
            "response_text": "".join([
                f"{x['assistantResponse'].strip()}",  # appending the EOS token in TextData...
            ]),
        }
    )

    # Tokenize the datasets
    encodings = tokenizer([dialogue["input_text"] for dialogue in formatted_dataset], truncation=True, padding=True, max_length=1024, return_tensors='pt', add_special_tokens=True)

    # Tokenize the response one by one without padding and special tokens for the purpose of calculating length
    response_lengths = [len(tokenizer.encode(dialogue["response_text"], truncation=True, max_length=1024, padding=False, add_special_tokens=False)) for dialogue in formatted_dataset]

    # Tokenize the input one by one without padding and with the initial special token for the purpose of calculating length
    total_lengths = [len(tokenizer.encode(dialogue["input_text"], truncation=True, max_length=1024, padding=False, add_special_tokens=True)) for dialogue in formatted_dataset]
    input_lengths = [total_length - response_length for total_length, response_length in zip(total_lengths, response_lengths)]

    # Create TextDataset
    text_dataset = TextDataset(encodings, response_lengths, input_lengths)

    return text_dataset

In [24]:
# Apply function to your datasets
train_dataset = prepare_dataset(data['train'], tokenizer)
test_dataset = prepare_dataset(data['test'], tokenizer)
validation_dataset = prepare_dataset(data['validation'], tokenizer)

Map:   0%|          | 0/198 [00:00<?, ? examples/s]

Map:   0%|          | 0/21 [00:00<?, ? examples/s]

Map:   0%|          | 0/57 [00:00<?, ? examples/s]

### Examine the datasets

In [30]:
# # Print the number of items in the dataset
# print(f"Number of samples in the dataset: {len(train_dataset)}")

# # Get a sample item
# sample_item = train_dataset[1]  # replace 0 with the index of the sample you want to examine

# # Print the dimensions of the sample item
# print(f"Dimensions of input_ids: {sample_item['input_ids'].shape}")
# print(f"Dimensions of attention_mask: {sample_item['attention_mask'].shape}")
# print(f"Dimensions of loss_mask: {sample_item['loss_mask'].shape}")
# print(f"Dimensions of labels: {sample_item['labels'].shape}")

# # Print some tokens from the start and end of the sample
# num_tokens_to_print = 336  # replace with the number of tokens you want to print

# print("\nTokens at the start of the sample:")
# print(sample_item['input_ids'][:num_tokens_to_print].tolist())
# print(tokenizer.convert_ids_to_tokens(sample_item['input_ids'][:num_tokens_to_print].tolist()))

# print("\nLabels at the start of the sample:")
# print(sample_item['labels'][:num_tokens_to_print].tolist())
# print(tokenizer.convert_ids_to_tokens(sample_item['labels'][:num_tokens_to_print].tolist()))

# print("Attention mask at the start of the sample:")
# print(sample_item['attention_mask'][:num_tokens_to_print].tolist())

# print("Loss mask at the start of the sample:")
# print(sample_item['loss_mask'][:num_tokens_to_print].tolist())

# print("\nTokens at the end of the sample:")
# print(sample_item['input_ids'][-num_tokens_to_print:].tolist())
# print(tokenizer.convert_ids_to_tokens(sample_item['input_ids'][-num_tokens_to_print:].tolist()))

# print("\nLabels at the end of the sample:")
# print(sample_item['labels'][-num_tokens_to_print:].tolist())
# print(tokenizer.convert_ids_to_tokens(sample_item['labels'][-num_tokens_to_print:].tolist()))

# print("Attention mask at the end of the sample:")
# print(sample_item['attention_mask'][-num_tokens_to_print:].tolist())

# print("Loss mask at the end of the sample:")
# print(sample_item['loss_mask'][-num_tokens_to_print:].tolist())


# Generate a sample

In [25]:
import textwrap
wrapper = textwrap.TextWrapper(width=80)

In [26]:
import re  # import regular expressions module

In [27]:
import gc  # import Python's garbage collection module

def generate(index,data_split="test"):

    functionList = data[data_split][index]['functionList']
    user_prompt = data[data_split][index]['userPrompt']
    correct_answer = data[data_split][index]['assistantResponse']

    # model.config.use_cache = True #unsure this is needed

    # Format your prompt template
    prompt = f"{B_INST}{B_FUNC}{functionList.strip()}{E_FUNC}{user_prompt.strip()}{E_INST}\n\n"

    print(f"Using the {data_split} data split.\n\nPrompt:")
    print(prompt)

    inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=True).to("cuda")

    if "token_type_ids" in inputs:
        del inputs["token_type_ids"]

    # print(f'model is on: {next(model.parameters()).device}')  # Debug line
    # print(f'input_ids is on: {inputs["input_ids"].device}')  # Debug line

    model.generation_config.top_k = None
    
    output = model.generate(**inputs,
                            max_new_tokens=200,
                            do_sample=False,
                            pad_token_id=tokenizer.pad_token_id,
                            eos_token_id=tokenizer.eos_token_id,
                            temperature=1.0,
                            top_p=1.0,
                           )

    print()

    # Subtract the length of input_ids from output to get only the model's response
    output_text = tokenizer.decode(output[0, len(inputs.input_ids[0]):], skip_special_tokens=False)
    output_text = re.sub('\n+', '\n', output_text)  # remove excessive newline characters

    print("**Generated Assistant Response:**")
    print(output_text)

    print()

    print("**Correct Assistant Response:**")
    print(correct_answer)

    print()

    # Clear GPU cache and run garbage collection
    torch.cuda.empty_cache()  # Clear GPU cache
    gc.collect()  # Run garbage collection

In [28]:
# # Run validation - not usually worth it for function calling as it won't one shot things.
# for index in range(len(test_dataset)):
#     print(f'---Running index {index}---')
#     generate(index, "test")

# OR
print(f'---Running index 0---')
generate(0, "test")

---Running index 0---
Using the test data split.

Prompt:
<|im_start|>user
You have access to the following functions. Use them if required:

[
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the stock price of an array of stocks",
            "parameters": {
                "type": "object",
                "properties": {
                    "names": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        },
                        "description": "An array of stocks"
                    }
                },
                "required": [
                    "names"
                ]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_big_stocks",
            "description": "Get the names of the largest N stocks by market cap",
            "parameters": {
    

# Training

In [29]:
import torch.nn as nn

In [30]:
class CustomTrainer(transformers.Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        # Define the number of tokens you want to display
        num_tokens = 25  # This displays info on the actual and predicted tokens at the end of each sequence.

        labels = inputs.pop("labels")

        # # Get first hundred label IDs for each sequence in the batch
        # first_hundred_label_ids = labels[:, :200]
        # # Convert to tokens
        # first_hundred_tokens = [tokenizer.convert_ids_to_tokens(label_ids) for label_ids in first_hundred_label_ids]
        # # Print them
        # for batch_idx, tokens in enumerate(first_hundred_tokens):
        #     print(f"First 200 decoded tokens for sequence {batch_idx + 1}: {tokens}")
        
        loss_mask = inputs.pop("loss_mask")

        # Forward pass
        outputs = model(**inputs)

        logits = outputs.logits

        # Check for NaN in logits and labels
        if torch.isnan(logits).any():
            print("NaN detected in logits")
            print(logits)

        # Convert logits to probabilities using softmax function
        probs = nn.functional.softmax(logits, dim=-1)

        # Get the most probable tokens
        predicted_token_ids = torch.argmax(probs, dim=-1)

        # Compute the loss
        loss_fct = nn.CrossEntropyLoss(reduction='none')
        losses = loss_fct(logits.view(-1, self.model.config.vocab_size), labels.view(-1))

        # Reshaping the losses to have dimensions [batch_size, seq_length]
        losses = losses.view(-1, inputs['input_ids'].size(1))

        # Apply the loss mask
        masked_loss = losses * loss_mask

        # Check for NaN in losses and zero in loss_mask.sum()
        if torch.isnan(losses).any():
            print("NaN detected in losses")
            # print(losses)

        if loss_mask.sum() == 0:
            print("Sum of loss_mask is zero")
            return (torch.tensor(0).to(loss_mask.device), outputs) if return_outputs else torch.tensor(0).to(loss_mask.device)  # Early return

        # Aggregate the masked losses
        loss = masked_loss.sum() / (loss_mask.sum() + 1e-9)  # normalizing by the number of tokens considered + epsilon to prevent division by zero

        # Print formatted tokens
        batch_size, seq_length = inputs['input_ids'].size()

        # num_tokens = len(inputs['input_ids'][0])

        # # Useful for debugging training - recommend just training a small number of steps
        # print("-" * 120)
        # print(f"Token analysis for last {num_tokens} tokens:")
        # header_format = "{:<10}{:<20}{:<20}{:<20}{:<20}{:<30}{:<30}".format("Index", "Input Token", "Predicted Token", "True Token", "Loss Mask", "Raw Loss", "Masked Loss")

        # for batch_idx in range(batch_size):
        #     input_tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][batch_idx])  # Using batch_idx
        #     predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_token_ids[batch_idx])  # Using batch_idx
        #     true_tokens = tokenizer.convert_ids_to_tokens(labels[batch_idx])  # Using batch_idx

        #     print(f"\nBatch {batch_idx + 1} of {batch_size}:")
        #     print(header_format)
        #     for i in range(-num_tokens, 0, 1):
        #         index = seq_length + i  # Correct index based on sequence length
        #         print("{:<10}{:<20}{:<20}{:<20}{:<20.1f}{:<30.6f}{:<30.6f}".format(index, input_tokens[index], predicted_tokens[index], true_tokens[index], loss_mask[batch_idx, i].item(), losses[batch_idx, i], masked_loss[batch_idx, i]))
        #     print("-" * 120)

        return (loss, outputs) if return_outputs else loss

    def get_train_dataloader(self):
      train_dataset = self.train_dataset
      data_collator = self.data_collator

      dataloader_params = {
          "batch_size": self.args.train_batch_size,
          "collate_fn": data_collator,
          "num_workers": self.args.dataloader_num_workers,
          "pin_memory": self.args.dataloader_pin_memory,
      }

      if not isinstance(train_dataset, torch.utils.data.IterableDataset):
          dataloader_params["sampler"] = self._get_train_sampler()
          dataloader_params["drop_last"] = self.args.dataloader_drop_last

      return DataLoader(train_dataset, **dataloader_params)

    def get_eval_dataloader(self, eval_dataset=None):
      eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
      if eval_dataset is None:
          raise ValueError("Trainer: evaluation requires an eval_dataset.")

      data_collator = self.data_collator

      # Parameters for the DataLoader
      dataloader_params = {
          "batch_size": self.args.eval_batch_size,
          "collate_fn": data_collator,
          "num_workers": self.args.dataloader_num_workers,
          "pin_memory": self.args.dataloader_pin_memory,
      }

      # If your dataset isn't an instance of torch's IterableDataset, you can provide sampler and drop_last
      if not isinstance(eval_dataset, torch.utils.data.IterableDataset):
          dataloader_params["sampler"] = self._get_eval_sampler(eval_dataset)
          dataloader_params["drop_last"] = False  # Typically, you don't drop the last batch for evaluation

      return DataLoader(eval_dataset, **dataloader_params)

In [31]:
class CustomDataCollator: # Needed if the EOS token is to be included in training.
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, batch):

        input_ids = torch.stack([item['input_ids'] for item in batch])
        attention_mask = torch.stack([item['attention_mask'] for item in batch])
        labels = torch.stack([item['labels'] for item in batch])
        loss_mask = torch.stack([item['loss_mask'] for item in batch])

        # # Debugging: print details of the first sequence in the batch
        # num_elements_to_view = 20  # Number of last elements to view

        # # Decoding the input_ids
        # decoded_input_tokens = self.tokenizer.convert_ids_to_tokens(input_ids[0].tolist())

        # print("Debugging last", num_elements_to_view, "elements of the first sequence in the batch:")
        # print("{:<20}{:<20}{:<20}{:<20}".format("Token", "Input ID", "Label", "Loss Mask"))
        # for i in range(-num_elements_to_view, 0, 1):
        #   print("{:<20}{:<20}{:<20}{:<20}".format(decoded_input_tokens[i], input_ids[0, i].item(), labels[0, i].item(), loss_mask[0, i].item()))

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels,
            'loss_mask': loss_mask
        }

data_collator = CustomDataCollator(tokenizer)


In [32]:
trainer = CustomTrainer(  
    model=model,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    args=transformers.TrainingArguments(
        # max_steps=1,
        num_train_epochs=1, #stronger models typically only need 1 epoch. Check if the validation loss is still dropping, that means you can train more.
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=1,
        evaluation_strategy="steps",
        max_grad_norm=1,
        warmup_ratio=0.1,
        eval_steps=0.2,
        # fp16=True, #if not using an Ampere architecture (i.e. not using H100, A100, A6000).
        bf16=True,
        logging_steps=1,
        output_dir="outputs",
        # optim="paged_adamw_8bit", #for training in 4bit (quantized)
        optim="adamw_torch", #for training in full fp16/bf16 precision
        learning_rate=1e-4, 
        lr_scheduler_type='constant',
        hub_private_repo=True,
    ),
    data_collator=data_collator,
    # data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [33]:
trainer.train()
torch.cuda.empty_cache()

[34m[1mwandb[0m: Currently logged in as: [33mspidy-in[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
40,4.7184,1.303684
80,2.2221,1.184953
120,0.0229,1.154052
160,0.8957,1.133719


# Example After Fine Tuning

In [34]:
model.config.use_cache = True
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(152064, 5120)
        (layers): ModuleList(
          (0-39): 40 x Qwen2DecoderLayer(
            (self_attn): Qwen2FlashAttention2(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=5120, out_features=5120, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=5120, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=5120, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear(
                (base_layer): Linear(in_features=5

In [35]:
# Run validation
for index in range(len(test_dataset)):
    print(f'---Running index {index}---')
    generate(index, "test")

---Running index 0---
Using the test data split.

Prompt:
<|im_start|>user
You have access to the following functions. Use them if required:

[
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the stock price of an array of stocks",
            "parameters": {
                "type": "object",
                "properties": {
                    "names": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        },
                        "description": "An array of stocks"
                    }
                },
                "required": [
                    "names"
                ]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_big_stocks",
            "description": "Get the names of the largest N stocks by market cap",
            "parameters": {
    

In [36]:
# STOP SCRIPT HERE BEFORE PUSHING TO HUB. (comment this out if you wish all cells to run).

# Merge Adapters and Save Model to Hub

In [37]:
# Extract the last portion of the base_model
base_model_name = base_model.split("/")[-1]

fine_tuned_slug="function-calling-v3"

# Define the save and push paths
adapter_model = f"Trelis/{base_model_name}-{fine_tuned_slug}-adapters"
new_model = f"Trelis/{base_model_name}-{fine_tuned_slug}" #adjust 'Trelis' to your HuggingFace organisation

print(new_model)
print(adapter_model)

Trelis/Qwen1.5-14B-Chat-function-calling-v3
Trelis/Qwen1.5-14B-Chat-function-calling-v3-adapters


In [69]:
# ## Optionally, set up the new repo as well as a branch for gguf and awq

# from huggingface_hub import HfApi, create_branch, create_repo

# # Initialize the HfApi class
# api = HfApi()

# create_repo(new_model, private=True)

# create_branch(new_model, repo_type="model", branch="gguf")

# create_branch(new_model, repo_type="model", branch="awq")

# create_branch(new_model, repo_type="model", branch="gptq")

In [38]:
# Save the model
model.save_pretrained(f"{adapter_model}-local", use_auth_token=True)

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

In [39]:
# Push the model to the hub
model.push_to_hub(adapter_model, token=True, private=True)

adapter_model.safetensors:   0%|          | 0.00/62.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Trelis/Qwen1.5-14B-Chat-function-calling-v3-adapters/commit/f7823887553abe82f50df3d6ab3451319325dac2', commit_message='Upload model', commit_description='', oid='f7823887553abe82f50df3d6ab3451319325dac2', pr_url=None, pr_revision=None, pr_num=None)

In [72]:
# # ## FOR FINE-TUNING WITH QLORA ONLY
# # ## reload the base model (you might need a pro subscription for this because you may need a high RAM environment since this is loading the full original model, not quantized)
# # ## you may prefer to use auto instead of cpu if you have a gpu
# # ## if you are training in full precision (not quantized), you may not need to reload the model, you can directly merge and unload.
# # ## if you are training very large models you may need to restart the kernel and reload the base model as there may not be enough space on gpu.

# # from transformers import AutoModelForCausalLM, PretrainedConfig
# # import torch

# # model = AutoModelForCausalLM.from_pretrained(base_model, device_map='auto', trust_remote_code=True, torch_dtype=torch.float16, cache_dir=cache_dir)

# from peft import PeftModel

# # load perf model with new adapters
# model = PeftModel.from_pretrained(
#     model,
#     './Trelis/Yi-34B-200K-Llamafied-chat-SFT-function-calling-adapters-v2',
# )

In [40]:
model = model.merge_and_unload() # merge adapters with the base model.

In [41]:
# optional, but allows you to save the model locally so you can immediately inference without downloading
model.save_pretrained(f"{new_model}-local")

In [42]:
model.push_to_hub(new_model, token=True, max_shard_size="10GB",safe_serialization=True, private=True)

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/8.39G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.96G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Trelis/Qwen1.5-14B-Chat-function-calling-v3/commit/9322ea251ad2b8fb022c20a671b7988f388c627a', commit_message='Upload Qwen2ForCausalLM', commit_description='', oid='9322ea251ad2b8fb022c20a671b7988f388c627a', pr_url=None, pr_revision=None, pr_num=None)

### Base README.md and also tokenizer.model (needed for GGUF and GPTQ)

In [76]:
# import os
# import requests
# from huggingface_hub import HfApi

# def download_file_from_huggingface(model_id, filename, save_path):
#     url = f"https://huggingface.co/{model_id}/resolve/main/{filename}"
#     r = requests.get(url)
#     if r.status_code != 200:
#         print(f"Failed to download {filename}. HTTP Status Code: {r.status_code}")
#         return False
#     with open(os.path.join(save_path, filename), 'wb') as f:
#         f.write(r.content)
#     return True

# def main():
#     # Files to download and upload
#     files_to_process = ["tokenizer.model", "README.md"]
    
#     # Directory to save the downloaded files
#     save_path = "./models"
#     if not os.path.exists(save_path):
#         os.makedirs(save_path)
    
#     # Initialize HfApi class
#     api = HfApi()

#     # Specify the repository where you want to upload the files
#     repo_id = new_model  # Assuming new_model is in the format "username/repo"

#     for filename in files_to_process:
#         # Download the file
#         success = download_file_from_huggingface(base_model, filename, save_path)
#         if success:
#             print(f"Successfully downloaded {filename}")
#         else:
#             print(f"Failed to download {filename}")
#             continue  # Skip uploading if download failed

#         # File path to upload
#         local_file_path = os.path.join(save_path, filename)

#         # Upload the file
#         api.upload_file(
#             path_or_fileobj=local_file_path,
#             path_in_repo=filename,  # Using filename directly, adjust as needed
#             repo_id=repo_id,
#             repo_type="model",  # Assuming it's a model; can be "dataset" or "space" as well
#         )
#         print(f"Uploaded {filename} to {repo_id}")

# if __name__ == "__main__":
#     main()


## Set up Chat Template [ADVANCED]
This is a more advanced step that allows you to customize a chat template for function calling.

Typically you need to start by grabbing the chat_template from tokenizer_config.json of the base file and pasting that into the box below. You then need to customize that template to include function_metadata, function_response and function_call roles. You can see one example below but it won't be correct for all models.

In [43]:
print(tokenizer.chat_template)

{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
'}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
' }}{% endif %}


In [44]:
print(tokenizer.bos_token)
print(tokenizer.eos_token)

None
<|im_end|>


In [45]:
import json

In [46]:
function_metadata = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "This function gets the current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city, e.g., San Francisco"
                    },
                    "format": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The temperature unit to use."
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_clothes",
            "description": "This function provides a suggestion of clothes to wear based on the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "temperature": {
                        "type": "string",
                        "description": "The temperature, e.g., 15 C or 59 F"
                    },
                    "condition": {
                        "type": "string",
                        "description": "The weather condition, e.g., 'Cloudy', 'Sunny', 'Rainy'"
                    }
                },
                "required": ["temperature", "condition"]
            }
        }
    }    
]

In [47]:
# Generate sample prompt from an array of messages and the chat template
messages = [
    {
        "role": "user",
        "content": "How are you?"
    },
    {
        "role": "assistant",
        "content": "Great."
    },
    {
        "role": "user",
        "content": "Where are you from?"
    },
]

formatted_messages = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

print(formatted_messages)

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
Great.<|im_end|>
<|im_start|>user
Where are you from?<|im_end|>
<|im_start|>assistant



In [48]:
# Comment out later messages to test various stages of generation.

sample_messages = [
    # System messages are not supported by default
    # {
    #     "role": "system",
    #     "content": "you are a helpful assistant"
    # },
    {
        "role": "function_metadata",
        "content": "FUNCTION_METADATA"
    },
    {
        "role": "user",
        "content": "What is the current weather in London?"
    },
    {
        "role": "function_call",
        "content": "{\n    \"name\": \"get_current_weather\",\n    \"arguments\": {\n        \"city\": \"London\"\n    }\n}"
    },
    {
        "role": "function_response",
        "content": "{\n    \"temperature\": \"15 C\",\n    \"condition\": \"Cloudy\"\n}"
    },
    # {
    #     "role": "assistant",
    #     "content": "The current weather in London is Cloudy with a temperature of 15 Celsius."
    # },
    # {
    #     "role": "user",
    #     "content": "That's great. Now say hello."
    # },
    # {
    #     "role": "assistant",
    #     "content": "Hello!"
    # }
]

In [49]:
# Iterate through each message in the list
for message in sample_messages:
    if message['role'] == 'function_metadata':
        # Replace 'FUNCTION_METADATA' with 'function_metadata' in the content
        message['content'] = message['content'].replace('FUNCTION_METADATA', json.dumps(function_metadata, indent=4))

In [50]:
print(tokenizer.eos_token)

<|im_end|>


In [51]:
# Llama 2 / Mistral (with no system message support)
B_INST = B_INST.replace('\n', '[NEW_LINE]') #change if not correct earlier in the script
E_INST = E_INST.replace('\n', '[NEW_LINE]') #change if not correct earlier in the script
FUNC_RESPONSE = f"{B_INST}[FUNCTION_RESPONSE] " #set this to a reasonable value considering the chat template.
FUNC_CALL = f"{E_INST}[FUNCTION_CALL] " #set this to a reasonable value considering the chat template.
# Use [NEW_LINE] for new lines.

formatted_template = f"""
{{% for message in messages %}}
    {{% if message['role'] == 'function_metadata' %}}
        {B_INST}You have access to the following functions. Use them if required:[NEW_LINE][NEW_LINE]{{{{ message['content'] }}}}[NEW_LINE][NEW_LINE]
    {{% elif message['role'] == 'user' and loop.index0 == 1 %}}
        {{{{ message['content'] }}}}
    {{% elif message['role'] == 'assistant' %}}
        {E_INST}{{{{ message['content'] }}}}{{{{ eos_token }}}}
    {{% elif message['role'] == 'function_call' %}}
        {FUNC_CALL}{{{{ message['content'] }}}}{{{{ eos_token }}}}
    {{% elif message['role'] == 'function_response' %}}
        {FUNC_RESPONSE}Here is the response to the function call. If helpful, use it to respond to the user's question:{{{{ message['content'] }}}}
    {{% elif message['role'] == 'user' and loop.index0 != 1 %}}
        {B_INST}{{{{ message['content'] }}}}
    {{% endif %}}
{{% endfor %}}
{{% if add_generation_prompt %}}{E_INST}{{% endif %}}
"""

# Using Python's replace to ensure \n is correctly placed in the final single-line template
tokenizer.chat_template = formatted_template.strip().replace('    ', '').replace('\n', '').replace('[NEW_LINE]', '\n')

In [52]:
print(tokenizer.chat_template)

{% for message in messages %}{% if message['role'] == 'function_metadata' %}<|im_start|>user
You have access to the following functions. Use them if required:

{{ message['content'] }}

{% elif message['role'] == 'user' and loop.index0 == 1 %}{{ message['content'] }}{% elif message['role'] == 'assistant' %}<|im_end|>
<|im_start|>assistant
{{ message['content'] }}{{ eos_token }}{% elif message['role'] == 'function_call' %}<|im_end|>
<|im_start|>assistant
[FUNCTION_CALL] {{ message['content'] }}{{ eos_token }}{% elif message['role'] == 'function_response' %}<|im_start|>user
[FUNCTION_RESPONSE] Here is the response to the function call. If helpful, use it to respond to the user's question:{{ message['content'] }}{% elif message['role'] == 'user' and loop.index0 != 1 %}<|im_start|>user
{{ message['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_end|>
<|im_start|>assistant
{% endif %}


In [54]:
# View the template applied without tokenization
prompt = tokenizer.apply_chat_template(sample_messages, tokenize=False, add_generation_prompt=True)
print(prompt)

<|im_start|>user
You have access to the following functions. Use them if required:

[
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "This function gets the current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city, e.g., San Francisco"
                    },
                    "format": {
                        "type": "string",
                        "enum": [
                            "celsius",
                            "fahrenheit"
                        ],
                        "description": "The temperature unit to use."
                    }
                },
                "required": [
                    "city"
                ]
            }
        }
    },
    {
        "type": "function",
    

In [55]:
## Test generation

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

if "token_type_ids" in inputs:
    del inputs["token_type_ids"]

# print(f'model is on: {next(model.parameters()).device}')  # Debug line
# print(f'input_ids is on: {inputs["input_ids"].device}')  # Debug line

output = model.generate(**inputs,
                        max_new_tokens=200,
                        do_sample=False,
                        pad_token_id=tokenizer.pad_token_id,
                        eos_token_id=tokenizer.eos_token_id,
                        temperature=1.0,
                        top_p=1.0,
                        repetition_penalty=1.0
                       )

print()

# Subtract the length of input_ids from output to get only the model's response
output_text = tokenizer.decode(output[0, len(inputs.input_ids[0]):], skip_special_tokens=False)
print(output_text)


{
    "name": "get_current_weather",
    "arguments": {
        "city": "London"
    }
}<|im_end|>


## Push Tokenizer

In [56]:
# optional, but allows you to save the model locally so you can immediately inference without downloading
tokenizer.save_pretrained(f"{new_model}-local")

('Trelis/Qwen1.5-14B-Chat-function-calling-v3-local/tokenizer_config.json',
 'Trelis/Qwen1.5-14B-Chat-function-calling-v3-local/special_tokens_map.json',
 'Trelis/Qwen1.5-14B-Chat-function-calling-v3-local/vocab.json',
 'Trelis/Qwen1.5-14B-Chat-function-calling-v3-local/merges.txt',
 'Trelis/Qwen1.5-14B-Chat-function-calling-v3-local/added_tokens.json',
 'Trelis/Qwen1.5-14B-Chat-function-calling-v3-local/tokenizer.json')

In [57]:
# #Push the tokenizer
tokenizer.push_to_hub(new_model, token=True)

## RELOAD IF NEEDED (NOT RECOMMENDED IF tokenizer.chat_template was updated.
# from transformers import AutoTokenizer
# # reload the tokenizer because you don't want to have an off-size tokenizer with pad tokens. 
# tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Trelis/Qwen1.5-14B-Chat-function-calling-v3/commit/822c37b2f5d036ae7589cb91bdcce83d44ff1d68', commit_message='Upload tokenizer', commit_description='', oid='822c37b2f5d036ae7589cb91bdcce83d44ff1d68', pr_url=None, pr_revision=None, pr_num=None)