# LoRA Fine-Tuning Tiny LLMs as Expert Agents

Tiny LLMs have never been ideal for agentic workflows. They lack the ability to reliably generate function calls; however, this isn't due to any real limitation on LLM size. Instead, it's due to the LLM providers' lack of focus on training data that provides quality examples of function calling

Because of that, we can fine-tune expert agents from tiny LLMs such as the 1B parameter Llama 3.2 and get incredible results. In this example, we do just that - we take llama-3.2-1b-instruct, Salesforce's xLAM dataset, and Low-Rank Adaptation (LoRA) fine-tuning via NVIDIA's NeMo Microservices, to create our own tiny LLM agent.

In [None]:
%pip install -qU \
    datasets==3.6.0 \
    graphai-lib==0.0.5 \
    openai


## Data Preparation

To train our LLM for function-calling we need a dataset containing function calls. Salesforce released exactly that with the Salesforce/xlam-function-calling-60k dataset. This dataset was used by Salesforce to train their family of Large Action Models (LAMs). These LAMs were designed specifically for function calling, reasoning, and planning — all essential abilities for agents.

We can download the dataset from HuggingFace, to do so we do need an account as we must agree to the T&Cs to use this dataset. After you have agreed to the T&Cs we first authenticate ourselves by grabbing a read-only token from the hub and entering it below:

In [None]:
from getpass import getpass

token = getpass("")


In [None]:
# Download the dataset
from datasets import load_dataset

data = load_dataset(
    "Salesforce/xlam-function-calling-60k",
    split="train",
    token=token
)


Each row of this dataset contains a user query, a set of tools that an LLM has access to, and the correct tool call that should be executed

In [None]:
data[0]

This format is not the format we need for training on NeMo — instead we need the standardized OpenAI format containing a list of messages (with roles of user or assistant) and a tools JSON containing a list of function schemas which defines the tools available to our LLM. It looks like this:
```json
{
    "messages": [
        {"role": "user", "content": "<user query>"},
        {"role": "assistant", "content": "", "tool_calls": [
            {
                "id": "call_xyz", "type": "function",
                "function": {
                    "name": "<tool name>", "arguments": {<input args>}
                }
            },
            ... <other calls if running parallel tool calling>
        ]}
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "<tool name>",
                "description": "<natural language description of the tool>",
                "parameteres": {  # this defines all possible args and their types
                    "type": "object",
                    "properties": {
                        "type": {
                            "type": "<data type, like `string`>",
                            "description": "<field-specific description>",
                            "default": "<optional field, use for default values>"
                        },
                        "required": []  # list required params, ie params without 'default'
                    }
                }
            }
        }
    ]
}
```
Our defined tools is simple a list of function schemas, ie functions that we would typically define that an LLM will be able to call. For example, we could define a multiply function:

In [None]:
def multiply(a: float, b: float, round: bool = False) -> float:
    """
    Multiplies two numbers together. Rounding is optional.
    """
    return (a * b) if round else a * b


The xLAM formatted function schemas are not aligned with the OpenAI format. Note the lack of 'type': 'function', different parameters structure, and different types (xLAM uses Python types like str or List[Any], which would become string or array respectively when using the OpenAI format).


In [None]:
import json

json.loads(data[0]["tools"])


To transform the xLAM data format into the OpenAI format we will work through a few transformation steps, first we handle type transformation from Python format tot OpenAI format, summarized here:

| Python format | OpenAI format | Explanation |
|---------------|---------------|-------------|
| "str" | "string" | Maps primitive Python type to OpenAI type |
| "int" | "integer" | Same as above |
| "float" |	"number" | |	
| "bool" | "boolean" | |	
| "list" or "List" | "array" | Maps list-like types to array |
| "dict" or "Dict" | "object" | Maps dict-like types to object |
| "Set" or "set" | "array" | Sets are arrays in JSON schema |
| "Callable[[int], str]" | "string" | Callables treated as strings |
| "Tuple[int, str]" | "array" | Tuples treated as arrays |
| "List[str], default='foo'" | "array" | Strips default, then normalizes |
| "default='foo'" | "string" | If only default value, assume string |
| "str, optional" | "string" | Removes , optional |
| "UnknownType" | "string" | Defaults unknown types to string |

We apply these transformations with:

In [None]:
def normalize_type(param_type: str) -> str:
    param_type = param_type.strip().split(",")[0]

    if param_type.startswith("default="):
        return "string"

    param_type = param_type.replace(", optional", "").strip()

    if any(param_type.startswith(prefix) for prefix in ("Callable", "Tuple", "List[", "Set")):
        return "array" if "Tuple" in param_type or "List" in param_type or "Set" in param_type else "string"

    type_mapping = {
        "str": "string",
        "int": "integer",
        "float": "number",
        "bool": "boolean",
        "list": "array",
        "dict": "object",
        "List": "array",
        "Dict": "object",
        "set": "array",
        "Set": "array",
    }

    return type_mapping.get(param_type, "string")

Now we restructure the tool/function schemas from xLAM to OpenAI format:

In [None]:
from typing import Any
import json

def xlam_tools_to_openai(
    tools: str | list[dict[str, Any]]
) -> list[dict[str, Any]]:
    # if input is string we assume it is json so parse it
    if isinstance(tools, str):
        try:
            tools = json.loads(tools)
        except json.JSONDecodeError:
            # if error, return empty list
            return []

    # check we have a list, if not return empty list
    if not isinstance(tools, list):
        return []

    openai_tools = []
    
    for tool in tools:
        # check tool is dictionary with parameters dict inside
        if not isinstance(tool, dict) or not isinstance(tool.get("parameters"), dict):
            # if not, we don't want it
            continue

        properties = {}

        for name, info in tool["parameters"].items():
            # skip if param info isn't a dict
            if not isinstance(info, dict):
                continue

            # convert from python -> openai types
            param = {
                "description": info.get("description", ""),  # default to empty string
                "type": normalize_type(info.get("type", "")),
            }

            # include default if it's not None or empty string
            default = info.get("default")
            if default not in (None, ""):
                param["default"] = default

            properties[name] = param

        # build new function format
        openai_tools.append({
            "type": "function",
            "function": {
                "name": tool.get("name", ""),
                "description": tool.get("description", ""),
                "parameters": {
                    "type": "object",
                    "properties": properties
                },
            },
        })

    return openai_tools

Using this, our xLAM tool schema:

In [None]:
json.loads(data[0]["tools"])

Is transformed into this OpenAI tool schema:

In [None]:
xlam_tools_to_openai(json.loads(data[0]["tools"]))

That will handle our transformation for the function schemas, but we also need to reformat our dataset into the correct OpenAI messages format. We do that like so:

In [None]:
def xlam_tool_calls_to_openai(tool_calls: list[dict]) -> list[dict]:
    """Convert xLAM tool calls to OpenAI tool calls."""
    # not all models support parallel tool calling, so we
    # just look at records with a single tool call
    if len(tool_calls) == 1:
        return [
            {
                "type": "function",
                "function": tool_calls[0]
            }
        ]
    else:
        return None

Here is the original xLAM format:

In [None]:
json.loads(data[1]["answers"])

And the OpenAI version:

In [None]:
xlam_tool_calls_to_openai(json.loads(data[1]["answers"]))

We need to do this for both the user and assistant message in our training data.

In [None]:
def xlam_messages_to_openai(data: dict) -> dict:
    """Convert xLAM data format to OpenAI format."""
    messages = [
        {"role": "user", "content": data["query"]},
        {
            "role": "assistant", "content": "",
            "tool_calls": xlam_tool_calls_to_openai(json.loads(data["answers"]))
        }
    ]
    return messages

This turns the xLAM query and answers into a messages list containing a user message followed by the assistant message:

In [None]:
data[1]

In [None]:
xlam_messages_to_openai(data[1])

In [None]:
{
    "messages": xlam_messages_to_openai(data[1]),
    "tools": xlam_tools_to_openai(json.loads(data[1]["tools"]))
}

In [None]:
from tqdm.auto import tqdm

openai_data = []

for row in tqdm(data):
    messages = xlam_messages_to_openai(row)
    tools = xlam_tools_to_openai(json.loads(row["tools"]))
    if messages is None or messages[1]["tool_calls"] is None or tools is None:
        # invalid data so we skip
        continue
    else:
        openai_data.append({
            "messages": messages,
            "tools": tools
        })

In [None]:
openai_data[0]

## Train-Vaidation-Test Split

We will split our dataset into a train-validation-test split, with 70% for training, 15% for validation, and 15% for testing.

In [None]:
import random

random.shuffle(openai_data)

train_split_index = int(len(openai_data) * 0.7)
val_split_index = int(len(openai_data) * 0.85)
# create split datasets
train_data = openai_data[:train_split_index]
val_data = openai_data[train_split_index:val_split_index]
test_data = openai_data[val_split_index:]

print(f"Train data: {len(train_data)}")
print(f"Val data: {len(val_data)}")
print(f"Test data: {len(test_data)}")

We would typically be using the test_data in evaluation, which requires a slightly different format to the OpenAI format we have created already, for this format we must shift the tool_calls data from the assistant message into a separate tool_calls key

In [None]:
test_data = [
    {
        "messages": [x["messages"][0]],
        "tools": x["tools"],
        "tool_calls": x["messages"][1]["tool_calls"]
    } for x in test_data
]

test_data[0]

In [None]:
# save training data
with open("training.jsonl", "w") as fp:
    for row in train_data:
        fp.write(json.dumps(row) + "\n")

# save validation data
with open("validation.jsonl", "w") as fp:
    for row in val_data:
        fp.write(json.dumps(row) + "\n")

# save test data
with open("test.jsonl", "w") as fp:
    for row in test_data:
        fp.write(json.dumps(row) + "\n")

## Data and Models Prep

Before we can train a model with the NeMo Customizer service, we need to push our data to the NeMo Data Store and push our base model to the NeMo Entity Store. Before doing either of those things we need to ensure our NeMo microservices are deployed and running. To do that, we will list all services running within our demo namespace that have a name matching the wildcard nemo-*, we do this with kubectl like so:

In [None]:
NAMESPACE = "demo"

In [None]:
!kubectl get service -n {NAMESPACE} | grep '^nemo-'

We'll need to grab a few of the ClusterIP and hosts from above. These values do change so make sure you update with your own deployment endpoints in the format "http://<cluster-ip>:<host>". For example, if your nemo-data-store fields are:

| NAME | TYPE |	CLUSTER-IP | EXTERNAL-IP | PORT(S) | AGE |
|------|------|------------|-------------|---------|-----|
| nemo-data-store |	ClusterIP |	10.111.16.88 | 3000/TCP | 2m2s |


You will need to enter:

```python
DATA_STORE = "http://10.111.16.88:3000"
```

In [None]:
CUSTOMIZER = "http://10.107.42.136:8000"            # service/nemo-customizer
DATA_STORE = "http://10.102.137.118:3000"           # service/nemo-data-store
DEPLOYMENT_MANAGER = "http://10.102.234.211:8000"   # service/deployment-management
ENTITY_STORE = "http://10.111.17.85:8000"           # service/nemo-entity-store
NIM_URL = "http://10.102.70.221:8000"               # service/nemo-nim-proxy - 8000 for HTTP and 8001 for gRPC

DATASET_NAME = "xlam-ft-dataset"


We can find API docs for all NeMo Microservice APIs here.

We first create a namespace where all of the resources and artifacts created during the tutorial will live.

In [None]:
import requests

# create namespace in entity store
res1 = requests.post(f"{ENTITY_STORE}/v1/namespaces", json={"id": NAMESPACE})
# create namespace in data store
res2 = requests.post(
    f"{DATA_STORE}/v1/datastore/namespaces",
    data={"namespace": NAMESPACE}
)

res1, res2


In [None]:
res2.json()

Now let's get our data uploaded to our microservices. We first create our data store repo. We use the HfApi client to do this but it's worth noting that we're not using Hugging Face Hub at all here. We're instead just piggy backing off their SDK.

In [None]:
from huggingface_hub import HfApi

repo_id = f"{NAMESPACE}/{DATASET_NAME}"

hf_api = HfApi(endpoint=f"{DATA_STORE}/v1/hf", token="")


In [None]:
from huggingface_hub.errors import HfHubHTTPError

try:
    # check if the repo exists, if not, create it
    hf_api.repo_exists(repo_id=repo_id, repo_type="dataset")
except HfHubHTTPError:
    # this means the repo doesn't exist, so we create it
    print(f"Creating `{repo_id}` dataset")
    hf_api.create_repo(repo_id=repo_id, repo_type="dataset")

⚠️ If needed, you can delete dataset repos with hf_api.delete_repo(repo_id=repo_id, repo_type="dataset").

Next we upload our training, validation, and test datasets:

In [None]:
hf_api.upload_file(
    path_or_fileobj="training.jsonl",
    path_in_repo="training.jsonl",
    repo_id=repo_id,
    repo_type="dataset",
)

hf_api.upload_file(
    path_or_fileobj="validation.jsonl",
    path_in_repo="validation.jsonl",
    repo_id=repo_id,
    repo_type="dataset",
)

hf_api.upload_file(
    path_or_fileobj="test.jsonl",
    path_in_repo="test.jsonl",
    repo_id=repo_id,
    repo_type="dataset",
)

Now we register the dataset with NeMo Entity Store.

In [None]:
import requests

res = requests.post(
    url=f"{ENTITY_STORE}/v1/datasets",
    json={
        "name": DATASET_NAME,
        "namespace": NAMESPACE,
        "description": "Tool calling xLAM dataset",
        "project": NAMESPACE,
        "files_url": f"/datasets/{NAMESPACE}/{DATASET_NAME}"
    }
)

res.json()

⚠️ If you need to delete records in the entity store you can use DELETE {ENTITY_STORE}/v1/datasets/{NAMESPACE}/{DATASET_NAME}.

Let's double check the dataset exists:

In [None]:
res = requests.get(f"{ENTITY_STORE}/v1/datasets/{NAMESPACE}/{DATASET_NAME}")

res.json()

Now we're ready to kick-off training. First, we'll choose a model that we'd like to train, we can see a list of available models by hitting the GET {CUSTOMIZER}/v1/customization/configs endpoint.

In [None]:
res = requests.get(f"{CUSTOMIZER}/v1/customization/configs")

res.json()

We should see any models that were defined in values.yaml inside customizer.customizerConfig.models. In this example, we should see meta/llama-3.2-1b-instruct.

Now we hit POST /v1/customization/jobs to start training, before doing so, there are a few parameters that we should be aware of — most of which are covered in detail in the customization docs.

Sequence Packing can be turned on or off, it is an optimization technique but it is only compatible with some Meta Llama models.

In [None]:
BASE_MODEL = "meta/llama-3.2-1b-instruct"

# add weights and biases API key for updates during training
WANDB_API_KEY = getpass("Enter your W&B API key: ")
headers = {"wandb-api-key": WANDB_API_KEY} if WANDB_API_KEY else None

training_params = {
    "name": "llama-3.2-1b-xlam-ft",
    "output_model": f"{NAMESPACE}/llama-3.2-1b-xlam-run1",
    "config": BASE_MODEL,
    "dataset": {"name": DATASET_NAME, "namespace": NAMESPACE},
    "hyperparameters": {
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 2,
        "batch_size": 32,
        "learning_rate": 5e-5,
        "lora": {
            "adapter_dim": 32,
            "adapter_dropout": 0.1,
        },
        "sequence_packing_enabled": True,
    }
}

res = requests.post(
    f"{CUSTOMIZER}/v1/customization/jobs",
    json=training_params,
    headers=headers,
)

customization = res.json()

customization 

⏳ If you see a message like {'detail': 'Model <model-name> is downloading to cache, try again later'} the custom model is still being downloaded to the model cache. Assuming everything is running correctly all you need to do in this scenario is wait. You can check for issues in the deployment or simply current download status like so:

In [None]:
!kubectl get pod -n {NAMESPACE} | grep '^model-downloader-'

⚠️ If you see a message like {'detail': '<model-name> is not configured for training'} you need to configure available models via the values.yaml file created in our deployment.

If our customization job is running we should see a large response detailing the training parameters and most importantly our customization job ID. We can use this ID to check in on the job status like so:

In [None]:
job_id = customization["id"]

res = requests.get(f"{CUSTOMIZER}/v1/customization/jobs/{job_id}/status")

res.json()

We can check the job is running in our cluster too — we should first see an entity-handler pod which should complete quickly and we will see a training-job pod appear:

In [None]:
!kubectl get pod -n {NAMESPACE} | awk 'NR==1 || /^cust-/'

We can check logs with:

In [None]:
!kubectl logs cust-xyz-training-job-worker-0 -n {NAMESPACE}

But most useful, if you set your W&B API key earlier you can find the training data in your W&B dashboard. We can continue checking the job status until it completes (for the 1b parameter model on a H100 this can take ~50 minutes).

In [None]:
res = requests.get(f"{CUSTOMIZER}/v1/customization/jobs/{job_id}/status")
res.json()

## Running our New Model

Once the training job is complete the custom model we have build should be available to us via the NeMo entity store:

In [None]:
import requests

res = requests.post(
    f"{DEPLOYMENT_MANAGER}/v1/deployment/model-deployments",
    json={
        "name": "llama-3.2-1b",
        "namespace": "meta",
        "config": {
            "model": "meta/llama-3.2-1b-instruct",
            "nim_deployment": {
                "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",  # NGC catalog image URL
                "image_tag": "1.8.5",
                "pvc_size": "25Gi",
                "gpu": 1,
                "additional_envs": {
                    "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
                }
            }
        }
    }
)
res.json()

We can check for the model-deployment job to complete:

In [None]:
!kubectl get all -n {NAMESPACE}

After the base model has been registered our NIM endpoint will detect it and automatically load all compatible models based on what we have inside our NeMo Entity Store, we can confirm that has happened with:

In [None]:
res = requests.get(f"{NIM_URL}/v1/models")

res.json()

With training complete and our model usable by our NIM endpoint, we can jump into testing it.

### Using our Model

First, we setup our NIM client using the OpenAI client but swapping the base_url from OpenAI to our NIM proxy server

In [None]:
from openai import OpenAI

nim = OpenAI(
    base_url=f"{NIM_URL}/v1",
    api_key="None",
)

Now we use our NIM endpoint as we would the typical chat completions endpoint of OpenAI.

In [None]:
test_data[0]["messages"]

In [None]:
out = nim.chat.completions.create(
    model="demo/llama-3.2-1b-xlam-run1@cust-xyz",
    messages=test_data[0]["messages"],
    tools=test_data[0]["tools"],
    tool_choice="auto",
    temperature=0.1,
    top_p=0.7,
    max_tokens=512,
    stream=False,
)

out.choices[0].message.tool_calls

We stream like so:

In [None]:
stream = nim.chat.completions.create(
    model="demo/llama-3.2-1b-xlam-run1@cust-xyz",
    messages=test_data[0]["messages"],
    tools=test_data[0]["tools"],
    tool_choice="auto",
    temperature=0.1,
    top_p=0.7,
    max_tokens=512,
    stream=True,
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
