# Tool and Function Calling

This guide demonstrates how to use SGLang’s **Tool Calling** functionality.

## OpenAI Compatible API

### Launching the Server

In [1]:
from openai import OpenAI
import json
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --tool-call-parser llama3 --port 30333 --host 0.0.0.0"  # llama3
)
wait_for_server("http://localhost:30333")

[2025-02-12 10:58:45] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30333, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=286015798, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable

[2025-02-12 10:59:05 TP0] Init torch distributed begin.
[2025-02-12 10:59:05 TP0] Load weight begin. avail mem=78.81 GB


[2025-02-12 10:59:06 TP0] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.17it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.88it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.41it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]

[2025-02-12 10:59:10 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.70 GB
[2025-02-12 10:59:10 TP0] KV Cache is allocated. K size: 27.12 GB, V size: 27.12 GB.
[2025-02-12 10:59:10 TP0] Memory pool end. avail mem=8.45 GB


[2025-02-12 10:59:10 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:23,  1.06s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.74it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.43it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.99it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.29it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.52it/s]

 30%|███       | 7/23 [00:02<00:04,  3.78it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.01it/s] 39%|███▉      | 9/23 [00:02<00:03,  4.27it/s]

 43%|████▎     | 10/23 [00:03<00:02,  4.49it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.63it/s] 52%|█████▏    | 12/23 [00:03<00:02,  4.81it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.92it/s] 61%|██████    | 14/23 [00:03<00:01,  5.02it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  5.08it/s] 70%|██████▉   | 16/23 [00:04<00:01,  5.12it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  5.17it/s] 78%|███████▊  | 18/23 [00:04<00:00,  5.13it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  5.09it/s] 87%|████████▋ | 20/23 [00:05<00:00,  5.09it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  5.04it/s] 96%|█████████▌| 22/23 [00:05<00:00,  5.03it/s]

100%|██████████| 23/23 [00:05<00:00,  4.99it/s]100%|██████████| 23/23 [00:05<00:00,  4.10it/s]
[2025-02-12 10:59:15 TP0] Capture cuda graph end. Time elapsed: 5.62 s


[2025-02-12 10:59:16 TP0] max_total_num_tokens=444372, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-02-12 10:59:16] INFO:     Started server process [4073844]
[2025-02-12 10:59:16] INFO:     Waiting for application startup.
[2025-02-12 10:59:16] INFO:     Application startup complete.
[2025-02-12 10:59:16] INFO:     Uvicorn running on http://0.0.0.0:30333 (Press CTRL+C to quit)


[2025-02-12 10:59:16] INFO:     127.0.0.1:60394 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-12 10:59:17] INFO:     127.0.0.1:60410 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-12 10:59:17 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-12 10:59:20] INFO:     127.0.0.1:60418 - "POST /generate HTTP/1.1" 200 OK
[2025-02-12 10:59:20] The server is fired up and ready to roll!


Note that `--tool-call-parser` defines the parser used to interpret responses. Currently supported parsers include:

- llama3: Llama 3.1 / 3.2 (e.g. meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-1B-Instruct).
- mistral: Mistral (e.g. mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-Nemo-Instruct-2407, mistralai/
Mistral-Nemo-Instruct-2407, mistralai/Mistral-7B-v0.3).
- qwen25: Qwen 2.5 (e.g. Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-7B-Instruct).

### Define Tools for Function Call
Below is a Python snippet that shows how to define a tool as a dictionary. The dictionary includes a tool name, a description, and property defined Parameters.

In [2]:
# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city to find the weather for, e.g. 'San Francisco'",
                    },
                    "state": {
                        "type": "string",
                        "description": "the two-letter abbreviation for the state that the city is"
                        " in, e.g. 'CA' which would mean 'California'",
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["city", "state", "unit"],
            },
        },
    }
]

### Define Messages

In [3]:
def get_messages():
    return [
        {
            "role": "user",
            "content": "What's the weather like in Boston today? Please respond with the format: Today's weather is :{function call result}",
        }
    ]


messages = get_messages()

### Initialize the Client

In [4]:
# Initialize OpenAI-like client
client = OpenAI(api_key="None", base_url="http://0.0.0.0:30333/v1")
model_name = client.models.list().data[0].id

[2025-02-12 10:59:22] INFO:     127.0.0.1:60532 - "GET /v1/models HTTP/1.1" 200 OK


###  Non-Streaming Request

In [5]:
# Non-streaming mode test
response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,  # Non-streaming
    tools=tools,
)
print_highlight("Non-stream response:")
print(response_non_stream)

[2025-02-12 10:59:22 TP0] Prefill batch. #new-seq: 1, #new-token: 302, #cached-token: 1, cache hit rate: 0.32%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-12 10:59:22] INFO:     127.0.0.1:60532 - "POST /v1/chat/completions HTTP/1.1" 200 OK


ChatCompletion(id='3eddebb2fd0a4f199c3d3acf28fc8b7b', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"unit": "fahrenheit", "city": "Boston", "state": "MA"}', name='get_current_weather'), type='function')]), matched_stop=128008)], created=1739357962, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=32, prompt_tokens=303, total_tokens=335, completion_tokens_details=None, prompt_tokens_details=None))


[2025-02-12 10:59:22 TP0] Decode batch. #running-req: 0, #token: 0, token usage: 0.00, gen throughput (token/s): 6.65, #queue-req: 0


### Streaming Request

In [6]:
# Streaming mode test
print_highlight("Streaming response:")
response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=True,  # Enable streaming
    tools=tools,
)

chunks = []
for chunk in response_stream:
    chunks.append(chunk)
    if chunk.choices[0].delta.tool_calls:
        print(chunk.choices[0].delta.tool_calls[0])

[2025-02-12 10:59:22] INFO:     127.0.0.1:60532 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-12 10:59:22 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 302, cache hit rate: 49.43%, token usage: 0.00, #running-req: 0, #queue-req: 0
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='', name='get_current_weather'), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='{"city": "', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='Boston"', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments=', "state": "', name=''), type='function')


ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='MA"', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments=', "unit": "', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='f', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='ahrenheit"}', name=''), type='function')



### Handle Tool Calls

When the engine determines it should call a particular tool, it will return arguments or partial arguments through the response. You can parse these arguments and later invoke the tool accordingly.

**Non-Streaming Request**

In [7]:
name_non_stream = response_non_stream.choices[0].message.tool_calls[0].function.name
arguments_non_stream = (
    response_non_stream.choices[0].message.tool_calls[0].function.arguments
)

print_highlight(f"Final streamed function call name: {name_non_stream}")
print_highlight(f"Final streamed function call arguments: {arguments_non_stream}")

**Streaming Request**

In [8]:
# Parse and combine function call arguments
arguments = []
for chunk in chunks:
    choice = chunk.choices[0]
    delta = choice.delta
    if delta.tool_calls:
        tool_call = delta.tool_calls[0]
        if tool_call.function.name:
            print_highlight(f"Streamed function call name: {tool_call.function.name}")

        if tool_call.function.arguments:
            arguments.append(tool_call.function.arguments)
            print(f"Streamed function call arguments: {tool_call.function.arguments}")

# Combine all fragments into a single JSON string
full_arguments = "".join(arguments)
print_highlight(f"Final streamed function call arguments: {full_arguments}")

Streamed function call arguments: {"city": "
Streamed function call arguments: Boston"
Streamed function call arguments: , "state": "
Streamed function call arguments: MA"
Streamed function call arguments: , "unit": "
Streamed function call arguments: f
Streamed function call arguments: ahrenheit"}


### Define a Tool Function

In [9]:
# This is a demonstration, define real function according to your usage.
def get_current_weather(city: str, state: str, unit: "str"):
    return (
        f"The weather in {city}, {state} is 85 degrees {unit}. It is "
        "partly cloudly, with highs in the 90's."
    )


available_tools = {"get_current_weather": get_current_weather}


## Execute the Tool

In [10]:
call_data = json.loads(full_arguments)

messages.append(
    {
        "role": "user",
        "content": "",
        "tool_calls": {"name": "get_current_weather", "arguments": full_arguments},
    }
)

# Call the corresponding tool function
tool_name = messages[-1]["tool_calls"]["name"]
tool_to_call = available_tools[tool_name]
result = tool_to_call(**call_data)
print_highlight(f"Function call result: {result}")
messages.append({"role": "tool", "content": result, "name": tool_name})

print_highlight(f"Updated message history: {messages}")

### Send Results Back to Model

In [11]:
final_response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools,
)
print_highlight("Non-stream response:")
print(final_response)

[2025-02-12 10:59:22 TP0] Prefill batch. #new-seq: 1, #new-token: 41, #cached-token: 300, cache hit rate: 63.21%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-12 10:59:22 TP0] Decode batch. #running-req: 1, #token: 350, token usage: 0.00, gen throughput (token/s): 112.92, #queue-req: 0


[2025-02-12 10:59:22] INFO:     127.0.0.1:60532 - "POST /v1/chat/completions HTTP/1.1" 200 OK


ChatCompletion(id='689f8b123879424181da8bcabe1935ff', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"city": "Boston", "state": "MA", "unit": "fahrenheit"}', name='get_current_weather'), type='function')]), matched_stop=128008)], created=1739357962, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=32, prompt_tokens=341, total_tokens=373, completion_tokens_details=None, prompt_tokens_details=None))


## Native API and SGLang Runtime (SRT)

In [12]:
from transformers import AutoTokenizer
import requests

# generate an answer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

messages = get_messages()

input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools,
)

gen_url = "http://localhost:30333/generate"
gen_data = {"text": input, "sampling_params": {"skip_special_tokens": False}}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]
print(gen_response)

# parse the response
parse_url = "http://localhost:30333/function_call"

function_call_input = {
    "text": gen_response,
    "tool_call_parser": "llama3",
    "tools": tools,
}

function_call_response = requests.post(parse_url, json=function_call_input)
function_call_response_json = function_call_response.json()
print("function name: ", function_call_response_json["calls"][0]["name"])
print("function arguments: ", function_call_response_json["calls"][0]["parameters"])

[2025-02-12 10:59:31 TP0] Prefill batch. #new-seq: 1, #new-token: 317, #cached-token: 1, cache hit rate: 47.48%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-12 10:59:31 TP0] Decode batch. #running-req: 1, #token: 335, token usage: 0.00, gen throughput (token/s): 4.40, #queue-req: 0


[2025-02-12 10:59:32] INFO:     127.0.0.1:59204 - "POST /generate HTTP/1.1" 200 OK
<|python_tag|>{
    "name": "get_current_weather",
    "parameters": {
        "unit": "fahrenheit",
        "city": "Boston",
        "state": "MA"
    }
}
[2025-02-12 10:59:32] INFO:     127.0.0.1:59218 - "POST /function_call HTTP/1.1" 200 OK
function name:  get_current_weather
function arguments:  {"unit": "fahrenheit", "city": "Boston", "state": "MA"}


In [13]:
terminate_process(server_process)

## Offline Engine API

In [14]:
import sglang as sgl
from sglang.srt.function_call_parser import FunctionCallParser
from sglang.srt.managers.io_struct import Tool, Function

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer = llm.tokenizer_manager.tokenizer
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, tools=tools
)

sampling_params = {
    "max_new_tokens": 128,
    "temperature": 0.3,
    "top_p": 0.95,
    "skip_special_tokens": False,
}

# 1) Offline generation
result = llm.generate(input_ids=input_ids, sampling_params=sampling_params)
generated_text = result["text"]  # Assume there is only one prompt

print("=== Offline Engine Output Text ===")
print(generated_text)


# 2) Parse using FunctionCallParser
def convert_dict_to_tool(tool_dict: dict) -> Tool:
    function_dict = tool_dict.get("function", {})
    return Tool(
        type=tool_dict.get("type", "function"),
        function=Function(
            name=function_dict.get("name"),
            description=function_dict.get("description"),
            parameters=function_dict.get("parameters"),
        ),
    )


tools = [convert_dict_to_tool(raw_tool) for raw_tool in tools]

parser = FunctionCallParser(tools=tools, tool_call_parser="llama3")
normal_text, calls = parser.parse_non_stream(generated_text)

print("\n=== Parsing Result ===")
print("Normal text portion:", normal_text)
print("Function call portion:")
for call in calls:
    # call: ToolCallItem
    print(f"  - tool name: {call.name}")
    print(f"    parameters: {call.parameters}")

# 3) If needed, perform additional logic on the parsed functions, such as automatically calling the corresponding function to obtain a return value, etc.

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.27it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:00,  2.03it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.45it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.41it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:00<00:21,  1.01it/s]

  9%|▊         | 2/23 [00:01<00:11,  1.82it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.40it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.90it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.30it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.38it/s]

 30%|███       | 7/23 [00:02<00:04,  3.62it/s]

 35%|███▍      | 8/23 [00:02<00:04,  3.72it/s]

 39%|███▉      | 9/23 [00:02<00:03,  3.86it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.97it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.01it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.05it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  3.86it/s]

 61%|██████    | 14/23 [00:04<00:02,  3.92it/s]

 65%|██████▌   | 15/23 [00:04<00:02,  3.97it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.03it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  3.98it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.19it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.30it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.07it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.17it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.30it/s]

100%|██████████| 23/23 [00:06<00:00,  4.31it/s]100%|██████████| 23/23 [00:06<00:00,  3.63it/s]


=== Offline Engine Output Text ===
<|python_tag|>{"name": "get_current_weather", "parameters": {"city": "Boston", "state": "MA", "unit": "fahrenheit"}}

=== Parsing Result ===
Normal text portion: <|python_tag|>{"name": "get_current_weather", "parameters": {"city": "Boston", "state": "MA", "unit": "fahrenheit"}}
Function call portion:
  - tool name: get_current_weather
    parameters: {"city": "Boston", "state": "MA", "unit": "fahrenheit"}


In [15]:
llm.shutdown()

## How to support a new model?
1. Update the TOOLS_TAG_LIST in sglang/srt/function_call_parser.py with the model’s tool tags. Currently supported tags include:
```
	TOOLS_TAG_LIST = [
	    “<|plugin|>“,
	    “<function=“,
	    “<tool_call>“,
	    “<|python_tag|>“,
	    “[TOOL_CALLS]”
	]
```
2. Create a new detector class in sglang/srt/function_call_parser.py that inherits from BaseFormatDetector. The detector should handle the model’s specific function call format. For example:
```
    class NewModelDetector(BaseFormatDetector):
```
3. Add the new detector to the MultiFormatParser class that manages all the format detectors.