# Function-calling & prompt budgeting

> Under the hood, functions are injected into the **system message** in a
> special syntax the model was trained on.  
> Because of this, **every function definition counts toward your context
> limit** and is **billed as input tokens**.  
> When you hit the token ceiling, trim either  
> * the **number of functions**, or  
> * the **length of each parameter description**.  
> If you have *many* functions, consider **fine-tuning** to compress them.

[Function-calling docs → “additional configurations”](https://platform.openai.com/docs/guides/function-calling?api-mode=responses#additional-configurations)

---

### Where to put your few-shot examples?

| Location for examples | What gets longer? | Sent **every** request? | Typical use-case |
|-----------------------|-------------------|-------------------------|------------------|
| **Tool description** | The *function schema* section inside the hidden system prompt | **Yes** – they travel with *every* call that includes the tool | Good when you want the model to see examples **only when that tool is available** (keeps unrelated chats smaller). |
| **System prompt** | The *global* system message | **Yes** – they precede every conversation turn, even if the tool isn’t used | Useful when examples provide **general policy** or disambiguation rules that should influence *all* reasoning. |

**Key difference:**  
Putting examples in the *tool description* keeps your **base system prompt shorter**, but the combined prompt *still* grows because the full function schema (now bigger) is injected on every request that carries `tools=[…]`.  
Placing them directly in the *system prompt* makes the top-level prompt larger **once**, yet avoids inflating each individual function schema.

So choose the location that minimises total tokens **for your traffic pattern**:

```text
few calls ⟹ duplicate cost OK ⟹ put examples in tool description
many calls ⟹ avoid schema bloat ⟹ centralise in system prompt


# Imports & Configs

In [197]:
import json, os, textwrap
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple, Iterable

import dotenv
import openai
import pandas as pd

dotenv.load_dotenv()

MODEL = "gpt-4o"                 # Or any function‑calling‑capable model
TEMPERATURE = 0.0
COORD_TOL_DEG = 0.1              # Lat/Lon tolerance when judging correctness

## 1 · Prepare Test Data

In [198]:
@dataclass
class Example:
    user: str
    should_call: bool
    expected: Dict | None  # {latitude, longitude}

    @property
    def messages(self) -> List[Dict]:
        return [{"role": "user", "content": self.user}]


DATASET: List[Example] = [
    # ---------- 15 POSITIVES ----------
    Example("Paris check-in: how hot is it right now?", True,
            {"latitude": 48.8566, "longitude": 2.3522}),
    Example("Temp RN in NYC please 🙏", True,
            {"latitude": 40.7128, "longitude": -74.0060}),
    Example("Give me today's Celsius for Tōkyō (present moment).", True,
            {"latitude": 35.6895, "longitude": 139.6917}),
    Example("Current °C at -33.87, 151.21 ?", True,
            {"latitude": -33.87, "longitude": 151.21}),             # Sydney as coords
    Example("What's the thermometer reading downtown São Paulo?", True,
            {"latitude": -23.5505, "longitude": -46.6333}),
    Example("How warm is it outside in Cairo right this second?", True,
            {"latitude": 30.0444, "longitude": 31.2357}),
    Example("Need real-time temp for Montréal, Québec.", True,
            {"latitude": 45.5019, "longitude": -73.5674}),
    Example("Berlin now: degrees?", True,
            {"latitude": 52.5200, "longitude": 13.4050}),
    Example("What's the weather like *right now* in Mumbai temperature-wise?", True,
            {"latitude": 19.0760, "longitude": 72.8777}),
    Example("Temp please for Capetown currently.", True,
            {"latitude": -33.9249, "longitude": 18.4241}),
    Example("Sydney temp update (this minute).", True,
            {"latitude": -33.8688, "longitude": 151.2093}),
    Example("Is it chilly in Lima at the moment?", True,
            {"latitude": -12.0464, "longitude": -77.0428}),
    Example("Right now, what's Bangkok's temperature?", True,
            {"latitude": 13.7563, "longitude": 100.5018}),
    Example("Tell me the °C in Athens this instant.", True,
            {"latitude": 37.9838, "longitude": 23.7275}),
    Example("How many degrees outside in Nairobi right now?", True,
            {"latitude": -1.286389, "longitude": 36.817223}),

    # ---------- 15 NEGATIVES ----------
    # Fiction / heavy misspellings
    Example("Weather report for Hogwarts, please.", False, None),
    Example("What's the temperature in San Fran-sissco today?", False, None),
    Example("Current climate of Narnia?", False, None),
    # Historical / future
    Example("What was the temperature in Berlin last Tuesday?", False, None),
    Example("Will it be hot in Tokyo tomorrow afternoon?", False, None),
    # Wrong metric
    Example("How humid is it in Paris right now?", False, None),
    Example("Wind speed in Chicago at the moment?", False, None),
    # Multiple places / comparisons
    Example("Compare the current temperature of London and Paris.", False, None),
    Example("Is Rome warmer than Madrid right now?", False, None),
    # Region-level / ambiguous
    Example("Temperature in California now?", False, None),
    Example("What's the temp on the East Coast currently?", False, None),
    # Misspelled beyond recognition
    Example("Temp in Tokyoo tonite?", False, None),
    Example("Degrees now in Barzil?", False, None),
    # Duplicate city, different phrasing (sanity check)
    Example("Give me NYC temperature for *right now* in Fahrenheit.", False, None),  # different unit → reject
    # Coordinate but historical
    Example("At 48.85, 2.35 – what was the temp yesterday?", False, None),
]

print(f"Dataset · Positives: {sum(e.should_call for e in DATASET)}, Negatives: {sum(not e.should_call for e in DATASET)} (total {len(DATASET)})")

Dataset · Positives: 15, Negatives: 15 (total 30)


## 2 · Prompts & tool specification

In [199]:
SYSTEM_PROMPT = textwrap.dedent(
    """
    Your task is to decide whether to call the
    function `get_weather` based on a user's message.
    
    Follow these rules **exactly**:

    1. **Call `get_weather`** only when the user wants the *current* air
      temperature for a single, real-world location that can be mapped to
      a latitude/longitude.
    2. The user may phrase it indirectly, use slang, or give coordinates
      instead of a name – that is still a call if it is clearly *current*.
    3. If the user requests any other weather metric (humidity, forecast,
      wind, UV index, historical dates, comparisons, multiple places,
      fictional or un-resolvable names, etc.) **do NOT** call — reply
      in plain text (“Sorry, …”).
    4. Respond *solely* with a tool call when appropriate; otherwise reply in
       plain text (e.g., "Sorry, I don't have data on that location.").
    """
).strip()

print(SYSTEM_PROMPT)

Your task is to decide whether to call the
function `get_weather` based on a user's message.

Follow these rules **exactly**:

1. **Call `get_weather`** only when the user wants the *current* air
  temperature for a single, real-world location that can be mapped to
  a latitude/longitude.
2. The user may phrase it indirectly, use slang, or give coordinates
  instead of a name – that is still a call if it is clearly *current*.
3. If the user requests any other weather metric (humidity, forecast,
  wind, UV index, historical dates, comparisons, multiple places,
  fictional or un-resolvable names, etc.) **do NOT** call — reply
  in plain text (“Sorry, …”).
4. Respond *solely* with a tool call when appropriate; otherwise reply in
   plain text (e.g., "Sorry, I don't have data on that location.").


In [200]:
FEW_SHOT_EXAMPLES = textwrap.dedent(
    """
    User: I need the current temperature in Berlin.
    Tool call: get_weather(latitude=52.5200, longitude=13.4050)

    User: What's the Celsius in Mumbai right now?
    Tool call: get_weather(latitude=19.0760, longitude=72.8777)

    User: Could you tell me the weather in Hogwarts?
    Assistant: (No tool call – fictional location)

    User: Weather update for Tokoyo, please.
    Assistant: (No tool call – unclear/misspelled location)

    User: Temperature check for Cape Town, please.
    Tool call: get_weather(latitude=-33.9249, longitude=18.4241)
    """
).strip()

print(FEW_SHOT_EXAMPLES)

User: I need the current temperature in Berlin.
Tool call: get_weather(latitude=52.5200, longitude=13.4050)

User: What's the Celsius in Mumbai right now?
Tool call: get_weather(latitude=19.0760, longitude=72.8777)

User: Could you tell me the weather in Hogwarts?
Assistant: (No tool call – fictional location)

User: Weather update for Tokoyo, please.
Assistant: (No tool call – unclear/misspelled location)

User: Temperature check for Cape Town, please.
Tool call: get_weather(latitude=-33.9249, longitude=18.4241)


In [201]:
def build_tool(embed_examples: bool) -> List[Dict]:
    """Return the `tools` argument for the Responses API."""

    desc = "Get current temperature for provided coordinates in Celsius."
    if embed_examples:
        desc += "\n\nFew‑shot examples:\n" + FEW_SHOT_EXAMPLES

    return [
        {
            "type": "function",
            "name": "get_weather",
            "description": desc,
            "parameters": {
                "type": "object",
                "properties": {
                    "latitude": {"type": "number"},
                    "longitude": {"type": "number"},
                },
                "required": ["latitude", "longitude"],
                "additionalProperties": False,
            },
            "strict": True,
        }
    ]


def build_messages(example: Example, include_fs_in_system: bool) -> List[Dict]:
    sys_prompt = SYSTEM_PROMPT
    if include_fs_in_system:
        sys_prompt += "\n\nFew‑shot examples:\n" + FEW_SHOT_EXAMPLES
    return [{"role": "system", "content": sys_prompt}, *example.messages]

## 3 · Helper Functions

In [202]:
def extract_tool(resp) -> Tuple[bool, Dict | None]:
    try:
        tool_call = resp.output[0]
        if tool_call.type != "function_call":
            return False, None
        args = json.loads(tool_call.arguments)
        return True, args
    except Exception:
        return False, None


def is_correct(example: Example, *, called: bool, args: Dict | None) -> bool:
    if example.should_call:
        if not called or args is None:
            return False
        dx = abs(args.get("latitude", 1e9) - example.expected["latitude"])
        dy = abs(args.get("longitude", 1e9) - example.expected["longitude"])
        return dx <= COORD_TOL_DEG and dy <= COORD_TOL_DEG
    else:
        return not called
    
def run_eval(dataset: Iterable[Example], *, examples_in_desc: bool, examples_in_sys: bool) -> pd.DataFrame:
    """Run the model on *dataset* and return a DataFrame with detailed results."""

    client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    rows = []

    for ex in dataset:
        messages = build_messages(ex, include_fs_in_system=examples_in_sys)
        tools = build_tool(embed_examples=examples_in_desc)

        resp = client.responses.create(
            model=MODEL,
            input=messages,
            tools=tools,
            temperature=TEMPERATURE,
        )

        called, args = extract_tool(resp)
        lat_pred = args.get("latitude") if args else None
        lon_pred = args.get("longitude") if args else None

        rows.append(
            {
                "user_query": ex.user,
                "response": resp.output[0],
                "should_call": ex.should_call,
                "called": called,
                "lat_expected": ex.expected.get("latitude") if ex.expected else None,
                "lon_expected": ex.expected.get("longitude") if ex.expected else None,
                "lat_pred": lat_pred,
                "lon_pred": lon_pred,
                "correct": is_correct(ex, called=called, args=args),
            }
        )

    df = pd.DataFrame(rows)
    accuracy = df["correct"].mean() * 100
    mode = "DESC" if examples_in_desc else "SYS"
    print(f"Accuracy ({mode} few‑shots): {accuracy:.2f}%\n")
    return df

## 4 · Run Tests – Few‑Shots in **Description**

In [None]:
result_df = run_eval(DATASET, examples_in_desc=True, examples_in_sys=False)

In [162]:
# View incorrect rows
incorrect_rows = result_df[result_df["correct"] == False][["user_query", "should_call", "response"]]

for idx, row in incorrect_rows.iterrows():
    print(f"User Query:\n{row['user_query']}\n")
    print(f"Should Call:\n{row['should_call']}\n")
    print(f"Response:\n{row['response']}\n")
    print("="*60)

User Query:
Give me NYC temperature for *right now* in Fahrenheit.

Should Call:
False

Response:
ResponseFunctionToolCall(arguments='{"latitude":40.7128,"longitude":-74.006}', call_id='call_P6JhrUrbn5RXwPaUFqLAyTiw', name='get_weather', type='function_call', id='fc_680a8738b9988191a98c5d85f585b8d8095538ce284c73f4', status='completed')



## 5 · Run Tests – Few‑Shots in **System Prompt**

In [None]:
result_sys_df = run_eval(DATASET, examples_in_desc=False, examples_in_sys=True)

Accuracy (SYS few‑shots): 93.33%



In [164]:
result_sys_df.head()

Unnamed: 0,user_query,response,should_call,called,lat_expected,lon_expected,lat_pred,lon_pred,correct
0,Paris check-in: how hot is it right now?,"ResponseFunctionToolCall(arguments='{""latitude...",True,True,48.8566,2.3522,48.8566,2.3522,True
1,Temp RN in NYC please 🙏,"ResponseFunctionToolCall(arguments='{""latitude...",True,True,40.7128,-74.006,40.7128,-74.006,True
2,Give me today's Celsius for Tōkyō (present mom...,"ResponseFunctionToolCall(arguments='{""latitude...",True,True,35.6895,139.6917,35.6895,139.6917,True
3,"Current °C at -33.87, 151.21 ?","ResponseFunctionToolCall(arguments='{""latitude...",True,True,-33.87,151.21,-33.87,151.21,True
4,What's the thermometer reading downtown São Pa...,"ResponseFunctionToolCall(arguments='{""latitude...",True,True,-23.5505,-46.6333,-23.5505,-46.6333,True


In [165]:
# Assuming your DataFrame is named result_df
incorrect_rows = result_sys_df[result_sys_df["correct"] == False][["user_query", "response"]]

for idx, row in incorrect_rows.iterrows():
    print(f"User Query:\n{row['user_query']}\n")
    print(f"Response:\n{row['response']}\n")
    print("="*60)

User Query:
What's the temperature in San Fran-sissco today?

Response:
ResponseFunctionToolCall(arguments='{"latitude":37.7749,"longitude":-122.4194}', call_id='call_J4dGxTnrDTC5flECybMpyO7k', name='get_weather', type='function_call', id='fc_680a88031f0c8191a060ec0eee01ba740e99c6365f769b15', status='completed')

User Query:
Give me NYC temperature for *right now* in Fahrenheit.

Response:
ResponseFunctionToolCall(arguments='{"latitude":40.7128,"longitude":-74.006}', call_id='call_50TrRSbIsEyi4xQyABAma9jr', name='get_weather', type='function_call', id='fc_680a881143b88191938ec4ca26b49e5607e0feea166a5583', status='completed')

