<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>
        <br>
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center"> Judge Prompt Comparison: Simple/Complex × Reasoning/Non-Reasoning Models </h1>


This notebook uses **Arize Phoenix `llm_classify`** to evaluate tool-calling predictions on the **Berkeley Function Calling Leaderboard (BFCL)** dataset with:
- a **simple** binary prompt (`Yes`/`No`), and
- a **complex** multi-class prompt (`correct` / `partially_correct` / `incorrect`).

We keep it minimal and focused on classification-style LLM-as-a-judge.


## 1) Install & Imports

In [1]:
%pip -q install --upgrade pandas datasets arize-phoenix-evals openai tiktoken nest_asyncio

Note: you may need to restart the kernel to use updated packages.


In [None]:
import json
import os
import random
import re
import urllib.request
from pathlib import Path

import nest_asyncio
import pandas as pd

from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel

nest_asyncio.apply()

## 2) Configure Judge Model

In [None]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
if not OPENAI_API_KEY:
    print("⚠️ Set OPENAI_API_KEY env var to run evals.")

non_reasoning_model = OpenAIModel(
    model="gpt-4o-mini",
    api_key=OPENAI_API_KEY,
    temperature=0,
)
reasoning_model = OpenAIModel(
    model="o3",
    api_key=OPENAI_API_KEY,
    temperature=0,
)

## 3) Load BFCL (V3 Exec Splits)

In [None]:
BFCL_FILES = {
    "exec_simple": "https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard/resolve/main/BFCL_v3_exec_simple.json",
    "exec_multiple": "https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard/resolve/main/BFCL_v3_exec_multiple.json",
}


def fetch_json(url, out_path):
    out = Path(out_path)
    if not out.exists():
        print(f"Downloading {url} -> {out}")
        urllib.request.urlretrieve(url, out)
    text = Path(out).read_text()
    try:
        return json.loads(text)
    except Exception:
        rows = []
        for line in text.splitlines():
            line = line.strip()
            if not line:
                continue
            try:
                rows.append(json.loads(line))
            except Exception:
                pass
        return rows


df_simple = pd.DataFrame(fetch_json(BFCL_FILES["exec_simple"], "BFCL_v3_exec_simple.json"))
df_multi = pd.DataFrame(fetch_json(BFCL_FILES["exec_multiple"], "BFCL_v3_exec_multiple.json"))
print("Simple shape:", df_simple.shape, "Multiple shape:", df_multi.shape)
df = pd.concat([df_simple, df_multi], ignore_index=True)
print("Combined shape:", df.shape)
df.head(2)

Simple shape: (100, 5) Multiple shape: (50, 5)
Combined shape: (150, 5)


Unnamed: 0,id,question,function,execution_result_type,ground_truth
0,exec_simple_0,"[[{'role': 'user', 'content': 'I've been playi...","[{'name': 'calc_binomial_probability', 'descri...",[exact_match],"[calc_binomial_probability(n=20, k=5, p=0.6)]"
1,exec_simple_1,"[[{'role': 'user', 'content': 'During last nig...","[{'name': 'calc_binomial_probability', 'descri...",[exact_match],"[calc_binomial_probability(n=30, k=15, p=0.5)]"


## 4) Prepare DataFrame (instruction, functions, ground truth, predictions)

In [5]:
def extract_gt_call(row):
    gt = row.get("ground_truth")
    if isinstance(gt, list) and gt:
        return gt[0]
    if isinstance(gt, str):
        return gt
    return ""


def extract_functions(row):
    fns = row.get("function", [])
    if isinstance(fns, dict):
        fns = [fns]
    return fns


def extract_instruction(row):
    q = row.get("question", [])
    last_user = ""
    for msg_list in q:
        for m in msg_list:
            if m.get("role") == "user":
                last_user = m.get("content", last_user)
    return last_user


work = []
for _, r in df.iterrows():
    rr = r.to_dict()
    work.append(
        {
            "id": rr.get("id", ""),
            "instruction": extract_instruction(rr),
            "functions_json": json.dumps(
                [
                    {
                        "name": f.get("name"),
                        "parameters": f.get("parameters"),
                        "description": f.get("description", ""),
                    }
                    for f in extract_functions(rr)
                ],
                ensure_ascii=False,
            ),
            "ground_truth": extract_gt_call(rr),
        }
    )
wf = pd.DataFrame(work)

PREDICTIONS_CSV = os.getenv("PREDICTIONS_CSV", "")
if PREDICTIONS_CSV and Path(PREDICTIONS_CSV).exists():
    preds = pd.read_csv(PREDICTIONS_CSV)[["id", "pred_tool_call"]]
    data = wf.merge(preds, on="id", how="left")
else:

    def corrupt_call(s: str) -> str:
        if not s or "(" not in s:
            return s
        tool, args = s.split("(", 1)
        tool = tool.strip()
        args = args.rstrip(")")
        if random.random() < 0.5:
            tool = tool + "_alt"
        else:
            args = re.sub(r"(\d+(?:\.\d+)?)", lambda m: str(float(m.group()) * 1.1), args, count=1)
        return f"{tool} ({args})"

    pred_tool_call = [
        gt if random.random() < 0.7 else corrupt_call(gt) for gt in wf["ground_truth"]
    ]
    data = wf.copy()
    data["pred_tool_call"] = pred_tool_call

SAMPLE = int(os.getenv("EVAL_SAMPLE", "100"))
data = data.sample(min(SAMPLE, len(data)), random_state=7).reset_index(drop=True)

print(data.head(3))

                 id                                        instruction  \
0  exec_multiple_49  I have a set of vertices: [[1,2],[3,4],[1,4],[...   
1    exec_simple_84  I need to identify the straight line that cont...   
2    exec_simple_40  I'm currently working on a detailed city map, ...   

                                      functions_json  \
0  [{"name": "convert_coordinates", "parameters":...   
1  [{"name": "maxPoints", "parameters": {"type": ...   
2  [{"name": "get_distance", "parameters": {"type...   

                                        ground_truth  \
0   polygon_area(vertices=[[1,2],[3,4],[1,4],[3,7]])   
1        maxPoints(points=[[1,1],[2,2],[3,4],[5,5]])   
2  get_distance(pointA=(45.76, 4.85), pointB=(48....   

                                      pred_tool_call  
0   polygon_area(vertices=[[1,2],[3,4],[1,4],[3,7]])  
1     maxPoints (points=[[1.1,1],[2,2],[3,4],[5,5]])  
2  get_distance(pointA=(45.76, 4.85), pointB=(48....  


In [6]:
small_data = data.sample(n=30, random_state=42).reset_index(drop=True)
print(f"Sampled dataset shape: {small_data.shape}")
print("First 3 rows of sampled data:")

Sampled dataset shape: (30, 5)
First 3 rows of sampled data:


## 5) Define your LLM-as-a-Judge Templates & Rails

In [None]:
SIMPLE_TEMPLATE = """You are grading a tool-calling attempt.

Given:
USER INSTRUCTION:
{instruction}

AVAILABLE FUNCTIONS (JSON Schemas):
{functions_json}

MODEL TOOL CALL (string):
{pred_tool_call}

GROUND TRUTH TOOL CALL (string):
{ground_truth}

Question: Did the model invoke the correct tool(s) AND use the correct parameter names and values?
Answer strictly with one token: Yes or No.
"""

SIMPLE_RAILS = ["Yes", "No"]

In [None]:
COMPLEX_TEMPLATE = """You are grading a tool-calling attempt.
Return ONLY one of the following labels:
- correct
- partially_correct
- incorrect

Use these rules:
- Consider types and trivial formatting (e.g., '5' vs 5, whitespace) as equivalent.
- Consider equivalent units only if explicitly clear from context.
- The attempt is "correct" only if the tool and all required parameters match the ground truth.
- It's "partially_correct" if the tool is correct but parameters have minor issues.
- It's "incorrect" otherwise.

Context:
USER INSTRUCTION:
{instruction}

AVAILABLE FUNCTIONS (JSON Schemas):
{functions_json}

MODEL TOOL CALL (string):
{pred_tool_call}

GROUND TRUTH TOOL CALL (string):
{ground_truth}
"""

COMPLEX_RAILS = ["correct", "partially_correct", "incorrect"]

## 6) Run `llm_classify` for the Simple Eval

In [None]:
simple_df = small_data.copy()

non_reasoning_simple_results = llm_classify(
    data=simple_df.assign(template=SIMPLE_TEMPLATE),
    model=non_reasoning_model,
    template="{template}",
    rails=SIMPLE_RAILS,
    provide_explanation=True,
    include_prompt=False,
    include_response=True,
    run_sync=True,
)

non_reasoning_simple_results.head(3)

llm_classify |          | 0/30 (0.0%) | ⏳ 00:00<? | ?it/s

Unnamed: 0,label,explanation,response,exceptions,execution_status,execution_seconds,prompt_tokens,completion_tokens,total_tokens
0,no,The model did not invoke the correct tool or u...,"{""explanation"":""The model did not invoke the c...",[],COMPLETED,1.086622,154,30,184
1,no,The model did not invoke the correct tool or u...,"{""explanation"":""The model did not invoke the c...",[],COMPLETED,0.96951,154,30,184
2,no,The model did not invoke the correct tool or u...,"{""explanation"":""The model did not invoke the c...",[],COMPLETED,1.560811,154,30,184


In [None]:
simple_df = small_data.copy()

reasoning_simple_results = llm_classify(
    data=simple_df.assign(template=SIMPLE_TEMPLATE),
    model=reasoning_model,
    template="{template}",
    rails=SIMPLE_RAILS,
    provide_explanation=True,
    include_prompt=False,
    include_response=True,
    run_sync=True,
)

reasoning_simple_results.head(3)

llm_classify |          | 0/30 (0.0%) | ⏳ 00:00<? | ?it/s

Unnamed: 0,label,explanation,response,exceptions,execution_status,execution_seconds,prompt_tokens,completion_tokens,total_tokens
0,no,The prompt did not provide the actual user ins...,"{""explanation"":""The prompt did not provide the...",[],COMPLETED,7.555468,148,403,551
1,no,Unable to directly inspect the predicted and g...,"{""explanation"":""Unable to directly inspect the...",[],COMPLETED,11.89464,148,576,724
2,no,"The necessary information (instruction, functi...","{""response"":""No"",""explanation"":""The necessary ...",[],COMPLETED,9.004514,148,580,728


## 6.5) Run `llm_classify` for the Complex Eval

In [11]:
complex_df = small_data.copy()

non_reasoning_complex_results = llm_classify(
    data=complex_df.assign(template=COMPLEX_TEMPLATE),
    model=non_reasoning_model,
    template="{template}",
    rails=COMPLEX_RAILS,
    provide_explanation=True,
    include_prompt=False,
    include_response=True,
    run_sync=True,
)
non_reasoning_complex_results.head(3)

llm_classify |          | 0/30 (0.0%) | ⏳ 00:00<? | ?it/s

Unnamed: 0,label,explanation,response,exceptions,execution_status,execution_seconds,prompt_tokens,completion_tokens,total_tokens
0,incorrect,The tool and parameters in the model tool call...,"{""explanation"":""The tool and parameters in the...",[],COMPLETED,1.530556,230,41,271
1,incorrect,The tool and parameters in the model tool call...,"{""explanation"":""The tool and parameters in the...",[],COMPLETED,1.437953,230,47,277
2,incorrect,The tool and parameters in the model tool call...,"{""explanation"":""The tool and parameters in the...",[],COMPLETED,1.714084,230,47,277


In [None]:
complex_df = small_data.copy()

reasoning_complex_results = llm_classify(
    data=complex_df.assign(template=COMPLEX_TEMPLATE),
    model=reasoning_model,
    template="{template}",
    rails=COMPLEX_RAILS,
    provide_explanation=True,
    include_prompt=False,
    include_response=True,
    run_sync=True,
)

reasoning_complex_results.head(3)

llm_classify |          | 0/30 (0.0%) | ⏳ 00:00<? | ?it/s

Unnamed: 0,label,explanation,response,exceptions,execution_status,execution_seconds,prompt_tokens,completion_tokens,total_tokens
0,incorrect,No information about the predicted and ground-...,"{""response"":""incorrect"",""explanation"":""No info...",[],COMPLETED,4.159545,224,181,405
1,incorrect,Cannot compare because the prediction or groun...,"{""response"":""incorrect"",""explanation"":""Cannot ...",[],COMPLETED,5.005243,224,235,459
2,incorrect,Insufficient data to compare predicted tool ca...,"{""response"":""incorrect"",""explanation"":""Insuffi...",[],COMPLETED,3.929234,224,171,395


## 7) Results

In [17]:
print("For Simple Eval: ")
different_labels = (
    non_reasoning_simple_results["label"] != reasoning_simple_results["label"]
).sum()
if different_labels == 0:
    print("Reasoning and non-reasoning models agree on all samples")
else:
    print(f"Reasoning and non-reasoning models disagree on {different_labels} samples")
NR_simple_lokens = non_reasoning_simple_results["total_tokens"].sum()
R_simple_lokens = reasoning_simple_results["total_tokens"].sum()
print(f"Non-reasoning model used {NR_simple_lokens} tokens")
print(f"Reasoning model used {R_simple_lokens} tokens")

For Simple Eval: 
Reasoning and non-reasoning models agree on all samples
Non-reasoning model used 5520 tokens
Reasoning model used 18704 tokens


In [18]:
print("For Complex Eval: ")
different_labels = (
    non_reasoning_complex_results["label"] != reasoning_complex_results["label"]
).sum()
if different_labels == 0:
    print("Reasoning and non-reasoning models agree on all samples")
else:
    print(f"Reasoning and non-reasoning models disagree on {different_labels} samples")
NR_complex_tokens = non_reasoning_complex_results["total_tokens"].sum()
R_complex_tokens = reasoning_complex_results["total_tokens"].sum()
print(f"Non-reasoning model used {NR_complex_tokens} tokens")
print(f"Reasoning model used {R_complex_tokens} tokens")

For Complex Eval: 
Reasoning and non-reasoning models disagree on 7 samples
Non-reasoning model used 8184 tokens
Reasoning model used 21627 tokens


### References
- `llm_classify` API (Phoenix Evals): https://arize-phoenix.readthedocs.io/en/latest/api/evals.classify.html
- Phoenix Evals Overview: https://arize.com/docs/phoenix/evaluation/llm-evals
- Using `llm_classify` (Docs): https://arize.com/docs/phoenix/evaluation/how-to-evals/bring-your-own-evaluator
- BFCL dataset: https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard
