# DABstep Benchmark Qwen2.5-Coder-32B Baseline

This notebook will guide you though submitting a Qwen2.5-Coder-32B baseline to the DABstep leaderboard.

* Live ü§ó Leaderboard: https://huggingface.co/spaces/adyen/DABstep
* Benchmark ü§ó Dataset: https://huggingface.co/datasets/adyen/DABstep
* LLM Agent Framework by ü§ó: https://github.com/huggingface/smolagents/tree/main



## Environment Setup

We need to setup:
* **HuggingFace Token:** In order to make free API calls to HuggingFace Inference API you must have a HF account, the API verifies this by checking your account's token. This token will not be used for anything else.
* **Benchmark context files:** In order to solve the benchmark tasks the agent will need to reference documentation and analyze data which is spread out across multiple files, just like a real Data Analyst would.

### HuggingFace Token Setup

In [282]:
import time
import os
import json
import re
import datasets
import pandas as pd
from smolagents import CodeAgent
from smolagents.agents import ActionStep
from smolagents.models import OpenAIServerModel
from huggingface_hub import hf_hub_download

# Load OpenRouter API key from secrets
openrouter_key = None
openrouter_path = os.path.abspath(os.path.join(os.getcwd(), "..", "secrets", "openrouter_credentials.txt"))
if os.path.exists(openrouter_path):
    with open(openrouter_path, "r") as f:
        content = f.read()
    # Extract API key - look for pattern like sk-or-v1-...
    key_match = re.search(r"sk-or-v1-[A-Za-z0-9]+", content)
    if key_match:
        openrouter_key = key_match.group(0)
        print("‚úì Loaded OpenRouter API key from secrets")
    else:
        print("‚ö†Ô∏è Warning: No valid OpenRouter key found in", openrouter_path)
else:
    print("‚ö†Ô∏è Error: OpenRouter credentials file not found at", openrouter_path)
    raise FileNotFoundError("OpenRouter credentials required to run this notebook")

# Try to import DABstep utilities
try:
    from dabstep_benchmark.utils import evaluate
    DABSTEP_AVAILABLE = True
except ImportError:
    print("Warning: DABstep benchmark utilities not available.")
    DABSTEP_AVAILABLE = False
    def evaluate(*args, **kwargs):
        raise RuntimeError("DABstep utilities not installed.")

notebook_start_time = time.time()


‚úì Loaded OpenRouter API key from secrets


#### Download context files
First we download the context files from the [Benchmark's Dataset](https://huggingface.co/datasets/adyen/DABstep) so that our agent can access them.


In [283]:
CONTEXT_FILENAMES = [
    "data/context/acquirer_countries.csv",
    "data/context/payments-readme.md",
    "data/context/payments.csv",
    "data/context/merchant_category_codes.csv",
    "data/context/fees.json",
    "data/context/merchant_data.json",
    "data/context/manual.md",
]
# Store data locally in the repo, not in /tmp
DATA_DIR = os.path.join(os.path.dirname(os.path.abspath(".")), "data", "context_files")
os.makedirs(DATA_DIR, exist_ok=True)

for filename in CONTEXT_FILENAMES:
    hf_hub_download(
        repo_id="adyen/DABstep",
        repo_type="dataset",
        filename=filename,
        local_dir=DATA_DIR,
        force_download=False
    )

CONTEXT_FILENAMES = [f"{DATA_DIR}/{filename}" for filename in CONTEXT_FILENAMES]

for file in CONTEXT_FILENAMES:
    if os.path.exists(file):
        print(f"{file} exists.")
    else:
        print(f"{file} does not exist.")

/home/mykola/repos/dabstep_test/data/context_files/data/context/acquirer_countries.csv exists.
/home/mykola/repos/dabstep_test/data/context_files/data/context/payments-readme.md exists.
/home/mykola/repos/dabstep_test/data/context_files/data/context/payments.csv exists.
/home/mykola/repos/dabstep_test/data/context_files/data/context/merchant_category_codes.csv exists.
/home/mykola/repos/dabstep_test/data/context_files/data/context/fees.json exists.
/home/mykola/repos/dabstep_test/data/context_files/data/context/merchant_data.json exists.
/home/mykola/repos/dabstep_test/data/context_files/data/context/manual.md exists.


## Agent

Here we will setup a simple zero-shot prompt for the agent. It has two parts, the general prompt and then a quick outline of which files are available.

In [284]:
# Available OpenRouter models
MODELS_OPENROUTER = {
    "qwen2.5": "qwen/qwen-2.5-coder-32b-instruct",
    "qwen-coder-480b": "qwen/qwen3-coder",
    "deepseek-v3.1": "deepseek/deepseek-chat-v3.1",
    "deepseek-v3-terminus": "deepseek/deepseek-v3.1-terminus",
    "deepseek-v3.2": "deepseek/deepseek-v3.2-exp",
    "kimi-k2-0905": "moonshotai/kimi-k2-0905",
    "glm-4.5-air": "z-ai/glm-4.5-air",
    "glm-4.5": "z-ai/glm-4.5",
    "gpt-oss-20b": "openai/gpt-oss-20b",
    "gpt-oss-120b": "openai/gpt-oss-120b",
    "gpt-5-mini": "openai/gpt-5-mini",
    "gpt-5-nano": "openai/gpt-5-nano",
    "grok-code-fast1": "x-ai/grok-code-fast-1",
    "grok-4-fast": "x-ai/grok-4-fast",
}

# Select which model to use
MODEL_KEY = "grok-4-fast"
MODEL_ID = MODELS_OPENROUTER[MODEL_KEY]

model = OpenAIServerModel(
    model_id=MODEL_ID,
    api_base="https://openrouter.ai/api/v1",
    api_key=openrouter_key
)
print(f"‚úì Using OpenRouter model: {MODEL_ID}")


‚úì Using OpenRouter model: x-ai/grok-4-fast


## 3.4 Testing Agent

In [285]:
# Use the model instance created in the previous cell
MAX_STEPS = 7

agent = CodeAgent(
    tools=[],
    model=model,  # Use the model instance from previous cell
    additional_authorized_imports=["numpy", "pandas", "json", "csv", "os", "glob", "markdown"],
    max_steps=MAX_STEPS,
    verbosity_level=3,
)

In [None]:
PROMPT = """You are an expert data analyst and you will answer factoid questions by loading and referencing the files/documents listed below.
You have these files available:
{context_files}
Don't forget to reference any documentation in the data dir before answering a question.

Here is the question you need to answer:
{question}

Here are the guidelines you must follow when answering the question above:
{guidelines}
"""
question = "What are the unique set of merchants in the payments data?"
guidelines = "Answer with a comma separated list"

PROMPT = PROMPT.format(
    context_files=CONTEXT_FILENAMES,
    question=question,
    guidelines=guidelines
)

agent_start_time = time.time()
answer = agent.run(PROMPT)
agent_end_time = time.time()


In [None]:
# You can inspect the steps taken by the agent by doing this
def clean_reasoning_trace(trace: list) -> list:
  for step in trace:
      # Remove memory from logs to make them more compact.
      if hasattr(step, "memory"):
          step.memory = None
      if isinstance(step, ActionStep):
          step.agent_memory = None
  return trace

# Access agent's reasoning trace
if hasattr(agent, 'memory') and hasattr(agent.memory, 'steps'):
    steps = agent.memory.steps
    for step in clean_reasoning_trace(steps):
        print(step)
else:
    print("Agent steps not available")

TaskStep(task="You are an expert data analyst and you will answer factoid questions by loading and referencing the files/documents listed below.\nYou have these files available:\n['/home/mykola/repos/dabstep_test/data/context_files/data/context/acquirer_countries.csv', '/home/mykola/repos/dabstep_test/data/context_files/data/context/payments-readme.md', '/home/mykola/repos/dabstep_test/data/context_files/data/context/payments.csv', '/home/mykola/repos/dabstep_test/data/context_files/data/context/merchant_category_codes.csv', '/home/mykola/repos/dabstep_test/data/context_files/data/context/fees.json', '/home/mykola/repos/dabstep_test/data/context_files/data/context/merchant_data.json', '/home/mykola/repos/dabstep_test/data/context_files/data/context/manual.md']\nDon't forget to reference any documentation in the data dir before answering a question.\n\nHere is the question you need to answer:\nWhat are the unique set of merchants in the payments data?\n\nHere are the guidelines you must

In [None]:
# Debug: Inspect step object structure
if hasattr(agent, 'memory') and hasattr(agent.memory, 'steps'):
    steps_list = agent.memory.steps
    if len(steps_list) > 0:
        print("First step object details:")
        first_step = steps_list[0]
        print(f"Type: {type(first_step)}")
        print(f"All attributes: {[attr for attr in dir(first_step) if not attr.startswith('_')]}")
        print(f"\nStep object repr:\n{first_step}")


First step object details:
Type: <class 'smolagents.memory.TaskStep'>
All attributes: ['dict', 'task', 'task_images', 'to_messages']

Step object repr:
TaskStep(task="You are an expert data analyst and you will answer factoid questions by loading and referencing the files/documents listed below.\nYou have these files available:\n['/home/mykola/repos/dabstep_test/data/context_files/data/context/acquirer_countries.csv', '/home/mykola/repos/dabstep_test/data/context_files/data/context/payments-readme.md', '/home/mykola/repos/dabstep_test/data/context_files/data/context/payments.csv', '/home/mykola/repos/dabstep_test/data/context_files/data/context/merchant_category_codes.csv', '/home/mykola/repos/dabstep_test/data/context_files/data/context/fees.json', '/home/mykola/repos/dabstep_test/data/context_files/data/context/merchant_data.json', '/home/mykola/repos/dabstep_test/data/context_files/data/context/manual.md']\nDon't forget to reference any documentation in the data dir before answering

In [None]:
notebook_end_time = time.time()
print(f"Notebook runtime: {notebook_end_time - notebook_start_time:.2f} seconds")

Notebook runtime: 24.04 seconds


In [None]:
# Calculate total tokens from agent execution
from smolagents.memory import ActionStep

total_input_tokens = 0
total_output_tokens = 0

# Extract token information from ActionStep objects in agent memory
if hasattr(agent, 'memory') and hasattr(agent.memory, 'steps'):
    for step in agent.memory.steps:
        if isinstance(step, ActionStep) and hasattr(step, 'token_usage'):
            token_usage = step.token_usage
            if token_usage is not None:
                # token_usage is typically a dict with 'input_tokens' and 'output_tokens'
                if isinstance(token_usage, dict):
                    total_input_tokens += token_usage.get('input_tokens', 0)
                    total_output_tokens += token_usage.get('output_tokens', 0)
                else:
                    # If it's an object with attributes
                    if hasattr(token_usage, 'input_tokens'):
                        total_input_tokens += token_usage.input_tokens
                    if hasattr(token_usage, 'output_tokens'):
                        total_output_tokens += token_usage.output_tokens

print(f"\nTotal Input Tokens: {total_input_tokens}")
print(f"Total Output Tokens: {total_output_tokens}")


Total Input Tokens: 8937
Total Output Tokens: 2566


In [None]:
# Test Summary - Steps, Timing, Answer, and Persistence
import os
import csv
from datetime import datetime

print("\n" + "="*80)
print("TEST SUMMARY")
print("="*80)

# Print model and question info
print(f"\nüìä Model: {MODEL_ID}")
print(f"‚ùì Question: {question}")

# Calculate metrics
total_steps = 0
total_time = 0
if hasattr(agent, 'memory') and hasattr(agent.memory, 'steps'):
    steps_list = agent.memory.steps
    total_steps = len(steps_list)
    print(f"\nüìà Total Steps: {total_steps}")
    
    # Calculate agent execution time
    try:
        total_time = agent_end_time - agent_start_time
        print(f"\n‚è±Ô∏è  Total Agent Time: {total_time:.2f}s")
        print(f"   Notebook Runtime: {notebook_end_time - notebook_start_time:.2f}s")
    except NameError:
        print("\n‚ö†Ô∏è  Agent timing information not available (run the agent execution cell first)")

# Print final answer
print(f"\nüí¨ Final Answer:")
print(f"{answer}")

# Print token info
print(f"\nüî§ Tokens:")
print(f"   Input Tokens: {total_input_tokens}")
print(f"   Output Tokens: {total_output_tokens}")

# Prepare results directory - save to repo/results/toy_results
# Navigate from notebooks directory to repo root
repo_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
results_dir = os.path.join(repo_root, "results", "toy_results")
os.makedirs(results_dir, exist_ok=True)
results_file = os.path.join(results_dir, "results.csv")

# Prepare result row
result_row = {
    "timestamp": datetime.now().isoformat(),
    "model": MODEL_ID,
    "question": question,
    "total_steps": total_steps,
    "total_agent_time_s": round(total_time, 2),
    "notebook_runtime_s": round(notebook_end_time - notebook_start_time, 2),
    "input_tokens": total_input_tokens,
    "output_tokens": total_output_tokens,
    "answer": str(answer)
}

# Save to CSV
file_exists = os.path.isfile(results_file)
with open(results_file, 'a', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=result_row.keys())
    if not file_exists:
        writer.writeheader()
    writer.writerow(result_row)

print(f"\n‚úÖ Results saved to: {results_file}")

# Display all results as table
print("\n" + "="*80)
print("ALL TEST RESULTS")
print("="*80)

import pandas as pd
results_df = pd.read_csv(results_file)
print("\n")
print(results_df.to_string(index=False))
print("\n" + "="*80)


TEST SUMMARY

üìä Model: x-ai/grok-code-fast-1
‚ùì Question: What are the unique set of merchants in the payments data?

üìà Total Steps: 4

‚è±Ô∏è  Total Agent Time: 22.75s
   Notebook Runtime: 24.04s

üí¨ Final Answer:
Belles_cookbook_store, Crossfit_Hanna, Golfclub_Baron_Friso, Martinis_Fine_Steakhouse, Rafa_AI

üî§ Tokens:
   Input Tokens: 8937
   Output Tokens: 2566

‚úÖ Results saved to: /home/mykola/repos/dabstep_test/results/toy_results/results.csv

ALL TEST RESULTS


                 timestamp                            model                                                   question  total_steps  total_agent_time_s  notebook_runtime_s  input_tokens  output_tokens                                                                                         answer
2025-11-11T16:41:28.839850 qwen/qwen-2.5-coder-32b-instruct What are the unique set of merchants in the payments data?            5               21.44               22.53         14403           1786 Crossfit_Hanna, B

In [None]:
results_df

Unnamed: 0,timestamp,model,question,total_steps,total_agent_time_s,notebook_runtime_s,input_tokens,output_tokens,answer
0,2025-11-11T16:41:28.839850,qwen/qwen-2.5-coder-32b-instruct,What are the unique set of merchants in the pa...,5,21.44,22.53,14403,1786,"Crossfit_Hanna, Belles_cookbook_store, Golfclu..."
1,2025-11-11T16:42:01.965437,qwen/qwen-2.5-coder-32b-instruct,What are the unique set of merchants in the pa...,5,21.63,22.81,12948,1211,"Crossfit_Hanna,Belles_cookbook_store,Golfclub_..."
2,2025-11-11T16:42:57.723621,qwen/qwen3-coder,What are the unique set of merchants in the pa...,3,7.97,9.03,4803,124,"Crossfit_Hanna, Belles_cookbook_store, Golfclu..."
3,2025-11-11T16:44:02.579388,deepseek/deepseek-chat-v3.1,What are the unique set of merchants in the pa...,5,21.57,22.6,12438,527,"Crossfit_Hanna, Belles_cookbook_store, Golfclu..."
4,2025-11-11T16:45:09.917422,deepseek/deepseek-v3.1-terminus,What are the unique set of merchants in the pa...,5,32.73,33.89,12830,627,"Crossfit_Hanna, Belles_cookbook_store, Golfclu..."
5,2025-11-11T16:46:45.138991,deepseek/deepseek-v3.2-exp,What are the unique set of merchants in the pa...,5,28.67,29.66,12466,490,"Crossfit_Hanna, Belles_cookbook_store, Golfclu..."
6,2025-11-11T16:47:37.776844,moonshotai/kimi-k2-0905,What are the unique set of merchants in the pa...,6,29.07,30.11,17788,724,"Belles_cookbook_store, Crossfit_Hanna, Golfclu..."
7,2025-11-11T16:48:16.399204,z-ai/glm-4.5-air,What are the unique set of merchants in the pa...,3,11.63,12.72,5814,333,"Crossfit_Hanna, Belles_cookbook_store, Golfclu..."
8,2025-11-11T16:49:43.515257,z-ai/glm-4.5,What are the unique set of merchants in the pa...,6,49.57,50.66,17608,1026,"Belles_cookbook_store, Crossfit_Hanna, Golfclu..."
9,2025-11-11T16:50:38.087623,openai/gpt-oss-20b,What are the unique set of merchants in the pa...,9,21.16,22.21,23813,2576,
