# Notebook Test

## Verification React

In [20]:
# test_agent.py

from agent import ToolAgent

# Instantiate the agent with your preferred model (must be running via Ollama)
agent = ToolAgent(model="phi3:instruct")  # or "phi3", etc.

# Define test questions
test_questions = [
    "What is 5 plus 7?",
    "Calculate 3 multiplied by 9.",
    "What is the remainder when 13 is divided by 5?",
    "What is the capital of France? Use wiki_search.",
    "Subtract 100 from 250.",
    "Divide 42 by 6.",
    "Who is Albert Einstein?",
]

# Run tests
for i, question in enumerate(test_questions):
    print(f"\n--- TEST {i+1} ---")
    print(f"Question: {question}")
    answer = agent(question)
    print(f"✅ Agent Answer: {answer}")



--- TEST 1 ---
Question: What is 5 plus 7?

📨 Prompt sent to Ollama (step 1):
You are an intelligent agent with access to tools.

To solve complex questions, you may use tools by writing:
Action: tool_name["arg1", "arg2"]

When you receive the result, respond with:
Observation: [tool_output]

Repeat this process as needed.

Always end with:
FINAL ANSWER: [your answer]

YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
Your answer should only start with "FINAL ANSWER: ", then follows w

## Questions Data

This is extract from https://huggingface.co/spaces/baixianger/RobotPai/blob/main/test.ipynb


In [21]:
import json 

with open('metadata.jsonl', 'r') as jsonl_file:
    json_list = list(jsonl_file)

json_QA = []
for json_str in json_list:
    json_data = json.loads(json_str)
    json_QA.append(json_data)

json_QA_level1 = [item for item in json_QA if str(item.get("Level", "")) == "1"]


In [22]:
import random
# random.seed(42)
random_samples = random.sample(json_QA, 1)
for sample in random_samples:
    print("=" * 50)
    print(f"Task ID: {sample['task_id']}")
    print(f"Question: {sample['Question']}")
    print(f"Level: {sample['Level']}")
    print(f"Final Answer: {sample['Final answer']}")
    print(f"Annotator Metadata: ")
    print(f"  ├── Steps: ")
    for step in sample['Annotator Metadata']['Steps'].split('\n'):
        print(f"  │      ├── {step}")
    print(f"  ├── Number of steps: {sample['Annotator Metadata']['Number of steps']}")
    print(f"  ├── How long did this take?: {sample['Annotator Metadata']['How long did this take?']}")
    print(f"  ├── Tools:")
    for tool in sample['Annotator Metadata']['Tools'].split('\n'):
        print(f"  │      ├── {tool}")
    print(f"  └── Number of tools: {sample['Annotator Metadata']['Number of tools']}")
print("=" * 50)

Task ID: 42d4198c-5895-4f0a-b0c0-424a66465d83
Question: I'm curious about how much information is available for popular video games before their release. Find the Wikipedia page for the 2019 game that won the British Academy Games Awards. How many revisions did that page have before the month listed as the game's release date on that Wikipedia page (as of the most recent entry from 2022)?
Level: 2
Final Answer: 60
Annotator Metadata: 
  ├── Steps: 
  │      ├── 1. Search the web for British Academy Video Games Award for Best Game 2019
  │      ├── 2. Find the answer, Outer Wilds
  │      ├── 3. Find the Wikipedia page for Outer Wilds
  │      ├── 4. Go to the last revision from 2022.
  │      ├── 5. Note the release date, May 29, 2019
  │      ├── 6. View the page history
  │      ├── 7. Count how many edits were made to the page before May 2019
  │      ├── 8. Arrive at the answer, 60
  ├── Number of steps: 8
  ├── How long did this take?: 30 minutes
  ├── Tools:
  │      ├── 1. Web b

In [23]:
# list of the tools used in all the samples
from collections import Counter, OrderedDict

tools = []
for sample in json_QA:
    for tool in sample['Annotator Metadata']['Tools'].split('\n'):
        tool = tool[2:].strip().lower()
        if tool.startswith("("):
            tool = tool[11:].strip()
        tools.append(tool)
tools_counter = OrderedDict(Counter(tools))
print("List of tools used in all samples:")
print("Total number of tools used:", len(tools_counter))
for tool, count in tools_counter.items():
    print(f"  ├── {tool}: {count}")

List of tools used in all samples:
Total number of tools used: 83
  ├── web browser: 107
  ├── image recognition tools (to identify and parse a figure with three axes): 1
  ├── search engine: 101
  ├── calculator: 34
  ├── unlambda compiler (optional): 1
  ├── a web browser.: 2
  ├── a search engine.: 2
  ├── a calculator.: 1
  ├── microsoft excel: 5
  ├── google search: 1
  ├── ne: 9
  ├── pdf access: 7
  ├── file handling: 2
  ├── python: 3
  ├── image recognition tools: 12
  ├── jsonld file access: 1
  ├── video parsing: 1
  ├── python compiler: 1
  ├── video recognition tools: 3
  ├── pdf viewer: 7
  ├── microsoft excel / google sheets: 3
  ├── word document access: 1
  ├── tool to extract text from images: 1
  ├── a word reversal tool / script: 1
  ├── counter: 1
  ├── excel: 3
  ├── image recognition: 5
  ├── color recognition: 3
  ├── excel file access: 3
  ├── xml file access: 1
  ├── access to the internet archive, web.archive.org: 1
  ├── text processing/diff tool: 1
  ├── gi

## Evaluation on data


In [24]:
import random
from agent import ToolAgent  # Ton agent local


# Set seed for reproducibility
random.seed(42)
evaluation_samples = random.sample(json_QA_level1, 10)  # Adjust the sample size if needed

# Initialize your local agent (make sure Ollama is running)
agent = ToolAgent(model="phi3:instruct")

# Store results
results = []

for sample in evaluation_samples:
    task_id = sample["task_id"]
    question = sample["Question"]
    expected = sample["Final answer"].strip().lower()

    try:
        # Call your agent directly
        print(f"\n🟨 --- TRACE FOR TASK {task_id} ---")
        print(f"🧠 Question: {question}")
        answer = agent(question).strip().lower()
        print(f"✅ Agent Answer: {answer}")
    except Exception as e:
        answer = f"ERROR: {e}"
        print(f"❌ ERROR during agent call: {e}")

    results.append({
        "task_id": task_id,
        "question": question,
        "expected": expected,
        "answer": answer,
        "correct": answer == expected
    })


🟨 --- TRACE FOR TASK 7d4a7d1d-cac6-44a8-96e8-ea9584a70825 ---
🧠 Question: According to Girls Who Code, how long did it take in years for the percentage of computer scientists that were women to change by 13% from a starting point of 37%?

📨 Prompt sent to Ollama (step 1):
You are an intelligent agent with access to tools.

To solve complex questions, you may use tools by writing:
Action: tool_name["arg1", "arg2"]

When you receive the result, respond with:
Observation: [tool_output]

Repeat this process as needed.

Always end with:
FINAL ANSWER: [your answer]

YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked fo

In [25]:
import pandas as pd
# Create a DataFrame for analysis
df_results = pd.DataFrame(results)
df_results["score"] = df_results["correct"].astype(int)

# Display summary
print("\n📊 EVALUATION SUMMARY:")
print(df_results[["task_id", "correct", "expected", "answer"]])
print(f"\n✅ Accuracy: {df_results['correct'].mean() * 100:.2f}%")



📊 EVALUATION SUMMARY:
                                task_id  correct           expected  \
0  7d4a7d1d-cac6-44a8-96e8-ea9584a70825    False                 22   
1  cffe0e32-c9a6-4c52-9877-78ceb4aaa9fb    False               fred   
2  8e867cd7-cff9-4e6c-867a-ff5ddc2550be    False                  3   
3  bda648d7-d618-4883-88f4-3466eabd860e    False   saint petersburg   
4  935e2cff-ae78-4218-b3f5-115589b19dae    False           research   
5  b415aba4-4b68-4fc6-9b89-2c812e55a3e1    False            diamond   
6  42576abe-0deb-4869-8c63-225c2d75a95a    False  maktay mato apple   
7  2d83110e-a098-4ebb-9987-066c06fa42d0    False              right   
8  4b6bb5f7-f634-410e-815d-e673ab7f8632    False         the castle   
9  23dd907f-1261-4488-b21c-e9185af91d5e    False                  2   

                                              answer  
0              insufficient data to answer precisely  
1   unable to identify based on provided information  
2                             