# Notebook Test

## Agent Initialization

Since the agent will be used across multiple blocks, we need to initialize it once at the beginning to avoid unintentionally initializing it multiple times.

In [None]:
from agent import ToolAgent  # Ton agent local

agent = ToolAgent(model="phi3:instruct")

## Questions Data from the GAIA dataset

In this section, we import questions from the GAIA dataset and extract information about which tools are used in each question. This allows us to prioritize the implementation of the most relevant tools.


### Import

In [21]:
import json 

with open('Data/metadata.jsonl', 'r') as jsonl_file:
    json_list = list(jsonl_file)

json_QA = []
for json_str in json_list:
    json_data = json.loads(json_str)
    json_QA.append(json_data)

json_QA_level1 = [item for item in json_QA if str(item.get("Level", "")) == "1"]


### Metadatas about one question

In [None]:
import random
# random.seed(42)
random_samples = random.sample(json_QA, 1)
for sample in random_samples:
    print("=" * 50)
    print(f"Task ID: {sample['task_id']}")
    print(f"Question: {sample['Question']}")
    print(f"Level: {sample['Level']}")
    print(f"Final Answer: {sample['Final answer']}")
    print(f"Annotator Metadata: ")
    print(f"  ├── Steps: ")
    for step in sample['Annotator Metadata']['Steps'].split('\n'):
        print(f"  │      ├── {step}")
    print(f"  ├── Number of steps: {sample['Annotator Metadata']['Number of steps']}")
    print(f"  ├── How long did this take?: {sample['Annotator Metadata']['How long did this take?']}")
    print(f"  ├── Tools:")
    for tool in sample['Annotator Metadata']['Tools'].split('\n'):
        print(f"  │      ├── {tool}")
    print(f"  └── Number of tools: {sample['Annotator Metadata']['Number of tools']}")
print("=" * 50)

### Used tools summary

In [None]:
# list of the tools used in all the samples
from collections import Counter, OrderedDict

tools = []
for sample in json_QA:
    for tool in sample['Annotator Metadata']['Tools'].split('\n'):
        tool = tool[2:].strip().lower()
        if tool.startswith("("):
            tool = tool[11:].strip()
        tools.append(tool)
tools_counter = OrderedDict(Counter(tools))
print("List of tools used in all samples:")
print("Total number of tools used:", len(tools_counter))
for tool, count in tools_counter.items():
    print(f"  ├── {tool}: {count}")

## Verification of Proper Tool Usage

Before testing on the dataset, we first ensure that the agent and its tools function correctly by using simple questions, before moving on to more complex ones.

### Tool verification

The following block is intended for directly testing the tools. This ensures that when the Agent invokes a tool, it performs as expected.

In [None]:
import pandas as pd
from tools import ToolExecutor

# Liste de tests à exécuter
test_cases = [
    ("add", ["3", "5"]),
    ("multiply", ["7", "6"]),
    ("subtract", ["10", "4"]),
    ("divide", ["20", "5"]),
    ("modulus", ["13", "5"]),
    ("wiki_search", ["Albert Einstein"]),
    ("web_search", ["current president of France"]),
]

# Stocke les résultats
results = []

for tool_name, args in test_cases:
    args_str = ', '.join(f'"{arg}"' for arg in args)
    command = f'Action: {tool_name}[{args_str}]'
    print(f"\n🛠️ Testing tool: {tool_name}")
    print(f"➡️ Command: {command}")
    result = ToolExecutor.execute(command)
    print(f"📤 Result: {result}")
    results.append({
        "tool": tool_name,
        "command": command,
        "result": result,
        "success": "Observation:" in result and "error" not in result.lower()
    })

# Résumé final
df = pd.DataFrame(results)
print("\n📊 TEST SUMMARY:")
print(df[["tool", "success"]])


### Call verification

This section is used to test whether the agent correctly selects and uses the appropriate tool when given simple, direct questions.

In [None]:
test_questions = [
    {"id": "q_add", "question": "What is 12 plus 30?","expected": "42"},
    {"id": "q_subtract", "question": "What is 100 minus 33?","expected": "67"},
    {"id": "q_multiply", "question": "What is 8 multiplied by 7?","expected": "56"},
    {"id": "q_divide", "question": "What is 81 divided by 9?","expected": "9"},
    {"id": "q_wiki", "question": "Who developed the theory of evolution?","expected": "Charles Darwin"},
    {"id": "q_web", "question": "Who is the current president of the United States?","expected": "Donald Trump"},
    {"id": "q_extract", "question": "Who founded Wikipedia?","expected":"Jimmy Wales, Larry Sanger"},
    {"id": "q_chain", "question": "What is the sum of 5 and 6, multiplied by 3?","expected":"33"}
]

for test in test_questions:
    print(f"🟨 --- Testing {test['id']} ---")
    question_unique = test["question"]

    # Mode avec trace
    logged = agent(question_unique, log=True)
    print("\n📜 Full trace with log:")
    print("✅ Final answer:", logged['final_answer'],"   |   Expected:", test["expected"])
    print("🛠️ Tools used:", logged['used_tools'])
    # print("📜 Trace:\n", logged['trace'])
    print("\n" + "="*80 + "\n")



## Evaluation on GAIA data

In this section, we select random level 1 questions from the GAIA dataset and test our agent to evaluate its ability to answer them correctly.

### Running the evaluation

In [None]:
import random
from agent import ToolAgent  # Ton agent local

# Set seed for reproducibility
random.seed(1)
evaluation_samples = random.sample(json_QA_level1, 15)  # Ajuste la taille si besoin

# If not you need to initialize your agent 

# Résultats stockés ici
results = []

for sample in evaluation_samples:
    task_id = sample["task_id"]
    question = sample["Question"]
    expected = sample["Final answer"].strip().lower()

    try:
        # Appel de l'agent en mode log
        print(f"\n🟨 --- TRACE FOR TASK {task_id} ---")
        print(f"🧠 Question: {question}")
        response = agent(question, log=True)  # ✅ utilisation du log

        answer = response["final_answer"].strip().lower()
        tools_used = response["used_tools"]
        trace = response["trace"]

        print(f"✅ Agent Answer: {answer}")
        print(f"🛠️ Tools used: {tools_used}")
        # print(f"📜 Trace:\n{trace}") # Uncomment this if you want more details about the reasonning process

    except Exception as e:
        answer = f"ERROR: {e}"
        tools_used = []
        trace = f"ERROR TRACE: {e}"
        print(f"❌ ERROR during agent call: {e}")

    results.append({
        "task_id": task_id,
        "question": question,
        "expected": expected,
        "answer": answer,
        "tools_used": tools_used,
        "correct": answer == expected,
        "trace": trace
    })


### Display Results

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display


# Exemple de structure `results` (à remplacer par ta variable réelle si différente)
# results = [...]  # Doit déjà être défini depuis la boucle d'évaluation

# Création du DataFrame
df_results = pd.DataFrame(results)
df_results["correct"] = df_results["correct"].astype(bool)
df_results["used_tool"] = df_results["tools_used"].apply(lambda tools: bool(tools and len(tools) > 0))

# Résumé global
accuracy = df_results["correct"].mean() * 100
tool_usage = df_results["used_tool"].mean() * 100
average_tool_count = df_results["tools_used"].apply(lambda tools: len(tools) if tools else 0).mean()

# Affichage console
print(f"\n✅ Accuracy: {accuracy:.2f}%")
print(f"🛠️ Tool usage rate: {tool_usage:.2f}%")
print(f"🛠️ Average tools count: {average_tool_count:.2f}")



In [22]:
import pandas as pd

# Création du DataFrame complet depuis la liste results
df_results = pd.DataFrame(results)

# Ajout éventuel de colonnes d'analyse (facultatif mais utile)
df_results["used_tool"] = df_results["tools_used"].apply(lambda tools: bool(tools and len(tools) > 0))
df_results["nb_tools"] = df_results["tools_used"].apply(lambda tools: len(tools) if tools else 0)

# Colonnes à afficher (tout le contenu pertinent)
columns_to_display = [
    "task_id",
    "question",
    "expected",
    "answer",
    "tools_used",
    "correct",
    "used_tool",
    "nb_tools",
    "trace"
]
columns_synthetiques = ["task_id", "correct", "used_tool", "nb_tools", "tools_used","answer","expected","question"]

# Affichage du tableau complet
full_summary_df = df_results[columns_to_display]
summary_df = df_results[columns_synthetiques]

full_summary_df.to_csv("Results/resultats_complets.csv", index=False)
summary_df.to_csv("Results/resultats.csv", index=False)

df = pd.read_csv("Results/resultats.csv")

from IPython.display import HTML
HTML(df.to_html(max_rows=100, max_cols=20))

# For just one line
# print(df_results[df_results["task_id"] == 3].iloc[0])



Unnamed: 0,task_id,correct,used_tool,nb_tools,tools_used,answer,expected,question
0,2d83110e-a098-4ebb-9987-066c06fa42d0,False,True,2,"['extract_answer', 'wiki_search']",list including rewsnee elementary school.,right,".rewsna eht sa ""tfel"" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI"
1,e142056d-56ab-4352-b091-b56054bd1359,False,True,1,['multiply'],6800,16000,"Bob was invited to participate in a game show, and he advanced to the final round. The final round offered Bob the chance to win a large sum by playing a game against the host. The host has 30 shiny prop coins, each of which is worth $1,000 if Bob manages to win them by playing the game. The host hides the coins in three different prize boxes and then shuffles their order. The only rule restricting the host's coin placement is that one box must contain at least 2 coins, and one box must contain 6 more coins than another box. In order to play, Bob must submit three guesses, one guess for the number of coins in each box. The box is then opened and the number of coins is revealed. If Bob's guess is a number greater than the number of coins in the box, Bob earns no coins. If Bob guesses a number equal to or less than the number of coins in the box, Bob wins a number of coins equal to his guess.\n\nIf Bob plays uses the optimal strategy, what's the minimum amount of money he can win from the game?"
2,50ec8903-b81f-4257-9450-1085afd2c319,False,True,2,"['wiki_search', 'wiki_search']","blue, orange","green, white","A standard Rubik’s cube has been broken into cubes making up its sides. The cubes are jumbled, and one is removed. There are 6 cubes with one colored face, 12 edge cubes with two colored faces, and 8 corner cubes with three colored faces. All blue cubes have been found. All cubes directly left, right, above, and below the orange center cube have been found, along with the center cube. The green corners have all been found, along with all green that borders yellow. For all orange cubes found, the opposite face’s cubes have been found. The removed cube has two colors on its faces. What are they? Answer using a comma separated list, with the colors ordered alphabetically."
3,a1e91b78-d3d8-4675-bb8d-62741b4b68a6,False,True,1,['web_search'],65,3,"In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?"
4,cca530fc-4052-43b2-b130-b30968d8aa44,False,True,1,['wiki_search'],ra8+,rd5,Review the chess position provided in the image. It is black's turn. Provide the correct next move for black which guarantees a win. Please provide your response in algebraic notation.
5,cffe0e32-c9a6-4c52-9877-78ceb4aaa9fb,False,True,2,"['extract_answer', 'wiki_search']","not identifiable from the available data; further investigation needed using web search or extracting additional context if possible. if unable to identify, it might be john doe assuming he is mentioned in a related document but unspecified herein.",fred,"An office held a Secret Santa gift exchange where each of its twelve employees was assigned one other employee in the group to present with a gift. Each employee filled out a profile including three likes or hobbies. On the day of the gift exchange, only eleven gifts were given, each one specific to one of the recipient's interests. Based on the information in the document, who did not give a gift?"
6,d0633230-7067-47a9-9dbf-ee11e0a2cdd6,False,True,1,['wiki_search'],"knn, logisticregressionbaseclassifier",baselabelpropagation,"In the Scikit-Learn July 2017 changelog, what other predictor base command received a bug fix? Just give the name, not a path."
7,cabe07ed-9eca-40ea-8ead-410ef5e83f91,False,True,1,['wiki_search'],dr. linda peters,louvrier,What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023?
8,99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3,False,True,1,['web_search'],"fresh strawberries, lemons (zest), sugar","cornstarch, freshly squeezed lemon juice, granulated sugar, pure vanilla extract, ripe strawberries","Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need for the crust, but I'm not sure about the filling. I got the recipe from my friend Aditi, but she left it as a voice memo and the speaker on my phone is buzzing so I can't quite make out what she's saying. Could you please listen to the recipe and list all of the ingredients that my friend described? I only want the ingredients for the filling, as I have everything I need to make my favorite pie crust. I've attached the recipe as Strawberry pie.mp3.\n\nIn your response, please only list the ingredients, not any measurements. So if the recipe calls for ""a pinch of salt"" or ""two cups of ripe strawberries"" the ingredients on the list would be ""salt"" and ""ripe strawberries"".\n\nPlease format your response as a comma separated list of ingredients. Also, please alphabetize the ingredients."
9,dc22a632-937f-4e6a-b72f-ba0ff3f5ff97,False,True,1,['web_search'],"the complete title is not explicitly mentioned, but it relates to recommendations by james beard award winners within the context of chris schlesinger and alvin yapa's book ""the food lover's guide to las vegas"" regarding new mexican food.",five hundred things to eat before it's too late: and the very best places to eat them,What was the complete title of the book in which two James Beard Award winners recommended the restaurant where Ali Khan enjoyed a New Mexican staple in his cost-conscious TV show that started in 2015? Write the numbers in plain text if there are some in the title.
