### Test AI Agent 🤖 Tool Calling with DeepEval 🧪

Testing AI Agent involves testing of the Tools being invoked by an AI Agent. Here, AI Agent will invoke the necessary tools based on the given input and respond with the help of the tools being bounded with the AI Agent


<img src="./img/AIAGent.png" width="800" height="400" style="display: block; margin: auto;">

In [9]:
#!pip install -qU duckduckgo-search

In [10]:
import deepeval

deepeval.login_with_confident_api_key("o6wy2TTe0igTiXs6zs6/JnR+wfzws96MGYfsqGOzntA=")

In [11]:
!deepeval set-ollama deepseek-r1:8b

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.


In [12]:
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase
from deepeval.test_case import ToolCall
from deepeval.tracing import (
    observe,
    update_current_span
)

In [13]:
from langchain_ollama import ChatOllama

@observe(type='llm', model='qwen2.5:latest')
def local_llms():
    return ChatOllama(
        base_url="http://localhost:11434",
        model = "qwen2.5:latest",
        temperature=0.5,
        max_tokens = 250
    )

llm = local_llms()

#### AI Agent with Tools

In [14]:
from langchain.tools import tool
from langchain.agents import initialize_agent, AgentType
from langchain_community.tools import DuckDuckGoSearchRun

s_tool = DuckDuckGoSearchRun()

@tool
@observe(type='tool')
def add_numbers(a: int, b: int) -> int:
    "Add two numbers and return results."
    result = int(a) + int(b)
    return f"The sum of {a} and {b} is {result}"

@tool
@observe(type='tool')
def subtract_numbers(a: int, b: int) -> int:
    "Subtract two numbers and return results."
    result = int(a) - int(b)
    return f"The difference of {a} and {b} is {result}"

@tool
@observe(type='tool')
def search_tool(query):
    "Tool to search online for the given query and return results"
    return s_tool.run(query)

tools = [add_numbers, subtract_numbers, search_tool]


@observe(type='agent', available_tools=["add_numbers", "subtract_numbers", "search_tool"], metrics=[ToolCorrectnessMetric()])
def main_ai_agent(query):
    agent = initialize_agent(
        tools= tools,
        llm=llm,
        agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
        verbose=True,
        return_intermediate_steps=True
    )
    
    response = agent.invoke(query)
    
    update_current_span(
        test_case=LLMTestCase(
            input=query,
            tools_called=[ToolCall(name="add_numbers")],
            expected_tools=[ToolCall(name="add_numbers")],
            actual_output=response['output']
        )
    )
    
    return response


In [15]:
# search_response = main_ai_agent("Who is the current president of USA in 2025, just give the name")
# add_response = main_ai_agent("What is the sum of 20 and 90")
# sub_response = main_ai_agent("What is the subtract of 100 with 50")

### Evaluating AI Agent with DeepEval for Component Testing

In [16]:
from deepeval.dataset import Golden
from deepeval import evaluate

goldens = Golden(input="What is the sum of 20 and 90")
evaluate(goldens=[goldens], observed_callback=main_ai_agent)

Evaluating goldens: |          |  0% (0/1) [Time Taken: 00:00, ?it/s]



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction:
```
{
  "action": "add_numbers",
  "action_input": {
    "a": 20,
    "b": 90
  }
}
```[0m
Observation: [36;1m[1;3mThe sum of 20 and 90 is 110[0m
Thought:



[32;1m[1;3mAction:
```
{
  "action": "Final Answer",
  "action_input": "The sum of 20 and 90 is 110"
}
```[0m

[1m> Finished chain.[0m
Ending trace: [BaseSpan(uuid='0ed8fa1e-a183-4144-a82f-cfa044ccb01c', status=<TraceSpanStatus.SUCCESS: 'SUCCESS'>, children=[AgentSpan(uuid='6d6a1b4f-1ab9-4dd0-922a-e89fb4c3fe8e', status=<TraceSpanStatus.SUCCESS: 'SUCCESS'>, children=[ToolSpan(uuid='7d5d037f-6556-4962-94ed-c615e8ec711d', status=<TraceSpanStatus.SUCCESS: 'SUCCESS'>, children=[], trace_uuid='9b058f62-6321-42b5-8760-5e1be474423a', parent_uuid='6d6a1b4f-1ab9-4dd0-922a-e89fb4c3fe8e', start_time=341291.427179, end_time=341291.427200208, name='add_numbers', metadata=None, input={'a': 20, 'b': 90}, output='The sum of 20 and 90 is 110', error=None, llm_test_case=None, metrics=None, attributes=None, description=None)], trace_uuid='9b058f62-6321-42b5-8760-5e1be474423a', parent_uuid='0ed8fa1e-a183-4144-a82f-cfa044ccb01c', start_time=341287.389004791, end_time=341292.473865208, name='main_ai_age


[A
Evaluating goldens: |██████████|100% (1/1) [Time Taken: 00:05,  5.09s/it]
     ⚡ Invoking traceable callback: |██████████|100% (1/1) [Time Taken: 00:05,  5.09s/it]



Metrics Summary


For test case:

  - input: What is the sum of 20 and 90
  - actual output: None
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates








EvaluationResult(test_results=[TestResult(name='test_case_1', success=True, metrics_data=[], conversational=False, multimodal=False, input='What is the sum of 20 and 90', actual_output=None, expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link='https://app.confident-ai.com/project/cmb8sq46q07rf1tfo1k6r68x4/evaluation/test-runs/cmbhe6qre09n9ei18oltm3zw2/compare-test-results')