#üöÄ DAY 9 ‚Äî Observability, Debugging & Evaluation for Agents

Today we will learn how real teams debug, monitor, and evaluate agents ‚Äî not just ‚Äúhope they work‚Äù.

**So far your agents can:**

   - ‚úî plan
   - ‚úî reason  
   - ‚úî use tools
  - ‚úî remember
  - ‚úî access the web

**But now we answer:**

‚ùì How do I know my agent is correct, reliable, and improving?


**Build observable, debuggable, and evaluable agents by:**

- Logging thoughts, tool calls, and errors

- Tracing agent execution

- Evaluating correctness, reasoning quality, and tool usage

**üîπ 1Ô∏è‚É£ CORE TOPICS**

**üîç Observability**

- Execution traces

- Step-by-step logs

- Tool-call visibility

- Latency & failure points

**üß™ Evaluation**

- Correctness evaluation

- Reasoning quality

- Tool-choice accuracy

- Regression testing for agents

**üõ† Debugging**

- Prompt failures

- Tool misuse

- Infinite loops

- Hallucinations

**üîπ 2Ô∏è‚É£ WHY THIS MATTERS**

Without observability:

‚ùå You don‚Äôt know why the agent failed

‚ùå You can‚Äôt improve prompts

‚ùå You can‚Äôt deploy safely

**With observability:**

‚úî Faster debugging

‚úî Measurable improvement

‚úî Production readiness

This is what separates hobby agents from real applications.

**‚úîÔ∏è Step 1 ‚Äî Enable Verbose Logging**

üß† What is initialize_agent()?

Think of initialize_agent() as a factory function.

üëâ It builds an agent for you automatically instead of you writing:

- prompts

- reasoning loops

- tool-selection logic

So instead of manually coding ReAct, LangChain does it internally.

**üß© Line-by-Line Explanation**


1Ô∏è‚É£ tools

    tools = [add_numbers, ai_fact, web_search]
    #‚ÄúThese are the actions my agent is allowed to take.‚Äù

2Ô∏è‚É£ llm

    #‚ÄúUse this LLM to think, reason, and decide.‚Äù
    llm = ChatOpenAI(model="gpt-4o-mini")

3Ô∏è‚É£ agent_type="zero-shot-react-description"

üîπ Zero-shot

The agent is not given examples.

It figures out:

how to reason

how to use tools

purely from descriptions.

üîπ ReAct

Reason + Act.

The agent:

Thinks

Decides to use a tool

Executes tool

Thinks again

Repeats


üîπ Description

The agent uses:

tool docstrings

function descriptions

to decide which tool to use.

    example:
    #Create a ReAct agent that can use tools by reading their descriptions, without examples

    @tool
    def add_numbers(a: int, b: int):
        """Add two numbers"""



4Ô∏è‚É£ verbose=True

üîπ What does verbose mean?

verbose=True means:

üîä ‚ÄúShow me EVERYTHING the agent is thinking and doing.‚Äù

**üß† Without verbose**

You see only:

    Final Answer: 15

**üß† With verbose=True**

You see:

    Thought: I should calculate 5 + 10
    Action: add_numbers
    Action Input: {"a": 5, "b": 10}
    Observation: 15
    Thought: Now I can answer
    Final Answer: 15


In production ‚Üí verbose=False
In learning ‚Üí ALWAYS True


initialize_agent = "Build me a smart agent"

tools            = "Here are its
hands"

llm              = "Here is its
brain"

agent_type       = "How it should think"

verbose=True     = "Show me its thoughts"


In [None]:
agent = initialize_agent(
    tools,
    llm,
    agent_type="zero-shot-react-description",
    verbose=True
)


**‚úîÔ∏è Step 2 ‚Äî Add Structured Logging**

üîπ What Is This Code Doing (High-Level)?

This code is creating a simple logging system to track:

- what step the agent is on

- which tool it used

- what result it got

- when it happened

Think of it as a flight recorder for your agent ‚úàÔ∏è


**üîπ Line-by-Line Explanation**

1Ô∏è‚É£ import time
Used for:

- ordering steps

- measuring duration

- debugging

      import time
      time.time()  # 1723456789.123

2Ô∏è‚É£ logs = []

- This list stores all agent actions.

- Each action will be added as a dictionary.

      logs = []

3Ô∏è‚É£ def log_step(...)
step ‚Üí description of what happened

tool=None ‚Üí optional argument (default is None)

result=None ‚Üí optional argument

üëâ You can call this function with:

just step or with tool and result


      def log_step(step, tool=None, result=None):


4Ô∏è‚É£ logs.append({ ... })

Each log entry is a dictionary.

    logs.append({
5Ô∏è‚É£ "timestamp": time.time()

Save when this step happened.

Later you can:

- sort logs

- calculate duration between steps

Example:

end - start = execution_time

    "timestamp": time.time(),


**üîπ Mental Model (Remember This)**

logs = agent diary
log_step() = write in diary
time.time() = when it happened

In [None]:
import time

logs = []

def log_step(step, tool=None, result=None):
    logs.append({
        "timestamp": time.time(),
        "step": step,
        "tool": tool,
        "result": result
    })


**‚úîÔ∏è Step 3 ‚Äî Trace Tool Calls**



In [None]:
def execute_step(step):
    start = time.time()
    result = agent.run(step)
    log_step(step, tool="agent", result=result)
    print(f"Latency: {time.time() - start:.2f}s")
    return result


**üîπ 5Ô∏è‚É£ AGENT EVALUATION (VERY IMPORTANT)**

In [None]:
def evaluate(output, expected_keywords):
    score = sum(1 for k in expected_keywords if k.lower() in output.lower())
    return score / len(expected_keywords)


**‚úîÔ∏è Run Automated Tests**

In [None]:
tests = [
    ("Explain RAG", ["retrieval", "generation"]),
    ("Add 5 and 10", ["15"])
]

for task, expected in tests:
    out = agent.run(task)
    print(task, evaluate(out, expected))
