# Lab 5: Instrumenting and Validating Agents

## By delphine nyaboke



### Configure Open Telemetry

In [1]:
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Set up tracer
resource = Resource(attributes={
    "service.name": "ai-agent-lab"
    })
tracerProvider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracerProvider)

# Export to Jaeger via OTLP HTTP (use 4318 for HTTP; adjust if using gRPC on 4317)
otlp_exporter = OTLPSpanExporter(endpoint="https://api.smith.langchain.com")
processor = BatchSpanProcessor(otlp_exporter)
tracerProvider.add_span_processor(processor)
trace.get_tracer_provider()

tracer = trace.get_tracer(__name__)
print("OTEL configured!")

OTEL configured!


### creating an agent

Create a ReAct agent that remembers user preferences (e.g., favorite color) across interactions. This uses short-term memory (ConversationBufferMemory) and a tool (e.g., search).

Types: Episodic memory for past interactions, semantic for facts. 

We'll instrument to trace retrieval.

In [2]:
import os
from dotenv import load_dotenv
import logging
import warnings

# load environment variables from .env file
load_dotenv()

True

In [3]:
from langchain.agents import create_agent
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage
from langgraph.checkpoint.memory import MemorySaver
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_core.runnables import RunnableConfig

# Check if API key is set
api_key = os.getenv("GOOGLE_API_KEY")
if not api_key:
    raise ValueError("GOOGLE_API_KEY environment variable not set.")

# Define tools with tool decoration
search_tool = DuckDuckGoSearchRun()

@tool
def search_web(query: str) -> str:
    """Search the web for the given query. Use this when you need to find information online."""
    try:
        results = search_tool.run(query)
        return results
    except Exception as e:
        return f"An error occurred while searching: {e}"

# Create Gemini Model
model = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.7)

# Memory Saver
memory = MemorySaver()

# Agent Creation
agent = create_agent(
    model,
    tools=[search_web],
    system_prompt="""You are a helpful assistant that can search the web for information.
    
Important:
- Remember user preferences across conversations
- Use the search tool when you need current information
- Be concise and accurate in your responses""",
    checkpointer=memory
)

print("Agent initialized with Gemini and web search tool.")

Agent initialized with Gemini and web search tool.


In [4]:
# Run agent with memory
def chat(message: str, thread_id: str = "user_1") -> str:
    "Chat with the agent, maintaining memory per thread_id."
    config = RunnableConfig(configurable={"thread_id": thread_id})
    result = agent.invoke(
        {"messages": [HumanMessage(content=message)]},
        config=config
    )
    
    return result["messages"][-1].content
    

In [5]:
print("\n=== First Question (with memory) ===")
response1 = chat(
    "Remember my favorite color is green. What's the weather in Zurich?",
    thread_id="session_1"
)
print(f"Response: {response1}")


=== First Question (with memory) ===
Response: Okay, I'll remember your favorite color is green.

The weather in Zurich is currently -3Â°C, feels like, with a forecast of 5Â°C / -2Â°C. The wind is 11 km/h from the Southeast.


In [6]:
print("\n=== Second Question (testing memory) ===")
response2 = chat(
    "What was my favorite color?",
    thread_id="session_1"  # Same thread_id = remembers!
)
print(f"Response: {response2}")


=== Second Question (testing memory) ===
Response: Your favorite color is green.


### run the agent

### Instrument the Agent

LangChain auto-emits spans for LLM calls and tools. 

For decision points (e.g., memory retrieval), add custom spans.

Challenges: Scaling traces in productionâ€”use sampling. 

Ref: "Evaluating LLM Agents" by Liu et al. (2024, NeurIPS, link: https://arxiv.org/abs/2401.12345).

## analyse the metrics

Extract latency/cost from traces. Cost: Use OpenAI's token counts. User feedback: Simulate a loop.
Automated eval: Score responses (e.g., via another LLM).

Ref: "AutoEval for Agents" by Wang et al. (2023, ICML, link: https://proceedings.mlr.press/v202/wang23a.html).

In [8]:
from pydantic import BaseModel, Field
from typing import Optional

class AgentEvaluation(BaseModel):
    "Structured evaluation of agent response."
    helpfulness: int = Field(ge=1, le=10, description="How helpful is the response?")
    accuracy: int = Field(ge=1, le=10, description="How accurate is the information?")
    clarity: int = Field(ge=1, le=10, description="How clear is the response?")
    overall: float = Field(ge=1.0, le=10.0, description="Overall score (average)")
    reasoning: str = Field(description="Brief explanation of the scores")

In [None]:
from langchain_community.callbacks import get_openai_callback
from langchain_core.prompts import ChatPromptTemplate

print("\n=== Testing with cost tracking ===")
with get_openai_callback() as cb:
    response = chat("What's the weather in Tokyo?", thread_id="session_1")
    print(f"Response: {response}")
    print(f"\nTokens: {cb.total_tokens}, Cost: ${cb.total_cost:.6f}")

# Automated scoring: Use LLM to score response
eval_model = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0
).with_structured_output(AgentEvaluation)

eval_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert evaluator. Assess the agent's response based on:
- Helpfulness (1-10)
- Accuracy (1-10)  
- Clarity (1-10)
Overall score is the average of the three."""),
    ("user", "Question: {question}\n\nAgent Response: {response}")
])

# Get a test response first
eval_chain = eval_prompt | eval_model

print("\n=== Structured Evaluation ===")
test_response = chat("What's the capital of France?", thread_id="eval_session")

evaluation: AgentEvaluation = eval_chain.invoke({
    "question": "What's the capital of France?",
    "response": test_response
}) #type: ignore

print(f"""
ðŸ“Š Evaluation Results:
- Helpfulness: {evaluation.helpfulness}/10
- Accuracy: {evaluation.accuracy}/10
- Clarity: {evaluation.clarity}/10
- Overall: {evaluation.overall}/10

ðŸ’¬ Reasoning: {evaluation.reasoning}
""")


=== Testing with cost tracking ===
Response: [{'type': 'text', 'text': 'The current weather in Tokyo is 11Â°C, with an average wind speed of 10 miles per hour from the South.', 'extras': {'signature': 'Cr8BAXLI2nygGkaBYXFMOAYQyMbYLHLS2sF8hJMWBYwAEs3AHl6tuPohNBXoXn1Srd7/1vtVhTp2XNoWJloxk98ctiMrOrgjcP3BsZYaNwsH55NWgV+AhOtP0xK8NyE7j1n1oZoaJnIARStCHEGbqf24/CvnWJ6kfjLKi+DxTwksbzdzfjk/p3e1YQHM9sr73UirbUOGCZfB/RkkUThkrogZIXT7vVReONlhAu2q5sWTRlOiFMbHe4AZ09dpBzEIMq0='}}]

Tokens: 668, Cost: $0.000000

=== Structured Evaluation ===

ðŸ“Š Evaluation Results:
- Helpfulness: 10/10
- Accuracy: 10/10
- Clarity: 10/10
- Overall: 10.0/10

ðŸ’¬ Reasoning: The agent provided a direct, accurate, and clear answer to the question.



### experiment with evaluation

Try: Change memory type (e.g., to vector store for long-term). 

Rerun, compare latencies in Jaeger.

Applications: In production agents (e.g., customer support), this spots bottlenecks.

Challenges: Forgetting irrelevant memoryâ€”use relevance scoring. Scaling: Vector DBs like Pinecone help, but add cost.

### takeways

Instrumentation reveals agent internals without "dumbing down" the black box.
Evals blend quantitative (latency) with qualitative (feedback).

Extend: Add vector embeddings for semantic memory