# Agent Evaluation

Now that we have our custom DIY Agent with ML tools, we can evaluate how the model is calling our tools and evaluate traces in detail.

In [0]:
%pip install uv

In [0]:
%sh uv pip install ../.

In [0]:
from well_agent.utils import get_config_path, DotConfig
config_path = get_config_path()
config = DotConfig('config.yaml')

## Evaluation Dataset

You can edit the requests or expected responses in your evaluation dataset and run evaluation as you iterate your agent, leveraging mlflow to track the computed quality metrics.

Evaluate your agent with one of our [predefined LLM scorers](https://learn.microsoft.com/azure/databricks/mlflow3/genai/eval-monitor/predefined-judge-scorers), or try adding [custom metrics](https://learn.microsoft.com/azure/databricks/mlflow3/genai/eval-monitor/custom-scorers).


In [0]:
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": question}
            ]
        }
    }
    for question in config.evaluate.questions
]

input_example = eval_dataset[0]['inputs']

In [0]:
input_example

This is the most common pattern for calling agents from an application - direct API interfaces, facilitated via the mlflow deployment client (this could also be done via the requests package)

In [0]:
from mlflow.deployments import get_deploy_client

response = get_deploy_client("databricks").predict(
    endpoint=f"agents_{config.catalog}-{config.schema}-{config.agentify.agent_name}", 
    inputs=input_example
    )

response

## Create a Sample Application
This will use our deploy client and get a local copy of the traces

In [0]:
import mlflow
from openai import OpenAI
from typing import Any
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
import pandas as pd

# Set up MLflow tracking
mlflow.set_tracking_uri("databricks")
mlflow.openai.autolog()

@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
    print(messages)
    response = get_deploy_client("databricks").predict(
        endpoint=f"agents_{config.catalog}-{config.schema}-{config.agentify.agent_name}", 
        inputs={'messages':messages}
    )
    return response

The concept of a scorer has gained a lot traction

In [0]:
import mlflow
from mlflow.genai.scorers import Guidelines
from databricks import agents
from agent import AGENT

eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[
        Guidelines(name="calls_tool", guidelines="The response must call a tool")
    ],
)

## Trace Evaluation

One of the most powerful features of MLflow is the ability to trace the execution of a model. This allows us to see the exact inputs and outputs of a model, as well as the code that was used to train it. This is especially important for real tasks where an agent is calling an ML model.

In [0]:
generated_traces = mlflow.search_traces(run_id=eval_results.run_id)
trace = generated_traces.iloc[0]['trace']
trace

## Make Sure We Are Calling Tools!

We saw that guidelines fail. But we can use traces and spans to quantify the number of tool calls in each message.

In [0]:
trace.search_spans()[0].to_dict()

In [0]:
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType

@scorer
def tools_called(trace: Trace) -> Feedback:
    # Search particular span type from the trace
    sp = trace.search_spans()[0]
    num_tool_calls = sum([len(x.get('tool_calls',[])) for x in sp.outputs['messages']])
    if num_tool_calls > 0:
        return Feedback(
            value="yes",
            rationale=f"Tools were called"
        )
    else:
        return Feedback(
            value="no",
            rationale=f"No tools were called"
        )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[tools_called]
)

## Guidelines and Custom Scorers
Correctness is great, but we often need much more granular information about how a model is doing. This scorer checks if the total execution time of the trace is within an acceptable range using trace data.

In [0]:
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType

@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
    # Search particular span type from the trace
    llm_span = trace.search_spans()[0]
    response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # second
    max_duration = 5.0
    if response_time <= max_duration:
        return Feedback(
            value="yes",
            rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
        )
    else:
        return Feedback(
            value="no",
            rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
        )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[llm_response_time_good]
)

## Custom APIs for applications

MlflowClient exposes granular, thread-safe APIs to start and end traces, manage spans, and set span fields. It provides complete control of the trace lifecycle and structure. These APIs are useful when the Fluent APIs are insufficient for your requirements, such as multi-threaded applications and callbacks.

These APIs can be called from anywhere and directed back to Databricks or another tracking server

In [0]:
import mlflow
from mlflow.client import MlflowClient

mlflow.set_tracking_uri("databricks")

mlflow_client = MlflowClient()

root_span = mlflow_client.start_trace(
  name="simple-rag-agent",
  inputs={
          "query": "Demo",
          "model_name": "DBRX",
          "temperature": 0,
          "max_tokens": 200
         }
  )

# Retrieve documents that are similar to the query
similarity_search_input = dict(query_text="demo", num_results=3)

span_ss = mlflow_client.start_span(
      "search",
      # Specify request_id and parent_id to create the span at the right position in the trace
        trace_id=root_span.trace_id,
        parent_id=root_span.span_id,
        inputs=similarity_search_input
  )
retrieved = ["Test Result"]

# Explicitly end the span
mlflow_client.end_span(root_span.trace_id, span_id=span_ss.span_id, outputs=retrieved)
mlflow_client.end_trace(root_span.trace_id, outputs={"output": retrieved})