# Hill Model Marketing Agent with MLflow Tracking

This notebook demonstrates:
1. **Hill Model Fitting** for single-channel marketing spend
2. **MLflow Experiment Tracking** for parameters, metrics, and artifacts
3. **LangGraph Agent** with MLflow tracing for predictions

**MLflow tracks:**
- Model parameters (Œ±, Œ≤, Œ∏)
- Performance metrics (R¬≤, MAE, RMSE)
- Training data
- Model artifacts
- Agent prediction traces

## Setup and Imports

In [41]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# MLflow for experiment tracking
import mlflow
import mlflow.pyfunc
from mlflow.models.signature import infer_signature

# LangGraph and LangChain
from typing import Annotated, Literal, Optional
from operator import add
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI
from langgraph.graph import StateGraph, START, END, MessagesState
from langgraph.types import Command
from pydantic import BaseModel

# Environment variables
import os
from dotenv import load_dotenv

# Load .env file
load_dotenv()

print("‚úì Imports loaded")

‚úì Imports loaded


## Configure MLflow

In [42]:
# Set MLflow tracking URI (local or remote)
mlflow.set_tracking_uri("./mlruns")  # Local tracking
# For remote: mlflow.set_tracking_uri("databricks") or "http://mlflow-server:5000"

# Set experiment name
experiment_name = "Hill_Model_Marketing_Agent"
mlflow.set_experiment(experiment_name)

print(f"‚úì MLflow experiment: {experiment_name}")
print(f"‚úì Tracking URI: {mlflow.get_tracking_uri()}")

‚úì MLflow experiment: Hill_Model_Marketing_Agent
‚úì Tracking URI: ./mlruns


## Load Training Data

In [43]:
# Historical marketing data (spend ‚Üí revenue in $k)
df = pd.DataFrame({
    "Digital_Spend": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
    "Revenue": [95, 150, 205, 250, 285, 305, 318, 325, 330, 334, 336, 337]
})

print("Training Data:")
print(df.head())
print(f"\nShape: {df.shape}")

Training Data:
   Digital_Spend  Revenue
0             10       95
1             20      150
2             30      205
3             40      250
4             50      285

Shape: (12, 2)


## Define Hill Model

In [44]:
def hill(x, alpha, beta, theta):
    """
    Hill transformation for marketing saturation:
    y = Œ± √ó x^Œ≤ / (x^Œ≤ + Œ∏^Œ≤)
    
    Parameters:
    - x: spend (in $k)
    - alpha: plateau (max revenue contribution in $k)
    - beta: steepness (curve shape, unitless)
    - theta: half-saturation spend (in $k)
    """
    x = np.asarray(x, dtype=float)
    x = np.clip(x, 1e-9, None)  # guard against zero
    return alpha * (x**beta) / (x**beta + theta**beta)

print("‚úì Hill function defined")

‚úì Hill function defined


## Train Hill Model with MLflow Tracking

In [None]:
# Start MLflow run
with mlflow.start_run(run_name="hill_model_training") as run:
    
    # Log dataset info
    mlflow.log_param("n_samples", len(df))
    mlflow.log_param("channel", "Digital")
    mlflow.log_param("model_type", "Hill_Saturation")
    
    # Prepare data
    x = df["Digital_Spend"].to_numpy(float)
    y = df["Revenue"].to_numpy(float)
    
    # Scale to [0,1] for stable optimization
    x_max, y_max = x.max(), y.max()
    x_s, y_s = x / x_max, y / y_max
    
    mlflow.log_param("x_max", x_max)
    mlflow.log_param("y_max", y_max)
    
    # Fit Hill curve
    p0 = [1.05, 1.0, 0.5]  # Initial guesses (scaled)
    bounds = ([0.0, 0.0, 0.0], [5.0, 10.0, 5.0])
    
    mlflow.log_param("p0_alpha", p0[0])
    mlflow.log_param("p0_beta", p0[1])
    mlflow.log_param("p0_theta", p0[2])
    
    params_s, _ = curve_fit(hill, x_s, y_s, p0=p0, bounds=bounds, maxfev=50000)
    alpha_s, beta, theta_s = params_s
    
    # Rescale parameters back to original units
    alpha = alpha_s * y_max  # plateau in $k
    theta = theta_s * x_max  # half-saturation in $k
    
    # Log fitted parameters
    mlflow.log_param("alpha_plateau_k", round(alpha, 2))
    mlflow.log_param("beta_steepness", round(beta, 3))
    mlflow.log_param("theta_halfsat_k", round(theta, 2))
    
    # Make predictions
    y_pred = hill(x, alpha, beta, theta)
    
    # Calculate metrics
    r2 = r2_score(y, y_pred)
    mae = mean_absolute_error(y, y_pred)
    mse = mean_squared_error(y, y_pred)
    rmse = np.sqrt(mse)
    
    # Log metrics
    mlflow.log_metric("r2_score", r2)
    mlflow.log_metric("mae", mae)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("mse", mse)
    
    # Log training data as artifact
    train_data_path = "training_data.csv"
    df.to_csv(train_data_path, index=False)
    mlflow.log_artifact(train_data_path)
    
    # Create and log visualization
    plt.figure(figsize=(10, 6))
    x_line = np.linspace(0, x.max() * 1.2, 200)
    y_line = hill(x_line, alpha, beta, theta)
    
    plt.scatter(x, y, color='blue', s=80, label='Actual Data', zorder=3)
    plt.plot(x_line, y_line, color='red', linewidth=2.5, label='Hill Model Fit')
    plt.xlabel("Digital Spend ($k)", fontsize=12)
    plt.ylabel("Revenue ($k)", fontsize=12)
    plt.title(f"Hill Model: R¬≤={r2:.3f}, MAE={mae:.1f}k", fontsize=14)
    plt.legend()
    plt.grid(alpha=0.3, linestyle='--')
    plt.tight_layout()
    
    plot_path = "hill_curve_fit.png"
    plt.savefig(plot_path, dpi=150)
    mlflow.log_artifact(plot_path)
    plt.show()
    
    # Log residuals plot
    residuals = y - y_pred
    plt.figure(figsize=(10, 5))
    plt.subplot(1, 2, 1)
    plt.scatter(x, residuals, alpha=0.7)
    plt.axhline(0, linestyle='--', color='red')
    plt.xlabel("Digital Spend ($k)")
    plt.ylabel("Residual ($k)")
    plt.title("Residuals vs Spend")
    plt.grid(alpha=0.3)
    
    plt.subplot(1, 2, 2)
    plt.hist(residuals, bins=8, alpha=0.7, edgecolor='black')
    plt.xlabel("Residual ($k)")
    plt.ylabel("Frequency")
    plt.title("Residuals Distribution")
    plt.grid(alpha=0.3)
    plt.tight_layout()
    
    residuals_path = "residuals_analysis.png"
    plt.savefig(residuals_path, dpi=150)
    mlflow.log_artifact(residuals_path)
    plt.show()
    
    # Store model parameters globally for agent use
    global MODEL_PARAMS
    MODEL_PARAMS = {
        'alpha': alpha,
        'beta': beta,
        'theta': theta
    }
    
    run_id = run.info.run_id
    
    print("\n" + "="*60)
    print("MLflow Training Summary")
    print("="*60)
    print(f"Run ID: {run_id}")
    print(f"\nFitted Parameters:")
    print(f"  Œ± (plateau):        {alpha:.2f}k")
    print(f"  Œ≤ (steepness):      {beta:.3f}")
    print(f"  Œ∏ (half-sat spend): {theta:.2f}k")
    print(f"\nPerformance Metrics:")
    print(f"  R¬≤ Score:  {r2:.4f}")
    print(f"  MAE:       {mae:.2f}k")
    print(f"  RMSE:      {rmse:.2f}k")
    print("="*60)

## Define Prediction Function

In [None]:
def predict_revenue(spend):
    """
    Predict revenue for a given spend using fitted Hill model.
    
    Args:
        spend: Digital spend in $k
    
    Returns:
        Predicted revenue in $k
    """
    return hill(spend, MODEL_PARAMS['alpha'], MODEL_PARAMS['beta'], MODEL_PARAMS['theta'])

# Test predictions
print("Sample Predictions:")
for s in [30, 60, 100, 140]:
    print(f"  Spend ${s}k ‚Üí Revenue ${predict_revenue(s):.1f}k")

## Setup Azure OpenAI (Update with your credentials)

In [None]:
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY", "your-api-key-here")
azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT", "https://your-resource.openai.azure.com/")

routing_llm = AzureChatOpenAI(
    model="gpt-4o-mini",
    api_key=azure_openai_api_key,
    api_version="2025-01-01-preview",
    azure_endpoint=azure_openai_endpoint,
    temperature=0.2
)

print("‚úì Azure OpenAI configured")

## Build LangGraph Agent with MLflow Tracing

In [None]:
# Define agent state
class AgentState(MessagesState):
    next: Optional[str] = None
    routed_intent: Annotated[list[str], add]
    spend_k: Optional[float] = None

# Define routing decision schema
class RouteDecision(BaseModel):
    target: Literal["Predictor", "FINISH"]
    reason: str
    spend_k: Optional[float] = None  # spend in $k

# Routing prompt
routing_prompt = ChatPromptTemplate.from_messages([
    SystemMessage(
        content=(
            "You supervise a marketing assistant.\n"
            "Respond with JSON containing `target`, `reason`, and `spend_k`.\n"
            "`target` must be `Predictor` when the user asks for revenue based on digital spend, otherwise `FINISH`.\n"
            "`spend_k` must be the numeric digital spend in thousands of dollars.\n"
            "Examples:\n"
            "- '10k' -> 10.0\n"
            "- '10,000 dollars' -> 10.0\n"
            "- '0.5 million' -> 500.0\n"
            "If no spend amount is stated, set `spend_k` to null."
        )
    ),
    ("user", "{query}"),
])

routing_chain = routing_prompt | routing_llm.with_structured_output(RouteDecision)

def _select_next(state: AgentState) -> str:
    return state.get("next", "FINISH")

# Supervisor node
def supervisor_node(state: AgentState):
    last_human = next(msg for msg in reversed(state["messages"]) if isinstance(msg, HumanMessage))
    
    try:
        decision = routing_chain.invoke({"query": last_human.content})
    except Exception as exc:
        error_msg = AIMessage(content=f"Routing failed: {exc}")
        return Command(
            update={
                "next": "FINISH",
                "messages": [error_msg],
                "routed_intent": ["routing_error"],
                "spend_k": None,
            }
        )
    
    supervisor_note = AIMessage(
        content=f"Supervisor routing to {decision.target}: {decision.reason}"
    )
    
    return Command(
        update={
            "next": decision.target,
            "spend_k": decision.spend_k,
            "routed_intent": [decision.reason],
            "messages": [supervisor_note],
        }
    )

# Predictor node with MLflow logging
def predictor_node(state: AgentState):
    spend_k = state.get("spend_k")
    
    # Start MLflow run for prediction
    with mlflow.start_run(run_name="agent_prediction", nested=True):
        
        mlflow.log_param("input_spend_k", spend_k)
        
        if spend_k is None:
            reply = "I couldn't detect a spend amount. Please specify an approximate digital spend (e.g., 10k)."
            mlflow.log_param("prediction_status", "missing_input")
            mlflow.log_metric("predicted_revenue_k", 0.0)
        else:
            # Make prediction
            est_revenue = predict_revenue(spend_k)
            
            # Log to MLflow
            mlflow.log_metric("predicted_revenue_k", est_revenue)
            mlflow.log_param("prediction_status", "success")
            mlflow.log_param("model_alpha", MODEL_PARAMS['alpha'])
            mlflow.log_param("model_beta", MODEL_PARAMS['beta'])
            mlflow.log_param("model_theta", MODEL_PARAMS['theta'])
            
            # Calculate ROI
            roi = (est_revenue / spend_k - 1) * 100 if spend_k > 0 else 0
            mlflow.log_metric("roi_percent", roi)
            
            reply = (
                f"üìä **Marketing Prediction**\n\n"
                f"üí∞ **Input:** ${spend_k:.1f}k digital spend\n"
                f"üìà **Predicted Revenue:** ${est_revenue:.1f}k\n"
                f"üéØ **ROI:** {roi:.1f}%\n\n"
                f"_Model: Hill saturation curve (Œ±={MODEL_PARAMS['alpha']:.1f}, Œ≤={MODEL_PARAMS['beta']:.2f}, Œ∏={MODEL_PARAMS['theta']:.1f})_"
            )
    
    return Command(
        update={
            "messages": [AIMessage(content=reply)],
            "next": "FINISH",
        }
    )

# Build the graph
graph = StateGraph(AgentState)
graph.add_node("supervisor", supervisor_node)
graph.add_node("Predictor", predictor_node)
graph.add_edge(START, "supervisor")
graph.add_conditional_edges("supervisor", _select_next, {"Predictor": "Predictor", "FINISH": END})
graph.add_edge("Predictor", END)

marketing_agent = graph.compile()

print("‚úì Marketing agent with MLflow tracing built")

## Test Agent with MLflow Tracking

In [None]:
# Test queries
test_queries = [
    "What if I spend 50k on digital?",
    "Predict revenue for 80,000 dollars digital spend",
    "How much revenue if we spend 120k?"
]

print("\n" + "="*60)
print("Testing Agent with MLflow Tracking & Metrics")
print("="*60 + "\n")

import time
from datetime import datetime

# Start MLflow run for agent testing
with mlflow.start_run(run_name=f"agent_test_{datetime.now().strftime('%Y%m%d_%H%M%S')}") as test_run:
    
    # Log test metadata
    mlflow.log_param("num_test_queries", len(test_queries))
    mlflow.log_param("test_type", "agent_execution")
    
    # Metrics tracking
    latencies = []
    successes = 0
    failures = 0
    predicted_revenues = []
    predicted_rois = []
    
    for idx, query in enumerate(test_queries, 1):
        print(f"\nüîπ Query {idx}/{len(test_queries)}: {query}")
        print("-" * 60)
        
        start_time = time.time()
        
        try:
            # Invoke agent with completely fresh state for each query
            result = marketing_agent.invoke(
                {"messages": [HumanMessage(content=query)]},
                config={"configurable": {"thread_id": f"test_{hash(query)}"}}
            )
            
            latency = time.time() - start_time
            latencies.append(latency)
            
            # Give MLflow nested run time to complete
            time.sleep(0.5)
            
            # Find the predictor's response
            ai_messages = [msg for msg in result["messages"] if isinstance(msg, AIMessage)]
            prediction_messages = [msg for msg in ai_messages if "Marketing Prediction" in msg.content]
            
            if prediction_messages:
                print(prediction_messages[0].content)
                successes += 1
                
                # Extract metrics from response if available
                # Parse the spend amount to calculate revenue and ROI
                import re
                spend_match = re.search(r'\$(\d+\.?\d*)k digital spend', prediction_messages[0].content)
                revenue_match = re.search(r'Predicted Revenue:\*\* \$(\d+\.?\d*)k', prediction_messages[0].content)
                roi_match = re.search(r'ROI:\*\* (-?\d+\.?\d*)%', prediction_messages[0].content)
                
                if revenue_match:
                    predicted_revenues.append(float(revenue_match.group(1)))
                if roi_match:
                    predicted_rois.append(float(roi_match.group(1)))
                    
            elif ai_messages:
                print(max(ai_messages, key=lambda m: len(m.content)).content)
                successes += 1
            else:
                print("‚ö†Ô∏è  No response generated")
                failures += 1
            
            # Log individual query metrics
            mlflow.log_metric(f"query_{idx}_latency_sec", latency)
            mlflow.log_metric(f"query_{idx}_success", 1 if prediction_messages else 0)
            
            print(f"‚è±Ô∏è  Latency: {latency:.3f}s")
            
        except Exception as e:
            latency = time.time() - start_time
            latencies.append(latency)
            failures += 1
            
            print(f"‚ùå Error: {e}")
            mlflow.log_metric(f"query_{idx}_success", 0)
            mlflow.log_metric(f"query_{idx}_latency_sec", latency)
        
        print("-" * 60)
    
    # Calculate aggregate metrics
    total_queries = len(test_queries)
    success_rate = (successes / total_queries) * 100
    failure_rate = (failures / total_queries) * 100
    
    avg_latency = np.mean(latencies) if latencies else 0
    min_latency = np.min(latencies) if latencies else 0
    max_latency = np.max(latencies) if latencies else 0
    std_latency = np.std(latencies) if latencies else 0
    
    avg_revenue = np.mean(predicted_revenues) if predicted_revenues else 0
    avg_roi = np.mean(predicted_rois) if predicted_rois else 0
    
    # Log aggregate metrics to MLflow
    mlflow.log_metric("success_rate_pct", success_rate)
    mlflow.log_metric("failure_rate_pct", failure_rate)
    mlflow.log_metric("total_successes", successes)
    mlflow.log_metric("total_failures", failures)
    
    mlflow.log_metric("avg_latency_sec", avg_latency)
    mlflow.log_metric("min_latency_sec", min_latency)
    mlflow.log_metric("max_latency_sec", max_latency)
    mlflow.log_metric("std_latency_sec", std_latency)
    
    if predicted_revenues:
        mlflow.log_metric("avg_predicted_revenue_k", avg_revenue)
    if predicted_rois:
        mlflow.log_metric("avg_predicted_roi_pct", avg_roi)
    
    # Print summary
    print("\n" + "="*60)
    print("Agent Test Metrics Summary")
    print("="*60)
    print(f"üìä Execution Metrics:")
    print(f"   Total Queries:     {total_queries}")
    print(f"   Successes:         {successes} ({success_rate:.1f}%)")
    print(f"   Failures:          {failures} ({failure_rate:.1f}%)")
    print(f"\n‚è±Ô∏è  Latency Metrics:")
    print(f"   Average:           {avg_latency:.3f}s")
    print(f"   Min:               {min_latency:.3f}s")
    print(f"   Max:               {max_latency:.3f}s")
    print(f"   Std Dev:           {std_latency:.3f}s")
    
    if predicted_revenues:
        print(f"\nüí∞ Prediction Metrics:")
        print(f"   Avg Revenue:       ${avg_revenue:.1f}k")
    if predicted_rois:
        print(f"   Avg ROI:           {avg_roi:.1f}%")
    
    print("="*60)
    print(f"\n‚úì Metrics logged to MLflow (Run ID: {test_run.info.run_id})")

## Interactive Chat with MLflow Logging

In [None]:
# def chat_with_agent():
#     """
#     Interactive chat loop with MLflow tracking.
#     Type 'quit' to exit.
#     """
#     print("\n" + "="*60)
#     print("Marketing Agent Chat (with MLflow Tracking)")
#     print("="*60)
#     print("Ask questions like:")
#     print("  - What if I spend 50k on digital?")
#     print("  - Predict revenue for 100k spend")
#     print("\nType 'quit' to exit\n")
    
#     while True:
#         user_input = input("You: ").strip()
        
#         if user_input.lower() in {'quit', 'exit', 'q'}:
#             print("\nüëã Ending chat. Check MLflow UI for tracked experiments!")
#             print(f"   Run: mlflow ui --backend-store-uri {mlflow.get_tracking_uri()}")
#             break
        
#         if not user_input:
#             continue
        
#         # Invoke agent
#         result = marketing_agent.invoke({"messages": [HumanMessage(content=user_input)]})
        
#         # Display response
#         ai_messages = [msg for msg in result["messages"] if isinstance(msg, AIMessage)]
#         if ai_messages:
#             print(f"\nAgent: {ai_messages[-1].content}\n")

# Uncomment to start interactive chat
# chat_with_agent()

## View MLflow Results

In [None]:
# Get the current experiment
experiment = mlflow.get_experiment_by_name(experiment_name)
print(f"\nüìä MLflow Experiment: {experiment.name}")
print(f"üìÅ Artifact Location: {experiment.artifact_location}")
print(f"üÜî Experiment ID: {experiment.experiment_id}")

# Get recent runs
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id], order_by=["start_time DESC"], max_results=5)

if not runs.empty:
    print("\nüìù Recent Runs:")
    
    # Select columns that exist in the DataFrame
    desired_cols = ['run_id', 'start_time', 'tags.mlflow.runName', 'metrics.r2_score']
    available_cols = [col for col in desired_cols if col in runs.columns]
    
    if available_cols:
        print(runs[available_cols].to_string(index=False))
    else:
        # Fallback: show basic columns
        basic_cols = [col for col in ['run_id', 'start_time'] if col in runs.columns]
        if basic_cols:
            print(runs[basic_cols].to_string(index=False))
        else:
            print(runs.head().to_string())
else:
    print("\nNo runs found yet.")

# print("\nüåê To view in MLflow UI, run:")
# print(f"   mlflow ui --backend-store-uri {mlflow.get_tracking_uri()}")
# print("   Then open: http://localhost:5000")

## Batch Predictions with MLflow Logging

In [None]:
# Create batch prediction scenarios
batch_scenarios = pd.DataFrame({
    'scenario': ['Conservative', 'Moderate', 'Aggressive', 'Maximum'],
    'spend_k': [40, 70, 100, 130]
})

with mlflow.start_run(run_name="batch_predictions"):
    
    mlflow.log_param("batch_size", len(batch_scenarios))
    
    # Make predictions
    batch_scenarios['predicted_revenue_k'] = batch_scenarios['spend_k'].apply(predict_revenue)
    batch_scenarios['roi_percent'] = ((batch_scenarios['predicted_revenue_k'] / batch_scenarios['spend_k']) - 1) * 100
    
    # Log batch results
    for idx, row in batch_scenarios.iterrows():
        mlflow.log_metric(f"scenario_{row['scenario']}_spend_k", row['spend_k'])
        mlflow.log_metric(f"scenario_{row['scenario']}_revenue_k", row['predicted_revenue_k'])
        mlflow.log_metric(f"scenario_{row['scenario']}_roi_pct", row['roi_percent'])
    
    # Save batch results
    batch_path = "batch_predictions.csv"
    batch_scenarios.to_csv(batch_path, index=False)
    mlflow.log_artifact(batch_path)
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Revenue vs Spend
    ax1.bar(batch_scenarios['scenario'], batch_scenarios['predicted_revenue_k'], color='steelblue')
    ax1.set_xlabel('Scenario')
    ax1.set_ylabel('Predicted Revenue ($k)')
    ax1.set_title('Revenue by Scenario')
    ax1.grid(alpha=0.3, axis='y')
    
    # ROI comparison
    ax2.bar(batch_scenarios['scenario'], batch_scenarios['roi_percent'], color='coral')
    ax2.set_xlabel('Scenario')
    ax2.set_ylabel('ROI (%)')
    ax2.set_title('ROI by Scenario')
    ax2.grid(alpha=0.3, axis='y')
    
    plt.tight_layout()
    batch_viz_path = "batch_predictions_viz.png"
    plt.savefig(batch_viz_path, dpi=150)
    mlflow.log_artifact(batch_viz_path)
    plt.show()

print("\nüìä Batch Prediction Results:")
print(batch_scenarios.to_string(index=False))

## MLflow GenAI Agent Evaluation

This section uses MLflow's native `mlflow.genai.evaluate()` framework to evaluate the agent with:
- **Custom scorers** that access agent traces
- **Structured evaluation dataset** with inputs and expectations
- **Automatic logging** to MLflow experiments

### Step 1: Prepare Evaluation Dataset

Create evaluation dataset with inputs and expectations following MLflow's format.

In [None]:
# Evaluation dataset with inputs, expectations, and tags
eval_dataset = [
    {
        "inputs": {"task": "What if I spend 35k on digital?"},
        "expectations": {
            "spend_k": 35,
            "revenue_range": (230, 240),  # Expected revenue range
            "routing_target": "Predictor"
        },
        "tags": {"category": "low_spend"}
    },
    {
        "inputs": {"task": "Predict revenue for 55k spend"},
        "expectations": {
            "spend_k": 55,
            "revenue_range": (290, 300),
            "routing_target": "Predictor"
        },
        "tags": {"category": "medium_spend"}
    },
    {
        "inputs": {"task": "How much revenue if we spend 85k?"},
        "expectations": {
            "spend_k": 85,
            "revenue_range": (323, 333),
            "routing_target": "Predictor"
        },
        "tags": {"category": "high_spend"}
    },
    {
        "inputs": {"task": "What's the expected revenue for 105k digital spend?"},
        "expectations": {
            "spend_k": 105,
            "revenue_range": (330, 340),
            "routing_target": "Predictor"
        },
        "tags": {"category": "very_high_spend"}
    },
    {
        "inputs": {"task": "Tell me about your capabilities"},
        "expectations": {
            "routing_target": "FINISH",
            "should_not_predict": True
        },
        "tags": {"category": "off_topic"}
    }
]

print(f"Created evaluation dataset with {len(eval_dataset)} test cases")

### Step 2: Define Custom Scorers

Create scorers that evaluate agent behavior using traces and outputs.

In [None]:
from mlflow.entities import Trace, SpanType, Feedback
from mlflow.genai import scorer

# Scorer 1: Check if prediction is within expected range
@scorer
def revenue_accuracy(outputs, expectations) -> bool:
    """Check if predicted revenue is within expected range."""
    if "should_not_predict" in expectations and expectations["should_not_predict"]:
        # For off-topic queries, we expect no revenue prediction
        return "Marketing Prediction" not in str(outputs)
    
    # Extract predicted revenue from output
    import re
    output_str = str(outputs)
    revenue_match = re.search(r'Predicted Revenue:\*\* \$(\d+\.?\d*)k', output_str)
    
    if not revenue_match:
        return False
    
    predicted_revenue = float(revenue_match.group(1))
    revenue_range = expectations.get("revenue_range", (0, 0))
    
    return revenue_range[0] <= predicted_revenue <= revenue_range[1]


# Scorer 2: Evaluate routing using trace
@scorer
def correct_routing(trace: Trace, expectations: dict) -> Feedback:
    """Evaluate if agent routed to the correct node."""
    expected_target = expectations.get("routing_target", "Predictor")
    
    # Search for supervisor spans in the trace
    supervisor_spans = trace.search_spans(name="supervisor")
    
    if not supervisor_spans:
        return Feedback(
            value="no",
            rationale="No supervisor span found in trace"
        )
    
    # Check if the supervisor routed to the expected target
    # Look at the span attributes or messages
    supervisor_span = supervisor_spans[0]
    
    # Try to extract routing decision from span events or attributes
    routed_correctly = False
    routing_info = "Unknown routing"
    
    # Check span events for routing information
    if supervisor_span.events:
        for event in supervisor_span.events:
            if expected_target in str(event.name):
                routed_correctly = True
                routing_info = f"Routed to {expected_target}"
                break
    
    # If we can't determine from events, check child spans
    if not routed_correctly:
        all_spans = trace.data.spans
        for span in all_spans:
            if span.name == "Predictor" and expected_target == "Predictor":
                routed_correctly = True
                routing_info = "Found Predictor span"
                break
            elif span.name != "Predictor" and expected_target == "FINISH":
                routed_correctly = True
                routing_info = "Did not call Predictor"
                break
    
    value = "yes" if routed_correctly else "no"
    rationale = f"Expected routing to {expected_target}. {routing_info}"
    
    return Feedback(value=value, rationale=rationale)


# Scorer 3: Check latency
@scorer
def acceptable_latency(trace: Trace) -> Feedback:
    """Check if agent responded within acceptable time."""
    # Get trace duration in milliseconds
    duration_ms = trace.info.execution_time_ms
    duration_sec = duration_ms / 1000 if duration_ms else 0
    
    # Define acceptable latency threshold (e.g., 5 seconds)
    threshold_sec = 5.0
    
    is_acceptable = duration_sec <= threshold_sec
    value = "yes" if is_acceptable else "no"
    rationale = f"Response time: {duration_sec:.2f}s (threshold: {threshold_sec}s)"
    
    return Feedback(value=value, rationale=rationale)


# Scorer 4: Extract prediction error (if ground truth available)
@scorer
def prediction_error_pct(outputs, expectations) -> float:
    """Calculate prediction error percentage against ground truth."""
    if "revenue_range" not in expectations:
        return 0.0
    
    # Extract predicted revenue
    import re
    output_str = str(outputs)
    revenue_match = re.search(r'Predicted Revenue:\*\* \$(\d+\.?\d*)k', output_str)
    
    if not revenue_match:
        return 100.0  # Max error if prediction not found
    
    predicted_revenue = float(revenue_match.group(1))
    
    # Use middle of expected range as ground truth
    expected_range = expectations["revenue_range"]
    ground_truth = (expected_range[0] + expected_range[1]) / 2
    
    # Calculate percentage error
    error_pct = abs(predicted_revenue - ground_truth) / ground_truth * 100
    
    return round(error_pct, 2)

print("‚úì Custom scorers defined:")
print("  - revenue_accuracy: Checks if prediction is within expected range")
print("  - correct_routing: Validates agent routing decision using trace")
print("  - acceptable_latency: Ensures response time < 5s")
print("  - prediction_error_pct: Calculates prediction error percentage")

### Step 3: Define Prediction Function

Wrap the agent in a function that MLflow can call during evaluation.

In [None]:
def predict_fn(task: str) -> str:
    """
    Prediction function for MLflow evaluation.
    Takes a task/query and returns the agent's response.
    """
    result = marketing_agent.invoke(
        {"messages": [HumanMessage(content=task)]},
        config={"configurable": {"thread_id": f"eval_{hash(task)}"}}
    )
    
    # Extract the final response
    ai_messages = [msg for msg in result["messages"] if isinstance(msg, AIMessage)]
    
    if ai_messages:
        # Return the last (most comprehensive) AI message
        return ai_messages[-1].content
    else:
        return "No response generated"

print("‚úì Prediction function defined")

### Step 4: Run MLflow GenAI Evaluation

Execute the evaluation using `mlflow.genai.evaluate()` with custom scorers.

In [None]:
print("\n" + "="*70)
print("Running MLflow GenAI Agent Evaluation")
print("="*70 + "\n")

# Run evaluation with MLflow
results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=predict_fn,
    scorers=[
        revenue_accuracy,
        correct_routing,
        acceptable_latency,
        prediction_error_pct
    ]
)

print("\n" + "="*70)
print("Evaluation Complete!")
print("="*70)
print(f"\n‚úì Evaluated {len(eval_dataset)} test cases")
print(f"‚úì Results logged to MLflow experiment: {experiment_name}")
print(f"\nüìä View detailed results in MLflow UI:")
print(f"   1. Run: mlflow ui --backend-store-uri {mlflow.get_tracking_uri()}")
print(f"   2. Open: http://localhost:5000")
print(f"   3. Navigate to the latest run to see:")
print(f"      - Evaluation metrics and scores")
print(f"      - Agent traces for each test case")
print(f"      - Scorer rationales and feedback")
print("="*70)

### Key Features of MLflow GenAI Evaluation

**What Makes This Different:**

1. **Trace-Based Evaluation** üîç
   - Access to agent's intermediate steps via `Trace` object
   - Inspect routing decisions, tool calls, and execution flow
   - Debug exactly where agent behavior differs from expectations

2. **Custom Scorers with `@scorer` Decorator** ‚öôÔ∏è
   - Create domain-specific evaluation metrics
   - Return `Feedback` objects with rationale
   - Automatic logging to MLflow

3. **Automatic Experiment Tracking** üìä
   - All predictions, scores, and traces logged automatically
   - Compare evaluation runs over time
   - No manual metric calculation needed

4. **Visual Analysis in MLflow UI** üé®
   - View traces with detailed span information
   - Compare scorer results across test cases
   - Identify patterns in agent failures

**Next Steps:**
- Launch MLflow UI to explore evaluation results
- Click on individual test cases to view traces
- Refine scorers based on evaluation insights
- Run evaluation again after agent improvements