# MLflow GenAI Demo: LLM Observability & Experiment Tracking

This notebook demonstrates MLflow's GenAI capabilities for tracking, tracing, and evaluating LLM applications.

## Features Demonstrated

### Part 1: Via LiteLLM Proxy
1. **Auto-logging** - LiteLLM → MLflow automatic trace logging
2. **@mlflow.trace** - Decorator for function-level tracing
3. **start_span()** - Custom spans for multi-step pipelines

### Part 2: MLflow Standalone (Direct SDK)
4. **OpenAI Autolog** - `mlflow.openai.autolog()`
5. **Direct Groq SDK** - Native tracing without proxy
6. **Manual Logging** - Full control over what gets tracked

### Part 3: Advanced GenAI Features
7. **Prompt Registry** - Version control for prompts (MLflow UI → Prompts tab)
8. **Model Evaluation** - `mlflow.genai.evaluate()` with scorers
9. **A/B Testing** - Compare models and prompts with metrics

## Prerequisites
```bash
cd deploy/docker/compose
docker compose -f docker-compose.core.yml up -d exp-mlflow exp-postgres-mlflow exp-minio
docker compose -f docker-compose.litellm.yml up -d  # Optional for Part 1
```

## MLflow UI Tabs
| Tab | What It Shows |
|-----|---------------|
| **Experiments** | All runs with params/metrics |
| **Traces** | LLM call hierarchies and spans |
| **Prompts** | Registered prompts with version history |
| **Evaluation** | Model evaluation results |

---
## 1. Environment Setup

**IMPORTANT**: S3/MinIO credentials must be set BEFORE importing mlflow.

In [None]:
# ============================================================================
# CELL 1: Set credentials BEFORE importing mlflow
# ============================================================================
import os
import socket

def detect_environment():
    """Detect if running inside Docker or locally."""
    try:
        socket.create_connection(("exp-mlflow", 5000), timeout=1)
        return "docker"
    except (socket.error, socket.timeout):
        return "local"

ENV = detect_environment()

# Service URLs based on environment
if ENV == "docker":
    MLFLOW_URI = "http://exp-mlflow:5000"
    LITELLM_URL = "http://exp-litellm:4000"
    MINIO_URL = "http://exp-minio:9000"
else:
    MLFLOW_URI = "http://localhost:15000"
    LITELLM_URL = "http://localhost:4000"
    MINIO_URL = "http://localhost:19000"

# Set S3/MinIO credentials (MUST be before mlflow import)
os.environ["AWS_ACCESS_KEY_ID"] = "admin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "password123"
os.environ["MLFLOW_S3_ENDPOINT_URL"] = MINIO_URL
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

# API Keys for standalone usage - Get from environment or set your own
GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "YOUR_GROQ_API_KEY_HERE")
LITELLM_API_KEY = "sk-local-dev-2025"

print(f"Environment: {ENV}")
print(f"MLflow: {MLFLOW_URI}")
print(f"LiteLLM: {LITELLM_URL}")
print(f"MinIO: {MINIO_URL}")
print(f"Groq API Key: {'✓ Set' if GROQ_API_KEY and GROQ_API_KEY != 'YOUR_GROQ_API_KEY_HERE' else '✗ Missing - set GROQ_API_KEY env var'}")

In [None]:
# ============================================================================
# CELL 2: Import libraries AFTER setting credentials
# ============================================================================
import mlflow
import requests
import json
import time
import pandas as pd
from typing import Dict, Any, List

mlflow.set_tracking_uri(MLFLOW_URI)
print(f"MLflow Version: {mlflow.__version__}")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")

In [None]:
# ============================================================================
# CELL 3: Health Checks
# ============================================================================
def check_services():
    results = {}
    
    # Check MLflow
    try:
        resp = requests.get(f"{MLFLOW_URI}/health", timeout=5)
        results["MLflow"] = "OK" if resp.status_code == 200 else f"Error: {resp.status_code}"
    except Exception as e:
        results["MLflow"] = f"FAILED: {str(e)[:40]}"
    
    # Check LiteLLM (optional)
    try:
        resp = requests.get(f"{LITELLM_URL}/health/liveliness", timeout=5)
        results["LiteLLM"] = "OK" if resp.status_code == 200 else f"Error: {resp.status_code}"
    except Exception as e:
        results["LiteLLM"] = f"Not running (OK for Part 2)"
    
    # Check MinIO
    try:
        import boto3
        from botocore.client import Config
        s3 = boto3.client('s3', endpoint_url=MINIO_URL,
                          aws_access_key_id='admin', aws_secret_access_key='password123',
                          config=Config(signature_version='s3v4'))
        buckets = s3.list_buckets()
        results["MinIO"] = f"OK ({len(buckets['Buckets'])} buckets)"
    except Exception as e:
        results["MinIO"] = f"FAILED: {str(e)[:40]}"
    
    return results

print("Service Health Check:")
print("=" * 50)
for service, status in check_services().items():
    icon = "✓" if "OK" in status else "⚠" if "Not running" in status else "✗"
    print(f"  {icon} {service}: {status}")
print("=" * 50)

In [None]:
# Create experiment
EXPERIMENT_NAME = "genai-demo"
experiment = mlflow.set_experiment(EXPERIMENT_NAME)
print(f"Experiment: {experiment.name} (ID: {experiment.experiment_id})")
print(f"View: http://localhost:15000/#/experiments/{experiment.experiment_id}")

---
# PART 1: Via LiteLLM Proxy

Uses LiteLLM as a unified proxy to access multiple LLM providers.

**Requires**: `docker compose -f docker-compose.litellm.yml up -d`

In [None]:
# ============================================================================
# LiteLLM Helper Function
# ============================================================================
DEFAULT_MODEL = "llama-3.1-8b"  # Groq free tier

def call_llm(prompt: str, model: str = DEFAULT_MODEL, temperature: float = 0.7, max_tokens: int = 500) -> Dict[str, Any]:
    """Call LiteLLM proxy."""
    start = time.time()
    try:
        resp = requests.post(
            f"{LITELLM_URL}/chat/completions",
            headers={"Content-Type": "application/json", "Authorization": f"Bearer {LITELLM_API_KEY}"},
            json={"model": model, "messages": [{"role": "user", "content": prompt}],
                  "temperature": temperature, "max_tokens": max_tokens},
            timeout=60
        )
        latency = time.time() - start
        if resp.status_code != 200:
            return {"content": f"Error: HTTP {resp.status_code}", "model": model, "usage": {}, "latency": latency}
        data = resp.json()
        if "error" in data:
            return {"content": f"Error: {data['error'].get('message', 'Unknown')}", "model": model, "usage": {}, "latency": latency}
        return {
            "content": data["choices"][0]["message"]["content"],
            "model": data.get("model", model),
            "usage": data.get("usage", {}),
            "latency": latency
        }
    except Exception as e:
        return {"content": f"Error: {str(e)[:80]}", "model": model, "usage": {}, "latency": time.time() - start}

# Test
print(f"Testing LiteLLM ({DEFAULT_MODEL})...")
test = call_llm("Say 'hello'", max_tokens=10)
if test["content"].startswith("Error"):
    print(f"⚠ LiteLLM not available: {test['content'][:50]}")
    print("  Skip to Part 2 for standalone MLflow usage")
else:
    print(f"✓ SUCCESS: {test['content']} ({test['latency']:.2f}s)")

In [None]:
# Basic LLM call via LiteLLM
response = call_llm("What is MLflow? Answer in 1 sentence.")
print(f"Response: {response['content']}")
print(f"Model: {response['model']}, Latency: {response['latency']:.2f}s")

In [None]:
# MLflow tracing with decorator
@mlflow.trace
def analyze_text_litellm(text: str) -> Dict[str, Any]:
    prompt = f"Analyze sentiment of: {text}. Answer: positive/negative/neutral"
    return call_llm(prompt, temperature=0.3)

result = analyze_text_litellm("The fraud detection system saved us millions!")
print(f"Analysis: {result['content']}")
print("→ View trace in MLflow UI → Traces tab")

In [None]:
# Multi-step pipeline with spans
@mlflow.trace(name="fraud_pipeline_litellm")
def fraud_pipeline(txn: Dict) -> Dict:
    results = {}
    with mlflow.start_span(name="classify") as span:
        span.set_inputs(txn)
        resp = call_llm(f"Fraud risk for ${txn['amount']} at {txn['merchant']}? LOW/MEDIUM/HIGH", temperature=0.2)
        results["risk"] = resp["content"]
        span.set_outputs({"risk": resp["content"]})
    with mlflow.start_span(name="explain") as span:
        resp = call_llm(f"Why is this {results['risk']}? 1 sentence.", temperature=0.5)
        results["reason"] = resp["content"]
        span.set_outputs({"reason": resp["content"]})
    return results

result = fraud_pipeline({"amount": 5000, "merchant": "Overseas Wire"})
print(f"Risk: {result['risk']}")
print(f"Reason: {result['reason']}")

---
# PART 2: MLflow Standalone (Direct SDK)

Direct integration with Groq/OpenAI SDK - **NO LiteLLM required**.

This demonstrates MLflow's native GenAI capabilities.

In [None]:
# ============================================================================
# Install OpenAI SDK (compatible with Groq)
# ============================================================================
# !pip install openai>=1.0.0

In [None]:
# ============================================================================
# Setup Groq client (OpenAI-compatible)
# ============================================================================
from openai import OpenAI

# Groq uses OpenAI-compatible API
groq_client = OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1"
)

# Test direct Groq connection
print("Testing direct Groq SDK connection...")
try:
    response = groq_client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[{"role": "user", "content": "Say 'hello'"}],
        max_tokens=10
    )
    print(f"✓ SUCCESS: {response.choices[0].message.content}")
except Exception as e:
    print(f"✗ FAILED: {e}")

### 2.1 MLflow OpenAI Autolog

Automatically trace all OpenAI-compatible API calls.

In [None]:
# ============================================================================
# Enable MLflow OpenAI Autolog
# ============================================================================
# Note: mlflow.openai.autolog() has minimal parameters
# It automatically traces all OpenAI-compatible API calls

mlflow.openai.autolog()
print("✓ MLflow OpenAI autolog enabled")
print("  All OpenAI/Groq calls will be automatically traced")

In [None]:
# ============================================================================
# Basic call with autolog (automatically traced)
# ============================================================================
print("Making LLM call with autolog...")

response = groq_client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[
        {"role": "system", "content": "You are a fraud analyst."},
        {"role": "user", "content": "What are the top 3 signs of fraudulent transactions?"}
    ],
    max_tokens=200,
    temperature=0.7
)

print(f"\nResponse:")
print("-" * 50)
print(response.choices[0].message.content)
print("-" * 50)
print(f"Model: {response.model}")
print(f"Tokens: {response.usage.total_tokens}")
print(f"\n→ This call was AUTO-LOGGED to MLflow!")
print(f"→ Check: http://localhost:15000 → Traces tab")

### 2.2 Manual Tracing with @mlflow.trace

In [None]:
# ============================================================================
# Custom traced function (standalone)
# ============================================================================
@mlflow.trace(name="standalone_fraud_analysis")
def analyze_fraud_standalone(transaction: Dict[str, Any]) -> Dict[str, Any]:
    """Fraud analysis using direct Groq SDK with MLflow tracing."""
    
    # Step 1: Risk assessment
    with mlflow.start_span(name="risk_assessment") as span:
        span.set_inputs({"transaction": transaction})
        
        response = groq_client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{
                "role": "user",
                "content": f"""Assess fraud risk:
Amount: ${transaction['amount']}
Merchant: {transaction['merchant']}
Time: {transaction['time']}

Respond with: RISK_LEVEL (LOW/MEDIUM/HIGH) and confidence (0-100)"""
            }],
            max_tokens=50,
            temperature=0.2
        )
        risk = response.choices[0].message.content
        span.set_outputs({"risk": risk, "tokens": response.usage.total_tokens})
    
    # Step 2: Generate explanation
    with mlflow.start_span(name="explanation") as span:
        span.set_inputs({"risk": risk})
        
        response = groq_client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{
                "role": "user",
                "content": f"Explain in 1 sentence why this transaction is {risk}"
            }],
            max_tokens=100,
            temperature=0.5
        )
        explanation = response.choices[0].message.content
        span.set_outputs({"explanation": explanation})
    
    return {"risk": risk, "explanation": explanation}

# Test
result = analyze_fraud_standalone({
    "amount": 3500,
    "merchant": "Online Casino",
    "time": "2:30 AM"
})

print("Standalone Fraud Analysis:")
print("=" * 50)
print(f"Risk: {result['risk']}")
print(f"Explanation: {result['explanation']}")
print("=" * 50)
print("→ View hierarchical trace in MLflow UI")

### 2.3 Manual Logging with MLflow Runs

In [None]:
# ============================================================================
# Full manual control over logging
# ============================================================================
def call_groq_with_logging(prompt: str, model: str = "llama-3.1-8b-instant", run_name: str = None):
    """Call Groq with full manual MLflow logging."""
    
    with mlflow.start_run(run_name=run_name or f"groq_{model}"):
        # Log input parameters
        mlflow.log_param("model", model)
        mlflow.log_param("prompt_length", len(prompt))
        mlflow.log_param("prompt_preview", prompt[:100])
        
        # Make API call
        start_time = time.time()
        response = groq_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=200
        )
        latency = time.time() - start_time
        
        # Log metrics
        mlflow.log_metric("latency_seconds", latency)
        mlflow.log_metric("prompt_tokens", response.usage.prompt_tokens)
        mlflow.log_metric("completion_tokens", response.usage.completion_tokens)
        mlflow.log_metric("total_tokens", response.usage.total_tokens)
        mlflow.log_metric("response_length", len(response.choices[0].message.content))
        
        # Log artifacts
        mlflow.log_text(prompt, "prompt.txt")
        mlflow.log_text(response.choices[0].message.content, "response.txt")
        
        # Log as JSON for structured data
        mlflow.log_dict({
            "model": model,
            "prompt": prompt,
            "response": response.choices[0].message.content,
            "usage": {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
                "total_tokens": response.usage.total_tokens
            },
            "latency": latency
        }, "call_details.json")
        
        return response.choices[0].message.content

# Test manual logging
print("Testing manual MLflow logging...")
result = call_groq_with_logging(
    "What is feature drift and why does it matter for fraud detection?",
    run_name="manual_logging_demo"
)
print(f"\nResponse: {result[:200]}...")
print("\n→ Check MLflow UI → genai-demo experiment → manual_logging_demo run")

### 2.4 Model Comparison (Standalone)

In [None]:
# ============================================================================
# Compare models using direct SDK
# ============================================================================
def compare_models_standalone(prompt: str, models: List[str]) -> pd.DataFrame:
    """Compare multiple Groq models."""
    results = []
    
    for model in models:
        with mlflow.start_run(run_name=f"standalone_compare_{model.split('-')[0]}"):
            mlflow.log_param("model", model)
            mlflow.log_param("method", "direct_groq_sdk")
            
            start = time.time()
            try:
                response = groq_client.chat.completions.create(
                    model=model,
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=150
                )
                latency = time.time() - start
                content = response.choices[0].message.content
                tokens = response.usage.total_tokens
                
                mlflow.log_metric("latency", latency)
                mlflow.log_metric("tokens", tokens)
                mlflow.log_text(content, "response.txt")
                
                results.append({
                    "model": model,
                    "status": "✓",
                    "latency": f"{latency:.2f}s",
                    "tokens": tokens,
                    "preview": content[:50] + "..."
                })
            except Exception as e:
                results.append({
                    "model": model,
                    "status": "✗",
                    "latency": "-",
                    "tokens": 0,
                    "preview": str(e)[:50]
                })
    
    return pd.DataFrame(results)

# Compare Groq models
print("Comparing Groq models (standalone)...\n")
df = compare_models_standalone(
    "Explain overfitting in 2 sentences.",
    ["llama-3.1-8b-instant", "llama-3.3-70b-versatile"]
)
print(df.to_string(index=False))
print("\n→ Compare runs in MLflow UI")

---
# PART 3: Advanced MLflow GenAI Features

This section demonstrates:
1. **Prompt Registry** - Version and track prompts in MLflow UI
2. **Model Evaluation** - `mlflow.genai.evaluate()` with scorers
3. **A/B Testing** - Compare models with evaluation metrics

**MLflow UI Prompts Tab**: View all registered prompts, compare versions side-by-side with diff highlighting.

### 3.1 Prompt Registry - Version Control for Prompts

The MLflow Prompt Registry provides:
- **Version Control**: Git-inspired commit-based versioning
- **Aliasing**: Flexible deployment (production, staging, etc.)
- **Lineage**: Track which prompts were used in which runs
- **Collaboration**: Share prompts across your organization

**In MLflow UI**: Navigate to **Prompts** tab to see all registered prompts.

In [None]:
# ============================================================================
# 3.1 Check for MLflow GenAI Module
# ============================================================================
# Check if mlflow.genai is available (MLflow 2.10+)
try:
    import mlflow.genai
    HAS_GENAI = True
    print(f"✓ mlflow.genai module available (MLflow {mlflow.__version__})")
except ImportError:
    HAS_GENAI = False
    print(f"⚠ mlflow.genai not available (MLflow {mlflow.__version__})")
    print("  Requires MLflow 2.10+ for Prompt Registry")
    print("  Will use manual prompt tracking instead")

In [None]:
print("\n" + "=" * 70)
print("  MLflow GenAI Demo Complete!")
print("=" * 70)
print(f"""
  PART 1: Via LiteLLM Proxy
    - Auto-logging via success_callback
    - @mlflow.trace decorator
    - Custom spans with start_span()
    
  PART 2: MLflow Standalone (Direct SDK)
    - mlflow.openai.autolog() - automatic tracing
    - Direct Groq SDK with manual tracing
    - Full control with log_param/log_metric/log_artifact
    
  PART 3: Advanced GenAI Features (below)
    - Prompt Registry - version control for prompts
    - Model Evaluation - mlflow.genai.evaluate() with scorers
    - A/B Testing - compare models and prompts
    
  MLflow UI: http://localhost:15000
  Experiment: {EXPERIMENT_NAME}
  
  UI Tabs:
  → Experiments: View all runs, compare metrics
  → Traces: View LLM call hierarchies
  → Prompts: View registered prompts, compare versions
  → Evaluation: View evaluation results
""")
print("=" * 70)

In [None]:
# ============================================================================
# Register Fraud Analysis Prompts (Multiple Versions)
# ============================================================================
if HAS_GENAI:
    # Version 1: Simple prompt
    try:
        prompt_v1 = mlflow.genai.register_prompt(
            name="fraud-analysis-prompt",
            template="""Analyze this transaction for fraud:
Amount: {{ amount }}
Merchant: {{ merchant }}

Is this fraudulent? Answer YES or NO with brief reason.""",
            commit_message="v1: Simple fraud detection prompt"
        )
        print(f"✓ Registered prompt v1: {prompt_v1.name}")
    except Exception as e:
        print(f"  Prompt may already exist: {str(e)[:50]}")
    
    # Version 2: Enhanced prompt with more context
    try:
        prompt_v2 = mlflow.genai.register_prompt(
            name="fraud-analysis-prompt",
            template="""You are a fraud detection expert. Analyze this transaction:

Transaction Details:
- Amount: ${{ amount }}
- Merchant: {{ merchant }}
- Time: {{ time }}
- Location: {{ location }}

Consider:
1. Is the amount unusual?
2. Is the merchant suspicious?
3. Is the timing suspicious?

Verdict: [FRAUD/LEGITIMATE]
Confidence: [0-100]%
Reasoning: [Brief explanation]""",
            commit_message="v2: Enhanced prompt with structured output"
        )
        print(f"✓ Registered prompt v2: {prompt_v2.name}")
    except Exception as e:
        print(f"  v2: {str(e)[:50]}")
    
    print("\n→ View prompts in MLflow UI → Prompts tab")
    print("→ Compare versions with diff highlighting")
else:
    print("Skipping prompt registry (mlflow.genai not available)")

In [None]:
# ============================================================================
# Load and Use Registered Prompts
# ============================================================================
if HAS_GENAI:
    # Load specific version using URI format: prompts:/<name>/<version>
    try:
        # Load latest version
        prompt = mlflow.genai.load_prompt("prompts:/fraud-analysis-prompt/1")
        print(f"Loaded prompt: {prompt.name}")
        
        # Format with variables
        formatted = prompt.format(
            amount="5000",
            merchant="Overseas Wire Transfer"
        )
        print(f"\nFormatted prompt:\n{'-'*50}\n{formatted}\n{'-'*50}")
        
        # Use with LLM
        response = groq_client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{"role": "user", "content": formatted}],
            max_tokens=100
        )
        print(f"\nLLM Response: {response.choices[0].message.content}")
        
    except Exception as e:
        print(f"Could not load prompt: {e}")
        print("This may require MLflow 2.15+ with Prompt Registry enabled")
else:
    # Fallback: Manual prompt tracking
    print("Using manual prompt tracking (no mlflow.genai)")
    
    PROMPTS = {
        "v1": "Analyze this transaction for fraud: Amount: {amount}, Merchant: {merchant}. Fraudulent? YES/NO",
        "v2": "You are a fraud expert. Transaction: ${amount} at {merchant}. Verdict: FRAUD/LEGITIMATE, Confidence: %, Reason:"
    }
    
    for version, template in PROMPTS.items():
        formatted = template.format(amount="5000", merchant="Overseas Wire")
        print(f"\n{version}: {formatted[:80]}...")

### 3.2 Model Evaluation with Scorers

MLflow GenAI provides `mlflow.genai.evaluate()` for systematic LLM evaluation:

**Built-in Scorers:**
- `Correctness()` - Compares output against expected facts/answers
- `RelevanceToQuery()` - Measures response relevance
- `Guidelines()` - Checks compliance with custom criteria
- `Safety()` - Detects harmful content

**Custom Scorers:** Write your own with the `@scorer` decorator.

In [None]:
# ============================================================================
# 3.2 Model Evaluation - Create Evaluation Dataset
# ============================================================================
# Evaluation dataset with inputs, expected outputs, and ground truth
eval_dataset = [
    {
        "inputs": {"query": "Is a $50 purchase at Starbucks fraudulent?"},
        "expectations": {"expected_response": "No, this is a normal low-value retail transaction."}
    },
    {
        "inputs": {"query": "Is a $5000 wire transfer to Nigeria at 3AM fraudulent?"},
        "expectations": {"expected_response": "Yes, this shows multiple fraud indicators: high amount, international wire, unusual time."}
    },
    {
        "inputs": {"query": "Is a $200 Amazon purchase fraudulent?"},
        "expectations": {"expected_response": "No, this is a typical e-commerce transaction."}
    },
    {
        "inputs": {"query": "Is a $10000 casino transaction with new card fraudulent?"},
        "expectations": {"expected_response": "Yes, high-risk merchant with large amount on new card."}
    },
]

print(f"Evaluation dataset: {len(eval_dataset)} test cases")
for i, item in enumerate(eval_dataset):
    print(f"  {i+1}. {item['inputs']['query'][:50]}...")

In [None]:
# ============================================================================
# Define Predict Function and Custom Scorers
# ============================================================================

# Predict function - must match input key names
def fraud_predict_fn(query: str) -> str:
    """Predict function for evaluation. Parameter name must match 'inputs' keys."""
    response = groq_client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[
            {"role": "system", "content": "You are a fraud analyst. Give brief, direct answers."},
            {"role": "user", "content": query}
        ],
        max_tokens=100,
        temperature=0.2
    )
    return response.choices[0].message.content

# Test predict function
test_output = fraud_predict_fn("Is a $50 Starbucks purchase fraudulent?")
print(f"Test prediction: {test_output[:100]}...")

# Custom scorers (code-based)
def is_concise(inputs: Dict, outputs: str, expectations: Dict) -> bool:
    """Check if response is concise (under 50 words)."""
    return len(outputs.split()) <= 50

def has_verdict(inputs: Dict, outputs: str, expectations: Dict) -> bool:
    """Check if response contains a clear verdict."""
    lower = outputs.lower()
    return any(word in lower for word in ["yes", "no", "fraud", "legitimate", "suspicious", "normal"])

def response_length(inputs: Dict, outputs: str, expectations: Dict) -> int:
    """Return response length in words."""
    return len(outputs.split())

print("\n✓ Custom scorers defined: is_concise, has_verdict, response_length")

In [None]:
# ============================================================================
# Run Evaluation with mlflow.genai.evaluate()
# ============================================================================
if HAS_GENAI:
    try:
        # Import scorers
        from mlflow.genai.scorers import Correctness, RelevanceToQuery, Guidelines
        
        # Run evaluation
        print("Running MLflow GenAI evaluation...")
        results = mlflow.genai.evaluate(
            data=eval_dataset,
            predict_fn=fraud_predict_fn,
            scorers=[
                Correctness(),           # Compare to expected_response
                RelevanceToQuery(),      # Check relevance to query
                is_concise,              # Custom: under 50 words
                has_verdict,             # Custom: contains clear verdict
                response_length,         # Custom: word count
            ]
        )
        
        print("\n✓ Evaluation complete!")
        print(f"→ View detailed results in MLflow UI → Evaluation tab")
        
        # Display summary
        if hasattr(results, 'tables') and 'eval_results' in results.tables:
            print("\nResults Preview:")
            print(results.tables['eval_results'].head())
            
    except ImportError as e:
        print(f"⚠ Scorers not available: {e}")
        print("  Using manual evaluation instead...")
    except Exception as e:
        print(f"⚠ Evaluation error: {e}")
        print("  This may require MLflow 2.15+ or specific configuration")
else:
    print("Skipping mlflow.genai.evaluate() - module not available")

In [None]:
# ============================================================================
# Manual Evaluation (Works with any MLflow version)
# ============================================================================
print("Running manual evaluation with MLflow tracking...\n")

eval_results = []

with mlflow.start_run(run_name="fraud_model_evaluation"):
    mlflow.log_param("model", "llama-3.1-8b-instant")
    mlflow.log_param("evaluation_size", len(eval_dataset))
    mlflow.log_param("method", "manual_eval")
    
    for i, item in enumerate(eval_dataset):
        query = item["inputs"]["query"]
        expected = item["expectations"]["expected_response"]
        
        # Get prediction
        output = fraud_predict_fn(query)
        
        # Score with custom metrics
        concise = len(output.split()) <= 50
        has_clear_verdict = any(w in output.lower() for w in ["yes", "no", "fraud", "legitimate"])
        word_count = len(output.split())
        
        # Simple correctness check (keyword match)
        expected_lower = expected.lower()
        output_lower = output.lower()
        if "yes" in expected_lower or "fraud" in expected_lower:
            correct = any(w in output_lower for w in ["yes", "fraud", "suspicious", "high risk"])
        else:
            correct = any(w in output_lower for w in ["no", "legitimate", "normal", "typical"])
        
        eval_results.append({
            "query": query[:40] + "...",
            "correct": "✓" if correct else "✗",
            "concise": "✓" if concise else "✗",
            "verdict": "✓" if has_clear_verdict else "✗",
            "words": word_count
        })
        
        print(f"  [{i+1}] {'✓' if correct else '✗'} {query[:50]}...")
    
    # Calculate aggregate metrics
    correct_count = sum(1 for r in eval_results if r["correct"] == "✓")
    concise_count = sum(1 for r in eval_results if r["concise"] == "✓")
    verdict_count = sum(1 for r in eval_results if r["verdict"] == "✓")
    avg_words = sum(r["words"] for r in eval_results) / len(eval_results)
    
    # Log metrics
    mlflow.log_metric("accuracy", correct_count / len(eval_dataset))
    mlflow.log_metric("conciseness_rate", concise_count / len(eval_dataset))
    mlflow.log_metric("verdict_rate", verdict_count / len(eval_dataset))
    mlflow.log_metric("avg_response_words", avg_words)
    
    # Log results as artifact
    mlflow.log_dict({"results": eval_results}, "eval_results.json")
    
    print(f"\n{'='*50}")
    print(f"Evaluation Summary:")
    print(f"  Accuracy:    {correct_count}/{len(eval_dataset)} ({100*correct_count/len(eval_dataset):.0f}%)")
    print(f"  Concise:     {concise_count}/{len(eval_dataset)}")
    print(f"  Has Verdict: {verdict_count}/{len(eval_dataset)}")
    print(f"  Avg Words:   {avg_words:.1f}")
    print(f"{'='*50}")
    print(f"\n→ View run in MLflow UI: fraud_model_evaluation")

### 3.3 A/B Testing - Compare Models with Evaluation Metrics

Compare multiple models on the same evaluation dataset to determine which performs best for your use case.

**What to compare:**
- Different models (llama-3.1-8b vs llama-3.3-70b)
- Different prompts (v1 vs v2)
- Different temperatures
- Different system prompts

In [None]:
# ============================================================================
# 3.3 A/B Testing - Compare Two Models
# ============================================================================

# Models to compare (Groq free tier)
MODELS_TO_TEST = [
    {"name": "llama-3.1-8b-instant", "alias": "Model A (Fast)"},
    {"name": "llama-3.3-70b-versatile", "alias": "Model B (Large)"},
]

# Prompts to test
PROMPTS_TO_TEST = [
    {
        "name": "v1_simple",
        "system": "You are a fraud analyst.",
        "template": "Is this fraudulent? {query} Answer YES or NO briefly."
    },
    {
        "name": "v2_detailed",
        "system": "You are an expert fraud detection analyst. Be thorough but concise.",
        "template": "Analyze for fraud: {query}\nProvide: Verdict (FRAUD/LEGITIMATE), Confidence (%), Brief reason."
    },
]

print(f"A/B Test Configuration:")
print(f"  Models: {[m['alias'] for m in MODELS_TO_TEST]}")
print(f"  Prompts: {[p['name'] for p in PROMPTS_TO_TEST]}")
print(f"  Test cases: {len(eval_dataset)}")
print(f"  Total runs: {len(MODELS_TO_TEST) * len(PROMPTS_TO_TEST)}")

In [None]:
# ============================================================================
# Run A/B Test - Evaluate Each Model/Prompt Combination
# ============================================================================
ab_results = []

print("Running A/B Tests...\n")

for model_config in MODELS_TO_TEST:
    for prompt_config in PROMPTS_TO_TEST:
        run_name = f"ab_{model_config['name'].split('-')[0]}_{prompt_config['name']}"
        
        print(f"Testing: {model_config['alias']} + {prompt_config['name']}...")
        
        with mlflow.start_run(run_name=run_name):
            # Log configuration
            mlflow.log_param("model", model_config["name"])
            mlflow.log_param("model_alias", model_config["alias"])
            mlflow.log_param("prompt_version", prompt_config["name"])
            mlflow.log_param("system_prompt", prompt_config["system"][:50])
            mlflow.log_param("test_type", "ab_test")
            
            # Run evaluation
            correct = 0
            total_latency = 0
            total_tokens = 0
            
            for item in eval_dataset:
                query = item["inputs"]["query"]
                expected = item["expectations"]["expected_response"]
                
                # Format prompt
                formatted_query = prompt_config["template"].format(query=query)
                
                # Call model
                start = time.time()
                try:
                    response = groq_client.chat.completions.create(
                        model=model_config["name"],
                        messages=[
                            {"role": "system", "content": prompt_config["system"]},
                            {"role": "user", "content": formatted_query}
                        ],
                        max_tokens=100,
                        temperature=0.2
                    )
                    latency = time.time() - start
                    output = response.choices[0].message.content
                    tokens = response.usage.total_tokens
                    
                    total_latency += latency
                    total_tokens += tokens
                    
                    # Check correctness
                    expected_lower = expected.lower()
                    output_lower = output.lower()
                    if "yes" in expected_lower or "fraud" in expected_lower:
                        if any(w in output_lower for w in ["yes", "fraud", "suspicious", "high"]):
                            correct += 1
                    else:
                        if any(w in output_lower for w in ["no", "legitimate", "normal", "low"]):
                            correct += 1
                            
                except Exception as e:
                    print(f"    Error: {str(e)[:30]}")
                    latency = time.time() - start
                    total_latency += latency
            
            # Calculate and log metrics
            accuracy = correct / len(eval_dataset)
            avg_latency = total_latency / len(eval_dataset)
            avg_tokens = total_tokens / len(eval_dataset)
            
            mlflow.log_metric("accuracy", accuracy)
            mlflow.log_metric("avg_latency", avg_latency)
            mlflow.log_metric("avg_tokens", avg_tokens)
            mlflow.log_metric("total_correct", correct)
            
            ab_results.append({
                "model": model_config["alias"],
                "prompt": prompt_config["name"],
                "accuracy": f"{100*accuracy:.0f}%",
                "latency": f"{avg_latency:.2f}s",
                "tokens": f"{avg_tokens:.0f}",
                "run_name": run_name
            })
            
            print(f"  → Accuracy: {100*accuracy:.0f}%, Latency: {avg_latency:.2f}s")

print("\n" + "="*60)

In [None]:
# ============================================================================
# A/B Test Results Summary
# ============================================================================
print("A/B TEST RESULTS")
print("="*60)

df_ab = pd.DataFrame(ab_results)
print(df_ab.to_string(index=False))

print("\n" + "="*60)
print("WINNER ANALYSIS:")

# Find best by accuracy
best_accuracy = df_ab.loc[df_ab['accuracy'].str.rstrip('%').astype(int).idxmax()]
print(f"  Best Accuracy: {best_accuracy['model']} + {best_accuracy['prompt']} ({best_accuracy['accuracy']})")

# Find fastest
best_latency = df_ab.loc[df_ab['latency'].str.rstrip('s').astype(float).idxmin()]
print(f"  Fastest:       {best_latency['model']} + {best_latency['prompt']} ({best_latency['latency']})")

# Find most efficient (tokens)
best_tokens = df_ab.loc[df_ab['tokens'].astype(int).idxmin()]
print(f"  Most Efficient: {best_tokens['model']} + {best_tokens['prompt']} ({best_tokens['tokens']} tokens)")

print("="*60)
print(f"""
→ Compare runs in MLflow UI:
  1. Go to http://localhost:15000
  2. Select 'genai-demo' experiment
  3. Select A/B test runs (filter by test_type=ab_test)
  4. Click 'Compare' to see side-by-side metrics
  5. Use Charts to visualize accuracy vs latency tradeoffs
""")

---
## Summary & Results

In [None]:
# ============================================================================
# View all runs
# ============================================================================
from mlflow.tracking import MlflowClient

client = MlflowClient()
exp = client.get_experiment_by_name(EXPERIMENT_NAME)

if exp:
    runs = client.search_runs([exp.experiment_id], max_results=15, order_by=["start_time DESC"])
    print(f"Recent Runs in '{EXPERIMENT_NAME}':")
    print("=" * 70)
    for run in runs:
        method = run.data.params.get("method", "litellm")
        metrics = ", ".join([f"{k}={v:.2f}" for k,v in list(run.data.metrics.items())[:2]]) or "no metrics"
        print(f"  [{method[:8]:8}] {run.info.run_name[:25]:25} | {metrics}")
    print(f"\n→ View all: http://localhost:15000/#/experiments/{exp.experiment_id}")

In [None]:
print("\n" + "=" * 70)
print("  MLflow GenAI Demo Complete!")
print("=" * 70)
print(f"""
  PART 1: Via LiteLLM Proxy
    - Auto-logging via success_callback
    - @mlflow.trace decorator
    - Custom spans with start_span()
    
  PART 2: MLflow Standalone (Direct SDK)
    - mlflow.openai.autolog() - automatic tracing
    - Direct Groq SDK with manual tracing
    - Full control with log_param/log_metric/log_artifact
    
  MLflow UI: http://localhost:15000
  Experiment: {EXPERIMENT_NAME}
  → Traces tab: View LLM call hierarchies
  → Runs: Compare model/prompt performance
""")
print("=" * 70)