# Serving and Deployment with MLflow ResponsesAgent

This notebook covers deployment options for MLflow ResponsesAgent models.

## Table of Contents
1. Deployment Overview
2. Local Model Serving
3. Docker Deployment
4. Databricks Model Serving
5. Testing Deployed Models
6. Production Best Practices

## Setup

In [None]:
import os
import json
import requests
from dotenv import load_dotenv
from typing import Generator

import mlflow
from mlflow.entities.span import SpanType
from mlflow.pyfunc import ResponsesAgent
from mlflow.types.responses import (
    ResponsesAgentRequest,
    ResponsesAgentResponse,
    ResponsesAgentStreamEvent,
)
from openai import OpenAI

# Load environment
load_dotenv()

# Set experiment
mlflow.set_experiment("Deployment_ResponseAgent")

print(f"MLflow version: {mlflow.__version__}")
print("✅ Setup complete!")

## 1. Deployment Overview

### Deployment Options for ResponsesAgent

| Option | Best For | Complexity | Scaling |
|--------|----------|------------|--------|
| **Local Serving** | Development, testing | Low | None |
| **Docker** | Containerized deployments | Medium | Manual |
| **Databricks** | Production, enterprise | Low | Auto |
| **Kubernetes** | Custom infrastructure | High | Auto |

### ResponsesAgent Deployment Flow

```
1. Create Agent → 2. Log to MLflow → 3. Choose Deployment → 4. Serve
```

### Key Concepts

1. **Models from Code**: Agent code is logged as a Python file
2. **Auto Signature**: Input/output schemas are auto-inferred
3. **Task Metadata**: `{"task": "agent/v1/responses"}` marks it as ResponsesAgent
4. **Dependencies**: pip requirements are bundled

## 2. Creating a Deployable Agent

Let's create a production-ready agent for deployment.

In [None]:
%%writefile deployable_agent.py
"""Production-ready ResponsesAgent for deployment."""

import os
from typing import Generator

import mlflow
from mlflow.entities.span import SpanType
from mlflow.pyfunc import ResponsesAgent
from mlflow.types.responses import (
    ResponsesAgentRequest,
    ResponsesAgentResponse,
    ResponsesAgentStreamEvent,
    output_to_responses_items_stream,
    to_chat_completions_input,
)
from openai import OpenAI


class DeployableAgent(ResponsesAgent):
    """
    A production-ready agent demonstrating best practices.
    
    Features:
    - OpenAI integration
    - Streaming support
    - Configurable parameters
    - Health check support
    - Error handling
    """
    
    def __init__(
        self,
        model: str = "gpt-4o-mini",
        system_prompt: str = None,
    ):
        self.model = model
        self.client = OpenAI()
        self.system_prompt = system_prompt or (
            "You are a helpful AI assistant. "
            "Provide clear, accurate, and concise responses."
        )
    
    def _prepare_messages(self, request: ResponsesAgentRequest) -> list:
        """Prepare messages with system prompt."""
        messages = [{"role": "system", "content": self.system_prompt}]
        messages.extend(
            to_chat_completions_input([i.model_dump() for i in request.input])
        )
        return messages
    
    @mlflow.trace(span_type=SpanType.AGENT)
    def predict(self, request: ResponsesAgentRequest) -> ResponsesAgentResponse:
        """Non-streaming prediction."""
        try:
            messages = self._prepare_messages(request)
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=messages,
            )
            
            return ResponsesAgentResponse(
                output=[
                    self.create_text_output_item(
                        text=response.choices[0].message.content,
                        id="msg_1",
                    )
                ],
                custom_outputs={
                    "model": self.model,
                    "usage": {
                        "prompt_tokens": response.usage.prompt_tokens,
                        "completion_tokens": response.usage.completion_tokens,
                        "total_tokens": response.usage.total_tokens,
                    }
                }
            )
        except Exception as e:
            return ResponsesAgentResponse(
                output=[
                    self.create_text_output_item(
                        text=f"Error: {str(e)}",
                        id="error_1",
                    )
                ]
            )
    
    @mlflow.trace(span_type=SpanType.AGENT)
    def predict_stream(
        self, request: ResponsesAgentRequest
    ) -> Generator[ResponsesAgentStreamEvent, None, None]:
        """Streaming prediction."""
        try:
            messages = self._prepare_messages(request)
            
            stream = self.client.chat.completions.create(
                model=self.model,
                messages=messages,
                stream=True,
            )
            
            yield from output_to_responses_items_stream(
                chunk.to_dict() for chunk in stream
            )
        except Exception as e:
            yield ResponsesAgentStreamEvent(
                type="response.output_item.done",
                item=self.create_text_output_item(
                    text=f"Error: {str(e)}",
                    id="error_1",
                )
            )


# Enable tracing
mlflow.openai.autolog()

# Create and set model
agent = DeployableAgent(
    model="gpt-4o-mini",
    system_prompt="You are a helpful assistant for technical questions.",
)
mlflow.models.set_model(agent)

In [None]:
# Log the model
with mlflow.start_run(run_name="deployable_agent") as run:
    model_info = mlflow.pyfunc.log_model(
        python_model="deployable_agent.py",
        artifact_path="agent",
        pip_requirements=[
            "mlflow>=3.0.0",
            "openai>=1.0.0",
            "pydantic>=2.0.0",
        ],
    )
    
    # Store run ID and model URI for deployment
    run_id = run.info.run_id
    model_uri = model_info.model_uri
    
    print(f"✅ Model logged successfully!")
    print(f"Run ID: {run_id}")
    print(f"Model URI: {model_uri}")
    print(f"\nModel metadata: {model_info.metadata}")

## 3. Local Model Serving

The simplest way to serve a ResponsesAgent locally.

In [None]:
# Display the command to start local serving
print("=" * 70)
print("LOCAL SERVING")
print("=" * 70)
print("\nTo serve your model locally, run this command in a terminal:\n")
print(f"  mlflow models serve -m {model_uri} -p 5001\n")
print("Or with the run ID:")
print(f"  mlflow models serve -m runs:/{run_id}/agent -p 5001\n")
print("Options:")
print("  -p, --port     : Port to serve on (default: 5000)")
print("  --host         : Host to bind to (default: 127.0.0.1)")
print("  --env-manager  : Environment manager (local, conda, virtualenv)")
print("  --no-conda     : Skip conda environment creation")
print("\nExample with all options:")
print(f"  mlflow models serve -m {model_uri} -p 5001 --host 0.0.0.0 --no-conda")
print("\n" + "=" * 70)

In [None]:
# Test local serving (run this after starting the server)
def test_local_server(port: int = 5001):
    """
    Test the locally served model.
    
    Run this after starting: mlflow models serve -m <model_uri> -p 5001
    """
    url = f"http://localhost:{port}/invocations"
    
    payload = {
        "input": [
            {"role": "user", "content": "What is Python?"}
        ]
    }
    
    try:
        response = requests.post(
            url,
            json=payload,
            headers={"Content-Type": "application/json"},
            timeout=30
        )
        
        if response.status_code == 200:
            result = response.json()
            print("✅ Server response:")
            print(json.dumps(result, indent=2))
        else:
            print(f"❌ Error: {response.status_code}")
            print(response.text)
            
    except requests.exceptions.ConnectionError:
        print("❌ Could not connect to server.")
        print(f"   Make sure the server is running on port {port}")
        print(f"   Command: mlflow models serve -m {model_uri} -p {port}")


# Uncomment to test (after starting server)
# test_local_server(5001)

## 4. Docker Deployment

Build a Docker image for containerized deployment.

In [None]:
# Docker deployment commands
print("=" * 70)
print("DOCKER DEPLOYMENT")
print("=" * 70)

print("\n1. Build Docker image:")
print(f"   mlflow models build-docker -m {model_uri} -n my-responses-agent\n")

print("2. Run the container:")
print("   docker run -p 5001:8080 \\")
print("     -e OPENAI_API_KEY=$OPENAI_API_KEY \\")
print("     my-responses-agent\n")

print("3. Test the container:")
print("   curl -X POST http://localhost:5001/invocations \\")
print("     -H 'Content-Type: application/json' \\")
print("     -d '{\"input\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}'\n")

print("=" * 70)

In [None]:
# Generate a Dockerfile (alternative approach)
dockerfile_content = f'''# MLflow ResponsesAgent Docker Image
# Built from: {model_uri}

FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \\
    curl \\
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip install --no-cache-dir \\
    mlflow>=3.0.0 \\
    openai>=1.0.0 \\
    pydantic>=2.0.0 \\
    gunicorn

# Copy model artifacts (you would need to export these first)
# COPY model /app/model

# Expose port
EXPOSE 8080

# Environment variables
ENV MLFLOW_MODEL_URI=/app/model

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \\
    CMD curl -f http://localhost:8080/health || exit 1

# Start server
CMD ["mlflow", "models", "serve", "-m", "/app/model", "-h", "0.0.0.0", "-p", "8080"]
'''

print("Example Dockerfile for custom deployments:")
print("-" * 50)
print(dockerfile_content)

## 5. Databricks Model Serving

Deploy to Databricks for managed, scalable serving.

In [None]:
# Databricks deployment guide
print("=" * 70)
print("DATABRICKS MODEL SERVING")
print("=" * 70)

print("""
Prerequisites:
- Databricks workspace with Unity Catalog
- Model registered in Unity Catalog
- Appropriate permissions

Step 1: Register the model in Unity Catalog
""")

databricks_register_code = f'''
import mlflow

# Set the registry URI to Unity Catalog
mlflow.set_registry_uri("databricks-uc")

# Register the model
mlflow.register_model(
    model_uri="{model_uri}",
    name="catalog.schema.my_responses_agent"
)
'''
print(databricks_register_code)

print("\nStep 2: Create serving endpoint")

databricks_serve_code = '''
from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

endpoint = client.create_endpoint(
    name="my-responses-agent-endpoint",
    config={
        "served_entities": [
            {
                "name": "agent-entity",
                "entity_name": "catalog.schema.my_responses_agent",
                "entity_version": "1",
                "workload_size": "Small",
                "scale_to_zero_enabled": True,
            }
        ],
        "traffic_config": {
            "routes": [
                {
                    "served_model_name": "agent-entity-1",
                    "traffic_percentage": 100
                }
            ]
        },
    },
)

print(f"Endpoint created: {endpoint}")
'''
print(databricks_serve_code)

print("\nStep 3: Query the endpoint")

databricks_query_code = '''
from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

response = client.predict(
    endpoint="my-responses-agent-endpoint",
    inputs={
        "input": [{"role": "user", "content": "Hello!"}]
    }
)

print(response)
'''
print(databricks_query_code)

print("\n" + "=" * 70)

## 6. Testing Deployed Models

Various ways to test your deployed agent.

In [None]:
# Python client for testing
class AgentClient:
    """
    Simple client for testing deployed ResponsesAgent.
    """
    
    def __init__(self, base_url: str):
        self.base_url = base_url.rstrip("/")
    
    def chat(self, message: str, history: list = None) -> dict:
        """
        Send a chat message to the agent.
        
        Args:
            message: User message
            history: Optional conversation history
            
        Returns:
            Agent response
        """
        input_messages = history or []
        input_messages.append({"role": "user", "content": message})
        
        payload = {"input": input_messages}
        
        response = requests.post(
            f"{self.base_url}/invocations",
            json=payload,
            headers={"Content-Type": "application/json"},
            timeout=60
        )
        
        response.raise_for_status()
        return response.json()
    
    def health_check(self) -> bool:
        """
        Check if the server is healthy.
        """
        try:
            response = requests.get(
                f"{self.base_url}/health",
                timeout=5
            )
            return response.status_code == 200
        except:
            return False


# Usage example
print("Agent Client Usage:")
print("-" * 40)
print('''
# Create client
client = AgentClient("http://localhost:5001")

# Check health
if client.health_check():
    print("Server is healthy!")

# Send message
response = client.chat("What is Python?")
print(response)

# Multi-turn conversation
history = [
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a programming language..."},
]
response = client.chat("What about JavaScript?", history=history)
''')

In [None]:
# cURL examples for testing
print("=" * 70)
print("CURL TESTING EXAMPLES")
print("=" * 70)

curl_examples = '''
# Basic request
curl -X POST http://localhost:5001/invocations \\
  -H 'Content-Type: application/json' \\
  -d '{
    "input": [{"role": "user", "content": "What is MLflow?"}]
  }'

# Multi-turn conversation
curl -X POST http://localhost:5001/invocations \\
  -H 'Content-Type: application/json' \\
  -d '{
    "input": [
      {"role": "user", "content": "What is Python?"},
      {"role": "assistant", "content": "Python is a programming language."},
      {"role": "user", "content": "What are its main uses?"}
    ]
  }'

# With context and custom inputs
curl -X POST http://localhost:5001/invocations \\
  -H 'Content-Type: application/json' \\
  -d '{
    "input": [{"role": "user", "content": "Hello!"}],
    "context": {"user_id": "123", "session_id": "abc"},
    "custom_inputs": {"request_type": "greeting"}
  }'

# Health check
curl http://localhost:5001/health
'''

print(curl_examples)

## 7. Production Best Practices

### Configuration and Environment

In [None]:
# Environment variables for production
print("=" * 70)
print("PRODUCTION ENVIRONMENT VARIABLES")
print("=" * 70)

env_vars = '''
# Required
OPENAI_API_KEY=sk-...                    # Your API key

# MLflow Configuration
MLFLOW_TRACKING_URI=...                  # Tracking server URI
MLFLOW_EXPERIMENT_NAME=production        # Experiment name

# Timeout Configuration (for long-running agents)
MLFLOW_DEPLOYMENT_PREDICT_TIMEOUT=120    # Single request timeout (seconds)
MLFLOW_DEPLOYMENT_PREDICT_TOTAL_TIMEOUT=600  # Total retry timeout

# Logging
MLFLOW_ENABLE_TRACING=true               # Enable MLflow tracing
LOG_LEVEL=INFO                           # Application log level

# Performance
GUNICORN_WORKERS=4                       # Number of worker processes
GUNICORN_TIMEOUT=120                     # Worker timeout
'''

print(env_vars)

In [None]:
# Production monitoring
print("=" * 70)
print("MONITORING AND OBSERVABILITY")
print("=" * 70)

monitoring_tips = '''
1. MLflow Tracing
   - Enable: mlflow.openai.autolog() or @mlflow.trace
   - View in MLflow UI -> Traces tab
   - Includes: latency, tokens, errors

2. Health Checks
   - Endpoint: GET /health
   - Use with load balancers and Kubernetes

3. Logging
   - MLflow logs predictions automatically
   - Add custom logging for business metrics

4. Metrics to Monitor
   - Request latency (p50, p95, p99)
   - Token usage per request
   - Error rate
   - Requests per second
   - Cost per request

5. Alerting
   - High error rate (>1%)
   - Latency spikes (>10s p95)
   - Unusual token usage patterns
   - API key expiration
'''

print(monitoring_tips)

In [None]:
# Security best practices
print("=" * 70)
print("SECURITY BEST PRACTICES")
print("=" * 70)

security_tips = '''
1. API Keys
   - Never commit API keys to version control
   - Use environment variables or secrets managers
   - Rotate keys regularly
   - Use separate keys for dev/staging/production

2. Network Security
   - Use HTTPS in production
   - Implement rate limiting
   - Use authentication for endpoints
   - Restrict access by IP if possible

3. Input Validation
   - Validate all user inputs
   - Limit input/output sizes
   - Sanitize conversation history

4. Data Privacy
   - Log only necessary information
   - Implement data retention policies
   - Consider PII handling requirements

5. Model Security
   - Version control all model artifacts
   - Audit model changes
   - Test before deploying to production
'''

print(security_tips)

## 8. Troubleshooting

Common issues and solutions.

In [None]:
print("=" * 70)
print("COMMON ISSUES AND SOLUTIONS")
print("=" * 70)

troubleshooting = '''
Issue: "ModuleNotFoundError" when serving
Solution: Ensure all dependencies are in pip_requirements
  mlflow.pyfunc.log_model(
      ...,
      pip_requirements=["mlflow", "openai", "pydantic>=2.0.0"]
  )

---

Issue: "Timeout" on long-running predictions
Solution: Increase timeout environment variables
  export MLFLOW_DEPLOYMENT_PREDICT_TIMEOUT=900  # 15 minutes
  export MLFLOW_DEPLOYMENT_PREDICT_TOTAL_TIMEOUT=1200

---

Issue: "OPENAI_API_KEY not found"
Solution: Set environment variable before serving
  export OPENAI_API_KEY=sk-...
  mlflow models serve -m <model_uri>

---

Issue: Schema validation errors
Solution: Ensure request matches ResponsesAgentRequest schema
  {
    "input": [{"role": "user", "content": "..."}],  # Required
    "context": {...},  # Optional
    "custom_inputs": {...}  # Optional
  }

---

Issue: Missing traces in MLflow UI
Solution: Enable autologging in your agent code
  mlflow.openai.autolog()
  # or use @mlflow.trace decorator

---

Issue: Docker container won't start
Solution: Check logs and ensure port is exposed
  docker logs <container_id>
  docker run -p 5001:8080 -e OPENAI_API_KEY=... <image>
'''

print(troubleshooting)

## Summary

### Deployment Options:

| Method | Command | Best For |
|--------|---------|----------|
| **Local** | `mlflow models serve -m <uri>` | Development |
| **Docker** | `mlflow models build-docker` | Containerized |
| **Databricks** | `client.create_endpoint()` | Production |

### Key Commands:

```bash
# Local serving
mlflow models serve -m runs:/<run_id>/agent -p 5001

# Build Docker image
mlflow models build-docker -m runs:/<run_id>/agent -n my-agent

# Run Docker container
docker run -p 5001:8080 -e OPENAI_API_KEY=$OPENAI_API_KEY my-agent
```

### Best Practices:

1. ✅ Always test locally before deploying
2. ✅ Use environment variables for secrets
3. ✅ Enable tracing for observability
4. ✅ Implement proper error handling
5. ✅ Monitor key metrics in production
6. ✅ Set appropriate timeouts

### Next Steps:
- Deploy your agent to your preferred platform
- Set up monitoring and alerting
- Implement CI/CD for model updates