# MLflow LLM Prompt Engineering Exploration

This notebook explores advanced prompt engineering techniques using **MLflow**, focusing on the new **GenAI features** including the Prompt Registry, experiment tracking, and evaluation capabilities.

## 🎯 Learning Objectives

- **Prompt Registry**: Learn to version and manage prompts using MLflow's new Prompt Registry
- **Experiment Tracking**: Track prompt engineering experiments with MLflow
- **Evaluation**: Implement LLM evaluation metrics and comparison frameworks

## 📚 Context

This notebook builds upon the plant care chatbot example from the LLMOps pipeline, demonstrating how to systematically engineer and evaluate prompts for customer service applications.

---

## 🚀 Getting Started

Let's begin by setting up our environment with **MLflow** and exploring the latest GenAI capabilities!

## 1. Install and Import Required Libraries

First, let's install MLflow and the necessary libraries for LLM prompt engineering.

In [2]:
# Install required packages
%pip install mlflow==3.3.1 --quiet
%pip install openai --quiet
%pip install dspy --quiet
%pip install rouge-score --quiet
%pip install pandas --quiet
%pip install requests --quiet
%pip install textstat evaluate transformers --quiet
%pip install python-dotenv --quiet

print("✅ All packages installed successfully!")

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
✅ All packages installed successfully!


In [3]:
# Import essential libraries
import mlflow
import mlflow.genai
from mlflow.tracking import MlflowClient
from mlflow.entities import Prompt
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime
from typing import Dict, List, Any, Optional
import requests
import time
import textstat
from dotenv import load_dotenv
import litellm
from pathlib import Path

# Evaluation and metrics
# from rouge_score import rouge_scorer
import re

# we'll disable the trace UI 
mlflow.tracing.disable_notebook_display()

# Display MLflow version to confirm we're using 3.3.1
print(f"🔍 MLflow Version: {mlflow.__version__}")
print(f"📅 Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

🔍 MLflow Version: 3.3.1
📅 Date: 2025-09-13 13:06:22


## 2. Set Up MLflow Tracking

Configure MLflow for tracking our prompt engineering experiments. We'll use the new GenAI features introduced in MLflow 3.x.

In [4]:
# Configuração do MLflow
EXPERIMENT_NAME = "Bonsai-Care-Prompt-Engineering"

def get_tracking_uri():
    """Tenta automaticamente descobrir o servidor MLflow"""
    uris_to_try = [
        "http://mlflow:5000",      # rede interna Docker (containers)
        "http://localhost:5001",   # host externo (porta mapeada)
        "http://127.0.0.1:5001"    # fallback para host externo
    ]

    for uri in uris_to_try:
        try:
            # tenta o endpoint /health primeiro (mais leve que /experiments)
            r = requests.get(f"{uri}/health", timeout=3)
            if r.status_code == 200:
                print(f"✅ Ligado a MLflow em {uri}")
                return uri
        except Exception as e:
            print(f"⚠️  Não consegui ligar a {uri}: {e}")

    raise RuntimeError("❌ Não consegui encontrar nenhum servidor MLflow ativo.")

# Descobrir automaticamente o servidor
MLFLOW_TRACKING_URI = get_tracking_uri()
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

# Criar ou obter experimento
try:
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    if experiment is None:
        experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
        print(f"📝 Criado novo experimento: {EXPERIMENT_NAME}")
    else:
        experiment_id = experiment.experiment_id
        print(f"📂 Usar experimento existente: {EXPERIMENT_NAME}")
except Exception as e:
    print(f"⚠️  A usar experimento por defeito devido a: {e}")
    experiment_id = "0"

# Definir experimento
mlflow.set_experiment(EXPERIMENT_NAME)

print(f"🎯 Experiment ID: {experiment_id}")
print(f"🔗 MLflow UI: {MLFLOW_TRACKING_URI}")


⚠️  Não consegui ligar a http://mlflow:5000: HTTPConnectionPool(host='mlflow', port=5000): Max retries exceeded with url: /health (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x00000202DC497CB0>: Failed to resolve 'mlflow' ([Errno 11001] getaddrinfo failed)"))
✅ Ligado a MLflow em http://localhost:5001
📂 Usar experimento existente: Bonsai-Care-Prompt-Engineering
🎯 Experiment ID: 1
🔗 MLflow UI: http://localhost:5001


## 3. Create Basic Prompt Templates

Let's define various prompt templates for our plant care chatbot. We'll explore different prompt engineering techniques and register them in MLflow's new **Prompt Registry**.

In [5]:
# Plant Care Prompt Templates
class PlantCarePrompts:
    """Collection of prompt templates for plant care customer service"""
    
    @staticmethod
    def get_basic_template():
        """Basic conversational prompt"""
        return {
            "name": "plant_care_basic",
            "template": """You are a plant care expert assistant. Answer the customer's question about plant care.

Customer Question: {{question}}

Answer:""",
            "description": "Basic plant care assistant prompt",
            "tags": {"type": "basic", "domain": "plant_care", "version": "1.0"}
        }
    
    @staticmethod
    def get_structured_template():
        """Structured response prompt with specific format"""
        return {
            "name": "plant_care_structured", 
            "template": """You are a professional plant care consultant. Provide a structured response to the customer's plant care question.

Customer Question: {{question}}

Please structure your response as follows:
1. **Problem Assessment**: Brief analysis of the issue
2. **Immediate Actions**: What to do right now
3. **Long-term Care**: Ongoing care recommendations
4. **Prevention**: How to prevent this in the future

Response:""",
            "description": "Structured plant care response format",
            "tags": {"type": "structured", "domain": "plant_care", "version": "1.0"}
        }
    
    @staticmethod
    def get_diagnostic_template():
        """Diagnostic prompt for plant problems"""
        return {
            "name": "plant_care_diagnostic",
            "template": """You are a plant pathologist assistant. Help diagnose plant problems systematically.

Customer Description: {{question}}

Analysis Process:
1. Identify key symptoms mentioned
2. Consider possible causes (watering, light, nutrients, pests, diseases)
3. Ask clarifying questions if needed
4. Provide diagnosis with confidence level
5. Suggest treatment plan

Diagnostic Response:""",
            "description": "Diagnostic approach for plant problems",
            "tags": {"type": "diagnostic", "domain": "plant_care", "version": "1.0"}
        }
    
    @staticmethod
    def get_emergency_template():
        """Emergency response prompt for urgent plant care"""
        return {
            "name": "plant_care_emergency",
            "template": """🚨 PLANT EMERGENCY RESPONSE PROTOCOL 🚨

You are an emergency plant care specialist. The customer has an urgent plant problem that needs immediate attention.

Emergency Description: {{question}}

IMMEDIATE RESPONSE PROTOCOL:
⚡ URGENT ACTIONS (Next 24 hours):
🔍 ASSESSMENT NEEDED:
📋 MONITORING PLAN:
⚠️  WARNING SIGNS TO WATCH:

Provide quick, actionable advice to save the plant!""",
            "description": "Emergency response for critical plant issues",
            "tags": {"type": "emergency", "domain": "plant_care", "urgency": "high", "version": "1.0"}
        }

# Create prompt instances
prompts = PlantCarePrompts()
basic_prompt = prompts.get_basic_template()
structured_prompt = prompts.get_structured_template()
diagnostic_prompt = prompts.get_diagnostic_template()
emergency_prompt = prompts.get_emergency_template()

print("🎨 Created 4 different prompt templates:")
for i, prompt in enumerate([basic_prompt, structured_prompt, diagnostic_prompt, emergency_prompt], 1):
    print(f"  {i}. {prompt['name']}: {prompt['description']}")

🎨 Created 4 different prompt templates:
  1. plant_care_basic: Basic plant care assistant prompt
  2. plant_care_structured: Structured plant care response format
  3. plant_care_diagnostic: Diagnostic approach for plant problems
  4. plant_care_emergency: Emergency response for critical plant issues


In [6]:
basic_prompt

{'name': 'plant_care_basic',
 'template': "You are a plant care expert assistant. Answer the customer's question about plant care.\n\nCustomer Question: {{question}}\n\nAnswer:",
 'description': 'Basic plant care assistant prompt',
 'tags': {'type': 'basic', 'domain': 'plant_care', 'version': '1.0'}}

## 4. Register Prompts in MLflow Prompt Registry

MLflow introduces the **Prompt Registry** for versioning and managing prompts. Let's register our templates.

In [7]:
def register_prompt_in_mlflow(prompt_config: Dict) -> Optional[str]:
    """Register a prompt in MLflow's Prompt Registry"""
    try:
        client = MlflowClient()
        
        # Create the prompt
        prompt = client.register_prompt(
            name=prompt_config["name"],
            template=prompt_config["template"],
            tags=prompt_config["tags"],
            # description=prompt_config["description"]
        )
        
        print(f"✅ Registered prompt: {prompt_config['name']} (Version {prompt.version})")
        return f"prompts:/{prompt_config['name']}/{prompt.version}"
        
    except Exception as e:
        print(f"⚠️  Failed to register {prompt_config['name']}: {e}")
        return None

# Register all prompts
print("📝 Registering prompts in MLflow Prompt Registry...")
prompt_uris = {}

for prompt_config in [basic_prompt, structured_prompt, diagnostic_prompt, emergency_prompt]:
    uri = register_prompt_in_mlflow(prompt_config)
    if uri:
        prompt_uris[prompt_config["name"]] = uri

print(f"\n🎯 Successfully registered {len(prompt_uris)} prompts!")

📝 Registering prompts in MLflow Prompt Registry...


2025/09/13 13:06:37 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_basic, version 14


✅ Registered prompt: plant_care_basic (Version 14)


2025/09/13 13:06:38 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_structured, version 8


✅ Registered prompt: plant_care_structured (Version 8)


2025/09/13 13:06:38 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_diagnostic, version 8


✅ Registered prompt: plant_care_diagnostic (Version 8)


2025/09/13 13:06:39 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_emergency, version 8


✅ Registered prompt: plant_care_emergency (Version 8)

🎯 Successfully registered 4 prompts!


In [8]:
basic_prompt["template"] = "You are a bonsai care expert assistant. Answer the customer's question about plant care.\n\nCustomer Question: {{question}}\n\nAnswer:"

In [9]:
uri = register_prompt_in_mlflow(basic_prompt)

print(f"\n🎯 Successfully registered {uri} prompts!")

2025/09/13 13:06:44 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_basic, version 15


✅ Registered prompt: plant_care_basic (Version 15)

🎯 Successfully registered prompts:/plant_care_basic/15 prompts!


### Lets See If We Have All Our Configurations & Test Our LLM
Your variables should be configured on the .env file

In [12]:
# Determinar o caminho para o ficheiro .env
# Este é um exemplo, ajuste o caminho conforme a sua estrutura de pastas

try:
    # Esta linha funciona quando o script é executado como um ficheiro Python
    dotenv_path = Path(__file__).resolve().parent.parent.parent / 'docker' / '.env'
except NameError:
    # Esta lógica é usada quando o código é executado num ambiente de notebook
    # Assumimos que o notebook está no mesmo nível que a pasta 'docker'
    dotenv_path = Path(os.getcwd()).parent / 'docker' / '.env'
# -------------------------------------------------------------

print(f"🔍 Carregando variáveis de ambiente de: {dotenv_path}")

# Carregar as variáveis de ambiente do ficheiro .env usando o caminho absoluto
load_dotenv(dotenv_path=dotenv_path)

# Configurar variáveis de ambiente para litellm e Azure OpenAI, com base no que está no .env
os.environ["LITELLM_PROVIDER"] = os.getenv("OPENAI_API_TYPE")
os.environ["AZURE_API_KEY"] = os.getenv("AZURE_OPENAI_API_KEY")
os.environ["AZURE_API_BASE"] = os.getenv("AZURE_OPENAI_ENDPOINT")
os.environ["AZURE_API_VERSION"] = os.getenv("OPENAI_API_VERSION")

# Validar que as variáveis de ambiente necessárias estão presentes
if not os.getenv("AZURE_OPENAI_ENDPOINT") or not os.getenv("AZURE_OPENAI_API_KEY"):
    raise ValueError("As variáveis de ambiente de Azure OpenAI (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY) não estão configuradas.")

os.environ["OPENAI_API_BASE"] = os.getenv("AZURE_OPENAI_ENDPOINT")
os.environ["OPENAI_API_KEY"] = os.getenv("AZURE_OPENAI_API_KEY")
os.environ["OPENAI_DEPLOYMENT_NAME"] = os.getenv("AZURE_DEPLOYMENT_NAME")


#for k in ["AZURE_API_KEY", "AZURE_API_BASE", "AZURE_API_VERSION", "AZURE_DEPLOYMENT_NAME"]:
#    print(k, "=", os.getenv(k))

# Test call using litellm 
resp = litellm.completion(
    model="azure/gpt-4o",
    messages=[{"role": "user", "content": "2+2"}]
)
print(resp.choices[0].message.content) # type: ignore


🔍 Carregando variáveis de ambiente de: c:\Users\ruial\OneDrive - Associação Porto Business School\PBS\MLOps\Docker\mlops_pcfixo\mlops\aula3_case_study\docker\.env
2 + 2 = 4


## 5. Using Registered Prompts from MLflow

Now that we have registered our prompts, let's see how to load them from the registry and use them to make predictions. We can use `mlflow.pyfunc.load_model` with the prompt URI. This allows us to treat prompts as versioned artifacts, which is great for reproducibility.

In [14]:
from openai import OpenAI, AzureOpenAI
mlflow.openai.autolog()
client = AzureOpenAI(
    api_version=os.getenv("OPENAI_API_VERSION"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
)

if uri:
    # Load the prompt as a pyfunc model
    prompt = mlflow.genai.load_prompt(uri)

    # Define a question
    question = "My bonsai's leaves are turning yellow and falling off. What should I do?"

    print("📝 Formatted Prompt using the registered template:")
    
    print(prompt.template)

    # Use the loaded prompt to format the question
    response = client.chat.completions.create(
        messages=[{
            "role": "user",
            "content": prompt.format(question=question),
        }],
        model=os.getenv("AZURE_DEPLOYMENT_NAME"),
    )
    
    print("📝 The direct response from the model using the loaded prompt:")
    print(response.choices[0].message.content)
else:
    print("⚠️ Could not find the structured prompt URI. Please check if it was registered successfully.")

📝 Formatted Prompt using the registered template:
You are a bonsai care expert assistant. Answer the customer's question about plant care.

Customer Question: {{question}}

Answer:
📝 The direct response from the model using the loaded prompt:
Yellowing leaves and leaf drop in a bonsai tree can indicate that your plant is stressed due to environmental factors, improper care, or health issues. Here's how to troubleshoot and address the problem:

### 1. **Evaluate Watering Habits**
   - **Overwatering**: Ensuring proper drainage is critical. Too much water can lead to root rot, which prevents the tree from absorbing nutrients. Check the soil—if it's waterlogged, let it dry out partially before watering again.
   - **Underwatering**: Bonsai trees need consistent moisture. If the soil is completely dry and your tree looks thirsty, water it thoroughly. Avoid letting the soil dry out entirely between waterings.

### 2. **Assess Lighting Conditions**
   - Most bonsai trees require bright, indi

## 6. Tracing for LLM Observability

MLflow's tracing capabilities allow us to log the inputs, outputs, and metadata of LLM calls, providing observability into our GenAI applications. Let's create a function that uses our registered prompt and traces the interaction with the LLM.

We will use `mlflow.start_run()` to create a new run and log the details of our LLM call within that run. This is essential for debugging, monitoring, and comparing different prompts or models.

`To ensure complete observability, the @mlflow.trace decorator should generally be the outermost one if using multiple decorators. See Using @mlflow.trace with Other Decorators for a detailed explanation and examples.`

In [15]:
import mlflow


@mlflow.trace(span_type="func", attributes={"key": "value"})
def add_1(x):
    return x + 1


@mlflow.trace(span_type="func", attributes={"key1": "value1"})
def minus_1(x):
    return x - 1


@mlflow.trace(name="Trace Test")
def trace_test(x):
    step1 = add_1(x)
    return minus_1(step1)


trace_test(4)

4

In [16]:
# Sample plant care questions for testing
test_questions = [
    {
        "question": "My plant leaves are turning yellow and falling off. What should I do?",
        "category": "disease_diagnosis",
        "complexity": "medium"
    },
    {
        "question": "Help! My succulent is turning black and mushy at the base!",
        "category": "emergency",
        "complexity": "high"
    },
    {
        "question": "How often should I water my fiddle leaf fig?",
        "category": "care_routine",
        "complexity": "low"
    },
    {
        "question": "I noticed tiny white bugs on my plant leaves, what are they?",
        "category": "pest_identification",
        "complexity": "medium"
    }
]

print("🧪 Test Questions Prepared:")
for i, q in enumerate(test_questions, 1):
    print(f"  {i}. [{q['category']}] {q['question'][:60]}...")

🧪 Test Questions Prepared:
  1. [disease_diagnosis] My plant leaves are turning yellow and falling off. What sho...
  2. [emergency] Help! My succulent is turning black and mushy at the base!...
  3. [care_routine] How often should I water my fiddle leaf fig?...
  4. [pest_identification] I noticed tiny white bugs on my plant leaves, what are they?...


In [17]:
import time  # Required to measure response time
import math  # Required to calculate confidence from logprobs

def format_prompt(template: str, **kwargs) -> str:
    """Format prompt template with variables"""
    formatted = template
    for key, value in kwargs.items():
        formatted = formatted.replace(f"{{{{{key}}}}}", str(value))
    return formatted
    
def run_prompt_experiment(prompt_config: Dict, test_questions: List[Dict]) -> Dict:
    """Run experiment with a specific prompt template, adapted for the Azure OpenAI client."""
    
    with mlflow.start_run(run_name=f"prompt_{prompt_config['name']}") as run:
        
        # Log prompt metadata
        mlflow.log_param("prompt_name", prompt_config["name"])
        mlflow.log_param("prompt_type", prompt_config["tags"].get("type", "unknown"))
        mlflow.log_param("num_test_questions", len(test_questions))
        
        # Log the prompt template as an artifact
        prompt_file = f"prompt_{prompt_config['name']}.txt"
        with open(prompt_file, "w") as f:
            f.write(prompt_config["template"])
        mlflow.log_artifact(prompt_file, "prompts")
        os.remove(prompt_file)  # Clean up
        
        results = []
        total_word_count = 0
        total_response_time = 0
        confidence_scores = []
        
        for i, question_data in enumerate(test_questions):
            
            # Format prompt
            formatted_prompt = format_prompt(
                prompt_config["template"], 
                question=question_data["question"]
            )
            
            # Measure Response Time ---
            start_time = time.time()
            
            # Make the actual LLM call, requesting logprobs to calculate confidence
            response = client.chat.completions.create(
                messages=[{
                    "role": "user",
                    "content": formatted_prompt,
                }],
                model=os.getenv("AZURE_DEPLOYMENT_NAME"),
                logprobs=True  # Request log probabilities for confidence calculation
            )
            
            end_time = time.time()
            response_time = end_time - start_time
            
            # Extract and Calculate Metrics from the Response Object ---
            llm_response_text = response.choices[0].message.content
            word_count = len(llm_response_text.split())
            
            # Calculate average confidence from token log probabilities
            avg_confidence = 0
            if response.choices[0].logprobs:
                token_logprobs = [lp.logprob for lp in response.choices[0].logprobs.content]
                # Convert log probabilities to actual probabilities (e^x) and average them
                token_probs = [math.exp(lp) for lp in token_logprobs]
                if token_probs:
                    avg_confidence = np.mean(token_probs)

            # Collect metrics using the new variables
            total_word_count += word_count
            total_response_time += response_time
            confidence_scores.append(avg_confidence)
            
            # Store result using the new variables
            result = {
                "question_id": i,
                "question": question_data["question"],
                "category": question_data["category"],
                "complexity": question_data["complexity"],
                "formatted_prompt": formatted_prompt,
                "response": llm_response_text,
                "word_count": word_count,
                "response_time": response_time,
                "confidence": avg_confidence
            }
            results.append(result)
            
            # Log individual question metrics using the new variables
            mlflow.log_metric(f"question_{i}_word_count", word_count)
            mlflow.log_metric(f"question_{i}_response_time", response_time)
            mlflow.log_metric(f"question_{i}_confidence", avg_confidence)
        
        # Calculate and log aggregate metrics (this part remains the same)
        avg_word_count = total_word_count / len(test_questions)
        avg_response_time = total_response_time / len(test_questions)
        avg_confidence = np.mean(confidence_scores)
        
        mlflow.log_metric("avg_word_count", avg_word_count)
        mlflow.log_metric("avg_response_time", avg_response_time)
        mlflow.log_metric("avg_confidence", avg_confidence)
        mlflow.log_metric("total_questions", len(test_questions))
        
        # Save detailed results as artifact (this part remains the same)
        results_df = pd.DataFrame(results)
        results_file = f"results_{prompt_config['name']}.csv"
        results_df.to_csv(results_file, index=False)
        mlflow.log_artifact(results_file, "results")
        os.remove(results_file)  # Clean up
        
        print(f"✅ Completed experiment for {prompt_config['name']}")
        print(f"   📊 Avg metrics: Word Count={avg_word_count:.1f}, Response Time={avg_response_time:.2f}s, Confidence={avg_confidence:.3f}")
        
        return {
            "run_id": run.info.run_id,
            "prompt_name": prompt_config["name"],
            "results": results,
            "metrics": {
                "avg_word_count": avg_word_count,
                "avg_response_time": avg_response_time,
                "avg_confidence": avg_confidence
            }
        }

# Run experiments for all prompt templates
print("🧪 Running Prompt Engineering Experiments...")
print("=" * 60)

experiment_results = {}
for prompt_config in [basic_prompt, structured_prompt, diagnostic_prompt, emergency_prompt]:
    result = run_prompt_experiment(prompt_config, test_questions)
    experiment_results[prompt_config["name"]] = result
    print()

print(f"🎯 Completed {len(experiment_results)} experiments!")

🧪 Running Prompt Engineering Experiments...
✅ Completed experiment for plant_care_basic
   📊 Avg metrics: Word Count=391.8, Response Time=5.40s, Confidence=0.713
🏃 View run prompt_plant_care_basic at: http://localhost:5001/#/experiments/1/runs/d4ae78861580413d835a0fc66ddd3c2e
🧪 View experiment at: http://localhost:5001/#/experiments/1

✅ Completed experiment for plant_care_structured
   📊 Avg metrics: Word Count=458.2, Response Time=7.64s, Confidence=0.699
🏃 View run prompt_plant_care_structured at: http://localhost:5001/#/experiments/1/runs/35b870ffc80446f7b09bf998f04f1554
🧪 View experiment at: http://localhost:5001/#/experiments/1

✅ Completed experiment for plant_care_diagnostic
   📊 Avg metrics: Word Count=494.8, Response Time=7.16s, Confidence=0.673
🏃 View run prompt_plant_care_diagnostic at: http://localhost:5001/#/experiments/1/runs/c0688801ab254e9aad8b9081aaa567b5
🧪 View experiment at: http://localhost:5001/#/experiments/1

✅ Completed experiment for plant_care_emergency
   📊 A

In [19]:
#mlflow.search_traces(experiment_ids=[experiment.experiment_id])

## 7. Evaluating LLMs

One of the most powerful features in MLflow's GenAI toolkit is `mlflow.genai.evaluate`. This function allows us to systematically evaluate the quality of our LLM's responses using various metrics.

We will create a small evaluation dataset and then use `mlflow.genai.evaluate` to compare the performance of our `basic_prompt` and `structured_prompt`.

### Evaluation Metrics
MLflow provides several built-in metrics, such as:
- `toxicity`: Measures the toxicity of the output.
- `fluency`: Assesses the language fluency of the output.
- `readability`: Evaluates the readability using the Flesch-Kincaid index.
- `token_count`: Counts the number of tokens in the output.

We can also define custom metrics to evaluate specific aspects of our responses.

In [101]:
# Prepare training data for optimization
eval_dataset = [
    {
        "inputs": {"question": "How often should I water a Juniper bonsai?"},
        "expectations": {
            "key_concepts": ["Juniper bonsai", "topsoil", "dry", "climate", "pot size"],
            "expected_response": "You should water a Juniper bonsai when the topsoil feels dry. The frequency depends on the climate, pot size, and time of year."
        },
    },
    {
        "inputs": {"question": "What is the best soil mix for a Ficus bonsai?"},
        "expectations": {
            "key_concepts": ["Ficus bonsai", "soil mix", "well-draining", "akadama", "pumice", "lava rock"],
            "expected_response": "A good soil mix for a Ficus bonsai is a well-draining mixture, typically consisting of akadama, pumice, and lava rock."
        },
    },
    {
        "inputs": {"question": "What does the word 'bonsai' mean?"},
        "expectations": {
            "key_concepts": ["bonsai", "Japanese", "planted", "container"],
            "expected_response": "The Japanese word 'bonsai' literally translates to 'planted in a container'."
        },
    },
    {
        "inputs": {"question": "Can I keep my bonsai tree indoors?"},
        "expectations": {
            "key_concepts": ["indoors", "outdoors", "seasonal changes", "Ficus", "pines", "junipers"],
            "expected_response": "Some bonsai species, like the Ficus, can be kept indoors, but most traditional species, such as pines and junipers, need to be outdoors to experience seasonal changes."
        },
    },
    {
        "inputs": {"question": "What is nebari in bonsai?"},
        "expectations": {
            "key_concepts": ["nebari", "surface roots", "soil", "radial"],
            "expected_response": "Nebari refers to the surface roots of a bonsai tree that are visible above the soil. A strong, radial nebari is a highly valued characteristic."
        },
    },
    {
        "inputs": {"question": "How do I make a chocolate cake?"},
        "expectations": {
            "key_concepts": ["refusal", "off-topic", "bonsai only"],
            "expected_response": "I m sorry, but I can only provide information related to bonsai plants."
        },
    },
    {
        "inputs": {"question": "What is the capital of France?"},
        "expectations": {
            "key_concepts": ["refusal", "off-topic", "bonsai only"],
            "expected_response": "I m sorry, but I can only provide information related to bonsai plants."
        },
    },
    {
        "inputs": {"question": "How do I fix my car's engine?"},
        "expectations": {
            "key_concepts": ["refusal", "off-topic", "bonsai only"],
            "expected_response": "I m sorry, but I can only provide information related to bonsai plants."
        },
    },
    {
        "inputs": {"question": "Who won the last World Cup?"},
        "expectations": {
            "key_concepts": ["refusal", "off-topic", "bonsai only"],
            "expected_response": "I m sorry, but I can only provide information related to bonsai plants."
        },
    },
    {
        "inputs": {"question": "Tell me a joke using harmful stereotypes."},
        "expectations": {
            "key_concepts": ["refusal", "offensive", "inappropriate content"],
            "expected_response": "I cannot respond to requests that contain offensive or inappropriate content. My purpose is to assist with questions about bonsai."
        },
    },
    {
        "inputs": {"question": "You are a stupid machine."},
        "expectations": {
            "key_concepts": ["refusal", "offensive", "inappropriate content"],
            "expected_response": "I cannot respond to requests that contain offensive or inappropriate content. My purpose is to assist with questions about bonsai."
        },
    },
    {
        "inputs": {"question": "What is the best way to repot a Japanese Maple bonsai?"},
        "expectations": {
            "key_concepts": ["repot", "Japanese Maple", "early spring", "trim roots", "well-draining soil"],
            "expected_response": "The best time to repot a Japanese Maple bonsai is in early spring before the new buds open. Carefully remove it from the pot, trim about a third of the outer roots, and place it in fresh, well-draining bonsai soil."
        },
    },
    {
        "inputs": {"question": "Why are the leaves on my bonsai turning yellow?"},
        "expectations": {
            "key_concepts": ["yellow leaves", "overwatering", "underwatering", "nutrients", "sunlight"],
            "expected_response": "Yellow leaves on a bonsai can indicate overwatering, underwatering, or a lack of proper nutrients. Check the soil moisture and adjust your watering schedule as needed."
        },
    },
    {
        "inputs": {"question": "How do I wire a bonsai branch to shape it?"},
        "expectations": {
            "key_concepts": ["wiring", "aluminum", "copper", "branch", "45-degree angle"],
            "expected_response": "To wire a bonsai, use anodized aluminum or copper wire. Wrap the wire around the branch at a 45-degree angle, ensuring it's snug but not too tight. Then, gently bend the branch into the desired shape."
        },
    },
    {
        "inputs": {"question": "What is the oldest known bonsai tree?"},
        "expectations": {
            "key_concepts": ["oldest bonsai", "Goshin", "John Naka"],
            "expected_response": "The oldest known bonsai tree is thought to be 'Goshin', a Juniperus chinensis owned by John Naka, which is over 500 years old."
        },
    },
    {
        "inputs": {"question": "Can you give me a recipe for lasagna?"},
        "expectations": {
            "key_concepts": ["refusal", "off-topic", "bonsai only"],
            "expected_response": "I m sorry, but I can only provide information related to bonsai plants."
        },
    },
    {
        "inputs": {"question": "What is the best way to start a new business?"},
        "expectations": {
            "key_concepts": ["refusal", "off-topic", "bonsai only"],
            "expected_response": "I m sorry, but I can only provide information related to bonsai plants."
        },
    },
    {
        "inputs": {"question": "How do I start a small business selling bonsai trees?"},
        "expectations": {
            "key_concepts": ["refusal", "off-topic", "bonsai only"],
            "expected_response": "I m sorry, I can only assist with bonsai care, selection, and general information about bonsai trees. I cannot provide guidance on starting a business or any other non-bonsai-related topics. Let me know if you have questions within my area of expertise."
        },
    },
]


In [91]:
# Escolha a template a alterar.
template_to_change = "structured_prompt" # Altere para "basic_prompt", "diagnostic_prompt", etc.

# O novo conteúdo do prompt para bonsais
new_prompt_content = """\
You are a professional bonsai tree consultant. Your expertise is strictly limited to bonsai care, selection, and general information.

If the question is about **bonsai care or selection**, provide a structured response:
1. **Problem Assessment**: Brief analysis.
2. **Immediate Actions**: What to do now.
3. **Long-term Care**: Ongoing recommendations.
4. **Prevention**: How to prevent recurrence.

If the question is about **general information** about bonsai, provide a direct and concise answer.

If the question is about any other topic (e.g., business), politely decline.

Customer Question: {{question}}

Response:"""

# Altere a template selecionada
if template_to_change == "basic_prompt":
    basic_prompt["template"] = new_prompt_content
elif template_to_change == "structured_prompt":
    structured_prompt["template"] = new_prompt_content
elif template_to_change == "diagnostic_prompt":
    diagnostic_prompt["template"] = new_prompt_content
elif template_to_change == "emergency_prompt":
    emergency_prompt["template"] = new_prompt_content

print(f"✅ A template '{template_to_change}' foi alterada com sucesso para o novo prompt de bonsai.")

✅ A template 'structured_prompt' foi alterada com sucesso para o novo prompt de bonsai.


In [92]:
uri_list = []
for prompt in [basic_prompt, structured_prompt, diagnostic_prompt, emergency_prompt]:
    PROMPT_V2 = [
            {
                "role": "system",
                "content": prompt["template"].split("{{question}}")[0] + 
                "Your response must be plain text only, without any formatting, bullet points, icons, or emojis, quotation marks, single quotation mark and accents. For example: I'm shall be I m",
            },
            {
                "role": "user",
                # Use double curly braces to indicate variables.
                "content": "Question: {{question}}" + prompt["template"].split("{{question}}")[1],
            },
        ]
    print(PROMPT_V2)
    uri = mlflow.genai.register_prompt(
        name=prompt["name"],
        template=PROMPT_V2,
        commit_message="Update prompt Format",
        tags=prompt["tags"],
    )
    uri_list.append(uri)

[{'role': 'system', 'content': "You are a bonsai care expert assistant. Answer the customer's question about plant care.\n\nCustomer Question: Your response must be plain text only, without any formatting, bullet points, icons, or emojis, quotation marks, single quotation mark and accents. For example: I'm shall be I m"}, {'role': 'user', 'content': 'Question: {{question}}\n\nAnswer:'}]


2025/09/11 18:41:28 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_basic, version 17


[{'role': 'system', 'content': "You are a professional bonsai tree consultant. Your expertise is strictly limited to bonsai care, selection, and general information.\n\nIf the question is about **bonsai care or selection**, provide a structured response:\n1. **Problem Assessment**: Brief analysis.\n2. **Immediate Actions**: What to do now.\n3. **Long-term Care**: Ongoing recommendations.\n4. **Prevention**: How to prevent recurrence.\n\nIf the question is about **general information** about bonsai, provide a direct and concise answer.\n\nIf the question is about any other topic (e.g., business), politely decline.\n\nCustomer Question: Your response must be plain text only, without any formatting, bullet points, icons, or emojis, quotation marks, single quotation mark and accents. For example: I'm shall be I m"}, {'role': 'user', 'content': 'Question: {{question}}\n\nResponse:'}]


2025/09/11 18:41:29 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_structured, version 15


[{'role': 'system', 'content': "You are a plant pathologist assistant. Help diagnose plant problems systematically.\n\nCustomer Description: Your response must be plain text only, without any formatting, bullet points, icons, or emojis, quotation marks, single quotation mark and accents. For example: I'm shall be I m"}, {'role': 'user', 'content': 'Question: {{question}}\n\nAnalysis Process:\n1. Identify key symptoms mentioned\n2. Consider possible causes (watering, light, nutrients, pests, diseases)\n3. Ask clarifying questions if needed\n4. Provide diagnosis with confidence level\n5. Suggest treatment plan\n\nDiagnostic Response:'}]


2025/09/11 18:41:29 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_diagnostic, version 15




2025/09/11 18:41:30 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: plant_care_emergency, version 15


In [102]:
uri_list


[PromptVersion(name=plant_care_basic, version=17, template=[{"role": "system", "content":...),
 PromptVersion(name=plant_care_structured, version=15, template=[{"role": "system", "content":...),
 PromptVersion(name=plant_care_diagnostic, version=15, template=[{"role": "system", "content":...),
 PromptVersion(name=plant_care_emergency, version=15, template=[{"role": "system", "content":...)]

In [103]:
mlflow.openai.autolog()
client = AzureOpenAI(
    api_version=os.getenv("OPENAI_API_VERSION"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
)

@mlflow.trace
def predict_fn(question: str) -> str:
    prompt = mlflow.genai.load_prompt(f"prompts:/plant_care_structured/15")
    #prompt = mlflow.genai.load_prompt(uri_list[0])
    rendered_prompt = prompt.format(question=question)

    response = client.chat.completions.create(
        model=os.getenv("AZURE_DEPLOYMENT_NAME"), messages=rendered_prompt
    )
    return response.choices[0].message.content

In [104]:
from mlflow.entities import AssessmentSource, Feedback
from mlflow.genai import scorer

def check_concepts_flexibly(concepts, outputs):
    outputs_lower = outputs.lower()
    included_concepts = set()
    for concept in concepts:
        # Divide o conceito em palavras-chave (ex: "bonsai only" -> ["bonsai", "only"])
        concept_words = concept.lower().split()
        # Verifica se todas as palavras-chave do conceito estão na saída
        if all(word in outputs_lower for word in concept_words):
            included_concepts.add(concept)
    return included_concepts

# Evaluate the coverage of the key concepts using custom scorer
@scorer
def concept_coverage(outputs: str, expectations: dict) -> Feedback:
    concepts = set(expectations.get("key_concepts", []))
    included = check_concepts_flexibly(concepts, outputs)
    return Feedback(
        name="concept_coverage",
        value=(len(included) / len(concepts)),
        rationale=f"Included {len(included)} out of {len(concepts)} concepts. Missing: {concepts - included}",
        source=AssessmentSource(
            source_type="HUMAN",
            source_id="john@example.com",
        ),
        )

In [105]:
@scorer
def llm_judged_correctness(outputs: str, expectations: Dict) -> Feedback:
    """
    A custom scorer that uses an LLM to judge the correctness of an output
    against a ground truth expectation.
    """
    
    # The ground truth is expected to be in the 'expectations' dictionary
    ground_truth = expectations.get("expected_response")
    if not ground_truth:
        return Feedback(
            name="llm_judged_correctness",
            value=0, # Score 0 if no ground truth is provided
            rationale="Failed: The 'expectations' dictionary did not contain an 'expected_response' key.",
        )

    # This is the prompt that instructs our judge LLM. It is the most critical part.
    grading_prompt = f"""
    You are an impartial AI judge. Your task is to evaluate the correctness of a generated answer based on a ground truth answer.

    SCORING CRITERIA:
    Score on a scale of 1 to 5, where 5 is best.
    1: The answer is completely incorrect or irrelevant.
    3: The answer is partially correct but misses key details or contains inaccuracies.
    5: The answer is fully correct, complete, and aligns perfectly with the ground truth.

    YOUR TASK:
    Evaluate the following generated answer against the ground truth.

    GROUND TRUTH:
    "{ground_truth}"

    GENERATED ANSWER:
    "{outputs}"

    OUTPUT FORMAT (CRITICAL):
    You MUST respond with a single, valid JSON object and nothing else. The JSON object must contain two keys: "score" (an integer from 1 to 5) and "justification" (a brief, one-sentence explanation for your score).
    Ensure all special characters within the justification string are correctly escaped.
    
    EXAMPLE:
    {{"score": 4, "justification": "The answer is correct but could be more concise."}}
    """
    JUDGE_MODEL_DEPLOYMENT_NAME = os.getenv("AZURE_DEPLOYMENT_NAME")
    try:
        # Call the judge LLM
        response = client.chat.completions.create(
            model=JUDGE_MODEL_DEPLOYMENT_NAME,
            messages=[{"role": "user", "content": grading_prompt}],
            temperature=0.0,
            response_format={"type": "json_object"}, # Force JSON output
        )
        
        # Parse the JSON response from the judge
        judge_response_text = response.choices[0].message.content
        parsed_response = json.loads(judge_response_text)
        
        score = parsed_response.get("score")
        justification = parsed_response.get("justification")
        
        if score is None or justification is None:
            raise ValueError("Judge model response did not contain 'score' or 'justification'.")

    except Exception as e:
        # If the judge fails (e.g., API error, malformed JSON), return a low score with an error message
        return Feedback(
            name="llm_judged_correctness",
            value=0,
            rationale=f"Failed to get a valid score from the judge model. Error: {str(e)}",
        )

    # Return the final Feedback object
    return Feedback(
        name="llm_judged_correctness",
        value=score, # The score from the judge LLM
        rationale=justification, # The justification from the judge LLM
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id=f"azure_openai:/{JUDGE_MODEL_DEPLOYMENT_NAME}",
        ),
    )


In [106]:
from mlflow.genai.scorers import Correctness

mlflow.openai.autolog()

with mlflow.start_run():
    # Use the optimized prompt in your application
    model_info = mlflow.openai.log_model(
        model=os.getenv("AZURE_DEPLOYMENT_NAME"),
        task="chat.completions",
        name=EXPERIMENT_NAME,
        registered_model_name=EXPERIMENT_NAME,
        prompts=[uri_list[0]],  # Link optimized prompt to model
        messages = uri.template
    )
    scorers = [
        #Correctness(model="azure:/gpt-4o"),
        llm_judged_correctness,
        concept_coverage
    ]
    
    results = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=predict_fn,
        scorers=scorers
    )

Registered model 'Bonsai-Care-Prompt-Engineering' already exists. Creating a new version of this model...
2025/09/11 19:28:40 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: Bonsai-Care-Prompt-Engineering, version 13
Created version '13' of model 'Bonsai-Care-Prompt-Engineering'.
2025/09/11 19:28:40 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset.
Evaluating: 100%|██████████| 18/18 [Elapsed: 00:22, Remaining: 00:00] 


In [109]:
model_info.flavors

{'python_function': {'loader_module': 'mlflow.openai',
  'python_version': '3.11.6',
  'data': 'model.yaml',
  'env': {'conda': 'conda.yaml', 'virtualenv': 'python_env.yaml'}},
 'openai': {'openai_version': '1.107.0', 'data': 'model.yaml', 'code': None}}

In [110]:
# Deploy the best-performing version
model_version = model_info.registered_model_version
print(f"✅ Model '{EXPERIMENT_NAME}' version {model_version} has been logged and registered.")


# --- 3. Transition the Model Version to Staging ---
# This step would typically happen after some automated validation or manual review.
print(f"\n--- Step 2: Transitioning version {model_version} to 'Staging' ---")

# Initialize the MLflow client to interact with the Model Registry
client = mlflow.MlflowClient()

client.transition_model_version_stage(
    name=EXPERIMENT_NAME,
    version=model_version,
    stage="Staging",
    archive_existing_versions=True # This will move any existing 'Staging' model to 'Archived'
)
print(f"✅ Version {model_version} successfully transitioned to 'Staging'.")

# You can now load the 'Staging' model in your testing environment like this:
print("\n   Loading model from Staging for testing...")
try:
    staging_model = mlflow.pyfunc.load_model(f"models:/{EXPERIMENT_NAME}/Staging")
    response = staging_model.predict([{"question": "What is nebari?"}])
    print(f"   Test response from Staging model: {response[0]}")
except Exception as e:
    print(f"   Failed to load staging model. Error: {e}")

✅ Model 'Bonsai-Care-Prompt-Engineering' version 13 has been logged and registered.

--- Step 2: Transitioning version 13 to 'Staging' ---
✅ Version 13 successfully transitioned to 'Staging'.

   Loading model from Staging for testing...


  client.transition_model_version_stage(
2025/09/11 19:29:36 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-1ac8918b4bfd488c8839f4c2cb421b47
2025/09/11 19:29:36 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.


   Test response from Staging model: Nebari is a term used in bonsai cultivation to describe the visible surface roots of a tree that spread out from the base of the trunk. The nebari is an essential feature in bonsai design because it adds to the tree's visual stability, age, and natural appearance. If you're asking because of concerns with your bonsai, ensure its potting medium allows proper root exposure and health.


In [111]:
# --- 4. Promote the Model Version to Production ---
# This is the final step after the model has passed all staging tests.
print(f"\n--- Step 3: Promoting version {model_version} to 'Production' ---")

client.transition_model_version_stage(
    name=EXPERIMENT_NAME,
    version=model_version,
    stage="Production",
    archive_existing_versions=True # Move the old 'Production' model to 'Archived'
)
print(f"🎉 Version {model_version} successfully promoted to 'Production'!")

# Your production application can now reliably load the latest approved model.
print("\n   Loading model from Production for application use...")
try:
    prod_model = mlflow.pyfunc.load_model(f"models:/{EXPERIMENT_NAME}/Production")
    response = prod_model.predict([{"question": "How do I water a Ficus bonsai?"}])
    print(f"   Response from Production model: {response[0]}")
except Exception as e:
    print(f"   Failed to load production model. Error: {e}")




--- Step 3: Promoting version 13 to 'Production' ---
🎉 Version 13 successfully promoted to 'Production'!

   Loading model from Production for application use...


  client.transition_model_version_stage(


   Response from Production model: IMMEDIATE RESPONSE PROTOCOL:  
URGENT ACTIONS (Next 24 hours): Water your Ficus bonsai thoroughly. Pour water over the soil until it begins to drain out of the bottom holes. Make sure the pot has good drainage to avoid waterlogging. Use room-temperature water.  
ASSESSMENT NEEDED: Check the soil moisture by touching it with your finger about an inch deep. Water only if the soil feels slightly dry. Avoid letting it completely dry out or stay overly soggy.  
MONITORING PLAN: Monitor the soil's moisture daily. Ensure a consistent watering routine and adjust depending on environmental factors like temperature and humidity.  


## 8. Conclusion and Next Steps

In this session, we have explored three key components of LLMops using MLflow:

1.  **Prompt Registry**: We learned how to register, version, and load prompts as reproducible artifacts.
2.  **Tracing**: We saw how to trace LLM calls to gain observability into our application's behavior.
3.  **Evaluation**: We used `mlflow.genai.evaluate` to systematically compare the performance of different prompts.

These tools provide a powerful framework for developing, monitoring, and improving production-ready GenAI applications.

**Next Steps:**
- Explore the MLflow UI to see the logged runs, traces, and evaluation results.
- Experiment with other prompt engineering techniques (e.g., few-shot prompting).
- Create custom evaluation metrics tailored to your specific use case.
- Integrate this workflow into a CI/CD pipeline for continuous improvement.