
%md
# Simple chain Evaluation

After creating the LLM chain called "simple_chain". It time to evaluate the model.

In this notebook, we will explore the metrics and MLflow help.


In [0]:
%pip install -U --quiet databricks-langchain==0.6.0 mlflow[databricks]==3.4.0  langchain==0.3.27 langchain_core==0.3.74 textstat
dbutils.library.restartPython()

### Load the mlflow config 

In [0]:
%run ../_config/config_0

## 0- Call the model endpoint

We are using the model endpoint of the monolithic LLM we've registered, simple_chain



In [0]:
import mlflow
import mlflow.deployments
from typing import List, Dict

# Launch  a deploy client 
mlflow_client = mlflow.deployments.get_deploy_client("databricks")
endpoint_name = "simple_chain"
# A few questions about the databricks mlflow system
DOCS_DATA = [
    {"question": "what are mlflow versions supported on databricks ?"},
    {"question": "what is a langchain mlflow flavor?"},
    {"question": "what is a scorer mlflow ?"},
    {"question": "how do I register a model ?"}
]


@mlflow.trace
def generate_docs(question: str) -> dict:
    """ the simple_chain model to predict."""

    response = mlflow_client.predict(
        endpoint="simple_chain", # The LLM chain that was created
        inputs={"inputs": {"question": question}} # Some questions you want answers
    )

    return {"response": response["predictions"][0]}

# Test the application
for q in DOCS_DATA : 
    print("response : ", generate_docs(q["question"]))

## 1- Creation of an evaluation dataset

Extraction from MLflow experiments the data to create an evalution dataset.

### What are Evaluation Datasets?

Evaluation Datasets in MLflow provide a structured way to organize and manage test data for GenAI applications. They serve as centralized repositories for test inputs, expected outputs (expectations), and evaluation results, enabling systematic quality assessment across your AI development lifecycle.

Evaluation datasets bridge the gap between ad-hoc testing and systematic quality assurance, providing the foundation for reproducible evaluations, regression testing, and continuous improvement of your GenAI applications.

We ll go through the last 10 minutes traces to build the dataset

In [0]:
from mlflow.genai.datasets import (
    create_dataset,
    get_dataset
)
import time
from databricks.connect import DatabricksSession

# 1. Create an evaluation dataset

# Replace with a Unity Catalog schema where you have CREATE TABLE permission
catalog_name = "demo"
schema_name = "demo"
uc_schema = f"{catalog_name}.{catalog_name}"
# This table will be created in the above UC schema
evaluation_dataset_table_name = "extraction_infos_doc_mlflow_eval"
eval_dataset = create_dataset(
    uc_table_name=f"{uc_schema}.{evaluation_dataset_table_name}",
)

print(f"Created evaluation dataset: {uc_schema}.{evaluation_dataset_table_name}")

# 2. Search for the simulated production traces from step 2: get traces from the last 10 days with our trace name.
ten_minutes_ago = int((time.time() - 10 * 60 ) * 1000)

traces = mlflow.search_traces(
    filter_string=f"attributes.timestamp_ms > {ten_minutes_ago} AND "
                 f"attributes.status = 'OK'",
    order_by=["attributes.timestamp_ms DESC"]
)

print(f"Found {len(traces)} successful traces from beta test")

# 3. Add the traces to the evaluation dataset
eval_dataset.merge_records(traces)
print(f"Added {len(traces)} records to evaluation dataset")

# Preview the dataset
eval_dataset_df = eval_dataset.to_df()
print(f"\nDataset preview:")
print(f"Total records: {len(eval_dataset_df)}")
print("\nSample record:")
sample = eval_dataset_df.iloc[0]
print(f"Inputs: {sample['inputs']}") 

### Observations
We have extracted the last traces, we've just created to build the dataset.

 

## 2- Scorers

### What are Scorers?

Scorers in MLflow are evaluation functions that assess the quality of your GenAI application outputs.

Scorers transform subjective quality assessments into measurable metrics, enabling you to track performance, compare models, and ensure your applications meet quality standards. 


### Use Cases

- Automated Quality Assessment : Replace manual review processes with automated scoring that can evaluate thousands of outputs consistently and at scale.

- Safety & Compliance Validation : Systematically check for harmful content, bias, PII leakage, and regulatory compliance. The role of the guardrails

- A/B Testing & Model Comparison : You can build objective criteria to compare differents models, or any evolutions in the components, such as the prompt.

- Continuous Quality Monitoring : In production, you can monitor your app and dectect loss in performance due to drifts for example.


### Types of scorers : 

MLflow provides several types of scorers to address different evaluation needs

- Code-based Scorers : deterministic scorers that can be calculated algorithmically like ROUGE scores, exact match, or custom business logic.
- LLM-as-judges scorers : Use LLM to evaluate subjective qualities like coherence, and style. These scorers can understand context and nuance that rule-based systems miss.
- Human-Aligned Judges : LLM judges fine-tuned with human feedback to match your specific quality standards. 
- Agent-as-a-Judge : Autonomous agents that analyze execution traces to evaluate not just outputs, but the entire process. They can assess tool usage, reasoning chains, and error handling. (Not present here)


### 1-1 code based scorers

Create a code based scorer.
This score evaluates the text complexity using Flesh Reading Ease.
Score = 206.835 - 1.015 × (total_mots / total_phrases) - 84.6 × (total_syllabes / total_mots)


In [0]:
from mlflow.genai.scorers import (
    scorer,
    RelevanceToQuery,
    Safety,
    Guidelines,
)

from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback

In [0]:
import textstat

@scorer
def reading_level(outputs: dict[str]) -> Feedback: # Code based scorer
    """
        Evaluate text complexity using Flesch Reading Ease.
        Higher scores indicate easier reading 
        
        Args:
            outputs (str)
                The generated text to evaluate.
        Returns:
            Feedback: A feedback object with the following attributes:
        The scorer is converted to be in 0. ->   1.0 range  
    """
    score = textstat.flesch_reading_ease(outputs["response"])  


    if score >= 60:
        level = "easy"
        rationale = f"Reading ease score of {score:.1f} - accessible to most readers"
    elif score >= 30:
        level = "moderate"
        rationale = f"Reading ease score of {score:.1f} - college level complexity"
    else:
        level = "difficult"
        rationale = f"Reading ease score of {score:.1f} - expert level required"

    return Feedback(value=score/ 100., rationale=rationale)

### 1-2 LLM as judge
Use your own LLM for a judge

Integrate a custom or externally hosted LLM within a scorer. The scorer handles API calls, input/output formatting, and generates Feedback from your LLM's response, giving full control over the judging process.

You can also set the source field in the Feedback object to indicate the source of the assessment is an LLM judge.

In [0]:
import json
from typing import Any, Optional


# Assume `generated_traces` is available from the prerequisite code block.
# Assume `client` (OpenAI SDK client configured for Databricks) is available from the prerequisite block.
# client = OpenAI(...)

# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.

Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.

Original User Query:
```{user_query}```

AI's Response:
```{llm_response_from_app}```

Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""

@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
    user_query = inputs["question"]

    # Call the Judge LLM using the OpenAI SDK client.
    predictions = mlflow_client.predict(
        endpoint="chat_gpt_4o_mini",
        inputs={
            "messages": [
                {
                    "role": "system",
                    "content": judge_system_prompt
                },
                {
                    "role": "user",
                    "content": judge_user_prompt.format(
                        user_query=user_query,
                        llm_response_from_app=outputs["response"]
                    )
                }
            ]
        }
    )

    # Parse the Judge LLM's JSON output.
    judge_eval_json = json.loads(predictions["choices"][0]["message"]["content"])
    parsed_score = int(judge_eval_json["score"])
    parsed_rationale = judge_eval_json["rationale"]

    return Feedback(
        value=parsed_score,
        rationale=parsed_rationale,
        # Set the source of the assessment to indicate the LLM judge used to generate the feedback
        source=AssessmentSource(
            source_type=AssessmentSourceType.LLM_JUDGE,
            source_id="chat_gpt_4o_mini",
        )
    )




### 1-3 Evaluation


#### Guidelines scorers
Some more LLM-as-judge scores can be build quickly.

Guidelines-based judges and scorers use pass/fail natural language criteria to evaluate GenAI outputs. They excel at evaluating:

    Compliance: "Must not include pricing information"
    Style/tone: "Maintain professional, empathetic tone"
    Requirements: "Must include specific disclaimers"
    Accuracy: "Use only facts from provided context"

Benefits

    Business-friendly: Domain experts write criteria without coding
    Flexible: Update criteria without code changes
    Interpretable: Clear pass/fail conditions
    Fast iteration: Rapidly test new criteria

Another paramater model allows you to choose the model used for the evalaution.


#### Built-in scorers

Some scores are already builtin and can be used as is to test the quality of your workflow
- RelevanceToQuery : This scorer evaluates if your app's response directly addresses the user's input without deviating into unrelated topics.


In [0]:
# Save the scorers as a variable so we can re-use them in step 7

docs_scorers = [
        Guidelines(
            name="follows_instructions",
            guidelines="The generated documents must follow the user_instructions in the request.",
        ),
        Guidelines(
            name="concise_communication",
            guidelines="The document MUST be concise and to the point. The document should communicate the key message efficiently without being overly brief or losing important context.",
        ),
        Guidelines(
            name="professional_tone",
            guidelines="The document must be in a professional tone.",
        ),
        RelevanceToQuery(),  # Checks if email addresses the user's request
        reading_level, #Evaluate text complexity
        answer_quality,
    ]

# Run evaluation with predefined scorers
eval_results_v1 = mlflow.genai.evaluate(
    data=eval_dataset_df,
    predict_fn=generate_docs,
    scorers=docs_scorers

)


### Observation :
The scores are quite good for the limited dataset. 
The most obvious point seems to keep answer concise.


## 2- Exploit traces

### 2-1 search traces

In [0]:
eval_traces = mlflow.search_traces(run_id=eval_results_v1.run_id)

# eval_traces is a Pandas DataFrame that has the evaluated traces.  The column `assessments` includes each scorer's feedback.
eval_traces

In [0]:
@mlflow.trace
def generate_docs_v2(question: str) -> dict:
    """Generate  the simple_chain model to predict."""

    question_v2 = f"""
    In the most concise way possible, can you answer the following question? 
    {question}

    """    
    
    response = mlflow_client.predict(
        endpoint="simple_chain", # The LLM chain that was created
        inputs={"inputs": {"question": question_v2}} # Some questions you want answers
    )

    return {"response": response["predictions"][0]}


# Run evaluation of the new version with the same scorers as before
# We use start_run to name the evaluation run in the UI
with mlflow.start_run(run_name="v2"):
    eval_results_v2 = mlflow.genai.evaluate(
        data=eval_dataset_df, # same eval dataset
        predict_fn=generate_docs_v2, # new app version
        scorers=docs_scorers, # same scorers as step 4
    )

Now we will compare the scores of the v1 and v2.


### 2-2 Compare v1 et v2 performances

In [0]:
import pandas as pd

# Fetch runs separately since mlflow.search_runs doesn't support IN or OR operators
run_v1_df = mlflow.search_runs(
    filter_string=f"run_id = '{eval_results_v1.run_id}'"
)
run_v2_df = mlflow.search_runs(
    filter_string=f"run_id = '{eval_results_v2.run_id}'"
)

# Extract metric columns (they end with /mean, not .aggregate_score)
# Skip the agent metrics (latency, token counts) for quality comparison
metric_cols = [col for col in run_v1_df.columns
               if col.startswith('metrics.') and col.endswith('/mean')
               and 'agent/' not in col]

# Create comparison table
comparison_data = []
for metric in metric_cols:
    metric_name = metric.replace('metrics.', '').replace('/mean', '')
    v1_score = run_v1_df[metric].iloc[0]
    v2_score = run_v2_df[metric].iloc[0]
    improvement = v2_score - v1_score

    comparison_data.append({
        'Metric': metric_name,
        'V1 Score': f"{v1_score:.3f}",
        'V2 Score': f"{v2_score:.3f}",
        'Improvement': f"{improvement:+.3f}",
        'Improved': '✓' if improvement >= 0 else '✗'
    })

comparison_df = pd.DataFrame(comparison_data)
display(comparison_df)

avg_v1 = run_v1_df[metric_cols].mean(axis=1).iloc[0]
avg_v2 = run_v2_df[metric_cols].mean(axis=1).iloc[0]
display(f"Overall average improvement: {(avg_v2 - avg_v1):+.3f} ({((avg_v2/avg_v1 - 1) * 100):+.1f}%)")

### Observation : 
While comparing our score, the model improved concisity as we asked but at the expense of the clarity of the text