# MLflow 09: Custom Metrics and Evaluation for Generative Tasks

Welcome to Notebook 9! In [Notebook 6](MLflow_06_Evaluating_and_Benchmarking_LLMs_with_MLflow.ipynb), we explored standard evaluation metrics for LLMs. However, generative AI tasks often require more nuanced assessment beyond what metrics like ROUGE or BERTScore can capture [5, 7]. The 'quality' of generated text can be highly subjective and task-dependent.

This notebook delves into **Custom Metrics for Generative Tasks**. We'll learn how to:
- Understand the limitations of standard metrics for certain generative AI qualities.
- Define and implement **heuristic-based custom metrics** tailored to specific requirements (e.g., length constraints, keyword presence) [2, 3, 6].
- Explore the concept and implementation of **LLM-as-a-Judge custom metrics**, where another LLM is used to score the output based on defined criteria (e.g., helpfulness, coherence, adherence to style) [1, 3, 8].
- Integrate these custom metrics into the `mlflow.evaluate()` workflow.
- Analyze these richer evaluation results in the MLflow UI to gain deeper insights into model performance.

![Custom Evaluation Concept](https://www.comet.com/wp-content/uploads/2023/11/LLM-Eval-Taxonomy.png)

By defining what truly matters for your specific use case, custom metrics empower you to build better, more reliable, and more aligned generative AI applications [4].

---

## Table of Contents

1. [Recap: Limitations of Standard Generative AI Metrics](#recap-standard-metric-limitations)
2. [Setting Up the Custom Evaluation Environment](#setting-up-custom-eval-env)
    - [Installing Libraries](#installing-libraries-custom-eval)
    - [Ollama and LLM Setup (for Judge and Evaluated LLMs)](#ollama-llm-setup-custom-eval)
    - [Configuring MLflow](#configuring-mlflow-custom-eval)
3. [Task, Dataset, and Models Under Evaluation](#task-dataset-model-custom-eval)
    - [Task: Text Summarization (Revisited)](#task-text-summarization-revisited)
    - [Dataset: `openai/summarize_from_feedback` (TLDR subset)](#dataset-summarize-feedback-revisited)
    - [Models to Evaluate: `gemma2:2b` and `phi3:mini` (from Ollama)](#models-to-evaluate)
    - [Judge LLM: `qwen2:1.5b` (from Ollama)](#judge-llm-setup)
4. [Defining Custom Heuristic-Based Metrics](#defining-custom-heuristic-metrics)
    - [Using `mlflow.metrics.make_metric`](#mlflow-make-metric)
    - [Example 1: Summary Length Ratio](#custom-metric-length-ratio)
    - [Example 2: Keyword Presence Check](#custom-metric-keyword-presence)
5. [Defining Custom LLM-as-a-Judge Metrics](#defining-custom-llm-judge-metrics)
    - [Concept: LLM as an Evaluator](#concept-llm-as-judge)
    - [Example: "Summary Helpfulness" Judge (using `qwen2:1.5b`)](#custom-metric-helpfulness-judge)
        - Defining the Judge LLM Prompt and Rating Scale [1]
        - Implementing the `eval_fn` to call the Judge LLM
6. [Evaluating with Custom Metrics using `mlflow.evaluate()`](#evaluating-with-custom-metrics)
    - [Preparing Evaluation Data with `custom_expected` fields [2, 6]](#preparing-eval-data-custom)
    - [Running the Evaluation for `gemma2:2b` and `phi3:mini`](#running-evaluation-custom)
7. [Analyzing Custom Metric Results in MLflow UI](#analyzing-custom-results-mlflow-ui)
8. [Best Practices for Developing Custom Metrics [2, 4, 6]](#best-practices-custom-metrics)
9. [Key Takeaways](#key-takeaways-custom-eval)
10. [Engaging Resources and Further Reading](#resources-further-reading-custom-eval)

---

## 1. Recap: Limitations of Standard Generative AI Metrics

In [Notebook 6](MLflow_06_Evaluating_and_Benchmarking_LLMs_with_MLflow.ipynb), we used metrics like ROUGE and BERTScore. While valuable, they primarily measure surface-level lexical overlap or semantic similarity with reference texts. They might not fully capture [5, 7]:
- **Factual Correctness/Faithfulness:** Does the generation accurately reflect provided context (if any) or known facts?
- **Coherence and Readability:** Is the text well-structured and easy to understand?
- **Adherence to Instructions/Style:** Does the model follow specific formatting, tone, or persona requirements?
- **Helpfulness/Relevance:** Is the output actually useful or relevant to the user's query or task?
- **Creativity and Novelty:** For creative tasks, are the outputs original and engaging?
- **Absence of Undesirable Content:** Metrics for toxicity, bias, PII leakage [5].

**Custom metrics** allow us to define evaluation criteria that are more closely aligned with these nuanced aspects and specific business goals [2, 4].

---

## 2. Setting Up the Custom Evaluation Environment

### Installing Libraries

In [None]:
!pip install --quiet mlflow "transformers>=4.30.0" datasets evaluate "langchain>=0.1.0" langchain_community langchain_core langchain_ollama pydantic tiktoken rouge_score bert_score sentencepiece accelerate
# Ensure transformers is compatible with the models, langchain for ChatOllama

import mlflow
import torch
from datasets import load_dataset, Dataset # Ensure Dataset is imported
from transformers import pipeline # Using pipeline for easier model wrapping
from mlflow.metrics import make_metric, MetricValue # For custom heuristic metrics [3]
# from mlflow.models.evaluation.base import EvaluationResult # Not strictly needed for this example
from langchain_ollama.chat_models import ChatOllama # For LLM-as-a-Judge
from langchain_core.messages import HumanMessage, SystemMessage
import pandas as pd
import numpy as np
import os
import shutil
import re # For keyword checking
import json # For LLM-as-Judge response parsing

print(f"MLflow Version: {mlflow.__version__}")
import transformers
print(f"Transformers Version: {transformers.__version__}")

### Ollama and LLM Setup (for Judge and Evaluated LLMs)
Ensure Ollama is installed and running. We'll need to pull `qwen2:1.5b` (for the judge), `gemma2:2b`, and `phi3:mini` (for evaluation).

In your terminal, run:
`ollama pull qwen2:1.5b`
`ollama pull gemma2:2b`
`ollama pull phi3:mini`

In [None]:
ollama_judge_model_name = "qwen2:1.5b"
ollama_eval_model_1_name = "gemma2:2b"
ollama_eval_model_2_name = "phi3:mini"
judge_llm = None

try:
    judge_llm = ChatOllama(
        model=ollama_judge_model_name, 
        temperature=0.1, # Low temperature for consistent judging
        keep_alive="5m",
        format="json" # Request JSON output from judge for easier parsing of scores/reasoning
    )
    response_test = judge_llm.invoke([SystemMessage(content="Output a JSON with a key 'status' and value 'ok'."), HumanMessage(content="Test prompt.")])
    print(f"Judge LLM ({ollama_judge_model_name}) connected. Test response: {response_test.content[:60]}...")
except Exception as e:
    print(f"Error connecting to Judge LLM ({ollama_judge_model_name}): {e}. Ensure Ollama is running and model is pulled.")
    judge_llm = None # Set to None if connection fails, so dependent cells can check

def clear_gpu_cache():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

### Configuring MLflow

In [None]:
mlflow.set_tracking_uri('mlruns')
experiment_name = "LLM_Custom_Metrics_Summarization_Ollama"
mlflow.set_experiment(experiment_name)
print(f"MLflow Experiment set to: {experiment_name}")

---

## 3. Task, Dataset, and Models Under Evaluation

### Task: Text Summarization (Revisited)
We'll stick to text summarization to demonstrate custom metrics.

### Dataset: `openai/summarize_from_feedback` (TLDR subset)
Reusing the dataset from Notebook 6.

In [None]:
dataset_name = "openai/summarize_from_feedback"
dataset_config_name = "tldr"
num_eval_samples = 20 # Smaller subset for quicker custom metric development

try:
    eval_dataset_full = load_dataset(dataset_name, dataset_config_name, split="validation")
    eval_dataset = eval_dataset_full.select(range(num_eval_samples))
    print(f"Loaded {len(eval_dataset)} samples from '{dataset_name}/{dataset_config_name}'.")
except Exception as e:
    print(f"Error loading dataset: {e}. Using dummy data.")
    dummy_data = {
        'info': [{'post': 'This is a long post about the benefits of MLflow for MLOps. MLflow helps track experiments, package models, and manage the ML lifecycle efficiently, with features for reproducibility.'}] * num_eval_samples,
        'summaries': [[{'text': 'MLflow is great for MLOps providing efficiency.'}]] * num_eval_samples
    }
    eval_dataset = Dataset.from_dict(dummy_data)
    print(f"Using dummy dataset with {len(eval_dataset)} samples.")

### Models to Evaluate: `gemma2:2b` and `phi3:mini` (from Ollama)
We'll evaluate these two pre-trained models available via Ollama.

### Judge LLM: `qwen2:1.5b` (from Ollama)
Our LLM-as-a-Judge metrics will be powered by `qwen2:1.5b`.

In [None]:
# Define a helper function to create a wrapped model for mlflow.evaluate
class OllamaSummarizationWrapper:
    def __init__(self, ollama_model_name, prompt_template="Summarize the following text concisely:\n\n{text}\n\nSummary:", max_new_toks=100):
        self.ollama_model_name = ollama_model_name
        self.prompt_template = prompt_template
        self.max_new_tokens = max_new_toks
        self.llm = ChatOllama(model=ollama_model_name, temperature=0, keep_alive="1m") # Keep alive shorter during eval
        print(f"Initialized Ollama wrapper for {ollama_model_name}")

    def predict(self, X_df):
        if isinstance(X_df, pd.DataFrame):
            texts_to_summarize = X_df['inputs'].tolist()
        elif isinstance(X_df, pd.Series):
            texts_to_summarize = X_df.tolist()
        else:
            raise ValueError("Input to predict should be a Pandas DataFrame or Series with an 'inputs' column.")

        summaries = []
        for text_input in texts_to_summarize:
            prompt = self.prompt_template.format(text=text_input)
            try:
                # Note: ChatOllama's invoke might not directly support max_new_tokens in the same way a pipeline does.
                # It's better to rely on the model's default generation length or control it via system prompt if possible,
                # or use a LangChain chain that explicitly handles output parsing and length.
                # For simplicity here, we'll assume the LLM gives a reasonable summary length or we post-process.
                response = self.llm.invoke([HumanMessage(content=prompt)])
                summary = response.content.strip()
                # Crude truncation if needed (not ideal, but for consistency in demo)
                summary = " ".join(summary.split()[:self.max_new_tokens]) 
            except Exception as e:
                print(f"Error during prediction with {self.ollama_model_name} for input '{text_input[:50]}...': {e}")
                summary = "Error generating summary."
            summaries.append(summary)
        return pd.Series(summaries)

print(f"Models to evaluate: {ollama_eval_model_1_name}, {ollama_eval_model_2_name}")
print(f"Judge LLM: {ollama_judge_model_name}")

---

## 4. Defining Custom Heuristic-Based Metrics
These are the same heuristic metrics as before: Summary Length Ratio and Keyword Presence.

### Using `mlflow.metrics.make_metric`
The `eval_fn` takes `predictions` and `targets` and returns `mlflow.metrics.MetricValue` [3].

### Example 1: Summary Length Ratio

In [None]:
def summary_length_ratio_eval_fn(predictions, targets, **kwargs):
    ratios = []
    for pred, target in zip(predictions, targets):
        len_pred = len(str(pred).split()) 
        len_target = len(str(target).split())
        if len_target == 0:
            ratios.append(0.0 if len_pred > 0 else 1.0)
        else:
            ratios.append(len_pred / len_target)
    
    return MetricValue(
        scores=ratios, 
        aggregate_results={
            "mean_length_ratio": np.mean(ratios),
            "std_dev_length_ratio": np.std(ratios)
        }
    )

summary_length_ratio_metric = make_metric(
    eval_fn=summary_length_ratio_eval_fn,
    greater_is_better=False, 
    name="summary_length_ratio"
)
print("Custom metric 'summary_length_ratio' defined.")

### Example 2: Keyword Presence Check

In [None]:
def keyword_presence_eval_fn(predictions, targets, custom_expected_list):
    scores = []
    details = [] 

    for i, pred_obj in enumerate(predictions):
        pred = str(pred_obj) # Ensure prediction is a string
        current_custom_expected = custom_expected_list[i]
        required_keywords = current_custom_expected.get("required_keywords", [])
        if not required_keywords:
            scores.append(1.0)
            details.append({"found": [], "missing": [], "all_required": []})
            continue

        found_count = 0
        found_kws = []
        missing_kws = []
        pred_lower = pred.lower()
        for kw in required_keywords:
            if re.search(r'\b' + re.escape(kw.lower()) + r'\b', pred_lower):
                found_count += 1
                found_kws.append(kw)
            else:
                missing_kws.append(kw)
        
        scores.append(found_count / len(required_keywords) if required_keywords else 1.0)
        details.append({"found": found_kws, "missing": missing_kws, "all_required": required_keywords})

    all_found_keywords_overall = sum([len(d['found']) for d in details])
    all_required_keywords_overall = sum([len(d['all_required']) for d in details])
    overall_hit_rate = all_found_keywords_overall / all_required_keywords_overall if all_required_keywords_overall > 0 else 1.0

    return MetricValue(
        scores=scores,
        aggregate_results={
            "mean_keyword_hit_rate": np.mean(scores),
            "overall_keyword_hit_rate": overall_hit_rate
        }
    )

keyword_presence_metric = make_metric(
    eval_fn=keyword_presence_eval_fn,
    greater_is_better=True,
    name="keyword_presence_score"
)
print("Custom metric 'keyword_presence_score' defined.")

---

## 5. Defining Custom LLM-as-a-Judge Metrics
We use `qwen2:1.5b` as our judge.

### Concept: LLM as an Evaluator
The judge LLM (`qwen2:1.5b`) gets the input, generated output, and a prompt instructing it how to score based on criteria like coherence, relevance, or helpfulness [1, 3, 8].

### Example: "Summary Helpfulness" Judge (using `qwen2:1.5b`)
This judge will assess the helpfulness of the generated summaries.

#### Defining the Judge LLM Prompt and Rating Scale [1]

In [None]:
CUSTOM_JUDGE_HELPFULNESS_PROMPT_TEMPLATE = """
You are an AI assistant tasked with evaluating the helpfulness of a generated summary.
Consider the original text and the generated summary provided below.
Rate the helpfulness of the *generated summary* on a scale of 1 to 5, where:
1: Not helpful at all. The summary is irrelevant, nonsensical, or completely misses the main points.
2: Slightly helpful. The summary touches on some aspects but is largely incomplete or unclear.
3: Moderately helpful. The summary captures some main points but could be significantly improved in clarity or conciseness.
4: Very helpful. The summary is clear, concise, and accurately reflects the main essence of the original text.
5: Extremely helpful. The summary is outstanding, perfectly capturing the core message with excellent clarity and conciseness.

Original Text:
---BEGIN ORIGINAL TEXT---
{original_text}
---END ORIGINAL TEXT---

Generated Summary:
---BEGIN GENERATED SUMMARY---
{generated_summary}
---END GENERATED SUMMARY---

Based on the criteria, provide your evaluation ONLY as a JSON object with two keys: "score" (an integer from 1 to 5) and "reasoning" (a brief explanation for your score, max 30 words).
Example JSON: {{"score": 4, "reasoning": "The summary is quite clear and captures the main points well, making it useful."}}
"""

#### Implementing the `eval_fn` to call the Judge LLM (`qwen2:1.5b`)

In [None]:
def summary_helpfulness_judge_eval_fn(predictions, targets, inputs):
    if judge_llm is None:
        print("Judge LLM (qwen2:1.5b) not available, skipping helpfulness metric.")
        return MetricValue(scores=[np.nan]*len(predictions), aggregate_results={"mean_helpfulness_score": np.nan})
    
    scores = []
    # all_reasonings = [] # Could collect these for artifact logging if desired

    for i, generated_summary_obj in enumerate(predictions):
        generated_summary = str(generated_summary_obj) # Ensure string
        original_text = str(inputs[i]) # Ensure string
        
        prompt_for_judge = CUSTOM_JUDGE_HELPFULNESS_PROMPT_TEMPLATE.format(
            original_text=original_text,
            generated_summary=generated_summary
        )
        
        score = 0 # Default score in case of error
        reasoning = "Judge LLM call or parsing failed."
        try:
            # Using SystemMessage for role and HumanMessage for the task prompt
            judge_response_msg = judge_llm.invoke([
                SystemMessage(content="You are an AI assistant that provides evaluations in JSON format according to the user's instructions."),
                HumanMessage(content=prompt_for_judge)
            ])
            judge_response_content = judge_response_msg.content
            
            try:
                json_match = re.search(r'\{.*\}', judge_response_content, re.DOTALL)
                if json_match:
                    parsed_response = json.loads(json_match.group(0))
                    score = int(parsed_response.get("score", 0))
                    reasoning = parsed_response.get("reasoning", "No reasoning provided.")
                else:
                    print(f"Warning: Judge LLM (qwen2:1.5b) did not output valid JSON for item {i}. Response: {judge_response_content}")

            except json.JSONDecodeError:
                print(f"Warning: Judge LLM (qwen2:1.5b) output for item {i} was not valid JSON: {judge_response_content}")
            except Exception as e_parse:
                print(f"Warning: Error parsing judge (qwen2:1.5b) response for item {i}: {e_parse}. Response: {judge_response_content}")

        except Exception as e_judge_call:
            print(f"Error calling Judge LLM (qwen2:1.5b) for item {i}: {e_judge_call}")
            
        scores.append(score)
        # all_reasonings.append({"input": original_text, "prediction": generated_summary, "score": score, "reasoning": reasoning})

    valid_scores = [s for s in scores if isinstance(s, (int, float)) and 0 < s <= 5] # Filter out 0s from errors
    mean_score = np.mean(valid_scores) if valid_scores else np.nan

    return MetricValue(
        scores=scores, 
        aggregate_results={"mean_helpfulness_score": mean_score}
    )

summary_helpfulness_judge_metric = make_metric(
    eval_fn=summary_helpfulness_judge_eval_fn,
    greater_is_better=True,
    name="summary_helpfulness_qwen2_1.5b_judge"
)
print("Custom LLM-as-a-Judge metric 'summary_helpfulness_qwen2_1.5b_judge' defined.")

---

## 6. Evaluating with Custom Metrics using `mlflow.evaluate()`

### Preparing Evaluation Data with `custom_expected` fields [2, 6]
For the `keyword_presence_metric`.

In [None]:
eval_df_custom = pd.DataFrame({
    "inputs": [entry['info']['post'] for entry in eval_dataset],
    "targets": [entry['summaries'][0]['text'] for entry in eval_dataset],
    "custom_expected": [
        {"required_keywords": ["MLflow", "lifecycle"]} if "mlflow" in entry['info']['post'].lower() 
        else {"required_keywords": ["summary", "text"]} # Default keywords
        for entry in eval_dataset
    ]
})
print("Evaluation DataFrame with 'custom_expected' prepared:")
print(eval_df_custom.head(2))

### Running the Evaluation for `gemma2:2b` and `phi3:mini`
We'll iterate through our selected models and evaluate them.

In [None]:
models_to_evaluate_ollama = {
    ollama_eval_model_1_name: OllamaSummarizationWrapper(ollama_eval_model_1_name),
    ollama_eval_model_2_name: OllamaSummarizationWrapper(ollama_eval_model_2_name)
}

custom_metrics_list = [
    summary_length_ratio_metric,
    keyword_presence_metric,
    summary_helpfulness_judge_metric 
]

for model_key, wrapped_model_instance in models_to_evaluate_ollama.items():
    if wrapped_model_instance.llm is None: # Check if Ollama model initialized in wrapper
        print(f"Skipping evaluation for {model_key} as its LLM failed to initialize.")
        continue

    print(f"\n--- Evaluating model: {model_key} ---")
    with mlflow.start_run(run_name=f"Eval_{model_key.replace(':', '_')}_Summarization_CustomMetrics") as run:
        mlflow.log_param("model_name", model_key)
        mlflow.log_param("evaluation_task", "text-summarization-custom")
        mlflow.log_param("dataset_name", f"{dataset_name}/{dataset_config_name}")
        mlflow.log_param("num_eval_samples", num_eval_samples)
        mlflow.log_param("judge_llm_for_helpfulness", ollama_judge_model_name if judge_llm else "N/A")
        mlflow.set_tag("evaluation_type", "custom_metrics_focused")
        mlflow.set_tag("model_source", "Ollama")

        print(f"Starting mlflow.evaluate for {model_key} with custom metrics...")
        try:
            custom_eval_results = mlflow.evaluate(
                model=wrapped_model_instance,
                data=eval_df_custom.copy(), # Pass a copy to be safe
                targets="targets",
                feature_names=["inputs"], 
                model_type="text-summarization", # This helps mlflow pick some default metrics if available
                extra_metrics=custom_metrics_list,
                # Example: To only use custom metrics and disable defaults, one might need to clear default metrics first or check API.
                # For now, we'll let it add defaults like ROUGE if it can infer them.
            )
            print(f"\nCustom evaluation results for {model_key}:")
            for metric_name, value in custom_eval_results.metrics.items():
                print(f"  {metric_name}: {value}")
            
            if custom_eval_results.artifacts and "eval_results_table.json" in custom_eval_results.artifacts:
                 print(f"  Detailed evaluation table artifact path: {custom_eval_results.artifacts['eval_results_table.json'].uri}")
        
        except Exception as e_eval:
            print(f"Error during mlflow.evaluate for {model_key}: {e_eval}")
            mlflow.log_text(str(e_eval), f"evaluation_error_{model_key}.txt")
    clear_gpu_cache() # Clear cache between model evaluations if on GPU

# Clean up judge LLM if it was loaded
if judge_llm is not None:
    # Depending on langchain/ollama, direct cleanup might not be exposed.
    # keep_alive="1m" helps Ollama unload it automatically after a short period.
    pass 

---

## 7. Analyzing Custom Metric Results in MLflow UI

Launch the MLflow UI (`mlflow ui`) and navigate to the `LLM_Custom_Metrics_Summarization_Ollama` experiment.

- **Compare Runs:** Select the runs for `gemma2:2b` and `phi3:mini` and click "Compare".
- **Metrics Section:** You'll see your custom metrics:
    - `summary_length_ratio/mean_length_ratio`
    - `keyword_presence_score/mean_keyword_hit_rate`
    - `summary_helpfulness_qwen2_1.5b_judge/mean_helpfulness_score`
    - ... and any default metrics MLflow added.
- **Analyze:** Which model performed better on helpfulness according to `qwen2:1.5b`? Which had a better length profile or keyword coverage for this task?
- **Artifacts:** The `eval_results_table.json` for each run will show per-sample scores for all metrics, allowing you to drill down into specific examples where models differed or where the judge provided interesting reasoning (if you were to log the judge's reasoning as a separate artifact).

![MLFlow UI](https://blog.min.io/content/images/2025/03/Screenshot-2025-03-10-at-3.30.33-PM.png)

This comparative view, enriched by custom metrics, offers a much deeper understanding than relying on generic scores alone.

---

## 8. Best Practices for Developing Custom Metrics [2, 4, 6]

- **Define Clear Criteria:** Your custom metric should measure a well-defined, specific aspect of quality or performance that matters for your application [4].
- **Iterative Development [2, 6]:**
    1.  **Generate an "Answer Sheet":** Run your model on an evaluation dataset once and save its predictions.
    2.  **Develop Metric `eval_fn`:** Write your custom metric function and test it directly on the saved predictions/inputs.
    3.  **Validate with `mlflow.evaluate` on Answer Sheet:** Run `mlflow.evaluate` using the pre-generated answer sheet (by *not* passing the `model` argument but providing the answer sheet as `data` with predictions).
    4.  **Full Evaluation:** Run `mlflow.evaluate` with the actual model.
- **Consider Cost and Latency:** LLM-as-a-Judge metrics add cost (judge LLM tokens) and latency. Heuristic metrics are faster.
- **Judge LLM Reliability & Bias:** The quality of LLM-as-a-Judge metrics depends on the judge LLM's capability and its own potential biases. Use a strong judge model and clear, unbiased prompts. Consider using multiple judge LLMs or averaging scores.
- **Combine with Standard Metrics and Human Review:** Custom metrics provide additional dimensions but shouldn't entirely replace standard metrics or qualitative human assessment.
- **Version Your Metrics:** Metric definitions can evolve. Track their versions.

---

## 9. Key Takeaways

This notebook has equipped you with advanced techniques for evaluating generative AI models:

- **Tailored Evaluation:** You can now define custom metrics that precisely measure what's important for your specific generative AI task.
- **Heuristic Metrics:** Implemented programmatic custom metrics (length ratio, keyword presence) using `mlflow.metrics.make_metric`.
- **LLM-as-a-Judge:** Implemented a custom LLM-as-a-Judge metric (using `qwen2:1.5b`) to assess "helpfulness."
- **Integration with `mlflow.evaluate`:** Seamlessly incorporated these diverse custom metrics into the MLflow evaluation workflow.
- **Deeper Insights:** Custom metrics, when analyzed in MLflow, provide a richer understanding of model performance beyond standard scores.

Mastering custom evaluation is crucial for effectively iterating on and improving your generative AI models.

---

## 10. Engaging Resources and Further Reading

- **MLflow Documentation:**
    - [MLflow LLM Evaluate - Custom Metrics Section](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html#create-custom-heuristic-based-llm-evaluation-metrics) [3]
    - [MLflow `make_metric` API](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.make_metric)
- **Databricks Documentation (Mosaic AI Agent Evaluation):**
    - [Custom Metrics Guide (Conceptual Overlap with MLflow)](https://docs.databricks.com/en/generative-ai/agent-evaluation/custom-metrics.html) [2, 6]
- **Cloud Provider Custom Metrics (for conceptual understanding of LLM-as-a-Judge structure):**
    - [AWS Bedrock - Custom Metrics for GenAI Evaluation](https://aws.amazon.com/blogs/machine-learning/use-custom-metrics-to-evaluate-your-generative-ai-application-with-amazon-bedrock/) [1]
    - [Google Vertex AI - Evaluating Generative AI](https://cloud.google.com/vertex-ai/docs/generative-ai/models/evaluate-models) (see also the LinkedIn article [8] referencing Google SDK)
- **General Best Practices:**
    - [Microsoft Tech Community: Evaluating generative AI: Best practices for developers](https://techcommunity.microsoft.com/blog/azuredevcommunityblog/evaluating-generative-ai-best-practices-for-developers/4271488) [4]
    - [DataRobot Blog: Design and Monitor Custom Metrics for Generative AI](https://www.datarobot.com/blog/design-and-monitor-custom-metrics-for-generative-ai-use-cases-in-datarobot-ai-platform/) [5]

--- 

Excellent work! You've now explored a critical aspect of maturing your generative AI development process by implementing and using custom evaluation metrics.

**Coming Up Next (Notebook 10):** We'll aim to synthesize several concepts by building a more comprehensive End-to-End GenAI Application, potentially combining RAG with function-calling agents, all tracked and managed with MLflow.

![Keep Learning](https://memento.epfl.ch/image/23136/1440x810.jpg)