# MLflow 06: Evaluating and Benchmarking LLMs with MLflow

Welcome to the sixth notebook in our MLflow series! So far, we've covered tracking experiments, HPO, model registry, RAG, and fine-tuning LLMs like Qwen3-0.6B. Now, a crucial question arises: *How good are these models?* And, *how do they compare against each other or against a baseline?*

This notebook dives into **Evaluating and Benchmarking Large Language Models (LLMs)**. We'll explore how to:
- Select appropriate evaluation datasets and metrics for LLM tasks.
- Use Hugging Face's `evaluate` library for calculating standard metrics.
- Leverage **MLflow's `mlflow.evaluate()` API** designed for LLMs to streamline the evaluation process.
- Systematically log evaluation results (metrics, parameters, sample outputs) for different LLMs (base models, fine-tuned models) to MLflow.
- Use the MLflow UI to compare model performance and create a benchmark/leaderboard.

![LLM Evaluation Concept](https://aisera.com/wp-content/uploads/2023/12/LLM-Evaluation.png)

Effective evaluation is key to making informed decisions in your LLM development lifecycle, guiding model selection, fine-tuning efforts, and understanding model capabilities and limitations.

---

## Table of Contents

1. [The Importance and Challenges of LLM Evaluation](#importance-challenges-llm-eval)
2. [Setting Up the Evaluation Environment](#setting-up-eval-env)
    - [Installing Libraries](#installing-libraries-eval)
    - [GPU Check](#gpu-check-eval)
    - [Configuring MLflow](#configuring-mlflow-eval)
3. [Choosing an Evaluation Dataset and Task](#choosing-eval-dataset-task)
    - [Task: Text Summarization](#task-text-summarization)
    - [Dataset: `openai/summarize_from_feedback` (TLDR subset)](#dataset-summarize-feedback)
4. [Selecting LLMs for Evaluation](#selecting-llms-for-eval)
    - [Model 1: Base `Qwen/Qwen3-0.6B`](#model1-base-qwen3)
    - [Model 2: Fine-tuned `Qwen/Qwen3-0.6B` (Recipe Bot)](#model2-ft-qwen3)
    - [Model 3: `google/flan-t5-small` (Baseline Instruction Model)](#model3-flan-t5)
5. [Overview of Evaluation Metrics](#overview-eval-metrics)
    - [ROUGE, BERTScore, Perplexity](#rouge-bertscore-perplexity)
6. [Evaluating LLMs with `mlflow.evaluate()`](#evaluating-with-mlflow-evaluate)
    - [Preparing the Evaluation Data](#preparing-eval-data-mlflow)
    - [Evaluating Model 1: Base `Qwen/Qwen3-0.6B`](#evaluating-model1)
    - [Evaluating Model 2: Fine-tuned `Qwen/Qwen3-0.6B`](#evaluating-model2)
    - [Evaluating Model 3: `google/flan-t5-small`](#evaluating-model3)
7. [Comparing Model Performance in MLflow UI](#comparing-models-mlflow-ui)
8. [Brief: Advanced Evaluation Concepts & Tools](#advanced-eval-concepts)
9. [Key Takeaways](#key-takeaways-eval)
10. [Engaging Resources and Further Reading](#resources-and-further-reading-eval)

---

## 1. The Importance and Challenges of LLM Evaluation

Evaluating LLMs is crucial for:
- **Model Selection:** Choosing the best model (base or fine-tuned) for a specific task.
- **Tracking Progress:** Measuring improvements from fine-tuning or changes in prompting strategies.
- **Understanding Capabilities & Limitations:** Identifying strengths and weaknesses of a model.
- **Ensuring Responsible AI:** Assessing aspects like fairness, bias, toxicity, and factuality (though these often require specialized evaluation setups).

**Challenges in LLM Evaluation:**
- **Open-endedness:** Generated text can be diverse, making it hard for simple metrics to capture true quality.
- **Lack of Ground Truth:** For some generative tasks, a single "correct" answer doesn't exist.
- **Metric Limitations:** Traditional metrics (e.g., BLEU, ROUGE) capture surface-level similarities but may miss semantic meaning or factual correctness.
- **Cost and Effort:** Human evaluation is often the gold standard but is expensive and time-consuming.
- **Task Diversity:** Different tasks (summarization, QA, translation, creative writing) require different evaluation approaches.

Despite these challenges, a combination of automated metrics and qualitative analysis provides valuable insights. MLflow helps organize and compare these varied evaluation results.

---

## 2. Setting Up the Evaluation Environment

### Installing Libraries
We'll need `mlflow`, `transformers`, `datasets`, `evaluate` (from Hugging Face), `peft` (for loading LoRA adapters), `bitsandbytes` (for quantization), `accelerate`, `rouge_score`, `bert_score`, and `sentencepiece`.

In [None]:
!pip install --quiet mlflow "transformers>=4.51.0" datasets evaluate peft trl bitsandbytes sentencepiece accelerate rouge_score bert_score

import mlflow
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoModelForSeq2SeqLM # For Flan-T5
)
from peft import PeftModel # For loading LoRA adapters
import evaluate # Hugging Face evaluate library
import pandas as pd
import os
import shutil

print(f"MLflow Version: {mlflow.__version__}")
print(f"PyTorch Version: {torch.__version__}")
import transformers
print(f"Transformers Version: {transformers.__version__}")
import evaluate as hf_evaluate_lib # Alias to avoid confusion if mlflow.evaluate is used
print(f"Hugging Face Evaluate Library Version: {hf_evaluate_lib.__version__}")

### GPU Check
LLM inference, even for smaller models, is faster on GPU.

In [None]:
if torch.cuda.is_available():
    print(f"CUDA is available. Using GPU: {torch.cuda.get_device_name(0)}")
    torch.cuda.set_device(0)
    current_device_name = torch.cuda.get_device_name(0)
else:
    print("CUDA not available. LLM evaluation will run on CPU and might be slow.")
    current_device_name = 'cpu'

def clear_gpu_cache():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

### Configuring MLflow

In [None]:
mlflow.set_tracking_uri('mlruns')
experiment_name = "LLM_Evaluation_Summarization_Benchmark"
mlflow.set_experiment(experiment_name)
print(f"MLflow Experiment set to: {experiment_name}")

---

## 3. Choosing an Evaluation Dataset and Task

### Task: Text Summarization
We'll evaluate the LLMs on their ability to generate concise summaries of given texts. This is a common and important NLP task.

### Dataset: `openai/summarize_from_feedback` (TLDR subset)
This dataset was used in the development of InstructGPT and contains human-written summaries along with human feedback. We'll use the `tldr` subset, which consists of Reddit posts and their TL;DR summaries.
-   **Input:** Reddit post content (the `info.post` field).
-   **Target/Reference:** Human-written TL;DR summary (the `summaries[0].text` field, taking the first summary as reference).

This dataset is suitable because it provides text-summary pairs, allowing us to compute metrics like ROUGE and BERTScore.

In [None]:
dataset_name = "openai/summarize_from_feedback"
dataset_config_name = "tldr" # Using the 'tldr' configuration for Reddit posts
num_eval_samples = 50 # Number of samples to use for evaluation (adjust as needed for speed vs. thoroughness)

try:
    eval_dataset_full = load_dataset(dataset_name, dataset_config_name, split="validation")
    # Select a subset for faster evaluation in this demo
    eval_dataset = eval_dataset_full.select(range(num_eval_samples))
    print(f"Loaded {len(eval_dataset)} samples from '{dataset_name}/{dataset_config_name}' for evaluation.")
    
    # Inspect a sample
    sample_entry = eval_dataset[0]
    print("\nSample Entry:")
    print(f"  Post (Input): {sample_entry['info']['post'][:300]}...")
    print(f"  Reference Summary (Target): {sample_entry['summaries'][0]['text']}") 
except Exception as e:
    print(f"Error loading dataset: {e}. This might be due to connectivity or dataset access issues.")
    print("Creating a dummy dataset for fallback.")
    dummy_data = {
        'info': [{'post': 'This is a long post about the benefits of MLflow for MLOps. MLflow helps track experiments, package models, and manage the ML lifecycle.'}] * num_eval_samples,
        'summaries': [[{'text': 'MLflow is great for MLOps.'}]] * num_eval_samples
    }
    eval_dataset = Dataset.from_dict(dummy_data)
    print(f"Using dummy dataset with {len(eval_dataset)} samples.")

---

## 4. Selecting LLMs for Evaluation
We will evaluate three different models to see how they perform on the summarization task:

### Model 1: Base `Qwen/Qwen3-0.6B`
This is the pre-trained Qwen3-0.6B model without any further fine-tuning from our side. We'll use quantization for efficient inference.

### Model 2: Fine-tuned `Qwen/Qwen3-0.6B` (Recipe Bot from Notebook 4)
This is the Qwen3-0.6B model that we fine-tuned on recipe generation in Notebook 4. It will be interesting to see how this domain-specific fine-tuning affects its performance on a general summarization task. We'll load the LoRA adapter.

### Model 3: `google/flan-t5-small` (Baseline Instruction Model)
Flan-T5 is a well-known family of instruction-tuned models. The 'small' version is efficient and provides a good baseline for comparison.

We will load these models sequentially to manage VRAM.

---

## 5. Overview of Evaluation Metrics

We'll use a combination of metrics, primarily those supported by `mlflow.evaluate` for summarization:

- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):** Measures overlap between the generated summary and reference summary (e.g., ROUGE-1, ROUGE-2, ROUGE-L for unigram, bigram, and longest common subsequence overlap).
- **BERTScore:** Computes similarity between generated and reference summaries using contextual embeddings from BERT, often correlating better with human judgment than ROUGE.
- **Perplexity (via `mlflow.evaluate`):** A measure of how well a probability model predicts a sample. Lower perplexity generally indicates better fluency and coherence of the generated text by the model itself (not comparing to a reference, but internal consistency). *Note: `mlflow.evaluate` can compute perplexity if the model is a `transformers` pipeline or model.* 

MLflow's `evaluate` API can also compute other metrics like `exact_match`, `f1`, and potentially model-based metrics if configured (e.g., toxicity, PII detection using an LLM as a judge, though this is more advanced).

---

## 6. Evaluating LLMs with `mlflow.evaluate()`

The `mlflow.evaluate()` API provides a structured way to evaluate models, especially for common NLP tasks like text summarization. It requires:
- A `model` (can be a model URI, a `transformers` pipeline, or a custom Python function).
- `data` (a Pandas DataFrame or a dataset path).
- `targets` (column name for reference/ground truth).
- `inputs` (column name for model input text).
- `model_type` (e.g., `"text-summarization"`, `"question-answering"`, `"text-generation"`).
- `metrics` (a list of metric names or custom metric functions).
- `evaluators` and `evaluator_config` for more advanced, model-based evaluations.

![MLFlow UI](https://blog.min.io/content/images/2025/03/Screenshot-2025-03-10-at-3.30.33-PM.png)

### Preparing the Evaluation Data
`mlflow.evaluate` often works best with Pandas DataFrames. Let's convert our Hugging Face dataset subset.

In [None]:
eval_df = pd.DataFrame({
    "prompt_text": [entry['info']['post'] for entry in eval_dataset],
    "reference_summary": [entry['summaries'][0]['text'] for entry in eval_dataset]
})

print("Evaluation DataFrame prepared:")
print(eval_dataset.column_names)
print(eval_df.head(2))

### Evaluating Model 1: Base `Qwen/Qwen3-0.6B`

In [None]:
clear_gpu_cache()
model_name_qwen3_base = "Qwen/Qwen3-0.6B"
run_name_qwen3_base = "Eval_Base_Qwen3_0.6B_Summarization"

try:
    print(f"Loading base model for evaluation: {model_name_qwen3_base}")
    # For mlflow.evaluate, providing a transformers pipeline is often easiest
    # Qwen3 uses AutoModelForCausalLM
    qwen3_base_tokenizer = AutoTokenizer.from_pretrained(model_name_qwen3_base, trust_remote_code=True)
    if qwen3_base_tokenizer.pad_token is None:
        qwen3_base_tokenizer.pad_token = qwen3_base_tokenizer.eos_token
    
    # Define a model object or pipeline for mlflow.evaluate
    # Using a custom model wrapper for more control over generation if needed, or a pipeline
    # For simplicity, we'll create a pipeline. Max_length for summary can be controlled.
    qwen3_base_pipeline = pipeline(
        "text-generation", # Qwen3 is a causal LM, so text-generation is appropriate for summarization prompts
        model=model_name_qwen3_base,
        tokenizer=qwen3_base_tokenizer,
        device=0 if torch.cuda.is_available() else -1, # Use GPU if available
        trust_remote_code=True,
        model_kwargs={ # Pass quantization here if desired and supported by pipeline for this model
            "quantization_config": BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16)
        }
    )
    print("Base Qwen3-0.6B pipeline created.")

    with mlflow.start_run(run_name=run_name_qwen3_base) as run:
        mlflow.log_param("model_name", model_name_qwen3_base)
        mlflow.log_param("evaluation_task", "text-summarization")
        mlflow.log_param("dataset_name", f"{dataset_name}/{dataset_config_name}")
        mlflow.log_param("num_eval_samples", num_eval_samples)
        mlflow.set_tag("model_type", "base_llm")

        print(f"Starting mlflow.evaluate for {model_name_qwen3_base}...")
        # Note: mlflow.evaluate might re-tokenize or handle data differently.
        # We need to provide a prompt that guides the model towards summarization.
        # For text-generation models used for summarization, the input needs to be formatted as a prompt.
        # `mlflow.evaluate` for text-summarization might expect a model that directly takes text and returns summary.
        # Let's try using model_type="text-generation" and crafting a summarization prompt within the input data.

        # Create a new DataFrame with a summarization prompt for text-generation models
        eval_df_for_gen = eval_df.copy()
        eval_df_for_gen["generation_prompt"] = eval_df_for_gen["prompt_text"].apply(lambda x: f"Summarize the following text:\n\n{x}\n\nSummary:")
        
        # The pipeline needs to be wrapped or adapted if its output isn't directly the summary text
        # For text-generation, it outputs a list of dicts with 'generated_text'
        # We'll create a simple wrapper for mlflow.evaluate
        class SummarizationPipelineWrapper:
            def __init__(self, pipeline_obj):
                self.pipeline = pipeline_obj
            def predict(self, X):
                prompts = X["generation_prompt"].tolist()
                # Pipeline expects list of strings
                outputs = self.pipeline(prompts, max_new_tokens=100, num_return_sequences=1, do_sample=False, pad_token_id=self.pipeline.tokenizer.eos_token_id)
                # Extract just the generated summary part AFTER the prompt
                results = []
                for i, out_list in enumerate(outputs):
                    full_text = out_list[0]['generated_text']
                    # Remove original prompt to get just the summary
                    summary = full_text.replace(prompts[i], "").strip()
                    results.append(summary)
                return pd.Series(results)

        summarization_model_qwen3_base = SummarizationPipelineWrapper(qwen3_base_pipeline)

        results = mlflow.evaluate(
            model=summarization_model_qwen3_base, # Our wrapped pipeline
            data=eval_df_for_gen, # Use df with 'generation_prompt'
            targets="reference_summary",
            input_example=eval_df_for_gen.head(1), # `inputs` arg name changed to `input_example` in some versions for data profiling.
                                                  # Or provide 'feature_names' if model takes a dict.
                                                  # For our wrapper, it expects a DataFrame with 'generation_prompt'.
            # For a model that directly takes the column name for prediction:
            # feature_names=["generation_prompt"], # Or just inputs="generation_prompt" if model.predict handles column name
            model_type="text", # Use generic 'text' model_type for this wrapper
                                 # as "text-summarization" has specific expectations for model signature.
            # We can specify metrics manually if "text-summarization" type isn't fully compatible with our wrapped pipeline
            # Or, let mlflow.evaluate pick defaults for "text" if possible, then add custom ones.
            # For now, let's rely on standard behavior for a text model. 
            # mlflow.evaluate will by default save predictions if `predictions` is not set to `None`
        )
        print(f"Base Qwen3-0.6B evaluation results:\n{results.metrics}")
        if results.artifacts and "eval_results_table.json" in results.artifacts:
             print(f"Evaluation table artifact path: {results.artifacts['eval_results_table.json'].uri}")

except Exception as e:
    print(f"Error evaluating base Qwen3-0.6B: {e}")
finally:
    del qwen3_base_pipeline
    if 'summarization_model_qwen3_base' in locals(): del summarization_model_qwen3_base
    clear_gpu_cache()

### Evaluating Model 2: Fine-tuned `Qwen/Qwen3-0.6B` (Recipe Bot)
We load the LoRA adapter trained in Notebook 4. **Important:** The path to the adapter must be correct.

In [None]:
clear_gpu_cache()
model_name_qwen3_ft = "Qwen/Qwen3-0.6B_FineTuned_RecipeBot"
run_name_qwen3_ft = "Eval_FineTuned_Qwen3_0.6B_Summarization"
# Ensure this path points to where your Qwen3 fine-tuned adapter from Notebook 4 was saved
qwen3_adapter_path = "./qwen3_0.6b_recipe_finetuned_adapters/final_adapter" 

if not os.path.exists(qwen3_adapter_path):
    print(f"Fine-tuned Qwen3 adapter not found at {qwen3_adapter_path}. Skipping evaluation.")
else:
    try:
        print(f"Loading fine-tuned Qwen3-0.6B model for evaluation from adapter: {qwen3_adapter_path}")
        qwen3_ft_base_model = AutoModelForCausalLM.from_pretrained(
            "Qwen/Qwen3-0.6B", 
            quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16),
            device_map="auto",
            trust_remote_code=True
        )
        qwen3_ft_model = PeftModel.from_pretrained(qwen3_ft_base_model, qwen3_adapter_path, is_trainable=False)
        qwen3_ft_model.eval()
        
        qwen3_ft_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B", trust_remote_code=True)
        if qwen3_ft_tokenizer.pad_token is None:
            qwen3_ft_tokenizer.pad_token = qwen3_ft_tokenizer.eos_token

        qwen3_ft_pipeline = pipeline(
            "text-generation",
            model=qwen3_ft_model,
            tokenizer=qwen3_ft_tokenizer,
            device=0 if torch.cuda.is_available() else -1,
            trust_remote_code=True # May not be needed if model object is passed
        )
        print("Fine-tuned Qwen3-0.6B pipeline created.")
        
        # Use the same wrapper and evaluation DataFrame structure as for the base model
        summarization_model_qwen3_ft = SummarizationPipelineWrapper(qwen3_ft_pipeline)
        eval_df_for_gen_ft = eval_df.copy()
        eval_df_for_gen_ft["generation_prompt"] = eval_df_for_gen_ft["prompt_text"].apply(lambda x: f"Summarize the following text:\n\n{x}\n\nSummary:")

        with mlflow.start_run(run_name=run_name_qwen3_ft) as run:
            mlflow.log_param("model_name", model_name_qwen3_ft)
            mlflow.log_param("adapter_path", qwen3_adapter_path)
            mlflow.log_param("evaluation_task", "text-summarization")
            mlflow.log_param("dataset_name", f"{dataset_name}/{dataset_config_name}")
            mlflow.log_param("num_eval_samples", num_eval_samples)
            mlflow.set_tag("model_type", "fine_tuned_llm")

            print(f"Starting mlflow.evaluate for {model_name_qwen3_ft}...")
            results_ft = mlflow.evaluate(
                model=summarization_model_qwen3_ft,
                data=eval_df_for_gen_ft,
                targets="reference_summary",
                input_example=eval_df_for_gen_ft.head(1),
                model_type="text", 
            )
            print(f"Fine-tuned Qwen3-0.6B evaluation results:\n{results_ft.metrics}")
            if results_ft.artifacts and "eval_results_table.json" in results_ft.artifacts:
                print(f"Evaluation table artifact path: {results_ft.artifacts['eval_results_table.json'].uri}")

    except Exception as e:
        print(f"Error evaluating fine-tuned Qwen3-0.6B: {e}")
    finally:
        del qwen3_ft_base_model
        del qwen3_ft_model
        del qwen3_ft_pipeline
        if 'summarization_model_qwen3_ft' in locals(): del summarization_model_qwen3_ft
        clear_gpu_cache()

**Note:** The Qwen3 model fine-tuned on recipes might perform differently (potentially worse) on general summarization compared to the base model. This evaluation helps quantify that effect.

### Evaluating Model 3: `google/flan-t5-small`

In [None]:
clear_gpu_cache()
model_name_flan_t5 = "google/flan-t5-small"
run_name_flan_t5 = "Eval_Flan_T5_Small_Summarization"

try:
    print(f"Loading Flan-T5 model for evaluation: {model_name_flan_t5}")
    # Flan-T5 is a Seq2Seq model, suitable for text2text-generation pipeline
    flan_t5_pipeline = pipeline(
        "text2text-generation", 
        model=model_name_flan_t5,
        tokenizer=model_name_flan_t5, # Tokenizer can be specified by name
        device=0 if torch.cuda.is_available() else -1,
        # No quantization for Flan-T5 small, it's already tiny
    )
    print("Flan-T5-small pipeline created.")

    with mlflow.start_run(run_name=run_name_flan_t5) as run:
        mlflow.log_param("model_name", model_name_flan_t5)
        mlflow.log_param("evaluation_task", "text-summarization")
        mlflow.log_param("dataset_name", f"{dataset_name}/{dataset_config_name}")
        mlflow.log_param("num_eval_samples", num_eval_samples)
        mlflow.set_tag("model_type", "instruction_tuned_baseline")

        # Flan-T5 is good at following instructions, so a direct prompt is fine.
        # The text2text-generation pipeline directly outputs the generated text.
        class FlanT5SummarizationWrapper:
            def __init__(self, pipeline_obj):
                self.pipeline = pipeline_obj
            def predict(self, X):
                prompts = X["prompt_text"].apply(lambda x: f"Summarize: {x}").tolist()
                outputs = self.pipeline(prompts, max_length=100, num_return_sequences=1, do_sample=False)
                # Output is list of dicts [{'generated_text': '...'}]
                return pd.Series([out['generated_text'] for out in outputs])

        summarization_model_flan_t5 = FlanT5SummarizationWrapper(flan_t5_pipeline)
        eval_df_for_flan = eval_df.copy() # Input column will be 'prompt_text' for wrapper

        print(f"Starting mlflow.evaluate for {model_name_flan_t5}...")
        results_flan_t5 = mlflow.evaluate(
            model=summarization_model_flan_t5,
            data=eval_df_for_flan,
            targets="reference_summary",
            input_example=eval_df_for_flan.head(1),
            model_type="text", # Use generic text type
        )
        print(f"Flan-T5-small evaluation results:\n{results_flan_t5.metrics}")
        if results_flan_t5.artifacts and "eval_results_table.json" in results_flan_t5.artifacts:
             print(f"Evaluation table artifact path: {results_flan_t5.artifacts['eval_results_table.json'].uri}")

except Exception as e:
    print(f"Error evaluating Flan-T5-small: {e}")
finally:
    del flan_t5_pipeline
    if 'summarization_model_flan_t5' in locals(): del summarization_model_flan_t5
    clear_gpu_cache()

---

## 7. Comparing Model Performance in MLflow UI

Now, the power of MLflow comes into play! Open the MLflow UI (`mlflow ui` from the directory containing `mlruns`).

1.  Navigate to the `LLM_Evaluation_Summarization_Benchmark` experiment.
2.  You should see three runs (or more, if you re-ran parts):
    - `Eval_Base_Qwen3_0.6B_Summarization`
    - `Eval_FineTuned_Qwen3_0.6B_Summarization`
    - `Eval_Flan_T5_Small_Summarization`
3.  **Select all these runs** by checking the boxes next to them.
4.  Click the **"Compare"** button.

**In the Comparison View:**
- **Parameters:** You can see the `model_name`, `dataset_name`, etc., for each run.
- **Metrics:** This is where it gets interesting! 
    - `mlflow.evaluate` logs various metrics it computes (e.g., from `rouge`, `bertscore`, `exact_match`, `perplexity` if computed). The exact names might be like `rouge1`, `rougeL`, `bertscore_precision`, etc. 
    - You can sort the table by any metric (e.g., sort by `rougeL` descending) to see which model performed best on that specific metric.
- **Artifacts:** For each run, `mlflow.evaluate` saves an `eval_results_table.json` (and often a `.html` version) in the artifacts. This table contains the input prompts, generated outputs, reference targets, and per-sample metric scores. This is invaluable for qualitative analysis and error inspection.

![MLFlow UI](https://blog.min.io/content/images/2025/03/Screenshot-2025-03-10-at-3.30.33-PM.png)

This comparison view effectively creates a **leaderboard** for your evaluated models on this specific task and dataset, all managed and visualized by MLflow.

---

## 8. Brief: Advanced Evaluation Concepts & Tools

While `mlflow.evaluate` and standard metrics give a good starting point, LLM evaluation is a rapidly evolving field. Some advanced concepts and tools include:

- **Human Evaluation:** Still the gold standard for nuanced aspects like coherence, creativity, helpfulness. Platforms exist to manage human annotation workflows.
- **LLM-as-a-Judge:** Using a powerful LLM (like GPT-4) to evaluate the output of another LLM based on predefined criteria or a rubric. `mlflow.evaluate` has some capabilities here via `evaluators` like `mlflow.metrics.genai`.
- **Task-Specific Benchmarks:** For specific tasks, dedicated benchmarks exist (e.g., HumanEval for code generation, MMLU for broad knowledge, BigBench for challenging reasoning tasks).
- **Specialized Evaluation Frameworks:**
    - **Ragas:** For evaluating RAG pipelines (retrieval and generation quality).
    - **TruLens:** Focuses on explainability and tracking quality for LLM apps.
    - **DeepEval:** Offers a suite of metrics for in-depth LLM evaluation.
    - **EleutherAI LM Evaluation Harness:** A comprehensive framework for running many standard academic benchmarks.
- **Ethical AI & Responsible AI Metrics:** Evaluating for bias, fairness, toxicity, robustness against adversarial attacks. These are critical for production systems.

MLflow can often be integrated with these tools to store their results, providing a central dashboard for all your evaluation efforts.

---

## 9. Key Takeaways

In this notebook, we've learned how to systematically evaluate and benchmark LLMs:

- **Structured Evaluation:** Understood the importance of a consistent process for evaluating LLMs on specific tasks and datasets.
- **`mlflow.evaluate()` for LLMs:** Leveraged MLflow's dedicated evaluation API to assess models on text summarization, automatically computing relevant metrics.
- **Metric-Driven Comparison:** Used metrics like ROUGE and BERTScore to quantify differences in model performance.
- **Comparative Benchmarking:** Compared a base pre-trained model (`Qwen3-0.6B`), a domain-fine-tuned version (our Qwen3 recipe bot), and an instruction-tuned baseline (`Flan-T5-small`).
- **MLflow for Centralized Results:** Used MLflow to log all evaluation parameters, metrics, and qualitative artifacts (like per-sample predictions), enabling easy comparison and leaderboard creation via the UI.
- **Qualitative Insights:** Recognized the value of inspecting generated outputs (available in MLflow artifacts) alongside quantitative metrics.

This systematic evaluation approach is vital for iterating on LLM development, whether you're choosing a base model, assessing fine-tuning impact, or comparing different prompting strategies.

---

## 10. Engaging Resources and Further Reading

To deepen your understanding of LLM evaluation:

- **MLflow Documentation:**
    - [MLflow LLM Evaluate](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html)
    - [Built-in Metrics for LLM Evaluate](https://mlflow.org/docs/latest/llms/llm-evaluate/metrics.html)
- **Hugging Face Evaluate Library:**
    - [Hugging Face `evaluate` Documentation](https://huggingface.co/docs/evaluate/index)
    - [List of Available Metrics](https://huggingface.co/spaces/evaluate-metric)
- **Key Evaluation Papers & Concepts:**
    - [ROUGE Paper (Lin, 2004)](https://aclanthology.org/W04-1013/)
    - [BERTScore Paper (Zhang et al., 2019)](https://arxiv.org/abs/1904.09675)
    - [Judging LLM-as-a-judge with MT-Bench and Chatbot Arena (Zheng et al., 2023 - for LLM-as-judge concepts)](https://arxiv.org/abs/2306.05685)
- **LLM Evaluation Leaderboards & Platforms:**
    - [Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
    - [Chatbot Arena (LMSys)](https://chat.lmsys.org/)

--- 

Fantastic work on completing this comprehensive LLM evaluation notebook! You're now well-equipped to assess and compare different language models using robust methodologies and MLflow.

**Coming Up Next (Notebook 7):** We'll shift gears to building more dynamic and interactive AI systems by exploring Tool-Calling Agents with LangGraph, Ollama, and, of course, tracking it all with MLflow.

![Keep Learning](https://memento.epfl.ch/image/23136/1440x810.jpg)