## Evaluate a 🤗 Hugging Face LLM with mlflow.evaluate()

This guide will show how to load a pre-trained Hugging Face pipeline, log it to MLflow, and use `mlflow.evaluate()` to evaluate builtin metrics as well as custom LLM-judged metrics for the model.

For detailed information, please read the documentation on [using MLflow evaluate](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html).

### Start MLflow Server

You can either:

- Start a local tracking server by running `mlflow ui` within the same directory that your notebook is in.
- Use a tracking server, as described in [this overview](https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html).

### Install necessary dependencies

In [2]:
%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Necessary imports

import mlflow
import pandas as pd

from transformers import pipeline
from datasets import load_dataset

from mlflow.metrics.genai import EvaluationExample, make_genai_metric, answer_correctness


* 'schema_extra' has been renamed to 'json_schema_extra'
  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import warnings

# Disable FutureWarnings 
warnings.filterwarnings("ignore", category=FutureWarning)

### Load a pretrained Hugging Face pipeline

Here we are loading a text summarization pipeline, but you can also use either a text generation or question answering pipeline.

In [5]:
mpt_pipeline = pipeline("text-generation", model="mosaicml/mpt-7b-chat")

config.json: 100%|██████████| 1.23k/1.23k [00:00<00:00, 703kB/s]
pytorch_model.bin.index.json: 100%|██████████| 16.0k/16.0k [00:00<00:00, 12.5MB/s]
pytorch_model-00001-of-00002.bin:   4%|▍         | 430M/9.94G [02:24<53:27, 2.97MB/s]
Downloading shards:   0%|          | 0/2 [02:25<?, ?it/s]


KeyboardInterrupt: 

### Log the model to MLflow

In [None]:
mlflow.set_experiment("Evaluate Hugging Face Summarizer")

with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=mpt_pipeline,
        artifact_path="mpt-7b",
        input_example="What are the three primary colors?",
        registered_model_name="mpt-7b-chat",
    )

### Load Evaluation Data

Load in a dataset from Hugging Face Hub to use for evaluation

In [None]:
dataset = load_dataset("tatsu-lab/alpaca")
eval_df = pd.DataFrame(dataset["train"])
eval_df.head(10)

### Define Extra Metrics

Create a custom LLM-judged metric named `answer_quality` using `make_genai_metric()`. We need to define a metric definition and grading rubric, as well as some examples for the LLM judge to use.

In [None]:
answer_quality_definition = """Please evaluate answer quality for the provided output on the following criteria: fluency, clarity, and conciseness. Each of the criteria is defined as follows:
  - Fluency measures how naturally and smooth the output reads.
  - Clarity measures how understandable the output is.
  - Conciseness measures the brevity and efficiency of the output without compromising meaning.
The more fluent, clear, and concise a text, the higher the score it deserves.
"""

answer_quality_rubric = """Answer quality: Below are the details for different scores:
  - Score 1: The output is entirely incomprehensible and cannot be read.
  - Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.
  - Score 3: The output is understandable but still needs improvement.
  - Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.
  - Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria."""

example1 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform. For managing machine learning workflows, it including experiment tracking model packaging versioning and deployment as well as a platform simplifying for on the ML lifecycle.",
    score=2,
    justification="The output is difficult to understand and demonstrates extremely low clarity. However, it still conveys some meaning so this output deserves a score of 2.",
)

example2 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine learning workflows, including experiment tracking, model packaging, versioning, and deployment.",
    score=5,
    justification="The output is easily understandable, clear, and concise. It deserves a score of 5.",
)

answer_quality_metric = make_genai_metric(
    name="answer_quality",
    definition=answer_quality_definition,
    grading_prompt=answer_quality_rubric,
    version="v1",
    examples=[example1, example2],
    model="openai:/gpt-4",
    greater_is_better=True,
)

We can also load one of the predefined metrics - in this case we are using `answer_correctness` with GPT-4.

In [None]:
answer_correctness_metric = answer_correctness(model="openai:/gpt-4")

### Evaluate

We need to set our OpenAI API key, since we are using GPT-4 for our LLM-judged metrics.

In [None]:
os.environ["OPENAI_API_KEY"] = "redacted"

Call `mlflow.evaluate()` on the first 10 rows of the data. Using the 'text-summarization' model, we get toxicity, readability metrics, and rouge score as builtin metrics. We also pass in the two metrics we defined above into the extra_metrics parameter to be evaluated.

In [None]:
with mlflow.start_run():
    results = mlflow.evaluate(
        model_info.model_uri,
        eval_df.head(10),
        model_type="question-answering",
        targets="highlights",
        extra_metrics=[answer_correctness_metric, answer_quality_metric],
        evaluator_config={
            "input": "instruction"
        }
    )

### View results

`results.metrics` is a dictionary with the aggregate values for all the metrics calculated.

In [None]:
results.metrics

We can also view the `eval_results_table`, which shows us the metrics for each row of data.

In [None]:
results.tables["eval_results_table"]

Finally, we can view our evaluation results in the MLflow UI under the Evaluation tab. Here, we can choose which columns to group by and a column to compare on.

![](https://i.imgur.com/uDmh4M0.png)