Copyrights Psitron Technologies

#MLflow LLM Evaluation
With the emerging of ChatGPT, LLMs have shown its power of text generation in various fields, such as question answering, translating and text summarization. Evaluating LLMs’ performance is slightly different from traditional ML models, as very often there is no single ground truth to compare against. MLflow provides an API ```mlflow.evaluate()``` to help evaluate your LLMs.

MLflow’s LLM evaluation functionality consists of 3 main components:

1. **A model to evaluate**: it can be an MLflow pyfunc model, a URI pointing to one registered MLflow model, or any python callable that represents your model, e.g, a HuggingFace text summarization pipeline.

2. **Metrics**: the metrics to compute, LLM evaluate will use LLM metrics.

3. **Evaluation data**: the data your model is evaluated at, it can be a pandas Dataframe, a python list, a numpy array or an ```mlflow.data.dataset.Dataset()``` instance.



##Quickstart


In [None]:
import os
os.environ["ANTHROPIC_API_KEY"] = "your-api-key"  # Replace with your actual key

In [None]:
!pip install mlflow pandas datasets transformers anthropic pyngrok tiktoken

Below is a **simple example** that gives a quick overview of how to evaluate LLMs with MLflow. The example builds a simple question-answering evaluation pipeline using Anthropic's Claude API. It interacts with the Claude API to get responses and then evaluates the responses based on a custom metric. The results, including the average matching accuracy, are logged to MLflow. You can paste it to your IPython or local editor and execute it, and install missing dependencies as prompted. Running the code requires an Anthropic API key, if you don’t have one, you can set it up by following the Anthropic guide.

In [None]:
import mlflow
import anthropic
import os
import pandas as pd
import time
import string  # Import string module for punctuation removal

# 1. Set your Anthropic API key
# os.environ["ANTHROPIC_API_KEY"] = "your-api-key"  # Replace!

# 2. Define your evaluation data
eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

# 3. Define a function to interact with the Claude API
def anthropic_model(prompt, system_prompt):
    client = anthropic.Anthropic()
    try:
        message = client.messages.create(
            model="claude-3-opus-20240229",  # Replace with an available Claude model for you
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text
    except anthropic.APIStatusError as e:
        print(f"Anthropic API Error: {e}")
        return None
    except anthropic.RateLimitError as e:
        print(f"Rate limit exceeded. Waiting and retrying...")
        time.sleep(60)
        return anthropic_model(prompt, system_prompt)  # Retry
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

# 4. Modify the evaluation loop to use the Claude API
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"

    # Create a new column with the Claude responses
    responses = []
    for i in range(len(eval_data)):
        response = anthropic_model(eval_data["inputs"][i], system_prompt)
        responses.append(response)
        print(f"Claude Response: {response}") # Print for debugging

    eval_data["Claude_response"] = responses

    # 5. Evaluate the model (using custom evaluation logic if needed)
    def simple_metric(row):
        claude_response = row["Claude_response"]
        if claude_response is None:  # Handle None values gracefully
            return 0.0

        ground_truth = row["ground_truth"].lower()
        claude_response = claude_response.lower()

        # Remove punctuation from both strings
        ground_truth = ground_truth.translate(str.maketrans('', '', string.punctuation))
        claude_response = claude_response.translate(str.maketrans('', '', string.punctuation))

        ground_truth_words = ground_truth.split()
        claude_words = claude_response.split()

        # Check if *any* of the ground truth words are in the Claude response
        if any(word in claude_words for word in ground_truth_words):
            return 1.0
        else:
            return 0.0

    eval_data["metric"] = eval_data.apply(simple_metric, axis=1)  # Apply the metric to each row
    average_metric = eval_data["metric"].mean()

    # Log the metric to MLflow
    mlflow.log_metric("simple_matching_accuracy", average_metric)

    print(f"Simple Matching Accuracy: {average_metric}")
    print(eval_data)

##LLM Evaluation Metrics
There are two types of LLM evaluation metrics in MLflow:
**bold text**
1. **Heuristic-based metrics**: These metrics calculate a score for each data record (row in terms of Pandas/Spark dataframe), based on certain functions, such as: ```Rouge (rougeL()), Flesch Kincaid (flesch_kincaid_grade_level()) or Bilingual Evaluation Understudy (BLEU) (bleu()).``` These metrics are similar to traditional continuous value metrics. For the list of built-in heuristic metrics and how to define a custom metric with your own function definition, see the Heuristic-based Metrics section.

2. **LLM-as-a-Judge metrics**: LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs. It overcomes the limitations of heuristic-based metrics, which often miss nuances like context and semantic accuracy. LLM-as-a-Judge metrics provides a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation. MLflow provides various built-in LLM-as-a-Judge metrics and supports creating custom metrics with your own prompt, grading criteria, and reference examples. See the LLM-as-a-Judge Metrics section for more details.





###Heuristic-based Metrics



In [None]:
!pip install evaluate textstat

MLflow LLM evaluation includes default collections of metrics for pre-selected tasks, e.g, “question-answering”. **Depending on the LLM use case that you are evaluating, these pre-defined collections can greatly simplify the process of running evaluation**s. **To use defaults metrics for pre-selected tasks, specify the model_type argument in  ```mlflow.evaluate()```, as shown by the example below:**

In [None]:
import mlflow
import anthropic
import os
import pandas as pd
import time
import string
import evaluate  # For toxicity
import textstat # For readability

# 1. Set your Anthropic API key
# os.environ["ANTHROPIC_API_KEY"] = "your-api-key"  # Replace!

# 2. Define your evaluation data
eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

# 3. Define a function to interact with the Claude API
def anthropic_model(prompt, system_prompt):
    client = anthropic.Anthropic()
    try:
        message = client.messages.create(
            model="claude-3-opus-20240229",  # Replace with an available Claude model for you
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text
    except anthropic.APIStatusError as e:
        print(f"Anthropic API Error: {e}")
        return None
    except anthropic.RateLimitError as e:
        print(f"Rate limit exceeded. Waiting and retrying...")
        time.sleep(60)
        return anthropic_model(prompt, system_prompt)  # Retry
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

# 4. Initialize the toxicity metric and readability metrics
toxicity_metric = evaluate.load("toxicity", module_type="measurement")


# 5. Modify the evaluation loop to use the Claude API and calculate metrics
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"

    # Create a new column with the Claude responses
    responses = []
    for i in range(len(eval_data)):
        response = anthropic_model(eval_data["inputs"][i], system_prompt)
        responses.append(response)
        print(f"Claude Response: {response}") # Print for debugging

    eval_data["Claude_response"] = responses

    # 6. Evaluate the model (using custom evaluation logic and external metrics)
    toxicity_scores = []
    ari_scores = []
    flesch_scores = []

    for response in eval_data["Claude_response"]:
        if response is None:
            toxicity_scores.append(None)
            ari_scores.append(None)
            flesch_scores.append(None)
            continue # Skip evaluation if the response is None

        # Calculate toxicity
        toxicity_result = toxicity_metric.compute(predictions=[response])
        toxicity_score = toxicity_result["toxicity"][0] if toxicity_result["toxicity"] else None  # Handle empty toxicity scores

        toxicity_scores.append(toxicity_score)

        # Calculate readability scores
        ari_score = textstat.automated_readability_index(response)
        flesch_score = textstat.flesch_kincaid_grade(response)
        ari_scores.append(ari_score)
        flesch_scores.append(flesch_score)


    eval_data["toxicity"] = toxicity_scores
    eval_data["ari_grade_level"] = ari_scores
    eval_data["flesch_kincaid_grade_level"] = flesch_scores

    # 7. Log metrics to MLflow
    # Handle None values when calculating the mean
    avg_toxicity = eval_data["toxicity"].dropna().mean() if eval_data["toxicity"].notna().any() else None
    avg_ari = eval_data["ari_grade_level"].dropna().mean() if eval_data["ari_grade_level"].notna().any() else None
    avg_flesch = eval_data["flesch_kincaid_grade_level"].dropna().mean() if eval_data["flesch_kincaid_grade_level"].notna().any() else None


    if avg_toxicity is not None:
        mlflow.log_metric("avg_toxicity", avg_toxicity)
    if avg_ari is not None:
        mlflow.log_metric("avg_ari_grade_level", avg_ari)
    if avg_flesch is not None:
        mlflow.log_metric("avg_flesch_kincaid_grade_level", avg_flesch)

    print(f"Average Toxicity: {avg_toxicity}")
    print(f"Average ARI Grade Level: {avg_ari}")
    print(f"Average Flesch-Kincaid Grade Level: {avg_flesch}")

    print(eval_data)

**Use a Custom List of Metrics**

Using the pre-defined metrics associated with a given model type is not the only way to generate scoring metrics for LLM evaluation in MLflow. You can specify a custom list of metrics in the extra_metrics argument in ```mlflow.evaluate```:

To add additional metrics to the default metrics list of pre-defined model type, keep the model_type and add your metrics to extra_metrics:

In [None]:
import mlflow
import anthropic
import os
import pandas as pd
import time
import string
import evaluate  # For toxicity
import textstat # For readability
import numpy as np

# 1. Set your Anthropic API key
# os.environ["ANTHROPIC_API_KEY"] = "your-api-key"  # Replace!

# 2. Define your evaluation data
eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

# 3. Define a function to interact with the Claude API
def anthropic_model(prompt, system_prompt):
    client = anthropic.Anthropic()
    start_time = time.time()  # Capture start time
    try:
        message = client.messages.create(
            model="claude-3-opus-20240229",  # Replace with an available Claude model for you
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": prompt}],
        )
        response = message.content[0].text
        end_time = time.time()  # Capture end time
        return response, end_time - start_time # Return response and latency
    except anthropic.APIStatusError as e:
        print(f"Anthropic API Error: {e}")
        return None, None  # Return None response and latency
    except anthropic.RateLimitError as e:
        print(f"Rate limit exceeded. Waiting and retrying...")
        time.sleep(60)
        return anthropic_model(prompt, system_prompt)  # Retry
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None, None

# 4. Initialize the toxicity metric and readability metrics
toxicity_metric = evaluate.load("toxicity", module_type="measurement")

# 5. Modify the evaluation loop to use the Claude API and calculate metrics
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"

    # Create a new column with the Claude responses
    responses = []
    latencies = []  # List to store latency values
    for i in range(len(eval_data)):
        response, latency = anthropic_model(eval_data["inputs"][i], system_prompt) #capture two values
        responses.append(response)
        latencies.append(latency)
        print(f"Claude Response: {response}")  # Print for debugging
        print(f"Latency: {latency}")

    eval_data["Claude_response"] = responses
    eval_data["latency"] = latencies #store latency in df

    # 6. Evaluate the model (using custom evaluation logic and external metrics)
    toxicity_scores = []
    ari_scores = []
    flesch_scores = []

    for response in eval_data["Claude_response"]:
        if response is None:
            toxicity_scores.append(None)
            ari_scores.append(None)
            flesch_scores.append(None)
            continue # Skip evaluation if the response is None

        # Calculate toxicity
        toxicity_result = toxicity_metric.compute(predictions=[response])
        toxicity_score = toxicity_result["toxicity"][0] if toxicity_result["toxicity"] else None  # Handle empty toxicity scores

        toxicity_scores.append(toxicity_score)

        # Calculate readability scores
        ari_score = textstat.automated_readability_index(response)
        flesch_score = textstat.flesch_kincaid_grade(response)
        ari_scores.append(ari_score)
        flesch_scores.append(flesch_score)


    eval_data["toxicity"] = toxicity_scores
    eval_data["ari_grade_level"] = ari_scores
    eval_data["flesch_kincaid_grade_level"] = flesch_scores

    # 7. Log metrics to MLflow

    # Handle None values when calculating the mean
    avg_toxicity = eval_data["toxicity"].dropna().mean() if eval_data["toxicity"].notna().any() else None
    avg_ari = eval_data["ari_grade_level"].dropna().mean() if eval_data["ari_grade_level"].notna().any() else None
    avg_flesch = eval_data["flesch_kincaid_grade_level"].dropna().mean() if eval_data["flesch_kincaid_grade_level"].notna().any() else None
    avg_latency = eval_data["latency"].dropna().mean() if eval_data["latency"].notna().any() else None


    if avg_toxicity is not None:
        mlflow.log_metric("avg_toxicity", avg_toxicity)
    if avg_ari is not None:
        mlflow.log_metric("avg_ari_grade_level", avg_ari)
    if avg_flesch is not None:
        mlflow.log_metric("avg_flesch_kincaid_grade_level", avg_flesch)
    if avg_latency is not None:
        mlflow.log_metric("avg_latency", avg_latency)


    print(f"Average Toxicity: {avg_toxicity}")
    print(f"Average ARI Grade Level: {avg_ari}")
    print(f"Average Flesch-Kincaid Grade Level: {avg_flesch}")
    print(f"Average Latency: {avg_latency}") # print average latency

    print(eval_data)

To disable default metric calculation and only calculate your selected metrics, remove the model_type argument and define the desired metrics.

## Calculate Latency

In [None]:
import mlflow
import anthropic
import os
import pandas as pd
import time
import numpy as np  # Import numpy

# 1. Set your Anthropic API key
# os.environ["ANTHROPIC_API_KEY"] = "your-api-key"  # Replace!

# 2. Define your evaluation data (you can adjust this as needed)
eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
    }
)

# 3. Define a function to interact with the Claude API and measure latency
def anthropic_model(prompt, system_prompt):
    client = anthropic.Anthropic()
    start_time = time.time()  # Capture start time
    try:
        message = client.messages.create(
            model="claude-3-opus-20240229",  # Replace with an available Claude model for you
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": prompt}],
        )
        end_time = time.time()  # Capture end time
        latency = end_time - start_time  # Calculate latency
        return latency
    except anthropic.APIStatusError as e:
        print(f"Anthropic API Error: {e}")
        return None  # Return None if there's an error
    except anthropic.RateLimitError as e:
        print(f"Rate limit exceeded. Waiting and retrying...")
        time.sleep(60)
        return anthropic_model(prompt, system_prompt)  # Retry
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

# 4. Evaluate latency
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences" #define the system prompt

    latencies = []
    for prompt in eval_data["inputs"]:
        latency = anthropic_model(prompt, system_prompt)
        latencies.append(latency)
        print(f"Prompt: {prompt}, Latency: {latency}")

    eval_data["latency"] = latencies  # Add latency to eval_data DataFrame
    # Handle None values when calculating the mean and use only notna values.
    avg_latency = eval_data["latency"].dropna().mean() if eval_data["latency"].notna().any() else None

    if avg_latency is not None:
        mlflow.log_metric("avg_latency", avg_latency)
        print(f"Average Latency: {avg_latency}")
    else:
        print("No successful API calls to calculate average latency.")

    print(eval_data)

In [None]:
pip install --upgrade mlflow

###LLM-as-a-Judge Metrics
LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs, providing a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation.

MLflow supports several builtin LLM-as-a-judge metrics, as well as allowing you to create your own LLM-as-a-judge metrics with custom configurations and prompts.



###Selecting the Judge Model
By default, MLflow will use OpenAI’s GPT-4 model as the judge model that scores metrics. You can change the judge model by passing an override to the model argument within the metric definition.

1. SaaS LLM Providers
To use SaaS LLM providers, such as OpenAI or Anthropic, set the model parameter in the metrics definition, in the format of <provider>:/<model-name>. Currently, MLflow supports ["openai", "anthropic", "bedrock", "mistral", "togetherai"] as viable LLM providers for any judge model.



Anthropic models can be accessed via the anthropic:/<model-name> URI. Note that the default judge parameters <#overriding-default-judge-parameters> need to be overridden by passing the parameters argument to the metrics definition, since the default parameters violates the Anthropic endpoint requirement (temperature and top_p cannot be specified together).

In [None]:
import mlflow
import os

# os.environ["ANTHROPIC_API_KEY"] = "your-api-key"

answer_correctness = mlflow.metrics.genai.answer_correctness(
    model="anthropic:/claude-3-5-sonnet-20241022",
    # Override default judge parameters to meet Claude endpoint requirements.
    parameters={"temperature": 0, "max_tokens": 256},
)

# Test the metric definition
answer_correctness(
    inputs="What is MLflow?",
    predictions="MLflow is an innovative full self-driving airship.",
    targets="MLflow is an open-source platform for managing the end-to-end ML lifecycle.",
)


###Creating Custom LLM-as-a-Judge Metrics
You can also create your own LLM-as-a-Judge evaluation metrics with mlflow.metrics.genai.make_genai_metric() API, which needs the following information:

1. name: the name of your custom metric.

2. definition: describe what’s the metric doing.

3. grading_prompt: describe the scoring criteria.

**examples (Optional)**: a few input/output examples with scores provided; used as a reference for the LLM judge.

See the API documentation for the full list of the configurations.

Under the hood, definition, grading_prompt, examples together with evaluation data and model output will be composed into a long prompt and sent to LLM. If you are familiar with the concept of prompt engineering, SaaS LLM evaluation metric is basically trying to compose a “right” prompt containing instructions, data and model output so that LLM, e.g., GPT4 can output the information we want.

Now let’s create a custom GenAI metrics called “professionalism”, which measures how professional our model output is.

Let’s first create a few examples with scores, these will be the reference samples LLM judge uses. To create such examples, we will use mlflow.metrics.genai.EvaluationExample() class, which has 4 fields:

1. input: input text.

2. output: output text.

3. score: the score for output in the context of input.

4. justification: why do we give the score for the data.

In [None]:
import mlflow
import anthropic
import os
import pandas as pd
import time
import google.generativeai as genai
from mlflow.metrics import make_metric, MetricValue
import re

# 1. Set your API keys
# os.environ["ANTHROPIC_API_KEY"] = "your-api-key"  # Replace!
os.environ["GOOGLE_API_KEY"] = "your-api-key" # Replace with google api key!
GOOGLE_API_KEY=os.environ["GOOGLE_API_KEY"]
# Configure the Google API
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])


# 2. Define your models
claude_model_id = "claude-3-opus-20240229"  # Replace!  Check what models are available
google_model_id = "gemini-1.5-pro-latest"  # Replace! Check what models are available

# 3. Define your evaluation data
eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
    }
)

# 4. Set the global evaluation prompt
evaluation_prompt = """You are an expert evaluator assessing the quality of an answer to a question.

    Question: {question}
    Answer: {answer}

    Assess the answer based on the following criteria:

    - Is the answer factually correct and consistent with common knowledge?
    - Does the answer address the question completely and accurately?
    - Is the answer well-written, clear, and easy to understand?

    Provide a score from 1 to 5 and a justification, following the exact format:

    Score: [your score]
    Justification: [your justification]
"""


# 5. All custom functions for the evaluation metrics
def get_google_ai_studio_response(prompt, google_model_id):
    """Calls Google AI Studio and returns the text response or None on error."""
    model = genai.GenerativeModel(google_model_id)
    try:
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        print(f"Google Generative AI Error: {e}")
        return None

def get_claude_judgment(question, answer, prompt, claude_model_id):
    """Calls Claude API for evaluation and returns a tuple (score, justification) or (1, "Error") on error."""
    try:
        client = anthropic.Anthropic()
        message = client.messages.create(
            model=claude_model_id,
            max_tokens=300,
            messages=[{"role": "user", "content": prompt.format(question=question, answer=answer)}],
        )
        claude_response = message.content[0].text

        # Use regular expressions to extract the score and justification
        match = re.search(r"Score:\s*(\d+)\s*Justification:\s*(.+)", claude_response, re.DOTALL) #Extract the data

        if match:
            score = int(match.group(1))
            justification = match.group(2).strip()
            return score, justification
        else:
            print(f"Could not parse Claude's response: {claude_response}")
            return 1, "Could not parse Claude's response."

    except Exception as e:
        print(f"Error calling Claude API: {e}")
        return 1, "Error during evaluation."  # Consistent return


def claude_judge_eval_fn(predictions, inputs, claude_model_id):
    """Evaluates responses using Claude, returns an MLflow MetricValue."""
    scores = []
    justifications = []
    for i in range(len(predictions)):
        if predictions[i] is None:  # Handle potential None responses
            scores.append(1)  # Assign lowest score if Google model returned None
            justifications.append("Google model returned an error.")
            continue

        question = inputs[i]
        answer = predictions[i]

        score, justification = get_claude_judgment(question=question, answer=answer, prompt=evaluation_prompt, claude_model_id=claude_model_id)
        scores.append(score)
        justifications.append(justification)

    aggregate_results = {
        "mean_score": sum(scores) / len(scores) if scores else 0,
        #Add also the justification
    }

    # Log detailed results to console
    print("Detailed Evaluation Results:")
    for i in range(len(inputs)):
        print(f"Question: {inputs[i]}")
        print(f"Answer: {predictions[i]}")
        print(f"Score: {scores[i]}")
        print(f"Justification: {justifications[i]}")
        print("---")


    return MetricValue(scores=scores, aggregate_results=aggregate_results)



# 6. Define custom evaluation metric
claude_judge_metric = make_metric(
    eval_fn=claude_judge_eval_fn,
    greater_is_better=True,
    name="claude_answer_correctness",
)

# 7. All code connected in the execution
with mlflow.start_run() as run:
    # Generate Google Model responses
    google_responses = []
    for prompt in eval_data["inputs"]:
        response = get_google_ai_studio_response(prompt, google_model_id)
        google_responses.append(response)
        print(f"Google AI Studio Model: {response}")

    # Evaluate using the custom metric
    metric_value = claude_judge_metric(predictions=google_responses, inputs=eval_data["inputs"], claude_model_id=claude_model_id)
    avg_answer_correctness = metric_value.aggregate_results["mean_score"]

    # Log results
    mlflow.log_metric("avg_answer_correctness", avg_answer_correctness)

    print(f"Average Answer Correctness from Claude: {avg_answer_correctness}")

##Prepare Your Target Models
In order to evaluate your model with mlflow.evaluate(), your model has to be one of the following types:

A mlflow.pyfunc.PyFuncModel() instance or a URI pointing to a logged mlflow.pyfunc.PyFuncModel model. In general we call that MLflow model. The

A python function that takes in string inputs and outputs a single string. Your callable must match the signature of mlflow.pyfunc.PyFuncModel.predict() (without params argument), briefly it should:

Has data as the only argument, which can be a pandas.Dataframe, numpy.ndarray, python list, dictionary or scipy matrix.

Returns one of pandas.DataFrame, pandas.Series, numpy.ndarray or list.

An MLflow Deployments endpoint URI pointing to a local MLflow AI Gateway, Databricks Foundation Models API, and External Models in Databricks Model Serving.

Set model=None, and put model outputs in data. Only applicable when the data is a Pandas dataframe.



###Evaluating with an MLflow Model
For detailed instruction on how to convert your model into a mlflow.pyfunc.PyFuncModel instance, please read this doc. But in short, to evaluate your model as an MLflow model, we recommend following the steps below:

Log your model to MLflow server by log_model. Each flavor (opeanai, pytorch, …) has its own log_model API

In [None]:
import mlflow
import anthropic
import os
import pandas as pd
from typing import Dict, Any, List

class ClaudeModelWrapper(mlflow.pyfunc.PythonModel):
    """
    A custom MLflow model that wraps the Anthropic Claude API.
    """

    def __init__(self, model_name: str, system_prompt: str):
        self.model_name = model_name
        self.system_prompt = system_prompt
        self.anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")

    def load_context(self, context: mlflow.pyfunc.model.PythonModelContext) -> None:
        """Loads artifacts (none in this case)."""
        if not self.anthropic_api_key:
            raise ValueError("Anthropic API key not found. Set ANTHROPIC_API_KEY environment variable.")

    def predict(self, context: mlflow.pyfunc.model.PythonModelContext, model_input: pd.DataFrame) -> List[str]:
        """
        Generates predictions using the Claude API.

        Args:
            context: MLflow context (unused).
            model_input: Pandas DataFrame with a 'question' column containing prompts.

        Returns:
            List of responses from Claude.
        """
        client = anthropic.Anthropic()
        responses = []
        for question in model_input["question"]:
            try:
                message = client.messages.create(
                    model=self.model_name,
                    max_tokens=1024,
                    system = self.system_prompt,
                    messages=[
                        {"role": "user", "content": question}
                    ],
                )
                responses.append(message.content[0].text)
            except Exception as e:
                print(f"Error calling Claude API: {e}")
                responses.append(None)  # Or a suitable error value
        return responses


# Example usage:
import mlflow
import pandas as pd

claude_model_name = "claude-3-opus-20240229"  # Or an appropriate Claude model. Change the model with one that is accesible
system_prompt = "Answer the following question in two sentences."

# Create and log the MLflow model
with mlflow.start_run() as run:
    claude_model = ClaudeModelWrapper(model_name=claude_model_name, system_prompt=system_prompt)
    mlflow.pyfunc.log_model(
        python_model=claude_model,
        artifact_path="claude_model",
        input_example=pd.DataFrame({"question": ["What is the capital of France?"]}),
    )

logging into the MLflow UI in Google Colab requires a few steps because Colab runs in a remote environment. You can't directly access localhost from your local browser. Here's a breakdown of the steps:

1: Using ngrok

ngrok creates a secure tunnel from a public URL to your local machine. This is the generally recommended approach.



Install ngrok:

In [None]:
!pip install pyngrok

Set the ngrok Authtoken:

In [None]:
from pyngrok import ngrok
ngrok.set_auth_token("your-api-key")  # Replace!

Create the Tunnel:

In [None]:
from pyngrok import ngrok
http_tunnel = ngrok.connect(addr=5000, proto="http", bind_tls=True)
print("MLflow UI URL:", http_tunnel.public_url)

Start the MLflow UI:

In [None]:
!mlflow ui --port 5000 &