

# LLM Evaluation Metrics

https://mlflow.org/docs/latest/llms/llm-evaluate/index.html


There are two types of LLM evaluation metrics in MLflow:

- Heuristic-based metrics: These metrics calculate a score for each data record (row in terms of Pandas/Spark dataframe), based on certain functions, such as: Rouge (rougeL()), Flesch Kincaid (flesch_kincaid_grade_level()) or Bilingual Evaluation Understudy (BLEU) (bleu()). These metrics are similar to traditional continuous value metrics. For the list of built-in heuristic metrics and how to define a custom metric with your own function definition, see the Heuristic-based Metrics section.

- LLM-as-a-Judge metrics: LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs. It overcomes the limitations of heuristic-based metrics, which often miss nuances like context and semantic accuracy. LLM-as-a-Judge metrics provides a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation. MLflow provides various built-in LLM-as-a-Judge metrics and supports creating custom metrics with your own prompt, grading criteria, and reference examples. See the LLM-as-a-Judge Metrics section for more details.



### MLFLOW Metrics
The mlflow.metrics module helps you quantitatively and qualitatively measure your models.

https://mlflow.org/docs/latest/python_api/mlflow.metrics.html


Create a test case of inputs that will be passed into the model and ground_truth which will be used to compare against the generated output from the model.

#### TASK: text-summarization: model_type="text-summarization":
- ROUGE

- toxicity

- ari_grade_level

- flesch_kincaid_grade_level

#### Descriptions

- https://huggingface.co/spaces/evaluate-measurement/toxicity
- https://en.wikipedia.org/wiki/Automated_readability_index
- https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level

### Toxicity
https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target

### Textstat
Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level.

https://pypi.org/project/textstat/

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install mlflow   --quiet
! pip install  evaluate  textstat tiktoken -q
! pip install psutil pynvml
! pip install bert_score -q
! pip install -q --disable-pip-version-check py7zr sentencepiece loralib peft trl
! pip install -q    bitsandbytes
! pip install datasets evaluate rouge_score -q
! pip install transformers[torch] -q
! pip install accelerate -U -q
! pip install onnxruntime optimum -q
! pip install optimum[onnxruntime] -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.0/29.0 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m101.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.9/231.9 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m700.2/700.2 kB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig,get_peft_model,prepare_model_for_kbit_training,AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
#from transformers import AutoModelForCausalLM,AutoTokenizer,set_seed,Trainer,TrainingArguments,BitsAndBytesConfig DataCollatorForLanguageModeling,Trainer,TrainingArguments
from torch import cuda ,bfloat16
import transformers
import google.genai as genai
import torch.nn as nn
from google.colab import userdata

In [4]:
from google.colab import output
output.enable_custom_widget_manager()
from transformers.utils import logging

In [5]:
logging.set_verbosity_warning()
os.environ['TRANSFORMERS_VERBOSITY']='warning'

# Load multi_news dataset

In [6]:
from datasets import load_dataset
dataset=load_dataset('multi_news',trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

multi_news.py:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

train.src.cleaned:   0%|          | 0.00/548M [00:00<?, ?B/s]

train.tgt:   0%|          | 0.00/58.8M [00:00<?, ?B/s]

val.src.cleaned:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

val.tgt:   0%|          | 0.00/7.30M [00:00<?, ?B/s]

test.src.cleaned:   0%|          | 0.00/69.0M [00:00<?, ?B/s]

test.tgt:   0%|          | 0.00/7.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

In [7]:
print(f"Train dataset size: {len(dataset['train'])}")
print(f"test dataset size: {len(dataset['test'])}")
print(f"Validation dataset size: {len(dataset['validation'])}")

Train dataset size: 44972
test dataset size: 5622
Validation dataset size: 5622


In [42]:
import transformers
from mlflow.transformers import generate_signature_output
import locale
import mlflow
def getpreferredencoding(do_setlocale=True):
  return 'UTF-8'
locale.getpreferredencoding=getpreferredencoding

In [54]:
model_uri = "runs:/1388429508485503/text_summarizer"
MLFLOW_TRACKING_URI='databricks'
DATABRICKS_HOST='https://dbc-78299b7d-4a77.cloud.databricks.com/'
DATABRICKS_TOKEN='dapid9d06f85d7a01b89dfce600319d64927'
os.environ["MLFLOW_TRACKING_URI"] = "databricks"
os.environ["DATABRICKS_HOST"] = "https://dbc-78299b7d-4a77.cloud.databricks.com"


In [43]:
if "MLFLOW_TRACKING_URI" not in os.environ:
    os.environ["MLFLOW_TRACKING_URI"] = MLFLOW_TRACKING_URI
if "DATABRICKS_HOST" not in os.environ:
    os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
if "DATABRICKS_TOKEN" not in os.environ:
    os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN

In [45]:
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

mlflow.set_experiment("/Users/sastatimepass123@gmail.com/summarization_evaluation")


<Experiment: artifact_location='dbfs:/databricks/mlflow-tracking/1388429508485503', creation_time=1746342444023, experiment_id='1388429508485503', last_update_time=1746348712204, lifecycle_stage='active', name='/Users/sastatimepass123@gmail.com/summarization_evaluation', tags={'mlflow.experiment.sourceName': '/Users/sastatimepass123@gmail.com/summarization_evaluation',
 'mlflow.experimentType': 'MLFLOW_EXPERIMENT',
 'mlflow.ownerEmail': 'sastatimepass123@gmail.com',
 'mlflow.ownerId': '943802935141691'}>

In [46]:
mlflow.end_run()

In [47]:
import torch
from tqdm.auto import tqdm

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [71]:
df_test=dataset['validation'].to_pandas()

In [72]:
df_test.columns=['input','summary']

In [73]:
df_test.head()

Unnamed: 0,input,summary
0,Whether a sign of a good read; or a comment on...,– The Da Vinci Code has sold so many copies—th...
1,The deaths of three American soldiers in Afgha...,– A major snafu has hit benefit payments to st...
2,DUBAI Al Qaeda in Yemen has claimed responsibi...,– Yemen-based al-Qaeda in the Arabian Peninsul...
3,"Cambridge Analytica, a data firm that worked f...",– Cambridge Analytica is calling it quits. The...
4,The N.S.A.’s Evolution: The National Security ...,"– A lengthy report in the New York Times, base..."


In [74]:
import gc
import torch
import datetime
torch.cuda.empty_cache()
gc.collect()

9183

# Evaluate MLFLOW default metrics

In [75]:
now = datetime.datetime.now()

description= f"""Evaluation Fine Tuned T5-Large Model on Multi_News Dataset
model_uri: {model_uri}
"""
with mlflow.start_run(run_name=f"Evaluation_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:

    results = mlflow.evaluate(
         model_uri,
         df_test[:10],
        targets="summary",  # specify which column corresponds to the expected output
        model_type="text-summarization",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )


Downloading artifacts:   0%|          | 0/16 [00:00<?, ?it/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
2025/05/04 09:24:09 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/05/04 09:24:50 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

🏃 View run Evaluation_2025-05-04_09:23:37 at: https://dbc-78299b7d-4a77.cloud.databricks.com/ml/experiments/1388429508485503/runs/e2a2ab27b5b04406ae6499b150c972f1
🧪 View experiment at: https://dbc-78299b7d-4a77.cloud.databricks.com/ml/experiments/1388429508485503


In [63]:
from transformers import pipeline
import mlflow.transformers

# 1. Load summarization pipeline (T5 model)
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base",device=-1)

# 2. Start a new MLflow run and log the model
with mlflow.start_run(run_name="T5_Model_Training") as run:
    mlflow.transformers.log_model(
        transformers_model=summarizer,
        artifact_path="model"
    )
    model_uri = f"runs:/{run.info.run_id}/model"  # Save model URI for evaluation


Device set to use cpu


Uploading artifacts:   0%|          | 0/16 [00:00<?, ?it/s]

🏃 View run T5_Model_Training at: https://dbc-78299b7d-4a77.cloud.databricks.com/ml/experiments/1388429508485503/runs/1bd74cc4e3d14af08d0e85e0aeb2199f
🧪 View experiment at: https://dbc-78299b7d-4a77.cloud.databricks.com/ml/experiments/1388429508485503


# Custom Metrics

In [76]:
from mlflow.metrics import latency
from mlflow.metrics.genai import answer_correctness
from mlflow.models import infer_signature,make_metric

In [77]:
mlflow.enable_system_metrics_logging()

In [78]:
mlflow.metrics.__all__

['EvaluationMetric',
 'MetricValue',
 'make_metric',
 'flesch_kincaid_grade_level',
 'ari_grade_level',
 'exact_match',
 'rouge1',
 'rouge2',
 'rougeL',
 'rougeLsum',
 'toxicity',
 'mae',
 'mse',
 'rmse',
 'r2_score',
 'max_error',
 'mape',
 'recall_score',
 'precision_score',
 'f1_score',
 'token_count',
 'latency',
 'genai',
 'bleu']

In [79]:
mlflow.metrics.genai.__all__

['EvaluationExample',
 'make_genai_metric',
 'make_genai_metric_from_prompt',
 'answer_similarity',
 'answer_correctness',
 'faithfulness',
 'answer_relevance',
 'relevance',
 'retrieve_custom_metrics']

In [81]:
from evaluate import load
import pandas as pd
from typing import List
bertscore=load('bertscore')
predictions=['hello there']
references=['hello there']
results=bertscore.compute(predictions=predictions,references=references,lang='en')

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [82]:
results

{'precision': [0.9999999403953552],
 'recall': [0.9999999403953552],
 'f1': [0.9999999403953552],
 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.48.3)'}

In [83]:
def calculate_bert_f1(eval_df, _builtin_metrics):
    predictions = []

    return bertscore.compute(predictions=eval_df["prediction"], references=eval_df["target"], lang="en")['f1'][0]
def calculate_bert_recall(eval_df, _builtin_metrics):
    predictions = []

    return bertscore.compute(predictions=eval_df["prediction"], references=eval_df["target"], lang="en")['recall'][0]
def calculate_bert_precision(eval_df, _builtin_metrics):
    predictions = []

    return bertscore.compute(predictions=eval_df["prediction"], references=eval_df["target"], lang="en")['precision'][0]

In [86]:

torch.cuda.empty_cache()
gc.collect()

0

In [87]:
now = datetime.datetime.now()

description= f"""Evaluation Fine Tuned T5-Large Model on Multi_News Dataset
model_uri: {model_uri}

custom metric BertScore and latency
"""
with mlflow.start_run(run_name=f"Evaluation_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:

    results = mlflow.evaluate(
         model_uri,
         df_test[:10],
        targets="summary",  # specify which column corresponds to the expected output
        model_type="text-summarization",  # model type indicates which metrics are relevant for this task
        evaluators="default",
        extra_metrics=[

        latency(),
      make_metric(
                eval_fn=calculate_bert_f1,
                greater_is_better=True,
            ),
        make_metric(
                eval_fn=calculate_bert_recall,
                greater_is_better=True,
            ),
        make_metric(
                eval_fn=calculate_bert_precision,
                greater_is_better=True,
            ),
    ],
    )


2025/05/04 09:33:49 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.


Downloading artifacts:   0%|          | 0/16 [00:00<?, ?it/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
2025/05/04 09:34:29 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/05/04 09:35:10 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


🏃 View run Evaluation_2025-05-04_09:33:48 at: https://dbc-78299b7d-4a77.cloud.databricks.com/ml/experiments/1388429508485503/runs/5c55d93a75e6449ba02ffab30ae8b4a9
🧪 View experiment at: https://dbc-78299b7d-4a77.cloud.databricks.com/ml/experiments/1388429508485503


2025/05/04 09:35:22 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2025/05/04 09:35:23 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!


# Evaluate with LLM as Judge metrics

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
os.environ['GOOGLE_API_KEY']='GEMINI_API_KEY'
llm=ChatGoogleGenerativeAI(model='gemini-1.5-flash')


import mlflow.deployments

client = mlflow.deployments.get_deploy_client("genai")

client.create_deployment(
    name="gemini-professionalism",
    config={
        "model": "gemini-1.5-pro-preview-0409",
        "provider": "vertex_ai"
    }
)


In [98]:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric

professionalism_metric = make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language"
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below "
        "are the details for different scores: "
        "- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts."
        "- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. "
        "- Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. "
    ),
    examples=[
        EvaluationExample(
            input="What is MLflow?",
            output=(
                "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!"
            ),
            score=2,
            justification=(
                "The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. "
            ),
        )
    ],
    version="v1",
    model="vertex_ai/gemini-1.5-pro-preview-0409",
    parameters={"temperature": 0.0},
    grading_context_columns=[],
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

print(professionalism_metric)

EvaluationMetric(name=professionalism, greater_is_better=True, long_name=professionalism, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's professionalism based on the rubric
justification: Your reasoning about the model's professionalism score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called professionalism based on the input and output.
A definition of professionalism and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before complet

In [99]:
torch.cuda.empty_cache()
gc.collect()

9041

In [100]:
now = datetime.datetime.now()

evaluator_config = {
    "col_mapping": {
        "inputs": "input",        # Map expected 'inputs' → your 'input'
        "targets": "summary",     # Already correct if named 'summary'
    }
}

description= f"""Evaluation Fine Tuned T5-Large Model on Multi_News Dataset
model_uri: {model_uri}

custom metric BertScore , latency and professionalism
"""
with mlflow.start_run(run_name=f"Evaluation_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:

    results = mlflow.evaluate(
        model_uri,
        df_test[:10],
        targets="summary",  # specify which column corresponds to the expected output
        model_type="text-summarization",  # model type indicates which metrics are relevant for this task
        evaluators="default",
        evaluator_config=evaluator_config,
        extra_metrics=[

        latency(),
        make_metric(
                eval_fn=calculate_bert_f1,
                greater_is_better=True,
            ),
        make_metric(
                eval_fn=calculate_bert_recall,
                greater_is_better=True,
            ),
        make_metric(
                eval_fn=calculate_bert_precision,
                greater_is_better=True,
            ),
        professionalism_metric,
    ],
    )
results.metrics

2025/05/04 09:53:37 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.


Downloading artifacts:   0%|          | 0/16 [00:00<?, ?it/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
2025/05/04 09:54:13 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/05/04 09:54:54 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


  0%|          | 0/1 [00:00<?, ?it/s]

🏃 View run Evaluation_2025-05-04_09:53:37 at: https://dbc-78299b7d-4a77.cloud.databricks.com/ml/experiments/1388429508485503/runs/535d0f6bb1c4465885c26d3ff437e8b7
🧪 View experiment at: https://dbc-78299b7d-4a77.cloud.databricks.com/ml/experiments/1388429508485503


2025/05/04 09:54:55 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2025/05/04 09:54:55 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!


MlflowException: Metric 'professionalism': Error:
Malformed model uri 'vertex_ai/gemini-1.5-pro-preview-0409'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/mlflow/models/evaluation/default_evaluator.py", line 710, in _test_first_row
    metric_value = metric.evaluate(eval_fn_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/mlflow/models/evaluation/utils/metric.py", line 59, in evaluate
    metric: MetricValue = self.function(*eval_fn_args)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/mlflow/metrics/genai/genai_metric.py", line 611, in eval_fn
    score, justification = future.result()
                           ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/mlflow/metrics/genai/genai_metric.py", line 110, in _score_model_on_one_payload
    endpoint_type = model_utils.get_endpoint_type(eval_model) or "llm/v1/chat"
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/mlflow/metrics/genai/model_utils.py", line 23, in get_endpoint_type
    schema, path = _parse_model_uri(endpoint_uri)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/mlflow/metrics/genai/model_utils.py", line 83, in _parse_model_uri
    raise MlflowException(
mlflow.exceptions.MlflowException: Malformed model uri 'vertex_ai/gemini-1.5-pro-preview-0409'


In [None]:
  torch.cuda.empty_cache()
gc.collect()

# Evaluate ONNX models in Custom PythonModel


In [None]:
model_uri_onnx = "runs:/79c1dcaabd214f0cae2c55797175b16a/t5-summarization-onnx"

In [None]:
loaded_model = mlflow.pyfunc.load_model(model_uri_onnx)