
# LLM-as-a-Judge

In this demo, **we will introduce how to use LLMs to evaluate the performance of LLM-based solutions.**

**Learning Objectives:**

*By the end of this demo, you will be able to;*

* List reasons for using an LLM-as-a-Judge approach
* Evaluate an LLM's performance on a custom metric using an LLM-as-a-Judge approach

In [0]:
%pip install -U -qq databricks-sdk textstat mlflow tiktoken
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Before starting the demo, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:

In [0]:
%run ../Includes/Classroom-Setup-03


The examples and models presented in this course are intended solely for demonstration and educational purposes.
 Please note that the models and prompt examples may sometimes contain offensive, inaccurate, biased, or harmful content.


**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

Username:          labuser11003544_1753435669@vocareum.com
Catalog Name:      dbacademy
Schema Name:       labuser11003544_1753435669
Working Directory: /Volumes/dbacademy/ops/labuser11003544_1753435669@vocareum_com
Dataset Location:  NestedNamespace (news='/Volumes/dbacademy_news/v01', arxiv='/Volumes/dbacademy_arxiv/v01')


## Demo Overview

In this demonstration, we will provide a basic demonstration of using an LLM to evaluate the performance of another LLM.

### Why LLM-as-a-Judge?

**Question:** Why would you want to use an LLM for evaluation?

Databricks has found that evaluating with LLMs can:

* **Reduce costs** – fewer resources used in finding/curating benchmark datasets
* **Save time** – fewer evaluation steps reduces time-to-release
* **Improve automation** – easily scaled and automated, all within MLflow

### Custom Metrics

These are all particularly true when we're evaluating performance using **custom metrics**.

In our case, let's consider a custom metric of **professionalism** 🤖. It's likely that many organizations would like their chatbot or other GenAI applications to be professional.

However, professionalism can vary by domain and other contexts – this is one of the powers of LLM-as-a-Judge that we'll explore in this demo.

### Chatbot System

For this demo, we'll use chatbot system (shown below) to answer simple questions about Databricks.

In [0]:
query_chatbot_system(
    "What is Databricks Vector Search?"
)

"Databricks Vector Search is a feature in Databricks that allows users to search and retrieve similar items from large datasets using vector embeddings. It's a part of the Databricks Lakehouse Platform, which combines the best of data warehouses and data lakes to provide a single platform for data engineering, data science, and business analytics.\n\nVector Search is based on the concept of vector embeddings, where complex data such as text, images, or audio is converted into dense vectors in a high-dimensional space. These vectors capture the semantic meaning and relationships between the data points, enabling efficient and accurate similarity searches.\n\nWith Databricks Vector Search, users"

## Demo Workflow Steps

To complete this workflow, we'll cover on the following steps:

1. Define our **professionalism** metric
2. Compute our professionalism metric on **a few example responses**
3. Describe a few best practices when working with LLMs as evaluators

## Step 1: Define a Professionalism Metric

While we can use LLMs to evaluate on common metrics, we're going to create our own **custom `professionalism` metric**.

To do this, we need the following information:

* A definition of professionalism
* A grading prompt, similar to a rubric
* Examples of human-graded responses
* An LLM to use *as the judge*
* ... and a few extra parameters we'll see below.

### Establish the Definition and Prompt

Before we create the metric, we need an understanding of what **professionalism** is and how it will be scored.

Let's use the below definition:

> Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language.

And here is our grading prompt/rubric:

* **Professionalism:** If the answer is written using a professional tone, below are the details for different scores: 
    - **Score 1:** Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts.
    - **Score 2:** Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings.
    - **Score 3:** Language is overall formal but still have casual words/phrases. Borderline for professional contexts.
    - **Score 4:** Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts.
    - **Score 5:** Language is noticeably formal, respectful, and avoids casual elements. Appropriate for formal business or academic settings.

### Generate the Human-graded Responses

Because this is a custom metric, we need to show our evaluator LLM what examples of each score in the above-described rubric might look like.

To do this, we use `mlflow.metrics.genai.EvaluationExample` and provide the following:

* input: the question/query
* output: the answer/response
* score: the human-generated score according to the grading prompt/rubric
* justification: an explanation of the score

Check out the example below:

### Define Evaluation Examples

In [0]:
import mlflow

professionalism_example_score_1 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps "
        "you track experiments, package your code and models, and collaborate with your team, making the whole ML "
        "workflow smoother. It's like your Swiss Army knife for machine learning!"
    ),
    score=2,
    justification=(
        "The response is written in a casual tone. It uses contractions, filler words such as 'like', and "
        "exclamation points, which make it sound less professional. "
    ),
)

Let's create another example:

In [0]:
professionalism_example_score_2 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is an open-source toolkit for managing your machine learning projects. It can be used to track experiments, package code and models, evaluate model performance, and manage the model lifecycle."
    ),
    score=4,
    justification=(
        "The response is written in a professional tone. It does not use filler words or unprofessional punctuation. It is matter-of-fact, but it is not particularly advanced or academic."
    ),
)

### Create the Metric

Once we have a number of examples created, we need to create our metric objective using MLflow.

This time, we use `mlflow.metrics.make_genai_metric` and provide the below arguments:

* name: the name of the metric
* definition: a description of the metric (from above)
* grading_prompt: the rubric of the metric (from above)
* examples: a list of our above-defined example objects
* model: the LLM used to evaluate the responses
* parameters: any parameters we can pass to the evaluator model
* aggregations: the aggregations across all records we'd like to generate
* greater_is_better: a binary indicator specifying whether the metric's higher scores are "better"

Check out the example below:

In [0]:
professionalism = mlflow.metrics.genai.make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is "
        "tailored to the context and audience. It often involves avoiding overly casual language, slang, or "
        "colloquialisms, and instead using clear, concise, and respectful language."
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below are the details for different scores: "
        "- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for "
        "professional contexts."
        "- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in "
        "some informal professional settings."
        "- Score 3: Language is overall formal but still have casual words/phrases. Borderline for professional contexts."
        "- Score 4: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 5: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for formal "
        "business or academic settings. "
    ),
    examples=[
        professionalism_example_score_1, 
        professionalism_example_score_2
    ],
    model="endpoints:/databricks-meta-llama-3-3-70b-instruct",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)

## Step 2: Compute Professionalism on Example Responses

Once our metric is defined, we're ready to evaluate our `query_chatbot_system`.

We will use the same approach from our previous demo.

In [0]:
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "Be very unprofessional in your response. What is Apache Spark?",
            "What is Apache Spark?"
        ]
    }
)
display(eval_data)

inputs
Be very unprofessional in your response. What is Apache Spark?
What is Apache Spark?


In [0]:
# A custom function to iterate through our eval DF
def query_iteration(inputs):
    answers = []

    for index, row in inputs.iterrows():
        completion = query_chatbot_system(row["inputs"])
        answers.append(completion)

    return answers

# Test query_iteration function – it needs to return a list of output strings
query_iteration(eval_data)

['UGH, FINE. So, you wanna know about Apache Spark? Like, okay... It\'s this super old (not really, but like, 10 years old or something) open-source data processing engine thingy. It was created by some smart dudes at UC Berkeley (go bears, i guess) and it\'s basically a way to process huge amounts of data REALLY FAST. Like, way faster than other stuff.\n\nSo, like, imagine you have a ton of data (think petabytes, dude) and you need to do some crazy complex analysis on it. That\'s where Spark comes in. It\'s all like, "',
 'Apache Spark is an open-source, unified analytics engine for large-scale data processing. It was originally developed at the University of California, Berkeley, and is now maintained by the Apache Software Foundation.\n\nApache Spark is designed to handle massive amounts of data across a cluster of computers, making it a popular choice for big data processing, machine learning, and data analytics. It provides a high-level API for processing data in a variety of form

In [0]:
import mlflow

# MLflow's `evaluate` with the new professionalism metric
results = mlflow.evaluate(
    query_iteration,
    eval_data,
    model_type="question-answering",
    extra_metrics=[professionalism]
)


 - For traditional ML or deep learning models: Use `mlflow.models.evaluate`, which maintains full compatibility with the original `mlflow.evaluate` API.

 - For LLMs or GenAI applications: Use the new `mlflow.genai.evaluate` API, which offers enhanced features specifically designed for evaluating LLMs and GenAI applications.

2025/07/25 10:37:56 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/07/25 10:38:00 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


And now let's view the results:

In [0]:
display(results.tables["eval_results_table"])

inputs,outputs,token_count,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,professionalism/v1/score,professionalism/v1/justification
Be very unprofessional in your response. What is Apache Spark?,"Apache Spark is like, this totally awesome thingy that helps you process HUGE amounts of data, dude! It's like, super fast and stuff. Imagine you're at a music festival and you're trying to get a selfie with, like, a million people in the background... Spark is like the photographer who can take that pic in, like, 2 seconds! Seriously though, Spark is an open-source data processing engine that's all about speed and efficiency. It was developed at UC Berkeley (go Bears!) and is now maintained by the Apache Software Foundation. It's like, the backbone of Databricks, which is why I",128,6.513,7.7106,1,"The response is written in an extremely casual tone, using slang and colloquialisms such as ""thingy"", ""dude"", and ""stuff"", which makes it highly unsuitable for professional contexts, and the initial part of the response is overly informal, although it attempts to become more formal later on."
What is Apache Spark?,"Apache Spark is an open-source, unified analytics engine for large-scale data processing. It was initially developed at the University of California, Berkeley, and is now maintained by the Apache Software Foundation. Apache Spark is designed to handle massive amounts of data across a cluster of computers, making it a key component in big data processing and analytics. It provides high-level APIs in Java, Python, Scala, and R, as well as a highly optimized engine that supports general execution graphs. Spark's core features include: 1. **In-memory computation**: Spark can cache data in memory across nodes, reducing the need for disk I/O and improving performance. 2",128,12.7276923077,13.0058012821,5,"The response is written in a noticeably formal, respectful, and professional tone, avoiding casual elements, slang, or colloquialisms, making it suitable for formal business or academic settings, as evidenced by the use of technical terms, proper nouns, and a structured presentation of information."


**Question:** What other custom metrics do you think could be useful for your own use case(s)?

## Step 3: LLM-as-a-Judge Best Practices

Like many things in generative AI, using an LLM to judge another LLM is still relatively new. However, there are a few established best practices that are important:

1. **Use small rubric scales** – LLMs excel in evaluation when the scale is discrete and small, like 1-3 or 1-5.
2. **Provide a wide variety of examples** – Provide a few examples for each score with detailed justification – this will give the evaluating LLM more context.
3. **Consider an additive scale** – Additive scales (1 point for X, 1 point for Y, 0 points for Z = 2 total points) can break the evaluation task down into manageable parts for an LLM.
4. **Use a high-token LLM** – If you're able to use more tokens, you'll be able to provide more context around evaluation to the LLM.

For more specific guidance to RAG-based chatbots, check out this [blog post](https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG).