# Metric Analysis

This vignette demonstrates accessing the results of a completed Eval Run.

Author: Zachary Levonian \
Date: July 2025

## Part 1: Running FlexEval to compute some metrics

We'll create some test data, build an eval, and execute it.

In [1]:
import dotenv
assert dotenv.load_dotenv("../.env"), "This vignette assumes access to API keys in a .env file."

### Generating test data

Let's evaluate the quality of grade-appropriate explanations.

In [2]:
concepts = ["integer addition", "factoring polynomials", "logistic regression"]
grades = ["3rd", "5th", "7th", "9th"]

user_queries = []
for concept in concepts:
    for grade in grades:
        user_queries.append(
            f"Concisely summarize {concept} at the United States {grade}-grade level."
        )
len(user_queries)

12

We can imagine that our system under test involves a particular system prompt, or perhaps multiple candidate prompts.

In this case, we'll imagine a single, simple system prompt.

In [3]:
system_prompt = """You are a friendly math tutor.

You attempt to summarize any mathematical topic the student is interested in, even if it's not appropriate for their grade level."""

In [4]:
# convert to JSONL
import json
from pathlib import Path

concept_queries_path = Path("concept_queries.jsonl")
with open(concept_queries_path, "w") as outfile:
    for user_query in user_queries:
        outfile.write(json.dumps({"input": [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_query}]}) + "\n")

Each line of `concept_queries.jsonl` will become a unique {class}`~flexeval.classes.thread.Thread` to be processed.

Now that we have test data, we can build a FlexEval configuration and execute it.

### Defining an Eval

An Eval describes the computations that need to happen to compute the required metrics.

In this case, we'll set a few details:
 - We want to generate new LLM completions, rather than just using any existing assistant messages in our threads. To do that, we'll set {attr}`~flexeval.schema.eval_schema.Eval.do_completion` to true, and define the function to actually generate those completions from those provided in {mod}`flexeval.configuration.completion_functions`. In this case, we'll use {func}`~flexeval.configuration.completion_functions.litellm_completion`, which enables access to many different models.
 - We'll compute two {class}`~flexeval.schema.eval_schema.FunctionItem`s, a Flesch reading ease score and {meth}`~flexeval.configuration.function_metrics.is_role`. We need `is_role` because we can use its value to compute particular metrics only for assistant messages (like the new completions we'll be generating).
 - Finally, we can specify a custom {class}`~flexeval.schema.eval_schema.RubricItem`s. We'll write a prompt that describes the assessment we want to make. In this case, we try to determine if the assistant response is grade appropriate.

In [12]:
import flexeval
from flexeval.schema import Eval, Rubric, GraderLlm, DependsOnItem, Metrics, FunctionItem, RubricItem, CompletionLlm

In [13]:
# by specifying an OpenAI model name here, we'll need OPENAI_API_KEY to exist in our environment variables or in our .env file
completion_llm = CompletionLlm(
    function_name="litellm_completion",
    kwargs={"model": "gpt-4o-mini", "mock_response": "I can't help with that!"},
)

In [6]:
bad_rubric_prompt = """Read the following input and output, assessing if the output is grade-appropriate.
[Input]: {context}
[Output]: {completion}

On a new line after your explanation, print:
- YES if the Output is fully appropriate for the grade level
- SOMEWHAT if the Output uses some language or concepts that would be inappropriate for that grade level
- NO if the Output would be mostly incomprehensible to a student at that grade level

Only print YES, SOMEWHAT, or NO on the final line.
"""
rubric = Rubric(
    prompt=bad_rubric_prompt,
    choice_scores={"YES": 2, "SOMEWHAT": 1, "NO": 0},
)
grader_llm = GraderLlm(
    function_name="litellm_completion", 
    kwargs={
        "model": "gpt-4o-mini"
    }
)
rubrics = {
    "is_grade_appropriate": rubric,
}

In [8]:
is_assistant_dependency = DependsOnItem(
    name="is_role", kwargs={"role": "assistant"}, metric_min_value=1
)
eval = Eval(
    name="grade_appropriateness",
    metrics=Metrics(
        function=[
            FunctionItem(name="is_role", kwargs={"role": "assistant"}),
            FunctionItem(
                name="flesch_reading_ease",
                depends_on=[is_assistant_dependency],
            ),
        ],
        rubric=[RubricItem(name="is_grade_appropriate", depends_on=[is_assistant_dependency])],
    ),
    grader_llm=grader_llm,
    do_completion=True,
    completion_llm=completion_llm,
)
eval

Eval(do_completion=True, name='grade_appropriateness', notes='', metrics=Metrics(function=[FunctionItem(name='is_role', depends_on=[], metric_level='Turn', kwargs={'role': 'assistant'}), FunctionItem(name='flesch_reading_ease', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})], rubric=[RubricItem(name='is_grade_appropriate', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})]), completion_llm=None, grader_llm=GraderLlm(function_name='openai_completion', kwargs={'model': 'gpt-4o-mini'}))

### Building an EvalRun

In [9]:
from flexeval.schema import Config, EvalRun, FileDataSource, RubricsCollection

In [10]:
input_data_sources = [FileDataSource(path=concept_queries_path)]
output_path = Path("eval_results.db")
config = Config(clear_tables=True)
eval_run = EvalRun(
    data_sources=input_data_sources,
    database_path=output_path,
    eval=eval,
    config=config,
    rubric_paths=[RubricsCollection(rubrics=rubrics)],
)
eval_run

EvalRun(data_sources=[FileDataSource(name=None, notes=None, path=PosixPath('concept_queries.jsonl'), format='jsonl')], database_path=PosixPath('eval_results.db'), eval=Eval(do_completion=True, name='grade_appropriateness', notes='', metrics=Metrics(function=[FunctionItem(name='is_role', depends_on=[], metric_level='Turn', kwargs={'role': 'assistant'}), FunctionItem(name='flesch_reading_ease', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})], rubric=[RubricItem(name='is_grade_appropriate', depends_on=[DependsOnItem(name='is_role', type=None, kwargs={'role': 'assistant'}, metric_name=None, metric_level=None, relative_object_position=0, metric_min_value=1.0, metric_max_value=1.7976931348623157e+308)], metric_level='Turn', kwargs={})]), completion_llm=None, grader_llm=GraderLlm(function_name='op

### Running the EvalRun

Once we've built an EvalRun, running it is easy: we can just use {func}`~flexeval.runner.run`!

In [11]:
_ = flexeval.run(eval_run)

An error occurred generating completions.
Traceback (most recent call last):
  File "/Users/zacharylevonian/repos/FlexEval/src/flexeval/runner.py", line 118, in run
    completion = turn.get_completion(
  File "/Users/zacharylevonian/repos/FlexEval/src/flexeval/classes/turn.py", line 37, in get_completion
    if self.is_final_turn_in_input:
AttributeError: 'Turn' object has no attribute 'is_final_turn_in_input'


## Part 2: Analyzing our results

We'll analyze the data we created in Part 1.

In [11]:
import pandas as pd
import matplotlib.pyplot as plt

In [12]:
from flexeval.metrics import access as metric_access

In [13]:
for metric in metric_access.get_all_metrics():
    print(
        f"{metric['thread']} {metric['turn']} {metric['metric_name']} {metric['metric_value']}"
    )


1 1 assistant 0.0
2 2 assistant 0.0
3 3 assistant 0.0
4 4 assistant 0.0
5 5 assistant 0.0
6 6 assistant 0.0
7 7 assistant 0.0
8 8 assistant 0.0
9 9 assistant 0.0
10 10 assistant 0.0
11 11 assistant 0.0
12 12 assistant 0.0
