# User Story for Calculating ELO Scores of QA Configurations for Ranking 

As a user of the Intelligence Layer (IL), I want to evaluate how well different configurations perform on a QA task with the given input data.
A configuration is a combination of a model with a fixed set of parameters.
In the following, we focus on comparing setups which differ only in the chosen model.

We provide multiple inputs consisting of a longer texts and a questions related to each of those texts, as well as the expected answers.
A Llama-model is used as a grader to decide which answer of two different models is better.
The aggregation of all comparisons results in [ELO](https://en.wikipedia.org/wiki/Elo_rating_system) scores and win rates of the models.

In this notebook, we go through the following steps: First, we create a set of examples of texts with a relevant question for each (Step 0), after which we use the models to generate answers (Step 1). The given answers are then compared against each other and judged by the Llama model (Step 2), which will result in a final ELO ranking and win rate (Step 3). Lastly, we include a new model in the evaluation without having to re-evaluate the previous models against each other, as is typically done in ELO rankings (Step 4).

## Evaluating classification use-cases

Before we can begin, we need to load the Aleph-Alpha access token from the environment and create the client.

In [1]:
from os import getenv

from aleph_alpha_client import Client
from dotenv import load_dotenv

from intelligence_layer.connectors import LimitedConcurrencyClient
from intelligence_layer.evaluation.evaluation.evaluator.elo_evaluator import Matches

load_dotenv()

aa_client = Client(getenv("AA_TOKEN"))
limited_concurrency_client = LimitedConcurrencyClient(aa_client)

# Step 0 – Data set

During the four steps of determining the ELO scores, we make use of the following four repositories for managing the intermediate data.

First, we create and store an input dataset into a so-called `dataset_repository`.

The IL will read the input dataset and produce outputs for each model, which will be stored in a `run_repository`.

The result from the previous step can now be evaluated, in this case with an ELO evaluator (`EloQaEvaluator`). The evaluation is stored in the `eval_repository`.

Finally, the evaluations are aggregated and stored in the `aggregation_repository`. The aggregation contains the ELO score and winning rate of each model along with additional metadata.

In [2]:
from intelligence_layer.evaluation import (
    InMemoryAggregationRepository,
    InMemoryDatasetRepository,
    InMemoryEvaluationRepository,
    InMemoryRunRepository,
)

dataset_repository = InMemoryDatasetRepository()
run_repository = InMemoryRunRepository()
evaluation_repository = InMemoryEvaluationRepository()
aggregation_repository = InMemoryAggregationRepository()

Here, we fill the `dataset_repository` with two `Example`s. Each `Example` contains a text, a question regarding said text, as well as an expected answer.
The newly created dataset in the repository has a unique id, which is stored in the `dataset_id` variable.

In [3]:
from intelligence_layer.core import Language
from intelligence_layer.evaluation import Example
from intelligence_layer.examples.qa.single_chunk_qa import SingleChunkQaInput

qa_input_text_1 = """Surface micromachining

Surface micromachining builds microstructures by deposition and etching structural layers over a substrate.[1] This is different from Bulk micromachining, in which a silicon substrate wafer is selectively etched to produce structures.

Layers

Generally, polysilicon is used as one of the substrate layers while silicon dioxide is used as a sacrificial layer. The sacrificial layer is removed or etched out to create any necessary void in the thickness direction. Added layers tend to vary in size from 2-5 micrometres. The main advantage of this machining process is the ability to build electronic and mechanical components (functions) on the same substrate. Surface micro-machined components are smaller compared to their bulk micro-machined counterparts.

As the structures are built on top of the substrate and not inside it, the substrate's properties are not as important as in bulk micro-machining. Expensive silicon wafers can be replaced by cheaper substrates, such as glass or plastic. The size of the substrates may be larger than a silicon wafer, and surface micro-machining is used to produce thin-film transistors on large area glass substrates for flat panel displays. This technology can also be used for the manufacture of thin film solar cells, which can be deposited on glass, polyethylene terepthalate substrates or other non-rigid materials.

Fabrication process

Micro-machining starts with a silicon wafer or other substrate upon which new layers are grown. These layers are selectively etched by photo-lithography; either a wet etch involving an acid, or a dry etch involving an ionized gas (or plasma). Dry etching can combine chemical etching with physical etching or ion bombardment. Surface micro-machining involves as many layers as are needed with a different mask (producing a different pattern) on each layer. Modern integrated circuit fabrication uses this technique and can use as many as 100 layers. Micro-machining is a younger technology and usually uses no more than 5 or 6 layers. Surface micro-machining uses developed technology (although sometimes not enough for demanding applications) which is easily repeatable for volume production."""

example_1 = Example(
    input=SingleChunkQaInput(
        chunk=qa_input_text_1,
        question="What is micromachining?",
        language=Language("en"),
    ),
    expected_output="Surface micromachining builds microstructures by deposition and etching structural layers over a substrate. This is different from Bulk micromachining, in which a silicon substrate wafer is selectively etched to produce structures.",
)

qa_input_text_2 = """
Silicon is a chemical element; it has symbol Si and atomic number 14. It is a hard, brittle crystalline solid with a blue-grey metallic luster, and is a non metal and semiconductor. It is a member of group 14 in the periodic table: carbon is above it; and germanium, tin, lead, and flerovium are below it. It is relatively unreactive.

Because of its high chemical affinity for oxygen, it was not until 1823 that Jöns Jakob Berzelius was first able to prepare it and characterize it in pure form. Its oxides form a family of anions known as silicates. Its melting and boiling points of 1414 °C and 3265 °C, respectively, are the second highest among all the metalloids and nonmetals, being surpassed only by boron.[a]

Silicon is the eighth most common element in the universe by mass, but very rarely occurs as the pure element in the Earth's crust. It is widely distributed in space in cosmic dusts, planetoids, and planets as various forms of silicon dioxide (silica) or silicates. More than 90% of the Earth's crust is composed of silicate minerals, making silicon the second most abundant element in the Earth's crust (about 28% by mass), after oxygen. 
"""

example_2 = Example(
    input=SingleChunkQaInput(
        chunk=qa_input_text_2, question="What is silicon?", language=Language("en")
    ),
    expected_output="Silicon is a chemical element.",
)

examples = [example_1, example_2]

In [4]:
dataset_id = dataset_repository.create_dataset(
    examples=examples, dataset_name="My-test-dataset"
).id

In [5]:
# ensure that we got a valid dataset ID
assert len(dataset_id) > 0, f"The dataset with ID {dataset_id} is empty"

Now that we stored the examples into the `dataset_repository`, we can retrieve them by the `dataset_id`

In [6]:
for example in dataset_repository.examples(dataset_id, SingleChunkQaInput, str):
    print(example)

Example ID = 87012cf8-1b98-4c9f-8a5f-815aa0e76edb
Input = chunk="Surface micromachining\n\nSurface micromachining builds microstructures by deposition and etching structural layers over a substrate.[1] This is different from Bulk micromachining, in which a silicon substrate wafer is selectively etched to produce structures.\n\nLayers\n\nGenerally, polysilicon is used as one of the substrate layers while silicon dioxide is used as a sacrificial layer. The sacrificial layer is removed or etched out to create any necessary void in the thickness direction. Added layers tend to vary in size from 2-5 micrometres. The main advantage of this machining process is the ability to build electronic and mechanical components (functions) on the same substrate. Surface micro-machined components are smaller compared to their bulk micro-machined counterparts.\n\nAs the structures are built on top of the substrate and not inside it, the substrate's properties are not as important as in bulk micro-machini

# Step 1 - Run Models

Given a `dataset_repository` with examples, we can now generate the output of the models for all examples.
First, we have to define which models we want to use. In this example, we use the _"luminous-base-control"_ model and the _"luminous-supreme-control"_ model.
 
The previously created client is used to create a `Task` for each model. We use a `SingleChunkQa` task, meaning that in each task a model will give an answer to a question regarding a single chunk of text.
These tasks are executed by a `Runner`, using the input dataset via the previously stored `dataset_id`.

Tasks require a `run_repository` to store the output. The generated output is automatically stored when calling `run_dataset` on the `runners`. The output for each model will have a unique _"run id"_.
In general, the output for each model consists of two parts. One part is a collection of example outputs. Each example outputs contains the `run_id`, `example_id`, and a field `output`. In this specific use-case, the `output` field contains the `answer` to the question. The other part is a _"run overview"_ with the run id stored as `id`, the `dataset_id`, and a description of the task, plus other metadata. 

In [7]:
from intelligence_layer.core import LuminousControlModel
from intelligence_layer.evaluation.run.runner import Runner
from intelligence_layer.examples.qa.single_chunk_qa import (
    SingleChunkQa,
    SingleChunkQaOutput,
)

models = [
    LuminousControlModel(name="luminous-base-control-20240215", client=aa_client),
    LuminousControlModel(name="luminous-supreme-control-20240215", client=aa_client),
]

for model in models:
    runner = Runner[SingleChunkQaInput, SingleChunkQaOutput](
        task=SingleChunkQa(model=model),
        dataset_repository=dataset_repository,
        run_repository=run_repository,
        description=f"QA with model {model.name}",
    )
    runner.run_dataset(dataset_id)

Running: 0it [00:00, ?it/s]

Running: 2it [00:21, 10.62s/it]
Running: 2it [00:17,  8.68s/it]


In [8]:
# ensure that all examples succeeded
for run_overview in run_repository.run_overviews():
    assert (
        run_overview.failed_example_count == 0
    ), f"There are failed runs for run overview ID {run_overview.id}"

The overviews and outputs can be retrieved via the unique run ids:

In [9]:
print(
    f"Run overview IDs saved in the run repository: {run_repository.run_overview_ids()}\n"
)

for run_overview in run_repository.run_overviews():
    print(run_overview)
    for example_output in run_repository.example_outputs(
        run_overview.id, SingleChunkQaOutput
    ):
        print(example_output)

Run overview IDs saved in the run repository: ['74d288d3-7536-4a40-bfe4-ccc9b9e107b0', 'a63e12f6-ebf5-4c4d-a9b6-d58a45f07257']

Run Overview ID = 74d288d3-7536-4a40-bfe4-ccc9b9e107b0
Dataset ID = d99a3aff-619d-422b-8d69-daf5513b3942
Start time = 2024-05-16 08:30:42.168784+00:00
End time = 2024-05-16 08:31:03.416546+00:00
Failed example count = 0
Successful example count = 2
Description = "QA with model luminous-base-control-20240215"

Example ID=87012cf8-1b98-4c9f-8a5f-815aa0e76edb
Related Run ID=74d288d3-7536-4a40-bfe4-ccc9b9e107b0
Output="answer='Micromachining is a process of building microstructures by deposition and etching structural layers over a substrate.' highlights=[ScoredTextHighlight(start=24, end=131, score=1.0)]"

Example ID=d91ae806-31bf-4e92-a40f-6a560dc68a95
Related Run ID=74d288d3-7536-4a40-bfe4-ccc9b9e107b0
Output="answer='Silicon is a chemical element with symbol Si and atomic number 14. It is a hard, brittle crystalline solid with a blue-grey metallic luster, and 

# Step 2 – Run Evaluation

Now that we have generated the answers of all models for all examples in the `dataset_repository`, the next step is to evaluate those answers.
The evaluation is done by an `Evaluator`. Here we are interested in the ELO score, which can be calculated using the `EloQaEvaluator`. For each example, the `EloQaEvaluator` takes the two answers of two different models and uses Llama to decide which answer is better. You can also implement your own `Evaluator` to exactly match your use case.

In [10]:
# this should demonstrate that there are no stored evaluations yet in our repository
print(f"IDs of stored evaluations: {evaluation_repository.evaluation_overview_ids()}")

IDs of stored evaluations: []


In [11]:
from intelligence_layer.core.model import Llama3InstructModel
from intelligence_layer.evaluation import IncrementalEvaluator
from intelligence_layer.evaluation.evaluation.evaluator.incremental_elo_evaluator import IncrementalEloQaEvaluationLogic

elo_qa_evaluation_logic = IncrementalEloQaEvaluationLogic(
    model=Llama3InstructModel(name="llama-3-8b-instruct")
)

evaluator = IncrementalEvaluator(
    dataset_repository=dataset_repository,
    run_repository=run_repository,
    evaluation_repository=evaluation_repository,
    description="ELO QA evaluation",  # this description will be used later to query for specific evaluations
    incremental_evaluation_logic=elo_qa_evaluation_logic,
)

In [12]:
evaluation_overview = evaluator.evaluate_runs(*run_repository.run_overview_ids())

PAIRS <itertools.combinations object at 0x33ff72890>
PLAYER A:  74d288d3-7536-4a40-bfe4-ccc9b9e107b0
PLAYER B:  a63e12f6-ebf5-4c4d-a9b6-d58a45f07257
example_id:  87012cf8-1b98-4c9f-8a5f-815aa0e76edb
______________________
PAIRS <itertools.combinations object at 0x33ff729d0>
PLAYER A:  74d288d3-7536-4a40-bfe4-ccc9b9e107b0
PLAYER B:  a63e12f6-ebf5-4c4d-a9b6-d58a45f07257
example_id:  d91ae806-31bf-4e92-a40f-6a560dc68a95
______________________


Evaluating: 2it [00:00,  2.98it/s]


In [13]:
# ensure that for each example there are evaluated comparisons


for example_evaluation in evaluation_repository.example_evaluations(
    evaluation_overview.id, Matches
):
    assert isinstance(example_evaluation.result, Matches)
    assert (
        len(example_evaluation.result.comparison_evaluations) > 0
    ), f"There are no matches (comparisons) for example ID {example_evaluation.example_id}"

The evaluation results can be retrieved via their unique ids:

In [14]:
for evaluation_overview in evaluation_repository.evaluation_overviews():
    print(evaluation_overview)

Evaluation Overview ID = 12e1b780-3360-4801-990c-512b9fb524a7
Start time = 2024-05-16 08:31:20.812654+00:00
End time = 2024-05-16 08:31:21.501250+00:00
Successful examples = 2
Failed examples = 0
Description = "ELO QA evaluation"
Run Overviews={
Run Overview ID = 74d288d3-7536-4a40-bfe4-ccc9b9e107b0
Dataset ID = d99a3aff-619d-422b-8d69-daf5513b3942
Start time = 2024-05-16 08:30:42.168784+00:00
End time = 2024-05-16 08:31:03.416546+00:00
Failed example count = 0
Successful example count = 2
Description = "QA with model luminous-base-control-20240215"
, Run Overview ID = a63e12f6-ebf5-4c4d-a9b6-d58a45f07257
Dataset ID = d99a3aff-619d-422b-8d69-daf5513b3942
Start time = 2024-05-16 08:31:03.417036+00:00
End time = 2024-05-16 08:31:20.775926+00:00
Failed example count = 0
Successful example count = 2
Description = "QA with model luminous-supreme-control-20240215"
}



# Step 3 – Run Aggregation

Finally, all individual evaluations are aggregated into metrics for each model - here, an ELO score and a win rate.
The aggregation is defined in the same evaluator that we used before in Step 2.

In [15]:
# this should demonstrate that there are no stored aggregated evaluations yet in our repository
print(
    f"IDs of stored aggregated evaluations: {aggregation_repository.aggregation_overview_ids()}"
)

IDs of stored aggregated evaluations: []


In [16]:
from intelligence_layer.evaluation import Aggregator
from intelligence_layer.evaluation.aggregation.elo_aggregation import (
    MatchesAggregationLogic,
)

aggregator = Aggregator(
    evaluation_repository=evaluation_repository,
    aggregation_repository=aggregation_repository,
    description="ELO QA aggregation",
    aggregation_logic=MatchesAggregationLogic(),
)

aggregated_evaluation = aggregator.aggregate_evaluation(evaluation_overview.id)

In [17]:
# ensure that there are no failed (aggregated) evaluations
assert (
    aggregated_evaluation.crashed_during_evaluation_count == 0
), f"There are crashed evaluations for aggregated evaluation ID {aggregated_evaluation.id}"
assert (
    aggregated_evaluation.failed_evaluation_count == 0
), f"There are failed evaluations for aggregated evaluation ID {aggregated_evaluation.id}"
# ensure that the result contains ELO scores
assert hasattr(
    aggregated_evaluation.statistics, "scores"
), f"There are no scores for aggregated evaluation ID {aggregated_evaluation.id}"

We can get an overview of each aggregation from the aggregation repository as follows. Note that we need to provide the type of the aggregation to enable the deserialization. The given `statistics` field of the evaluation result contains only the aggregated metrics for each model. 

In [18]:
from intelligence_layer.evaluation import AggregatedEvaluation

for aggregation_overview in aggregation_repository.aggregation_overviews(
    AggregatedEvaluation
):
    print(aggregation_overview)

Aggregation Overview ID = 8374c433-e1b8-4601-a5db-109b76226da6
Start time = 2024-05-16 08:31:21.530772+00:00
End time = 2024-05-16 08:31:21.531775+00:00
Successful example count = 2
Count of examples crashed during evaluation = 0
Description = "ELO QA aggregation"
IDs of aggregated Evaluation Overviews = ['12e1b780-3360-4801-990c-512b9fb524a7']
IDs of aggregated Run Overviews = ['74d288d3-7536-4a40-bfe4-ccc9b9e107b0', 'a63e12f6-ebf5-4c4d-a9b6-d58a45f07257']
Statistics = {
scores={'74d288d3-7536-4a40-bfe4-ccc9b9e107b0': PlayerScore(elo=1500.034342091151, elo_standard_error=0.05713369977028134, win_rate=0.5, num_matches=2), 'a63e12f6-ebf5-4c4d-a9b6-d58a45f07257': PlayerScore(elo=1499.965657908849, elo_standard_error=0.05713369977028134, win_rate=0.5, num_matches=2)}
}



# Step 4 Addition of New Models

Now let us consider the case where we want to add new models to our existing evaluation.
Since the comparison of answers is rather time-consuming, we want to avoid recalculating the evaluation for the previous models, and just compare the new models to the old ones.

To do so, we first define the new models _"luminous-base-control-v10"_ and _"luminous-supreme-control-v15"_, and generate their answers.

In [19]:
newly_added_models = [
    LuminousControlModel(name="luminous-base-control-20230501", client=aa_client),
    LuminousControlModel(name="luminous-supreme-control-20230501", client=aa_client),
]

for model in newly_added_models:
    runner = Runner[
        SingleChunkQaInput, SingleChunkQaOutput
    ](
        task=SingleChunkQa(model),
        dataset_repository=dataset_repository,
        run_repository=run_repository,
        description=f"New QA with model {model.name}",  # used to query for new runs only later in the code
    )
    runner.run_dataset(dataset_id)

Running: 2it [00:10,  5.06s/it]
Running: 2it [00:17,  8.70s/it]


In [20]:
# ensure that all examples succeeded
for run_overview in run_repository.run_overviews():
    assert (
        run_overview.failed_example_count == 0
    ), f"There are failed runs for run overview ID {run_overview.id}"

In [21]:
for run_overview in run_repository.run_overviews():
    # skip runs done for previous models
    if not run_overview.description.startswith("New"):
        continue
    # print runs for the added models
    print(run_overview)

Run Overview ID = 1fe9f33e-4e07-4e4b-a258-891aae660991
Dataset ID = d99a3aff-619d-422b-8d69-daf5513b3942
Start time = 2024-05-16 08:31:21.542825+00:00
End time = 2024-05-16 08:31:31.664710+00:00
Failed example count = 0
Successful example count = 2
Description = "New QA with model luminous-base-control-20230501"

Run Overview ID = 9470d56c-e6f0-4e9c-af25-cbd381c36a3a
Dataset ID = d99a3aff-619d-422b-8d69-daf5513b3942
Start time = 2024-05-16 08:31:31.664870+00:00
End time = 2024-05-16 08:31:49.058983+00:00
Failed example count = 0
Successful example count = 2
Description = "New QA with model luminous-supreme-control-20230501"



To evaluate the new models against the previous models, we define a new evaluator that has an additional parameter `high_priority_runs`.
To limit the evaluator to comparisons where one of the answers is generated by a new model, we add the run_ids of the new models to `high_priority_runs`.
This way, the previous models are not compared against each other again.

In [22]:
run_repository.run_overview_ids()

['1fe9f33e-4e07-4e4b-a258-891aae660991',
 '74d288d3-7536-4a40-bfe4-ccc9b9e107b0',
 '9470d56c-e6f0-4e9c-af25-cbd381c36a3a',
 'a63e12f6-ebf5-4c4d-a9b6-d58a45f07257']

In [23]:
evaluation_overview.id

'12e1b780-3360-4801-990c-512b9fb524a7'

In [24]:



new_evaluation_overview = evaluator.evaluate_additional_runs(
    *run_repository.run_overview_ids(),
    previous_evaluation_ids=[evaluation_overview.id]
)

PAIRSPAIRS <itertools.combinations object at 0x33ff9aac0>
PLAYER A:  9470d56c-e6f0-4e9c-af25-cbd381c36a3a
PLAYER B:  1fe9f33e-4e07-4e4b-a258-891aae660991
example_id:  d91ae806-31bf-4e92-a40f-6a560dc68a95
______________________
 <itertools.combinations object at 0x33ff9a340>
PLAYER A:  9470d56c-e6f0-4e9c-af25-cbd381c36a3a
PLAYER B:  1fe9f33e-4e07-4e4b-a258-891aae660991
example_id:  87012cf8-1b98-4c9f-8a5f-815aa0e76edb
______________________


Evaluating: 0it [00:00, ?it/s]

PLAYER A:  9470d56c-e6f0-4e9c-af25-cbd381c36a3a
PLAYER B:  74d288d3-7536-4a40-bfe4-ccc9b9e107b0
example_id:  d91ae806-31bf-4e92-a40f-6a560dc68a95
______________________
PLAYER A:  9470d56c-e6f0-4e9c-af25-cbd381c36a3a
PLAYER B:  a63e12f6-ebf5-4c4d-a9b6-d58a45f07257
example_id:  d91ae806-31bf-4e92-a40f-6a560dc68a95
______________________
PLAYER A:  9470d56c-e6f0-4e9c-af25-cbd381c36a3a
PLAYER B:  74d288d3-7536-4a40-bfe4-ccc9b9e107b0
example_id:  87012cf8-1b98-4c9f-8a5f-815aa0e76edb
______________________
PLAYER A:  1fe9f33e-4e07-4e4b-a258-891aae660991
PLAYER B:  74d288d3-7536-4a40-bfe4-ccc9b9e107b0
example_id:  d91ae806-31bf-4e92-a40f-6a560dc68a95
______________________
PLAYER A:  9470d56c-e6f0-4e9c-af25-cbd381c36a3a
PLAYER B:  a63e12f6-ebf5-4c4d-a9b6-d58a45f07257
example_id:  87012cf8-1b98-4c9f-8a5f-815aa0e76edb
______________________
PLAYER A:  1fe9f33e-4e07-4e4b-a258-891aae660991
PLAYER B:  a63e12f6-ebf5-4c4d-a9b6-d58a45f07257
example_id:  d91ae806-31bf-4e92-a40f-6a560dc68a95
_________

Evaluating: 2it [00:04,  2.44s/it]


In [25]:
print(evaluation_overview)
print('_____________________')
print(new_evaluation_overview)

Evaluation Overview ID = 12e1b780-3360-4801-990c-512b9fb524a7
Start time = 2024-05-16 08:31:20.812654+00:00
End time = 2024-05-16 08:31:21.501250+00:00
Successful examples = 2
Failed examples = 0
Description = "ELO QA evaluation"
Run Overviews={
Run Overview ID = 74d288d3-7536-4a40-bfe4-ccc9b9e107b0
Dataset ID = d99a3aff-619d-422b-8d69-daf5513b3942
Start time = 2024-05-16 08:30:42.168784+00:00
End time = 2024-05-16 08:31:03.416546+00:00
Failed example count = 0
Successful example count = 2
Description = "QA with model luminous-base-control-20240215"
, Run Overview ID = a63e12f6-ebf5-4c4d-a9b6-d58a45f07257
Dataset ID = d99a3aff-619d-422b-8d69-daf5513b3942
Start time = 2024-05-16 08:31:03.417036+00:00
End time = 2024-05-16 08:31:20.775926+00:00
Failed example count = 0
Successful example count = 2
Description = "QA with model luminous-supreme-control-20240215"
}

_____________________
Evaluation Overview ID = 063b3257-9712-4b95-b3ca-68e6dc8034e0
Start time = 2024-05-16 08:31:49.091557+00

In [26]:
# ensure that for each example there are evaluated comparisons
for example_evaluation in evaluation_repository.example_evaluations(
    new_evaluation_overview.id, Matches
):
    assert isinstance(example_evaluation.result, Matches)
    assert (
        len(example_evaluation.result.comparison_evaluations) > 0
    ), f"There are no matches (comparisons) for example ID {example_evaluation.example_id}"


In addition to the previous `evaluation_overview`, we now also have the newly generated `new_evaluation_overview` which includes our new model.
Finally, the aggregated evaluation of all models is calculated by passing in the evaluation ids of both evaluations into `aggregate_evaluation`. By doing so, the previously calculated ELO scores will be updated with the comparisons to the new models' answers.

In [27]:
# get the IDs of all the evaluation overviews which we created for the QA task
evaluation_overview_ids = [
    evaluation_overview.id
    for evaluation_overview in evaluation_repository.evaluation_overviews()
    if evaluation_overview.description.find("QA")
]
print(f"Evaluation overviews to aggregate: {evaluation_overview_ids}")

Evaluation overviews to aggregate: ['063b3257-9712-4b95-b3ca-68e6dc8034e0', '12e1b780-3360-4801-990c-512b9fb524a7']


In [28]:
# run the aggregation
aggregated_evaluation_with_new_model = aggregator.aggregate_evaluation(
    *evaluation_overview_ids
)

In [29]:
# ensure that there are no failed (aggregated) evaluations
assert (
    aggregated_evaluation_with_new_model.crashed_during_evaluation_count == 0
), f"There are crashed evaluations for aggregated evaluation ID {aggregated_evaluation.id}"
assert (
    aggregated_evaluation_with_new_model.failed_evaluation_count == 0
), f"There are failed evaluations for aggregated evaluation ID {aggregated_evaluation.id}"
# ensure that we result contains ELO scores
assert hasattr(
    aggregated_evaluation_with_new_model.statistics, "scores"
), f"There are no scores for aggregated evaluation ID {aggregated_evaluation.id}"

A look at the new aggregated evaluation shows that the new model have been added to the evaluation:

In [30]:
print(aggregated_evaluation_with_new_model)

Aggregation Overview ID = 29a2316d-47b3-4aa8-a8c2-5983f1b072de
Start time = 2024-05-16 08:31:54.005689+00:00
End time = 2024-05-16 08:31:54.008595+00:00
Successful example count = 4
Count of examples crashed during evaluation = 0
Description = "ELO QA aggregation"
IDs of aggregated Evaluation Overviews = ['063b3257-9712-4b95-b3ca-68e6dc8034e0', '12e1b780-3360-4801-990c-512b9fb524a7']
IDs of aggregated Run Overviews = ['1fe9f33e-4e07-4e4b-a258-891aae660991', '9470d56c-e6f0-4e9c-af25-cbd381c36a3a', '74d288d3-7536-4a40-bfe4-ccc9b9e107b0', 'a63e12f6-ebf5-4c4d-a9b6-d58a45f07257']
Statistics = {
scores={'9470d56c-e6f0-4e9c-af25-cbd381c36a3a': PlayerScore(elo=1499.9627526168383, elo_standard_error=0.17735549496891068, win_rate=0.5, num_matches=6), '1fe9f33e-4e07-4e4b-a258-891aae660991': PlayerScore(elo=1536.6164246453945, elo_standard_error=0.11725579512845279, win_rate=0.8333333333333334, num_matches=6), '74d288d3-7536-4a40-bfe4-ccc9b9e107b0': PlayerScore(elo=1481.9272029297929, elo_standard