## Tuning of the evaluator

The purpose of this notebook is to evaluate how close are evaluations of LLM generated solutions between Human and LLM.

* It allows to create a tuning set by selecting random tasks from the whole benchmark set.
* It contains code for comparing the match scores given to the LLM generated solutions by human and LLM by using confusion matrix

If the difference is too high, the adjustments to the prompt in the evaluation.py file might be needed. The notebook contains code to facilitate analysis of cases where the match scores differed. 

When ready, the evaluator code is used in the "Benchmarking" notebook.

##### Imports

In [None]:
# !pip install -r requirements.txt

In [2]:
import os, sys
import json
import zipfile
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from io import StringIO
import importlib
from collections import defaultdict
import random
from pprint import pprint, pformat, PrettyPrinter
import re
from tqdm import tqdm

In [3]:
from geobenchx.constants import DATA_FOLDER, RESULTS_FOLDER, MODEL_CLAUDE, MODEL_GEMINI_ADV, MODEL_GPT
import geobenchx.dataclasses
importlib.reload(geobenchx.dataclasses)
from geobenchx.dataclasses import TaskSet, Task, Solution

import geobenchx.utils
importlib.reload(geobenchx.utils)
from geobenchx.utils import generate_timestamp_id, get_dataframe_info, get_solution_code


import geobenchx.agent
importlib.reload(geobenchx.agent)
from geobenchx.agent import execute_task

import geobenchx.evaluation
importlib.reload(geobenchx.evaluation)
from geobenchx.evaluation import score_task_solution, generate_eval_stats_evaluator, score_solutions_set

#### Selecting random tasks for annotation

In [4]:
source_tasks = 'tasks_and_reference_solutions.json' # name of file with tasks and reference solutions (ground truth solutions)

In [5]:
tuning_tasks_filename = 'evaluator_tuning_set.json' # name for file with tasks with references solution, candidate solution and manual match score

In [6]:
# Selecting the tasks from 'source_tasks' for the evaluator tuning set or reading them from already exsisting file with tuning tasks set

if os.path.exists(os.path.join(DATA_FOLDER, tuning_tasks_filename)): 
    evaluator_tuning_tasks = TaskSet.read_from_file(tuning_tasks_filename, DATA_FOLDER)   
else:
    tasks = TaskSet.read_from_file(source_tasks, DATA_FOLDER)
    evaluator_tuning_tasks = tasks.sample_stratified(40)

In [None]:
# Checking the size of the tuning set and its composition by types of the selected tasks

print(len(evaluator_tuning_tasks))
evaluator_tuning_tasks.get_labels_counts()

44


{<TaskLabels.MERGE_VISUALIZE: 'Merge, Visualize'>: 6,
 <TaskLabels.TASK_SET_01: 'Task Set 01'>: 6,
 <TaskLabels.SPATIAL_OPERATIONS: 'Spatial operations'>: 15,
 <TaskLabels.TASK_SET_03: 'Task Set 03'>: 15,
 <TaskLabels.VAGUE: 'Vague'>: 4,
 <TaskLabels.HEATMAPS_CONTOUR_LINES: 'Heatmaps, Contour Lines'>: 14,
 <TaskLabels.TASK_SET_04: 'Task Set 04'>: 14,
 <TaskLabels.PROCESS_MERGE_VISUALIZE: 'Process, Merge, Visualize'>: 9,
 <TaskLabels.TASK_SET_02: 'Task Set 02'>: 9,
 <TaskLabels.HARD: 'Hard'>: 1}

In [None]:
# saving the tuning set for evaluation if needed

evaluator_tuning_tasks.save_to_file(tuning_tasks_filename, DATA_FOLDER)

### Attention!

In the evaluator_tuning_set.json file in the repository, the LLM solutions are already generated. 

If a new file generated using the above part of the notebook:
1. Proceed with generating solutions by an LLM of choice and, 
2. Score it manually, by comparing the reference and candidate solutions using GUI by running tasks_editor or direclty in the json file, by inputting match score and match resoning under the keys "match_reasoning_Human" (optional) and "match_score_Human" (required) in the new file.

### Evaluate solutions in the tuning set

The LLM scores will be saved directly in the tuning tasks file. 

After scoring the set with an LLM, proceed to the next part of the notebook to calculate how close the LLM's evaluations are to the human scores.

Repeat for any LLM you plan to use for evaluations.

In [None]:
# Select the model to generate the solutions

model = MODEL_GEMINI_ADV

# Default temperature for evaluation of tasks is 0, to change temperature use line below. For OpenAI's o3-mini, use temperature = None

# temperature = 


In [None]:
# Run evaluation of the whole set by the selected model
# The LLM match scores will be saved directly in the tuning tasks file. 
# Proceed to the next part of the notebook and see how close the LLM's solutions 

score_solutions_set(tuning_tasks_filename, DATA_FOLDER, model, skip_scored=False)

### Evaluate comparisons

1. Count how many human comparisons and LLM's comparisons match. 

In [16]:
# Generate the matrix of LLM scores vs Human scores, percent of tasks for which the scores are the same and CI using Ward formula                                                                                                                                                                                     
generate_eval_stats_evaluator(tuning_tasks_filename, DATA_FOLDER)

         LLM_0  LLM_1  LLM_2
Human_0     18      3      1
Human_1      1      3      0
Human_2      3      0     15
0.8181818181818182
(0.6803944911281533, 0.9048719428437227)


In [None]:
# Select tasks with particular combination of Human and LLM score
tasks_selected = [task for task in evaluator_tuning_tasks if task.match_score_LLM is not None and (task.match_score_Human.value, task.match_score_LLM.value)==(2, 0)]
tasks_selected

[|                                  | Task details                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
 |:---------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
# Select one task from the tuningv set to see the task details, reference and candidate solution and scores

check = [task for task in evaluator_tuning_tasks if task.task_ID == 'TASK_250309_135125_530802']
check

[|                                  | Task details                                                                                                                                                                                                                                                                                                     |
 |:---------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | task_ID                          | TASK_250309_135125_530802                                                                                                                                                                                                                                                       

### When ready, the evaluator code is used in the "Benchmarking" notebook.