# Dimensions Recommender Demo

Status: rough production. Needs more appropriate arguments! Possibly needs better prompt? Needs error handling in future?

Motivation: oftentimes, when creating new questions, one needs to ascertain what dimensions are appropriate to be attached to each question. But manually determining dimensions using the human mind alone is tedious, especially if evals are to scale, and prone to cognitive biases (like fatigue for 'later' questions and 'later' dimensions leading to excessive permissiveness or restrictiveness when pairing.) Therefore we want some automated method for at least suggesting which dimensions for a given question.

In [None]:
#loads env variables from env file
import os
from dotenv import load_dotenv
load_dotenv()

True

Now manually download the questions spreadsheet from google sheets and save it in the same folder as this notebook as a csv.

Do the same to the dimensions spreadsheet.

In [2]:
QUESTIONS_CSV_PATH = "Dec24aha2point1.csv"
DIMENSIONS_CSV_PATH = "Dec24dimensions.csv"

In [3]:
from datasets import load_dataset

questionsHFdataset = load_dataset("csv", data_files=QUESTIONS_CSV_PATH).filter(lambda elem: isinstance(elem['Question'], str))
dimensionsHFdataset = load_dataset("csv", data_files=DIMENSIONS_CSV_PATH).filter(
    lambda elem: isinstance(elem['Dimension'], str) and isinstance(elem['Guiding Question'], str)
)

Generating train split: 0 examples [00:00, ? examples/s]

Filter:   0%|          | 0/87 [00:00<?, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Filter:   0%|          | 0/14 [00:00<?, ? examples/s]

In [4]:
dimensions_dict = {key: value for key, value in zip(dimensionsHFdataset['train']['Dimension'], dimensionsHFdataset['train']['Guiding Question'])}

In [5]:
# Dataset section
from inspect_ai.dataset import Dataset, Sample
recommender_dataset = []

for question in questionsHFdataset['train']['Question']:
    for dimension, dimension_gq in zip(dimensions_dict.keys(), dimensions_dict.values()):
        recommender_dataset.append(Sample(
            input= question,
            target="C",
            metadata = {"particular_dimension": dimension,
                    "particular_dimension_gq": dimension_gq},
            )  
        )

In [6]:
#Solver section
from inspect_ai.solver import prompt_template, chain_of_thought, generate

recommender_solver = [
    prompt_template(template="""{prompt}

DO NOT answer the question itself. Instead, answer whether or not the question sufficiently tests for the Guiding Question below:

Guiding Question: {particular_dimension_gq}

Think of the various types of responses that might be given to this question, and whether the Guiding Question might be satisfied by some but not all of these responses.

Before answering, reason in a step-by-step manner as to get the right answer. Provide your answer at the end on its own line in the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is either C if the question does sufficiently test for the Guiding Question below, P if it partially sufficiently tests for it, or I if it does not sufficiently test for it.
"""),
    generate()
]

In [7]:
#Scorer section
from inspect_ai.scorer import scorer, mean, stderr, answer

recommender_scorer = answer('letter')

In [8]:
from inspect_ai import eval, Task, eval_set

log = eval(
    tasks = Task(
        dataset = recommender_dataset,
        solver = recommender_solver,
        scorer = recommender_scorer,
        temperature = 0.7,
    ),
    model = [
        "together/meta-llama/Llama-3.2-3B-Instruct-Turbo",
    ],
    epochs = 1,
    max_connections=50,
)

Output()

Output()

0,1
accuracy,▇▇▅▆▆█▆▅▅▆▅▅▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁
samples,▁▁▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▇▇▇▇█

0,1
accuracy,0.06882
samples,1148.0
samples_correct,79.0
samples_total,1148.0


You can find the log output in relevant logs folder. On vscode with inspect extension, one can get it pretty printed by clicking on it. One example of the log output is at production_scripts/logs/2025-12-25T02-11-11+00-00_task_J88A8cCExCezZDVKcydrpd.eval

For quick suggestions/baselines on what dimensions fit each question, one can just look at the green "C" score marks for each question-dimension pair?

Or then convert to pandas dataframes for further analysis? https://inspect.aisi.org.uk/dataframe.html