# HW3: Building and Evaluating Human-AI Interaction

In the previous homeworks you learned about ways of training LLMs to build both general and personalized chat assistants.

Once we have AI assistants we can plug them into novel tools for people to use. However building and evaluating such scaffolds for human-AI interaction is a non-trivial task in itself. In this assignment you will go through the process of setting up a Human-AI interaction environment (CoGym), evaluate the AI behaviors manually, and then build automated evaluators for these agents using ideas from AutoMetrics.


## Part 1: Building Human-AI Interaction with CoGym

Collaborative Gym (CoGym) is a framework for developing and evaluating human–AI collaboration. It provides shared task environments, like travel planning, data analysis, and writing, where both the human and the agent can act at any time instead of taking strict turns. This setup makes it possible to study how agents communicate, coordinate, and share initiative with humans. CoGym includes both simulated and real-world conditions, and its evaluation suite measures not only task success and quality but also how well the collaboration itself worked. The goal is to build AI systems that act as capable teammates rather than passive tools.


# ![Figure 1: Collaborative Gym (Co-Gym) enables collaboration between humans and LM agents within a task environment. Left: Human adds requests and sends multiple messages without waiting for agent responses. Right: Human rates collaboration highly as the agent proactively seeks help when uncertain about package installation.](img/cogym-figure1.png)


CoGym runs both the human and the AI agent inside a shared environment that handles actions, messages, and updates through an event-based notification system. This design allows real-time, non-turn-taking coordination: either side can act, edit, or message at any point, and both see synchronized updates. For more information [read the paper](https://arxiv.org/abs/2412.15701).


### Setting up CoGym

Follow these instructions to get CoGym running on your local machine:

#### Repo and Environment Setup

1. Clone cogym onto your local machine: https://github.com/SALT-NLP/collaborative-gym

2. Run the following commands in the root of the repo

```bash
conda create --name cogym python=3.12
conda activate cogym
pip install -r requirements.txt
pip install -U litellm
pip install uvicorn
```

3. Ensure that you have [docker installed](https://docs.docker.com/engine/install/) AND that it is **currently running**.

#### API Access and Keys

1. Navigate to your Google Cloud project for the course. You will need to enable the following APIs:

- Vertex AI ([here](https://console.cloud.google.com/vertex-ai/dashboard))
- Google Maps Platform APIs & Services ([here](https://console.cloud.google.com/google/maps-apis/api-list))
  - For this enable Distance Matrix API, Places API, Places API (new). **NO NEED** to enable Places Aggregate API or Places UI Kit
- Custom Search API ([here](https://console.cloud.google.com/marketplace/product/google/customsearch.googleapis.com))

2. Rename the `secrets.example.toml` file to `secrets.toml`

3. Now get your API key for the google project [here](https://console.cloud.google.com/apis/credentials). Fill this in as the `GEMINI_API_KEY`, `GOOGLE_MAP_API_KEY`, and `GOOGLE_SEARCH_API_KEY`.

4. Create a Custom Search Engine and copy the CX ID ([here](https://cse.google.com/cse)). Fill this in as `GOOGLE_CSE_ID`.

#### Editing the codebase and running the backend server

1. Open to `collaborative_gym/server.py`.

Change lines 152-154 from

```python
                            "demo_agent.collaborative_agent_with_situational_planning.agent "
                            "--model-name gpt-4o --wait-time 1 --enhance-user-control",
```

to

```python
                            "demo_agent.basic_collaborative_agent.agent "
                            "--model-name gemini/gemini-pro-latest --wait-time 1 --enhance-user-control",
```

Note that gemini-pro-latest will work best, but run much slower than `gemini-flash-latest`.  If runtime is a problem for you consider changing to flash!

2. change your current directory to the `cs329x/collaborative-gym` folder

3. Setup the redis server (make sure docker is running!)

```bash
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
```

4. Launch the backend server (make sure nothing is already running on localhost:8000)

```bash
DISABLE_AGENT=false uvicorn collaborative_gym.server:app --reload
```

#### Setting up the frontend

1. Leave the backend and redis servers running (Open up a new terminal window)

2. Navigate to `frontend/workbench`

3. create a file `.env.local` with the following contents

```
NEXT_PUBLIC_USE_MOCK_API="false"
NEXT_PUBLIC_API_URL="http://localhost:8000/api"
NEXT_PUBLIC_WS_URL="ws://localhost:8000/ws"
```

4. Install the necessary packages ([install npm](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm) if you don't already have it)

```bash
npm install -g pnpm
pnpm install
```

5. Run the frontend

```bash
pnpm run dev
```


### Running CoGym (25 points)

1. Open up the CoGym Web Interface (likely [localhost:3000](localhost:3000))

2. Start writing a Travel Plan with the Agent for a trip you might want to do someday.

3. Interact with the Agent to produce a travel plan.

![Interacting with the cogym agent](img/cogym-example.png)

4. When you are done click `Finish` in the upper right hand corner. This will give you a chance to copy the travel plan you created.

#### Question 1.1 (15 points)

Produce 3 travel plans with cogym! Please be creative and create travel plans that are actually interesting to you or have interesting constraints (budget, activities, multi-city, etc.) When you are done working with the collaborative agent paste your completed travel plans to `writeup.md`.

#### Question 1.2 (5 points)

Evaluate these travel plans. Which one is the best? Which was the worst? Score each on quality from (1-5) using your own criteria. (We are looking for a score of 1-5 for each travel plan). Add this to `writeup.md`.

#### Question 1.3 (5 points)

Explain what criteria you were using when doing this evaluation. What mattered? (2-3 sentences). Add this to `writeup.md`.


If you found working on builing human-ai interactions interesting you can reach out to [shaoyj@stanford.edu](mailto:shaoyj@stanford.edu) for more in this space.

## Part 2: Building Automatic Evaluators to Approximate Human Judgement (50 points + 10 extra credit)


You just got a sense of how human evaluation can be a subjective process. Still, when designing research projects or products it is important to be able to quantify some of this fuzzy qualitative signals. For this we need metrics.

CoGym directly introduces a few metrics. For outcomes, it measures Delivery Rate (whether the task was completed) and Task Performance (the quality of the final result).

Now you are going to try to write an automatic evaluator. As a start, let's write an LLM as a Judge prompt.


#### Helper Functions and Imports (Do not modify)


In [None]:
import pandas as pd
from IPython.display import display
from dotenv import load_dotenv
import litellm
from litellm import disable_cache, enable_cache
from litellm.caching.caching import Cache
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import numpy as np
from typing import List, Tuple, Callable, Optional

litellm.cache = Cache(type="disk")

litellm.set_verbose = False
litellm.suppress_debug_info = True

In [None]:
pd.set_option("display.max_colwidth", None)  # show full text in each cell
pd.set_option("display.width", 2000)  # prevent horizontal truncation
pd.set_option("display.max_columns", None)  # show all columns

load_dotenv()

In [None]:
def llm(prompt="", model="gemini/gemini-2.0-flash"):
    if type(prompt) == str:
        prompt = [{"role": "user", "content": prompt}]

    return (
        litellm.completion(
            model=model,
            messages=prompt,
        )
        .choices[0]
        .message.content
    )


def _llm_as_judge_prompt(prompt, model="gemini/gemini-flash-latest"):
    system_prompt = """You are a expert evaluator.  Given the criteria that you are evaluating for, you will score the given response from 1-5.

First, reason over the given response and how it does or does not meet the criteria.  Then, give your final score.

Follow the following format:

<Reasoning>
[Your reasoning which will help you come up with your score]
</Reasoning>
<Score>
[Your final score; 1-5]"""

    if type(prompt) == str:
        prompt = [{"role": "user", "content": prompt}]

    prompt = [{"role": "system", "content": system_prompt}] + prompt

    try:
        response = llm(prompt, model)

        # Extract score from response (the next line after <Score>)
        score_line = response.split("<Score>")[1].split("</Score>")[0].strip()
        score = float(score_line)

        return score
    except Exception as e:
        prompt[-1]["content"] = (
            prompt[-1]["content"]
            + "\n\nBe very careful and precise in formatting your response."
        )
        res2 = llm(prompt)
        score_line = res2.split("<Score>")[1].split("</Score>")[0].strip()
        score = float(score_line)

        return score


def llm_as_judge(
    evaluation_criteria, input, output, model="gemini/gemini-flash-latest"
):
    prompt = f"""<Evaluation Criteria>
    {evaluation_criteria}
    </Evaluation Criteria>
    <Input provided to the AI>
    {input}
    </Input>
    <Output to evaluate>
    {output}
    </Output>
    """
    return _llm_as_judge_prompt(prompt, model)


def run_llm_as_judge_on_df(
    df, evaluation_criteria, model="gemini/gemini-flash-latest", num_threads=16
):
    scores = [None] * len(df)
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = {
            executor.submit(
                llm_as_judge,
                evaluation_criteria,
                row.input,
                row.output,
                model,
            ): i
            for i, row in enumerate(df.itertuples(index=False))
        }

        for future in tqdm(
            as_completed(futures), total=len(futures), desc="Scoring rows"
        ):
            i = futures[future]
            scores[i] = future.result()
    return scores


def run_regression_judge_on_df(
    df,
    criteria_list,
    coef,
    intercept,
    model="gemini/gemini-flash-latest",
    num_threads=16,
):
    for criterion in criteria_list:
        scores = run_llm_as_judge_on_df(df, criterion, model, num_threads)
        df[criterion] = scores

    coef_arr = np.asarray(coef).reshape(-1)  # force (n_features,)

    df["predicted_score"] = df[criteria_list].dot(coef_arr) + intercept

    return df["predicted_score"].tolist()


def run_evaluator_on_df(
    df, evaluator: Callable[[str, str], float], num_threads=16
) -> list[float]:
    # Preallocate result list
    scores = [None] * len(df)

    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        # Submit one task per row
        futures = {
            executor.submit(evaluator, row.input, row.output): i
            for i, row in enumerate(df.itertuples(index=False))
        }

        # Collect results as they complete
        for future in tqdm(
            as_completed(futures), total=len(futures), desc="Scoring rows"
        ):
            i = futures[future]
            scores[i] = future.result()

    return scores

In [None]:
def compute_pearson_correlation(df, scores: list[float]) -> float:
    y_true = df["score"]
    y_pred = pd.Series(scores, index=df.index, name="predicted_score")

    paired = pd.concat([y_true, y_pred], axis=1).dropna()
    if len(paired) < 2:
        return np.nan
    if paired.iloc[:, 0].nunique() < 2 or paired.iloc[:, 1].nunique() < 2:
        return np.nan  # zero variance on either side

    return paired.iloc[:, 0].corr(paired.iloc[:, 1], method="pearson")

def get_judge_confidence_interval(
    df,
    evaluation_criteria,
    model="gemini/gemini-flash-latest",
    num_threads=16,
    trials=5,
):
    disable_cache()
    scores = []
    for _ in range(trials):
        print(f"Running trial {_ + 1} of {trials}")
        scores.append(
            run_llm_as_judge_on_df(df, evaluation_criteria, model, num_threads)
        )
    enable_cache()

    pearson_correlations = []
    for score_list in scores:
        corr = compute_pearson_correlation(df, score_list)
        if not np.isnan(corr):
            pearson_correlations.append(corr)
        else:
            trials -= 1
    return np.array(pearson_correlations).mean(), np.array(
        pearson_correlations
    ).std() / np.sqrt(trials)


def get_regression_confidence_interval(
    df,
    criteria_list,
    coef,
    intercept,
    model="gemini/gemini-flash-latest",
    num_threads=16,
    trials=5,
):
    disable_cache()
    scores = []
    for _ in range(trials):
        print(f"Running trial {_ + 1} of {trials}")
        scores.append(
            run_regression_judge_on_df(
                df, criteria_list, coef, intercept, model, num_threads
            )
        )
    enable_cache()

    pearson_correlations = []
    for score_list in scores:
        pearson_correlations.append(compute_pearson_correlation(df, score_list))
    return np.array(pearson_correlations).mean(), np.array(
        pearson_correlations
    ).std() / np.sqrt(trials)


def get_evaluator_confidence_interval(
    df, evaluator: Callable[[str, str], float], num_threads=16, trials=5
) -> tuple[float, float]:
    disable_cache()
    scores = []
    for _ in range(trials):
        scores.append(run_evaluator_on_df(df, evaluator, num_threads))
    enable_cache()

    pearson_correlations = []
    for score_list in scores:
        pearson_correlations.append(compute_pearson_correlation(df, score_list))
    return np.array(pearson_correlations).mean(), np.array(
        pearson_correlations
    ).std() / np.sqrt(trials)

In [None]:
def evaluate_on_df_set(
    train_df_list,
    test_df_list,
    evaluator_constructor: Callable[[pd.DataFrame, str], Callable[[str, str], float]],
    human_criteria_list: Optional[list[str]] = None,
    num_threads: int = 16,
    trials: int = 5,
    parallel: bool = False,
):
    if human_criteria_list is None:
        human_criteria_list = ["Unknown"] * len(train_df_list)

    # package args per task
    task_args = list(zip(train_df_list, test_df_list, human_criteria_list))

    def run_one(train_df, test_df, human_criteria):
        evaluator = evaluator_constructor(train_df, human_criteria)
        mean, std = get_evaluator_confidence_interval(
            test_df,
            evaluator,
            num_threads=num_threads,
            trials=trials,
        )
        return (mean, std)

    if parallel:
        with ThreadPoolExecutor(max_workers=num_threads) as ex:
            results = list(ex.map(lambda args: run_one(*args), task_args))
    else:
        results = [run_one(*args) for args in task_args]

    # side-effect print, same as before
    for mean, std in results:
        print(f"Pearson correlation: {mean} ± {std}")

    return results


class Dataset:
    def __init__(
        self,
        train_df: pd.DataFrame,
        dev_df: pd.DataFrame,
        test_df: pd.DataFrame,
        human_criteria: str = "Unknown",
    ):
        self.train_df = train_df
        self.dev_df = dev_df
        self.test_df = test_df
        self.human_criteria = human_criteria


def init_dataset(path: str) -> Dataset:
    train_df = pd.read_csv(f"{path}/train.csv")
    dev_df = pd.read_csv(f"{path}/val.csv")
    test_df = pd.read_csv(f"{path}/test.csv")
    with open(f"{path}/human_criteria.txt", "r") as f:
        human_criteria = f.read().strip()
    return Dataset(train_df, dev_df, test_df, human_criteria)


def evaluate_on_datasets(
    datasets: list[Dataset],
    evaluator_constructor: Callable[[pd.DataFrame, str], Callable[[str, str], float]],
    num_threads=16,
    trials=5,
):
    return evaluate_on_df_set(
        [dataset.train_df for dataset in datasets],
        [dataset.test_df for dataset in datasets],
        evaluator_constructor,
        [dataset.human_criteria for dataset in datasets],
        num_threads,
        trials,
    )


cogym = init_dataset("datasets/CoGym")
helpsteer2 = init_dataset("datasets/HelpSteer2")
simpeval = init_dataset("datasets/SimpEval")

### Question 2.1: Writing your own LLM-as-a-Judge Evaluator (10 points)

Take a look at some of the CoGym data:


In [None]:
display(cogym.test_df.head())

Now devise a prompt that works to make a more reliable automatic evaluator of travel plan quality. You will want to use the following functions:

- `run_llm_as_judge_on_df(df, prompt):` - runs an llm judge in parallel over a dataframe and returns a list of scores for each row
- `compute_pearson_correlation(df, scores):` - computes the pearson correlation of LLM judge scores versus the true human labels on a dataframe

Please try out a few prompts. Note that due to the small test set size there will be more variance in your correlations. For more reliable assessment we provide the `get_judge_confidence_interval(df, prompt)` method which runs five trials.

Note that the LLM Judge is automatically cached so several runs with the same prompt will yield the same result. `get_judge_confidence_interval` purposefully disables the cache.


In [None]:
PROMPT = """Assess the travel plan for quality."""  # TODO: Write your prompt here!  You may choose to write a much longer prompt with more criteria.

scores = run_llm_as_judge_on_df(cogym.test_df, PROMPT)

print(scores)
print(compute_pearson_correlation(cogym.test_df, scores))

Now that you have played around with some prompts you should have a sense of the challenge of writing this evaluation criteria.

Please use the `get_judge_confidence_interval` method to test your prompt for reliablility. For full credit find a prompt that achieves MEAN correlation >= 0.15 and STDDEV <= 0.08.


In [None]:
mean, std = get_judge_confidence_interval(cogym.test_df, PROMPT)
print()
print(f"Pearson correlation: {mean} ± {std}")

Copy and paste your prompt and Pearson correlation (mean ± stddev) into `writeup.md`.


#### Motivating AutoMetrics

Now let's introduce two new datasets: HelpSteer2 and SimpEval.

HelpSteer2 asked human annotators to rate conversations on a scale of helpfulness from 1-5.

SimpEval asked several annotators rate sentence simplifications on a scale from 1-100.


In [None]:
display(helpsteer2.test_df.head(1))

In [None]:
display(simpeval.test_df.head(1))

We could manually go through and write predictive LLM as a Judge prompts for each of these tasks, however, a more scalable approach would be to automatically discover the insights that matter using an LLM and turning that into evaluation.

This is a core insight in the recent AutoMetrics paper. AutoMetrics was motivated by the challenge that human evaluation is slow, expensive, and hard to scale across new tasks. Instead of hand-designing rubrics, AutoMetrics uses a small amount of human feedback to automatically generate and weight evaluation criteria. It does this through a four-step pipeline—generate, retrieve, regress, and report—that creates candidate rubrics with an LLM, filters them with relevant existing metrics, and learns how to combine them to best predict human judgments


![AutoMetrics takes you from expensive measures to interpretable automatic metrics.  Here AutoMetrics generates useful metrics for evaluating LLM written product descriptions from user reviews from EvalGen \citep{10.1145/3654777.3676450}.  Percentages indicate relative importance of each metric derived from regression coefficients.](img/autometrics-figure1.png)


Here we are going to reproduce two steps in the AutoMetrics method: **criteria generation** and **regression**. We will leave out existing metrics, metric retrieval, and reporting for this assignment.

![AutoMetrics comprises four steps. (1) Generate: create task-specific candidate metrics (Single criteria, Rubric, Examples, MIPROv2). (2) Retrieve: from the generated candidates plus MetricBank, use ColBERT to prefilter to $k'$ metric cards and an LLM to select the final $k$. (3) Regress: fit a PLS model on the training set to weight and select metrics that predict human judgments. (4) Report: produce a writeup with weights and correlations and details to guide adoption.](img/autometrics-method.png)

Let's start with criteria generation.


### Question 2.2: Automatically Generating Criteria (10 points)

In order to automatically generate criteria we ask that you implement the following algorithm:

1. Sort the training data based on the human scores into lowest and highest scoring

2. Sample 5 of the worst 10 examples and 5 of the best 10 examples. (Use the seed parameter to ensure reproducibility)

3. Format these examples in a prompt to an LLM to explain what key differences exist between the quality of these outputs. You can mention in the prompt to the LLM what the annotators were trying to measure.

4. Return a list of at least 5 criteria that independently can be used for an LLM as a Judge.

You may find it helpful to use the `llm(prompt) -> response` helper function defined above,


In [None]:
# Helper function to format examples
def format_example(example):
    return f"""<INPUT>
    {example['input']}
    </INPUT>
    <OUTPUT>
    {example['output']}
    </OUTPUT>
    <SCORE>
    {example['score']}
    </SCORE>"""


def format_examples(examples):
    return "\n\n".join(
        [f"[[{i}]] " + format_example(example) for i, example in examples.iterrows()]
    )

In [None]:
# Generate criteria from non-descriptive human feedback
# Inputs:
#   train_df: a dataframe of training examples with input, output, and score columns
#   human_criteria: a string of the human criteria for the task
#   seed: an integer for the random seed
# Outputs:
#   A list of strings of the criteria.
def generate_criteria(train_df, human_criteria="Unknown", seed=42) -> list[str]:
    # TODO: Implement your code here!

    # Sort the training data based on the human scores into lowest and highest scoring

    # Sample 5 of the worst 10 examples and 5 of the best 10 examples (use the seed parameter to ensure reproducibility)

    # Format examples in a prompt

    # Call the LLM to generate the criteria
    
    # Return the list of criteria
    pass

In [None]:
criteria_cogym = generate_criteria(cogym.train_df, "travel plan quality")
print(criteria_cogym)

In [None]:
criteria_helpsteer2 = generate_criteria(helpsteer2.train_df, "helpfulness")
print(criteria_helpsteer2)

In [None]:
criteria_simpeval = generate_criteria(simpeval.train_df, "simplification quality")
print(criteria_simpeval)

Please paste your implementation of `generate_critera` in `writeup.md` alongside the (minimum) 5 criteria that the LLM generated. It is okay if you generated more.


### Question 2.3: Selecting the Criteria (10 points)

Now we have a list of criteria that could potentially be useful to inform an LLM as a Judge, but we don't actually know which ones we should use. For that we can actually use a relatively simple tool: regression.

In the paper, we use Partial Least Squares (PLS) regression because it works well when the number of metrics (predictors) is similar to or larger than the number of data points, and when those metrics are highly correlated. PLS finds the direction in metric space that best predicts human judgments and assigns weights accordingly. This lets AutoMetrics identify which criteria truly matter, even with limited data, and combine them into a single predictive score.


Implement the following method:

1. Given a list of potential LLM as a Judge criteria, run the LLM judges on the training dataset
2. Compute a PLS regression on all of these outputs. You will find scikit-learn's `sklearn.cross_decomposition.PLSRegression` helpful. For now we will just use 1 component.
3. Return the regression coefficients as a list of length `n` where `n` is the number of criteria input, and a y-intercept

You may find it helpful to use the helper method `run_llm_as_judge_on_df(df, criteria_string) -> list[float]`


In [None]:
from sklearn.cross_decomposition import PLSRegression


# Run the LLM judges on the training dataset and return the regression coefficients and intercept
# Inputs:
#   train_df: a dataframe of training examples with input, output, and score columns
#   criteria_list: a list of strings of the LLM as a Judge criteria to regress on
# Outputs:
#   A tuple of the regression coefficients (list[float]) and intercept (float)
def regress_criteria(train_df, criteria_list) -> Tuple[List[float], float]:
    # TODO: Implement your code here!

     # Run the LLM judges on the training dataset

    # NOTE: Ensure the shape of the matrix is correct for the PLSRegression

    # Fit PLS

    # Return the regression coefficients (list[float]) and intercept (float) as a tuple
    pass

In [None]:
coef, inter = regress_criteria(cogym.train_df, criteria_cogym)

In [None]:
assert len(coef) == len(criteria_cogym)

In [None]:
mean, std = get_regression_confidence_interval(
    cogym.test_df, criteria_cogym, coef, inter
)
print(f"Pearson correlation: {mean} ± {std}")

Paste your code for `regress_criteria` into `writeup.md` alongside the results of `get_regression_confidence_interval`. Ideally your automatically generated criteria can get similar correlation to the Manual LLM Judge prompt you wrote earlier, however this didn't require any manual prompt engineering.  

Note: This may not be what you observe because for the purposes of the assignment we are operating on small LLMs (gemini-flash-2.0 instead of gemini-pro-2.5).


### Question 2.4: Building Better Automatic Evaluators (20 points + 10 extra credit)

**[Optional for 3 credit students; Required for 4 credit students]**

Now we have learned one way to automatically generate automatic evaluators through generation + regression. There are many other clever strategies you could try. In the paper we also try prompt optimization, finetuning, revising criteria into rubrics, etc.

Below we provide you with some seed ideas as well as scaffolding for your evaluator function. Your task is to write an automatic evaluation generator that exceeds a certain threshold on our held out test datasets.


#### Ideas you may wish to explore:

- Reflective Prompt Optimization: https://arxiv.org/abs/2507.19457
- LLM Based Clustering for Criteria extraction: https://stanfordhci.github.io/lloom/ https://arxiv.org/abs/2503.08893
- Bootstrap Reasoning Induction: https://arxiv.org/pdf/2203.14465

Or come up with something entirely unique!


To do this task we ask that you implement `generate_evaluator` below. This method takes in a train_df. It returns a callable function that when passed an input and output it returns a score. You will be graded based on the pearson correlation of your automatic evaluator with human judgements on three held out datasets (not shared with the class). These datasets are similar to the datasets we provided in the earlier parts of this homework, but cover slightly different domains.


In [None]:
# Given a train_df and human_criteria, generate an automatic evaluator that can be used to score a model's output
# Inputs:
#   train_df: a dataframe of training examples with input, output, and score columns
#   human_criteria: a string of the human criteria for the task (e.g. "travel plan quality", "helpfulness", "simplification quality")
# Outputs:
#   A callable function that takes an input and output and returns a score
def generate_evaluator(
    train_df, human_criteria="Unknown"
) -> Callable[[str, str], float]:
    # TODO: Replace this implementation with your code here!  This implements the generation + regression method from above

    # Generate criteria
    criteria_list = generate_criteria(train_df, human_criteria)

    # Regress criteria
    coef, inter = regress_criteria(train_df, criteria_list)

    def evaluate(input: str, output: str) -> float:
        scores = []
        for criterion in criteria_list:
            score = llm_as_judge(criterion, input, output)
            scores.append(score)
        return (coef @ np.array(scores)) + inter

    return evaluate

In [None]:
results = evaluate_on_datasets([cogym, helpsteer2, simpeval], generate_evaluator, trials=1)

In [None]:
print("Results:")
print("[Cogym] Pearson correlation: ", results[0][0], "±", results[0][1])
print("[HelpSteer2] Pearson correlation: ", results[1][0], "±", results[1][1])
print("[SimpEval] Pearson correlation: ", results[2][0], "±", results[2][1])

print(
    "Average Pearson correlation: ", np.array([result[0] for result in results]).mean()
)

When you are done implementing your custom method paste your implementation into `writeup.md` AND into `submission.py`. The code in `submission.py` will be run to determine your place on the leaderboard. We will run your evaluator on our held-out test datasets.

In order to get full credit you will need to exceed the baseline score (generate + regress) of `0.450` average pearson correlation on our test set. For reference the equivalent score on the provided datasets is `0.439`, so try to beat this locally before submitting.  Additionally the **top 3** submissions on the leaderboard by the homework deadline will get 10 points of extra credit.

Finally we ask that you add a writeup of your approach (1-2 paragraphs) to `writeup.md`. If you took any inspiration from papers be sure to link them in your explanation! We look forward to seeing what you come up with!


If you found working on automating evaluations of AI systems interesting you can reach out to [mryan0@stanford.edu](mailto:mryan0@stanford.edu) for more in this space.