# OpenAI Baselines

This notebook runs the OpenAI model baselines on the Only Connect dataset.

In [3]:
# %pip install -r requirements.txt
# %pip install guidance
import json
import os
import re
import random
from pathlib import Path
from evaluate_only_connect import Evaluate

import guidance
from datasets import load_dataset

## Setup

First, you will need to add your OpenAI API key

In [4]:
os.environ["OPENAI_API_KEY"] = "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"

Next, download a copy of the Only Connect dataset from [here](https://drive.google.com/drive/folders/1118w_ydBSBWUru5cPlyGY9TMrgd993f3?usp=sharing). We expect the three JSON files exist under `./dataset`.

In [5]:
print(f'Found train set: {Path("./dataset/train.json").exists()}')
print(f'Found validation set: {Path("./dataset/validation.json").exists()}')
print(f'Found test set: {Path("./dataset/test.json").exists()}')

Found train set: True
Found validation set: True
Found test set: True


Then, load the dataset using the [HuggingFace Datasets Library](https://huggingface.co/docs/datasets/index).

In [6]:
dataset = load_dataset(
    "json",
    data_files={
        "train": "dataset/train.json",
        "validation": "dataset/validation.json",
        "test": "dataset/test.json",
    },
    field="dataset",
)

Found cached dataset json (/Users/johngiorgi/.cache/huggingface/datasets/json/default-368e73e10be4fc5a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████| 3/3 [00:00<00:00, 63.57it/s]


Finally, load the helper function with will make the calls to the OpenAI API

In [21]:
def run_openai(
    dataset,
    task: str = "task1",
    model: str = "gpt-3.5-turbo",
    split: str = "test",
    num_in_context_examples: int = 3,
    dry_run: bool = False,
    seed: int = 42,
    **kwargs,
):
    guidance.llm = guidance.llms.OpenAI(model)

    if task == "task1":
        prompt = guidance(
            """{{#system~}}You are currently competing in Round 3: Connecting Wall on the quiz show Only Connect. Your task: given 16 "clues" (words or phrases), solve the wall by grouping the clues into four groups of four. You will be given the clues as a list. You are also given examples of solved walls, which include the connections. Provide your answer as a list of four groups of four clues; separate groups by newlines and clues by commas. Do not try to guess the connection; only use the clues given and don't make up your own.

Be careful! Connecting Wall is deliberately difficult. The puzzles are designed to include red herrings and to suggest more connections than actually exist. Some clues appear to fit into more than one category. Still, there is only one perfect solution for each wall.
{{~/system}}

{{#user~}}
{{examples}}

Clues: {{#each clues}} {{this}}{{#unless @last}},{{/unless}}{{/each}}
{{~/user}}

Solved wall: 

{{#assistant~}}{{gen 'predicted_groups' temperature=0.0 max_tokens=64}}{{~/assistant}}""",
            **kwargs,
        )
    elif task == "task2":
        prompt = guidance(
            """{{#system~}}You are currently competing in Round 3: Connecting Wall on the quiz show Only Connect. Your task: given 4 groups of 4 "clues" (words or phrases), determine the connection for each group. You will be given the groups as four lists of four. You are also given examples of solved walls, which include the connections. Provide your answer by repeating the four groups and adding it after "Connection:"

Note: Connections might be thematic, linguistic, factual, mathematical and rely on both arcane subject areas and popular culture.                    
{{~/system}}

{{#user~}}
{{examples}}

Groups:
{{#each groups}}{{this}}{{#unless @last}}\n{{/unless}}{{/each}}
{{~/user}}

Solved wall: 

{{#assistant~}}{{gen 'predicted_connections' temperature=0.0 max_tokens=128}}{{~/assistant}}""",
            **kwargs,
        )

    # Set the RNG here so repeated calls to this function will return the same results every time
    rng = random.Random(seed)

    predictions = []

    # Create the in-context examples
    ic_examples = ""
    random_examples = rng.sample(dataset["train"]["groups"], k=num_in_context_examples)
    for i, example in enumerate(random_examples):
        ic_examples += f"Example {i+1}\n"
        for group in example.values():
            ic_examples += ", ".join(group["gt_words"]) + f". Connection: {group['gt_connection']}\n"
        ic_examples += "\n"
    ic_examples = ic_examples.strip()

    # Run the model on each wall
    for wall in dataset[split]:
        # Clues have already been shuffled, so we can take them as is
        wall_id, clues = (
            wall["wall_id"],
            wall["words"],
        )
        groups = [", ".join(group["gt_words"]) for group in wall["groups"].values()]
        # Try to parse the model response, but if it fails, just use a random guess
        predicted_groups, predicted_connections = None, None
        if task == "task1":
            response = prompt(examples=ic_examples, clues=clues)
            try:
                # Sometimes the model returns more or less than 4 groups, truncate or pad with empty strings
                predicted_groups = response["predicted_groups"].splitlines()[:4]
                predicted_groups += [""] * (4 - len(predicted_groups))
                # Sometimes the model returns more or less than 4 words per group, truncate or pad with empty strings
                predicted_groups = [[word.strip() for word in line.split(",")][:4] for line in predicted_groups]
                predicted_groups = [group + ([""] * (4 - len(group))) for group in predicted_groups]
            except:
                Warning(
                    f"Failed to parse model response:\n\n{response['predicted_groups']}\n\nUsing random guess instead."
                )
                predicted_groups = [clues[i : i + 4] for i in range(0, len(clues), 4)]
        else:
            groups = [f"{group}. Connection:" for group in groups]
            response = prompt(examples=ic_examples, groups=groups)
            predicted_connections = [
                re.search(r"Connection:\s*(.*)", connection)
                for connection in response["predicted_connections"].splitlines()
            ]
            predicted_connections = [
                connection.group(1).strip() if connection else "" for connection in predicted_connections
            ]
            # Sometimes the model returns more than 4 connections, so we take the first 4
            predicted_connections = predicted_connections[:4]
            # If the model returns fewer than 4 connections, we pad with empty strings
            predicted_connections += [""] * (4 - len(predicted_connections))

        predictions.append(
            {
                "wall_id": wall_id,
                "predicted_groups": predicted_groups,
                "predicted_connections": predicted_connections,
            }
        )
        if dry_run:
            print("--dry-run flag passed. Exiting after one example.")
            break

    return predictions

## Task 1: Solving Walls

To run task 1 (solving the wall), run the following:

In [28]:
# Remove dry-run when you are ready to run the full dataset
predictions = run_openai(dataset, task="task1", num_in_context_examples=3, split="validation", dry_run=True)

To evaluate the predictions, save them to disk and run the evaluation script:

In [29]:
fp = Path("predictions/task1.json")
fp.parent.mkdir(exist_ok=True)
fp.write_text(json.dumps(predictions, ensure_ascii=False, indent=2));

In [31]:
!python evaluate.py \
    --prediction_file "./predictions/task1.json" \
    --dataset_path "./dataset/" \
    --results_path "./results/task1.json" \
    --split "validation"

Found cached dataset json (/Users/johngiorgi/.cache/huggingface/datasets/json/default-368e73e10be4fc5a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 137.28it/s]
 29%|████████████▏                             | 18/62 [00:00<00:00, 346.79it/s]
[31m╭─[0m[31m────────────────────[0m[31m [0m[1;31mTraceback [0m[1;2;31m(most recent call last)[0m[31m [0m[31m─────────────────────[0m[31m─╮[0m
[31m│[0m [2;33m/Users/johngiorgi/Documents/dev/only_connect_nlp/[0m[1;33mevaluate.py[0m:[94m71[0m in [92m<module>[0m  [31m│[0m
[31m│[0m                                                                              [31m│[0m
[31m│[0m   [2m68 [0m                                                                        [31m│[0m
[31m│[0m   [2m69 [0m[94mif[0m [91m__name__[0m == [33m'[0m[33m__main__[0m[33m'[0m:                                              [31m│[0m
[31m│

## Task 2: Making Connections

To run task 2 (predicting the connections between solved groups), run the following:

In [18]:
predictions = run_openai(dataset, task="task2", split="validation", num_in_context_examples=5, dry_run=True)

--dry-run flag passed. Exiting after one example.


To reproduce our results from the paper, run the following

In [26]:
for num_in_context_examples in [1, 3, 5, 10]:
    predictions = run_openai(
        dataset,
        task="task2",
        model="gpt-3.5-turbo-0301",
        split="test",
        num_in_context_examples=num_in_context_examples,
        caching=True,
    )
    pred_fp = f"predictions/task2/{num_in_context_examples}_examples.json"
    Path(pred_fp).parent.mkdir(exist_ok=True)
    Path(pred_fp).write_text(json.dumps(predictions, ensure_ascii=False, indent=2))
    results_fp = f"task2/{num_in_context_examples}_examples.json"
    evaluator = Evaluate(pred_fp, dataset_path="./dataset/", results_path=results_fp, split="test")
    evaluator.task2_evaluation()