I have a model in `./models/run1/merged` that was trained on GPT-4's outputs to classify recipes. I need to figure out whether it does a good job at classifying recipes. I'll install dependencies first.

In [3]:
%pip install vllm==0.1.3 pandas==2.0.3

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Remember I got a "test.jsonl" file from OpenPipe back in [./prepare.ipynb](./prepare.ipynb)? Since that is data formatted the same way as our training data but that we didn't use for training, we can use it to check our model's performance.

In [4]:
import pandas as pd

test_data = pd.read_json("./data/test.jsonl", lines=True)


During the training process Axolotl transformed our data into an instruction/response format known as the "Alpaca format" based on [the project that introduced it](https://github.com/tatsu-lab/stanford_alpaca). I need to transform my test data into the same format for best results.

In [7]:
from axolotl.prompters import UnpromptedPrompter

prompter = UnpromptedPrompter()


def format_prompt(input: str) -> str:
    return next(prompter.build_prompt(input))


prompts = test_data["instruction"].apply(format_prompt)

print(f"Sample prompt:\n--------------\n{prompts[0]}")


Sample prompt:
--------------
### Instruction:
[{"role":"system","content":"Your goal is to classify a recipe along several dimensions. You should "},{"role":"user","content":"Strawberry Sorbet\n\nIngredients:\n- 2 cups chopped strawberries Safeway 1 lb For $3.99 thru 02/09\n- 1 cup cold water\n- 2 cups boiling water\n- 1 pkg. (4-serving size) JELL-O Strawberry Flavor Gelatin\n- 1/2 cup sugar\n\nDirections:\n- Place strawberries and cold water in blender container; cover.\n- Blend on high speed until smooth.\n- Stir boiling water into combined dry gelatin mix and sugar in medium bowl at least 2 minutes until completely dissolved.\n- Add strawberry mixture; mix well.\n- Pour into 9-inch square pan.\n- Freeze 1 to 1-1/2 hours or until ice crystals form 1 inch around edges of pan.\n- Spoon half of the gelatin mixture into blender container; cover.\n- Blend on high speed about 30 seconds or until smooth; pour into bowl.\n- Repeat with remaining gelatin mixture.\n- Add to blended gelatin mi

Next up, I'll use [vLLM](https://vllm.readthedocs.io/en/latest/) to efficiently process all the prompts in our test data with our own model.

In [8]:
from vllm import LLM, SamplingParams

llm = LLM(model="./models/run1/merged", max_num_batched_tokens=4096)

sampling_params = SamplingParams(
    # 120 should be fine for the work we're doing here.
    max_tokens=120,
    # This is a deterministic task so temperature=0 is best.
    temperature=0,
)

my_outputs = llm.generate(prompts, sampling_params=sampling_params)
my_outputs = [o.outputs[0].text for o in my_outputs]

test_data["my_outputs"] = my_outputs

print(f"Sample output:\n--------------\n{my_outputs[0]}")


INFO 08-24 18:58:03 llm_engine.py:70] Initializing an LLM engine with config: model='./models/run1/merged', tokenizer='./models/run1/merged', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-24 18:59:18 llm_engine.py:196] # GPU blocks: 3419, # CPU blocks: 512


Processed prompts: 100%|██████████| 201/201 [00:16<00:00, 12.18it/s]

Sample output:
--------------
{"role":"assistant","content":null,"function_call":{"name":"classify","arguments":"{\n\"has_non_fish_meat\": false,\n\"requires_oven\": false,\n\"requires_stove\": true,\n\"cook_time_over_30_mins\": false,\n\"main_course\": false\n}"}}





Ok, we have our outputs! Since there are 5 categories we classify each recipe on, a natural metric would be for each recipe and each category, what percentage of the time our model's output matches GPT-4's. I'll write a quick eval function to check that.

In [16]:
import json


def parse_fn_call_args(str):
    """Parse the function call arguments from the response"""
    response_dict = json.loads(str)
    args_dict = json.loads(response_dict["function_call"]["arguments"])

    return args_dict


def calculate_accuracy(row):
    """Calculate the fraction of my model's outputs that match the reference outputs"""
    true_outputs = parse_fn_call_args(row["output"])
    my_outputs = parse_fn_call_args(row["my_outputs"])

    num_matching_outputs = 0
    for key in true_outputs.keys():
        if true_outputs[key] == my_outputs[key]:
            num_matching_outputs += 1

    return num_matching_outputs / len(true_outputs)


test_data["accuracy"] = test_data.apply(calculate_accuracy, axis=1)

print(f"Overall accuracy: {test_data['accuracy'].mean():.2f}")


Overall accuracy: 0.91


Not bad! Of course, the next obvious step is to look at where Llama 2 is "wrong" and evaluate the types of errors it makes. I've exported a Google Sheet where I did exactly that with an earlier version of this model trained on the same dataset. You can see that [here](https://docs.google.com/spreadsheets/d/1vn-nA0CRQwz-BvEYvxUcO1-EP80ZbPhcxDoCTttvsmI/edit?usp=sharing).

The main takeaway: generally places where GPT-4 and Llama 2 disagreed were genuinely ambiguous cases, where either answer was acceptable (eg. a dish that takes about 30 mins to cook might be classified as over 30 minutes by one, and under 30 minutes by the other).

Interested in cost/latency benchmarking? You can check out [./benchmarking.ipynb](./benchmarking.ipynb) for an overview of my findings!