In [5]:
%load_ext autoreload
%autoreload 2

import getpass
import os
import sys

import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 800)

from joblib import Parallel, delayed

# add path for the src dir
sys.path.append('/Users/maxshap/Documents/workspace/LLMverse/src')

from LLMverse.api.openai import basic_response_generation

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Evaluate model performance on the BRAINTEASER dataset to asses lateral thinking capabilities
* [paper link](https://arxiv.org/pdf/2310.05057)
* [dataset link](https://github.com/1171-jpg/BrainTeaser/tree/main)

Approach:
1. Take riddles one by one, ask `gpt-4o-mini` and `gpt-4o` to think step by step and solve it
2. Run `gpt-4o` as a judge (since it is a more powerful model then mini variant of it), this time providing text of the riddle, along with correct and generated answer and ask to check if semantically generated response is close to reference answer.
3. Calculate accuracy 


In [None]:
# load data
sentence_puzzles = np.load('/Users/maxshap/Downloads/BTDATA/sentence_puzzle.npy', allow_pickle=True)
sentence_frame = pd.DataFrame({
    'id': [x['id'] for x in sentence_puzzles],
    'question': [x['question'] for x in sentence_puzzles],
    'answer': [x['answer'] for x in sentence_puzzles],
    'list_of_options': [x['choice_list'] for x in sentence_puzzles]
})
word_puzzles = np.load('/Users/maxshap/Downloads/BTDATA/word_puzzle.npy', allow_pickle=True)
word_frame = pd.DataFrame({
    'id': [x['id'] for x in word_puzzles],
    'question': [x['question'] for x in word_puzzles],
    'answer': [x['answer'] for x in word_puzzles],
    'list_of_options': [x['choice_list'] for x in word_puzzles]
})
data = pd.concat([sentence_frame, word_frame])
data = data.sample(random_state=101, frac=1)

print('Original dataset shape', data.shape)

data['category'] = data['id'].apply(lambda x: 'word_puzzle' if 'WP' in x else 'sentence_puzzle')
# exclude questions where the question is formulated in a way that the model should choose one of the options
data = data[data['answer'] != 'None of above.']
print('Dataset after filtrations', data.shape)

def extract_puzzle_type(puzzle_id):
    if 'SR' in puzzle_id:
        return 'semantic_reconstruction'
    elif 'CR'in puzzle_id:
        return 'context_reconstruction'
    return 'original'

data['puzzle_type'] = data['id'].apply(extract_puzzle_type)

data.head()

Original dataset shape (1119, 4)
Dataset after filtrations (1058, 5)


Unnamed: 0,id,question,answer,list_of_options,category,puzzle_type
112,WP-37_SR,Which three letters can fend off a thief?,I C U.,"[O P T, I C U., S O S, None of above.]",word_puzzle,original
284,SP-94_CR,"Two person had to go to the top of the hill. There's only one motorcycle, yet only one person can ride it. Fortunately, they both get to the top of the hill. How?",One person is already at the top of the hill.,"[They find a good view., One men go home to get a another motorcycle., One person is already at the top of the hill., None of above.]",sentence_puzzle,semantic_reconstruction
237,SP-79,"Jessica is telling her friends this story and asks them to guess if it's the truth or a lie: ""There was a man sitting in a house at night that had no lights on at all. There was no lamp, no candle, and no other source of light. Yet, he sat in the house and read his book happily."" Her friends say she's lying, but Jessica corrects them and says she's telling the truth. Jessica's story is true—but how?",The man is blind and is reading braille.\n,"[The man is blind and is reading braille.\n, The man is smart that she pretended to reading the book., Because it was daytime., None of above.]",sentence_puzzle,semantic_reconstruction
32,SP-10_CR,"Eight people were sitting under a large tree. Suddenly, a gust of wind blows, yet none of them got hit by any falling leaves. How is this possible?",It was winter and the tree doesn't have any leaves.,"[Trees could have had a unique characteristic that made it resistant to losing leaves in windy conditions. , The gust of wind is that the wind blew in a direction away from the tree rather than towards it., It was winter and the tree doesn't have any leaves., None of above.]",sentence_puzzle,semantic_reconstruction
107,WP-35_CR,"What is the sum of 3/6 monkey, 3/4 tree, and 2/7 alcohol?",Montreal.,"[Montreal., California., Texas., None of above.]",word_puzzle,context_reconstruction


In [7]:
# define prompt for model to solve riddles

system_prompt = """
You are an expert in solving riddles.
You will be provided with a riddle wrapped in the tags: <riddle>riddle text</riddle>.

Your task is to provide an answer to the riddle.

If you find it helpful, you may output your intermediate thoughts to aid in finding the answer. These should be wrapped in the tags <thinking>your thinking process</thinking>. However, this is optional.
You must conclude your response with the final answer wrapped in the tags <answer>your answer</answer>.
If you are unsure of the answer, respond with <answer>I have no answer</answer>.
Let’s begin solving riddles.
"""

# define prompt for model to validate solutions against correct answers
validator_system_prompt = """
You are an expert in validating answers to riddles.

You will be provided with the following:

A riddle wrapped in the tags: <riddle>riddle text</riddle>.
A reference answer wrapped in the tags: <reference_answer>text</reference_answer>.
A predicted answer wrapped in the tags: <predicted_answer>text</predicted_answer>.
Your task is to determine whether the predicted answer matches the reference answer.

Focus on whether the meaning of the predicted answer aligns with the reference answer, ignoring any typos.
The reference answer may also include an explanation, usually in a separate sentence. If the predicted answer contains reasoning that differs from the reference reasoning but the predicted answer itself is correct, you should still consider the riddle as solved correctly.
If you strongly believe the predicted answer is valid and can be treated as correct (even if it is completely different from the reference answer), you may decide that the riddle is solved correctly.
You may output intermediate thoughts to help you reach a decision. These should be wrapped in the tags <thoughts></thoughts>.

Finally, return your verdict wrapped in the tags <verdict>your verdict</verdict>.
Your verdict should be either True (for matching answers) or False (if the answers do not match).
"""

In [8]:
# setup api key
api_key = getpass.getpass()

In [10]:
# helper function to read predictions and validations from local files

import re
def parse_response_from_file(utt_id, output_dir):
    with open(f'{output_dir}/{utt_id}.txt', 'r') as f:
        model_output = f.read()
    matches = re.findall(r"<answer>(.*?)</answer>", model_output)
    if not matches:
        print('Answer was not found, replace with I have no answer string')
        return "I have no answer"
    if len(matches) > 1:
        print("Suspicious response with multiple matches, first is used")
    return matches[0]

def extract_verdict_from_file(utt_id, output_dir):
    with open(f'{output_dir}/{utt_id}.txt', 'r') as f:
        model_output = f.read()    
    patterns = [
        r"<verdict>(.*?)</verdict>",              # Standard case
        r"verdict=(True|False)",                 # Inline case
        r"verdict>(True|False)</verdict>",       # Inline with incorrect opening tag
        r"veredict>\s*(True|False)\s*</verdict>",  # Misspelled "veredict" case
        r"verdict>\s*<(True|False)\s*/>\s*</verdict>",  # Case with additional closing brackets
        r"verdict>\s*<(True|False)\s*</verdict>",  # Case with stray "<" before verdict value
        r"verdict\s+(True|False)",               # Inline case without "="
    ]
    # Try each pattern
    for pattern in patterns:
        match = re.search(pattern, model_output, re.IGNORECASE)
        if match:
            match = match.group(1).strip()
            if 'True'in match:
                return True
            elif 'False' in match:
                return False
            else:
                print(f'Unexpected answer, {match}')
                return False
    print(f'Answer was not found, replace with False, suspicious example: {model_output}')
    return False

In [12]:
model = 'gpt-4o-mini'

for model in ['gpt-4o-mini', 'gpt-4o']:
    # prepare output directory for predictions
    output_dir = f'/Users/maxshap/Documents/workspace/LLMverse/src/LLMverse/projects/brainteasers_reasoning/brainteaser_{model}_results'
    os.makedirs(output_dir, exist_ok=True)
    print(f'Running solver for {model}')
    # parallelize the process
    _ = Parallel(n_jobs=4)(
        delayed(basic_response_generation)(system_message=system_prompt, 
                          prompt=f'<riddle>{row.question}</riddle>', 
                          utt_id=row.id, 
                          output_dir=output_dir, 
                          api_key=api_key, 
                          model_name=model, 
                          skip_if_exist=True) for row in data[['id', 'question']].itertuples())
    
    # store predictions in the pandas frame
    data[f'{model}_response'] = data['id'].apply(lambda x: parse_response_from_file(x, output_dir))
    
    print(f'Running judge for {model}')
    # extract validations
    output_dir = f'/Users/maxshap/Documents/workspace/LLMverse/src/LLMverse/projects/brainteasers_reasoning/brainteaser_{model}_validation_results'
    os.makedirs(output_dir, exist_ok=True)

    # parallelize the process to speed things up
    # row._4 referes to column {model}_response of pandas data frame
    _ = Parallel(n_jobs=3)(
    delayed(basic_response_generation)(system_message=validator_system_prompt, 
                          prompt=f'<riddle>{row.question}</riddle>\n<reference_answer>{row.answer}</reference_answer>\n<predicted_answer>{row._4}</predicted_answer>', 
                          utt_id=row.id, 
                          output_dir=output_dir, 
                          api_key=api_key, 
                          model_name='gpt-4o', 
                          skip_if_exist=True) for row in data[['id', 'question', 'answer', f'{model}_response']].itertuples())
    # store validations in a frame
    data[f'{model}_validation'] = data['id'].apply(lambda x: extract_verdict_from_file(x, output_dir))
    

Running solver for gpt-4o-mini


Suspicious response with multiple matches, first is used
Answer was not found, replace with I have no answer string
Running judge for gpt-4o-mini
Running solver for gpt-4o
Running judge for gpt-4o


In [13]:
data.head()

Unnamed: 0,id,question,answer,list_of_options,category,puzzle_type,gpt-4o-mini_response,gpt-4o-mini_validation,gpt-4o_response,gpt-4o_validation
112,WP-37_SR,Which three letters can fend off a thief?,I C U.,"[O P T, I C U., S O S, None of above.]",word_puzzle,original,FBI,True,"D, O, G (dog)",True
284,SP-94_CR,"Two person had to go to the top of the hill. There's only one motorcycle, yet only one person can ride it. Fortunately, they both get to the top of the hill. How?",One person is already at the top of the hill.,"[They find a good view., One men go home to get a another motorcycle., One person is already at the top of the hill., None of above.]",sentence_puzzle,semantic_reconstruction,"One person rides the motorcycle to the top, while the other walks or takes another route to reach the top.",False,"One person rides the motorcycle partway up the hill, leaves it, and continues walking to the top. The second person walks to the motorcycle, rides it the rest of the way to the top, where they both meet.",False
237,SP-79,"Jessica is telling her friends this story and asks them to guess if it's the truth or a lie: ""There was a man sitting in a house at night that had no lights on at all. There was no lamp, no candle, and no other source of light. Yet, he sat in the house and read his book happily."" Her friends say she's lying, but Jessica corrects them and says she's telling the truth. Jessica's story is true—but how?",The man is blind and is reading braille.\n,"[The man is blind and is reading braille.\n, The man is smart that she pretended to reading the book., Because it was daytime., None of above.]",sentence_puzzle,semantic_reconstruction,The man was reading by the light of the moon.,False,The man is blind and reading a book in Braille.,True
32,SP-10_CR,"Eight people were sitting under a large tree. Suddenly, a gust of wind blows, yet none of them got hit by any falling leaves. How is this possible?",It was winter and the tree doesn't have any leaves.,"[Trees could have had a unique characteristic that made it resistant to losing leaves in windy conditions. , The gust of wind is that the wind blew in a direction away from the tree rather than towards it., It was winter and the tree doesn't have any leaves., None of above.]",sentence_puzzle,semantic_reconstruction,There are no leaves on the tree.,True,The tree had no leaves to begin with.,True
107,WP-35_CR,"What is the sum of 3/6 monkey, 3/4 tree, and 2/7 alcohol?",Montreal.,"[Montreal., California., Texas., None of above.]",word_puzzle,context_reconstruction,Montreal,True,mental,False


Overall accuracy of the systems

In [18]:
pd.DataFrame({
    'system': ['gpt-4o', 'gpt-4o-mini'],
    'accuracy': [round(data['gpt-4o_validation'].mean(), 2), 
                 round(data['gpt-4o-mini_validation'].mean(), 2)],
})

Unnamed: 0,system,accuracy
0,gpt-4o,0.75
1,gpt-4o-mini,0.55


Analyze the model accuracy with the:
 * Original sentence puzzles, category `original` (technically model could have seen the puzzle during pretraining)
 * Reconstructed version of the puzzle, `context_reconstruction` or `semantic_reconstruction` (less likely it could have seen the puzzle somewhere during pretraining or instruction tuning)

In [31]:
data.groupby('puzzle_type')['gpt-4o_validation'].mean()

puzzle_type
context_reconstruction     0.656250
original                   0.841360
semantic_reconstruction    0.745042
Name: gpt-4o_validation, dtype: float64

In [39]:
data.groupby('puzzle_type')['gpt-4o-mini_validation'].mean().sort_values(ascending=False)

puzzle_type
original                   0.603399
semantic_reconstruction    0.541076
context_reconstruction     0.502841
Name: gpt-4o-mini_validation, dtype: float64

In [32]:
data.groupby('puzzle_type')['gpt-4o-mini_validation'].mean()

puzzle_type
context_reconstruction     0.502841
original                   0.603399
semantic_reconstruction    0.541076
Name: gpt-4o-mini_validation, dtype: float64

Finally, let's see how models perform on `word_puzzle` and `sentence_puzzle`

In [20]:
data.groupby('category')['gpt-4o_validation'].mean()

category
sentence_puzzle    0.776224
word_puzzle        0.713992
Name: gpt-4o_validation, dtype: float64

In [21]:
data.groupby('category')['gpt-4o-mini_validation'].mean()

category
sentence_puzzle    0.568182
word_puzzle        0.526749
Name: gpt-4o-mini_validation, dtype: float64

Basic analysis

* Bigger model leads to performance boost, gpt-4o outperforms gpt-4o-mini by 20%
* Model performance goes down on reconstructed variants of puzzles. This indicates that it's very like just saw somewhere in the training data original puzzle and memorized the answeer for it.

Next steps:
* Instead of running generation for one time, run it multiple times and select the best of N responses as a final prediction (basically, ensambling). Likely this approach will improve results for all models
* Try out bigger(?) and smarter (for sure) models like `o1` that specicifally optimized for reasoning tasks
* Try a real-life scenario: let the model generate some response, validate it, and if the answer is not correct, give the model another try to solve it using information from validator. This approach can mimic a real brainteaser-interview where candidate thinks in steps and verifies his thoughts with the interviewer