<a href="https://colab.research.google.com/github/NoamMichael/Comparing-Confidence-in-LLMs/blob/main/LSAT_Benchmarking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# This Notebook will test all models on the formatted LSAT-AR dataset
# I have no issue running multiple API clients simultaneously. However, running
# local models is pretty memory intensive so I can only run one at a time.

# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
#                 TO DO LIST:
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# (X) Implement GetRAC() for Anthropic and Gemini Models
#  O  Fix formatdf() function (Daniels implementation)
#  O  Fix how systemprompts work (maybe make a metaclass var?)
#       --Honestly works fine now as global var. I dont see a need to change it.
# (X) Make an init_models() function:
#     def init_functions(models_dict):
#         "Do some stuff"
#         return models_list
#  O  Clean up notebook




In [21]:
%pip install anthropic
%pip install openai
%pip install tqdm



In [24]:
import pandas as pd
import numpy as np
import json
import time
import random
import torch
import matplotlib.pyplot as plt
from transformers import (AutoTokenizer,
                        AutoModelForCausalLM,
                        BitsAndBytesConfig,
                        pipeline)
import warnings
import openai
import anthropic
import google.generativeai as genai
from abc import ABC, abstractmethod
from tqdm.notebook import tqdm
warnings.filterwarnings('ignore')
from google.colab import userdata

max_tokens = 250

class OpenModel: ## This class is built around Hugging Face methods
  def __init__(self, name, key, MaxTokens = 250):
    self.name = name
    self.key = key
    self.MaxTokens = MaxTokens
    print(f"Downloading Tokenizer for {self.name}")
    self.tokenizer = AutoTokenizer.from_pretrained(self.name,token = self.key) ## Import Tokenizer
    print(f"Downloading Model Weights for {self.name}")
    self.model = AutoModelForCausalLM.from_pretrained(self.name, token = self.key, device_map="auto") ## Import Model

    ## Make text generation pipeline
    self.pipeline = pipeline(
    "text-generation",
    model = self.model,
    tokenizer = self.tokenizer,
    do_sample = False,
    max_new_tokens = self.MaxTokens,
    eos_token_id = self.tokenizer.eos_token_id,
    pad_token_id = self.tokenizer.eos_token_id
    )

  def generate(self, prompt):
    return self.pipeline(prompt)[0]['generated_text']

  def GetTokens(self, prompt: str):
    ## Get Answer:
    batch = self.tokenizer(prompt, return_tensors= "pt").to('cuda')
    with torch.no_grad():
        outputs = self.model(**batch)
    ## Get Token Probabilites
    logits = outputs.logits

    ## Apply softmax to the logits to get probabilities
    probs = torch.softmax(logits[0, -1], dim=0)

    ##Get the top k token indices and their probabilities
    top_k_probs, top_k_indices = torch.topk(probs, 100, sorted =True)

    ## Convert token indices to tokens
    top_k_tokens = [self.tokenizer.decode([token_id]) for token_id in top_k_indices]

    ## Convert probabilities to list of floats
    top_k_probs = top_k_probs.tolist()                  #list of probabilities

    ## Create a Pandas Series with tokens as index and probabilities as values
    logit_series = pd.Series(top_k_probs, index=top_k_tokens)

    ## Sort the series by values in descending order
    logit_series = logit_series.sort_values(ascending=False)
    logit_series.index.name = "Token"
    logit_series.name = "Probability"
    return logit_series

class ClosedModel(ABC):
  @abstractmethod
  def generate(self, prompt: str, system:str = "")-> str:
        """
        Abstract method to generate a response from the language model.
        """
        pass
  @abstractmethod
  def __init__(self, name, api_key):
    self.name = name
    self.key = api_key
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning'])
    pass

  @abstractmethod
  def client(self):
    pass

  @abstractmethod
  def GetRAC(self, prompt: str, system1:str = "", system2: str = "")-> tuple[str, str]: ## Get Reasoning Answer Confidence
    pass

class GPTmodel(ClosedModel):
  def __init__(self, name, api_key):
    self.name = name
    self.key = api_key
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning'])
  def client(self):
    # Initialize the OpenAI client with the API key
    self.client = openai.OpenAI(api_key=self.key)


  def generate(self, prompt: str, system: str = "") -> str:
    # Use the new client-based API call
    response = self.client.chat.completions.create(
        model=self.name,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=0,
        max_tokens= max_tokens
    )
    # Access the content from the new response object structure
    return response.choices[0].message.content
  def GetRAC(self, prompt: str)-> tuple[str, str]: ## Get Reasoning Answer Confidence
    ## For context here's the two system prompts:
    ## System prompt 1:
    '''
    Given the following question, analyze the options, and provide a concise reasoning for your selected answer. Your reasoning should not exceed 100 words. After your explanation, clearly state your answer by choosing one of the options listed (A, B, C, D, or E).

    Question: ${Question}
    Options:
    A) ${Option A}
    B) ${Option B}
    C) ${Option C}
    D) ${Option D}
    E) ${Option E}

    Please provide your reasoning first, limited to 100 words, and then conclusively state only your selected answer using the corresponding letter (A, B, C, D, or E).
    Reasoning: <Your concise reasoning here. Max 100 words>
    '''

    ## System prompt 2:
    '''
    Based on the reasoning above, Provide the correct answer and the likelihood that each option is correct from 0.0 to 1.0 in a JSON format. The four probabilities should sum to 1.0. For example:

    {
    'Answer': <Your answer choice here, as a single letter and nothing else.>
    'A': <Probability choice A is correct. As a float from 0.0 to 1.0>,
    'B': <Probability choice B is correct. As a float from 0.0 to 1.0>,
    'C': <Probability choice C is correct. As a float from 0.0 to 1.0>,
    'D': <Probability choice D is correct. As a float from 0.0 to 1.0>,
    'E': <Probability choice E is correct. As a float from 0.0 to 1.0>
    }
    '''
    # Access global system prompts
    global sys_prompt1, sys_prompt2
    ## Get the reasoning
    reasoning = self.generate(prompt, sys_prompt1)
    ## Get the answer and confidence
    answer_confidence = self.generate(prompt + reasoning + sys_prompt2, sys_prompt2)

    return reasoning, answer_confidence

class AnthropicModel(ClosedModel):
  def __init__(self, name, api_key):
    self.name = name
    self.key = api_key
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning'])
  def client(self):
    # Initialize the Anthropic client with the API key
    self.client = anthropic.Anthropic(api_key=self.key)

  def generate(self, prompt: str, system: str = "") -> str:
    # The messages list should only contain user and assistant roles
    messages = [{"role": "user", "content": prompt}]

    # Use the Anthropic client to create a message
    # Pass the system message as a top-level 'system' parameter
    message = self.client.messages.create(
        model=self.name,
        max_tokens= max_tokens, # You can adjust this or make it an instance variable
        messages=messages,
        system=system if system else None # Pass system as a separate parameter, or None if empty
    )
    # Access the content from the response object
    return message.content[0].text

  def GetRAC(self, prompt: str)-> tuple[str, str]: ## Get Reasoning Answer Confidence
    # Access global system prompts
    global sys_prompt1, sys_prompt2
    ## Get the reasoning
    reasoning = self.generate(prompt, sys_prompt1)
    ## Get the answer and confidence
    answer_confidence = self.generate(prompt + reasoning + sys_prompt2, sys_prompt2)

    return reasoning, answer_confidence
class GeminiModel(ClosedModel):
  def __init__(self, name, api_key):
    self.name = name
    self.key = api_key
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning'])

  def client(self):
    # Initialize the google.generativeai client with the API key

    genai.configure(api_key=self.key)
    self.model = genai.GenerativeModel(model_name=self.name)

  def generate(self, prompt: str, system: str = "") -> str:
    # Build the content list, including the system message if provided
    contents = [{"role": "user", "parts": [prompt]}]
    if system:
        contents = [{"role": "user", "parts": [system]}] + contents

    # Use the Gemini model to generate content
    response = self.model.generate_content(contents)

    # Access the content from the response object
    return response.text

  def GetRAC(self, prompt: str)-> tuple[str, str]: ## Get Reasoning Answer Confidence
    # Access global system prompts
    global sys_prompt1, sys_prompt2
    ## Get the reasoning
    reasoning = self.generate(prompt, sys_prompt1)
    ## Get the answer and confidence
    answer_confidence = self.generate(prompt + reasoning + sys_prompt2, sys_prompt2)

    return reasoning, answer_confidence






### Define Functions:

In [22]:
## Define Functions:
def init_models(models_dict,
                test_prompt = "Zdzisław Beksiński was",
                test_system = "You are a helpful assistant."):
  print('Initializing Closed Models:')
  closed_models = []
  for model_type in my_closed_models:
      print(f'{model_type}:')
      api_key_name = my_closed_models[model_type]['api_key_name']
      api_key = userdata.get(api_key_name)
      print(f'  API Key Name: {my_closed_models[model_type]["api_key_name"]}')
      for model_name in my_closed_models[model_type]['models']:
        # Instantiate the correct subclass based on model_type
        if model_type == 'GPT':
            my_model = GPTmodel(name = model_name, api_key = api_key)
        elif model_type == 'Claude':
            my_model = AnthropicModel(name = model_name, api_key = api_key)
        elif model_type == 'Gemini':
            my_model = GeminiModel(name = model_name, api_key = api_key)
        else:
            # Handle unexpected model types if necessary
            print(f"Warning: Unknown model type {model_type}. Skipping.")
            continue # Skip to the next model name if type is unknow
        my_model.client()
        closed_models.append(my_model)
        print(f'    {model_name}')


  print(f'Models Initialized: {len(closed_models)}')
  print(f'Model locations:\n{closed_models}')

  print('-'*42)
  print('Testing all closed models:')
  print(f'Test prompt: {test_prompt}')
  print(f'Test system: {test_system}')

  for model in closed_models:
    print(f'\nTesting model: {model.name}')
    print(model.generate(test_prompt, test_system))
  return closed_models

def make_system_prompt(df, sys_prompt1 = '', sys_prompt2 = ''):
  if sys_prompt1 == '' and sys_prompt2 == '':

    sys_prompt1 = '''
    Given the following question, analyze the options, and provide a concise reasoning for your selected answer. Your reasoning should not exceed 100 words. After your explanation, clearly state your answer by choosing one of the options listed (A, B, C, D, or E).

    Question: ${Question}
    Options:
    A) ${Option A}
    B) ${Option B}
    C) ${Option C}
    D) ${Option D}
    E) ${Option E}

    Please provide your reasoning first, limited to 100 words, and then conclusively state only your selected answer using the corresponding letter (A, B, C, D, or E).
    Reasoning: <Your concise reasoning here. Max 100 words>
    '''
    sys_prompt2 = '''
    Based on the reasoning above, Provide the correct answer and the likelihood that each option is correct from 0.0 to 1.0 in a JSON format. The four probabilities should sum to 1.0. For example:

    {
    'Answer': <Your answer choice here, as a single letter and nothing else.>
    'A': <Probability choice A is correct. As a float from 0.0 to 1.0>,
    'B': <Probability choice B is correct. As a float from 0.0 to 1.0>,
    'C': <Probability choice C is correct. As a float from 0.0 to 1.0>,
    'D': <Probability choice D is correct. As a float from 0.0 to 1.0>,
    'E': <Probability choice E is correct. As a float from 0.0 to 1.0>
    }
    '''
  ## Edit system Prompts in order to match size of dataset
  columns = df.columns
  num_options = columns.str.contains('Option').astype(int).sum()

  sys_prompt_temp1 = sys_prompt1
  sys_prompt_temp2 = sys_prompt2
  ## Reformat system prompt in order to fit number of options in benchmark
  if num_options < 5: ## ABCD
    sys_prompt_temp1 = (sys_prompt1
                  .replace('(A, B, C, D, or E)', '(A, B, C, or D)') ## Change the available options
                  .replace('E) ${Option E}', '') ## Drop option E
        )
    sys_prompt_temp2 = (sys_prompt2
                  .replace('(A, B, C, D, or E)', '(A, B, C, or D)') ## Change the available options
                  .replace('E) ${Option E}', '') ## Drop option E
        )
    if num_options < 4: ## ABC
      sys_prompt_temp1 = (sys_prompt_temp1
                    .replace('(A, B, C, or D)', '(A, B, or C)') ## Change the available options
                    .replace('D) ${Option D}', '') ## Drop option D
          )
      sys_prompt_temp2 = (sys_prompt_temp2
                  .replace('(A, B, C, or D)', '(A, B, or C)') ## Change the available options
                  .replace('D) ${Option D}', '') ## Drop option D
        )

      if num_options < 3: ## AB
        sys_prompt_temp1 = (sys_prompt_temp1
                      .replace('(A, B, or C)', '(A or B)') ## Change the available options
                      .replace('C) ${Option C}', '') ## Drop option C
            )
        sys_prompt_temp2 = (sys_prompt_temp2
                    .replace('(A, B, or C)', '(A or B)') ## Change the available options
                    .replace('C) ${Option C}', '') ## Drop option C
          )

  return sys_prompt_temp1, sys_prompt_temp2

def format_df(df):

  ## %%%%%%%%%%%%%%
  ## I need to fix how formating is done for some Q's. As daniel pointed out some
  ## questions only have 4 options, not 5.
  ## %%%%%%%%%%%%%%

  ## Takes in a dataframe in the form:
  ## | Question Number | Question | Option A | Option B | ... | Correct Answer Letter |
  ## |     (Int)       |     (Str)     |  (Str)   |  (Str)   |     |       (Char)          |
  ##
  ## Returns a dataframe in the form:
  ## | Question Number | Full Prompt 1 | Full Prompt 2 |
  ## |     (Int)       |    (Str)      |    (Str)      |

  columns = df.columns
  num_options = columns.str.contains('Option').astype(int).sum()

  #----------------------------------------------------------------------------#
  ## Check if DF is formatted properly
  error_text = f'''Make sure dataframe is in following format:
  | Question Number | Question | Option A | Option B | ... | Correct Answer Letter |
  |     (Int)       |     (Str)     |  (Str)   |  (Str)   |     |       (Char)          |

  The current format of Dataframe is: {columns}
  '''
  ['Question Number', 'Question', 'Correct Answer Letter']
  if num_options < 2:
    raise Exception(error_text)

  #----------------------------------------------------------------------------#
  ## Initialize Output dataframe:
  header = ['Question Num', 'Full Prompt 1', 'Full Prompt 2']
  output_df = pd.DataFrame(columns = header)

  #----------------------------------------------------------------------------#

  ## Format questions for benchmark
  letters = ['A', 'B', 'C', 'D', 'E']
  options = ['Option A', 'Option B', 'Option C', 'Option D', 'Option E']

  for i in range(len(df)):
    question = df['Question'][i]
    option_text = df[options[:num_options]].iloc[i].to_list()

    ## Prompt for specific question
    new_prompt = sys_prompt_temp1.replace('${Question}', question)
    for j in range(num_options): ## This for loop allows for dynamic question amounts
        new_prompt = new_prompt.replace(f'${{Option {letters[j]}}}', str(option_text[j]))


    ## Add formatted prompts.
    ## Note that this is formatted to llama so changes may be needed down the line.
    prompts1 = (new_prompt.split('<Your concise reasoning here. Max 100 words>')[0]) ## Specific prompt for question

    prompts2 = (sys_prompt_temp2) ## Generic prompt for question confidence
    output_df.loc[i] = [df['Question Number'].iloc[i], prompts1, prompts2]

  return output_df

def test_models_sequential_by_question(df, models, debug=False):
    """
    Tests a list of models on a given dataset sequentially,
    iterating through questions and then models for each question.
    Includes a debug mode to process only the first 10 questions.

    Args:
        df (pd.DataFrame): The dataset containing questions and prompts.
        models (list): A list of initialized model objects.
        debug (bool): If True, only process the first 10 questions.
    """
    print("Clearing previous results for each model...")
    for model in models:
        model.results = pd.DataFrame(columns=['Question ID', 'Question', 'Answer', 'Reasoning'])
        print(f"  Cleared results for {model.name}")
    print("Starting sequential testing (by question)...")

    # Determine the number of questions to process
    num_questions_to_process = 10 if debug else len(df)

    # Iterate over questions first
    for index, row in tqdm(df.head(num_questions_to_process).iterrows(), total=num_questions_to_process, desc="Processing Questions"):
        question_num = row['Question Num']
        prompt = row['Full Prompt 1']
        print(f"\nProcessing Question {question_num}")

        # Iterate over models for the current question
        for model in models:
            try:
                print(f"  Testing with model: {model.name}")
                # Call GetRAC and add the result to the model's self.results
                reasoning, answer_confidence = model.GetRAC(prompt=prompt)

                # Add the results to the model's self.results DataFrame
                new_row = pd.DataFrame([{
                    'Question ID': question_num,
                    'Question': prompt,
                    'Answer': answer_confidence,
                    'Reasoning': reasoning
                }])
                model.results = model.results._append(new_row, ignore_index=True)
                filename = f"{model.name.replace('/', '_').replace('-', '_')}_test_results.csv"
                model.results.to_csv(filename, index=False)
            except Exception as e:
                print(f"  Error testing {model.name} on Question {question_num}: {e}")
                # Optionally add an error entry to the results
                error_row = pd.DataFrame([{
                    'Question ID': question_num,
                    'Question': prompt,
                    'Answer': f"Error: {e}",
                    'Reasoning': f"Error: {e}"
                }])
                model.results = model.results._append(error_row, ignore_index=True)
                filename = f"{model.name.replace('/', '_').replace('-', '_')}_test_results.csv"
                model.results.to_csv(filename, index=False)
    print("\nSequential testing complete.")

    # After processing all questions, save the results for each model
    for model in models:
        filename = f"{model.name.replace('/', '_').replace('-', '_')}_test_results.csv"
        model.results.to_csv(filename, index=False)
        print(f"Results for {model.name} saved to '{filename}'")

### Model Playground:

In [None]:
## Playground for Llama
hf_llama_token = userdata.get('hf_llama_token')
test_name = 'meta-llama/Llama-3.1-8B-Instruct'
test_key = hf_llama_token
test_prompt = "Zdzisław Beksiński was"

test_model = OpenModel(name = test_name, key = test_key)

In [None]:
print(test_model.generate(test_prompt))
test_model.GetTokens(test_prompt)

In [57]:
## Playground for GPT

gpt_4_key = userdata.get('gpt_api_key')
test_name = 'gpt-4'
test_key = gpt_4_key
test_prompt = "Zdzisław Beksiński was"
test_system = "You are a helpful assistant."

my_gpt = GPTmodel(name = test_name, api_key = test_key)
my_gpt.client()
my_gpt.generate(test_prompt, test_system)


"a renowned Polish painter, photographer, and sculptor. He is best known for his large, detailed images of a surreal, post-apocalyptic environment. Beksiński's works are characterized by their haunting, dreamlike nature, often featuring desolate landscapes and grotesque, distorted figures. Despite the grim themes, he insisted his work was not to be interpreted in a literal sense, and that he was more interested in the emotions and reactions they provoked. He was born on February 24, "

In [None]:
## Playground for Claude

claude_key = userdata.get('claude_api_key')
test_name = 'claude-3-haiku-20240307'
test_key = claude_key
test_prompt = "Zdzisław Beksiński was"
test_system = "You are a helpful assistant."

my_claude = AnthropicModel(name = test_name, api_key = test_key)
my_claude.client()
my_claude.generate(test_prompt, test_system)

'Zdzisław Beksiński was a Polish painter, photographer, and sculptor who was known for his distinctive surrealist and dystopian style. Here are some key facts about him:\n\n- Born in 1929 in Sanok, Poland, Beksiński initially studied to be an architect before turning to art.\n\n- His paintings often depicted post-apocalyptic, nightmarish landscapes and figures. His style was highly detailed and technical, with a'

In [None]:
## Playground for Gemini

gemini_api_key = userdata.get('gemini_api_key') # Assuming you stored your key in Userdata
test_name = "gemini-2.0-flash" # Or another Gemini model name like 'gemini-1.5-flash'
test_key = gemini_api_key
test_prompt = "Zdzisław Beksiński was"
test_system = "You are a helpful assistant." # Optional system message

my_gemini = GeminiModel(name = test_name, api_key = test_key)
my_gemini.client()
my_gemini.generate(test_prompt, test_system)


'Zdzisław Beksiński was a Polish painter, photographer, and sculptor specializing in dystopian surrealism. He is known for his distinctive and hauntingly beautiful, yet often disturbing, imagery. His art explores themes of death, decay, anxiety, and the human condition.\n'

### Run Benchmarking:

In [25]:
## Import Dataset
print('-' *42)
file_path = '/content/LSAT_formatted.csv'
print(f'Importing Dataset: {file_path}')
dataset = pd.read_csv(file_path)
dataset.head()

## Edit System Prompts
print('-' *42)
print('Editing System Prompts:')
try:
  sys_prompt1, sys_prompt2 = make_system_prompt(new_dataset)
  print(' New System Prompts:')
  print(f'  {sys_prompt1}')
  print(f'  {sys_prompt2}')
except Exception as e:
  print(f'Error editing prompts:\n{e}')



## Format DF
print('-' *42)
print('Formatting Dataset:')
new_dataset = format_df(dataset)
print(' Successfully Formatted Dataset')
print('New Dataset:')
display(new_dataset.head())




## Initialize Models
print('-' *42)
print('Initializing Models:')
my_closed_models = {
    'GPT': {
        'api_key_name': 'gpt_api_key', # Name of the key to retrieve from userdata
        'models': [
            'gpt-4',
            'gpt-3.5-turbo'
        ]
    },
    'Claude': {
        'api_key_name': 'claude_api_key', # Name of the key to retrieve from userdata
        'models': [
            #'claude-3-sonnet-20240229',
            'claude-3-haiku-20240307'
        ]
    },
    'Gemini': {
        'api_key_name': 'gemini_api_key', # Name of the key to retrieve from userdata
        'models': [
            'gemini-1.5-flash',
            #'gemini-1.5-pro'
        ]
    }
}

closed_models = init_models(my_closed_models)
print(' Successfully Initialied Models')
## Test Models on LSAT
print('-' *42)
print('Testing Models:')
test_models_sequential_by_question(new_dataset, closed_models, debug=True)

------------------------------------------
Importing Dataset: /content/LSAT_formatted.csv
------------------------------------------
Editing System Prompts:
 New System Prompts:
  
    Given the following question, analyze the options, and provide a concise reasoning for your selected answer. Your reasoning should not exceed 100 words. After your explanation, clearly state your answer by choosing one of the options listed (A or B).

    Question: ${Question}
    Options:
    A) ${Option A}
    B) ${Option B}
    
    
    

    Please provide your reasoning first, limited to 100 words, and then conclusively state only your selected answer using the corresponding letter (A or B).
    Reasoning: <Your concise reasoning here. Max 100 words>
    
  
    Based on the reasoning above, Provide the correct answer and the likelihood that each option is correct from 0.0 to 1.0 in a JSON format. The four probabilities should sum to 1.0. For example:

    {
    'Answer': <Your answer choice here, 

Unnamed: 0,Question Num,Full Prompt 1,Full Prompt 2
0,199106_2-G_1_1,"\nGiven the following question, analyze the op...","\nBased on the reasoning above, Provide the co..."
1,199106_2-G_1_2,"\nGiven the following question, analyze the op...","\nBased on the reasoning above, Provide the co..."
2,199106_2-G_1_3,"\nGiven the following question, analyze the op...","\nBased on the reasoning above, Provide the co..."
3,199106_2-G_1_4,"\nGiven the following question, analyze the op...","\nBased on the reasoning above, Provide the co..."
4,199106_2-G_1_5,"\nGiven the following question, analyze the op...","\nBased on the reasoning above, Provide the co..."


------------------------------------------
Initializing Models:
Initializing Closed Models:
GPT:
  API Key Name: gpt_api_key
    gpt-4
    gpt-3.5-turbo
Claude:
  API Key Name: claude_api_key
    claude-3-haiku-20240307
Gemini:
  API Key Name: gemini_api_key
    gemini-1.5-flash
Models Initialized: 4
Model locations:
[<__main__.GPTmodel object at 0x7cd0a1b08e50>, <__main__.GPTmodel object at 0x7cd0a1698050>, <__main__.AnthropicModel object at 0x7cd0a214a390>, <__main__.GeminiModel object at 0x7cd0a1e378d0>]
------------------------------------------
Testing all closed models:
Test prompt: Zdzisław Beksiński was
Test system: You are a helpful assistant.

Testing model: gpt-4
a renowned Polish painter, photographer, and sculptor. He is best known for his large, detailed images of a surreal, post-apocalyptic environment. Beksiński's works are characterized by their haunting, dreamlike nature, often featuring desolate landscapes and grotesque, distorted figures. Despite the often grim subj

Processing Questions:   0%|          | 0/10 [00:00<?, ?it/s]


Processing Question 199106_2-G_1_1
  Testing with model: gpt-4
  Testing with model: gpt-3.5-turbo
  Testing with model: claude-3-haiku-20240307
  Testing with model: gemini-1.5-flash

Processing Question 199106_2-G_1_2
  Testing with model: gpt-4
  Testing with model: gpt-3.5-turbo
  Testing with model: claude-3-haiku-20240307
  Testing with model: gemini-1.5-flash

Processing Question 199106_2-G_1_3
  Testing with model: gpt-4
  Testing with model: gpt-3.5-turbo
  Testing with model: claude-3-haiku-20240307
  Testing with model: gemini-1.5-flash

Processing Question 199106_2-G_1_4
  Testing with model: gpt-4
  Testing with model: gpt-3.5-turbo
  Testing with model: claude-3-haiku-20240307
  Testing with model: gemini-1.5-flash

Processing Question 199106_2-G_1_5
  Testing with model: gpt-4
  Testing with model: gpt-3.5-turbo
  Testing with model: claude-3-haiku-20240307
  Testing with model: gemini-1.5-flash

Processing Question 199106_2-G_1_6
  Testing with model: gpt-4
  Testing w