[FACTS Grounding](https://kaggle.com/facts-leaderboard) is a novel benchmark from Google DeepMind and Google Research designed to evaluate the factual accuracy and grounding of AI models.

**As a primer, please check out this [Examples Section](https://kaggle.com/facts-leaderboard/examples) before running this notebook!**

This notebook will walk you through:

1) **[Generating responses](https://kaggle.com/facts-leaderboard/examples#response_generation)** for examples

2) **[Evaluating responses](https://kaggle.com/facts-leaderboard/examples#response_evaluation)** that were generated

3) **[Ensembling evaluations](https://kaggle.com/facts-leaderboard/examples#ensembling)** to produce a score

# Setup
---

###  Setup Keys

You will need your own set of API keys from [Google AI Studio](https://aistudio.google.com/app/apikey), [OpenAI](https://platform.openai.com/api-keys), and [Anthropic](https://docs.anthropic.com/en/api/getting-started) to run this notebook.

**Add your keys as Kaggle Secrets with the following names:**

1) `GOOGLE_AIS_API_KEY`

2) `OPENAI_API_KEY`

3) `ANTHROPIC_API_KEY`


### **WARNING:** This notebook will execute real API calls against these keys. Please check usage limits before running.

In [5]:
import os
from dotenv import load_dotenv
load_dotenv()

openai_api_key = os.getenv("OPENAI_API_KEY")
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
google_ais_api_key = os.getenv("GOOGLE_AIS_API_KEY")

### Setup Clients

In [6]:
from openai import OpenAI
from anthropic import Anthropic
from google import genai

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

google_client = genai.Client(api_key=google_ais_api_key)
openai_client = OpenAI(api_key=openai_api_key)
anthropic_client = Anthropic(api_key=anthropic_api_key)

models = ["gemini-1.5-pro", "gpt-4o", "claude-3-5-sonnet"]
# models.append("Qwen/Qwen2.5-7B-Instruct")
def generate_gemini(prompt, sys):
    response = google_client.models.generate_content(model='gemini-1.5-pro-002', contents=f'{sys} {prompt}')
    return response.text

def generate_gpt(prompt, sys):
    if len(sys) > 0:
        completion = openai_client.chat.completions.create(
          model="gpt-4o",
          messages=[{"role": "system", "content": sys}, {"role": "user","content": prompt}]
        )
        return completion.choices[0].message.content
    else:
        completion = openai_client.chat.completions.create(
          model="gpt-4o",
          messages=[{"role": "user","content": prompt}]
        )
        return completion.choices[0].message.content     

def generate_claude(prompt, sys):
    if len(sys) > 0:
        message = anthropic_client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=8192,
            system=[{"type": "text", "text": sys}],
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text
    else:
        message = anthropic_client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=8192,
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text

def generate_huggingface(prompt, sys, model_name="gpt2"):
    messages = [
        {"role": "user", "content": prompt}
    ]

    # Concatenate system and prompt if sys is not empty
    if len(sys) > 0:
        messages.insert(0, {"role": "system", "content": sys})

    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    # Tokenize input
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    # Generate output
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=8192
    )
    # Decode and return

    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

def generate(model, prompt, sys):
    if model == "gemini-1.5-pro":
        return generate_gemini(prompt, sys)
    elif model == "gpt-4o":
        return generate_gpt(prompt, sys)
    elif model == "claude-3-5-sonnet":
        return generate_claude(prompt, sys)
    elif model == "Qwen/Qwen2.5-7B-Instruct":
        return generate_huggingface(prompt, sys, model_name=model)
    else:
        raise Exception("Invalid model selected")

In [7]:
# Test client helper functions

for model in models:
    print(generate(model, "count to 3", ""))

1, 2, 3

One, two, three.
1
2
3


# Response Generation
---

We'll be using a subset of the examples from the [FACTS Grounding 1.0 Public Examples](https://kaggle.com/datasets/deepmind/FACTS-grounding-examples/data) dataset. This dataset contains 860 public examples (out of a total 1,719 examples) for the general public to access.

### Load Examples

In [68]:
import pandas as pd

examples = pd.read_json("data/test.jsonl", lines=True)
examples.head(3)

Unnamed: 0,system_instruction,user_request,context_document,full_prompt
0,[question]\n [user request]\n \n\n ===========...,Is the Government of the United States legally...,Passed following an address by Ukraine’s leade...,[question]\n Is the Government of the United S...
1,You are required to use only information provi...,Explain the responsibilities of each departmen...,"Procedural History In 2007, Mr. Taylor filed a...",Explain the responsibilities of each departmen...
2,Answer the question based solely on the inform...,Researchers at Foch Hospital in France publish...,Objectives: Maternal age has been increasing f...,Answer the question based solely on the inform...


### Generate Responses

In [44]:
from tqdm import tqdm
import pandas as pd

responses = pd.DataFrame()
for ix, row in tqdm(examples.iterrows(), total=len(examples)):
    full_prompt = row['full_prompt']
    for model in models:
        response = generate(model=model, prompt=full_prompt, sys="")
        responses.loc[ix, ["system_instruction", "user_request", "context_document", f'{model}-response']] = [row["system_instruction"], row["user_request"], row["context_document"], response]
responses

  0%|                                             | 1/430 [02:11<15:41:00, 131.61s/it]


KeyboardInterrupt: 

### Filtering Responses

In [45]:
for model in models:
    responses[f'{model}-response-length'] = responses[f'{model}-response'].str.len()
responses["average-response-length"] = responses[[f'{model}-response-length' for model in models]].mean(axis=1)

responses.sort_values("average-response-length", inplace=True)
responses.head(3)

Unnamed: 0,system_instruction,user_request,context_document,Qwen/Qwen2.5-1.5B-Instruct-response,Qwen/Qwen2.5-1.5B-Instruct-response-length,average-response-length
0,[question]\n [user request]\n \n\n ===========...,Is the Government of the United States legally...,Passed following an address by Ukraine’s leade...,"Yes, the Government of the United States is le...",598,598.0


In [40]:
import os

responses_dir = os.path.join("data", "responses")
os.makedirs(responses_dir, exist_ok=True)

responses_file = os.path.join(responses_dir, "model1_responses.csv")
responses.to_csv(responses_file, index=True)
print(f"Responses saved to {responses_file}")

Responses saved to data/responses/model1_responses.csv


In [47]:
import pandas as pd

responses = pd.read_csv("data/responses/qwen_responses.csv")
responses.head()

Unnamed: 0.1,Unnamed: 0,system_instruction,user_request,context_document,Qwen/Qwen2.5-7B-Instruct-response,Qwen/Qwen2.5-7B-Instruct-response-length,average-response-length
0,112,Answer in 10 words or less. Keep things simple...,Give me a list of people that died in 1957.,"After the show, Christian Dior began thinking ...","Sorry, no specific deaths in 1957 mentioned.",44,44.0
1,314,Give me your answer as a full sentence. Answer...,"According to this transcript, what was ipad re...",**Tim Cook**\n\nThank you. Suhasini. Good afte...,iPad revenue in the December quarter was $7 bi...,52,52.0
2,292,Use only the provided context for your respons...,How much does a pair of Diamond 210 cost?,Title: Wharfedale Diamonds Shine Even Brighter...,"According to the document, a pair of Diamond 2...",63,63.0
3,270,Your response to the user must only use the in...,Which model showed debt underutilization in th...,"Since Modigliani and Miller (1958), economists...",- The multistage model showed debt underutiliz...,66,66.0
4,346,"Answer in complete sentences, only use the con...","According to the document, how many copies of ...","**What's publicly known about ""Switch 2""**\n\n...","According to the document, Mario Kart 8 Deluxe...",74,74.0


# Response Evaluation
---

### Load Evaluation Prompts

##### Evaluation Prompt Selection

We chose grounding and evaluation prompts based on evaluation against an internal set of human rated responses. See below for which evaluation prompt were selected and the [Technical Report](https://arxiv.org/abs/2501.03200) for more details.

#### Grounding

1) `Gemini-1.5-pro    : json`

2) `GPT-4o            : json`

3) `Claude-3-5-sonnet : implicit_span_level`

#### Quality

1) `Gemini-1.5-pro    : ineligible_responses_filter_no_context`

2) `GPT-4o            : ineligible_responses_filter_no_context`

3) `Claude-3-5-sonnet : ineligible_responses_filter_no_context`

In [9]:
import pandas as pd
import tqdm 

evaluation_prompts = pd.read_json("data/evaluation_prompts.jsonl", lines=True)
evaluation_prompts

Unnamed: 0,evaluation_method,evaluation_prompt
0,response_level,Your task is to check if the Response is accur...
1,json_alt,You are a helpful and harmless AI assistant. Y...
2,json,You are a helpful and harmless AI assistant. Y...
3,json_with_double_check,Your task is to verify whether a given sentenc...
4,span_level,Your task is to check if a specific Span is ac...
5,implicit_span_level,Your task is to check if the Response is accur...
6,ineligible_responses_filter_with_context,Your mission is to judge the response from an ...
7,ineligible_responses_filter_no_context,Your mission is to judge the response from an ...


In [10]:
json_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'json', "evaluation_prompt"].values[0]
implicit_span_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'implicit_span_level', "evaluation_prompt"].values[0]
ineligible_responses_filter_no_context_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'ineligible_responses_filter_with_context', "evaluation_prompt"].values[0]

### Setup Evaluation Helpers

Here we'll create a helper method for each evaluation type. Note that, for grounding, the Gemini-1.5-pro and GPT-4o judges use the `json` evaluation prompt and that the Claude-3-5-sonnet judge uses the `implicit_span_level` evaluation prompt as described above.

#### Testing Grounding Judges
We'll test grounding with a situation where the response is correct but not grounded. We expect to see a `False` from all judges!

```
user_request:      "what is 2 + 2?"
context_document:  "2 + 2 is 3"
response:          "2 + 2 is 4" (this is correct but not grounded in the context_document)
```

In [11]:
# Grounding json evaluation

import json

def parse_structured_json(ans):
  if '```json' in ans:
      ans = ans.split('```json')[1].split('```')[0]
  ans = ans.strip()
  ans = ans.replace('}\n', '}\n@\n@\n')
  parsed_answers = []
  for line in ans.split('\n@\n@\n'):
    try:
      line = line.replace('\n', ' ')
      line = line.replace("\\'", "'")
      parsed = json.loads(line)
      parsed_answers.append(parsed)
    except:
      pass
  if len(parsed_answers) > 0:
    bool_ans = all(d['label'] == 'supported' or d['label'] == 'no_rad' for d in parsed_answers)
  else:
    bool_ans = False
  return bool_ans, parsed_answers

def evaluate_grounding_json(user_request, context_document, response, model):
    prompt = json_prompt.replace('{{user_request}}', user_request).replace('{{context_document}}', context_document).replace('{{response}}', response)

    evaluation_text = generate(model=model, prompt=prompt, sys="")
    evaluation, parsed = parse_structured_json(evaluation_text)
    return evaluation

def evaluate_grounding_gemini(user_request, context_document, response):
    return evaluate_grounding_json(user_request, context_document, response, model="gemini-1.5-pro")

def evaluate_grounding_gpt(user_request, context_document, response):
    return evaluate_grounding_json(user_request, context_document, response, model="gpt-4o")


In [12]:
# Test the evaluator
print(evaluate_grounding_gemini(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4"))
print(evaluate_grounding_gpt(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4"))

False
False


In [13]:
# Grounding implicit_span_level evaluation

def answer_normalization(answer):
  answer = answer.strip().lower()
  if 'inaccurate' in answer or 'false' in answer:
    return False
  elif 'accurate' in answer or 'true' in answer:
    return True
  else:
    return False
    
def evaluate_grounding_implicit_span(user_request, context_document, response, model):
    prompt = implicit_span_prompt.replace('{{user_request}}', user_request).replace('{{context_document}}', context_document).replace('{{response}}', response)
    
    evaluation_text = generate(model=model, prompt=prompt, sys="")
    splits = evaluation_text.split('Final Answer:')
    if (len(splits) <= 1):
        return False
    final_ans = splits[1]
    return answer_normalization(final_ans)

def evaluate_grounding_claude(user_request, context_document, response):
    return evaluate_grounding_implicit_span(user_request, context_document, response, model="claude-3-5-sonnet")

In [14]:
# Test the evaluator
evaluate_grounding_claude(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4")

False

#### Testing Quality Judges

We'll test quality with a situation where the response does not answer the posed request. We expect to see a `False` from all judges!

```
user_request:      "what is 2 + 2?"
response_a:        "3 + 3 is 6" (this is the response we evaluate)
response_b:        "2 + 2 is 4" (this is the reference response)
```

We compare our response with a reference because it improves the accuracy of the judges.

In [15]:
# Quality no_context evaluation

QUESTIONS_TO_LABELS = {
    'Instruction Following': ['No Issues', 'Minor Issue(s)', 'Major Issue(s)', 'Invalid'],
}

def parse_json(ans):
    parsed = {}
    if '```json' in ans:
        ans = ans.split('```json')[1]
        ans = ans.split('```')[0]
    ans = ans.replace('\n', ' ')
    try:
        parsed = json.loads(ans)
    except Exception as e:
        pass
    if 'Instruction Following' not in parsed:
        parsed['Instruction Following'] = 'Invalid'
    elif parsed['Instruction Following'] not in ['No Issues', 'Minor Issue(s)', 'Major Issue(s)', 'Invalid']:
        parsed['Instruction Following'] = 'Invalid'
    return parsed

# Evaluates response_a for quality using response_b as a reference.
def evaluate_quality_no_context(user_request, response_a, response_b, model):
    prompt = ineligible_responses_filter_no_context_prompt.replace('{{user_request}}', user_request).replace('{{response_a}}', response_a).replace('{{response_b}}', response_b)
    
    evaluation_text = generate(prompt=prompt, model=model, sys="")
    parsed = parse_json(evaluation_text)

    return "Major Issue(s)" not in parsed['Instruction Following']

# Use the response from the judge model itself as the reference response (response_b).
def evaluate_quality_no_context_gemini(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["gemini-1.5-pro-response"], model="gemini-1.5-pro")

def evaluate_quality_no_context_gpt(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["gpt-4o-response"], model="gpt-4o")

def evaluate_quality_no_context_claude(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["claude-3-5-sonnet-response"], model="claude-3-5-sonnet")

In [16]:
# Example references dataframe, when we're actually running the evaluation we'll be using our generated responses from before as references.
references = pd.DataFrame({
    "gemini-1.5-pro-response": ["2 + 2 is 4"],
    "gpt-4o-response": ["2 + 2 is 4"],
    "claude-3-5-sonnet-response": ["2 + 2 is 4"],
})

print(evaluate_quality_no_context_gemini("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))
print(evaluate_quality_no_context_gpt("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))
print(evaluate_quality_no_context_claude("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))

True
True
False


### Evaluate Responses (~15 minutes)

With our evaluation helper methods, we can now evaluate the responses we generated before.

**Please re-run the below cell if it fails due to transient API issues.**

In [20]:
responses = pd.read_csv("data/responses/merged_responses.csv", index_col=False)
responses.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,system_instruction,user_request,context_document,gemini-1.5-pro-response,gpt-4o-response,claude-3-5-sonnet-response,gemini-1.5-pro-response-length,gpt-4o-response-length,claude-3-5-sonnet-response-length,average-response-length
0,0,112,Answer in 10 words or less. Keep things simple...,Give me a list of people that died in 1957.,"After the show, Christian Dior began thinking ...",Christian Dior.\n,"Christian Dior, Humphrey Bogart, and Joseph Mc...","Based on the context, Christian Dior died in 1...",16,66,50,44.0
1,1,292,Use only the provided context for your respons...,How much does a pair of Diamond 210 cost?,Title: Wharfedale Diamonds Shine Even Brighter...,A pair of Wharfedale Diamond 210 speakers cost...,A pair of Diamond 210 speakers costs £149.95.,"According to the document, the Diamond 210 cos...",57,45,204,102.0
2,2,178,This task requires you to answer questions bas...,What legal basis is being used to analyze the ...,"On January 18, 2022, Microsoft Corp. announced...","Section 7 of the Clayton Act, enforced via Har...",The merger analysis is based on the Hart-Scott...,The merger is being analyzed under Section 7 o...,74,113,122,103.0
3,3,346,"Answer in complete sentences, only use the con...","According to the document, how many copies of ...","**What's publicly known about ""Switch 2""**\n\n...",55 million copies of Mario Kart 8 Deluxe have ...,"According to the document, Mario Kart 8 Deluxe...","According to the document, Mario Kart 8 Deluxe...",57,74,180,103.666667
4,4,159,Simplify the language used so it's easier to u...,What is the short title of the act?,"1ST SESSION, 43RD LEGISLATURE, ONTARIO\r\n2 CH...",The short title of the act is Chad’s Law (Enfo...,"The short title of the act is ""Chad’s Law (Enf...","According to the document, the short title of ...",74,75,164,104.333333


In [None]:
grounding_evaluation_methods = [
    evaluate_grounding_gemini,
    evaluate_grounding_gpt,
    evaluate_grounding_claude,
]

quality_evaluation_methods = [
    evaluate_quality_no_context_gemini,
    evaluate_quality_no_context_gpt,
    evaluate_quality_no_context_claude,
]

from tqdm import tqdm
import time

MAX_RETRIES = 5

def nameof(f):
    return f.__name__.replace("_", "-")

for ix, row in tqdm(responses.iterrows(), total=len(responses)):    
    user_request = row["user_request"]
    context_document = row["context_document"]
    
    for evaluated_model in models:
        response_column_key = f'{evaluated_model}-response'
        response = row[response_column_key]


        print(f'Evaluating response at ix: {ix} for model {evaluated_model} with:')
        for eval_method in grounding_evaluation_methods:
            print(f'    {eval_method.__name__}')
            key = f'{response_column_key}-{nameof(eval_method)}'
            
            # skip rows if we've evaluated already
            if ix in responses.index and key in responses.columns and pd.notna(responses.loc[ix, key]):
                continue

            retries = 0
            while retries < MAX_RETRIES:
                try:
                    evaluation = eval_method(user_request, context_document, response)
                    responses.loc[ix, [key]] = [evaluation]
                    break
                except:
                    print(f"API Error: grounding retrying {retries}")
                    time.sleep(5)
                    retries += 1
                

        for eval_method in quality_evaluation_methods:
            print(f'    {eval_method.__name__}')
            key = f'{response_column_key}-{nameof(eval_method)}'

            # skip rows if we've evaluated already
            if ix in responses.index and key in responses.columns and pd.notna(responses.loc[ix, key]):
                continue

            # Add retry logic
            retries = 0
            while retries < MAX_RETRIES:
                try:
                    evaluation = eval_method(user_request, response, row)
                    responses.loc[ix,[key]] = [evaluation]
                    break
                except:
                    print(f"API Error: quality retrying {retries}")
                    time.sleep(5)
                    retries += 1

  0%|                                     | 0/430 [00:00<?, ?it/s]

Evaluating response at ix: 0 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 0 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 0 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 1 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt


  0%|▏                          | 2/430 [02:04<7:25:26, 62.44s/it]

Evaluating response at ix: 2 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 2 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 2 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  1%|▏                          | 3/430 [03:52<9:37:26, 81.14s/it]

Evaluating response at ix: 3 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 3 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 3 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  1%|▏                         | 4/430 [05:40<10:48:54, 91.40s/it]

Evaluating response at ix: 4 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 4 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 4 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  1%|▎                         | 5/430 [07:23<11:15:30, 95.37s/it]

Evaluating response at ix: 5 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 5 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 5 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  1%|▎                         | 6/430 [09:06<11:31:45, 97.89s/it]

Evaluating response at ix: 6 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 6 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 6 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  2%|▍                        | 7/430 [11:05<12:18:31, 104.76s/it]

Evaluating response at ix: 7 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 7 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 7 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  2%|▍                        | 8/430 [12:54<12:25:10, 105.95s/it]

Evaluating response at ix: 8 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 8 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 8 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  2%|▌                        | 9/430 [14:55<12:57:28, 110.80s/it]

Evaluating response at ix: 9 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 9 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 9 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  2%|▌                       | 10/430 [17:01<13:28:11, 115.46s/it]

Evaluating response at ix: 10 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 10 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 10 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  3%|▌                       | 11/430 [19:19<14:13:34, 122.23s/it]

Evaluating response at ix: 11 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 11 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 11 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  3%|▋                       | 12/430 [21:23<14:15:46, 122.84s/it]

Evaluating response at ix: 12 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 12 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 12 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  3%|▋                       | 13/430 [24:07<15:38:28, 135.03s/it]

Evaluating response at ix: 13 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 13 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 13 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  3%|▊                       | 14/430 [26:45<16:26:09, 142.23s/it]

Evaluating response at ix: 14 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 14 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 14 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  3%|▊                       | 15/430 [29:08<16:24:47, 142.38s/it]

Evaluating response at ix: 15 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 15 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 15 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  4%|▉                       | 16/430 [31:48<16:58:09, 147.56s/it]

Evaluating response at ix: 16 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 16 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 16 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  4%|▉                       | 17/430 [34:32<17:29:50, 152.52s/it]

Evaluating response at ix: 17 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 17 for model gpt-4o with:
    evaluate_grounding_gemini
API Error: grounding retrying 0
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 17 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  4%|█                       | 18/430 [39:16<21:58:07, 191.96s/it]

Evaluating response at ix: 18 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 18 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
API Error: quality retrying 1
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 18 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  4%|█                       | 19/430 [45:13<27:35:03, 241.61s/it]

Evaluating response at ix: 19 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 19 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 19 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  5%|█                       | 20/430 [48:09<25:17:14, 222.03s/it]

Evaluating response at ix: 20 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 20 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 20 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
API Error: grounding retrying 0
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  5%|█▏                      | 21/430 [51:01<23:30:10, 206.87s/it]

Evaluating response at ix: 21 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 21 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 21 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  5%|█▏                      | 22/430 [54:12<22:55:15, 202.24s/it]

Evaluating response at ix: 22 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 22 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 22 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  5%|█▎                      | 23/430 [56:48<21:17:08, 188.28s/it]

Evaluating response at ix: 23 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 23 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 23 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  6%|█▎                      | 24/430 [59:53<21:07:36, 187.33s/it]

Evaluating response at ix: 24 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 24 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 24 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  6%|█▎                    | 25/430 [1:02:55<20:52:59, 185.63s/it]

Evaluating response at ix: 25 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 25 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 25 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  6%|█▎                    | 26/430 [1:05:25<19:38:10, 174.98s/it]

Evaluating response at ix: 26 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 26 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 26 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  6%|█▍                    | 27/430 [1:07:45<18:25:43, 164.62s/it]

Evaluating response at ix: 27 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 27 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 27 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  7%|█▍                    | 28/430 [1:10:09<17:41:11, 158.39s/it]

Evaluating response at ix: 28 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 28 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 28 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  7%|█▍                    | 29/430 [1:12:02<16:06:14, 144.58s/it]

Evaluating response at ix: 29 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 29 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 29 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  7%|█▌                    | 30/430 [1:14:20<15:51:36, 142.74s/it]

Evaluating response at ix: 30 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 30 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 30 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  7%|█▌                    | 31/430 [1:16:40<15:43:19, 141.85s/it]

Evaluating response at ix: 31 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 31 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 31 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  7%|█▋                    | 32/430 [1:18:46<15:08:50, 137.01s/it]

Evaluating response at ix: 32 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 32 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 32 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  8%|█▋                    | 33/430 [1:20:52<14:45:00, 133.75s/it]

Evaluating response at ix: 33 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 33 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 33 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  8%|█▋                    | 34/430 [1:23:25<15:20:38, 139.49s/it]

Evaluating response at ix: 34 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 34 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 34 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  8%|█▊                    | 35/430 [1:25:37<15:03:34, 137.25s/it]

Evaluating response at ix: 35 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 35 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 35 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  8%|█▊                    | 36/430 [1:28:03<15:18:40, 139.90s/it]

Evaluating response at ix: 36 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
API Error: grounding retrying 0
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 36 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 36 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  9%|█▉                    | 37/430 [1:31:03<16:36:06, 152.08s/it]

Evaluating response at ix: 37 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 37 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 37 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  9%|█▉                    | 38/430 [1:33:13<15:50:25, 145.47s/it]

Evaluating response at ix: 38 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 38 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 38 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  9%|█▉                    | 39/430 [1:35:40<15:51:13, 145.97s/it]

Evaluating response at ix: 39 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 39 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 39 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


  9%|██                    | 40/430 [1:37:58<15:32:33, 143.47s/it]

Evaluating response at ix: 40 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 40 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 40 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 10%|██                    | 41/430 [1:40:27<15:41:40, 145.25s/it]

Evaluating response at ix: 41 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 41 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 41 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 10%|██▏                   | 42/430 [1:43:03<15:59:44, 148.41s/it]

Evaluating response at ix: 42 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 42 for model gpt-4o with:
    evaluate_grounding_gemini
API Error: grounding retrying 0
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
API Error: quality retrying 1
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 42 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 10%|██▏                   | 43/430 [1:46:00<16:52:13, 156.93s/it]

Evaluating response at ix: 43 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 43 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 43 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 10%|██▎                   | 44/430 [1:48:59<17:31:57, 163.52s/it]

Evaluating response at ix: 44 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 44 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 44 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 10%|██▎                   | 45/430 [1:51:22<16:49:38, 157.35s/it]

Evaluating response at ix: 45 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 45 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 45 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 11%|██▎                   | 46/430 [1:54:24<17:35:38, 164.94s/it]

Evaluating response at ix: 46 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 46 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 46 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
API Error: grounding retrying 0
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 11%|██▍                   | 47/430 [1:57:40<18:31:19, 174.10s/it]

Evaluating response at ix: 47 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 47 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 47 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 11%|██▍                   | 48/430 [2:00:50<18:58:06, 178.76s/it]

Evaluating response at ix: 48 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 48 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 48 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 11%|██▌                   | 49/430 [2:03:43<18:44:48, 177.13s/it]

Evaluating response at ix: 49 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 49 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 49 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
API Error: quality retrying 0
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 12%|██▌                   | 50/430 [2:06:42<18:45:05, 177.65s/it]

Evaluating response at ix: 50 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 50 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 50 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 12%|██▌                   | 51/430 [2:09:10<17:47:20, 168.97s/it]

Evaluating response at ix: 51 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 51 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 51 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 12%|██▋                   | 52/430 [2:11:51<17:29:15, 166.55s/it]

Evaluating response at ix: 52 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 52 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 52 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 12%|██▋                   | 53/430 [2:14:24<17:00:46, 162.46s/it]

Evaluating response at ix: 53 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 53 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 53 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 13%|██▊                   | 54/430 [2:16:58<16:42:04, 159.91s/it]

Evaluating response at ix: 54 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 54 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 54 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 13%|██▊                   | 55/430 [2:19:24<16:12:06, 155.54s/it]

Evaluating response at ix: 55 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 55 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 55 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 13%|██▊                   | 56/430 [2:21:38<15:29:13, 149.07s/it]

Evaluating response at ix: 56 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 56 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 56 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 13%|██▉                   | 57/430 [2:24:11<15:34:53, 150.39s/it]

Evaluating response at ix: 57 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 57 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 57 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 13%|██▉                   | 58/430 [2:25:58<14:12:16, 137.46s/it]

Evaluating response at ix: 58 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 58 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 58 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 14%|███                   | 59/430 [2:28:20<14:16:55, 138.59s/it]

Evaluating response at ix: 59 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 59 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 59 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 14%|███                   | 60/430 [2:30:49<14:35:30, 141.97s/it]

Evaluating response at ix: 60 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 60 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 60 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 14%|███                   | 61/430 [2:32:50<13:52:58, 135.44s/it]

Evaluating response at ix: 61 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 61 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 61 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 14%|███▏                  | 62/430 [2:35:28<14:33:46, 142.46s/it]

Evaluating response at ix: 62 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 62 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 62 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 15%|███▏                  | 63/430 [2:37:51<14:31:43, 142.52s/it]

Evaluating response at ix: 63 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 63 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 63 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 15%|███▎                  | 64/430 [2:39:50<13:46:08, 135.43s/it]

Evaluating response at ix: 64 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 64 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 64 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 15%|███▎                  | 65/430 [2:42:49<15:03:53, 148.58s/it]

Evaluating response at ix: 65 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 65 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 65 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 15%|███▍                  | 66/430 [2:44:54<14:17:48, 141.40s/it]

Evaluating response at ix: 66 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 66 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 66 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 16%|███▍                  | 67/430 [2:47:25<14:33:45, 144.42s/it]

Evaluating response at ix: 67 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 67 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 67 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 16%|███▍                  | 68/430 [2:49:53<14:36:18, 145.25s/it]

Evaluating response at ix: 68 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 68 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 68 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 16%|███▌                  | 69/430 [2:52:22<14:41:11, 146.46s/it]

Evaluating response at ix: 69 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 69 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 69 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 16%|███▌                  | 70/430 [2:55:01<15:01:07, 150.19s/it]

Evaluating response at ix: 70 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 70 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 70 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 17%|███▋                  | 71/430 [2:58:21<16:27:38, 165.07s/it]

Evaluating response at ix: 71 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 71 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 71 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 17%|███▋                  | 72/430 [3:00:34<15:27:51, 155.51s/it]

Evaluating response at ix: 72 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 72 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 72 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 17%|███▋                  | 73/430 [3:03:35<16:10:54, 163.18s/it]

Evaluating response at ix: 73 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 73 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 73 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 17%|███▊                  | 74/430 [3:05:56<15:28:19, 156.46s/it]

Evaluating response at ix: 74 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 74 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 74 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 17%|███▊                  | 75/430 [3:07:55<14:20:03, 145.36s/it]

Evaluating response at ix: 75 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 75 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 75 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 18%|███▉                  | 76/430 [3:10:09<13:57:49, 142.00s/it]

Evaluating response at ix: 76 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 76 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 76 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 18%|███▉                  | 77/430 [3:13:22<15:25:05, 157.24s/it]

Evaluating response at ix: 77 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 77 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 77 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 18%|███▉                  | 78/430 [3:15:48<15:02:56, 153.91s/it]

Evaluating response at ix: 78 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 78 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 78 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 18%|████                  | 79/430 [3:18:21<14:58:31, 153.59s/it]

Evaluating response at ix: 79 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 79 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 79 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 19%|████                  | 80/430 [3:20:45<14:38:49, 150.66s/it]

Evaluating response at ix: 80 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 80 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 80 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 19%|████▏                 | 81/430 [3:23:52<15:39:18, 161.48s/it]

Evaluating response at ix: 81 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 81 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 81 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 19%|████▏                 | 82/430 [3:26:18<15:10:53, 157.05s/it]

Evaluating response at ix: 82 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 82 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 82 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 19%|████▏                 | 83/430 [3:28:54<15:05:15, 156.53s/it]

Evaluating response at ix: 83 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 83 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 83 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 20%|████▎                 | 84/430 [3:31:35<15:11:55, 158.14s/it]

Evaluating response at ix: 84 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 84 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 84 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 20%|████▎                 | 85/430 [3:34:07<14:57:49, 156.14s/it]

Evaluating response at ix: 85 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 85 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 85 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 20%|████▍                 | 86/430 [3:37:04<15:30:37, 162.32s/it]

Evaluating response at ix: 86 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 86 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 86 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 20%|████▍                 | 87/430 [3:39:41<15:19:07, 160.78s/it]

Evaluating response at ix: 87 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 87 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 87 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 20%|████▌                 | 88/430 [3:42:05<14:48:02, 155.80s/it]

Evaluating response at ix: 88 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 88 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 88 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt


In [None]:
evaluations = responses[[col for col in responses.columns if "evaluate" in col]]
evaluations = evaluations.reindex(sorted(evaluations.columns), axis=1)
evaluations

# Evaluation Ensembling
---

### Ensemble Quality Evaluations

We'll invalidate responses only if it fails the bar for all three quality judges.

In [None]:
for ix, row in evaluations.iterrows():
    for model in models:
        passes_qc_count = 0
        for evaluation_method in quality_evaluation_methods:
            key = f'{model}-response-{nameof(evaluation_method)}'
            passes_qc_count += 1 if row[key] else 0

        passes_qc = passes_qc_count > 0

        evaluations.loc[ix, [f'{model}-response-passes-qc']] = [passes_qc]

relevant_columns = [col for col in evaluations.columns if ("evaluate-grounding" in col or "passes-qc" in col)]

grounding_evaluations = evaluations[relevant_columns]
grounding_evaluations = grounding_evaluations.reindex(sorted(grounding_evaluations.columns), axis=1)
grounding_evaluations

### Gather Grounding Evaluations

Then we'll sum all the responses that were considered grounded (and that passed the quality check) across each of the three grounding judges.

In [None]:
scores = pd.DataFrame(columns=['model', 'evaluation_method', 'score'])
n = 3
for model in models:
    for evaluation_method in grounding_evaluation_methods:
        grounding_key = f'{model}-response-{nameof(evaluation_method)}'
        quality_key = f'{model}-response-passes-qc'

        passed_qc_responses = grounding_evaluations[grounding_evaluations[quality_key] == True]
        n_success = passed_qc_responses[grounding_key].astype(int).sum()
        
        scores.loc[len(scores), ['model', 'evaluation_method', 'score']] = [model, nameof(evaluation_method), n_success / n]
scores

### Ensemble Results

Finally, we can average across the three LLM judges to arrive at final scores.

In [None]:
scores.groupby("model")[["score"]].mean()

# Closing Remarks
---

Thank you for following through this entire notebook! It's important to note that we've gone through a tiny subset of the examples here. For the real set of scores, please check out the [official leaderboard](https://kaggle.com/facts-leaderboard).

Questions, comments, or issues? Please feel free to share your thoughts with us in the [discussion forum](https://kaggle.com/facts-leaderboard/discussion?sort=hotness)!