[FACTS Grounding](https://kaggle.com/facts-leaderboard) is a novel benchmark from Google DeepMind and Google Research designed to evaluate the factual accuracy and grounding of AI models.

**As a primer, please check out this [Examples Section](https://kaggle.com/facts-leaderboard/examples) before running this notebook!**

This notebook will walk you through:

1) **[Generating responses](https://kaggle.com/facts-leaderboard/examples#response_generation)** for examples

2) **[Evaluating responses](https://kaggle.com/facts-leaderboard/examples#response_evaluation)** that were generated

3) **[Ensembling evaluations](https://kaggle.com/facts-leaderboard/examples#ensembling)** to produce a score

# Setup
---

###  Setup Keys

You will need your own set of API keys from [Google AI Studio](https://aistudio.google.com/app/apikey), [OpenAI](https://platform.openai.com/api-keys), and [Anthropic](https://docs.anthropic.com/en/api/getting-started) to run this notebook.

**Add your keys as Kaggle Secrets with the following names:**

1) `GOOGLE_AIS_API_KEY`

2) `OPENAI_API_KEY`

3) `ANTHROPIC_API_KEY`


### **WARNING:** This notebook will execute real API calls against these keys. Please check usage limits before running.

In [17]:
import os

openai_api_key = os.getenv("OPENAI_API_KEY")
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
google_ais_api_key = os.getenv("GOOGLE_AIS_API_KEY")

### Setup Clients

In [41]:
from openai import OpenAI
from anthropic import Anthropic
from google import genai

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

google_client = genai.Client(api_key=google_ais_api_key)
openai_client = OpenAI(api_key=openai_api_key)
anthropic_client = Anthropic(api_key=anthropic_api_key)

# models = ["gemini-1.5-pro", "gpt-4o", "claude-3-5-sonnet"]
# models = ["Qwen/Qwen2.5-7B-Instruct"]
models = ["Qwen/Qwen2.5-1.5B-Instruct"]
def generate_gemini(prompt, sys):
    response = google_client.models.generate_content(model='gemini-1.5-pro-002', contents=f'{sys} {prompt}')
    return response.text

def generate_gpt(prompt, sys):
    if len(sys) > 0:
        completion = openai_client.chat.completions.create(
          model="gpt-4o",
          messages=[{"role": "system", "content": sys}, {"role": "user","content": prompt}]
        )
        return completion.choices[0].message.content
    else:
        completion = openai_client.chat.completions.create(
          model="gpt-4o",
          messages=[{"role": "user","content": prompt}]
        )
        return completion.choices[0].message.content     

def generate_claude(prompt, sys):
    if len(sys) > 0:
        message = anthropic_client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=8192,
            system=[{"type": "text", "text": sys}],
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text
    else:
        message = anthropic_client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=8192,
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text

def generate_huggingface(prompt, sys, model_name="gpt2"):
    messages = [
        {"role": "user", "content": prompt}
    ]

    # Concatenate system and prompt if sys is not empty
    if len(sys) > 0:
        messages.insert(0, {"role": "system", "content": sys})

    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    # Tokenize input
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    # Generate output
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=8192
    )
    # Decode and return

    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

def generate(model, prompt, sys):
    if model == "gemini-1.5-pro":
        return generate_gemini(prompt, sys)
    elif model == "gpt-4o":
        return generate_gpt(prompt, sys)
    elif model == "claude-3-5-sonnet":
        return generate_claude(prompt, sys)
    elif model == "Qwen/Qwen2.5-1.5B-Instruct":
        return generate_huggingface(prompt, sys, model_name=model)
    else:
        raise Exception("Invalid model selected")

In [42]:
# Test client helper functions

for model in models:
    print(generate(model, "count to 3", ""))

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Sure! Here's the count to 3:

1
2
3


# Response Generation
---

We'll be using a subset of the examples from the [FACTS Grounding 1.0 Public Examples](https://kaggle.com/datasets/deepmind/FACTS-grounding-examples/data) dataset. This dataset contains 860 public examples (out of a total 1,719 examples) for the general public to access.

### Load Examples

In [43]:
import pandas as pd

examples = pd.read_json("data/test.jsonl", lines=True)
examples.head(3)

Unnamed: 0,system_instruction,user_request,context_document,full_prompt
0,[question]\n [user request]\n \n\n ===========...,Is the Government of the United States legally...,Passed following an address by Ukraine’s leade...,[question]\n Is the Government of the United S...
1,You are required to use only information provi...,Explain the responsibilities of each departmen...,"Procedural History In 2007, Mr. Taylor filed a...",Explain the responsibilities of each departmen...
2,Answer the question based solely on the inform...,Researchers at Foch Hospital in France publish...,Objectives: Maternal age has been increasing f...,Answer the question based solely on the inform...


### Generate Responses

In [44]:
from tqdm import tqdm
import pandas as pd

responses = pd.DataFrame()
for ix, row in tqdm(examples.iterrows(), total=len(examples)):
    full_prompt = row['full_prompt']
    for model in models:
        response = generate(model=model, prompt=full_prompt, sys="")
        responses.loc[ix, ["system_instruction", "user_request", "context_document", f'{model}-response']] = [row["system_instruction"], row["user_request"], row["context_document"], response]
responses

  0%|                                             | 1/430 [02:11<15:41:00, 131.61s/it]


KeyboardInterrupt: 

### Filtering Responses

In [45]:
for model in models:
    responses[f'{model}-response-length'] = responses[f'{model}-response'].str.len()
responses["average-response-length"] = responses[[f'{model}-response-length' for model in models]].mean(axis=1)

responses.sort_values("average-response-length", inplace=True)
responses.head(3)

Unnamed: 0,system_instruction,user_request,context_document,Qwen/Qwen2.5-1.5B-Instruct-response,Qwen/Qwen2.5-1.5B-Instruct-response-length,average-response-length
0,[question]\n [user request]\n \n\n ===========...,Is the Government of the United States legally...,Passed following an address by Ukraine’s leade...,"Yes, the Government of the United States is le...",598,598.0


In [40]:
import os

responses_dir = os.path.join("data", "responses")
os.makedirs(responses_dir, exist_ok=True)

responses_file = os.path.join(responses_dir, "model1_responses.csv")
responses.to_csv(responses_file, index=True)
print(f"Responses saved to {responses_file}")

Responses saved to data/responses/model1_responses.csv


In [47]:
import pandas as pd

responses = pd.read_csv("data/responses/qwen_responses.csv")
responses.head()

Unnamed: 0.1,Unnamed: 0,system_instruction,user_request,context_document,Qwen/Qwen2.5-7B-Instruct-response,Qwen/Qwen2.5-7B-Instruct-response-length,average-response-length
0,112,Answer in 10 words or less. Keep things simple...,Give me a list of people that died in 1957.,"After the show, Christian Dior began thinking ...","Sorry, no specific deaths in 1957 mentioned.",44,44.0
1,314,Give me your answer as a full sentence. Answer...,"According to this transcript, what was ipad re...",**Tim Cook**\n\nThank you. Suhasini. Good afte...,iPad revenue in the December quarter was $7 bi...,52,52.0
2,292,Use only the provided context for your respons...,How much does a pair of Diamond 210 cost?,Title: Wharfedale Diamonds Shine Even Brighter...,"According to the document, a pair of Diamond 2...",63,63.0
3,270,Your response to the user must only use the in...,Which model showed debt underutilization in th...,"Since Modigliani and Miller (1958), economists...",- The multistage model showed debt underutiliz...,66,66.0
4,346,"Answer in complete sentences, only use the con...","According to the document, how many copies of ...","**What's publicly known about ""Switch 2""**\n\n...","According to the document, Mario Kart 8 Deluxe...",74,74.0


# Response Evaluation
---

### Load Evaluation Prompts

##### Evaluation Prompt Selection

We chose grounding and evaluation prompts based on evaluation against an internal set of human rated responses. See below for which evaluation prompt were selected and the [Technical Report](https://arxiv.org/abs/2501.03200) for more details.

#### Grounding

1) `Gemini-1.5-pro    : json`

2) `GPT-4o            : json`

3) `Claude-3-5-sonnet : implicit_span_level`

#### Quality

1) `Gemini-1.5-pro    : ineligible_responses_filter_no_context`

2) `GPT-4o            : ineligible_responses_filter_no_context`

3) `Claude-3-5-sonnet : ineligible_responses_filter_no_context`

In [31]:
evaluation_prompts = pd.read_csv("/kaggle/input/FACTS-grounding-examples/evaluation_prompts.csv")
evaluation_prompts

Unnamed: 0,evaluation_method,evaluation_prompt
0,response_level,Your task is to check if the Response is accur...
1,json_alt,You are a helpful and harmless AI assistant. Y...
2,json,You are a helpful and harmless AI assistant. Y...
3,json_with_double_check,Your task is to verify whether a given sentenc...
4,span_level,Your task is to check if a specific Span is ac...
5,implicit_span_level,Your task is to check if the Response is accur...
6,ineligible_responses_filter_with_context,Your mission is to judge the response from an ...
7,ineligible_responses_filter_no_context,Your mission is to judge the response from an ...


In [32]:
json_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'json', "evaluation_prompt"].values[0]
implicit_span_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'implicit_span_level', "evaluation_prompt"].values[0]
ineligible_responses_filter_no_context_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'ineligible_responses_filter_with_context', "evaluation_prompt"].values[0]

### Setup Evaluation Helpers

Here we'll create a helper method for each evaluation type. Note that, for grounding, the Gemini-1.5-pro and GPT-4o judges use the `json` evaluation prompt and that the Claude-3-5-sonnet judge uses the `implicit_span_level` evaluation prompt as described above.

#### Testing Grounding Judges
We'll test grounding with a situation where the response is correct but not grounded. We expect to see a `False` from all judges!

```
user_request:      "what is 2 + 2?"
context_document:  "2 + 2 is 3"
response:          "2 + 2 is 4" (this is correct but not grounded in the context_document)
```

In [33]:
# Grounding json evaluation

import json

def parse_structured_json(ans):
  if '```json' in ans:
      ans = ans.split('```json')[1].split('```')[0]
  ans = ans.strip()
  ans = ans.replace('}\n', '}\n@\n@\n')
  parsed_answers = []
  for line in ans.split('\n@\n@\n'):
    try:
      line = line.replace('\n', ' ')
      line = line.replace("\\'", "'")
      parsed = json.loads(line)
      parsed_answers.append(parsed)
    except:
      pass
  if len(parsed_answers) > 0:
    bool_ans = all(d['label'] == 'supported' or d['label'] == 'no_rad' for d in parsed_answers)
  else:
    bool_ans = False
  return bool_ans, parsed_answers

def evaluate_grounding_json(user_request, context_document, response, model):
    prompt = json_prompt.replace('{{user_request}}', user_request).replace('{{context_document}}', context_document).replace('{{response}}', response)

    evaluation_text = generate(model=model, prompt=prompt, sys="")
    evaluation, parsed = parse_structured_json(evaluation_text)
    return evaluation

def evaluate_grounding_gemini(user_request, context_document, response):
    return evaluate_grounding_json(user_request, context_document, response, model="gemini-1.5-pro")

def evaluate_grounding_gpt(user_request, context_document, response):
    return evaluate_grounding_json(user_request, context_document, response, model="gpt-4o")


In [34]:
# Test the evaluator
print(evaluate_grounding_gemini(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4"))
print(evaluate_grounding_gpt(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4"))

False
False


In [35]:
# Grounding implicit_span_level evaluation

def answer_normalization(answer):
  answer = answer.strip().lower()
  if 'inaccurate' in answer or 'false' in answer:
    return False
  elif 'accurate' in answer or 'true' in answer:
    return True
  else:
    return False
    
def evaluate_grounding_implicit_span(user_request, context_document, response, model):
    prompt = implicit_span_prompt.replace('{{user_request}}', user_request).replace('{{context_document}}', context_document).replace('{{response}}', response)
    
    evaluation_text = generate(model=model, prompt=prompt, sys="")
    splits = evaluation_text.split('Final Answer:')
    if (len(splits) <= 1):
        return False
    final_ans = splits[1]
    return answer_normalization(final_ans)

def evaluate_grounding_claude(user_request, context_document, response):
    return evaluate_grounding_implicit_span(user_request, context_document, response, model="claude-3-5-sonnet")

In [36]:
# Test the evaluator
evaluate_grounding_claude(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4")

False

#### Testing Quality Judges

We'll test quality with a situation where the response does not answer the posed request. We expect to see a `False` from all judges!

```
user_request:      "what is 2 + 2?"
response_a:        "3 + 3 is 6" (this is the response we evaluate)
response_b:        "2 + 2 is 4" (this is the reference response)
```

We compare our response with a reference because it improves the accuracy of the judges.

In [37]:
# Quality no_context evaluation

QUESTIONS_TO_LABELS = {
    'Instruction Following': ['No Issues', 'Minor Issue(s)', 'Major Issue(s)', 'Invalid'],
}

def parse_json(ans):
    parsed = {}
    if '```json' in ans:
        ans = ans.split('```json')[1]
        ans = ans.split('```')[0]
    ans = ans.replace('\n', ' ')
    try:
        parsed = json.loads(ans)
    except Exception as e:
        pass
    if 'Instruction Following' not in parsed:
        parsed['Instruction Following'] = 'Invalid'
    elif parsed['Instruction Following'] not in ['No Issues', 'Minor Issue(s)', 'Major Issue(s)', 'Invalid']:
        parsed['Instruction Following'] = 'Invalid'
    return parsed

# Evaluates response_a for quality using response_b as a reference.
def evaluate_quality_no_context(user_request, response_a, response_b, model):
    prompt = ineligible_responses_filter_no_context_prompt.replace('{{user_request}}', user_request).replace('{{response_a}}', response_a).replace('{{response_b}}', response_b)
    
    evaluation_text = generate(prompt=prompt, model=model, sys="")
    parsed = parse_json(evaluation_text)

    return "Major Issue(s)" not in parsed['Instruction Following']

# Use the response from the judge model itself as the reference response (response_b).
def evaluate_quality_no_context_gemini(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["gemini-1.5-pro-response"], model="gemini-1.5-pro")

def evaluate_quality_no_context_gpt(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["gpt-4o-response"], model="gpt-4o")

def evaluate_quality_no_context_claude(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["claude-3-5-sonnet-response"], model="claude-3-5-sonnet")

In [38]:
# Example references dataframe, when we're actually running the evaluation we'll be using our generated responses from before as references.
references = pd.DataFrame({
    "gemini-1.5-pro-response": ["2 + 2 is 4"],
    "gpt-4o-response": ["2 + 2 is 4"],
    "claude-3-5-sonnet-response": ["2 + 2 is 4"],
})

print(evaluate_quality_no_context_gemini("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))
print(evaluate_quality_no_context_gpt("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))
print(evaluate_quality_no_context_claude("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))

True
True
False


### Evaluate Responses (~15 minutes)

With our evaluation helper methods, we can now evaluate the responses we generated before.

**Please re-run the below cell if it fails due to transient API issues.**

In [39]:
grounding_evaluation_methods = [
    evaluate_grounding_gemini,
    evaluate_grounding_gpt,
    evaluate_grounding_claude,
]

quality_evaluation_methods = [
    evaluate_quality_no_context_gemini,
    evaluate_quality_no_context_gpt,
    evaluate_quality_no_context_claude,
]

def nameof(f):
    return f.__name__.replace("_", "-")

for ix, row in tqdm(responses.iterrows(), total=len(responses)):    
    user_request = row["user_request"]
    context_document = row["context_document"]
    
    for evaluated_model in models:
        response_column_key = f'{evaluated_model}-response'
        response = row[response_column_key]
        
        print(f'Evaluating response at ix: {ix} for model {evaluated_model} with:')
        for eval_method in grounding_evaluation_methods:
            print(f'    {eval_method.__name__}')
            key = f'{response_column_key}-{nameof(eval_method)}'
            
            # skip rows if we've evaluated already
            if ix in responses.index and key in responses.columns and pd.notna(responses.loc[ix, key]):
                continue

            evaluation = eval_method(user_request, context_document, response)
            responses.loc[ix, [key]] = [evaluation]
            
        for eval_method in quality_evaluation_methods:
            print(f'    {eval_method.__name__}')
            key = f'{response_column_key}-{nameof(eval_method)}'

            # skip rows if we've evaluated already
            if ix in responses.index and key in responses.columns and pd.notna(responses.loc[ix, key]):
                continue
            
            evaluation = eval_method(user_request, response, row)
            responses.loc[ix,[key]] = [evaluation]

  0%|          | 0/3 [00:00<?, ?it/s]

Evaluating response at ix: 17 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 17 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 17 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 33%|███▎      | 1/3 [01:25<02:50, 85.45s/it]

Evaluating response at ix: 6 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 6 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 6 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


 67%|██████▋   | 2/3 [03:10<01:36, 96.76s/it]

Evaluating response at ix: 15 for model gemini-1.5-pro with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 15 for model gpt-4o with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude
Evaluating response at ix: 15 for model claude-3-5-sonnet with:
    evaluate_grounding_gemini
    evaluate_grounding_gpt
    evaluate_grounding_claude
    evaluate_quality_no_context_gemini
    evaluate_quality_no_context_gpt
    evaluate_quality_no_context_claude


100%|██████████| 3/3 [04:54<00:00, 98.20s/it] 


In [40]:
evaluations = responses[[col for col in responses.columns if "evaluate" in col]]
evaluations = evaluations.reindex(sorted(evaluations.columns), axis=1)
evaluations

Unnamed: 0,claude-3-5-sonnet-response-evaluate-grounding-claude,claude-3-5-sonnet-response-evaluate-grounding-gemini,claude-3-5-sonnet-response-evaluate-grounding-gpt,claude-3-5-sonnet-response-evaluate-quality-no-context-claude,claude-3-5-sonnet-response-evaluate-quality-no-context-gemini,claude-3-5-sonnet-response-evaluate-quality-no-context-gpt,gemini-1.5-pro-response-evaluate-grounding-claude,gemini-1.5-pro-response-evaluate-grounding-gemini,gemini-1.5-pro-response-evaluate-grounding-gpt,gemini-1.5-pro-response-evaluate-quality-no-context-claude,gemini-1.5-pro-response-evaluate-quality-no-context-gemini,gemini-1.5-pro-response-evaluate-quality-no-context-gpt,gpt-4o-response-evaluate-grounding-claude,gpt-4o-response-evaluate-grounding-gemini,gpt-4o-response-evaluate-grounding-gpt,gpt-4o-response-evaluate-quality-no-context-claude,gpt-4o-response-evaluate-quality-no-context-gemini,gpt-4o-response-evaluate-quality-no-context-gpt
17,True,True,True,False,True,True,True,True,False,False,True,False,True,True,True,True,True,True
6,True,True,True,False,True,True,True,True,True,True,False,False,True,True,True,False,True,True
15,True,True,False,False,True,True,False,True,True,False,True,True,True,True,True,False,False,True


# Evaluation Ensembling
---

### Ensemble Quality Evaluations

We'll invalidate responses only if it fails the bar for all three quality judges.

In [41]:
for ix, row in evaluations.iterrows():
    for model in models:
        passes_qc_count = 0
        for evaluation_method in quality_evaluation_methods:
            key = f'{model}-response-{nameof(evaluation_method)}'
            passes_qc_count += 1 if row[key] else 0

        passes_qc = passes_qc_count > 0

        evaluations.loc[ix, [f'{model}-response-passes-qc']] = [passes_qc]

relevant_columns = [col for col in evaluations.columns if ("evaluate-grounding" in col or "passes-qc" in col)]

grounding_evaluations = evaluations[relevant_columns]
grounding_evaluations = grounding_evaluations.reindex(sorted(grounding_evaluations.columns), axis=1)
grounding_evaluations

Unnamed: 0,claude-3-5-sonnet-response-evaluate-grounding-claude,claude-3-5-sonnet-response-evaluate-grounding-gemini,claude-3-5-sonnet-response-evaluate-grounding-gpt,claude-3-5-sonnet-response-passes-qc,gemini-1.5-pro-response-evaluate-grounding-claude,gemini-1.5-pro-response-evaluate-grounding-gemini,gemini-1.5-pro-response-evaluate-grounding-gpt,gemini-1.5-pro-response-passes-qc,gpt-4o-response-evaluate-grounding-claude,gpt-4o-response-evaluate-grounding-gemini,gpt-4o-response-evaluate-grounding-gpt,gpt-4o-response-passes-qc
17,True,True,True,True,True,True,False,True,True,True,True,True
6,True,True,True,True,True,True,True,True,True,True,True,True
15,True,True,False,True,False,True,True,True,True,True,True,True


### Gather Grounding Evaluations

Then we'll sum all the responses that were considered grounded (and that passed the quality check) across each of the three grounding judges.

In [42]:
scores = pd.DataFrame(columns=['model', 'evaluation_method', 'score'])
n = 3
for model in models:
    for evaluation_method in grounding_evaluation_methods:
        grounding_key = f'{model}-response-{nameof(evaluation_method)}'
        quality_key = f'{model}-response-passes-qc'

        passed_qc_responses = grounding_evaluations[grounding_evaluations[quality_key] == True]
        n_success = passed_qc_responses[grounding_key].astype(int).sum()
        
        scores.loc[len(scores), ['model', 'evaluation_method', 'score']] = [model, nameof(evaluation_method), n_success / n]
scores

Unnamed: 0,model,evaluation_method,score
0,gemini-1.5-pro,evaluate-grounding-gemini,1.0
1,gemini-1.5-pro,evaluate-grounding-gpt,0.666667
2,gemini-1.5-pro,evaluate-grounding-claude,0.666667
3,gpt-4o,evaluate-grounding-gemini,1.0
4,gpt-4o,evaluate-grounding-gpt,1.0
5,gpt-4o,evaluate-grounding-claude,1.0
6,claude-3-5-sonnet,evaluate-grounding-gemini,1.0
7,claude-3-5-sonnet,evaluate-grounding-gpt,0.666667
8,claude-3-5-sonnet,evaluate-grounding-claude,1.0


### Ensemble Results

Finally, we can average across the three LLM judges to arrive at final scores.

In [43]:
scores.groupby("model")[["score"]].mean()

Unnamed: 0_level_0,score
model,Unnamed: 1_level_1
claude-3-5-sonnet,0.888889
gemini-1.5-pro,0.777778
gpt-4o,1.0


# Closing Remarks
---

Thank you for following through this entire notebook! It's important to note that we've gone through a tiny subset of the examples here. For the real set of scores, please check out the [official leaderboard](https://kaggle.com/facts-leaderboard).

Questions, comments, or issues? Please feel free to share your thoughts with us in the [discussion forum](https://kaggle.com/facts-leaderboard/discussion?sort=hotness)!