# FACTS Benchmark: Factual Accuracy and Grounding Evaluation

This notebook demonstrates how to generate and evaluate responses for the FACTS (Factual Accuracy and Grounding Benchmark) dataset with Contextual AI.

The FACTS (Factual Accuracy and Grounding Benchmark) [dataset](https://kaggle.com/facts-leaderboard) is a novel benchmark from Google DeepMind and Google Research designed to evaluate the factual accuracy and grounding of AI models.

1. **Evaluate Factual Accuracy**: Test how well AI models can provide accurate information based on given context
2. **Assess Grounding**: Measure models' ability to stick to provided reference material without introducing external knowledge
3. **Cross-Domain Testing**: Evaluate performance across different fields like medical, legal, and financial domains
4. **Task Versatility**: Test various types of tasks from simple fact-finding to complex analysis

Table of Contents:

1. [Setup and Imports](#setup-and-imports)
2. [Load and Explore the FACTS Dataset](#load-and-explore-the-facts-dataset)
3. [Model Setup and Response Generation](#model-setup-and-response-generation)
4. [Generate Responses](#generate-responses)
5. [Save Responses](#save-responses)
6. [Response Evaluation](#response-evaluation)
7. [Results and Analysis](#results-and-analysis)


Thhe generated responses can then be evaluated for grounding.

This notebook is largely derived from the [Starter FACTS notebook](https://www.kaggle.com/code/andrewmingwang/facts-grounding-benchmark-starter-code) at Kaggle. 

It is assumed you have downloaded the examples.csv and evaluation_prompts.csv from the Kaggle site. 

## 1. Setup and Imports

To run across all the providers in this notebook, you will need your own set of API keys from [OpenAI](https://platform.openai.com/api-keys), [Anthropic](https://docs.anthropic.com/en/api/getting-started), [Contextual AI](https://app.contextual.ai) to run this notebook.

In [None]:
%pip install -q openai anthropic contextual-client google-genai

In [15]:
import os
import time

from contextual import ContextualAI
from anthropic import Anthropic
from google import genai
from openai import OpenAI

import pandas as pd
from tqdm import tqdm

If you save your keys in your environment

In [16]:
contextual_api_key = os.getenv("CONTEXTUAL_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY")

For the rest of you

In [17]:
#anthropic_api_key = "sk-ant-"
#openai_api_key = "sk-z
#google_api_key = "AI"
#contextual_api_key = ""

In [18]:
openai_client = OpenAI(api_key=openai_api_key)
anthropic_client = Anthropic(api_key=anthropic_api_key)
google_client = genai.Client(api_key=google_api_key)
contextual_client = ContextualAI(api_key=contextual_api_key)

## 2. Load and Explore the FACTS Dataset

We'll be using a subset of the examples from the [FACTS Grounding 1.0 Public Examples](https://kaggle.com/datasets/deepmind/FACTS-grounding-examples/data) dataset. This dataset contains 860 public examples (out of a total 1,719 examples) for the general public to access.

I have downloaded the data locally for this notebook.

## Dataset Structure

The dataset contains 860 public examples (out of 1,719 total) with the following columns:

### Core Components

1. **system_instruction**
   - Contains model guidelines and constraints
   - Specifies how to approach the response
   - Often emphasizes using only provided context

2. **user_request**
   - The actual question or task to be answered
   - Ranges from simple queries to complex analysis requests

3. **context_document**
   - Reference material for grounding responses
   - Contains the factual information needed to answer the question
   - Serves as the knowledge base for the response

4. **full_prompt**
   - Complete formatted prompt combining system instruction, context, and user request
   - Ready for model input

### Categorization

5. **domain**
   - Categorizes questions by field:
     - Medical
     - Legal
     - Financial
     - Retail/Product
     - Internet/Technology

6. **type**
   - Specific task categories:
     - Effect Analysis
     - Fact Finding
     - Find & Summarize
     - Summarize
     - Explanation/Definition

7. **high_level_type**
   - Broad task classification:
     - Q&A (Question and Answer tasks)
     - Text Transformation (Summarization and text processing)

This structured approach allows for comprehensive evaluation of model performance across different dimensions of factual grounding and response generation.

In [19]:
examples = pd.read_csv("examples.csv")
examples.head()

# Limit to the first 20 examples for testing
#examples = examples.head(10)
responses = pd.DataFrame()

## 3. Model Setup and Response Generation

Define helper functions for generating responses from each LLM provider.

In [43]:
models = ["gemini-1.5-pro", "gpt-4o", "claude-3-5-sonnet"]
def generate_gemini(prompt, sys):
    response = google_client.models.generate_content(model='gemini-1.5-pro-002', contents=f'{sys} {prompt}')
    return response.text

def generate_gpt(prompt, sys):
    if len(sys) > 0:
        completion = openai_client.chat.completions.create(
          model="gpt-4o",
          messages=[{"role": "system", "content": sys}, {"role": "user","content": prompt}]
        )
        return completion.choices[0].message.content
    else:
        completion = openai_client.chat.completions.create(
          model="gpt-4o",
          messages=[{"role": "user","content": prompt}]
        )
        return completion.choices[0].message.content     

def generate_claude(prompt, sys):
    if len(sys) > 0:
        message = anthropic_client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=8192,
            system=[{"type": "text", "text": sys}],
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text
    else:
        message = anthropic_client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=8192,
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text

def generate(model, prompt, sys):
    if model == "gemini-1.5-pro":
        return generate_gemini(prompt, sys)
    elif model == "gpt-4o":
        return generate_gpt(prompt, sys)
    elif model == "claude-3-5-sonnet":
        return generate_claude(prompt, sys)
    else:
        raise Exception("Invalid model selected")

In [None]:
for model in models:
    print(model)
    print(generate(model, "count to 3", ""))

## 4. Generate Responses

Generate responses for each example in the dataset using the selected models.

Generate responses for Contextual to match the Kaggle benchmarking results. Doing this takes around 90 minutes (you can do this faster by using an async script)

In [None]:
model = "contextual"
for ix, row in tqdm(examples.iterrows(), total=len(examples)):
    response = contextual_client.generate.create(
        model="v2",
        messages=[{"role": "user", "content": row["user_request"]}],
        system_prompt=row["full_prompt"],
        knowledge=[],
        top_p=0.9,
        avoid_commentary=True,
        temperature=0.0,
        max_new_tokens=2048
    )
    responses.loc[ix, ["system_instruction", "user_request", "context_document", "full_prompt", f'{model}-response']] = [
        row["system_instruction"], row["user_request"], row["context_document"], row["full_prompt"], response.response
    ]
responses

Generate responses for the other models

Optional, this is how you can get a bit more performance out of the GLM by taking advantage of the knowledge field.

```
response = contextual_client.generate.create(
    model="v2",
    messages=[{"role": "user", "content": row["user_request"]}],
    system_prompt=row["system_instruction"],
    knowledge=knowledge,
    top_p=1,
    avoid_commentary=True,
    temperature=0.0)

Simple script for getting responses from the other models. Ideally, this would work, but I kept running into timeouts and other issues, so the script kept failing

In [None]:
for ix, row in tqdm(examples.iterrows(), total=len(examples)):
    full_prompt = row['full_prompt']
    for model in models:
        response = generate(model=model, prompt=full_prompt, sys="")
        responses.loc[ix, ["system_instruction", "user_request", "context_document", f'{model}-response']] = [row["system_instruction"], row["user_request"], row["context_document"], response]
responses

Longer script to deal with errors from API providers. It's more robust with checkpointing and catching errors

In [None]:
max_retries = 3
wait_seconds = 30
checkpoint_path = "responses_multi_checkpoint.csv"

# Load checkpoint if exists
if os.path.exists(checkpoint_path):
    responses = pd.read_csv(checkpoint_path)
    print(f"Loaded checkpoint with {len(responses)} rows.")
else:
    responses = pd.DataFrame(columns=["system_instruction", "user_request", "context_document", "full_prompt"] + [f"{model}-response" for model in models])

for ix, row in tqdm(examples.iterrows(), total=len(examples)):
    for model in models:
        col = f"{model}-response"
        if col in responses.columns and ix in responses.index and pd.notna(responses.loc[ix, col]):
            continue
        for attempt in range(max_retries):
            try:
                response = generate(model=model, prompt=row["full_prompt"], sys="")
                if ix not in responses.index:
                    responses.loc[ix, ["system_instruction", "user_request", "context_document", "full_prompt"]] = [
                        row["system_instruction"], row["user_request"], row["context_document"], row["full_prompt"]
                    ]
                responses.loc[ix, col] = response
                break  # Success
            except Exception as e:
                print(f"Error at row {ix}, model {model}, attempt {attempt+1}: {e}")
                if attempt < max_retries - 1:
                    time.sleep(wait_seconds)
                else:
                    print(f"Failed after {max_retries} attempts, skipping.")
        # Checkpoint every 10 rows
        if ix % 10 == 0:
            responses.to_csv(checkpoint_path, index=False)

# Final save
responses.to_csv(checkpoint_path, index=False)

## 5. Save Responses

Save the generated responses to disk for reproducibility. You show now have columns:
- 'contextual-response'
- 'gemini-1.5-pro-response'
- 'gpt-4o-response'
- 'claude-3-5-sonnet-response'
 
 You will need all of these for the evaluation process.

In [52]:
responses.to_csv("responses_contextual.csv", index=False)

## 6. Response Evaluation

Evaluate the factual accuracy and grounding of the generated responses using multiple LLM-based judges.

### Load Evaluation Prompts

##### Evaluation Prompt Selection

We chose grounding and evaluation prompts based on evaluation against an internal set of human rated responses. See below for which evaluation prompt were selected and the [Technical Report](https://arxiv.org/abs/2501.03200) for more details.

#### Grounding

1) `Gemini-1.5-pro    : json`

2) `GPT-4o            : json`

3) `Claude-3-5-sonnet : implicit_span_level`

#### Quality

1) `Gemini-1.5-pro    : ineligible_responses_filter_no_context`

2) `GPT-4o            : ineligible_responses_filter_no_context`

3) `Claude-3-5-sonnet : ineligible_responses_filter_no_context`

In [None]:
evaluation_prompts = pd.read_csv("evaluation_prompts.csv")
evaluation_prompts

In [54]:
json_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'json', "evaluation_prompt"].values[0]
implicit_span_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'implicit_span_level', "evaluation_prompt"].values[0]
ineligible_responses_filter_no_context_prompt = evaluation_prompts.loc[evaluation_prompts["evaluation_method"] == 'ineligible_responses_filter_with_context', "evaluation_prompt"].values[0]

### Setup Evaluation Helpers

Here we'll create a helper method for each evaluation type. Note that, for grounding, the Gemini-1.5-pro and GPT-4o judges use the `json` evaluation prompt and that the Claude-3-5-sonnet judge uses the `implicit_span_level` evaluation prompt as described above.

In [55]:
# Grounding json evaluation

import json

def parse_structured_json(ans):
  if '```json' in ans:
      ans = ans.split('```json')[1].split('```')[0]
  ans = ans.strip()
  ans = ans.replace('}\n', '}\n@\n@\n')
  parsed_answers = []
  for line in ans.split('\n@\n@\n'):
    try:
      line = line.replace('\n', ' ')
      line = line.replace("\\'", "'")
      parsed = json.loads(line)
      parsed_answers.append(parsed)
    except:
      pass
  if len(parsed_answers) > 0:
    bool_ans = all(d['label'] == 'supported' or d['label'] == 'no_rad' for d in parsed_answers)
  else:
    bool_ans = False
  return bool_ans, parsed_answers

def evaluate_grounding_json(user_request, context_document, response, model):
    prompt = json_prompt.replace('{{user_request}}', user_request).replace('{{context_document}}', context_document).replace('{{response}}', response)

    evaluation_text = generate(model=model, prompt=prompt, sys="")
    evaluation, parsed = parse_structured_json(evaluation_text)
    return evaluation

def evaluate_grounding_gemini(user_request, context_document, response):
    return evaluate_grounding_json(user_request, context_document, response, model="gemini-1.5-pro")

def evaluate_grounding_gpt(user_request, context_document, response):
    return evaluate_grounding_json(user_request, context_document, response, model="gpt-4o")


#### Testing Grounding Judges
We'll test grounding with a situation where the response is correct but not grounded. We expect to see a `False` from all judges!

```
user_request:      "what is 2 + 2?"
context_document:  "2 + 2 is 3"
response:          "2 + 2 is 4" (this is correct but not grounded in the context_document)
```

In [None]:
# Test the evaluator
print(evaluate_grounding_gemini(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4"))
print(evaluate_grounding_gpt(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4"))

In [57]:
# Grounding implicit_span_level evaluation

def answer_normalization(answer):
  answer = answer.strip().lower()
  if 'inaccurate' in answer or 'false' in answer:
    return False
  elif 'accurate' in answer or 'true' in answer:
    return True
  else:
    return False
    
def evaluate_grounding_implicit_span(user_request, context_document, response, model):
    prompt = implicit_span_prompt.replace('{{user_request}}', user_request).replace('{{context_document}}', context_document).replace('{{response}}', response)
    
    evaluation_text = generate(model=model, prompt=prompt, sys="")
    splits = evaluation_text.split('Final Answer:')
    if (len(splits) <= 1):
        return False
    final_ans = splits[1]
    return answer_normalization(final_ans)

def evaluate_grounding_claude(user_request, context_document, response):
    return evaluate_grounding_implicit_span(user_request, context_document, response, model="claude-3-5-sonnet")

In [None]:
# This test call should now work without a TypeError
evaluate_grounding_claude(user_request="what is 2 + 2?", context_document="2 + 2 is 3", response="2 + 2 is 4")

#### Testing Quality Judges

We'll test quality with a situation where the response does not answer the posed request. We expect to see a `False` from all judges!

```
user_request:      "what is 2 + 2?"
response_a:        "3 + 3 is 6" (this is the response we evaluate)
response_b:        "2 + 2 is 4" (this is the reference response)
```

We compare our response with a reference because it improves the accuracy of the judges.

In [59]:
# Quality no_context evaluation

QUESTIONS_TO_LABELS = {
    'Instruction Following': ['No Issues', 'Minor Issue(s)', 'Major Issue(s)', 'Invalid'],
}

def parse_json(ans):
    parsed = {}
    if '```json' in ans:
        ans = ans.split('```json')[1]
        ans = ans.split('```')[0]
    ans = ans.replace('\n', ' ')
    try:
        parsed = json.loads(ans)
    except Exception as e:
        pass
    if 'Instruction Following' not in parsed:
        parsed['Instruction Following'] = 'Invalid'
    elif parsed['Instruction Following'] not in ['No Issues', 'Minor Issue(s)', 'Major Issue(s)', 'Invalid']:
        parsed['Instruction Following'] = 'Invalid'
    return parsed

# Evaluates response_a for quality using response_b as a reference.
def evaluate_quality_no_context(user_request, response_a, response_b, model):
    prompt = ineligible_responses_filter_no_context_prompt.replace('{{user_request}}', user_request).replace('{{response_a}}', response_a).replace('{{response_b}}', response_b)
    
    evaluation_text = generate(prompt=prompt, model=model, sys="")
    parsed = parse_json(evaluation_text)

    return "Major Issue(s)" not in parsed['Instruction Following']

# Use the response from the judge model itself as the reference response (response_b).
def evaluate_quality_no_context_gemini(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["gemini-1.5-pro-response"], model="gemini-1.5-pro")

def evaluate_quality_no_context_gpt(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["gpt-4o-response"], model="gpt-4o")

def evaluate_quality_no_context_claude(user_request, response, references):
    return evaluate_quality_no_context(user_request, response, references["claude-3-5-sonnet-response"], model="claude-3-5-sonnet")

In [None]:
# Example references dataframe, when we're actually running the evaluation we'll be using our generated responses from before as references.
references = pd.DataFrame({
    "gemini-1.5-pro-response": ["2 + 2 is 4"],
    "gpt-4o-response": ["2 + 2 is 4"],
    "claude-3-5-sonnet-response": ["2 + 2 is 4"],
})

print(evaluate_quality_no_context_gemini("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))
print(evaluate_quality_no_context_gpt("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))
print(evaluate_quality_no_context_claude("What is 2 + 2?", response="3 + 3 is 6", references=references.loc[0]))

### Evaluate Responses

With our evaluation helper methods, we can now evaluate the responses we generated before.
- We'll invalidate responses only if it fails the bar for all three quality judges.
- Then we'll sum all the responses that were considered grounded (and that passed the quality check) across each of the three grounding judges.
- Finally, we can average across the three LLM judges to arrive at final scores.

In [61]:
models = ["contextual", "gemini-1.5-pro", "gpt-4o", "claude-3-5-sonnet"]
models = ["contextual"] ##For this notebook, we'll only evaluate the contextual model.

I added some checkpointing, so if you hit an error, you should be able to restart the below cell. This cell is doing thousands of API calls and it's not unexpected to hit some errors. This usually take several hours at least to run.

In [None]:
grounding_evaluation_methods = [
    evaluate_grounding_gemini,
    evaluate_grounding_gpt,
    evaluate_grounding_claude,
]

quality_evaluation_methods = [
    evaluate_quality_no_context_gemini,
    evaluate_quality_no_context_gpt,
    evaluate_quality_no_context_claude,
]

def nameof(f):
    return f.__name__.replace("_", "-")

for ix, row in tqdm(responses.iterrows(), total=len(responses)):    
    user_request = row["user_request"]
    context_document = row["context_document"]
    
    for evaluated_model in models:
        response_column_key = f'{evaluated_model}-response'
        response = row[response_column_key]
        
        print(f'Evaluating response at ix: {ix} for model {evaluated_model} with:')
        for eval_method in grounding_evaluation_methods:
            print(f'    {eval_method.__name__}')
            key = f'{response_column_key}-{nameof(eval_method)}'
            
            # skip rows if we've evaluated already
            if ix in responses.index and key in responses.columns and pd.notna(responses.loc[ix, key]):
                continue

            evaluation = eval_method(user_request, context_document, response)
            responses.loc[ix, [key]] = [evaluation]
            
        for eval_method in quality_evaluation_methods:
            print(f'    {eval_method.__name__}')
            key = f'{response_column_key}-{nameof(eval_method)}'

            # skip rows if we've evaluated already
            if ix in responses.index and key in responses.columns and pd.notna(responses.loc[ix, key]):
                continue
            
            evaluation = eval_method(user_request, response, row)
            responses.loc[ix,[key]] = [evaluation]
    
    # Save every 100 rows
    if ix % 100 == 0:
        temp_evaluations = responses[[col for col in responses.columns if "evaluate" in col]]
        temp_evaluations = temp_evaluations.reindex(sorted(temp_evaluations.columns), axis=1)
        temp_evaluations.to_csv(f"evaluations_checkpoint_{ix}.csv")


Verify your evaluations:
You should have 7 distinct columns for evaluating each model with True/False. 

In [None]:
evaluations = responses[[col for col in responses.columns if "evaluate" in col]]
evaluations = evaluations.reindex(sorted(evaluations.columns), axis=1)
evaluations


Now we can add evaluate to responses and we should have a nice wide dataset with all the useful data we need.

In [None]:
responses.head()

Save this hard work!!! I noticed jsonl is the best format for this dataset.

In [74]:
#responses.to_csv("evaluations_contextual_checkpoint_all.csv", index=False, quoting=csv.QUOTE_ALL)
responses.to_json("responses.jsonl", orient="records", lines=True, force_ascii=False)

In [None]:
responses = pd.read_json("responses.jsonl", lines=True)
responses.head()

### Ensemble Quality Evaluations

We'll invalidate responses only if it fails the bar for all three quality judges.

In [None]:
for ix, row in evaluations.iterrows():
    for model in models:
        passes_qc_count = 0
        for evaluation_method in quality_evaluation_methods:
            key = f'{model}-response-{nameof(evaluation_method)}'
            passes_qc_count += 1 if row[key] else 0

        passes_qc = passes_qc_count > 0

        evaluations.loc[ix, [f'{model}-response-passes-qc']] = [passes_qc]

relevant_columns = [col for col in evaluations.columns if ("evaluate-grounding" in col or "passes-qc" in col)]

grounding_evaluations = evaluations[relevant_columns]
grounding_evaluations = grounding_evaluations.reindex(sorted(grounding_evaluations.columns), axis=1)
grounding_evaluations

## 7. Results and Analysis

Summarize and visualize the evaluation results for each model.

In [None]:
scores = pd.DataFrame(columns=['model', 'evaluation_method', 'score'])
n = examples.shape[0]
for model in models:
    print(model)
    for evaluation_method in grounding_evaluation_methods:
        grounding_key = f'{model}-response-{nameof(evaluation_method)}'
        quality_key = f'{model}-response-passes-qc'

        passed_qc_responses = grounding_evaluations[grounding_evaluations[quality_key] == True]
        n_success = passed_qc_responses[grounding_key].astype(int).sum()

        scores.loc[len(scores), ['model', 'evaluation_method', 'score']] = [model, nameof(evaluation_method), n_success / n]
scores

My scores for contextual:

	model	evaluation_method	score  
0	contextual	evaluate-grounding-gemini	0.948837  
1	contextual	evaluate-grounding-gpt	0.80814  
2	contextual	evaluate-grounding-claude	0.92907  

In [None]:
scores.groupby("model")[["score"]].mean()

My scores for contextual:  

contextual	0.895349