# BioResonKGBench: Multi-Model LLM Evaluation

**Evaluating 8 LLM Models on Causal Knowledge Graph Question Answering**

## Overview

BioResonKGBench contains questions across four taxonomies:
- **S (Structure)**: Graph navigation and topology
- **R (Risk)**: Quantitative risk assessment
- **C (Causal)**: Causal evidence evaluation
- **M (Mechanism)**: Pathway and semantic understanding

## Models Evaluated
1. claude-3-haiku
2. deepseek-v3
3. gpt-4.1
4. gpt-4.1-mini
5. gpt-4o
6. gpt-4o-mini
7. llama-3.1-8b
8. qwen-2.5-7b

## 1. Setup and Configuration

In [1]:
# Install required packages
!pip install openai anthropic neo4j pandas matplotlib seaborn tqdm pyyaml -q

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.


In [1]:
import json
import os
import sys
import yaml
import time
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Set, Optional
from collections import defaultdict

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

print("="*60)
print("BioResonKGBench Multi-Model Evaluation")
print("="*60)

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.
BioResonKGBench Multi-Model Evaluation


In [2]:
# =============================================================================
# Configure Paths
# =============================================================================
NOTEBOOK_DIR = Path.cwd()

# Detect project structure
if NOTEBOOK_DIR.name == 'tutorials':
    BENCHMARK_DIR = NOTEBOOK_DIR.parent
else:
    BENCHMARK_DIR = NOTEBOOK_DIR

# BioResonKGBench paths
DATA_DIR = BENCHMARK_DIR / 'data'
RESULTS_DIR = BENCHMARK_DIR / 'results'
CONFIG_DIR = BENCHMARK_DIR / 'config'

# BioKGBench src (for KGQA system)
BIOKGBENCH_DIR = BENCHMARK_DIR.parent / '01_BioKGBench'
SRC_DIR = BIOKGBENCH_DIR / 'src'
BIOKGBENCH_CONFIG_DIR = BIOKGBENCH_DIR / 'config'

# Add src to path
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

# Create results directory if needed
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

print(f"Benchmark directory: {BENCHMARK_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"Results directory: {RESULTS_DIR}")
print(f"KGQA src directory: {SRC_DIR}")

Benchmark directory: /ibex/user/alsaedsb/ROCKET/Data/BioREASONIC/benchmarks/02_BioResonKGBench
Data directory: /ibex/user/alsaedsb/ROCKET/Data/BioREASONIC/benchmarks/02_BioResonKGBench/data
Results directory: /ibex/user/alsaedsb/ROCKET/Data/BioREASONIC/benchmarks/02_BioResonKGBench/results
KGQA src directory: /ibex/user/alsaedsb/ROCKET/Data/BioREASONIC/benchmarks/01_BioKGBench/src


In [3]:
# =============================================================================
# Load API Keys from Config
# =============================================================================
print("="*60)
print("API Key Status")
print("="*60)

# Try BioResonKGBench config first, then BioKGBench
config_path = CONFIG_DIR / 'config.local.yaml'
if not config_path.exists():
    config_path = BIOKGBENCH_CONFIG_DIR / 'config.local.yaml'

if config_path.exists():
    with open(config_path) as f:
        config = yaml.safe_load(f)
    
    llm_config = config.get('llm', {})
    
    api_keys = {
        'OPENAI_API_KEY': llm_config.get('openai', {}).get('api_key'),
        'ANTHROPIC_API_KEY': llm_config.get('claude', {}).get('api_key'),
        'TOGETHER_API_KEY': llm_config.get('together', {}).get('api_key'),
    }
    
    for key_name, key_value in api_keys.items():
        if key_value:
            os.environ[key_name] = key_value
            status = f"Configured ({key_value[:20]}...)"
        else:
            status = "Not set"
        print(f"{key_name:<18} {status}")
else:
    print(f"Config not found: {config_path}")

print("="*60)

API Key Status
OPENAI_API_KEY     Configured (sk-proj-8MD311fkBlQY...)
ANTHROPIC_API_KEY  Configured (sk-ant-api03-5zli5oZ...)
TOGETHER_API_KEY   Configured (1e793b4f3ba98e902a88...)


## 2. Load BioResonKGBench Data

In [4]:
def load_benchmark_data(split: str = 'dev') -> List[Dict]:
    """Load benchmark questions from combined JSON file."""
    file_path = DATA_DIR / f'combined_CKGQA_{split}_matched.json'
    with open(file_path, 'r') as f:
        questions = json.load(f)
    print(f"Loaded {len(questions)} questions from {split} set")
    return questions

# Load data
dev_questions = load_benchmark_data('dev')
test_questions = load_benchmark_data('test')

# Show distribution
print("\nDev Set Distribution:")
taxonomy_dist = defaultdict(int)
category_dist = defaultdict(int)
for q in dev_questions:
    taxonomy_dist[q.get('taxonomy', 'unknown')] += 1
    category_dist[q.get('category', 'unknown')] += 1

print(f"  Taxonomy: {dict(taxonomy_dist)}")
print(f"  Category: {dict(category_dist)}")

Loaded 192 questions from dev set
Loaded 1088 questions from test set

Dev Set Distribution:
  Taxonomy: {'M': 54, 'C': 48, 'S': 42, 'R': 48}
  Category: {'knowledge': 114, 'reasoning': 78}


In [5]:
# Show sample questions
print("="*60)
print("Sample Questions by Taxonomy")
print("="*60)

for taxonomy in ['S', 'R', 'C', 'M']:
    sample = next((q for q in dev_questions if q.get('taxonomy') == taxonomy), None)
    if sample:
        print(f"\n[{taxonomy}] {sample['task_id']}")
        print(f"Q: {sample['question'][:80]}...")
        print(f"Type: {sample.get('type', 'N/A')}")

Sample Questions by Taxonomy

[S] S-DISEASE-GENES
Q: What genes are risk factors for Coronary Artery Disease?...
Type: one-hop

[R] R-PVALUE
Q: How statistically significant is the association between SNP rs2856690 and Celia...
Type: one-hop

[C] C-EVIDENCE-LEVEL
Q: What is the evidence level for gene CCDC26 affecting Neoplasms?...
Type: classification

[M] M-PROTEIN
Q: What protein does gene PHC3 translate into?...
Type: one-hop


## 3. Test Neo4j Connection

In [6]:
# =============================================================================
# Test Neo4j Connection
# =============================================================================
print("="*60)
print("Testing Neo4j Connection")
print("="*60)

try:
    from neo4j import GraphDatabase
    
    # Load Neo4j config
    kg_config_path = BIOKGBENCH_CONFIG_DIR / 'kg_config.yml'
    if not kg_config_path.exists():
        kg_config_path = CONFIG_DIR / 'kg_config.yml'
    
    with open(kg_config_path) as f:
        neo4j_config = yaml.safe_load(f)
    
    uri = f"bolt://{neo4j_config['db_url']}:{neo4j_config['db_port']}"
    print(f"Connecting to: {uri}")
    
    driver = GraphDatabase.driver(
        uri,
        auth=(neo4j_config['db_user'], neo4j_config['db_password']),
        encrypted=False
    )
    
    with driver.session() as session:
        result = session.run("MATCH (n) RETURN count(n) as count LIMIT 1")
        count = result.single()['count']
        print(f"Connected! Database has {count:,} nodes")
    
    driver.close()
    NEO4J_AVAILABLE = True
    
except Exception as e:
    print(f"Connection failed: {e}")
    NEO4J_AVAILABLE = False

print("="*60)

Testing Neo4j Connection
Connecting to: bolt://10.73.107.108:7687
Connected! Database has 1,006,535 nodes


## 4. Define Evaluation Functions

In [7]:
# =============================================================================
# FIXED: Gold Answer Extraction - Must run Cypher query!
# =============================================================================

from neo4j import GraphDatabase

def get_neo4j_driver():
    """Get a Neo4j driver connection."""
    kg_config_path = BIOKGBENCH_CONFIG_DIR / 'kg_config.yml'
    if not kg_config_path.exists():
        kg_config_path = CONFIG_DIR / 'kg_config.yml'
    
    with open(kg_config_path) as f:
        neo4j_config = yaml.safe_load(f)
    
    uri = f"bolt://{neo4j_config['db_url']}:{neo4j_config['db_port']}"
    return GraphDatabase.driver(
        uri,
        auth=(neo4j_config['db_user'], neo4j_config['db_password']),
        encrypted=False
    )


def extract_gold_answers_from_cypher(question: Dict, driver) -> Set[str]:
    """
    FIXED: Extract gold answers by RUNNING the Cypher query.
    The answer is NOT stored in the question - it must be computed!
    """
    answers = set()
    
    cypher = question.get('cypher', '')
    answer_key = question.get('answer_key', '')
    
    if not cypher or not answer_key:
        return answers
    
    try:
        with driver.session() as session:
            result = session.run(cypher)
            records = list(result)
            
            for record in records:
                # Get the answer_key column from results
                if answer_key in record.keys():
                    value = record[answer_key]
                    if value is not None:
                        answers.add(normalize_answer(str(value)))
                
                # Also check common answer columns
                for key in ['answer', 'name', 'id', 'gene', 'protein', 'disease', 'snp']:
                    if key in record.keys() and key != answer_key:
                        value = record[key]
                        if value is not None:
                            answers.add(normalize_answer(str(value)))
    
    except Exception as e:
        print(f"Cypher error: {e}")
    
    return answers


def normalize_answer(answer: str) -> str:
    """Normalize answer for comparison."""
    if answer is None:
        return ""
    return str(answer).strip().lower()


def compute_metrics(results: List[Dict]) -> Dict:
    """Compute evaluation metrics."""
    total = len(results)
    if total == 0:
        return {'total': 0}
    
    correct = sum(1 for r in results if r.get('correct', False))
    executable = sum(1 for r in results if r.get('success', False))
    answered = sum(1 for r in results if r.get('predicted'))
    
    # By taxonomy
    by_taxonomy = defaultdict(lambda: {'total': 0, 'correct': 0})
    for r in results:
        tax = r.get('taxonomy', 'unknown')
        by_taxonomy[tax]['total'] += 1
        if r.get('correct', False):
            by_taxonomy[tax]['correct'] += 1
    
    return {
        'total': total,
        'correct': correct,
        'accuracy': correct / total,
        'executability': executable / total,
        'coverage': answered / total,
        'by_taxonomy': {k: v['correct']/v['total'] if v['total'] > 0 else 0 
                       for k, v in by_taxonomy.items()}
    }

print("FIXED evaluation functions defined!")
print("Gold answers will be extracted by running Cypher queries against Neo4j.")

FIXED evaluation functions defined!
Gold answers will be extracted by running Cypher queries against Neo4j.


In [None]:
# =============================================================================
# FIXED: evaluate_model - Uses Cypher-based gold answer extraction
# =============================================================================

def evaluate_model(model_name: str, questions: List[Dict], max_questions: int = None) -> Dict:
    """
    Evaluate a model on BioResonKGBench questions.
    
    FIXED: Now properly extracts gold answers by running Cypher queries!
    Uses cached gold answers if available (_gold_answers field).
    
    Args:
        model_name: Name of model (e.g., 'gpt-4o-mini')
        questions: List of question dictionaries
        max_questions: Limit number of questions (for testing)
    """
    from kg_qa_system_v2 import KnowledgeGraphQAv2, QAMode
    
    print(f"\nEvaluating: {model_name}")
    
    # Load config
    config_path = BIOKGBENCH_CONFIG_DIR / 'config.local.yaml'
    with open(config_path) as f:
        config = yaml.safe_load(f)
    config['llm']['provider'] = model_name
    
    # Save temp config
    temp_config = f'/tmp/bioresonkg_config_{model_name}.yaml'
    with open(temp_config, 'w') as f:
        yaml.dump(config, f)
    
    # Initialize KGQA system for LLM predictions
    qa = KnowledgeGraphQAv2(config_path=temp_config, mode=QAMode.LLM)
    qa.connect()
    
    # Get Neo4j driver for gold answer extraction (if not cached)
    gold_driver = None
    
    # Process questions
    results = []
    questions_to_eval = questions[:max_questions] if max_questions else questions
    start_time = time.time()
    
    for q in tqdm(questions_to_eval, desc=model_name):
        question_text = q.get('question', '')
        taxonomy = q.get('taxonomy', 'unknown')
        task_id = q.get('task_id', '')
        
        # Use cached gold answers if available, otherwise extract
        if '_gold_answers' in q:
            gold_answers = q['_gold_answers']
        else:
            if gold_driver is None:
                gold_driver = get_neo4j_driver()
            gold_answers = extract_gold_answers_from_cypher(q, gold_driver)
        
        try:
            # Get LLM prediction
            result = qa.answer(question_text, q)
            predicted = []
            
            if result.success and result.answers:
                for ans in result.answers:
                    if ans.name:
                        predicted.append(normalize_answer(ans.name))
                    if ans.id:
                        predicted.append(normalize_answer(ans.id))
                    # Also add extra fields
                    if ans.extra:
                        for v in ans.extra.values():
                            if v is not None:
                                predicted.append(normalize_answer(str(v)))
            
            # Check correctness - flexible matching
            correct = False
            if predicted and gold_answers:
                for p in predicted:
                    if not p:
                        continue
                    if p in gold_answers:
                        correct = True
                        break
                    # Partial match
                    for g in gold_answers:
                        if g and (g in p or p in g):
                            correct = True
                            break
                    if correct:
                        break
            
            results.append({
                'question': question_text,
                'taxonomy': taxonomy,
                'task_id': task_id,
                'gold_answers': list(gold_answers)[:5],  # Limit for readability
                'predicted': predicted[:5],
                'correct': correct,
                'success': result.success,
                'cypher_generated': result.cypher_query,
            })
            
        except Exception as e:
            results.append({
                'question': question_text,
                'taxonomy': taxonomy,
                'task_id': task_id,
                'gold_answers': list(gold_answers)[:5] if gold_answers else [],
                'predicted': [],
                'correct': False,
                'success': False,
                'error': str(e),
            })
    
    qa.close()
    if gold_driver:
        gold_driver.close()
    total_time = time.time() - start_time
    
    metrics = compute_metrics(results)
    metrics['model'] = model_name
    metrics['total_time_sec'] = total_time
    
    # Show some examples
    correct_examples = [r for r in results if r['correct']][:2]
    wrong_examples = [r for r in results if not r['correct'] and r['success']][:2]
    
    print(f"\nResults: Accuracy={metrics['accuracy']*100:.1f}%, Exec={metrics['executability']*100:.1f}%, Time={total_time:.1f}s")
    
    if correct_examples:
        print(f"\nCorrect example:")
        ex = correct_examples[0]
        print(f"  Q: {ex['question'][:50]}...")
        print(f"  Gold: {ex['gold_answers'][:3]}")
        print(f"  Pred: {ex['predicted'][:3]}")
    
    if wrong_examples:
        print(f"\nWrong example:")
        ex = wrong_examples[0]
        print(f"  Q: {ex['question'][:50]}...")
        print(f"  Gold: {ex['gold_answers'][:3]}")
        print(f"  Pred: {ex['predicted'][:3]}")
    
    return {'metrics': metrics, 'results': results}

print("FIXED evaluate_model() function defined!")
print("Now properly compares LLM answers with Cypher-derived gold answers.")

In [9]:
# =============================================================================
# VERIFY: Test Gold Answer Extraction
# =============================================================================
print("="*80)
print("VERIFYING GOLD ANSWER EXTRACTION (via Cypher)")
print("="*80)

# Test on a few sample questions
test_driver = get_neo4j_driver()

print("\nTesting gold answer extraction on sample questions:\n")

for i, q in enumerate(dev_questions[:5]):
    print(f"Q{i+1}: {q['question'][:60]}...")
    print(f"    Task: {q['task_id']}")
    print(f"    Answer Key: {q.get('answer_key', 'N/A')}")
    
    gold = extract_gold_answers_from_cypher(q, test_driver)
    print(f"    Gold Answers: {list(gold)[:3]}{'...' if len(gold) > 3 else ''}")
    print(f"    Count: {len(gold)} answers")
    print()

test_driver.close()

print("="*80)
print("If gold answers are extracted correctly, the evaluation should now work!")
print("="*80)

VERIFYING GOLD ANSWER EXTRACTION (via Cypher)

Testing gold answer extraction on sample questions:

Q1: What protein does gene PHC3 translate into?...
    Task: M-PROTEIN
    Answer Key: protein_name
    Gold Answers: ['phc3']
    Count: 1 answers

Q2: What is the evidence level for gene CCDC26 affecting Neoplas...
    Task: C-EVIDENCE-LEVEL
    Answer Key: evidence_level
    Gold Answers: []
    Count: 0 answers

Q3: Which SNP has the strongest causal effect on Malabsorption S...
    Task: C-TOP-CAUSAL
    Answer Key: snp
    Gold Answers: []
    Count: 0 answers

Q4: What is the evidence level for gene GEMIN7-AS1 affecting Aut...
    Task: C-EVIDENCE-LEVEL
    Answer Key: evidence_level
    Gold Answers: []
    Count: 0 answers

Q5: What genes are risk factors for Coronary Artery Disease?...
    Task: S-DISEASE-GENES
    Answer Key: gene
    Gold Answers: []
    Count: 0 answers

If gold answers are extracted correctly, the evaluation should now work!


## 5. Run Multi-Model Evaluation

In [None]:
# =============================================================================
# FIXED: Multi-Model Evaluation with Pre-Filtering
# =============================================================================

print("="*60)
print("Multi-Model Evaluation on BioResonKGBench")
print("="*60)

# Step 1: Pre-filter questions to only those with valid gold answers
print("\nStep 1: Pre-filtering questions with valid gold answers...")

prefilter_driver = get_neo4j_driver()
valid_questions = []
invalid_count = 0

for q in tqdm(dev_questions, desc="Pre-filtering"):
    gold = extract_gold_answers_from_cypher(q, prefilter_driver)
    if gold:  # Only keep questions with extractable gold answers
        q['_gold_answers'] = gold  # Cache the gold answers
        valid_questions.append(q)
    else:
        invalid_count += 1

prefilter_driver.close()

print(f"\nPre-filter Results:")
print(f"  Valid questions (with gold answers): {len(valid_questions)}")
print(f"  Invalid questions (no gold answers): {invalid_count}")
print(f"  Coverage: {len(valid_questions)/len(dev_questions)*100:.1f}%")

# Show distribution of valid questions
valid_taxonomy = {}
for q in valid_questions:
    tax = q.get('taxonomy', 'unknown')
    valid_taxonomy[tax] = valid_taxonomy.get(tax, 0) + 1
print(f"  By Taxonomy: {valid_taxonomy}")

# Define all models to evaluate
MODELS_TO_TEST = [
    'gpt-4o-mini',      # Start with fastest/cheapest
    'gpt-4o',
    'claude-3-haiku',
    'llama-3.1-8b',
    'qwen-2.5-7b',
    'deepseek-v3',
    'gpt-4.1-mini',
    'gpt-4.1',
]

# Configuration
MAX_QUESTIONS = 20  # Set to None for full evaluation on valid questions
DATASET = valid_questions  # Use only valid questions!

print(f"\n{'='*60}")
print(f"EVALUATION CONFIGURATION")
print(f"{'='*60}")
print(f"Models to test: {len(MODELS_TO_TEST)}")
print(f"Valid questions available: {len(DATASET)}")
print(f"Questions per model: {MAX_QUESTIONS if MAX_QUESTIONS else 'All'}")
print("="*60)

if not NEO4J_AVAILABLE:
    print("Neo4j not available. Cannot run evaluation.")
elif len(DATASET) == 0:
    print("No valid questions found. Check KG schema alignment.")
else:
    all_model_results = {}
    
    for model_name in MODELS_TO_TEST:
        print(f"\n{'='*60}")
        print(f"Evaluating: {model_name}")
        print(f"{'='*60}")
        
        try:
            result = evaluate_model(model_name, DATASET, max_questions=MAX_QUESTIONS)
            all_model_results[model_name] = result
            print(f"Completed: {model_name}")
        except Exception as e:
            print(f"Failed: {model_name} - {e}")
            import traceback
            traceback.print_exc()
            all_model_results[model_name] = {'error': str(e)}
    
    # Summary
    print("\n" + "="*80)
    print("EVALUATION SUMMARY")
    print("="*80)
    print(f"{'Model':<16} {'Accuracy':>10} {'Exec':>10} {'Time (s)':>10}")
    print("-"*80)
    
    for model_name in MODELS_TO_TEST:
        if model_name in all_model_results and 'metrics' in all_model_results[model_name]:
            m = all_model_results[model_name]['metrics']
            print(f"{model_name:<16} {m['accuracy']*100:>9.1f}% {m['executability']*100:>9.1f}% {m.get('total_time_sec', 0):>10.1f}")
        else:
            err = all_model_results.get(model_name, {}).get('error', 'Unknown')
            print(f"{model_name:<16} {'FAILED':>10} - {err[:30]}")
    
    print("="*80)

## 6. Save Results and Visualize

In [None]:
# =============================================================================
# Save Results and Create Visualizations
# =============================================================================

if 'all_model_results' in dir() and all_model_results:
    # Create DataFrame
    results_data = []
    for model_name, result in all_model_results.items():
        if 'metrics' in result:
            m = result['metrics']
            row = {
                'Model': model_name,
                'Accuracy (%)': m['accuracy'] * 100,
                'Executability (%)': m['executability'] * 100,
                'Coverage (%)': m['coverage'] * 100,
                'Time (s)': m.get('total_time_sec', 0),
            }
            # Add per-taxonomy accuracy
            for tax, acc in m.get('by_taxonomy', {}).items():
                row[f'{tax} Acc (%)'] = acc * 100
            results_data.append(row)
    
    df_results = pd.DataFrame(results_data)
    df_results = df_results.sort_values('Accuracy (%)', ascending=False).reset_index(drop=True)
    
    print("Results:")
    print(df_results.to_string())
    
    # Save to files
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    
    csv_path = RESULTS_DIR / f'multimodel_eval_{timestamp}.csv'
    df_results.to_csv(csv_path, index=False)
    print(f"\nSaved to: {csv_path}")
    
    # Create visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Accuracy comparison
    df_plot = df_results.sort_values('Accuracy (%)', ascending=True)
    axes[0].barh(df_plot['Model'], df_plot['Accuracy (%)'], color='steelblue')
    axes[0].set_xlabel('Accuracy (%)')
    axes[0].set_title('Model Accuracy on BioResonKGBench')
    axes[0].set_xlim(0, 100)
    
    # Plot 2: Per-taxonomy heatmap
    tax_cols = [c for c in df_results.columns if 'Acc (%)' in c and c != 'Accuracy (%)']
    if tax_cols:
        heatmap_data = df_results.set_index('Model')[tax_cols]
        sns.heatmap(heatmap_data, annot=True, fmt='.1f', cmap='YlOrRd', ax=axes[1])
        axes[1].set_title('Accuracy by Taxonomy')
    
    plt.tight_layout()
    plot_path = RESULTS_DIR / f'multimodel_comparison_{timestamp}.png'
    plt.savefig(plot_path, dpi=150, bbox_inches='tight')
    plt.show()
    print(f"Plot saved to: {plot_path}")

else:
    print("No results to save.")

## 7. Detailed Analysis by Taxonomy

In [None]:
# =============================================================================
# Detailed Analysis by Taxonomy
# =============================================================================

if 'all_model_results' in dir() and all_model_results:
    print("="*80)
    print("DETAILED ANALYSIS BY TAXONOMY")
    print("="*80)
    
    for taxonomy in ['S', 'R', 'C', 'M']:
        print(f"\n--- {taxonomy} (Structure/Risk/Causal/Mechanism) ---")
        
        for model_name, result in all_model_results.items():
            if 'metrics' not in result:
                continue
            
            tax_results = [r for r in result['results'] if r.get('taxonomy') == taxonomy]
            if tax_results:
                correct = sum(1 for r in tax_results if r.get('correct', False))
                total = len(tax_results)
                acc = correct / total * 100 if total > 0 else 0
                print(f"  {model_name:<16}: {acc:5.1f}% ({correct}/{total})")
    
    print("\n" + "="*80)

## Summary

This notebook evaluated 8 LLM models on BioResonKGBench:

1. **claude-3-haiku** - Anthropic's fast model
2. **deepseek-v3** - DeepSeek's latest model
3. **gpt-4.1** - OpenAI GPT-4 Turbo
4. **gpt-4.1-mini** - OpenAI GPT-4 Turbo Preview
5. **gpt-4o** - OpenAI GPT-4o
6. **gpt-4o-mini** - OpenAI GPT-4o Mini
7. **llama-3.1-8b** - Meta's Llama 3.1 8B
8. **qwen-2.5-7b** - Alibaba's Qwen 2.5 7B

### Key Metrics
- **Accuracy**: Percentage of questions answered correctly
- **Executability**: Percentage of queries that executed without error
- **Per-Taxonomy**: S (Structure), R (Risk), C (Causal), M (Mechanism)