Kaggle Cognitive Benchmark Suite

This repository contains multiple benchmark prototypes for the Kaggle competition Measuring Progress Toward AGI - Cognitive Abilities.

Implemented benchmark packages:

veltic/: in-context learning via a synthetic symbolic rewriting language
confab/: metacognitive error detection over subtly corrupted solutions
protom/: first-, second-, and third-order belief tracking / theory-of-mind evaluation

Installation

python -m venv .venv
.venv\Scripts\activate
pip install -e .
pip install -e .[dev]

Unified CLI

The repository uses one root runner with a benchmark switch.

VelticBench

Default benchmark mode is veltic, so the existing command remains valid:

python run_benchmark.py \
  --model gpt-4o \
  --api-base https://api.openai.com/v1 \
  --api-key %OPENAI_API_KEY% \
  --n-per-difficulty 50 \
  --seed 42

Optional async mode:

python run_benchmark.py \
  --benchmark veltic \
  --model gpt-4o \
  --api-base https://api.openai.com/v1 \
  --api-key %OPENAI_API_KEY% \
  --async

Outputs:

outputs/results.jsonl
outputs/submission.csv

ConfabDetect

python run_benchmark.py \
  --benchmark confab \
  --detector-model gpt-4o \
  --generator-model claude-3-5-sonnet-20241022 \
  --api-base https://api.openai.com/v1 \
  --api-key %OPENAI_API_KEY% \
  --n-per-domain 20 \
  --seed 42 \
  --domains math code causal

Outputs:

outputs/confab_results.jsonl
outputs/confab_submission.csv

ProToM

Oracle sanity baseline:

python run_benchmark.py --benchmark protom --mode oracle --seed 42 --max-items 50

Random lower-bound baseline:

python run_benchmark.py --benchmark protom --mode random --seed 42 --max-items 50

Outputs:

outputs/protom_results.jsonl
outputs/protom_submission.csv
outputs/protom_summary.json

Package Summary

VelticBench

VelticBench evaluates in-context learning by teaching models an invented symbolic rewriting language and then testing generalization on novel instances. Its primary contamination defense is per-item symbol rotation over a shared latent rule system.

ConfabDetect

ConfabDetect evaluates whether a model can detect subtle errors in peer-like solutions, localize the mistake, and calibrate confidence appropriately under same-model and cross-model probe settings.

ProToM

ProToM evaluates belief tracking over synthetic scenes with explicit witness structure, including trap settings for reality-lure and omniscience errors.

Testing

Run the full suite with:

python -m pytest -q

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.agent/skills		.agent/skills
.agents/skills		.agents/skills
.claude		.claude
.github		.github
.kiro/skills		.kiro/skills
.planning		.planning
.vendor		.vendor
__pycache__		__pycache__
assets		assets
confab		confab
data		data
goodbye		goodbye
outputs		outputs
protom		protom
scripts		scripts
tests		tests
veltic		veltic
.bootstrap.sh		.bootstrap.sh
.clauderules		.clauderules
Competition Details.litcoffee		Competition Details.litcoffee
PROJECT_LINKS.md		PROJECT_LINKS.md
ProToM_Procedural_Theory_of_Mind.task.json		ProToM_Procedural_Theory_of_Mind.task.json
README.md		README.md
SKILL_PROMPTS.md		SKILL_PROMPTS.md
WRITEUP.md		WRITEUP.md
claude.md		claude.md
confabdetect_benchmark.ipynb		confabdetect_benchmark.ipynb
cover.png		cover.png
kaggle_tasks.py		kaggle_tasks.py
pyproject.toml		pyproject.toml
run_benchmark.py		run_benchmark.py
skill-writer-psychologist.md		skill-writer-psychologist.md
skills-lock.json		skills-lock.json
skills.md		skills.md
skills.sh		skills.sh
tasks.ipynb		tasks.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Cognitive Benchmark Suite

Installation

Unified CLI

VelticBench

ConfabDetect

ProToM

Package Summary

VelticBench

ConfabDetect

ProToM

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kaggle Cognitive Benchmark Suite

Installation

Unified CLI

VelticBench

ConfabDetect

ProToM

Package Summary

VelticBench

ConfabDetect

ProToM

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages