This repository contains multiple benchmark prototypes for the Kaggle competition Measuring Progress Toward AGI - Cognitive Abilities.
Implemented benchmark packages:
veltic/: in-context learning via a synthetic symbolic rewriting languageconfab/: metacognitive error detection over subtly corrupted solutionsprotom/: first-, second-, and third-order belief tracking / theory-of-mind evaluation
python -m venv .venv
.venv\Scripts\activate
pip install -e .
pip install -e .[dev]The repository uses one root runner with a benchmark switch.
Default benchmark mode is veltic, so the existing command remains valid:
python run_benchmark.py \
--model gpt-4o \
--api-base https://api.openai.com/v1 \
--api-key %OPENAI_API_KEY% \
--n-per-difficulty 50 \
--seed 42Optional async mode:
python run_benchmark.py \
--benchmark veltic \
--model gpt-4o \
--api-base https://api.openai.com/v1 \
--api-key %OPENAI_API_KEY% \
--asyncOutputs:
outputs/results.jsonloutputs/submission.csv
python run_benchmark.py \
--benchmark confab \
--detector-model gpt-4o \
--generator-model claude-3-5-sonnet-20241022 \
--api-base https://api.openai.com/v1 \
--api-key %OPENAI_API_KEY% \
--n-per-domain 20 \
--seed 42 \
--domains math code causalOutputs:
outputs/confab_results.jsonloutputs/confab_submission.csv
Oracle sanity baseline:
python run_benchmark.py --benchmark protom --mode oracle --seed 42 --max-items 50Random lower-bound baseline:
python run_benchmark.py --benchmark protom --mode random --seed 42 --max-items 50Outputs:
outputs/protom_results.jsonloutputs/protom_submission.csvoutputs/protom_summary.json
VelticBench evaluates in-context learning by teaching models an invented symbolic rewriting language and then testing generalization on novel instances. Its primary contamination defense is per-item symbol rotation over a shared latent rule system.
ConfabDetect evaluates whether a model can detect subtle errors in peer-like solutions, localize the mistake, and calibrate confidence appropriately under same-model and cross-model probe settings.
ProToM evaluates belief tracking over synthetic scenes with explicit witness structure, including trap settings for reality-lure and omniscience errors.
Run the full suite with:
python -m pytest -q