Skip to content

MMEHDI0606/protom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle Cognitive Benchmark Suite

This repository contains multiple benchmark prototypes for the Kaggle competition Measuring Progress Toward AGI - Cognitive Abilities.

Implemented benchmark packages:

  • veltic/: in-context learning via a synthetic symbolic rewriting language
  • confab/: metacognitive error detection over subtly corrupted solutions
  • protom/: first-, second-, and third-order belief tracking / theory-of-mind evaluation

Installation

python -m venv .venv
.venv\Scripts\activate
pip install -e .
pip install -e .[dev]

Unified CLI

The repository uses one root runner with a benchmark switch.

VelticBench

Default benchmark mode is veltic, so the existing command remains valid:

python run_benchmark.py \
  --model gpt-4o \
  --api-base https://api.openai.com/v1 \
  --api-key %OPENAI_API_KEY% \
  --n-per-difficulty 50 \
  --seed 42

Optional async mode:

python run_benchmark.py \
  --benchmark veltic \
  --model gpt-4o \
  --api-base https://api.openai.com/v1 \
  --api-key %OPENAI_API_KEY% \
  --async

Outputs:

  • outputs/results.jsonl
  • outputs/submission.csv

ConfabDetect

python run_benchmark.py \
  --benchmark confab \
  --detector-model gpt-4o \
  --generator-model claude-3-5-sonnet-20241022 \
  --api-base https://api.openai.com/v1 \
  --api-key %OPENAI_API_KEY% \
  --n-per-domain 20 \
  --seed 42 \
  --domains math code causal

Outputs:

  • outputs/confab_results.jsonl
  • outputs/confab_submission.csv

ProToM

Oracle sanity baseline:

python run_benchmark.py --benchmark protom --mode oracle --seed 42 --max-items 50

Random lower-bound baseline:

python run_benchmark.py --benchmark protom --mode random --seed 42 --max-items 50

Outputs:

  • outputs/protom_results.jsonl
  • outputs/protom_submission.csv
  • outputs/protom_summary.json

Package Summary

VelticBench

VelticBench evaluates in-context learning by teaching models an invented symbolic rewriting language and then testing generalization on novel instances. Its primary contamination defense is per-item symbol rotation over a shared latent rule system.

ConfabDetect

ConfabDetect evaluates whether a model can detect subtle errors in peer-like solutions, localize the mistake, and calibrate confidence appropriately under same-model and cross-model probe settings.

ProToM

ProToM evaluates belief tracking over synthetic scenes with explicit witness structure, including trap settings for reality-lure and omniscience errors.

Testing

Run the full suite with:

python -m pytest -q

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors