Releases · OpenInterpretability/mechreward

Mechanistic interpretability as reward signal for RL training of LLMs.

Install

pip install mechreward

FeatureReward — SAE feature activations as trajectory-level reward
CompositeReward — HERO-style stratified normalization combining outcome + feature rewards
HackingDetector + DualVerifier — Wilhelm-2603.04069 style anti-Goodhart framework
AdversarialSuite — 10 canned red-team prompts for reward robustness testing
MechRewardGRPOTrainer — drop-in TRL GRPO wrapper with hidden state capture
Reference catalogs for Gemma-2-9B (reasoning, confidence, retrieval packs — placeholder features, validate before use)
7 reference experiments in experiments/ covering baseline, mech-only, hybrid, SARM reproduction, CRL reproduction, adversarial suite, capability preservation
Outcome verifiers for GSM8K, MATH, HumanEval-style code, Python exec

Alpha. API subject to change. See RESEARCH.md for the scientific context and prior-art audit (SARM, SparseRM, CRL, YaPO, Wilhelm et al.).

35/35 unit tests passing. Ruff clean.

Validate placeholder features in the Gemma-2-9B reasoning pack against real data
Run experiment 3 (hybrid) on Gemma-2-9B + GSM8K
Ship adversarial hacking bench in CI