Mechanistic interpretability as reward signal for RL training of LLMs.
Install
pip install mechrewardHighlights
FeatureReward— SAE feature activations as trajectory-level rewardCompositeReward— HERO-style stratified normalization combining outcome + feature rewardsHackingDetector+DualVerifier— Wilhelm-2603.04069 style anti-Goodhart frameworkAdversarialSuite— 10 canned red-team prompts for reward robustness testingMechRewardGRPOTrainer— drop-in TRL GRPO wrapper with hidden state capture- Reference catalogs for Gemma-2-9B (reasoning, confidence, retrieval packs — placeholder features, validate before use)
- 7 reference experiments in
experiments/covering baseline, mech-only, hybrid, SARM reproduction, CRL reproduction, adversarial suite, capability preservation - Outcome verifiers for GSM8K, MATH, HumanEval-style code, Python exec
Status
Alpha. API subject to change. See RESEARCH.md for the scientific context and prior-art audit (SARM, SparseRM, CRL, YaPO, Wilhelm et al.).
Tests
35/35 unit tests passing. Ruff clean.
What's next
- Validate placeholder features in the Gemma-2-9B reasoning pack against real data
- Run experiment 3 (hybrid) on Gemma-2-9B + GSM8K
- Ship adversarial hacking bench in CI