Skip to content

Releases: OpenInterpretability/mechreward

v0.1.0 — Initial alpha release

15 Apr 00:57

Choose a tag to compare

Mechanistic interpretability as reward signal for RL training of LLMs.

Install

pip install mechreward

Highlights

  • FeatureReward — SAE feature activations as trajectory-level reward
  • CompositeReward — HERO-style stratified normalization combining outcome + feature rewards
  • HackingDetector + DualVerifier — Wilhelm-2603.04069 style anti-Goodhart framework
  • AdversarialSuite — 10 canned red-team prompts for reward robustness testing
  • MechRewardGRPOTrainer — drop-in TRL GRPO wrapper with hidden state capture
  • Reference catalogs for Gemma-2-9B (reasoning, confidence, retrieval packs — placeholder features, validate before use)
  • 7 reference experiments in experiments/ covering baseline, mech-only, hybrid, SARM reproduction, CRL reproduction, adversarial suite, capability preservation
  • Outcome verifiers for GSM8K, MATH, HumanEval-style code, Python exec

Status

Alpha. API subject to change. See RESEARCH.md for the scientific context and prior-art audit (SARM, SparseRM, CRL, YaPO, Wilhelm et al.).

Tests

35/35 unit tests passing. Ruff clean.

What's next

  • Validate placeholder features in the Gemma-2-9B reasoning pack against real data
  • Run experiment 3 (hybrid) on Gemma-2-9B + GSM8K
  • Ship adversarial hacking bench in CI