Skip to content

RFC-0001: Mechanistic Fact Editing Commands (crown, edit, memit)#2

Merged
mikeumus merged 1 commit intomainfrom
feat/mechanistic-edit-rfc
Apr 17, 2026
Merged

RFC-0001: Mechanistic Fact Editing Commands (crown, edit, memit)#2
mikeumus merged 1 commit intomainfrom
feat/mechanistic-edit-rfc

Conversation

@mikeumus
Copy link
Copy Markdown

Summary

Proposes three new LarQL subcommands that turn LarQL into the first mechanistic-interpretability-native fact-editing CLI:

  • larql crown — per-edit crown-layer discovery via module ablation (Chapter 17 Phase 125c)
  • larql edit — single-fact rank-1 edit with auto-scale calibration (Chapter 20 Phase 140 + Chapter 18 Phase 130)
  • larql memit — batch fact editing via joint least-squares, grouped by crown (Chapter 21 Phase 141c + Chapter 23 Phase 143b)

Plus a new patch file format (~55KB per Gemma 4 4B single edit) and a non-destructive larql apply-patch command.

Why

Nine chapters of experiments on Gemma 4 4B and 26B in April 2026 established the mechanism and proved editing works:

  • Entity zone at L20-29 (~67-97% depth) — Chapter 15
  • L27 MLP is the load-bearing country→capital writer on Gemma 4 4B — Chapter 17 (ablating it breaks "France → Paris" to "France → France")
  • Single rank-1 edit: 11/11 specificity at 0.9% relative weight perturbation — Chapter 20
  • MEMIT handles 2-3 concurrent edits; multi-layer + per-edit crown extends to 3/5 — Chapters 21-23
  • No single-neuron "Paris" (polysemantic superposition) forces the rank-1 approach — Chapter 19

Building this in Rust on top of the existing larql-inference forward-pass + capture infrastructure beats shipping it as a Python library because:

  • GGUF/ollama compatibility — edit quantized models, not just HF safetensors
  • Static + dynamic triangulation — LarQL's existing vindex analysis can predict editability cheaply
  • Self-calibratingauto-scale + per-edit crown discovery don't require ML expertise to operate

What's in this PR

Just the design doc (docs/rfcs/0001-mechanistic-fact-editing.md). Phased implementation plan:

  • Phase A: larql crown — smallest new code, builds on existing capture hooks
  • Phase B: larql edit + patch format + apply-patch
  • Phase C: larql memit joint multi-fact
  • Phase D: larql-python binding extensions

Test plan

  • Design review + merge this RFC
  • Implementation PRs follow phased plan

References

🤖 Generated with Claude Code

Proposes extending LarQL from weight-analysis into analysis+editing via
three new subcommands that implement ROME/MEMIT-family algorithms on top
of the existing larql-inference forward pass and capture hooks.

Based on 9 chapters of experimentation on Gemma 4 (4B and 26B) documented
in Divinci-AI/server notebooks/CHAPTER_15 through CHAPTER_23:

- larql crown: per-edit crown-layer discovery via module ablation
- larql edit: single-fact rank-1 edit with auto-scale calibration
- larql memit: batch fact editing via joint least-squares, grouped by crown

Also defines a patch file format (~55KB per Gemma 4 4B single edit) and
a non-destructive larql apply-patch command. Phased 4-step rollout plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikeumus added a commit that referenced this pull request Apr 17, 2026
Implements Phase A of RFC-0001 (#2): per-layer MLP ablation scan to find
the layer whose last-position MLP output is load-bearing for a given
(prompt, expected-token) pair.

Changes:
- crates/larql-inference/src/ffn/ablating.rs — new LastPositionAblatingFfn
  that wraps any FfnBackend and zeroes its output at the last-token row for
  one target layer. Thin wrapper, no math changes.
- crates/larql-cli/src/commands/extraction/crown_cmd.rs — new `larql crown`
  subcommand. Tokenises the prompt, runs a baseline forward pass, then
  iterates layers in [start..=end] running predict_with_ffn against the
  ablating backend, reports per-layer Δ in expected-token probability and
  picks the layer whose ablation causes the top prediction to flip with the
  largest suppression magnitude.

Methodology matches Phase 125c of Divinci-AI/server
notebooks/CHAPTER_17_CORONATION.md — on Gemma 4 4B, ablating L27 MLP on
"Capital of France? A:" makes the top prediction flip from " Paris" to
"France" (the country token). The command outputs JSON (optional --json)
so downstream commands (edit, memit) can consume the crown_layer field.

Compile-checked with `cargo check --package larql-cli`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikeumus
Copy link
Copy Markdown
Author

RFC follow-up roadmap

Phase A (larql crown) has been opened as #3. Tracking Phases B/C/D here since the repo has issues disabled:

Phase B — larql edit + patch format + apply-patch

Blocked by: #3

  • CLI: larql edit <model> --src-prompt \"...\" --old \" Paris\" --new \" Tokyo\" --auto-scale --output france_to_tokyo.patch
  • Algorithm: Chapter 20 Phase 140 (rank-1 outer product) + Chapter 18 Phase 130 (binary-search auto-scale)
  • Patch file format: <name>.meta.json + <name>.d.bin + <name>.k.bin (~55KB Gemma 4 4B)
  • Validated result: France→Tokyo flip with 11/11 other-capitals preserved at 0.9% relative weight perturbation
  • Acceptance: round-trip edit → apply-patch → predict yields the new token; 5-capital specificity spot-check

Phase C — larql memit + specificity validation

Blocked by: Phase B (reuses patch format)

  • CLI: larql memit <model> --edits edits.json --output patches/ --validate-specificity 50
  • Algorithm: existing run_memit at larql-inference/forward/memit.rs + Chapter 23 Phase 143b per-edit crown grouping
  • Specificity validation: probe N held-out facts, report preserve-rate
  • Known ceiling: ~3/5 target flips at 2× scale with correlated keys (Chapter 22)

Phase D — larql-python bindings for Colab-style scripting

Independent (depends on A/B/C only conceptually)

  • Expose crown, edit, memit as Python functions via PyO3
  • Python experiments from Divinci-AI/server Chapters 15-23 become one-liner Rust invocations
  • Unblocks publication-grade reproducibility — researchers can use LarQL from Jupyter without Rust toolchain

Each of B/C/D will be its own focused PR, keeping scope reviewable.

mikeumus added a commit that referenced this pull request Apr 17, 2026
… RFC-0001)

Implements Phase B of RFC-0001 (#2): single-fact rank-1 editor with
portable patch file format. Builds on Phase A's LastPositionAblatingFfn
(#3) and adds the symmetric LastPositionInjectingFfn for scale search.

### New library module: `larql-inference/src/edit.rs`
- `EditPatch` struct (serializable via serde)
- `compute_rank1(k, d, scale, layer, provenance) -> EditPatch`
- `write_patch(path, &patch)` / `read_patch(path) -> EditPatch` with a
  simple binary format: LQPATCH magic + JSON meta + little-endian f32
  vectors for d and k_norm. ~55 KB for Gemma 4 4B.
- `apply_patch(&mut ModelWeights, &EditPatch)`: installs the rank-1
  outer product into `down_proj.weight` in place, handling both
  `[hidden, intermediate]` and `[intermediate, hidden]` layouts.

### New FFN wrapper: `larql-inference/src/ffn/injecting.rs`
- `LastPositionInjectingFfn` — adds a fixed delta vector to the inner
  backend's last-row output at one target layer. Symmetric to the
  ablating wrapper from PR #3. Used for auto-scale search.

### New CLI commands
- `larql edit <model> --src "..." --tgt "..." --new-token " Tokyo" --output f2t.lqpatch`
  Runs Phase A crown discovery (or accepts `--layer`), captures k at the
  crown layer for both prompts, computes d = W_down @ (k_tgt - k_src),
  linearly searches [0.5, 1, 1.5, 2, 2.5, 3, 4] for the minimum scale
  that flips the source's top-1 to --new-token, emits the patch.
- `larql apply-patch <model> --patch f2t.lqpatch --prompt "..."`
  Non-destructively installs one or more patches into the loaded
  weights, optionally runs a test prediction. Supports `--reverse`
  to subtract a patch (verifies reversibility).

### Supporting change
- Added `InferenceModel::weights_mut()` accessor so apply-patch can
  mutate the in-memory weight map without reloading.

Methodology validated in Python across Divinci-AI/server
notebooks/CHAPTER_20_HONEY.md (Phase 140c: France→Tokyo with 11/11
specificity at 0.9% weight perturbation) and CHAPTER_18_THE_EDIT.md
(Phase 130 scale search). The Rust port preserves the same math.

Compile-checked with `cargo check --package larql-cli`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikeumus added a commit that referenced this pull request Apr 17, 2026
Wraps the existing covariance-MEMIT solver (larql_inference::forward::memit::
run_memit) with a CLI, an edits.json file format, and automatic crown-layer
discovery for each edit. Groups edits by crown layer, invokes the joint
least-squares solve, emits one dense `.lqpatch` per affected layer plus a
manifest.json. Phase C of RFC-0001 (#2), stacked on Phase B (#4).

### Extended patch file format (still backward compatible)
- Bumped patch version 1 → 2 with a `kind` field (defaults to "rank_one")
- New `kind = "dense"` variant carries a flat row-major ΔW matrix, needed
  because MEMIT's covariance-projected solve isn't natively a rank-1 outer
  product. Larger on disk (~72 MB per Gemma 4 4B layer) but semantically
  exact — no SVD approximation step.
- `write_patch`, `read_patch`, `apply_patch` all dispatch on kind. Phase B
  rank-1 patches continue to round-trip unchanged.
- New `compute_dense()` helper builds a Dense patch from an Array2<f32>.

### New CLI: `larql memit`
- Reads edits.json (list of {label, src, new_token, layer?} records).
- For each edit: tokenises src, resolves target_token_id, resolves crown
  layer (explicit or auto-scan).
- Calls `run_memit` with Vec<MemitFact>, receives one `MemitResult` per
  affected layer.
- Serialises each layer's ΔW as a Dense patch into the output directory,
  writes a manifest.json enumerating them.
- Prints the apply-patch command to install the batch.

### Usage

    cat > edits.json <<EOF
    [
      {"label":"france-to-tokyo","src":"Capital of France? A:",
       "new_token":" Tokyo","layer":27},
      {"label":"germany-to-rome","src":"Capital of Germany? A:",
       "new_token":" Rome","layer":27}
    ]
    EOF

    larql memit /path/to/gemma4 --edits edits.json --output patches/
    larql apply-patch /path/to/gemma4 \\
        -p patches/memit_L27.lqpatch \\
        --prompt "Capital of France? A:"

### Known ceiling
Chapter 22 established that single-layer MEMIT with correlated keys (~60%
cosine) lands ~3/5 concurrent targets. For 5+ correlated edits, users can
now distribute across multiple crown layers via `layer` overrides in
edits.json — MEMIT runs once per layer group.

Compile-checked with `cargo check --package larql-cli`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikeumus mikeumus merged commit 074d512 into main Apr 17, 2026
@mikeumus mikeumus deleted the feat/mechanistic-edit-rfc branch April 17, 2026 23:59
mikeumus added a commit that referenced this pull request Apr 18, 2026
Implements Phase A of RFC-0001 (#2): per-layer MLP ablation scan to find
the layer whose last-position MLP output is load-bearing for a given
(prompt, expected-token) pair.

Changes:
- crates/larql-inference/src/ffn/ablating.rs — new LastPositionAblatingFfn
  that wraps any FfnBackend and zeroes its output at the last-token row for
  one target layer. Thin wrapper, no math changes.
- crates/larql-cli/src/commands/extraction/crown_cmd.rs — new `larql crown`
  subcommand. Tokenises the prompt, runs a baseline forward pass, then
  iterates layers in [start..=end] running predict_with_ffn against the
  ablating backend, reports per-layer Δ in expected-token probability and
  picks the layer whose ablation causes the top prediction to flip with the
  largest suppression magnitude.

Methodology matches Phase 125c of Divinci-AI/server
notebooks/CHAPTER_17_CORONATION.md — on Gemma 4 4B, ablating L27 MLP on
"Capital of France? A:" makes the top prediction flip from " Paris" to
"France" (the country token). The command outputs JSON (optional --json)
so downstream commands (edit, memit) can consume the crown_layer field.

Compile-checked with `cargo check --package larql-cli`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikeumus added a commit that referenced this pull request Apr 18, 2026
… RFC-0001)

Implements Phase B of RFC-0001 (#2): single-fact rank-1 editor with
portable patch file format. Builds on Phase A's LastPositionAblatingFfn
(#3) and adds the symmetric LastPositionInjectingFfn for scale search.

### New library module: `larql-inference/src/edit.rs`
- `EditPatch` struct (serializable via serde)
- `compute_rank1(k, d, scale, layer, provenance) -> EditPatch`
- `write_patch(path, &patch)` / `read_patch(path) -> EditPatch` with a
  simple binary format: LQPATCH magic + JSON meta + little-endian f32
  vectors for d and k_norm. ~55 KB for Gemma 4 4B.
- `apply_patch(&mut ModelWeights, &EditPatch)`: installs the rank-1
  outer product into `down_proj.weight` in place, handling both
  `[hidden, intermediate]` and `[intermediate, hidden]` layouts.

### New FFN wrapper: `larql-inference/src/ffn/injecting.rs`
- `LastPositionInjectingFfn` — adds a fixed delta vector to the inner
  backend's last-row output at one target layer. Symmetric to the
  ablating wrapper from PR #3. Used for auto-scale search.

### New CLI commands
- `larql edit <model> --src "..." --tgt "..." --new-token " Tokyo" --output f2t.lqpatch`
  Runs Phase A crown discovery (or accepts `--layer`), captures k at the
  crown layer for both prompts, computes d = W_down @ (k_tgt - k_src),
  linearly searches [0.5, 1, 1.5, 2, 2.5, 3, 4] for the minimum scale
  that flips the source's top-1 to --new-token, emits the patch.
- `larql apply-patch <model> --patch f2t.lqpatch --prompt "..."`
  Non-destructively installs one or more patches into the loaded
  weights, optionally runs a test prediction. Supports `--reverse`
  to subtract a patch (verifies reversibility).

### Supporting change
- Added `InferenceModel::weights_mut()` accessor so apply-patch can
  mutate the in-memory weight map without reloading.

Methodology validated in Python across Divinci-AI/server
notebooks/CHAPTER_20_HONEY.md (Phase 140c: France→Tokyo with 11/11
specificity at 0.9% weight perturbation) and CHAPTER_18_THE_EDIT.md
(Phase 130 scale search). The Rust port preserves the same math.

Compile-checked with `cargo check --package larql-cli`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikeumus added a commit that referenced this pull request Apr 18, 2026
… RFC-0001) (#7)

Implements Phase B of RFC-0001 (#2): single-fact rank-1 editor with
portable patch file format. Builds on Phase A's LastPositionAblatingFfn
(#3) and adds the symmetric LastPositionInjectingFfn for scale search.

### New library module: `larql-inference/src/edit.rs`
- `EditPatch` struct (serializable via serde)
- `compute_rank1(k, d, scale, layer, provenance) -> EditPatch`
- `write_patch(path, &patch)` / `read_patch(path) -> EditPatch` with a
  simple binary format: LQPATCH magic + JSON meta + little-endian f32
  vectors for d and k_norm. ~55 KB for Gemma 4 4B.
- `apply_patch(&mut ModelWeights, &EditPatch)`: installs the rank-1
  outer product into `down_proj.weight` in place, handling both
  `[hidden, intermediate]` and `[intermediate, hidden]` layouts.

### New FFN wrapper: `larql-inference/src/ffn/injecting.rs`
- `LastPositionInjectingFfn` — adds a fixed delta vector to the inner
  backend's last-row output at one target layer. Symmetric to the
  ablating wrapper from PR #3. Used for auto-scale search.

### New CLI commands
- `larql edit <model> --src "..." --tgt "..." --new-token " Tokyo" --output f2t.lqpatch`
  Runs Phase A crown discovery (or accepts `--layer`), captures k at the
  crown layer for both prompts, computes d = W_down @ (k_tgt - k_src),
  linearly searches [0.5, 1, 1.5, 2, 2.5, 3, 4] for the minimum scale
  that flips the source's top-1 to --new-token, emits the patch.
- `larql apply-patch <model> --patch f2t.lqpatch --prompt "..."`
  Non-destructively installs one or more patches into the loaded
  weights, optionally runs a test prediction. Supports `--reverse`
  to subtract a patch (verifies reversibility).

### Supporting change
- Added `InferenceModel::weights_mut()` accessor so apply-patch can
  mutate the in-memory weight map without reloading.

Methodology validated in Python across Divinci-AI/server
notebooks/CHAPTER_20_HONEY.md (Phase 140c: France→Tokyo with 11/11
specificity at 0.9% weight perturbation) and CHAPTER_18_THE_EDIT.md
(Phase 130 scale search). The Rust port preserves the same math.

Compile-checked with `cargo check --package larql-cli`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikeumus added a commit that referenced this pull request Apr 18, 2026
Wraps the existing covariance-MEMIT solver (larql_inference::forward::memit::
run_memit) with a CLI, an edits.json file format, and automatic crown-layer
discovery for each edit. Groups edits by crown layer, invokes the joint
least-squares solve, emits one dense `.lqpatch` per affected layer plus a
manifest.json. Phase C of RFC-0001 (#2), stacked on Phase B (#4).

### Extended patch file format (still backward compatible)
- Bumped patch version 1 → 2 with a `kind` field (defaults to "rank_one")
- New `kind = "dense"` variant carries a flat row-major ΔW matrix, needed
  because MEMIT's covariance-projected solve isn't natively a rank-1 outer
  product. Larger on disk (~72 MB per Gemma 4 4B layer) but semantically
  exact — no SVD approximation step.
- `write_patch`, `read_patch`, `apply_patch` all dispatch on kind. Phase B
  rank-1 patches continue to round-trip unchanged.
- New `compute_dense()` helper builds a Dense patch from an Array2<f32>.

### New CLI: `larql memit`
- Reads edits.json (list of {label, src, new_token, layer?} records).
- For each edit: tokenises src, resolves target_token_id, resolves crown
  layer (explicit or auto-scan).
- Calls `run_memit` with Vec<MemitFact>, receives one `MemitResult` per
  affected layer.
- Serialises each layer's ΔW as a Dense patch into the output directory,
  writes a manifest.json enumerating them.
- Prints the apply-patch command to install the batch.

### Usage

    cat > edits.json <<EOF
    [
      {"label":"france-to-tokyo","src":"Capital of France? A:",
       "new_token":" Tokyo","layer":27},
      {"label":"germany-to-rome","src":"Capital of Germany? A:",
       "new_token":" Rome","layer":27}
    ]
    EOF

    larql memit /path/to/gemma4 --edits edits.json --output patches/
    larql apply-patch /path/to/gemma4 \\
        -p patches/memit_L27.lqpatch \\
        --prompt "Capital of France? A:"

### Known ceiling
Chapter 22 established that single-layer MEMIT with correlated keys (~60%
cosine) lands ~3/5 concurrent targets. For 5+ correlated edits, users can
now distribute across multiple crown layers via `layer` overrides in
edits.json — MEMIT runs once per layer group.

Compile-checked with `cargo check --package larql-cli`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikeumus added a commit that referenced this pull request Apr 18, 2026
Wraps the existing covariance-MEMIT solver (larql_inference::forward::memit::
run_memit) with a CLI, an edits.json file format, and automatic crown-layer
discovery for each edit. Groups edits by crown layer, invokes the joint
least-squares solve, emits one dense `.lqpatch` per affected layer plus a
manifest.json. Phase C of RFC-0001 (#2), stacked on Phase B (#4).

### Extended patch file format (still backward compatible)
- Bumped patch version 1 → 2 with a `kind` field (defaults to "rank_one")
- New `kind = "dense"` variant carries a flat row-major ΔW matrix, needed
  because MEMIT's covariance-projected solve isn't natively a rank-1 outer
  product. Larger on disk (~72 MB per Gemma 4 4B layer) but semantically
  exact — no SVD approximation step.
- `write_patch`, `read_patch`, `apply_patch` all dispatch on kind. Phase B
  rank-1 patches continue to round-trip unchanged.
- New `compute_dense()` helper builds a Dense patch from an Array2<f32>.

### New CLI: `larql memit`
- Reads edits.json (list of {label, src, new_token, layer?} records).
- For each edit: tokenises src, resolves target_token_id, resolves crown
  layer (explicit or auto-scan).
- Calls `run_memit` with Vec<MemitFact>, receives one `MemitResult` per
  affected layer.
- Serialises each layer's ΔW as a Dense patch into the output directory,
  writes a manifest.json enumerating them.
- Prints the apply-patch command to install the batch.

### Usage

    cat > edits.json <<EOF
    [
      {"label":"france-to-tokyo","src":"Capital of France? A:",
       "new_token":" Tokyo","layer":27},
      {"label":"germany-to-rome","src":"Capital of Germany? A:",
       "new_token":" Rome","layer":27}
    ]
    EOF

    larql memit /path/to/gemma4 --edits edits.json --output patches/
    larql apply-patch /path/to/gemma4 \\
        -p patches/memit_L27.lqpatch \\
        --prompt "Capital of France? A:"

### Known ceiling
Chapter 22 established that single-layer MEMIT with correlated keys (~60%
cosine) lands ~3/5 concurrent targets. For 5+ correlated edits, users can
now distribute across multiple crown layers via `layer` overrides in
edits.json — MEMIT runs once per layer group.

Compile-checked with `cargo check --package larql-cli`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant