Skip to content

feat: MoE gate topology + expert clustering + scaffold cross-reference#58

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/gate-topology
Mar 30, 2026
Merged

feat: MoE gate topology + expert clustering + scaffold cross-reference#58
AdaWorldAPI merged 1 commit into
masterfrom
claude/gate-topology

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

What

Extends causal_diff.rs with MoE gate topology analysis — the other half of the reasoning reverse-engineering pipeline.

New functions

extract_gate_topology(bgz7_path)

Finds ffn_gate_inp tensors in a bgz7 file. Each row = one expert's activation fingerprint as Base17. For Maverick: 128 rows per MoE block.

cluster_experts(fingerprints, threshold)

Pairwise L1 between all experts within each block. Connected-component grouping finds structurally interchangeable expert groups. At 123,000× compression on expert weights, we expect >90% of pairs to be redundant.

cross_reference_gate_scaffold(clusters, scaffold_blocks)

The key insight connector:

  • Attention scaffold (from Qwen3.5 diff): blocks where Q+O shifted → reasoning circuit
  • Gate redundancy (from Maverick topology): blocks where experts are interchangeable → routing dominates
  • Cross-reference: scaffold blocks WITH high redundancy → reasoning changes work THROUGH the router, not the experts

Tests

  • test_maverick_gate_topology — loads all 18 Maverick bgz7 shards, extracts gates, clusters
  • test_cross_reference_gate_scaffold — full pipeline: Qwen3.5 diff → scaffold blocks → Maverick gates → routing dominance check

The loop that closes

Maverick 123,000× → experts are commodity (gate topology)
Qwen3.5 Q+O shift → reasoning is routing (attention diff)
Cross-reference   → reasoning = routing at both scales
NARS truth        → first observed evidence for the stack

…ss-ref

extract_gate_topology() — pulls ffn_gate_inp Base17 rows from bgz7,
one row per expert. Each row IS the expert's structural identity.

cluster_experts() — pairwise L1 between experts within each block,
connected-component grouping of structurally interchangeable experts.
At threshold=500, Maverick's 123,000× compression predicts >90% redundancy.

cross_reference_gate_scaffold() — links attention scaffold blocks
(Q+O shifted from Qwen3.5 diff) with gate redundancy per block.
Routing-dominated blocks = reasoning changes work through the router,
not through the expert weights.

Tests:
- test_maverick_gate_topology: load all 18 Maverick bgz7 shards
- test_cross_reference_gate_scaffold: full pipeline connecting
  Qwen3.5 attention diff with Maverick gate structure
@AdaWorldAPI AdaWorldAPI merged commit e900ad0 into master Mar 30, 2026
5 of 14 checks passed
@AdaWorldAPI AdaWorldAPI deleted the claude/gate-topology branch March 30, 2026 08:24
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54ed7eedf3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hpc/causal_diff.rs
Comment on lines +374 to +375
if !t.name.contains("gate_inp") && !t.name.contains("gate.weight") {
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict gate tensor matching to router-only names

Narrowing logic here is too broad: matching "gate.weight" pulls in dense FFN gate tensors (e.g., blk.{i}.ffn_gate.weight is a SiLU MLP gate, not a router gate), so extract_gate_topology will treat thousands of FFN rows as experts and feed them into cluster_experts. That corrupts redundancy conclusions and can make the O(n²) adjacency allocation/computation explode for normal dense blocks, especially when running the Maverick shard pipeline.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant