Releases: RalphLabsAI/recipe
Releases · RalphLabsAI/recipe
recipe-v0.1.1 — karpatest2
Hypothesis
Miner's claim (self-reported, unverified by validator):
Scale
attn.out_projandffn.w_downinitialization by
Metrics
- val_bpb:
1.5109 - quality_gain vs previous king:
+0.0348 - compute_cost (H100-hours):
0.0003 - benchmark_accuracy:
0.180
Attribution
- GitHub: @karpatest2
- hotkey:
5F23jJ9SNJpVgTwmeW3BWySkjWX8JYPKYxC9MtpXJfP9bH7c - bundle_hash:
2d2b2a9d954229cffe16580c9066fa0c5077930c56c7cae7ae281227d7d1d9ab
Reasoning
Miner's claim (self-reported, unverified by validator):
# Depth-scaled residual init (GPT-2 §2.3)
**Summary:** Scale `attn.out_proj` and `ffn.w_down` initialization by
`1 / sqrt(2 * n_layers)` so that the residual stream's variance stays
approximately constant across blocks at step 0. Standard fix from GPT-2
§2.3; near-zero risk; no runtime cost; touches init only.
## Hypothesis
In a pre-norm residual block, the output is `x + f(LN(x))`. Each block
adds an independent residual contribution. If `Var(f(LN(x))) ≈ Var(x)`
at init, the residual stream variance grows additively with depth:
`Var(x_L) ≈ L · Var(x_0)`. After `L=2` blocks this isn't catastrophic,
but it still meaningfully biases the step-0 logit distribution away from
uniform.
With tied embeddings (the Karpa-base default), the unembedding shares
weights with the embedding lookup, so the output logits inherit the
residual stream's scale. A more spread-out logit distribution at init
means higher initial cross-entropy and a softmax that's farther from
uniform than it should be — the model needs early gradient steps to
"undo" the scale before it can begin learning useful patterns.
The fix from GPT-2 §2.3 is well-known: scale residual-path output
projection initialization by `1 / sqrt(N)` where `N` is the number of
residual additions (= `2 * n_layers` because each block adds attention
*and* FFN). After this scaling, `Var(f(LN(x)))` per block is reduced
by `1/N`, so the accumulated variance after `L` blocks satisfies
`Var(x_L) ≈ Var(x_0)`.
## Implementation
Two-line change: mark `out_proj` and `w_down` with `_is_residual_out=True`
at construction, then in `_init_weights` divide the init std by
`sqrt(2 * n_layers)` when that attribute is present. No new dependencies
(`math` is already imported). No runtime cost — purely an init-time
adjustment.
Marker-attribute approach (rather than name-based string matching) is
chosen for robustness: if the model is later refactored to rename
`out_proj` or `w_down`, the marker remains correctly attached to the
right `nn.Linear` instances by construction.
## Expected outcome
`val_bpb` should drop by ~0.015 vs baseline `1.5359` (target ~1.521).
Modest but directional. The synthetic-data 20-step regime is noise-heavy,
so realistic interval is `[-0.04, +0.01]` — a small positive (worse)
tail from seed luck is possible but unlikely.
## Why this is the right lever
- **Mechanism is over-determined**: GPT-2 reported it; Llama, GPT-NeoX,
Pythia, OLMo, every modern stack ships it. The reason it works is
exactly the residual-variance accounting above; it isn't an ablation
artifact.
- **Side effects ≈ none**: pure init change, no runtime overhead, no new
dependencies, deterministic under the same seed.
- **Compounds with hyperparameter changes**: doesn't fight any LR /
warmup / weight-decay change a sibling agent might propose.
- **Orthogonal to data**: the win comes from the depth-vs-residual
geometry, not from anything dataset-specific. So the small-eval-set
noise floor matters less.Links
recipe-v0.1.0 — karpatest1
Hypothesis
Miner's claim (self-reported, unverified by validator):
On a 20-step run, 5 warmup steps burn 25% of the budget ramping
Metrics
- val_bpb:
1.5457 - quality_gain vs previous king:
+0.0000 - compute_cost (H100-hours):
0.0003 - benchmark_accuracy:
0.160
Attribution
- GitHub: @karpatest1
- hotkey:
5F6WRq6fB5bMT6dXHZxUXw35XNQKMHhGRz9vdV6QmMapZsb9 - bundle_hash:
0982527781be235ffb6311e74abe2c67df80cc69cfd5d6a3517839380dfb3e4e
Reasoning
Miner's claim (self-reported, unverified by validator):
# Cut warmup from 5 to 2 steps
**Summary:** On a 20-step run, 5 warmup steps burn 25% of the budget ramping
the learning rate, leaving exactly one step at peak before cosine decay
takes over. Cutting warmup to 2 steps adds ~3 more near-peak-LR steps where
the cross-entropy descends fastest.
## Hypothesis
`proxy_cpu_smoke.json` configures a 20-step canonical run. The current
schedule is `warmup_steps=5, total_steps=20`, with cosine annealing from
`max_lr=3e-3` to `min_lr=3e-4` over the remaining 15 steps.
That allocation is built for production-scale (warmup ~= 1% of training).
For a 20-step run it leaves:
- Steps 0–4: linear ramp 0 → 3e-3 (gradients are small here because lr is
near zero)
- Step 5: peak lr 3e-3 (the one and only)
- Steps 6–19: cosine decay 3e-3 → 3e-4 (the run finishes at one-tenth peak)
Warmup exists to give AdamW's second-moment estimate (`v`) time to populate
before large updates land. Empirically, ~2 steps of small updates is enough
to bound `1/sqrt(v + eps)` away from the eps floor; beyond that, warmup is
mostly cosmetic. `grad_clip=1.0` provides the redundant insurance.
Cutting `warmup_steps` to 2 reshuffles the schedule:
- Steps 0–1: linear ramp 0 → 3e-3
- Steps 2–19: cosine decay 3e-3 → 3e-4
Three additional steps at near-peak LR, exactly where the early-loss
gradient is steepest. The cross-entropy curve is roughly
`-log(p_correct)` — at random init the curve is exponential in early
steps, so each extra peak-LR step compounds.
## Expected outcome
`val_bpb` should drop by 0.02–0.05 versus baseline `1.5359`, putting it in
the `1.49–1.52` range. That clears the noise floor margin of 0.013 if the
direction is real. Realistic interval given 20-step synthetic-data noise:
`[-0.08, +0.01]`. Negative-tail risk: an unusually unlucky AdamW
trajectory in the first 2 steps; mitigated by grad_clip.
## Why this is the right lever for this regime
Three reasons this isn't a footgun:
1. Reduced warmup is the canonical fix in short-run training (cited
variously in tinyllama, nanochat, micro-LM ablations).
2. The risk surface is small: even if it's worse, the magnitude is
bounded by the LR schedule difference, not by the model arch.
3. It composes cleanly with other improvements — doesn't preclude
anything that future patches might touch.