Skip to content

Releases: RalphLabsAI/recipe

recipe-v0.1.1 — karpatest2

29 May 14:52
b5171f5

Choose a tag to compare

Hypothesis

Miner's claim (self-reported, unverified by validator):

Scale attn.out_proj and ffn.w_down initialization by

Metrics

  • val_bpb: 1.5109
  • quality_gain vs previous king: +0.0348
  • compute_cost (H100-hours): 0.0003
  • benchmark_accuracy: 0.180

Attribution

  • GitHub: @karpatest2
  • hotkey: 5F23jJ9SNJpVgTwmeW3BWySkjWX8JYPKYxC9MtpXJfP9bH7c
  • bundle_hash: 2d2b2a9d954229cffe16580c9066fa0c5077930c56c7cae7ae281227d7d1d9ab

Reasoning

Miner's claim (self-reported, unverified by validator):

# Depth-scaled residual init (GPT-2 §2.3)

**Summary:** Scale `attn.out_proj` and `ffn.w_down` initialization by
`1 / sqrt(2 * n_layers)` so that the residual stream's variance stays
approximately constant across blocks at step 0. Standard fix from GPT-2
§2.3; near-zero risk; no runtime cost; touches init only.

## Hypothesis

In a pre-norm residual block, the output is `x + f(LN(x))`. Each block
adds an independent residual contribution. If `Var(f(LN(x))) ≈ Var(x)`
at init, the residual stream variance grows additively with depth:
`Var(x_L) ≈ L · Var(x_0)`. After `L=2` blocks this isn't catastrophic,
but it still meaningfully biases the step-0 logit distribution away from
uniform.

With tied embeddings (the Karpa-base default), the unembedding shares
weights with the embedding lookup, so the output logits inherit the
residual stream's scale. A more spread-out logit distribution at init
means higher initial cross-entropy and a softmax that's farther from
uniform than it should be — the model needs early gradient steps to
"undo" the scale before it can begin learning useful patterns.

The fix from GPT-2 §2.3 is well-known: scale residual-path output
projection initialization by `1 / sqrt(N)` where `N` is the number of
residual additions (= `2 * n_layers` because each block adds attention
*and* FFN). After this scaling, `Var(f(LN(x)))` per block is reduced
by `1/N`, so the accumulated variance after `L` blocks satisfies
`Var(x_L) ≈ Var(x_0)`.

## Implementation

Two-line change: mark `out_proj` and `w_down` with `_is_residual_out=True`
at construction, then in `_init_weights` divide the init std by
`sqrt(2 * n_layers)` when that attribute is present. No new dependencies
(`math` is already imported). No runtime cost — purely an init-time
adjustment.

Marker-attribute approach (rather than name-based string matching) is
chosen for robustness: if the model is later refactored to rename
`out_proj` or `w_down`, the marker remains correctly attached to the
right `nn.Linear` instances by construction.

## Expected outcome

`val_bpb` should drop by ~0.015 vs baseline `1.5359` (target ~1.521).
Modest but directional. The synthetic-data 20-step regime is noise-heavy,
so realistic interval is `[-0.04, +0.01]` — a small positive (worse)
tail from seed luck is possible but unlikely.

## Why this is the right lever

- **Mechanism is over-determined**: GPT-2 reported it; Llama, GPT-NeoX,
  Pythia, OLMo, every modern stack ships it. The reason it works is
  exactly the residual-variance accounting above; it isn't an ablation
  artifact.
- **Side effects ≈ none**: pure init change, no runtime overhead, no new
  dependencies, deterministic under the same seed.
- **Compounds with hyperparameter changes**: doesn't fight any LR /
  warmup / weight-decay change a sibling agent might propose.
- **Orthogonal to data**: the win comes from the depth-vs-residual
  geometry, not from anything dataset-specific. So the small-eval-set
  noise floor matters less.

Links

recipe-v0.1.0 — karpatest1

29 May 14:52
5e09f29

Choose a tag to compare

Hypothesis

Miner's claim (self-reported, unverified by validator):

On a 20-step run, 5 warmup steps burn 25% of the budget ramping

Metrics

  • val_bpb: 1.5457
  • quality_gain vs previous king: +0.0000
  • compute_cost (H100-hours): 0.0003
  • benchmark_accuracy: 0.160

Attribution

  • GitHub: @karpatest1
  • hotkey: 5F6WRq6fB5bMT6dXHZxUXw35XNQKMHhGRz9vdV6QmMapZsb9
  • bundle_hash: 0982527781be235ffb6311e74abe2c67df80cc69cfd5d6a3517839380dfb3e4e

Reasoning

Miner's claim (self-reported, unverified by validator):

# Cut warmup from 5 to 2 steps

**Summary:** On a 20-step run, 5 warmup steps burn 25% of the budget ramping
the learning rate, leaving exactly one step at peak before cosine decay
takes over. Cutting warmup to 2 steps adds ~3 more near-peak-LR steps where
the cross-entropy descends fastest.

## Hypothesis

`proxy_cpu_smoke.json` configures a 20-step canonical run. The current
schedule is `warmup_steps=5, total_steps=20`, with cosine annealing from
`max_lr=3e-3` to `min_lr=3e-4` over the remaining 15 steps.

That allocation is built for production-scale (warmup ~= 1% of training).
For a 20-step run it leaves:

- Steps 0–4: linear ramp 0 → 3e-3 (gradients are small here because lr is
  near zero)
- Step 5: peak lr 3e-3 (the one and only)
- Steps 6–19: cosine decay 3e-3 → 3e-4 (the run finishes at one-tenth peak)

Warmup exists to give AdamW's second-moment estimate (`v`) time to populate
before large updates land. Empirically, ~2 steps of small updates is enough
to bound `1/sqrt(v + eps)` away from the eps floor; beyond that, warmup is
mostly cosmetic. `grad_clip=1.0` provides the redundant insurance.

Cutting `warmup_steps` to 2 reshuffles the schedule:

- Steps 0–1: linear ramp 0 → 3e-3
- Steps 2–19: cosine decay 3e-3 → 3e-4

Three additional steps at near-peak LR, exactly where the early-loss
gradient is steepest. The cross-entropy curve is roughly
`-log(p_correct)` — at random init the curve is exponential in early
steps, so each extra peak-LR step compounds.

## Expected outcome

`val_bpb` should drop by 0.02–0.05 versus baseline `1.5359`, putting it in
the `1.49–1.52` range. That clears the noise floor margin of 0.013 if the
direction is real. Realistic interval given 20-step synthetic-data noise:
`[-0.08, +0.01]`. Negative-tail risk: an unusually unlucky AdamW
trajectory in the first 2 steps; mitigated by grad_clip.

## Why this is the right lever for this regime

Three reasons this isn't a footgun:
1. Reduced warmup is the canonical fix in short-run training (cited
   variously in tinyllama, nanochat, micro-LM ablations).
2. The risk surface is small: even if it's worse, the magnitude is
   bounded by the LR schedule difference, not by the model arch.
3. It composes cleanly with other improvements — doesn't preclude
   anything that future patches might touch.

Links