Skip to content

Rebase Scout shards to BF16-direct + F64x8 SIMD pipeline#54

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/bf16-direct-rebase
Mar 30, 2026
Merged

Rebase Scout shards to BF16-direct + F64x8 SIMD pipeline#54
AdaWorldAPI merged 1 commit into
masterfrom
claude/bf16-direct-rebase

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Rebases run_llama4_shard() from the legacy f32 path (stream_index_gguf) to the BF16-direct pipeline (stream_index_gguf_bf16).

Both Scout and Maverick now use the same optimized pipeline:

  • BF16-direct: no Vec<f32> intermediate (saves 283 MB per tensor)
  • F64x8 SIMD: 8 rows projected in parallel per zmm register
  • Strided octave (stride=16): 97% fewer BF16→f64 conversions
  • Halftone drop: 9 of 17 golden positions, odd bins interpolated
  • Exact shard sizes: SCOUT_SHARD_SIZES const replaces 44 GB estimate

Why golden step matters at this ratio

At 4,735× compression, every Base17 bin feeds the palette clustering.
Golden step (11 mod 17) visits all 17 residues → full-rank palette centroids.
Fibonacci had 4 dead bins → 23.5% of the CAM distance matrix was comparing noise floor to noise floor.

What stays on the f32 path

  • test_stream_index_llama4_scout_from_hf (IQ1_S format, not BF16)
  • test_stream_index_openchat_q8 (Q8_0 format)
  • test_stream_index_synthetic_gguf (synthetic F32)

These correctly use the old path because their dtypes need actual dequantization.

run_llama4_shard() now uses stream_index_gguf_bf16() instead of
stream_index_gguf(). Changes:

- BF16-direct: no f32 intermediate allocation (saves 283 MB/tensor)
- F64x8 SIMD: 8 rows projected in parallel per zmm register
- Strided octave (stride=16): 97% fewer BF16→f64 conversions
- Halftone drop: 9 of 17 golden positions, odd bins interpolated
- Exact shard sizes: SCOUT_SHARD_SIZES const replaces 44 GB estimate
- Reusable u16 buffer inside indexer (no per-tensor alloc)

Both Scout shard tests and Maverick test now use the same
BF16-direct pipeline. The old f32 path remains for non-BF16
formats (IQ1_S, Q8_0, etc).
@AdaWorldAPI AdaWorldAPI merged commit a8631a2 into master Mar 30, 2026
5 of 14 checks passed
@AdaWorldAPI AdaWorldAPI deleted the claude/bf16-direct-rebase branch March 30, 2026 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant