Skip to content

test: add speculative decoding E2E test to CI pipeline#27

Merged
solderzzc merged 12 commits intomainfrom
feature/speculative-decoding-ci
Apr 12, 2026
Merged

test: add speculative decoding E2E test to CI pipeline#27
solderzzc merged 12 commits intomainfrom
feature/speculative-decoding-ci

Conversation

@solderzzc
Copy link
Copy Markdown
Member

@solderzzc solderzzc commented Apr 12, 2026

Close #24

Summary

This PR finalizes the integration of the SSD expert streaming and speculative decoding pipeline into the SwiftLM production codebase, and adds continuous integration testing for dual-model verification.

Key Changes

  1. Dependency Sync: Updated Package.swift to point the mlx-swift-lm dependency to SharpAI/mlx-swift-lm branch main, tracking the integrated PR Feature/gemma4 benchmark #9 which contains the SSD streaming 10x rewrite.
  2. README Documentation:
    • Added a new SSD Expert Streaming: 10x MoE Speedup section detailing the methodology (cross-projection batching, concurrent pread, asyncEval pipeline, persistent Metal buffers, runtime top-k).
    • Included full benchmark results showing 0.58 → 6.52 tok/s improvement for 122B+ MoE models on M1 Ultra 64GB.
    • Updated CLI options table to include --draft-model and --num-draft-tokens.
    • Credited Eric Lake in the Acknowledgments for the speculative decoding infrastructure and SSD rewrite.
  3. Speculative Decoding CI:
    • Added tests/test-speculative.sh: A new dual-model E2E integration test verifying speculative decoding path activation, sequential stability, memory limits, and streaming behavior.
    • Updated .github/workflows/ci.yml: Created a new speculative-decoding job leveraging the macos-15-xlarge (14 GB) runner, using Qwen3.5-0.8B-MLX-4bit as a lightweight draft model to accelerate a Qwen3.5-9B-4bit main model.

Notes for Reviewers

  • The new CI job is constrained to macos-15-xlarge because loading both the 0.8B and 9B models requires ~6GB resident RAM, which is too tight for the standard 7GB macos-15 runner without risking OOM terminations.
  • The test-speculative.sh intentionally verifies the presence of the Using speculative decoding logs to ensure the draft model is actively engaged in the generation pipeline.

@solderzzc solderzzc merged commit 3990199 into main Apr 12, 2026
3 checks passed
@solderzzc solderzzc deleted the feature/speculative-decoding-ci branch April 12, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--stream-experts only achieves ~5% of available SSD bandwidth on M1 Ultra (300 MB/s observed vs 5-7 GB/s capable)

1 participant