test: add speculative decoding E2E test to CI pipeline#27
Merged
Conversation
added 12 commits
April 11, 2026 22:06
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Close #24
Summary
This PR finalizes the integration of the SSD expert streaming and speculative decoding pipeline into the SwiftLM production codebase, and adds continuous integration testing for dual-model verification.
Key Changes
Package.swiftto point themlx-swift-lmdependency toSharpAI/mlx-swift-lmbranchmain, tracking the integrated PR Feature/gemma4 benchmark #9 which contains the SSD streaming 10x rewrite.SSD Expert Streaming: 10x MoE Speedupsection detailing the methodology (cross-projection batching, concurrent pread, asyncEval pipeline, persistent Metal buffers, runtime top-k).--draft-modeland--num-draft-tokens.tests/test-speculative.sh: A new dual-model E2E integration test verifying speculative decoding path activation, sequential stability, memory limits, and streaming behavior..github/workflows/ci.yml: Created a newspeculative-decodingjob leveraging themacos-15-xlarge(14 GB) runner, usingQwen3.5-0.8B-MLX-4bitas a lightweight draft model to accelerate aQwen3.5-9B-4bitmain model.Notes for Reviewers
macos-15-xlargebecause loading both the 0.8B and 9B models requires ~6GB resident RAM, which is too tight for the standard 7GBmacos-15runner without risking OOM terminations.test-speculative.shintentionally verifies the presence of theUsing speculative decodinglogs to ensure the draft model is actively engaged in the generation pipeline.