[plugin][CI/CD] Add CI/CD for OOT and enable OOT docker release#320
[plugin][CI/CD] Add CI/CD for OOT and enable OOT docker release#320zejunchen-zejun wants to merge 15 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds CI/CD coverage for ATOM’s vLLM OOT (out-of-tree) plugin workflow and extends the Docker release pipeline to optionally build/push an OOT vLLM image, while consolidating OOT image build logic into the main multi-stage docker/Dockerfile.
Changes:
- Add new OOT CI workflows (per-PR/scheduled “OOT Test” + manual “Full Validation”) and a shared OOT test script for launching vLLM + running GSM8K accuracy checks.
- Convert
docker/Dockerfileinto a multi-stage build with a dedicatedoot_imagestage and update the nightly docker release workflow to optionally publish OOT images. - Add plugin-mode unit tests for framework selection, env-flag behavior, and vLLM→ATOM config translation.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/plugin/test_plugin_mode_status.py |
Adds unit tests for plugin framework backbone selection and mode helpers. |
tests/plugin/test_plugin_env_flags.py |
Adds unit test verifying ATOM_DISABLE_VLLM_PLUGIN disables platform/registration behavior. |
tests/plugin/test_plugin_config_translation.py |
Adds unit tests for translating vLLM config to ATOM config in plugin mode. |
.github/scripts/atom_oot_test.sh |
New helper script to launch vLLM and run GSM8K accuracy + threshold gating. |
.github/workflows/atom-vllm-oot-test.yaml |
New per-PR/push/scheduled OOT workflow building OOT image and running plugin UT + GSM8K accuracy. |
.github/workflows/atom-vllm-oot-full-test.yaml |
New workflow-dispatch full validation across multiple models/runners. |
.github/workflows/atom-test.yaml |
Fixes ATOM_BASE_NIGTHLY_IMAGE typo to ATOM_BASE_NIGHTLY_IMAGE. |
.github/workflows/docker-release.yaml |
Adds optional inputs/steps to build and push OOT vLLM Docker images; builds atom_image stage explicitly. |
docker/Dockerfile |
Introduces oot_image stage (vLLM build/install + deps) and renames original flow to atom_image stage. |
docker/plugin/Dockerfile_OOT_vLLM |
Removes old dedicated OOT vLLM Dockerfile in favor of the consolidated multi-stage Dockerfile. |
docker/plugin/build_OOT_vLLM.sh |
Removes old local OOT image build script (replaced by consolidated build paths). |
oot_ut_changes.patch |
New committed patch file capturing diffs (appears redundant with PR content). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi, @valarLip @gyohuangxin @wuhuikx Could you help review this PR? For now, the OOT CI has been established and Kimi-K2 can pass this OOT CI. @gyohuangxin For now I am testing nightly OOT CI release Thank you |
| jobs: | ||
| build-oot-image: | ||
| name: Build OOT validation image | ||
| runs-on: linux-atom-mi355-1 |
There was a problem hiding this comment.
please use build-only-atom runner to build images.
| - model_name: "Kimi-K2-Thinking-MXFP4" | ||
| model_path: "amd/Kimi-K2-Thinking-MXFP4" | ||
| accuracy_test_threshold: 0.90 | ||
| runner: atom-mi355-8gpu.predownload |
There was a problem hiding this comment.
Does it need 8 gpus to run kimi-k2-thinking model?
There was a problem hiding this comment.
Will use runner: linux-atom-mi355-4 for 4-GPUs
| - name: Build ATOM base image | ||
| run: | | ||
| cat <<EOF > Dockerfile.mod | ||
| FROM ${{ env.ATOM_BASE_NIGHTLY_IMAGE }} | ||
| RUN pip install -U lm-eval[api] | ||
| RUN pip show lm-eval || true | ||
| RUN pip install hf_transfer | ||
| RUN pip show hf_transfer || true | ||
| RUN echo "=== Aiter version BEFORE uninstall ===" && pip show amd-aiter || true | ||
| RUN pip uninstall -y amd-aiter | ||
| RUN pip install --upgrade "pybind11>=3.0.1" | ||
| RUN pip show pybind11 | ||
| RUN rm -rf /app/aiter-test | ||
| RUN git clone --depth 1 https://github.com/ROCm/aiter.git /app/aiter-test && \\ | ||
| cd /app/aiter-test && \\ | ||
| git checkout HEAD && \\ | ||
| git submodule sync && git submodule update --init --recursive && \\ | ||
| MAX_JOBS=64 PREBUILD_KERNELS=0 GPU_ARCHS=gfx950 python3 setup.py develop | ||
| RUN echo "=== Aiter version AFTER installation ===" && pip show amd-aiter || true | ||
| RUN echo "=== ATOM version BEFORE uninstall ===" && pip show atom || true | ||
| RUN pip uninstall -y atom | ||
| RUN rm -rf /app/ATOM | ||
| RUN git clone ${{ env.GITHUB_REPO_URL }} /app/ATOM && \\ | ||
| cd /app/ATOM && \\ | ||
| git checkout ${{ env.GITHUB_COMMIT_SHA }} && \\ | ||
| pip install -e . | ||
| RUN echo "=== ATOM version AFTER installation ===" && pip show atom || true | ||
| EOF | ||
|
|
There was a problem hiding this comment.
Do we still need build an ATOM image before an OOT image?
There was a problem hiding this comment.
Yes, in most cases we need to build the ATOM image, so I move the OOT build after the ATOM build
But for image release, I introduce an argument SKIP_ATOM_NATIVE_BUILD option to skip the ATOM image build because when we want to release nightly image, we need to build the ATOM image and push it to hub firstly, then for OOT image, we can reuse the ATOM image and do incremental build for vLLM
docker/Dockerfile
Outdated
| FROM $BASE_IMAGE | ||
| # -------------------------------------------------------------------- | ||
| # OOT image stage: extends an ATOM base image with vLLM + OOT deps. | ||
| # Build with: docker build --target oot_image --build-arg BASE_IMAGE=... |
There was a problem hiding this comment.
use an argument to control if building proceed for ATOM OOT image
enable OOT docker release Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
6fc865f to
37e19c7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| atom-vllm-oot: | ||
| needs: [pre-checks] | ||
| if: ${{ needs.pre-checks.result == 'success' && (!github.event.pull_request || github.event.pull_request.draft == false) }} |
| - name: Clean up containers and workspace | ||
| run: | | ||
| echo "=== Cleaning up containers on $(hostname) ===" | ||
| containers=$(docker ps -q) | ||
| if [ -n "$containers" ]; then | ||
| docker kill $containers || true | ||
| fi | ||
| docker rm -f "$CONTAINER_NAME" 2>/dev/null || true | ||
| docker run --rm -v "${GITHUB_WORKSPACE:-$PWD}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true | ||
|
|
| RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \ | ||
| echo "========== [OOT 1/7] Prepare build tools ==========" && \ | ||
| apt-get update && \ | ||
| apt --fix-broken install -y && \ | ||
| apt-get install -y --no-install-recommends ca-certificates ninja-build vim && \ | ||
| mkdir -p /usr/local/bin && \ | ||
| ln -sf "$(command -v ninja)" /usr/local/bin/ninja && \ | ||
| /usr/local/bin/ninja --version && \ | ||
| rm -rf /var/lib/apt/lists/* | ||
|
|
||
| RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \ | ||
| echo "========== [OOT 2/7] Verify base packages (atom/aiter/mori) ==========" && \ | ||
| "${VENV_PYTHON}" -m pip show atom || true && \ | ||
| "${VENV_PYTHON}" -m pip show amd-aiter || true && \ | ||
| "${VENV_PYTHON}" -m pip show mori || true | ||
|
|
||
| RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \ | ||
| echo "========== [OOT 3/7] Clone vLLM ==========" && \ | ||
| rm -rf /app/vllm && \ | ||
| git clone "${VLLM_REPO}" /app/vllm && \ | ||
| cd /app/vllm && \ |
| jobs: | ||
| build-oot-image: | ||
| name: Build OOT validation image | ||
| runs-on: build-only-atom |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \ | ||
| echo "========== [OOT 1/7] Prepare build tools ==========" && \ | ||
| apt-get update && \ | ||
| apt --fix-broken install -y && \ | ||
| apt-get install -y --no-install-recommends ca-certificates ninja-build vim && \ | ||
| mkdir -p /usr/local/bin && \ | ||
| ln -sf "$(command -v ninja)" /usr/local/bin/ninja && \ | ||
| /usr/local/bin/ninja --version && \ | ||
| rm -rf /var/lib/apt/lists/* | ||
|
|
||
| RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \ | ||
| echo "========== [OOT 2/7] Verify base packages (atom/aiter/mori) ==========" && \ | ||
| "${VENV_PYTHON}" -m pip show atom || true && \ | ||
| "${VENV_PYTHON}" -m pip show amd-aiter || true && \ | ||
| "${VENV_PYTHON}" -m pip show mori || true | ||
|
|
||
| RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \ | ||
| echo "========== [OOT 3/7] Clone vLLM ==========" && \ | ||
| rm -rf /app/vllm && \ | ||
| git clone "${VLLM_REPO}" /app/vllm && \ | ||
| cd /app/vllm && \ |
|
|
||
| atom-vllm-oot: | ||
| needs: [pre-checks] | ||
| if: ${{ needs.pre-checks.result == 'success' && (!github.event.pull_request || github.event.pull_request.draft == false) }} |
| docker kill $containers || true | ||
| fi | ||
| docker rm -f "$CONTAINER_NAME" 2>/dev/null || true | ||
| docker run --rm -v "${GITHUB_WORKSPACE:-$PWD}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| atom-vllm-oot: | ||
| needs: [pre-checks] | ||
| if: ${{ needs.pre-checks.result == 'success' && (!github.event.pull_request || github.event.pull_request.draft == false) }} |
| RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \ | ||
| echo "========== [OOT 1/7] Prepare build tools ==========" && \ | ||
| apt-get update && \ | ||
| apt --fix-broken install -y && \ | ||
| apt-get install -y --no-install-recommends ca-certificates ninja-build vim && \ | ||
| mkdir -p /usr/local/bin && \ | ||
| ln -sf "$(command -v ninja)" /usr/local/bin/ninja && \ | ||
| /usr/local/bin/ninja --version && \ | ||
| rm -rf /var/lib/apt/lists/* |


Design RFC: #255