Skip to content

[plugin][CI/CD] Add CI/CD for OOT and enable OOT docker release#320

Open
zejunchen-zejun wants to merge 15 commits intomainfrom
zejun/establish_oot_ci_cd_docker_release
Open

[plugin][CI/CD] Add CI/CD for OOT and enable OOT docker release#320
zejunchen-zejun wants to merge 15 commits intomainfrom
zejun/establish_oot_ci_cd_docker_release

Conversation

@zejunchen-zejun
Copy link
Contributor

@zejunchen-zejun zejunchen-zejun commented Mar 12, 2026

Design RFC: #255

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds CI/CD coverage for ATOM’s vLLM OOT (out-of-tree) plugin workflow and extends the Docker release pipeline to optionally build/push an OOT vLLM image, while consolidating OOT image build logic into the main multi-stage docker/Dockerfile.

Changes:

  • Add new OOT CI workflows (per-PR/scheduled “OOT Test” + manual “Full Validation”) and a shared OOT test script for launching vLLM + running GSM8K accuracy checks.
  • Convert docker/Dockerfile into a multi-stage build with a dedicated oot_image stage and update the nightly docker release workflow to optionally publish OOT images.
  • Add plugin-mode unit tests for framework selection, env-flag behavior, and vLLM→ATOM config translation.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/plugin/test_plugin_mode_status.py Adds unit tests for plugin framework backbone selection and mode helpers.
tests/plugin/test_plugin_env_flags.py Adds unit test verifying ATOM_DISABLE_VLLM_PLUGIN disables platform/registration behavior.
tests/plugin/test_plugin_config_translation.py Adds unit tests for translating vLLM config to ATOM config in plugin mode.
.github/scripts/atom_oot_test.sh New helper script to launch vLLM and run GSM8K accuracy + threshold gating.
.github/workflows/atom-vllm-oot-test.yaml New per-PR/push/scheduled OOT workflow building OOT image and running plugin UT + GSM8K accuracy.
.github/workflows/atom-vllm-oot-full-test.yaml New workflow-dispatch full validation across multiple models/runners.
.github/workflows/atom-test.yaml Fixes ATOM_BASE_NIGTHLY_IMAGE typo to ATOM_BASE_NIGHTLY_IMAGE.
.github/workflows/docker-release.yaml Adds optional inputs/steps to build and push OOT vLLM Docker images; builds atom_image stage explicitly.
docker/Dockerfile Introduces oot_image stage (vLLM build/install + deps) and renames original flow to atom_image stage.
docker/plugin/Dockerfile_OOT_vLLM Removes old dedicated OOT vLLM Dockerfile in favor of the consolidated multi-stage Dockerfile.
docker/plugin/build_OOT_vLLM.sh Removes old local OOT image build script (replaced by consolidated build paths).
oot_ut_changes.patch New committed patch file capturing diffs (appears redundant with PR content).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 13, 2026 01:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 13, 2026 03:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 13, 2026 06:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zejunchen-zejun
Copy link
Contributor Author

Hi, @valarLip @gyohuangxin @wuhuikx

Could you help review this PR? For now, the OOT CI has been established and Kimi-K2 can pass this OOT CI.
image

@gyohuangxin For now I am testing nightly OOT CI release
image

Thank you

jobs:
build-oot-image:
name: Build OOT validation image
runs-on: linux-atom-mi355-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use build-only-atom runner to build images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure!

- model_name: "Kimi-K2-Thinking-MXFP4"
model_path: "amd/Kimi-K2-Thinking-MXFP4"
accuracy_test_threshold: 0.90
runner: atom-mi355-8gpu.predownload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need 8 gpus to run kimi-k2-thinking model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will use runner: linux-atom-mi355-4 for 4-GPUs

Comment on lines +70 to +98
- name: Build ATOM base image
run: |
cat <<EOF > Dockerfile.mod
FROM ${{ env.ATOM_BASE_NIGHTLY_IMAGE }}
RUN pip install -U lm-eval[api]
RUN pip show lm-eval || true
RUN pip install hf_transfer
RUN pip show hf_transfer || true
RUN echo "=== Aiter version BEFORE uninstall ===" && pip show amd-aiter || true
RUN pip uninstall -y amd-aiter
RUN pip install --upgrade "pybind11>=3.0.1"
RUN pip show pybind11
RUN rm -rf /app/aiter-test
RUN git clone --depth 1 https://github.com/ROCm/aiter.git /app/aiter-test && \\
cd /app/aiter-test && \\
git checkout HEAD && \\
git submodule sync && git submodule update --init --recursive && \\
MAX_JOBS=64 PREBUILD_KERNELS=0 GPU_ARCHS=gfx950 python3 setup.py develop
RUN echo "=== Aiter version AFTER installation ===" && pip show amd-aiter || true
RUN echo "=== ATOM version BEFORE uninstall ===" && pip show atom || true
RUN pip uninstall -y atom
RUN rm -rf /app/ATOM
RUN git clone ${{ env.GITHUB_REPO_URL }} /app/ATOM && \\
cd /app/ATOM && \\
git checkout ${{ env.GITHUB_COMMIT_SHA }} && \\
pip install -e .
RUN echo "=== ATOM version AFTER installation ===" && pip show atom || true
EOF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need build an ATOM image before an OOT image?

Copy link
Contributor Author

@zejunchen-zejun zejunchen-zejun Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in most cases we need to build the ATOM image, so I move the OOT build after the ATOM build
But for image release, I introduce an argument SKIP_ATOM_NATIVE_BUILD option to skip the ATOM image build because when we want to release nightly image, we need to build the ATOM image and push it to hub firstly, then for OOT image, we can reuse the ATOM image and do incremental build for vLLM

FROM $BASE_IMAGE
# --------------------------------------------------------------------
# OOT image stage: extends an ATOM base image with vLLM + OOT deps.
# Build with: docker build --target oot_image --build-arg BASE_IMAGE=...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an argument to control if building proceed for ATOM OOT image

Copilot AI review requested due to automatic review settings March 13, 2026 10:57
enable OOT docker release

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 14, 2026 07:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


atom-vllm-oot:
needs: [pre-checks]
if: ${{ needs.pre-checks.result == 'success' && (!github.event.pull_request || github.event.pull_request.draft == false) }}
Comment on lines +58 to +67
- name: Clean up containers and workspace
run: |
echo "=== Cleaning up containers on $(hostname) ==="
containers=$(docker ps -q)
if [ -n "$containers" ]; then
docker kill $containers || true
fi
docker rm -f "$CONTAINER_NAME" 2>/dev/null || true
docker run --rm -v "${GITHUB_WORKSPACE:-$PWD}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true

Comment on lines +148 to +168
RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \
echo "========== [OOT 1/7] Prepare build tools ==========" && \
apt-get update && \
apt --fix-broken install -y && \
apt-get install -y --no-install-recommends ca-certificates ninja-build vim && \
mkdir -p /usr/local/bin && \
ln -sf "$(command -v ninja)" /usr/local/bin/ninja && \
/usr/local/bin/ninja --version && \
rm -rf /var/lib/apt/lists/*

RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \
echo "========== [OOT 2/7] Verify base packages (atom/aiter/mori) ==========" && \
"${VENV_PYTHON}" -m pip show atom || true && \
"${VENV_PYTHON}" -m pip show amd-aiter || true && \
"${VENV_PYTHON}" -m pip show mori || true

RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \
echo "========== [OOT 3/7] Clone vLLM ==========" && \
rm -rf /app/vllm && \
git clone "${VLLM_REPO}" /app/vllm && \
cd /app/vllm && \
jobs:
build-oot-image:
name: Build OOT validation image
runs-on: build-only-atom
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 14, 2026 09:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +148 to +168
RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \
echo "========== [OOT 1/7] Prepare build tools ==========" && \
apt-get update && \
apt --fix-broken install -y && \
apt-get install -y --no-install-recommends ca-certificates ninja-build vim && \
mkdir -p /usr/local/bin && \
ln -sf "$(command -v ninja)" /usr/local/bin/ninja && \
/usr/local/bin/ninja --version && \
rm -rf /var/lib/apt/lists/*

RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \
echo "========== [OOT 2/7] Verify base packages (atom/aiter/mori) ==========" && \
"${VENV_PYTHON}" -m pip show atom || true && \
"${VENV_PYTHON}" -m pip show amd-aiter || true && \
"${VENV_PYTHON}" -m pip show mori || true

RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \
echo "========== [OOT 3/7] Clone vLLM ==========" && \
rm -rf /app/vllm && \
git clone "${VLLM_REPO}" /app/vllm && \
cd /app/vllm && \

atom-vllm-oot:
needs: [pre-checks]
if: ${{ needs.pre-checks.result == 'success' && (!github.event.pull_request || github.event.pull_request.draft == false) }}
docker kill $containers || true
fi
docker rm -f "$CONTAINER_NAME" 2>/dev/null || true
docker run --rm -v "${GITHUB_WORKSPACE:-$PWD}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 14, 2026 10:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


atom-vllm-oot:
needs: [pre-checks]
if: ${{ needs.pre-checks.result == 'success' && (!github.event.pull_request || github.event.pull_request.draft == false) }}
Comment on lines +148 to +156
RUN if [ "${BUILD_OOT_IMAGE}" != "true" ]; then exit 0; fi && \
echo "========== [OOT 1/7] Prepare build tools ==========" && \
apt-get update && \
apt --fix-broken install -y && \
apt-get install -y --no-install-recommends ca-certificates ninja-build vim && \
mkdir -p /usr/local/bin && \
ln -sf "$(command -v ninja)" /usr/local/bin/ninja && \
/usr/local/bin/ninja --version && \
rm -rf /var/lib/apt/lists/*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants