Skip to content

[None][infra] Add apt cache mounts to devel stage and use existing pip cache#13510

Merged
eopXD merged 1 commit intoNVIDIA:mainfrom
eopXD:infra/devel-apt-cache-mount
May 5, 2026
Merged

[None][infra] Add apt cache mounts to devel stage and use existing pip cache#13510
eopXD merged 1 commit intoNVIDIA:mainfrom
eopXD:infra/devel-apt-cache-mount

Conversation

@eopXD
Copy link
Copy Markdown
Collaborator

@eopXD eopXD commented Apr 27, 2026

Description

The devel stage of docker/Dockerfile.multi already mounts a BuildKit pip cache, but two issues prevent the cache from helping incremental rebuilds:

  1. apt-get install calls in install_base.sh run without an apt cache mount, so every rebuild re-downloads every .deb. The install.sh layer takes ~260s on a typical compute node, dominated by apt fetch + extract.

  2. pip3 install --no-cache-dir is used in two places that run inside RUN layers already mounting /root/.cache/pip, defeating the existing pip cache mount:

    • docker/Dockerfile.multi line 57 (constraints.txt install).
    • docker/common/install_nixl.sh line 21 (meson, ninja, pybind11, setuptools).

    Wheels are downloaded fresh every rebuild instead of being served from the persistent cache.

What this PR changes

  • docker/Dockerfile.multi: add --mount=type=cache,target=/var/cache/apt,sharing=locked and --mount=type=cache,target=/var/lib/apt,sharing=locked to the two devel-stage RUN layers that invoke apt (the install.sh layer and the UCX/NIXL/etcd layer).
  • docker/Dockerfile.multi: drop --no-cache-dir from the constraints.txt pip install (the surrounding RUN already mounts the pip cache).
  • docker/common/install_nixl.sh: drop --no-cache-dir from the meson/ninja/pybind11/setuptools install for the same reason.
  • docker/common/install_base.sh: update cleanup() to skip apt-get clean, rm -rf /var/lib/apt/lists/*, and pip3 cache purge. Under cache-mount RUNs these commands operate on the persistent mount rather than the image layer, so running them wipes the cache between builds. The cache mounts themselves ensure the layer never persists these paths, so image size is unchanged.

The release stage (docker/Dockerfile.multi lines 116, 122) already follows this same pattern; this PR extends the pattern to devel.

Expected impact

  • Incremental rebuilds (Dockerfile content changes but install scripts and constraints don't): the install.sh layer drops from ~260s to a few seconds — apt fetches resolve from the cache and pip wheels are served from the persistent cache.
  • From-scratch builds: unchanged.
  • Resulting image size: unchanged (cache mounts are not baked into the image).

Out of scope / potential follow-ups

  • Adding the same apt cache mount to the tritondevel RUN (Dockerfile.multi lines 91-98) — same pattern, separate change to keep this PR focused.
  • Caching the 20 MB etcd tarball that install_etcd.sh downloads on every build.
  • ccache mount for the UCX and NIXL source builds.

Test Coverage

This is a build-system-only change with no runtime code path, so it relies on the existing image-build CI for validation:

  • The image-build CI stages exercise make -C docker build end-to-end on a cold runner. A successful build proves the new --mount=type=cache lines parse correctly under BuildKit and that removing --no-cache-dir plus the cleanup changes still produce a functional tensorrt_llm/devel:latest.
  • BuildKit cache mounts are a pass-through: when no cache layer is present (cold runner), they behave like normal directories, so the from-scratch build path is unchanged.
  • The cache-hit path (incremental rebuild) was verified locally: after this change, a no-op rebuild after a small Dockerfile edit drops the install.sh layer from ~260s to seconds. This requires a warm BuildKit cache, which CI runners typically don't have, but the warm path is exercised by anyone running make -C docker build repeatedly on the same machine.
  • No new runtime tests required — the resulting image is functionally equivalent to baseline; only build-time behavior changes.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@eopXD eopXD requested review from a team as code owners April 27, 2026 13:27
@eopXD eopXD requested review from mlefeb01 and yiqingy0 April 27, 2026 13:27
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

Changes optimize Docker build caching by introducing BuildKit cache mounts for apt and pip packages and removing cache-disabling flags to preserve persistent caches across incremental builds.

Changes

Cohort / File(s) Summary
Docker Build Cache Configuration
docker/Dockerfile.multi
Adds BuildKit cache mounts for /var/cache/apt and /var/lib/apt with locked sharing, and removes --no-cache-dir from pip install in constraints step to enable pip cache reuse.
Cache Preservation in Install Scripts
docker/common/install_base.sh, docker/common/install_nixl.sh
Modifies cleanup function to skip apt-get clean, /var/lib/apt/lists deletion, and pip3 cache purge to preserve BuildKit cache mounts; removes --no-cache-dir flag from pip installs to allow cache retention.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding apt cache mounts to the devel stage and enabling pip cache reuse.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is thorough and follows the template structure with clear sections explaining the problem, changes, expected impact, and test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docker/common/install_base.sh (1)

47-56: Note: Rocky Linux/dnf cleanup remains unmodified.

The cleanup for Rocky Linux systems (dnf clean all and rm -rf /var/cache/dnf) is still executed since there are no BuildKit cache mounts for /var/cache/dnf in the Dockerfile. This means Rocky-based builds won't benefit from the same incremental build caching improvements as Ubuntu-based builds.

This appears intentional based on the PR scope (focusing on apt cache mounts for the devel stage), but worth noting for future optimization if Rocky Linux builds need similar caching benefits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/common/install_base.sh` around lines 47 - 56, The Rocky Linux cleanup
still always runs dnf clean all and rm -rf /var/cache/dnf in the conditional
block handling /etc/redhat-release; update the devel-stage Dockerfile to add a
BuildKit cache mount for /var/cache/dnf (or make the script conditional) so
Rocky builds can reuse the dnf cache: either add a
--mount=type=cache,target=/var/cache/dnf entry to the Dockerfile RUN that
invokes install_base.sh, or change the install_base.sh block (the elif branch
that echoes "Removing python3-pygments from Rocky Linux..." and runs dnf clean
all / rm -rf /var/cache/dnf) to skip the cache purge when an environment
variable or marker file (e.g., DNF_CACHE_MOUNTED) indicates a cache mount is
present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docker/common/install_base.sh`:
- Around line 47-56: The Rocky Linux cleanup still always runs dnf clean all and
rm -rf /var/cache/dnf in the conditional block handling /etc/redhat-release;
update the devel-stage Dockerfile to add a BuildKit cache mount for
/var/cache/dnf (or make the script conditional) so Rocky builds can reuse the
dnf cache: either add a --mount=type=cache,target=/var/cache/dnf entry to the
Dockerfile RUN that invokes install_base.sh, or change the install_base.sh block
(the elif branch that echoes "Removing python3-pygments from Rocky Linux..." and
runs dnf clean all / rm -rf /var/cache/dnf) to skip the cache purge when an
environment variable or marker file (e.g., DNF_CACHE_MOUNTED) indicates a cache
mount is present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbf703e1-f4a0-4637-aaac-48bfc8daa4cf

📥 Commits

Reviewing files that changed from the base of the PR and between 734a146 and 10d6d7d.

📒 Files selected for processing (3)
  • docker/Dockerfile.multi
  • docker/common/install_base.sh
  • docker/common/install_nixl.sh

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented Apr 28, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45881 [ run ] triggered by Bot. Commit: 10d6d7d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45881 [ run ] completed with state ABORTED. Commit: 10d6d7d

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented Apr 30, 2026

/bot run --disable-fail-fast

@eopXD eopXD force-pushed the infra/devel-apt-cache-mount branch from 10d6d7d to 0010c19 Compare April 30, 2026 08:10
@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented Apr 30, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46342 [ run ] triggered by Bot. Commit: 0010c19 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46342 [ run ] completed with state SUCCESS. Commit: 0010c19
/LLM/main/L0_MergeRequest_PR pipeline #36434 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Copy link
Copy Markdown
Collaborator

@juney-nvidia juney-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from OSS compliance perspective.

…p cache

The devel stage in docker/Dockerfile.multi already mounts a BuildKit pip
cache, but two issues prevent the cache from helping incremental rebuilds:

1. The apt installs in install_base.sh run without a cache mount, so every
   rebuild re-downloads every .deb package (the install.sh layer takes
   ~260s on a typical compute node, dominated by apt fetch + extract).

2. `pip3 install --no-cache-dir` in two places defeats the existing pip
   cache mount: docker/Dockerfile.multi line 57 (constraints.txt install)
   and docker/common/install_nixl.sh line 21 (meson/ninja/pybind11/
   setuptools). Wheels are downloaded fresh each rebuild instead of being
   served from the persistent cache.

This patch:
- Adds `--mount=type=cache,target=/var/cache/apt,sharing=locked` and
  `--mount=type=cache,target=/var/lib/apt,sharing=locked` to the two
  devel-stage RUN layers that invoke apt (install.sh and the UCX/NIXL/
  etcd RUN).
- Drops `--no-cache-dir` from the two pip3 install invocations that run
  inside RUN layers with `--mount=type=cache,target=/root/.cache/pip`.
- Updates the cleanup() function in install_base.sh to skip
  `apt-get clean` and `rm -rf /var/lib/apt/lists/*` and to skip
  `pip3 cache purge`. Under cache-mount RUNs these commands operate on
  the persistent mount, not the image layer, so running them wipes the
  cache between builds. The mount itself ensures the layer never persists
  these paths, so image size is unchanged.

The release stage (lines 116, 122) already follows this pattern — this
change extends it to devel.

No functional changes for builds without BuildKit cache support: the
image is byte-equivalent for a from-scratch build.

Signed-off-by: Yueh-Ting Chen <yueh.ting.chen@gmail.com>
@eopXD eopXD force-pushed the infra/devel-apt-cache-mount branch from 0010c19 to 4657e7a Compare May 4, 2026 07:14
@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 4, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46626 [ run ] triggered by Bot. Commit: 4657e7a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46626 [ run ] completed with state SUCCESS. Commit: 4657e7a
/LLM/main/L0_MergeRequest_PR pipeline #36671 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 5, 2026

Failed Pipelines & Test Cases

  DGX_B200-4_GPUs-PyTorch (3 failures — all Executor worker died during initialization)
  - test_llm_api_pytorch.TestQwen3_5_397B_A17B.test_nvfp4[adp4_trtllm]
  - test_llm_api_pytorch.TestQwen3_5_397B_A17B.test_nvfp4[tep4_block_reuse]
  - test_llm_api_pytorch.TestQwen3_5_397B_A17B.test_nvfp4[adp4_cutedsl]

  DGX_B200-8_GPUs-PyTorch (1 — Test terminated unexpectedly)
  - test_disaggregated_serving.TestQwen3_8B.test_auto_dtype_with_helix[fifo_v2-cudagraph:with_padding-pp1tp1cp4]

  DGX_B200-PyTorch (2 — Test terminated unexpectedly)
  - test_unittests_v2[unittest/_torch/visual_gen/test_flux_pipeline.py]
  - test_unittests_v2[unittest/_torch/visual_gen/test_flux_pipeline.py::TestFluxE2E::test_flux2_e2e_vs_hf]

  DGX_H100-PyTorch (3 — kv_cache prefix-aware scheduling)
  - test_multi_round_qa_shared_prefix[swa-chunked] — benchmark failed at QPS=8 (rc=-1)
  - test_multi_round_qa_shared_prefix[max-util-chunked] — setup timeout installing LMBenchmark from GitHub
  - test_multi_round_qa_shared_prefix[python-scheduler] — setup timeout installing LMBenchmark from GitHub

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 5, 2026

/bot skip --comment "Ran a devel container build on my computelab leased machine node. The container build and run was successful. The existing CI failure is not related to the scope of change in this merge request. My judgement is that the risk is low on merging the MR. Let us skip and merge the MR."

@eopXD eopXD enabled auto-merge (squash) May 5, 2026 03:24
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46736 [ skip ] triggered by Bot. Commit: 4657e7a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46736 [ skip ] completed with state SUCCESS. Commit: 4657e7a
Skipping testing for commit 4657e7a

Link to invocation

@eopXD eopXD merged commit f0e9b51 into NVIDIA:main May 5, 2026
7 checks passed
@eopXD eopXD deleted the infra/devel-apt-cache-mount branch May 7, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants