[None][infra] Add apt cache mounts to devel stage and use existing pip cache by eopXD · Pull Request #13510 · NVIDIA/TensorRT-LLM

eopXD · 2026-04-27T13:27:18Z

Description

The devel stage of docker/Dockerfile.multi already mounts a BuildKit pip cache, but two issues prevent the cache from helping incremental rebuilds:

apt-get install calls in install_base.sh run without an apt cache mount, so every rebuild re-downloads every .deb. The install.sh layer takes ~260s on a typical compute node, dominated by apt fetch + extract.
pip3 install --no-cache-dir is used in two places that run inside RUN layers already mounting /root/.cache/pip, defeating the existing pip cache mount:
- docker/Dockerfile.multi line 57 (constraints.txt install).
- docker/common/install_nixl.sh line 21 (meson, ninja, pybind11, setuptools).
Wheels are downloaded fresh every rebuild instead of being served from the persistent cache.

What this PR changes

docker/Dockerfile.multi: add --mount=type=cache,target=/var/cache/apt,sharing=locked and --mount=type=cache,target=/var/lib/apt,sharing=locked to the two devel-stage RUN layers that invoke apt (the install.sh layer and the UCX/NIXL/etcd layer).
docker/Dockerfile.multi: drop --no-cache-dir from the constraints.txt pip install (the surrounding RUN already mounts the pip cache).
docker/common/install_nixl.sh: drop --no-cache-dir from the meson/ninja/pybind11/setuptools install for the same reason.
docker/common/install_base.sh: update cleanup() to skip apt-get clean, rm -rf /var/lib/apt/lists/*, and pip3 cache purge. Under cache-mount RUNs these commands operate on the persistent mount rather than the image layer, so running them wipes the cache between builds. The cache mounts themselves ensure the layer never persists these paths, so image size is unchanged.

The release stage (docker/Dockerfile.multi lines 116, 122) already follows this same pattern; this PR extends the pattern to devel.

Expected impact

Incremental rebuilds (Dockerfile content changes but install scripts and constraints don't): the install.sh layer drops from ~260s to a few seconds — apt fetches resolve from the cache and pip wheels are served from the persistent cache.
From-scratch builds: unchanged.
Resulting image size: unchanged (cache mounts are not baked into the image).

Out of scope / potential follow-ups

Adding the same apt cache mount to the tritondevel RUN (Dockerfile.multi lines 91-98) — same pattern, separate change to keep this PR focused.
Caching the 20 MB etcd tarball that install_etcd.sh downloads on every build.
ccache mount for the UCX and NIXL source builds.

Test Coverage

This is a build-system-only change with no runtime code path, so it relies on the existing image-build CI for validation:

The image-build CI stages exercise make -C docker build end-to-end on a cold runner. A successful build proves the new --mount=type=cache lines parse correctly under BuildKit and that removing --no-cache-dir plus the cleanup changes still produce a functional tensorrt_llm/devel:latest.
BuildKit cache mounts are a pass-through: when no cache layer is present (cold runner), they behave like normal directories, so the from-scratch build path is unchanged.
The cache-hit path (incremental rebuild) was verified locally: after this change, a no-op rebuild after a small Dockerfile edit drops the install.sh layer from ~260s to seconds. This requires a warm BuildKit cache, which CI runners typically don't have, but the warm path is exercised by anyone running make -C docker build repeatedly on the same machine.
No new runtime tests required — the resulting image is functionally equivalent to baseline; only build-time behavior changes.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-27T13:32:39Z

📝 Walkthrough

Walkthrough

Changes optimize Docker build caching by introducing BuildKit cache mounts for apt and pip packages and removing cache-disabling flags to preserve persistent caches across incremental builds.

Changes

Cohort / File(s)	Summary
Docker Build Cache Configuration `docker/Dockerfile.multi`	Adds BuildKit cache mounts for `/var/cache/apt` and `/var/lib/apt` with locked sharing, and removes `--no-cache-dir` from pip install in constraints step to enable pip cache reuse.
Cache Preservation in Install Scripts `docker/common/install_base.sh`, `docker/common/install_nixl.sh`	Modifies cleanup function to skip `apt-get clean`, `/var/lib/apt/lists` deletion, and `pip3 cache purge` to preserve BuildKit cache mounts; removes `--no-cache-dir` flag from pip installs to allow cache retention.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding apt cache mounts to the devel stage and enabling pip cache reuse.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is thorough and follows the template structure with clear sections explaining the problem, changes, expected impact, and test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

docker/common/install_base.sh (1)
47-56: Note: Rocky Linux/dnf cleanup remains unmodified.

The cleanup for Rocky Linux systems (dnf clean all and rm -rf /var/cache/dnf) is still executed since there are no BuildKit cache mounts for /var/cache/dnf in the Dockerfile. This means Rocky-based builds won't benefit from the same incremental build caching improvements as Ubuntu-based builds.

This appears intentional based on the PR scope (focusing on apt cache mounts for the devel stage), but worth noting for future optimization if Rocky Linux builds need similar caching benefits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/common/install_base.sh` around lines 47 - 56, The Rocky Linux cleanup
still always runs dnf clean all and rm -rf /var/cache/dnf in the conditional
block handling /etc/redhat-release; update the devel-stage Dockerfile to add a
BuildKit cache mount for /var/cache/dnf (or make the script conditional) so
Rocky builds can reuse the dnf cache: either add a
--mount=type=cache,target=/var/cache/dnf entry to the Dockerfile RUN that
invokes install_base.sh, or change the install_base.sh block (the elif branch
that echoes "Removing python3-pygments from Rocky Linux..." and runs dnf clean
all / rm -rf /var/cache/dnf) to skip the cache purge when an environment
variable or marker file (e.g., DNF_CACHE_MOUNTED) indicates a cache mount is
present.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docker/common/install_base.sh`:
- Around line 47-56: The Rocky Linux cleanup still always runs dnf clean all and
rm -rf /var/cache/dnf in the conditional block handling /etc/redhat-release;
update the devel-stage Dockerfile to add a BuildKit cache mount for
/var/cache/dnf (or make the script conditional) so Rocky builds can reuse the
dnf cache: either add a --mount=type=cache,target=/var/cache/dnf entry to the
Dockerfile RUN that invokes install_base.sh, or change the install_base.sh block
(the elif branch that echoes "Removing python3-pygments from Rocky Linux..." and
runs dnf clean all / rm -rf /var/cache/dnf) to skip the cache purge when an
environment variable or marker file (e.g., DNF_CACHE_MOUNTED) indicates a cache
mount is present.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbf703e1-f4a0-4637-aaac-48bfc8daa4cf

📥 Commits

Reviewing files that changed from the base of the PR and between 734a146 and 10d6d7d.

📒 Files selected for processing (3)

docker/Dockerfile.multi
docker/common/install_base.sh
docker/common/install_nixl.sh

eopXD · 2026-04-28T06:37:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-28T06:44:44Z

PR_Github #45881 [ run ] triggered by Bot. Commit: 10d6d7d Link to invocation

tensorrt-cicd · 2026-04-29T06:45:29Z

PR_Github #45881 [ run ] completed with state ABORTED. Commit: 10d6d7d

Link to invocation

eopXD · 2026-04-30T07:52:44Z

/bot run --disable-fail-fast

eopXD · 2026-04-30T08:16:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-30T08:23:07Z

PR_Github #46342 [ run ] triggered by Bot. Commit: 0010c19 Link to invocation

tensorrt-cicd · 2026-05-01T06:32:18Z

PR_Github #46342 [ run ] completed with state SUCCESS. Commit: 0010c19
/LLM/main/L0_MergeRequest_PR pipeline #36434 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

juney-nvidia

Approved from OSS compliance perspective.

…p cache The devel stage in docker/Dockerfile.multi already mounts a BuildKit pip cache, but two issues prevent the cache from helping incremental rebuilds: 1. The apt installs in install_base.sh run without a cache mount, so every rebuild re-downloads every .deb package (the install.sh layer takes ~260s on a typical compute node, dominated by apt fetch + extract). 2. `pip3 install --no-cache-dir` in two places defeats the existing pip cache mount: docker/Dockerfile.multi line 57 (constraints.txt install) and docker/common/install_nixl.sh line 21 (meson/ninja/pybind11/ setuptools). Wheels are downloaded fresh each rebuild instead of being served from the persistent cache. This patch: - Adds `--mount=type=cache,target=/var/cache/apt,sharing=locked` and `--mount=type=cache,target=/var/lib/apt,sharing=locked` to the two devel-stage RUN layers that invoke apt (install.sh and the UCX/NIXL/ etcd RUN). - Drops `--no-cache-dir` from the two pip3 install invocations that run inside RUN layers with `--mount=type=cache,target=/root/.cache/pip`. - Updates the cleanup() function in install_base.sh to skip `apt-get clean` and `rm -rf /var/lib/apt/lists/*` and to skip `pip3 cache purge`. Under cache-mount RUNs these commands operate on the persistent mount, not the image layer, so running them wipes the cache between builds. The mount itself ensures the layer never persists these paths, so image size is unchanged. The release stage (lines 116, 122) already follows this pattern — this change extends it to devel. No functional changes for builds without BuildKit cache support: the image is byte-equivalent for a from-scratch build. Signed-off-by: Yueh-Ting Chen <yueh.ting.chen@gmail.com>

eopXD · 2026-05-04T07:15:40Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-04T07:23:10Z

PR_Github #46626 [ run ] triggered by Bot. Commit: 4657e7a Link to invocation

tensorrt-cicd · 2026-05-04T15:05:48Z

PR_Github #46626 [ run ] completed with state SUCCESS. Commit: 4657e7a
/LLM/main/L0_MergeRequest_PR pipeline #36671 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

eopXD · 2026-05-05T02:17:26Z

Failed Pipelines & Test Cases

  DGX_B200-4_GPUs-PyTorch (3 failures — all Executor worker died during initialization)
  - test_llm_api_pytorch.TestQwen3_5_397B_A17B.test_nvfp4[adp4_trtllm]
  - test_llm_api_pytorch.TestQwen3_5_397B_A17B.test_nvfp4[tep4_block_reuse]
  - test_llm_api_pytorch.TestQwen3_5_397B_A17B.test_nvfp4[adp4_cutedsl]

  DGX_B200-8_GPUs-PyTorch (1 — Test terminated unexpectedly)
  - test_disaggregated_serving.TestQwen3_8B.test_auto_dtype_with_helix[fifo_v2-cudagraph:with_padding-pp1tp1cp4]

  DGX_B200-PyTorch (2 — Test terminated unexpectedly)
  - test_unittests_v2[unittest/_torch/visual_gen/test_flux_pipeline.py]
  - test_unittests_v2[unittest/_torch/visual_gen/test_flux_pipeline.py::TestFluxE2E::test_flux2_e2e_vs_hf]

  DGX_H100-PyTorch (3 — kv_cache prefix-aware scheduling)
  - test_multi_round_qa_shared_prefix[swa-chunked] — benchmark failed at QPS=8 (rc=-1)
  - test_multi_round_qa_shared_prefix[max-util-chunked] — setup timeout installing LMBenchmark from GitHub
  - test_multi_round_qa_shared_prefix[python-scheduler] — setup timeout installing LMBenchmark from GitHub

eopXD · 2026-05-05T03:24:53Z

/bot skip --comment "Ran a devel container build on my computelab leased machine node. The container build and run was successful. The existing CI failure is not related to the scope of change in this merge request. My judgement is that the risk is low on merging the MR. Let us skip and merge the MR."

tensorrt-cicd · 2026-05-05T03:32:48Z

PR_Github #46736 [ skip ] triggered by Bot. Commit: 4657e7a Link to invocation

tensorrt-cicd · 2026-05-05T03:42:27Z

PR_Github #46736 [ skip ] completed with state SUCCESS. Commit: 4657e7a
Skipping testing for commit 4657e7a

Link to invocation

eopXD requested review from a team as code owners April 27, 2026 13:27

eopXD requested review from mlefeb01 and yiqingy0 April 27, 2026 13:27

github-actions Bot assigned eopXD Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

eopXD force-pushed the infra/devel-apt-cache-mount branch from 10d6d7d to 0010c19 Compare April 30, 2026 08:10

juney-nvidia approved these changes May 4, 2026

View reviewed changes

eopXD force-pushed the infra/devel-apt-cache-mount branch from 0010c19 to 4657e7a Compare May 4, 2026 07:14

eopXD enabled auto-merge (squash) May 5, 2026 03:24

eopXD merged commit f0e9b51 into NVIDIA:main May 5, 2026
7 checks passed

eopXD deleted the infra/devel-apt-cache-mount branch May 7, 2026 06:06

Conversation

eopXD commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What this PR changes

Expected impact

Out of scope / potential follow-ups

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

eopXD commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

eopXD commented Apr 30, 2026

Uh oh!

eopXD commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

juney-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

eopXD commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

eopXD commented May 5, 2026

Uh oh!

eopXD commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eopXD commented Apr 27, 2026 •

edited

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading