[None][infra] Add apt cache mounts to devel stage and use existing pip cache#13510
[None][infra] Add apt cache mounts to devel stage and use existing pip cache#13510eopXD merged 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughChanges optimize Docker build caching by introducing BuildKit cache mounts for apt and pip packages and removing cache-disabling flags to preserve persistent caches across incremental builds. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docker/common/install_base.sh (1)
47-56: Note: Rocky Linux/dnf cleanup remains unmodified.The cleanup for Rocky Linux systems (
dnf clean allandrm -rf /var/cache/dnf) is still executed since there are no BuildKit cache mounts for/var/cache/dnfin the Dockerfile. This means Rocky-based builds won't benefit from the same incremental build caching improvements as Ubuntu-based builds.This appears intentional based on the PR scope (focusing on apt cache mounts for the devel stage), but worth noting for future optimization if Rocky Linux builds need similar caching benefits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/common/install_base.sh` around lines 47 - 56, The Rocky Linux cleanup still always runs dnf clean all and rm -rf /var/cache/dnf in the conditional block handling /etc/redhat-release; update the devel-stage Dockerfile to add a BuildKit cache mount for /var/cache/dnf (or make the script conditional) so Rocky builds can reuse the dnf cache: either add a --mount=type=cache,target=/var/cache/dnf entry to the Dockerfile RUN that invokes install_base.sh, or change the install_base.sh block (the elif branch that echoes "Removing python3-pygments from Rocky Linux..." and runs dnf clean all / rm -rf /var/cache/dnf) to skip the cache purge when an environment variable or marker file (e.g., DNF_CACHE_MOUNTED) indicates a cache mount is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docker/common/install_base.sh`:
- Around line 47-56: The Rocky Linux cleanup still always runs dnf clean all and
rm -rf /var/cache/dnf in the conditional block handling /etc/redhat-release;
update the devel-stage Dockerfile to add a BuildKit cache mount for
/var/cache/dnf (or make the script conditional) so Rocky builds can reuse the
dnf cache: either add a --mount=type=cache,target=/var/cache/dnf entry to the
Dockerfile RUN that invokes install_base.sh, or change the install_base.sh block
(the elif branch that echoes "Removing python3-pygments from Rocky Linux..." and
runs dnf clean all / rm -rf /var/cache/dnf) to skip the cache purge when an
environment variable or marker file (e.g., DNF_CACHE_MOUNTED) indicates a cache
mount is present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cbf703e1-f4a0-4637-aaac-48bfc8daa4cf
📒 Files selected for processing (3)
docker/Dockerfile.multidocker/common/install_base.shdocker/common/install_nixl.sh
|
/bot run --disable-fail-fast |
|
PR_Github #45881 [ run ] triggered by Bot. Commit: |
|
PR_Github #45881 [ run ] completed with state |
|
/bot run --disable-fail-fast |
10d6d7d to
0010c19
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46342 [ run ] triggered by Bot. Commit: |
|
PR_Github #46342 [ run ] completed with state
|
juney-nvidia
left a comment
There was a problem hiding this comment.
Approved from OSS compliance perspective.
…p cache The devel stage in docker/Dockerfile.multi already mounts a BuildKit pip cache, but two issues prevent the cache from helping incremental rebuilds: 1. The apt installs in install_base.sh run without a cache mount, so every rebuild re-downloads every .deb package (the install.sh layer takes ~260s on a typical compute node, dominated by apt fetch + extract). 2. `pip3 install --no-cache-dir` in two places defeats the existing pip cache mount: docker/Dockerfile.multi line 57 (constraints.txt install) and docker/common/install_nixl.sh line 21 (meson/ninja/pybind11/ setuptools). Wheels are downloaded fresh each rebuild instead of being served from the persistent cache. This patch: - Adds `--mount=type=cache,target=/var/cache/apt,sharing=locked` and `--mount=type=cache,target=/var/lib/apt,sharing=locked` to the two devel-stage RUN layers that invoke apt (install.sh and the UCX/NIXL/ etcd RUN). - Drops `--no-cache-dir` from the two pip3 install invocations that run inside RUN layers with `--mount=type=cache,target=/root/.cache/pip`. - Updates the cleanup() function in install_base.sh to skip `apt-get clean` and `rm -rf /var/lib/apt/lists/*` and to skip `pip3 cache purge`. Under cache-mount RUNs these commands operate on the persistent mount, not the image layer, so running them wipes the cache between builds. The mount itself ensures the layer never persists these paths, so image size is unchanged. The release stage (lines 116, 122) already follows this pattern — this change extends it to devel. No functional changes for builds without BuildKit cache support: the image is byte-equivalent for a from-scratch build. Signed-off-by: Yueh-Ting Chen <yueh.ting.chen@gmail.com>
0010c19 to
4657e7a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46626 [ run ] triggered by Bot. Commit: |
|
PR_Github #46626 [ run ] completed with state
|
|
Failed Pipelines & Test Cases |
|
/bot skip --comment "Ran a |
|
PR_Github #46736 [ skip ] triggered by Bot. Commit: |
|
PR_Github #46736 [ skip ] completed with state |
Description
The
develstage ofdocker/Dockerfile.multialready mounts a BuildKit pip cache, but two issues prevent the cache from helping incremental rebuilds:apt-get installcalls ininstall_base.shrun without an apt cache mount, so every rebuild re-downloads every.deb. Theinstall.shlayer takes ~260s on a typical compute node, dominated by apt fetch + extract.pip3 install --no-cache-diris used in two places that run inside RUN layers already mounting/root/.cache/pip, defeating the existing pip cache mount:docker/Dockerfile.multiline 57 (constraints.txt install).docker/common/install_nixl.shline 21 (meson, ninja, pybind11, setuptools).Wheels are downloaded fresh every rebuild instead of being served from the persistent cache.
What this PR changes
docker/Dockerfile.multi: add--mount=type=cache,target=/var/cache/apt,sharing=lockedand--mount=type=cache,target=/var/lib/apt,sharing=lockedto the two devel-stage RUN layers that invoke apt (theinstall.shlayer and the UCX/NIXL/etcd layer).docker/Dockerfile.multi: drop--no-cache-dirfrom the constraints.txt pip install (the surrounding RUN already mounts the pip cache).docker/common/install_nixl.sh: drop--no-cache-dirfrom the meson/ninja/pybind11/setuptools install for the same reason.docker/common/install_base.sh: updatecleanup()to skipapt-get clean,rm -rf /var/lib/apt/lists/*, andpip3 cache purge. Under cache-mount RUNs these commands operate on the persistent mount rather than the image layer, so running them wipes the cache between builds. The cache mounts themselves ensure the layer never persists these paths, so image size is unchanged.The
releasestage (docker/Dockerfile.multilines 116, 122) already follows this same pattern; this PR extends the pattern todevel.Expected impact
Out of scope / potential follow-ups
tritondevelRUN (Dockerfile.multi lines 91-98) — same pattern, separate change to keep this PR focused.install_etcd.shdownloads on every build.Test Coverage
This is a build-system-only change with no runtime code path, so it relies on the existing image-build CI for validation:
make -C docker buildend-to-end on a cold runner. A successful build proves the new--mount=type=cachelines parse correctly under BuildKit and that removing--no-cache-dirplus the cleanup changes still produce a functionaltensorrt_llm/devel:latest.make -C docker buildrepeatedly on the same machine.PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.