strip _core.so explicitly to fight 3 GB GPU wheel bloat#263
Merged
jameslehoux merged 1 commit intomasterfrom Apr 26, 2026
Merged
strip _core.so explicitly to fight 3 GB GPU wheel bloat#263jameslehoux merged 1 commit intomasterfrom
jameslehoux merged 1 commit intomasterfrom
Conversation
CPU wheel: 30 MB. GPU wheel: 3 GB. The 100x ratio doesn't match anything
even remotely reasonable (pytorch-cuda is 750 MB max). Diagnosis:
1. auditwheel --strip only touches the *vendored* .so files in the
wheel's .libs/ folder. It does NOT strip the main extension module
(_core.so), which is where ~99% of the bloat lives — every templated
AMReX device-kernel instantiation, with debug symbols, profile name
strings, and architecture fat-binary slices, all linked together by
RDC + CUDA_RESOLVE_DEVICE_SYMBOLS across 27 translation units.
2. CUDA architectures: dropped from 5 SASS + 1 PTX (70/75/80/89/90)
to 2 SASS + 1 PTX (75/80 + 90-virtual). T4 + A100 ship as native
SASS (the actual GPUs Colab/HPC use day-to-day), V100/L4/H100/RTX
JIT-compile from PTX on first kernel launch (~2-5s startup, then
cached). This is the same model pytorch-cuda uses.
3. Strip the static deps before linking: strip --strip-debug on
libamrex_3d.a, libHYPRE.a, libhdf5*.a, libtiff.a kills any debug
symbols that snuck through Release before the device linker has a
chance to embed them in _core.so.
4. New post-auditwheel pass:
for whl in /tmp/repaired/*.whl; do
wheel unpack
find ... -name '*.so*' -exec strip --strip-all {} +
wheel pack # regenerates RECORD hashes
done
echoes pre/post wheel size to the build log for sanity.
Cache key bumped to v8 with arch+stripped tags so the next build can't
restore the 2.4 GB deps tarball.
If this is still too big, next levers (in order of impact-to-risk):
- Drop AMReX_TINY_PROFILE=ON (wheel-only; native build keeps it)
- Set AMReX_GPU_RDC=OFF (risk: breaks AMReX features that need RDC)
- Drop arch 80, ship 75-real + 90-virtual only
https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
Code Coverage ReportGenerated by CI — coverage data from gcovr |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CPU wheel: 30 MB. GPU wheel: 3 GB. The 100x ratio doesn't match anything even remotely reasonable (pytorch-cuda is 750 MB max). Diagnosis:
auditwheel --strip only touches the vendored .so files in the wheel's .libs/ folder. It does NOT strip the main extension module (_core.so), which is where ~99% of the bloat lives — every templated AMReX device-kernel instantiation, with debug symbols, profile name strings, and architecture fat-binary slices, all linked together by RDC + CUDA_RESOLVE_DEVICE_SYMBOLS across 27 translation units.
CUDA architectures: dropped from 5 SASS + 1 PTX (70/75/80/89/90) to 2 SASS + 1 PTX (75/80 + 90-virtual). T4 + A100 ship as native SASS (the actual GPUs Colab/HPC use day-to-day), V100/L4/H100/RTX JIT-compile from PTX on first kernel launch (~2-5s startup, then cached). This is the same model pytorch-cuda uses.
Strip the static deps before linking: strip --strip-debug on libamrex_3d.a, libHYPRE.a, libhdf5*.a, libtiff.a kills any debug symbols that snuck through Release before the device linker has a chance to embed them in _core.so.
New post-auditwheel pass: for whl in /tmp/repaired/.whl; do wheel unpack find ... -name '.so*' -exec strip --strip-all {} + wheel pack # regenerates RECORD hashes done echoes pre/post wheel size to the build log for sanity.
Cache key bumped to v8 with arch+stripped tags so the next build can't restore the 2.4 GB deps tarball.
If this is still too big, next levers (in order of impact-to-risk):