strip _core.so explicitly to fight 3 GB GPU wheel bloat by jameslehoux · Pull Request #263 · BASE-Laboratory/OpenImpala

jameslehoux · 2026-04-26T09:03:42Z

CPU wheel: 30 MB. GPU wheel: 3 GB. The 100x ratio doesn't match anything even remotely reasonable (pytorch-cuda is 750 MB max). Diagnosis:

auditwheel --strip only touches the vendored .so files in the wheel's .libs/ folder. It does NOT strip the main extension module (_core.so), which is where ~99% of the bloat lives — every templated AMReX device-kernel instantiation, with debug symbols, profile name strings, and architecture fat-binary slices, all linked together by RDC + CUDA_RESOLVE_DEVICE_SYMBOLS across 27 translation units.
CUDA architectures: dropped from 5 SASS + 1 PTX (70/75/80/89/90) to 2 SASS + 1 PTX (75/80 + 90-virtual). T4 + A100 ship as native SASS (the actual GPUs Colab/HPC use day-to-day), V100/L4/H100/RTX JIT-compile from PTX on first kernel launch (~2-5s startup, then cached). This is the same model pytorch-cuda uses.
Strip the static deps before linking: strip --strip-debug on libamrex_3d.a, libHYPRE.a, libhdf5*.a, libtiff.a kills any debug symbols that snuck through Release before the device linker has a chance to embed them in _core.so.
New post-auditwheel pass: for whl in /tmp/repaired/.whl; do wheel unpack find ... -name '.so*' -exec strip --strip-all {} + wheel pack # regenerates RECORD hashes done echoes pre/post wheel size to the build log for sanity.

Cache key bumped to v8 with arch+stripped tags so the next build can't restore the 2.4 GB deps tarball.

If this is still too big, next levers (in order of impact-to-risk):

Drop AMReX_TINY_PROFILE=ON (wheel-only; native build keeps it)
Set AMReX_GPU_RDC=OFF (risk: breaks AMReX features that need RDC)
Drop arch 80, ship 75-real + 90-virtual only

CPU wheel: 30 MB. GPU wheel: 3 GB. The 100x ratio doesn't match anything even remotely reasonable (pytorch-cuda is 750 MB max). Diagnosis: 1. auditwheel --strip only touches the *vendored* .so files in the wheel's .libs/ folder. It does NOT strip the main extension module (_core.so), which is where ~99% of the bloat lives — every templated AMReX device-kernel instantiation, with debug symbols, profile name strings, and architecture fat-binary slices, all linked together by RDC + CUDA_RESOLVE_DEVICE_SYMBOLS across 27 translation units. 2. CUDA architectures: dropped from 5 SASS + 1 PTX (70/75/80/89/90) to 2 SASS + 1 PTX (75/80 + 90-virtual). T4 + A100 ship as native SASS (the actual GPUs Colab/HPC use day-to-day), V100/L4/H100/RTX JIT-compile from PTX on first kernel launch (~2-5s startup, then cached). This is the same model pytorch-cuda uses. 3. Strip the static deps before linking: strip --strip-debug on libamrex_3d.a, libHYPRE.a, libhdf5*.a, libtiff.a kills any debug symbols that snuck through Release before the device linker has a chance to embed them in _core.so. 4. New post-auditwheel pass: for whl in /tmp/repaired/*.whl; do wheel unpack find ... -name '*.so*' -exec strip --strip-all {} + wheel pack # regenerates RECORD hashes done echoes pre/post wheel size to the build log for sanity. Cache key bumped to v8 with arch+stripped tags so the next build can't restore the 2.4 GB deps tarball. If this is still too big, next levers (in order of impact-to-risk): - Drop AMReX_TINY_PROFILE=ON (wheel-only; native build keeps it) - Set AMReX_GPU_RDC=OFF (risk: breaks AMReX features that need RDC) - Drop arch 80, ship 75-real + 90-virtual only https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf

github-actions · 2026-04-26T09:15:53Z

Code Coverage Report

------------------------------------------------------------------------------
                           GCC Code Coverage Report
Directory: .
------------------------------------------------------------------------------
File                                       Lines     Exec  Cover   Missing
------------------------------------------------------------------------------
src/io/CathodeWrite.cpp                       95       83    87%   40-41,97-100,115-116,182-185
src/io/CathodeWrite.H                          1        1   100%
src/io/DatReader.cpp                         135      105    77%   26-27,30,35,92-93,99-100,107-109,135-137,141,144-148,152-155,162,164,208-209,242,245
src/io/DatReader.H                             1        1   100%
src/io/HDF5Reader.cpp                        344       84    24%   40-41,43-44,46-49,52,54-56,58-59,62,64-66,68-74,92-93,126-128,144-145,154-157,174-180,182-187,204,213-215,217,219-228,230-233,236-238,240-251,253-258,266,266,266,266,266,266,266,270,270,270,270,270,270,270,274,276,278,280,282,288,290,297,297,297,297,297,297,297,301,301,301,301,301,301,301,305,305,305,305,305,305,305-306,306,306,306,306,306,306,309,309,309,309,309,309,309-310,310,310,310,310,310,310-311,311,311,311,311,311,311,313,313,313,313,313,313,313-314,314,314,314,314,314,314-315,315,315,315,315,315,315,319,319,319,319,319,319,319,324,324,324,324,324,324,324-325,325,325,325,325,325,325-326,326,326,326,326,326,326-327,327,327,327,327,327,327,332,332,332,332,332,332,332,337,337,337,337,337,337,337-338,338,338,338,338,338,338,343,343,343,343,343,343,343,350,350,350,350,350,350,350,357-358,432-435,437-440
src/io/HDF5Reader.H                            3        3   100%
src/io/ImageLoader.cpp                        61       42    68%   25,38,48,60-62,64-70,72,77,89-90,92,94
src/io/RawReader.cpp                         266      135    50%   49-50,89-90,111-112,115-117,120-121,140-142,155-157,166-168,174-177,185-186,192-196,200-204,209-212,219-224,231-237,271,273-274,276,283-284,301,312,314,318,325,327,331-334,338,346-347,353-355,361-363,365-366,369,372,374,377-380,382-384,386,388-389,391,393-394,396,398-399,401,403-404,406,410-411,413,417-418,420,425,465,471-472,521-524,538,540-542,544,546-548,558,562-564,566,588
src/io/RawReader.H                             1        1   100%
src/io/TiffReader.cpp                        384      130    33%   59-65,67-69,71-73,75-77,79-80,82-84,86-88,90-92,94-96,98-99,101-103,106-108,111-112,114-117,119,122,124-127,143-144,148-150,152-158,160,186,210,217,226,228-231,240,242-245,248,255,288-293,306,309-317,319-320,323-327,331-335,338-342,344-348,351-357,359-363,367,369,375-377,379-393,396,398-402,404-409,413-418,420-425,428-429,432-434,555-575,577-578,581-588,590,593-609,612-614,670,673-674,677-683,685,689-700,702-703
src/io/TiffReader.H                            5        5   100%
src/props/BoundaryCondition.H                131       74    56%   63,68,70,216,224-229,233-236,238-244,247-249,252-253,255,258-261,264-265,271-272,274-279,285-287,290-296,299,303,365-366,371,373
src/props/ConnectedComponents.cpp             69       67    97%   94-95
src/props/ConnectedComponents.H                4        4   100%
src/props/DeffTensor.cpp                      62       59    95%   122,128-129
src/props/Diffusion.cpp                      510      378    74%   93-94,97-98,103-104,106-116,118,123-132,134-141,144-150,153-157,159-163,165,168-173,175-177,179,182-184,186-187,190-191,193,195-198,200,202-203,288-289,297-298,300,349,359-360,368-371,373-375,404-413,415,453,461,465-467,526-527,533,535,539,547,581,610,638,646,735-736,739-740,757-760,771-772,774,824
src/props/EffDiffFillMtx.H                   120      106    88%   58,216-217,221-225,229,231-235
src/props/EffectiveDiffusivityHypre.cpp      389      347    89%   189-191,193-197,305,367-370,479,612-615,617-619,621-624,633-636,643,672,684-687,689-691,693,705,716,718
src/props/EffectiveDiffusivityHypre.H          7        7   100%
src/props/FloodFill.cpp                       84       81    96%   94-95,203
src/props/HypreStructSolver.cpp              343      210    61%   87-88,121,133-134,145,299,309,311,314,346,356,358,361,367-370,372-376,378-379,381-385,388-389,391-392,394,397-398,401-402,404-407,409-413,415-416,418-422,425-426,428-429,431,434-435,438-439,441-443,445-451,453-457,460-461,463-464,466,469-470,473,475-477,479-485,487-491,494-495,497-498,500,503-504,507,509-511,513-516,518-522,525-526,528-529,531,534-535,538,541-542,555
src/props/HypreStructSolver.H                  6        6   100%
src/props/MacroGeometry.H                     17       17   100%
src/props/ParticleSizeDistribution.cpp        11       11   100%
src/props/ParticleSizeDistribution.H           6        6   100%
src/props/PercolationCheck.cpp                53       46    86%   32-33,49-51,68,73
src/props/PercolationCheck.H                   4        4   100%
src/props/PhysicsConfig.H                     90       89    98%   150
src/props/ResultsJSON.H                      225      222    98%   242,395,416
src/props/REVStudy.cpp                       151      128    84%   72,83-91,159,170-173,175,183-186,188-190
src/props/SolverConfig.H                      32       20    62%   30,32,37-44,75-76
src/props/SpecificSurfaceArea.cpp             56       55    98%   59
src/props/SpecificSurfaceArea.H                6        6   100%
src/props/ThroughThicknessProfile.cpp         38       38   100%
src/props/ThroughThicknessProfile.H            5        5   100%
src/props/Tortuosity.H                         2        2   100%
src/props/TortuosityDirect.cpp               219      191    87%   81-83,86,100-106,113-114,125,134,140,202-209,226,394,424,433
src/props/TortuosityDirect.H                   5        5   100%
src/props/TortuosityHypre.cpp                784      563    71%   149-150,155-156,240-243,246-248,311,335-337,340-341,343,353-355,358-360,390-393,573,597,601,622,639-640,642-644,646-655,657,660-664,668-670,673-680,682-686,690-692,694-696,698-707,709-713,715-726,728-731,733,743,749-752,754-756,765-768,770-772,788,791-792,815-820,831-834,836,873,878-881,884-886,890-893,895,897-900,902,907-909,911,960,969,974,977-982,998-1001,1015-1019,1024-1029,1039-1043,1048-1053,1058-1062,1065-1068,1075-1078,1089,1098,1100,1104,1106,1128,1159-1160,1246-1248,1374-1377
src/props/TortuosityHypre.H                   15       15   100%
src/props/TortuosityHypreFill.H              127       98    77%   85,203,205-212,237-239,241-245,247-248,250,252,255-256,258-262
src/props/TortuosityKernels.H                 97       53    54%   52,56-60,62-65,69-74,76-80,84-85,90,129,143,157,243,245-248,250-253,257-260,262-265
src/props/TortuosityMLMG.cpp                  99       91    91%   160,181-183,185-186,193,206
src/props/TortuosityMLMG.H                     1        1   100%
src/props/TortuositySolverBase.cpp           301      237    78%   70-72,74-75,94-101,104,106,142-145,200,203,205,255,280,298,327,391,394-396,398,406-409,411-417,422,427-429,435-436,438-440,454,460,464-465,467,478,492,496-498,500,502,506
src/props/TortuositySolverBase.H              13       13   100%
src/props/VolumeFraction.cpp                  25       25   100%
src/props/VolumeFraction.H                     4        4   100%
------------------------------------------------------------------------------
TOTAL                                       5407     3874    71%
------------------------------------------------------------------------------

Generated by CI — coverage data from gcovr

codecov · 2026-04-26T09:16:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

jameslehoux merged commit eba3e11 into master Apr 26, 2026
5 checks passed

github-actions Bot added devops gpu labels Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strip _core.so explicitly to fight 3 GB GPU wheel bloat#263

strip _core.so explicitly to fight 3 GB GPU wheel bloat#263
jameslehoux merged 1 commit intomasterfrom
claude/upbeat-mccarthy-f1mNN

jameslehoux commented Apr 26, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jameslehoux commented Apr 26, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 26, 2026

Code Coverage Report

Uh oh!

codecov Bot commented Apr 26, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant