Skip to content

strip _core.so explicitly to fight 3 GB GPU wheel bloat#263

Merged
jameslehoux merged 1 commit intomasterfrom
claude/upbeat-mccarthy-f1mNN
Apr 26, 2026
Merged

strip _core.so explicitly to fight 3 GB GPU wheel bloat#263
jameslehoux merged 1 commit intomasterfrom
claude/upbeat-mccarthy-f1mNN

Conversation

@jameslehoux
Copy link
Copy Markdown

CPU wheel: 30 MB. GPU wheel: 3 GB. The 100x ratio doesn't match anything even remotely reasonable (pytorch-cuda is 750 MB max). Diagnosis:

  1. auditwheel --strip only touches the vendored .so files in the wheel's .libs/ folder. It does NOT strip the main extension module (_core.so), which is where ~99% of the bloat lives — every templated AMReX device-kernel instantiation, with debug symbols, profile name strings, and architecture fat-binary slices, all linked together by RDC + CUDA_RESOLVE_DEVICE_SYMBOLS across 27 translation units.

  2. CUDA architectures: dropped from 5 SASS + 1 PTX (70/75/80/89/90) to 2 SASS + 1 PTX (75/80 + 90-virtual). T4 + A100 ship as native SASS (the actual GPUs Colab/HPC use day-to-day), V100/L4/H100/RTX JIT-compile from PTX on first kernel launch (~2-5s startup, then cached). This is the same model pytorch-cuda uses.

  3. Strip the static deps before linking: strip --strip-debug on libamrex_3d.a, libHYPRE.a, libhdf5*.a, libtiff.a kills any debug symbols that snuck through Release before the device linker has a chance to embed them in _core.so.

  4. New post-auditwheel pass: for whl in /tmp/repaired/.whl; do wheel unpack find ... -name '.so*' -exec strip --strip-all {} + wheel pack # regenerates RECORD hashes done echoes pre/post wheel size to the build log for sanity.

Cache key bumped to v8 with arch+stripped tags so the next build can't restore the 2.4 GB deps tarball.

If this is still too big, next levers (in order of impact-to-risk):

  • Drop AMReX_TINY_PROFILE=ON (wheel-only; native build keeps it)
  • Set AMReX_GPU_RDC=OFF (risk: breaks AMReX features that need RDC)
  • Drop arch 80, ship 75-real + 90-virtual only

CPU wheel: 30 MB. GPU wheel: 3 GB. The 100x ratio doesn't match anything
even remotely reasonable (pytorch-cuda is 750 MB max). Diagnosis:

1. auditwheel --strip only touches the *vendored* .so files in the
   wheel's .libs/ folder. It does NOT strip the main extension module
   (_core.so), which is where ~99% of the bloat lives — every templated
   AMReX device-kernel instantiation, with debug symbols, profile name
   strings, and architecture fat-binary slices, all linked together by
   RDC + CUDA_RESOLVE_DEVICE_SYMBOLS across 27 translation units.

2. CUDA architectures: dropped from 5 SASS + 1 PTX (70/75/80/89/90)
   to 2 SASS + 1 PTX (75/80 + 90-virtual). T4 + A100 ship as native
   SASS (the actual GPUs Colab/HPC use day-to-day), V100/L4/H100/RTX
   JIT-compile from PTX on first kernel launch (~2-5s startup, then
   cached). This is the same model pytorch-cuda uses.

3. Strip the static deps before linking: strip --strip-debug on
   libamrex_3d.a, libHYPRE.a, libhdf5*.a, libtiff.a kills any debug
   symbols that snuck through Release before the device linker has a
   chance to embed them in _core.so.

4. New post-auditwheel pass:
     for whl in /tmp/repaired/*.whl; do
       wheel unpack
       find ... -name '*.so*' -exec strip --strip-all {} +
       wheel pack  # regenerates RECORD hashes
     done
   echoes pre/post wheel size to the build log for sanity.

Cache key bumped to v8 with arch+stripped tags so the next build can't
restore the 2.4 GB deps tarball.

If this is still too big, next levers (in order of impact-to-risk):
  - Drop AMReX_TINY_PROFILE=ON (wheel-only; native build keeps it)
  - Set AMReX_GPU_RDC=OFF (risk: breaks AMReX features that need RDC)
  - Drop arch 80, ship 75-real + 90-virtual only

https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
@jameslehoux jameslehoux merged commit eba3e11 into master Apr 26, 2026
5 checks passed
@github-actions
Copy link
Copy Markdown

Code Coverage Report

------------------------------------------------------------------------------
                           GCC Code Coverage Report
Directory: .
------------------------------------------------------------------------------
File                                       Lines     Exec  Cover   Missing
------------------------------------------------------------------------------
src/io/CathodeWrite.cpp                       95       83    87%   40-41,97-100,115-116,182-185
src/io/CathodeWrite.H                          1        1   100%
src/io/DatReader.cpp                         135      105    77%   26-27,30,35,92-93,99-100,107-109,135-137,141,144-148,152-155,162,164,208-209,242,245
src/io/DatReader.H                             1        1   100%
src/io/HDF5Reader.cpp                        344       84    24%   40-41,43-44,46-49,52,54-56,58-59,62,64-66,68-74,92-93,126-128,144-145,154-157,174-180,182-187,204,213-215,217,219-228,230-233,236-238,240-251,253-258,266,266,266,266,266,266,266,270,270,270,270,270,270,270,274,276,278,280,282,288,290,297,297,297,297,297,297,297,301,301,301,301,301,301,301,305,305,305,305,305,305,305-306,306,306,306,306,306,306,309,309,309,309,309,309,309-310,310,310,310,310,310,310-311,311,311,311,311,311,311,313,313,313,313,313,313,313-314,314,314,314,314,314,314-315,315,315,315,315,315,315,319,319,319,319,319,319,319,324,324,324,324,324,324,324-325,325,325,325,325,325,325-326,326,326,326,326,326,326-327,327,327,327,327,327,327,332,332,332,332,332,332,332,337,337,337,337,337,337,337-338,338,338,338,338,338,338,343,343,343,343,343,343,343,350,350,350,350,350,350,350,357-358,432-435,437-440
src/io/HDF5Reader.H                            3        3   100%
src/io/ImageLoader.cpp                        61       42    68%   25,38,48,60-62,64-70,72,77,89-90,92,94
src/io/RawReader.cpp                         266      135    50%   49-50,89-90,111-112,115-117,120-121,140-142,155-157,166-168,174-177,185-186,192-196,200-204,209-212,219-224,231-237,271,273-274,276,283-284,301,312,314,318,325,327,331-334,338,346-347,353-355,361-363,365-366,369,372,374,377-380,382-384,386,388-389,391,393-394,396,398-399,401,403-404,406,410-411,413,417-418,420,425,465,471-472,521-524,538,540-542,544,546-548,558,562-564,566,588
src/io/RawReader.H                             1        1   100%
src/io/TiffReader.cpp                        384      130    33%   59-65,67-69,71-73,75-77,79-80,82-84,86-88,90-92,94-96,98-99,101-103,106-108,111-112,114-117,119,122,124-127,143-144,148-150,152-158,160,186,210,217,226,228-231,240,242-245,248,255,288-293,306,309-317,319-320,323-327,331-335,338-342,344-348,351-357,359-363,367,369,375-377,379-393,396,398-402,404-409,413-418,420-425,428-429,432-434,555-575,577-578,581-588,590,593-609,612-614,670,673-674,677-683,685,689-700,702-703
src/io/TiffReader.H                            5        5   100%
src/props/BoundaryCondition.H                131       74    56%   63,68,70,216,224-229,233-236,238-244,247-249,252-253,255,258-261,264-265,271-272,274-279,285-287,290-296,299,303,365-366,371,373
src/props/ConnectedComponents.cpp             69       67    97%   94-95
src/props/ConnectedComponents.H                4        4   100%
src/props/DeffTensor.cpp                      62       59    95%   122,128-129
src/props/Diffusion.cpp                      510      378    74%   93-94,97-98,103-104,106-116,118,123-132,134-141,144-150,153-157,159-163,165,168-173,175-177,179,182-184,186-187,190-191,193,195-198,200,202-203,288-289,297-298,300,349,359-360,368-371,373-375,404-413,415,453,461,465-467,526-527,533,535,539,547,581,610,638,646,735-736,739-740,757-760,771-772,774,824
src/props/EffDiffFillMtx.H                   120      106    88%   58,216-217,221-225,229,231-235
src/props/EffectiveDiffusivityHypre.cpp      389      347    89%   189-191,193-197,305,367-370,479,612-615,617-619,621-624,633-636,643,672,684-687,689-691,693,705,716,718
src/props/EffectiveDiffusivityHypre.H          7        7   100%
src/props/FloodFill.cpp                       84       81    96%   94-95,203
src/props/HypreStructSolver.cpp              343      210    61%   87-88,121,133-134,145,299,309,311,314,346,356,358,361,367-370,372-376,378-379,381-385,388-389,391-392,394,397-398,401-402,404-407,409-413,415-416,418-422,425-426,428-429,431,434-435,438-439,441-443,445-451,453-457,460-461,463-464,466,469-470,473,475-477,479-485,487-491,494-495,497-498,500,503-504,507,509-511,513-516,518-522,525-526,528-529,531,534-535,538,541-542,555
src/props/HypreStructSolver.H                  6        6   100%
src/props/MacroGeometry.H                     17       17   100%
src/props/ParticleSizeDistribution.cpp        11       11   100%
src/props/ParticleSizeDistribution.H           6        6   100%
src/props/PercolationCheck.cpp                53       46    86%   32-33,49-51,68,73
src/props/PercolationCheck.H                   4        4   100%
src/props/PhysicsConfig.H                     90       89    98%   150
src/props/ResultsJSON.H                      225      222    98%   242,395,416
src/props/REVStudy.cpp                       151      128    84%   72,83-91,159,170-173,175,183-186,188-190
src/props/SolverConfig.H                      32       20    62%   30,32,37-44,75-76
src/props/SpecificSurfaceArea.cpp             56       55    98%   59
src/props/SpecificSurfaceArea.H                6        6   100%
src/props/ThroughThicknessProfile.cpp         38       38   100%
src/props/ThroughThicknessProfile.H            5        5   100%
src/props/Tortuosity.H                         2        2   100%
src/props/TortuosityDirect.cpp               219      191    87%   81-83,86,100-106,113-114,125,134,140,202-209,226,394,424,433
src/props/TortuosityDirect.H                   5        5   100%
src/props/TortuosityHypre.cpp                784      563    71%   149-150,155-156,240-243,246-248,311,335-337,340-341,343,353-355,358-360,390-393,573,597,601,622,639-640,642-644,646-655,657,660-664,668-670,673-680,682-686,690-692,694-696,698-707,709-713,715-726,728-731,733,743,749-752,754-756,765-768,770-772,788,791-792,815-820,831-834,836,873,878-881,884-886,890-893,895,897-900,902,907-909,911,960,969,974,977-982,998-1001,1015-1019,1024-1029,1039-1043,1048-1053,1058-1062,1065-1068,1075-1078,1089,1098,1100,1104,1106,1128,1159-1160,1246-1248,1374-1377
src/props/TortuosityHypre.H                   15       15   100%
src/props/TortuosityHypreFill.H              127       98    77%   85,203,205-212,237-239,241-245,247-248,250,252,255-256,258-262
src/props/TortuosityKernels.H                 97       53    54%   52,56-60,62-65,69-74,76-80,84-85,90,129,143,157,243,245-248,250-253,257-260,262-265
src/props/TortuosityMLMG.cpp                  99       91    91%   160,181-183,185-186,193,206
src/props/TortuosityMLMG.H                     1        1   100%
src/props/TortuositySolverBase.cpp           301      237    78%   70-72,74-75,94-101,104,106,142-145,200,203,205,255,280,298,327,391,394-396,398,406-409,411-417,422,427-429,435-436,438-440,454,460,464-465,467,478,492,496-498,500,502,506
src/props/TortuositySolverBase.H              13       13   100%
src/props/VolumeFraction.cpp                  25       25   100%
src/props/VolumeFraction.H                     4        4   100%
------------------------------------------------------------------------------
TOTAL                                       5407     3874    71%
------------------------------------------------------------------------------


Generated by CI — coverage data from gcovr

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant