Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocmPackages.composable_kernel: compress output #299589

Merged
merged 2 commits into from
Apr 14, 2024

Conversation

mschwaig
Copy link
Member

@mschwaig mschwaig commented Mar 27, 2024

Description of changes

I made a mistake in my ROCm 6.0 PR #287846, not realizing that rocmPackages.composable_kernel would be larger than Hydra's size limit. See the comments here for details: #287846 (comment)

This PR compresses the huge lib/libdevice_operations.a file that's in the output of that derivation, and then decompresses it again in another derivation based on runCommandLocal.

Is this kind of fix acceptable?
Is how I implemented this configurable enough and the derivations carry the correct metadata?
How else could this issue be addressed?
Not having that derivation cached is a huge pain, since the build takes a few hours at least.

I am running nixpkgs-review on this right now.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@mschwaig
Copy link
Member Author

Result of nixpkgs-review pr 299589 run on x86_64-linux 1

1 package marked as broken and skipped:
  • rocmPackages.migraphx
2 packages built:
  • rocmPackages.composable_kernel
  • rocmPackages.miopen

@mschwaig mschwaig marked this pull request as ready for review March 28, 2024 10:30
@ulrikstrid
Copy link
Member

I'm going to invoke @wegank and @SuperSandro2000 to give their view of this

@SuperSandro2000
Copy link
Member

lib/libdevice_operations.a

Do we even need the static library? Is it stripped?

@GZGavinZhao
Copy link
Contributor

Do we even need the static library?

It is intended to be consumed as a static library. Building it as a shared library needs patching. I've tried that in the past, and every time MIOpen would fail to find composable_kernel properly if I do that. I will try this again and see if the situation now improves.

@GZGavinZhao
Copy link
Contributor

GZGavinZhao commented Apr 3, 2024

An option would be to cherry-pick llvm/llvm-project@7e28234 to ROCm's LLVM.

Or maybe backport ROCm/composable_kernel#1044 and ROCm/MIOpen#2526. I've tried this a few months ago (GZGavinZhao/composable_kernel@c8d9ac8) and I don't recall getting too much space improvement, but I could've mis-remembered since my original intention wasn't to save space.

@mschwaig
Copy link
Member Author

mschwaig commented Apr 3, 2024

lib/libdevice_operations.a

Do we even need the static library? Is it stripped?

Yes, we need it. It is a fat binary file with a a lot of of GPU-specific code for each target GPU, which is used by other parts of ROCm.

I checked the log and yes, it is stripped:

stripping (with command strip and flags -S -p) in  /nix/store/sfybgqj64vhgp386fd3xw0fkjx0wvzwf-composable_kernel-6.0.2/lib

I would prefer if we first found an acceptable solution to get caching to work for users of ROCm, because especially composable_kernel is a real pain to build in terms of hardware requirements and how long the build takes. We could then always switch to a better solution when we find one that we can implement with a reasonable amount of effort.

Another alternative to this PR would probably be setting a long enough meta.timeout for the few consumers of composable_kernel (migraphx and miopen) so it can be rebuilt on the fly while building those.
Though I barely know anything about Hydra, so I don't know if it would behave how I would imagine.

In case the bandwidth and storage costs are actually proportional to compressed size I think it would make sense to tie Hydra's output size limit to compressed size to avoid these kinds of workarounds as much as possible.

@mschwaig
Copy link
Member Author

mschwaig commented Apr 3, 2024

An option would be to cherry-pick llvm/llvm-project@7e28234 to ROCm's LLVM.

Thanks for linking this, I was aware of the -compress option for clang-offload-bundler and that it is not present yet in version of LLVM we use in ROCm, but I was not aware of the associated back and forth. I might find the time to see if this works out of the box, but currently I cannot invest a lot of time into getting it to work.

An option would be to cherry-pick llvm/llvm-project@7e28234 to ROCm's LLVM.

Or maybe backport ROCm/composable_kernel#1044 and ROCm/MIOpen#2526. I've tried this a few months ago (GZGavinZhao/composable_kernel@c8d9ac8) and I don't recall getting too much space improvement, but I could've mis-remembered since my original intention wasn't to save space.

To me it looks like this just splits up that huge file in a few smaller ones that add up to roughly the same total size again, so it does not help us with the self-imposed limit we have here. Going by that first PR it also looks like it would need corresponding changes in other places. I cannot invest the time required to try this.

@mschwaig
Copy link
Member Author

mschwaig commented Apr 3, 2024

We could also try dropping the gfx940 and gfx941 targets as suggested here #298388 (comment) to see if that gets us below the size limit.

@mschwaig
Copy link
Member Author

mschwaig commented Apr 3, 2024

Here is a bit of an overview of what's in that file:

$ bloaty /nix/store/sfybgqj64vhgp386fd3xw0fkjx0wvzwf-composable_kernel-6.0.2/lib/libdevice_operations.a
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  75.4%  2.78Gi  93.7%  2.78Gi    .hip_fatbin
   8.1%   307Mi   0.0%       0    .strtab
   5.9%   221Mi   0.0%       0    [AR Symbol Table]
   5.8%   219Mi   4.7%   142Mi    [954581 Others]
   2.4%  90.7Mi   0.0%       0    [ELF Section Headers]
   0.8%  29.9Mi   1.0%  29.9Mi    .rodata.str1.1
   0.6%  23.6Mi   0.0%       0    .symtab
   0.4%  13.6Mi   0.4%  13.6Mi    .eh_frame
   0.2%  8.49Mi   0.0%       0    .rela.eh_frame
   0.1%  5.64Mi   0.0%       0    .group
   0.1%  3.86Mi   0.1%  3.86Mi    .text
   0.1%  2.97Mi   0.0%       0    .rela.text
   0.0%   877Ki   0.0%       0    [Unmapped]
   0.0%   688Ki   0.0%       0    .rela.text._ZN2ck15get_device_nameB5cxx11Ev
   0.0%   522Ki   0.0%   522Ki    .text._ZN2ck15get_device_nameB5cxx11Ev
   0.0%   361Ki   0.0%       0    .llvm_addrsig
   0.0%   310Ki   0.0%       0    .rela.text._Z15hip_check_error10hipError_t
   0.0%   165Ki   0.0%       0    .rela.text._ZNK2ck16tensor_operation6device12BaseOperator17GetTypeIdHashCodeB5cxx11Ev
   0.0%   160Ki   0.0%   160Ki    .text._Z15hip_check_error10hipError_t
   0.0%   153Ki   0.0%   153Ki    .text._ZNK2ck16tensor_operation6device12BaseOperator17GetTypeIdHashCodeB5cxx11Ev
   0.0%   105Ki   0.0%       0    .rela.data.rel.ro
 100.0%  3.69Gi 100.0%  2.97Gi    TOTAL

.hip_fatbin contains the GPU specific code and this is what clang-offload-bundler will be able to compress in the future.

@Flakebi
Copy link
Member

Flakebi commented Apr 3, 2024

The workaround here looks good to me.
(Much more appealing than rebuilding for its dependencies.)

Thanks for putting so much time and effort into this.

@mschwaig
Copy link
Member Author

mschwaig commented Apr 3, 2024

Another alternative to this PR would probably be setting a long enough meta.timeout for the few consumers of composable_kernel (migraphx and miopen) so it can be rebuilt on the fly while building those.
Though I barely know anything about Hydra, so I don't know if it would behave how I would imagine.

Turns out this would not even work because miopen is practically the same size as composable_kernel so it will not get cached either. In practice this PR still reduces the problem by orders of magnitude, since building miopen will only take ROCm users about 10 minutes. Packages further downstream like torchWithRocm should be cached, since they are a lot smaller.

@mschwaig
Copy link
Member Author

mschwaig commented Apr 3, 2024

We could also try dropping the gfx940 and gfx941 targets as suggested here #298388 (comment) to see if that gets us below the size limit.

I tried this as well now and it does not work either. It only gets us down to 3.1 GB.

@mschwaig
Copy link
Member Author

mschwaig commented Apr 4, 2024

An option would be to cherry-pick llvm/llvm-project@7e28234 to ROCm's LLVM.

I have backported this commit here mschwaig/llvm-project@527b6ea and it looks like that will work. The size of composable_kernel and miopen goes down from 3.7 GB to 1.3 GB when I do this.

What I'm still having trouble with:

  1. Applying this commit as a patch in Nix. For some reason none of the changes, besides the ones creating new files, apply at all because the files in question are not found. I suspect that this has something to do with sourceRoot, but I could not figure out how to get past that issue.
  2. Right now I'm just wrapping clang-offload-bundler to add the -compress argument instead of only modifying the relevant invocations, which I was not able to identify yet. This might be fine, but I am not sure.
  3. I have not tested that applications keep working.

You can find what I have here: mschwaig@fe795d4

To conserve reviewer's time I am moving this back into the draft stage.

@mschwaig mschwaig marked this pull request as draft April 4, 2024 11:52
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/trouble-patching-llvm-source-code-for-rocm/42772/1

@mschwaig mschwaig force-pushed the compress-composable-kernel branch 3 times, most recently from 7fd6313 to f7b817c Compare April 7, 2024 14:55
This patches the clang-offload-bundler tool to add a compression option
from a more recent version of clang.
This compression option reduces the size of ROCm's fat binaries.
Those binaries contain .hip_fatbin sections with GPU-specific code,
for each target.
Compression is automatically turned on for all produced outputs
via a wrapper, because it's difficult to identify all the places
where the -compression argument would be needed.

Once upsteam introduces handeling for this argument,
we should drop the wrapper again.
This transistion will create inconsistsency, but I do not think that
it will impact any actual users and it's what's practical to implement.
@mschwaig
Copy link
Member Author

mschwaig commented Apr 7, 2024

I have a working version with the -compress option for clang-offload-bundler now, so this is ready for review again.

Result of nixpkgs-review pr 299589 run on x86_64-linux 1

11 packages marked as broken and skipped:
  • rocmPackages.llvm.flang
  • rocmPackages.llvm.flang.doc
  • rocmPackages.llvm.flang.info
  • rocmPackages.llvm.flang.man
  • rocmPackages.llvm.libclc
  • rocmPackages.migraphx
  • rocmPackages.mivisionx
  • rocmPackages.mivisionx-cpu
  • rocmPackages.mivisionx-hip
  • rocmPackages.rdc
  • rocmPackages.rdc.doc
88 packages built:
  • blender-hip
  • rocmPackages.clang-ocl
  • rocmPackages.clr
  • rocmPackages.clr.icd
  • rocmPackages.composable_kernel
  • rocmPackages.half
  • rocmPackages.hip-common
  • rocmPackages.hipblas
  • rocmPackages.hipcc
  • rocmPackages.hipcub
  • rocmPackages.hipfft
  • rocmPackages.hipfort
  • rocmPackages.hipify
  • rocmPackages.hiprand
  • rocmPackages.hipsolver
  • rocmPackages.hipsparse
  • rocmPackages.hsa-amd-aqlprofile-bin
  • rocmPackages.llvm.bintools
  • rocmPackages.llvm.clang
  • rocmPackages.llvm.clang-tools-extra
  • rocmPackages.llvm.clang-tools-extra.doc
  • rocmPackages.llvm.clang-tools-extra.info
  • rocmPackages.llvm.clang-tools-extra.man
  • rocmPackages.llvm.clang-unwrapped
  • rocmPackages.llvm.clang-unwrapped.doc
  • rocmPackages.llvm.clang-unwrapped.info
  • rocmPackages.llvm.clang-unwrapped.man
  • rocmPackages.llvm.compiler-rt
  • rocmPackages.llvm.libc
  • rocmPackages.llvm.libc.doc
  • rocmPackages.llvm.libcxx
  • rocmPackages.llvm.libcxx.doc
  • rocmPackages.llvm.libcxxabi
  • rocmPackages.llvm.libunwind
  • rocmPackages.llvm.libunwind.doc
  • rocmPackages.llvm.lld
  • rocmPackages.llvm.lld.doc
  • rocmPackages.llvm.lldb
  • rocmPackages.llvm.lldb.doc
  • rocmPackages.llvm.lldb.info
  • rocmPackages.llvm.lldb.man
  • rocmPackages.llvm.llvm
  • rocmPackages.llvm.llvm.doc
  • rocmPackages.llvm.llvm.info
  • rocmPackages.llvm.llvm.man
  • rocmPackages.llvm.mlir
  • rocmPackages.llvm.openmp
  • rocmPackages.llvm.openmp.doc
  • rocmPackages.llvm.openmp.info
  • rocmPackages.llvm.openmp.man
  • rocmPackages.llvm.polly
  • rocmPackages.llvm.polly.doc
  • rocmPackages.llvm.polly.info
  • rocmPackages.llvm.polly.man
  • rocmPackages.llvm.pstl
  • rocmPackages.llvm.rocmClangStdenv
  • rocmPackages.miopen
  • rocmPackages.rccl
  • rocmPackages.rocalution
  • rocmPackages.rocblas
  • rocmPackages.rocdbgapi
  • rocmPackages.rocdbgapi.doc
  • rocmPackages.rocfft
  • rocmPackages.rocgdb
  • rocmPackages.rocm-cmake
  • rocmPackages.rocm-comgr
  • rocmPackages.rocm-core
  • rocmPackages.rocm-device-libs
  • rocmPackages.rocm-runtime
  • rocmPackages.rocm-smi
  • rocmPackages.rocm-thunk
  • rocmPackages.rocminfo
  • rocmPackages.rocmlir
  • rocmPackages.rocmlir-rock
  • rocmPackages.rocmlir.external
  • rocmPackages.rocprim
  • rocmPackages.rocprofiler
  • rocmPackages.rocr-debug-agent
  • rocmPackages.rocsolver
  • rocmPackages.rocsparse
  • rocmPackages.rocthrust
  • rocmPackages.roctracer
  • rocmPackages.rocwmma
  • rocmPackages.rpp (rocmPackages.rpp-hip)
  • rocmPackages.rpp-cpu
  • rocmPackages.rpp-opencl
  • rocmPackages.tensile
  • rocmPackages.tensile.dist

@mschwaig mschwaig marked this pull request as ready for review April 7, 2024 15:29
Copy link
Member

@ulrikstrid ulrikstrid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good now. Landing this would be huge since it takes quite some time to build composable_kernel

@mschwaig
Copy link
Member Author

I think this looks good now. Landing this would be huge since it takes quite some time to build composable_kernel

As the author of this PR I think its ready too. 👍 I ran nixpkgs-review on it and manually tested if torchWithRocm still works.

(This PR is also a prerequisite for merging #298388 as is, because that PR simplifies the rocblas build in a way that would otherwise exceed the cache size limit as well.)

@ulrikstrid
Copy link
Member

#298388 (comment)

This comment is interesting, maybe we can drop those versions that are mentioned? This can be done in a separate PR however so that we can merge this without anything that could be considered controversial sneaking in.

@mschwaig
Copy link
Member Author

mschwaig commented Apr 14, 2024

I agree that dropping those ISAs is worth looking into.
I have opened issue #303994 for it, since I do not have the time to open a PR for it in the coming week.

@wegank wegank merged commit e8d27c9 into NixOS:master Apr 14, 2024
23 checks passed
@mschwaig
Copy link
Member Author

Unfortunately I made a mistake when testing this PR and this segfaults when properly tested with llama.cpp:

#0  0x00007fffec3bb452 in amd::Elf::getElfSize(void const*) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#1  0x00007fffec035025 in hip::CodeObject::ElfSize(void const*) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#2  0x00007fffec0b043e in hip::FatBinaryInfo::ExtractFatBinaryUsingCOMGR(std::vector<hip::Device*, std::allocator<hip::Device*> > const&) ()
   from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#3  0x00007fffec0b1b6e in hip::FatBinaryInfo::ExtractFatBinary(std::vector<hip::Device*, std::allocator<hip::Device*> > const&) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#4  0x00007fffec038add in hip::StatCO::digestFatBinary(void const*, hip::FatBinaryInfo*&) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#5  0x00007fffec276a2c in PlatformState::digestFatBinary(void const*, hip::FatBinaryInfo*&) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#6  0x00007fffec276430 in PlatformState::init() () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#7  0x00007fffec04d0f6 in hip::init(bool*) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#8  0x00007fffec066dcd in void std::__invoke_impl<void, void (&)(bool*), bool*>(std::__invoke_other, void (&)(bool*), bool*&&) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#9  0x00007fffec066d9d in std::__invoke_result<void (&)(bool*), bool*>::type std::__invoke<void (&)(bool*), bool*>(void (&)(bool*), bool*&&) ()
   from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#10 0x00007fffec066d6c in std::call_once<void (&)(bool*), bool*>(std::once_flag&, void (&)(bool*), bool*&&)::{lambda()#1}::operator()() const ()
   from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#11 0x00007fffec066d44 in std::once_flag::_Prepare_execution::_Prepare_execution<std::call_once<void (&)(bool*), bool*>(std::once_flag&, void (&)(bool*), bool*&&)::{lambda()#1}>(void (&)(bool*))::{lambda()#1}::operator()() const () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#12 0x00007fffec066d11 in std::once_flag::_Prepare_execution::_Prepare_execution<std::call_once<void (&)(bool*), bool*>(std::once_flag&, void (&)(bool*), bool*&&)::{lambda()#1}>(void (&)(bool*))::{lambda()#1}::__invoke() () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#13 0x00007fffebbf50df in __pthread_once_slow () from /nix/store/ddwyrxif62r8n6xclvskjyy6szdhvj60-glibc-2.39-5/lib/libc.so.6
#14 0x00007fffec05e5eb in __gthread_once(int*, void (*)()) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#15 0x00007fffec061fe1 in void std::call_once<void (&)(bool*), bool*>(std::once_flag&, void (&)(bool*), bool*&&) () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#16 0x00007fffec077144 in hipGetDeviceCount () from /nix/store/md9m0qlr5qzv1yswqqhx4md4kj8njawf-clr-6.0.2/lib/libamdhip64.so.6
#17 0x00007ffff6515442 in (anonymous namespace)::TensileHost::TensileHost() () from /nix/store/rydjpmr5awlzznyp4a1r3fhwvyb4gnaj-rocblas-6.0.2/lib/librocblas.so.4
#18 0x00007ffff65098de in (anonymous namespace)::get_library_and_adapter(std::shared_ptr<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >*, std::shared_ptr<hipDeviceProp_tR0600>*, int) () from /nix/store/rydjpmr5awlzznyp4a1r3fhwvyb4gnaj-rocblas-6.0.2/lib/librocblas.so.4
#19 0x00007ffff7e32d56 in ggml_cuda_init() () from /nix/store/9sq0q30g65xyd02yzvixfl39y6wdw5df-llama-cpp-rocm-0.0.0/lib/libllama.so
#20 0x00007ffff7e33399 in ggml_backend_cuda_get_device_count () from /nix/store/9sq0q30g65xyd02yzvixfl39y6wdw5df-llama-cpp-rocm-0.0.0/lib/libllama.so
#21 0x000055555559c31f in _GLOBAL__sub_I_llama_bench.cpp ()
#22 0x00007fffebb8c23e in __libc_start_main_impl () from /nix/store/ddwyrxif62r8n6xclvskjyy6szdhvj60-glibc-2.39-5/lib/libc.so.6
#23 0x0000555555572275 in _start ()

@mschwaig
Copy link
Member Author

mschwaig commented Apr 16, 2024

clang-offload-bundler in rocmPackages.llvm.openmp also needs rocmPackages.llvm.llvm on it's path to even get there.

@GZGavinZhao do you happen to have a working version of those the patch for llvm/clang?
I saw in llvm/llvm-project#67162 that you have GZGavinZhao/rocm-llvm-project@c859f20

@ulrikstrid
Copy link
Member

ulrikstrid commented Apr 16, 2024

Should we roll back this @mschwaig ? Or does someone have time to fix it before 24.05?

@GZGavinZhao
Copy link
Contributor

GZGavinZhao commented Apr 16, 2024

@GZGavinZhao do you happen to have a working version of those the patch for llvm/clang?

I couldn't get LLVM to build with my patches. IIRC it would fail during the test phase, which was sort of expected since I removed the test file changes in the patches to prevent conflicts, but I didn't have time to dig deep into it.

From the stack trace, it seems like the problem is not with LLVM's side, but that the HIP clr runtime and/or ROCm-CompilerSupport doesn't support compressed device binaries yet.

I was able to reproduce this segmentation fault just with the vectoradd_hip example file (hipcc vectoradd_hip.cpp -o vectoradd_hip). However, I'd note that upon inspecting the verbose compilation output, it seems like clang-offload-bundler compress the device binary regardless of whether --offload-compress was passed. You can verify this by adding the --verbose flag when compiling and you would see this

 "/nix/store/y44kwxkk8pc4wdj0gqhnbdzjm4c7w82w-rocm-llvm-clang-6.0.2/bin/clang-offload-bundler" -type=o -bundle-align=4096 -targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx1032,hipv4-amdgcn-amd-amdhsa--gfx90c -input=/dev/null -input=/tmp/vectoradd_hip-gfx1032-406a80.out -input=/tmp/vectoradd_hip-gfx90c-2193ac.out -output=/tmp/vectoradd_hip-71eb0a.hipfb -verbose
Compressed bundle format version: 1
Compression method used: zlib
Binary size before compression: 67184 bytes
Binary size after compression: 27668 bytes
Truncated MD5 hash: 0x5296b68059d8fbbe
===-------------------------------------------------------------------------===
                     Timer group for clang offload bundler
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0051 seconds (0.0051 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0049 ( 99.4%)   0.0000 (  0.0%)   0.0049 ( 96.8%)   0.0050 ( 96.8%)  Compression time
   0.0000 (  0.6%)   0.0001 (100.0%)   0.0002 (  3.2%)   0.0002 (  3.2%)  Hash calculation time
   0.0050 (100.0%)   0.0001 (100.0%)   0.0051 (100.0%)   0.0051 (100.0%)  Total

This is weird because according to llvm/llvm-project#67162, device binary compression should be off by default, but it seems like clang-offload-bundler would compress the device binary even if it doesn't receive the -compress flag.

@mschwaig
Copy link
Member Author

Thanks for the minimal test case and all of the additional info.

Compression always being on is due to the wrapper I added here:

extraPostInstall = ''
mv bin/clang-tblgen $out/bin
# add wrapper to compress embedded accelerator-specific code
# this makes the output of composable_kernel significantly smaller right now
# TODO: remove this once ROCm does it out of the box
mv $out/bin/clang-offload-bundler $out/bin/clang-offload-bundler-unwrapped
makeWrapper $out/bin/clang-offload-bundler-unwrapped $out/bin/clang-offload-bundler \
--add-flags '-compress'
'';

@mschwaig
Copy link
Member Author

Should we roll back this @mschwaig ? Or does someone have time to fix it before 24.05?

I don't quite understand the timelines involved, as far as I know this would have to be fixed until tomorrow, to get into the release.

Until tomorrow (or within the next week even) I don't think this change can be salvaged.
I could spend more time to look into it starting in a few days, and maybe find and easy fix but the much more likely outcome is that this is not possible yet, because not all required components support it and we have to look at my previous proposal for working around the size limit again (f199cf1).

We could

  • revert the changes in this PR, or
  • replace them with what I proposed previously.

I would assume that getting around the Hydra size limit with either solution is something we could backport into the release later on as well (maybe after we have something working on unstable for some time)?
In that case I think we should just revert this PR.

@GZGavinZhao
Copy link
Contributor

I personally suggest a revert as well. I just dug into clr and comgr code to check and found that even AMD themselves currently have no support for compressed device binaries. This means that we would essentially have to implement support for compressed device binaries in comgr by ourselves. It shouldn't be hard, but I don't think it's something doable in one or two days. A week is probably enough, but if the deadline for 24.05 is tomorrow, then I suggest we should just revert the PR.

@wegank
Copy link
Member

wegank commented Apr 17, 2024

The deadline is certainly not tomorrow, as it only applies to breaking changes to Release Critical Packages.

EDIT: reverted in #304672.

@surfaceflinger
Copy link
Member

Not sure how ROCm works, but if we're not getting ROCm in Hydra in 24.05 then would it be possible instead to make an override to build a single target? (if that's even possible)
Could be done as a module that would have effect on everything that uses ROCm. Tried building ollama yesterday on OC'd 5600x and it was like 2-3 hours before failing :(

@GZGavinZhao
Copy link
Contributor

Thanks to a very new PR (4 days ago) llvm/llvm-project#88827, I managed to make compressed device binaries work for vectoradd_hip.cpp. Next week I will start cleaning up my work locally, try backporting the commit, and run tests on composable_kernel.

@ulrikstrid
Copy link
Member

@GZGavinZhao please ping me for testing etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants