Skip to content

[Backends] wrap allocs in try catch#1027

Merged
tdavidcl merged 5 commits intoShamrock-code:mainfrom
tdavidcl:internal-alloc-2dot0
Jun 8, 2025
Merged

[Backends] wrap allocs in try catch#1027
tdavidcl merged 5 commits intoShamrock-code:mainfrom
tdavidcl:internal-alloc-2dot0

Conversation

@tdavidcl
Copy link
Member

@tdavidcl tdavidcl commented Jun 2, 2025

No description provided.

@tdavidcl
Copy link
Member Author

tdavidcl commented Jun 7, 2025

On the DGX with 60Mpart on a single GPU

---------------- t = 0, dt = 0 ---------------- 
Info: summary :                                                                                    [LoadBalance][rank=0] 
Info:  - strategy "psweep" : max = 60519200 min = 60519200                                         [LoadBalance][rank=0] 
Info:  - strategy "round robin" : max = 60519200 min = 60519200                                    [LoadBalance][rank=0] 
Info: Loadbalance stats :                                                                          [LoadBalance][rank=0]
    npatch = 64
    min = 60519200
    max = 60519200
    avg = 60519200
    efficiency = 100.00%  
Info: Scheduler step timings :                                                                       [Scheduler][rank=0]
   metadata sync     : 20.39 us   (3.5%)
   patch tree reduce : 9.77 us    (1.7%)
   gen split merge   : 1042.00 ns (0.2%)
   split / merge op  : 0/0
   apply split merge : 1042.00 ns (0.2%)
   LB compute        : 526.52 us  (89.9%)
   LB move op cnt    : 0
   LB apply          : 1823.00 ns (0.3%)  
Info: Scheduler step timings :                                                                       [Scheduler][rank=0]
   metadata sync     : 5.12 us    (37.6%)  
Warning: smoothing length is not converged, rerunning the iterator ...                         [Smoothinglength][rank=0]
     largest h = 0.0018919262923627275 unconverged cnt = 60519200   
Warning: smoothing length is not converged, rerunning the iterator ...                         [Smoothinglength][rank=0]
     largest h = 0.0020811189215990005 unconverged cnt = 60519200   
Warning: smoothing length is not converged, rerunning the iterator ...                         [Smoothinglength][rank=0]
     largest h = 0.0022892308137589007 unconverged cnt = 60519200   
Warning: smoothing length is not converged, rerunning the iterator ...                         [Smoothinglength][rank=0]
     largest h = 0.002518153895134791 unconverged cnt = 60519200   
Warning: smoothing length is not converged, rerunning the iterator ...                         [Smoothinglength][rank=0]
     largest h = 0.0027699692846482704 unconverged cnt = 60519200   
Warning: smoothing length is not converged, rerunning the iterator ...                         [Smoothinglength][rank=0]
     largest h = 0.003046966213113098 unconverged cnt = 60519200   
Warning: smoothing length is not converged, rerunning the iterator ...                         [Smoothinglength][rank=0]
     largest h = 0.003351662834424408 unconverged cnt = 60519200   
<CUDA>[ERROR]: 
UR CUDA ERROR:
	Value:           2
	Name:            CUDA_ERROR_OUT_OF_MEMORY
	Description:     out of memory
	Function:        USMDeviceAllocImpl
	Source Location: /local/tdavidcl/Shamrock_tdavidcl/build_intel/.env/intelllvm-git/build/_deps/unified-runtime-src/source/adapters/cuda/usm.cpp:139

Error: Exception created :                                                                           [Exception][rank=0]
USM allocation failed, details : sz=7645440, target=device, alignment=8, alloc result = 0x0
    World infos :
        World size = 1
        World rank = 0
    Device infos :
        Device name = NVIDIA A100-SXM4-40GB
    Allocs :
        max_allocated_byte_host = 36.25 MB
        max_allocated_byte_device = 33.73 GB
        max_allocated_byte_shared = 0.00 B
        allocated_byte_host = 0.00 B
        allocated_byte_device = 33.73 GB
        allocated_byte_shared = 0.00 B
        
---- Source Location ----
/local/tdavidcl/Shamrock_tdavidcl/src/shambackends/src/details/internal_alloc.cpp:273:13
call = void *sham::details::internal_alloc(size_t, std::shared_ptr<DeviceScheduler>, std::optional<size_t>) [target = sham::device]
stacktrace :
  0 : int main(int, char **) (/local/tdavidcl/Shamrock_tdavidcl/src/main.cpp:61:16)
  1 : int main(int, char **) (/local/tdavidcl/Shamrock_tdavidcl/src/main.cpp:204:28)
  2 : shammodels::sph::TimestepLog shammodels::sph::Solver<sycl::vec<double, 3>, shammath::M4>::evolve_once() [Tvec = sycl::vec<double, 3>, SPHKernel = shammath::M4] (/local/tdavidcl/Shamrock_tdavidcl/src/shammodels/sph/src/Solver.cpp:949:16)
  3 : void shammodels::sph::Solver<sycl::vec<double, 3>, shammath::M4>::sph_prestep(Tscal, Tscal) [Tvec = sycl::vec<double, 3>, SPHKernel = shammath::M4] (/local/tdavidcl/Shamrock_tdavidcl/src/shammodels/sph/src/Solver.cpp:396:16)
  4 : ComputeField<T> shamrock::SchedulerUtility::make_compute_field(std::string, u32, T) [T = double] (/local/tdavidcl/Shamrock_tdavidcl/src/shamrock/include/shamrock/scheduler/SchedulerUtility.hpp:282:24)
  5 : USMPtrHolder<target> sham::details::create_usm_ptr(size_t, std::shared_ptr<DeviceScheduler>, std::optional<size_t>) [target = sham::device] (/local/tdavidcl/Shamrock_tdavidcl/src/shambackends/src/details/memoryHandle.cpp:33:20)
  6 : void *sham::details::internal_alloc(size_t, std::shared_ptr<DeviceScheduler>, std::optional<size_t>) [target = sham::device] (/local/tdavidcl/Shamrock_tdavidcl/src/shambackends/src/details/internal_alloc.cpp:188:20)

tdavidcl and others added 3 commits June 7, 2025 14:41
**Dusty collision test B from Huang&Bai (2022) with exponential drag
solver**


![dusty_collision_test_exp_B](https://github.com/user-attachments/assets/51f7333c-ecec-4fc5-b5b6-481c41ab767c)

**Dusty wave 2 fluids**

![dusty_wave_test_2fluids](https://github.com/user-attachments/assets/05d1ce16-0e43-4705-91e2-afb1cf45a03b)

**Dusty wave 5 fluids**

![dusty_wave_test_5fluids_new](https://github.com/user-attachments/assets/ae9c3d62-fbe2-4d16-bb70-8cf1814b51bc)

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: David--Cléris Timothée <timothee.davidcleris@proton.me>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2025

Workflow report

workflow report corresponding to commit 079d888
Commiter email is timothee.davidcleris@proton.me
GitHub page artifact URL GitHub page artifact link (can expire)

Pre-commit check report

Pre-commit check: ✅

trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check for merge conflicts................................................Passed
check that executables have shebangs.....................................Passed
check that scripts with shebangs are executable..........................Passed
check for added large files..............................................Passed
check for case conflicts.................................................Passed
check for broken symlinks............................(no files to check)Skipped
check yaml...............................................................Passed
detect private key.......................................................Passed
No-tabs checker..........................................................Passed
Tabs remover.............................................................Passed
Validate GitHub Workflows................................................Passed
clang-format.............................................................Passed
black....................................................................Passed
ruff check...............................................................Passed
Check doxygen headers....................................................Passed
Check license headers....................................................Passed
Check #pragma once.......................................................Passed
Check SYCL #include......................................................Passed
No ssh in git submodules remote..........................................Passed

Test pipeline can run.

Clang-tidy diff report

No relevant changes found.
Well done!

You should now go back to your normal life and enjoy a hopefully sunny day while waiting for the review.

Doxygen diff with main

Removed warnings : 1
New warnings : 1
Warnings count : 6275 → 6275 (0.0%)

Detailed changes :
- src/shambackends/src/details/internal_alloc.cpp:123: warning: Member log_mem_perf_info(const std::shared_ptr< DeviceScheduler > &dev_sched) (function) of namespace sham::details is not documented.
+ src/shambackends/src/details/internal_alloc.cpp:124: warning: Member log_mem_perf_info(const std::shared_ptr< DeviceScheduler > &dev_sched) (function) of namespace sham::details is not documented.

@tdavidcl tdavidcl merged commit f9227a7 into Shamrock-code:main Jun 8, 2025
40 checks passed
@tdavidcl tdavidcl deleted the internal-alloc-2dot0 branch June 8, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants