Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

look into allocations #691

Closed
wants to merge 8 commits into from
Closed

look into allocations #691

wants to merge 8 commits into from

Conversation

juliasloan25
Copy link
Member

@juliasloan25 juliasloan25 commented Mar 13, 2024

Purpose

closes #683

Notes

  1. ClimaCoupler gpu_dyamond_target_nodiags shortrun uses 0.25GB before atmos init, 15.83 after atmos init, then exceeds the 15.895GB limit during bucket init (on P100) see build
  2. ClimaAtmos gpu_aquaplanet_dyamond run uses ~10GB (on P100) see build
  3. ClimaCoupler gpu_dyamond_target longrun running without atmos diagnostics uses 0.43GB before atmos init, 62GB after atmos init - see details below (on H100) see build
  4. ClimaCoupler gpu_dyamond_target longrun running with atmos or coupler diagnostics uses 0.43GB before atmos init, 62GB after atmos init, 66GB right before coupler loop, and 67GB right after coupler loop (on H100) see build
  5. ClimaCoupler atmos-only run exceeds memory limit during atmos init (on P100) see build

** I would expect 2 and 5 to show the same results, but atmos run from coupler allocates 15GB while from ClimaAtmos it allocates 10GB. Maybe I should run 5 on H100 to see how high the allocations will reach in that case (but clima is offline this morning)

atmos allocations when running gpu_dyamond_target without diagnostics (same for last two setups mentioned above):

[ Info: Allocating cache (p): 41.072 s (122486389 allocations: 7.49 GiB)
--
  | [ Info: Using ODE config: ClimaTimeSteppers.ARS343
  | [ Info: ode_configuration: 613.655 ms (1280120 allocations: 79.57 MiB)
  | [ Info: Progress logging enabled.
  | [ Info: get_callbacks: 182.676 ms (126426 allocations: 8.61 MiB)
  | [ Info: initializing diagnostics: 2.692 s (6111627 allocations: 440.36 MiB)
  | [ Info: HDF5Writer: Any[]
  | [ Info: NetCDFWriter: Any[]
  | [ Info: Prepared diagnostic callbacks: 21.383 ms (64402 allocations: 4.46 MiB)
  | [ Info: Prepared SciMLBase.CallbackSet callbacks: 13.758 ms (411 allocations: 29.77 KiB)
  | [ Info: n_steps_per_cycle_per_cb (non diagnostics): [1, 1, 864, 216, 72, 1]
  | [ Info: n_steps_per_cycle_per_cb_diagnostic: Any[]
  | [ Info: n_steps_per_cycle (non diagnostics): 864
  | [ Info: Define ode function: 6.759 s (49510523 allocations: 1.98 GiB)
  | [ Info: dt_save_to_sol: 43200.0, length(saveat): 2
  | [ Info: Saving state to HDF5 file on day 0 second 0
  | [ Info: init integrator: 42.470 s (75399874 allocations: 5.00 GiB)
  | [ Info: Init diagnostics: 8.460 μs (0 allocations: 0 bytes)
  | Effective GPU memory usage: 78.49% (62.126 GiB/79.150 GiB)

@juliasloan25 juliasloan25 force-pushed the js/allocs branch 2 times, most recently from fb3205d to bad4ef7 Compare March 18, 2024 16:03
@juliasloan25 juliasloan25 mentioned this pull request Mar 20, 2024
14 tasks
@juliasloan25
Copy link
Member Author

included in #706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

investigate allocations
1 participant