investigate GPU conservation #607

juliasloan25 · 2024-02-09T01:09:27Z

Our GPU runs have higher water and energy conservation than CPU runs with identical setups. The CPU error tends to be around 1e-5, while the GPU error is around 1e-3 or 1e-4. We should look into why the conservation is worse on GPU, and try to improve it.

For now (as of #589), we've set the GPU runs to soft fail if the conservation error is larger than 1e-3, but we want them to be able to pass this threshold (and even a smaller one, ideally).

part of #390
PR #614

Partially done in #735

Approaches to try

add debug plots, as was done in this branch
- requires ability to convert fields from GPU to CPU - see add function to convert fields, spaces, grids between CPU and GPU ClimaCore.jl#1619
check float type stability
- Atmos cache always initialized with NaN
- Land state always initialized with Float64 in coupler
- might not be relevant if our runs are already with Float64, but could affect Float32 runs

Bucket info

ClimaLand has standalone global bucket runs on CPU and GPU that we compare. For all 3 albedo options, comparing the mean values of the states from CPU and GPU runs gives a difference on the order of 1e-15. For the temporal map albedo case, which runs for 50 days, we get:

mean(cpu_state .- gpu_state) = 1.433428363408908e-15
eps(Float64) = 2.220446049250313e-16

The functional and static map albedo cases show similar discrepancies, and run for 7 days each. These differences are much smaller than what we see in coupled runs, so the difference is probably not coming from the bucket model.

example RSE values seen in #589 (for comparison between CPU/GPU runs)

functional albedo

CPU
rse[end] = 1.6532404532423364e-5
rse[end] = 0.0005462468417393023
GPU
rse[end] = 7.841693994284924e-5
rse[end] = 0.0005103673696554503

static map

CPU
rse[end] = 1.5598104032642762e-5
rse[end] = 7.682827509990757e-5
GPU
rse[end] = 0.00012549802174549964
rse[end] = 0.0015490000882394841

temporal map

CPU
rse[end] = 1.7424794137292602e-5
rse[end] = 0.00033606441246576436
GPU
rse[end] = 0.00012708219201050294
rse[end] = 0.000943450817881417

The text was updated successfully, but these errors were encountered:

akshaysridhar · 2024-02-15T19:00:04Z

These might be relevant builds (ClimaAtmos - standalone test cases which compare CPU v GPU runs)
https://buildkite.com/clima/climaatmos-ci/builds/16728#018da910-0554-4f88-ab65-3f79079dcb23 (HS)
https://buildkite.com/clima/climaatmos-ci/builds/16728#018da910-054b-4a34-aed1-0807b98db143 (BW)

LenkaNovak · 2024-04-05T02:11:07Z

Stand-alone Moist held-suarez atmos runs differ too: CliMA/ClimaAtmos.jl#2876

juliasloan25 · 2024-04-15T18:05:19Z

Fields.bycolumn is effectively a no-op on GPU and could be contributing to the differences between CPU and GPU. See #736

LenkaNovak · 2024-04-16T15:29:44Z

Thanks to #737, we can confirm that our GPU longruns run as well as CPU longruns (build). Conservation logging issue (20% error) was addressed in #735. The GPU runs are systematically less conservative than the CPU runs (discussion in #735) but the difference is quite small (~1%), so we can revisit this in the future (as part of #594) when we can track energy sinks and sources to the precision of sqrt(eps) - this requires work in ClimaAtmos (see CliMA/ClimaAtmos.jl#2658 and CliMA/ClimaAtmos.jl#2568), and close down this issue. @juliasloan25, would you agree?

LenkaNovak · 2024-04-18T17:32:30Z

completed

juliasloan25 added enhancement New feature or request 💰 Grab Bag GPU labels Feb 9, 2024

juliasloan25 mentioned this issue Feb 9, 2024

add GPU slabplanet file read albedo runs #589

Merged

1 task

juliasloan25 added this to the O1.2.5 (coupler) Atmos-land simulations on GPU milestone Feb 9, 2024

juliasloan25 mentioned this issue Feb 9, 2024

O1.2.5 Atmos-land simulations on GPU at 1 SYPD on 4 A100s #390

Closed

2 tasks

juliasloan25 added 🔥 Urgent 🍃 leaf Issue coupled to a PR and removed 💰 Grab Bag enhancement New feature or request labels Feb 9, 2024

juliasloan25 self-assigned this Feb 12, 2024

juliasloan25 mentioned this issue Feb 12, 2024

debug GPU conservation #614

Closed

1 task

LenkaNovak mentioned this issue Mar 14, 2024

Improve slack report #693

Closed

12 tasks

LenkaNovak closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate GPU conservation #607

investigate GPU conservation #607

juliasloan25 commented Feb 9, 2024 •

edited

Loading

akshaysridhar commented Feb 15, 2024 •

edited

Loading

LenkaNovak commented Apr 5, 2024 •

edited

Loading

juliasloan25 commented Apr 15, 2024

LenkaNovak commented Apr 16, 2024

LenkaNovak commented Apr 18, 2024

investigate GPU conservation #607

investigate GPU conservation #607

Comments

juliasloan25 commented Feb 9, 2024 • edited Loading

Approaches to try

Bucket info

example RSE values seen in #589 (for comparison between CPU/GPU runs)

functional albedo

static map

temporal map

akshaysridhar commented Feb 15, 2024 • edited Loading

LenkaNovak commented Apr 5, 2024 • edited Loading

juliasloan25 commented Apr 15, 2024

LenkaNovak commented Apr 16, 2024

LenkaNovak commented Apr 18, 2024

juliasloan25 commented Feb 9, 2024 •

edited

Loading

akshaysridhar commented Feb 15, 2024 •

edited

Loading

LenkaNovak commented Apr 5, 2024 •

edited

Loading