Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate GPU conservation #607

Closed
Tracked by #390 ...
juliasloan25 opened this issue Feb 9, 2024 · 5 comments
Closed
Tracked by #390 ...

investigate GPU conservation #607

juliasloan25 opened this issue Feb 9, 2024 · 5 comments
Assignees
Labels
🔥 Urgent GPU 🍃 leaf Issue coupled to a PR

Comments

@juliasloan25
Copy link
Member

juliasloan25 commented Feb 9, 2024

Our GPU runs have higher water and energy conservation than CPU runs with identical setups. The CPU error tends to be around 1e-5, while the GPU error is around 1e-3 or 1e-4. We should look into why the conservation is worse on GPU, and try to improve it.

For now (as of #589), we've set the GPU runs to soft fail if the conservation error is larger than 1e-3, but we want them to be able to pass this threshold (and even a smaller one, ideally).

part of #390
PR #614

Partially done in #735

Approaches to try

Bucket info

ClimaLand has standalone global bucket runs on CPU and GPU that we compare. For all 3 albedo options, comparing the mean values of the states from CPU and GPU runs gives a difference on the order of 1e-15. For the temporal map albedo case, which runs for 50 days, we get:

mean(cpu_state .- gpu_state) = 1.433428363408908e-15
eps(Float64) = 2.220446049250313e-16

The functional and static map albedo cases show similar discrepancies, and run for 7 days each. These differences are much smaller than what we see in coupled runs, so the difference is probably not coming from the bucket model.

example RSE values seen in #589 (for comparison between CPU/GPU runs)

functional albedo

CPU
rse[end] = 1.6532404532423364e-5
rse[end] = 0.0005462468417393023
GPU
rse[end] = 7.841693994284924e-5
rse[end] = 0.0005103673696554503

CPU analytic function

static map

CPU
rse[end] = 1.5598104032642762e-5
rse[end] = 7.682827509990757e-5
GPU
rse[end] = 0.00012549802174549964
rse[end] = 0.0015490000882394841

CPU static map GPU static map

temporal map

CPU
rse[end] = 1.7424794137292602e-5
rse[end] = 0.00033606441246576436
GPU
rse[end] = 0.00012708219201050294
rse[end] = 0.000943450817881417

CPU temporal map GPU temporal map
@akshaysridhar
Copy link
Member

akshaysridhar commented Feb 15, 2024

These might be relevant builds (ClimaAtmos - standalone test cases which compare CPU v GPU runs)
https://buildkite.com/clima/climaatmos-ci/builds/16728#018da910-0554-4f88-ab65-3f79079dcb23 (HS)
https://buildkite.com/clima/climaatmos-ci/builds/16728#018da910-054b-4a34-aed1-0807b98db143 (BW)

@LenkaNovak LenkaNovak mentioned this issue Mar 14, 2024
12 tasks
@LenkaNovak
Copy link
Collaborator

LenkaNovak commented Apr 5, 2024

Stand-alone Moist held-suarez atmos runs differ too: CliMA/ClimaAtmos.jl#2876

@juliasloan25
Copy link
Member Author

Fields.bycolumn is effectively a no-op on GPU and could be contributing to the differences between CPU and GPU. See #736

@LenkaNovak
Copy link
Collaborator

Thanks to #737, we can confirm that our GPU longruns run as well as CPU longruns (build). Conservation logging issue (20% error) was addressed in #735. The GPU runs are systematically less conservative than the CPU runs (discussion in #735) but the difference is quite small (~1%), so we can revisit this in the future (as part of #594) when we can track energy sinks and sources to the precision of sqrt(eps) - this requires work in ClimaAtmos (see CliMA/ClimaAtmos.jl#2658 and CliMA/ClimaAtmos.jl#2568), and close down this issue. @juliasloan25, would you agree?

@LenkaNovak
Copy link
Collaborator

completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔥 Urgent GPU 🍃 leaf Issue coupled to a PR
Projects
None yet
Development

No branches or pull requests

3 participants