Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 4xA100 to achieve > 1SYPD for target EDMF AMIP #740

Closed
4 tasks done
Tracked by #390
LenkaNovak opened this issue Apr 17, 2024 · 8 comments · Fixed by #744
Closed
4 tasks done
Tracked by #390

Use 4xA100 to achieve > 1SYPD for target EDMF AMIP #740

LenkaNovak opened this issue Apr 17, 2024 · 8 comments · Fixed by #744
Assignees
Labels
🔥 Urgent GPU 🍃 leaf Issue coupled to a PR

Comments

@LenkaNovak
Copy link
Collaborator

LenkaNovak commented Apr 17, 2024

Since gpu_amip_topo_target_diagedmf is our current target, we want to run it on the faster nodes: either clima's A100 or new-central's V100 / H100.

Results

Running on Clima should be sufficient.

clima A100:

  • current driver, 4 GPUs build SYPD ~ 1.51
  • current driver w/o coupler diagnostics (still with atmos diags), 4 GPUs build SYPD ~ 1.63

new-central P100

  • current driver, 1 GPU build SYPD ~ 0.23
  • current driver, 4 GPUs build - memory timeout

Note

  • currently this run crashes (saturation adjustment) after ~90-140 days (non deterministic).

Components in PR

  • move gpu_amip_topo_target_diagedmf to clima
  • increase t_end to 90d
  • spilt clima pipeline init from individual runs, so we can run the multiple runs on clima
  • remove concurrency block, which was slowing down the clima init, even though the depots are in different locations
@Sbozzolo
Copy link
Member

What's the issue with the P100?

@LenkaNovak
Copy link
Collaborator Author

Apparently they're too slow and we're using this run as an SYPD benchmark.

@LenkaNovak LenkaNovak self-assigned this Apr 18, 2024
@LenkaNovak LenkaNovak changed the title Avoid running target EDMF longrun on P100 Use 4xA100 to achieve > 1SYPD for target EDMF AMIP Apr 18, 2024
@Sbozzolo
Copy link
Member

If you are interested, this build https://buildkite.com/clima/climacoupler-longruns/builds/628 can be fixed by asking more memory per CPU

@LenkaNovak
Copy link
Collaborator Author

That's useful to know, thanks! We'll still run the benchmark on the A100 (as requested by the OKR), but this will be useful for the scaling tables (Cc'ing @juliasloan25 ).

@juliasloan25 juliasloan25 added 🍃 leaf Issue coupled to a PR GPU labels Apr 19, 2024
@LenkaNovak
Copy link
Collaborator Author

Update:

The above results are for 200km resolution.

For 100km resolution SYPD on 4xA100: between 0.8 and 1.5 (see builds). This is partly dependent on whether we use coupler/atmos diagnsotics, but removing diagnostics didn't always lead to better SYPD. We also see a large variability between runs of the same config, and even within one simulation. More thorough investigation is being performed as part of CliMA/ClimaAtmos.jl#2914. And we will be presenting a like-for-like comparison and scaling as part of #663.

Notes:

  • when server was busy, got these memory errors, some nodes kept running
  • 100km edmf run runs to 90d without breaking build

@LenkaNovak
Copy link
Collaborator Author

@Sbozzolo
Copy link
Member

Sbozzolo commented May 1, 2024

Now achieving 1.082 https://buildkite.com/clima/climacoupler-longruns/builds/668#018f325b-a6d2-4c2d-b67f-1416484fda11

While it should be reasonably accurate, I would encourage you not to look at the SYPD printed by the progress log. That is an estimate and does not reflect the actual SYPD in some cases (e.g, first iterations, when callbacks/diagnostics are called).

@LenkaNovak
Copy link
Collaborator Author

Very true, but for avoidance of doubt, this run shows we can achieve at least 380sim days in 1 day of walltime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔥 Urgent GPU 🍃 leaf Issue coupled to a PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants