Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARCTICGRIS PE-layout is very slow... #1098

Closed
ekluzek opened this issue Aug 6, 2020 · 7 comments · Fixed by #1111
Closed

ARCTICGRIS PE-layout is very slow... #1098

ekluzek opened this issue Aug 6, 2020 · 7 comments · Fixed by #1111
Assignees
Labels
enhancement new capability or improved behavior of existing capability

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 6, 2020

The ARCTICGRIS PE-layout is only using 8 nodes and is running at about a half a year per wallclock day on cheyenne.
It has 5X as many points as f09, a time-step that's an 8th of the size and running with a 6th of the number of processors.

Note, CONUS and ARCTIC grids are also only using 8 nodes (for any machine). They don't have particular setups for cheyenne.

And furthermore the fv3 grids have particular setups that seem to be setup for cheyenne, but labeled as any machine. They probably should have separate general setups from the cheyenne specific ones.

@ekluzek ekluzek added the enhancement new capability or improved behavior of existing capability label Aug 6, 2020
@ekluzek ekluzek self-assigned this Aug 6, 2020
@adamrher
Copy link
Contributor

adamrher commented Aug 7, 2020

Erik, can you dig out the core hours / syr for the ARCTICGRIS and ARCTIC grids? I am trying to compare these numbers with my runs, and I tend to run with more than 8 nodes.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 7, 2020

For a short test run this is what I have (which wouldn't be very accurate, but maybe ballpark):

ARCTICGRIS:
total pes active : 288
mpi tasks per node : 36
pe count for cost estimate : 288

ARCTIC:

total pes active : 288
mpi tasks per node : 36
pe count for cost estimate : 288

Overall Metrics:
Model Cost: 654.98 pe-hrs/simulated_year
Model Throughput: 10.55 simulated_years/day

Overall Metrics:
Model Cost: 25263.68 pe-hrs/simulated_year
Model Throughput: 0.27 simulated_years/day

Can you post what PE layouts you used for your simulations?

@adamrher
Copy link
Contributor

adamrher commented Aug 7, 2020

For an ARCTIC I compset I have:
total pes active : 1800
mpi tasks per node : 36
pe count for cost estimate : 1800

Overall Metrics:
Model Cost: 656.34 pe-hrs/simulated_year
Model Throughput: 65.82 simulated_years/day

Screen Shot 2020-08-07 at 3 48 33 PM

I don't have any ARCTICGRIS I-compset runs. But I can estimate it from a F-compset run:
total pes active : 7680
mpi tasks per node : 36
pe count for cost estimate : 7704

Overall Metrics:
Model Cost: 55893.74 pe-hrs/simulated_year
Model Throughput: 3.31 simulated_years/day

Estimating CTSM cost = pe-hrs/syr * (LND Run Time/TOT Run Time) = 55893.74 * (403.304 s / 13166.622 s) = 1712.0692 pe-hrs / syr. This seems right to me, as the cost of ARCTICGRIS is about (2-3)X of the ARCTIC grid.

How long was your ARCTICGRIS run? Are those numbers inflated due to a long intialization?

Adam

@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 7, 2020

Mine are just from really short 9-step test runs. It should take into account initialization, but they aren't going to be at all accurate for such a short test like that. I don't think they had DEBUG on, but the ARCTICGRIS simulation might have been running out of memory and was dog-slow because of that.

@adamrher
Copy link
Contributor

adamrher commented Aug 7, 2020

Just for clarity, how confident are you that those numbers take into account initialization? I thought @PeterHjortLauritzen conveyed to me that they don't account for initialization.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 7, 2020

I'm not very confident. But, it does report the initialization time, so there's no reason it couldn't take it into account. I could also look into the code to check for sure. But, I also know to not really believe this really short test with too few nodes.

@adamrher
Copy link
Contributor

adamrher commented Aug 7, 2020

ok. not a high priority, so don't feel obliged to do so.

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Aug 11, 2020
@billsacks billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants