Fix cesm link step in builds with high pcols values #22

gdicker1 · 2024-07-25T20:36:23Z

In GPU runs of CAM, it is desirable to increase the number of physics columns each MPI rank is responsible for by increasing the pcols variable. When pcols grows somewhere beyond 2048, builds fail in the link step.

Adding -mcmodel=medium to the FFLAGS and LDFLAGS allow builds to proceed. Since this can have performance implications, this is only being applied to the NVHPC compilers (the main compilers for GPU applications right now).

gdicker1 · 2024-07-25T20:38:51Z

@sjsprecious, apologies if you aren't the right person to direct questions or this PR to.

I added you as a reviewer for your opinion on this change. Should this be made even more GPU specific? Or, should I apply this to intel and gnu compilers too? Any other thoughts?

gdicker1 · 2024-07-25T20:44:54Z

Once incorporated, this will address EarthWorksOrg/EarthWorks Issue #56

sjsprecious · 2024-07-25T21:22:50Z

Hi @gdicker1 , my understanding is that the -mcmodel=medium flag is only needed when you want to set a large PCOLS value. According to ChatGPT:

Using the -mcmodel=medium flag can hurt performance compared to the default -mcmodel=small because it requires the compiler to generate larger and potentially less efficient instructions to access global and static variables. This can lead to increased instruction cache pressure and slower memory accesses.

Therefore, I prefer not to set this flag by default and let a user add it manually when needed. Even for a GPU run, we may not have a large PCOLS value depending on the compset, resolution, number of CPU cores / GPUs per node, etc.

On the other hand, using a large PCOLS value will unfortunately hurt the CPU performance. Thus using the default PCOLS value makes sense for any CPU simulation and we should not need to add this flag for GNU/Intel compiler.

This is just my two cents.

gdicker1 · 2024-07-25T21:49:51Z

Thanks for the perspective @sjsprecious! I'll consider this a bit more...

You're correct that we only need to consider adding this flag for higher pcols values. Right now -mcmodel=medium might be something we want for GPU runs, based on Supreeth's work with 1 MPI rank per GPU (if I remember correctly). It seems like a cumbersome step to expect users to add it themselves (but GPU runs already have a few "special" steps).

gdicker1 · 2024-07-26T22:19:41Z

@supreethms1809 could you offer some thoughts or review?

I just updated this so -mcmodel=medium is only used for OpenACC offloads and the changes are at the bottom of the file below a "EarthWorks specifc: ..." comment.

supreethms1809 · 2024-07-26T22:31:12Z

From my limited tests on CPUs, I haven't seen the flag hurting the performance. But it is needed for GPU builds.

@areanddee Do we expect to run with bigger PCOLs on CPUs as well?

areanddee · 2024-07-29T18:52:55Z

Hi @gdicker1 , my understanding is that the -mcmodel=medium flag is only needed when you want to set a large PCOLS value. According to ChatGPT:
Using the -mcmodel=medium flag can hurt performance compared to the default -mcmodel=small because it requires the compiler to generate larger and potentially less efficient instructions to access global and static variables. This can lead to increased instruction cache pressure and slower memory accesses.
Therefore, I prefer not to set this flag by default and let a user add it manually when needed. Even for a GPU run, we may not have a large PCOLS value depending on the compset, resolution, number of CPU cores / GPUs per node, etc.

On the other hand, using a large PCOLS value will unfortunately hurt the CPU performance. Thus using the default PCOLS value makes sense for any CPU simulation and we should not need to add this flag for GNU/Intel compiler.

This is just my two cents.

My 2 cents:
I would recommend we set this by default ON for GPUs and OFF for CPUs. We already have flags that are specifically for GPUs so I don't think this sets any sort of bad precedent.

When CAM is compiled with pcols set higher than 2048, builds will fail during the link step. Adding `-mcmodel=medium` allows builds to succeed.

Also move this to the bottom of the file and add a comment that indicates this change is only for EarthWorks.

gdicker1 · 2024-07-31T00:38:16Z

Thanks for your thoughts @areanddee. Since this will only activate for GPU builds (and more specifically OpenACC builds), I think this is good to move forward.

gdicker1 · 2024-07-31T00:39:58Z

Force push from c3bc0bd to b8a71e2 was rebasing this branch from being based on ESMCI/ccs_config_cesm tag ccs_config_cesm0.0.99 to being based on ew-develop (currently ccs_config-ew2.2.000).

gdicker1 added the bug Something isn't working label Jul 25, 2024

gdicker1 requested review from sjsprecious and supreethms1809 July 25, 2024 20:36

gdicker1 self-assigned this Jul 25, 2024

gdicker1 mentioned this pull request Jul 25, 2024

Update ccs_configs to fix link step failures for high pcols builds EarthWorksOrg/EarthWorks#59

Merged

gdicker1 added 2 commits July 30, 2024 18:36

Fix cesm link step in builds with high pcols values

8bf2289

When CAM is compiled with pcols set higher than 2048, builds will fail during the link step. Adding `-mcmodel=medium` allows builds to succeed.

Restrict mcmodel=medium to only OpenACC offload builds

b8a71e2

Also move this to the bottom of the file and add a comment that indicates this change is only for EarthWorks.

gdicker1 force-pushed the fix/mcmodel_highpcols_linkstep branch from c3bc0bd to b8a71e2 Compare July 31, 2024 00:37

gdicker1 merged commit cb6c09c into EarthWorksOrg:ew-develop Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cesm link step in builds with high pcols values #22

Fix cesm link step in builds with high pcols values #22

gdicker1 commented Jul 25, 2024

gdicker1 commented Jul 25, 2024

gdicker1 commented Jul 25, 2024

sjsprecious commented Jul 25, 2024

gdicker1 commented Jul 25, 2024

gdicker1 commented Jul 26, 2024

supreethms1809 commented Jul 26, 2024

areanddee commented Jul 29, 2024

gdicker1 commented Jul 31, 2024

gdicker1 commented Jul 31, 2024

Fix cesm link step in builds with high pcols values #22

Fix cesm link step in builds with high pcols values #22

Conversation

gdicker1 commented Jul 25, 2024

gdicker1 commented Jul 25, 2024

gdicker1 commented Jul 25, 2024

sjsprecious commented Jul 25, 2024

gdicker1 commented Jul 25, 2024

gdicker1 commented Jul 26, 2024

supreethms1809 commented Jul 26, 2024

areanddee commented Jul 29, 2024

gdicker1 commented Jul 31, 2024

gdicker1 commented Jul 31, 2024