Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn off FMA option for the intel and nvhpc compiler #121

Merged
merged 3 commits into from
Sep 15, 2023

Conversation

sjsprecious
Copy link
Collaborator

@sjsprecious sjsprecious commented Sep 15, 2023

This PR turns off the FMA option by default to the intel and nvhpc compilers. Based on the ensemble consistency test (ECT), turning on FMA is likely to generate a statistically different climatology.

This makes sure that the ECT is passed when using the intel/2023 compiler on Derecho for the test simulations and comparing the results to the baseline generated on Cheyenne with intel/19.1.1. See more discussions here: ESCOMP/CAM#883.

@sjsprecious sjsprecious added the bug Something isn't working label Sep 15, 2023
Copy link
Collaborator

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have this in the machine specific file instead of the generic intel.cmake file?

@sjsprecious
Copy link
Collaborator Author

I guess the options like "--host=cray" and "-march=core-avx2" are specific to Derecho (Cray machine and AMD CPU)?

@jedwards4b
Copy link
Collaborator

Can I push that change back to your PR? Move no-fma flag to generic intel.cmake file.

@sjsprecious
Copy link
Collaborator Author

Oh, sorry that I misunderstand your point. You mean we should turn off FMA whenever we use the intel compiler on any machine? That is fine with me but I am not sure if other people want to leave it on for their machines (e.g., people do no use ECT for verification).

@sjsprecious
Copy link
Collaborator Author

Machines like constance and izumi all turn on FMA by default. Moving -no-fma to the intel.cmake file will break the regression test on those machines and it may not be desirable.

@jedwards4b
Copy link
Collaborator

Are we running ect on those systems? Seems like the ect test confirms that we should not be using fma on any system.

@sjsprecious
Copy link
Collaborator Author

sjsprecious commented Sep 15, 2023

I agreed that based on ECT, we should not use FMA for any system (intel, nvhpc, etc). I do not have access to those systems but I assume that they will run some regressions tests like aux_cam for BFB. If we add FMA to the generic file, their tests will fail and they may not want that if they do not trust ECT like we do. This is my concern.

@briandobbins
Copy link
Collaborator

briandobbins commented Sep 15, 2023 via email

@jedwards4b
Copy link
Collaborator

We will mark the PR as answer changing so that people understand that baselines may fail.

@sjsprecious
Copy link
Collaborator Author

Thanks @jedwards4b and @briandobbins for your comments. That sounds good to me.

My last comment is how we could make sure that -no-fma is added as the last option? Each system applies some specific flags and I do not want the -no-fma flag to be overwritten by other optimization flags.

@sjsprecious
Copy link
Collaborator Author

A quick test on Derecho by moving -no-fma from the machine specific file to the generic intel file passes the ECT. Thus I will move on to do what @jedwards4b and @briandobbins suggested here.

If anyone finds out an issue about the changes here, they can always modify the flags specifically for their own system.

endif()
if (DEBUG)
string(APPEND CFLAGS " -O0 -g")
string(APPEND CFLAGS " -O0 -g -no-fma")
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you shouldn't do this seperately for DEBUG=TRUE vs DEBUG=FALSE. You can add it once at line 1 or if you want to make sure it's after the other flags add another line after 10
string(APPEND CFLAGS "-no-fma")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jim. I just updated those files.

@jedwards4b jedwards4b merged commit 1ca702d into ESMCI:main Sep 15, 2023
@jedwards4b jedwards4b changed the title Turn off FMA option for the intel compiler on Derecho Turn off FMA option for the intel compiler Sep 15, 2023
@sjsprecious sjsprecious changed the title Turn off FMA option for the intel compiler Turn off FMA option for the intel and nvhpc compiler Sep 15, 2023
@sjsprecious sjsprecious deleted the fix_derecho_bug branch September 15, 2023 19:47
@fischer-ncar
Copy link
Collaborator

fischer-ncar commented Sep 15, 2023

It looks like the FMA option was also causing my ERP/PEM/ERC tests to fail. I still need to do more ERP/PEM/ERC testing.

@jedwards4b
Copy link
Collaborator

That totally makes sense since changing task count with fma enabled will change answers.

@sjsprecious
Copy link
Collaborator Author

@jedwards4b can you explain more about why changing task count with FMA enabled will change answers? I thought changing task count just affected the domain decomposition.

@jedwards4b
Copy link
Collaborator

I misspoke there - I was thinking about vector math and how using different pelayouts changes the length of the vectors - but that doesn't have anything to do with FMA.

@sjsprecious
Copy link
Collaborator Author

Thanks Jim for your clarification. I assumed Chris was doing the ERP test on Derecho and my understanding was that its failure should not be caused by the FMA option. But it seems that turning off FMA on Derecho passes the ERP test somehow?

@jedwards4b
Copy link
Collaborator

Maybe we should confirm that with another test?

@sjsprecious
Copy link
Collaborator Author

Agreed and I think Chris is already working on it? I would like to let Chris update more details here in case I have a misunderstanding.

@fischer-ncar
Copy link
Collaborator

I reran the prealpha tests over the weekend. These following tests were all failing before FMA was turned off.

PASS ERP_Ld3_Vnuopc.f09_f09_mg17.FCfireHIST.derecho_intel.cam-outfrq1d COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.F1850.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.F2000climo.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.F2000dev.derecho_intel.cam-outfrq9s_mg3 COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.F2010climo.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.FHIST_BGC.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.FHIST.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.FTJ16.derecho_intel.cam-outfrq9s COMPARE_base_rest

@sjsprecious
Copy link
Collaborator Author

Thanks Chris for posting these new results. So turning off FMA does help pass the ERP tests on Derecho somehow.

Are there any other type of tests besides ERP that also pass due to the disablement of FMA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants