Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CNL specific CMake macros #5745

Merged
merged 11 commits into from
Jul 17, 2023
Merged

Conversation

sarats
Copy link
Member

@sarats sarats commented Jun 5, 2023

Removing CMake macro file used on old Cray supercomputers using Catamount OS.
Updated OS entry for OLCF machines.

If any other machines are still reliant on any flags in the deleted macro file, suggest moving those to specific compiler_machine.cmake macro file.

[BFB]

@sarats sarats requested a review from jgfouca June 5, 2023 21:27
@jgfouca jgfouca self-assigned this Jun 5, 2023
@jgfouca jgfouca added the BFB PR leaves answers BFB label Jun 5, 2023
jgfouca added a commit that referenced this pull request Jun 5, 2023
Remove CNL specific CMake macros

Removing CMake macro file used on old Cray supercomputers using
Catamount OS.  Updated OS entry for OLCF machines.

If any other machines are still reliant on any flags in the deleted
macro file, suggest moving those to specific compiler_machine.cmake
macro file.

[BFB]
@dqwu
Copy link
Contributor

dqwu commented Jun 5, 2023

@sarats
The removed CNL.cmake has the following settings to use Cray wrappers:

set(MPICC "cc")
set(MPIFC "ftn")
set(MPICXX "CC")

For Perlmutter (CNL.cmake is not used), gnu_pm-cpu.cmake has:

set(MPICC "cc")
set(MPICXX "CC")
set(MPIFC "ftn")

Please check if Crusher or Frontier also needs similar settings in compiler_machine.cmake macro files after CNL.cmake is removed.

@sarats
Copy link
Member Author

sarats commented Jun 6, 2023

Will check. Cray compilers have it.

@dqwu
Copy link
Contributor

dqwu commented Jun 6, 2023

@jgfouca @sarats
I tested latest next branch (with this PR merged) on Frontier and there are SCORPIO build errors.

Before this PR, Cray MPI wrappers are used (set in CNL.cmake) to configure SCORPIO (see spio.bldlog):
CC=cc CXX=CC FC=ftn ...

With CNL.cmake removed by this PR, non-Cray MPI wrappers are used instead (see spio.bldlog):
CC=mpicc CXX=mpicxx FC=mpif90 ...

In addition to the build errors caused by non-Cray wrappers, for non-Cray MPI wrappers, CMake 3.22 or higher has confirmed hanging issues when configuring SCORPIO (Frontier uses CMake 3.21 so far), see E3SM-Project/scorpio#517

If CNL.cmake (set Cray wrappers) is no longer being used for Frontier (or Crusher), please consider updating xxxx_frontier.cmake files to use Cray wrappers (similar to gnu_pm-cpu.cmake for Perlmutter) to avoid build errors and potential CMake 3.22+ hanging issues.

PS, with non-Cray wrappers to configure and build SCORPIO, the build errors are from linking ADIOS libs (built with Cray MPI wrappers):

/usr/bin/ld: /lustre/orion/cli133/world-shared/3rdparty/adios2/2.8.3.patch/cray-mpich-8.1.17/gcc-11.2.0/lib64/libadios2_evpath.a(cm.c.o): undefined reference to symbol 'pthread_join@@GLIBC_2.2.5'
/usr/bin/ld: /lib64/libpthread.so.0: error adding symbols: DSO missing from command line

Comment on lines -4 to -7
string(APPEND CPPDEFS " -DLINUX")
if (COMP_NAME STREQUAL gptl)
string(APPEND CPPDEFS " -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY")
endif()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Will check if Frontier/Crusher configurations need these.
Esp. w.r.t GPTL if they are already present or redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amametjanov We usually have -DHAVE_SLASHPROC. What else do we really need?

-DLINUX is checked in shr_sys_mod conditionals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grepping through share, it appears that only -DHAVE_VPRINTF -DHAVE_BACKTRACE can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I wonder why don't we use rest of the options on other machines too? If we expect almost all machines to have some of these, why don't we move them to defaults in buildlib?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some machines might not have low-level x86 intrinsics like rdtsc used by HAVE_NANOTIME.
/proc was not available on all machines: maybe it is now.
Putting all these configure options in the top-level OS-specific XML/CMake file (instead of a build script) might have been appropriate at the time. Don't know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth opening an issue on CIME to poll if OS-specific cmake files are useful anywhere.

@sarats
Copy link
Member Author

sarats commented Jun 6, 2023

@dqwu Your tests are with GNU, right? Those don't have the Cray wrappers set.

@dqwu
Copy link
Contributor

dqwu commented Jun 6, 2023

@dqwu Your tests are with GNU, right? Those don't have the Cray wrappers set.

Right, the default compiler is GNU: <COMPILERS>gnu,crayclang,amdclang,gnugpu,crayclanggpu,amdclanggpu</COMPILERS>

And GNU compiler can use Cray wrappers, see gnu_pm-cpu.cmake for Perlmutter.

@jgfouca
Copy link
Member

jgfouca commented Jun 6, 2023

@sarats , it looks like this likely broke the builds for most tests on crusher / e3sm_integration_next_gnu.  

@sarats
Copy link
Member Author

sarats commented Jun 6, 2023

Yes, I have some local changes for GNU wrappers that I will push through.

@sarats
Copy link
Member Author

sarats commented Jun 6, 2023

default compiler is GNU

To avoid any confusion, Cray compilers are preferred and I will set them as default on this machine.
GNU is the fallback option.

@jgfouca Does CIME consider the first compiler entry as the default?

AMD compiler have known issues and not ready for use again #5692

@jgfouca
Copy link
Member

jgfouca commented Jun 6, 2023

@sarats , yes, first compiler is the default.

@jgfouca
Copy link
Member

jgfouca commented Jun 29, 2023

@sarats , any progress on this?

@rljacob
Copy link
Member

rljacob commented Jul 6, 2023

ping @sarats please see above.

@sarats
Copy link
Member Author

sarats commented Jul 6, 2023

On vacation, will look next week.

@rljacob
Copy link
Member

rljacob commented Jul 12, 2023

@sarats please take care of this.

@sarats
Copy link
Member Author

sarats commented Jul 12, 2023

For future reference:

GPTL build options

-DHAVE_SLASHPROC is needed in GPTLget_memusage to obtain data directly from the /proc filesystem (present on most current systems). That file has logic for BGP and BGQ that can probably be removed in a later PR.

-DHAVE_NANOTIME and -DBIT64 are used in combination in gptl.c to invoke rdtsc (inline assembly) and store in timers. Code snippet from PAPI and also see: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

-DHAVE_VPRINTF - GPTL assumes this is present by default, no corresponding #ifdef block. There is a NO_VPRINTF option which just prints an error message.

-DHAVE_BACKTRACE - Not used

-DDHAVE_COMM_F2C : Seems to be default in timing/private.h. No harm in retaining.

-DHAVE_TIMES : Enables CPU stats and needed

-DHAVE_GETTIMEOFDAY: Needed and present on all machines.

@sarats
Copy link
Member Author

sarats commented Jul 12, 2023

This turned out to require a lot more updates to all compiler files for Frontier/Crusher but it is better to get rid of the OS specific cmake file anyway for clarity.

@grnydawn You can apply the new commits to next (I presume older commits are on next already which would explain the Dashboard). Let me know when you do it and we can trigger new test runs on OLCF Gitlab.

@sarats
Copy link
Member Author

sarats commented Jul 12, 2023

Note: This also fixed inconsistencies in ADIOS support on Frontier.

@grnydawn
Copy link
Contributor

@sarats Ok. I will apply the new commits to next and will test them on Crusher(or Frontier).

@sarats
Copy link
Member Author

sarats commented Jul 12, 2023

@jgfouca I haven't touched the Scream compiler files. Currently, I removed the CNL reference in config_machines.xml. So, check how you want to handle any changes that are required before doing a upstream merge for Scream.

We should really converge to a single machine entry soon to avoid redundant work. I don't know what outstanding issues are a hurdle to that (we resolved the Kokkos config discrepancy). Other than some Depends file where optimization is turned off for all CICE in SCREAM.

@grnydawn
Copy link
Contributor

@sarats @jgfouca , I locally merged the commits into next. Currently it is being tested on Frontier. After looking at the test results so far, there are issues in "/eam/src/physics/cam/zm_conv_intr.F90" All test cases are failed at the file.

It seems that there exists a bug in the file of using "real(r8) :: jctop(pcols)" variable. The real variable is used in the place of DO loop variable or as a scalar integer index like below:

do k = jctop(i),pver
bottom = state%pint(i,pver+1) - state%pmid(i,jctop(i))

It apparently is not related to Sarat's commits, and if there is no other concerns, I think I will push this updated next to upstream.

@sarats
Copy link
Member Author

sarats commented Jul 13, 2023

Fixes for the unrelated ZM issue are in #5805

grnydawn added a commit that referenced this pull request Jul 13, 2023
* Removing CMake macro file used on old Cray supercomputers using Catamount OS.
* Updated OS entry for OLCF machines.

[BFB]
@jgfouca
Copy link
Member

jgfouca commented Jul 13, 2023

@sarats , you should not have to modify anything in scream/eamxx since they manage their own flags. We have discussed scream using the CIME flags like the other components so it's in the works. Thanks for the summary of what these CPPDEFs do.

@grnydawn , please feel free to push to this branch and next.

@grnydawn
Copy link
Contributor

@jgfouca , This branch has been merged into next.

@dqwu
Copy link
Contributor

dqwu commented Jul 13, 2023

@sarats "FI_CXI_RX_MATCH_MODE=software" mentioned in issue #5750 is not included in this PR, right?
FYI, that ENV variable has already been set for scream (in scream feature branch machines/frontier).

@sarats
Copy link
Member Author

sarats commented Jul 13, 2023

We discussed the software tag matching during our regular meeting with Cray. I don't see a reason to set it as a machine-wide default at this point. As I mentioned during our conversation at all-hands, I want to be careful and deliberate about machine-wide default options and assess performance impact.

You should join the bi-weekly calls with Cray for more context/details.

@jgfouca
Copy link
Member

jgfouca commented Jul 14, 2023

@grnydawn , @sarats , can I merge to master now?

@sarats
Copy link
Member Author

sarats commented Jul 14, 2023

I kicked off the Gitlab CI testing at OLCF this afternoon, they are still running.

grnydawn added a commit that referenced this pull request Jul 17, 2023
Frontier/Crusher: AMD compiler add NETCDF paths

[BFB]
@grnydawn
Copy link
Contributor

@sarats I have pushed the recent update about adding NETCDF paths for amd compiler to next branch.

@jgfouca, while waiting for Sarat's opinion, I think it is ok to merge this PR. From CDash test results, this PR fixed the issue from removing CNL.cmake. There are still 16 test failures after merging this PR but those failures seem not related to this issue.

@jgfouca jgfouca merged commit 25f1726 into master Jul 17, 2023
@jgfouca jgfouca deleted the sarats/machines/remove-cnl-cmake branch July 17, 2023 14:57
@sarats
Copy link
Member Author

sarats commented Jul 17, 2023

FYI, https://my.cdash.org/viewTest.php?buildid=2373368 (AMD Clang has 41 pass and 77 fails)
Most of those fails are known issues.

#5692

@sarats
Copy link
Member Author

sarats commented Jul 17, 2023

Since it has known fails without workarounds (OLCF/AMD not providing fixes), I'm going to disable regular testing with AMD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB Machine Files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants