-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove CNL specific CMake macros #5745
Conversation
Remove CNL specific CMake macros Removing CMake macro file used on old Cray supercomputers using Catamount OS. Updated OS entry for OLCF machines. If any other machines are still reliant on any flags in the deleted macro file, suggest moving those to specific compiler_machine.cmake macro file. [BFB]
@sarats
For Perlmutter (CNL.cmake is not used), gnu_pm-cpu.cmake has:
Please check if Crusher or Frontier also needs similar settings in compiler_machine.cmake macro files after CNL.cmake is removed. |
Will check. Cray compilers have it. |
@jgfouca @sarats Before this PR, Cray MPI wrappers are used (set in CNL.cmake) to configure SCORPIO (see spio.bldlog): With CNL.cmake removed by this PR, non-Cray MPI wrappers are used instead (see spio.bldlog): In addition to the build errors caused by non-Cray wrappers, for non-Cray MPI wrappers, CMake 3.22 or higher has confirmed hanging issues when configuring SCORPIO (Frontier uses CMake 3.21 so far), see E3SM-Project/scorpio#517 If CNL.cmake (set Cray wrappers) is no longer being used for Frontier (or Crusher), please consider updating xxxx_frontier.cmake files to use Cray wrappers (similar to gnu_pm-cpu.cmake for Perlmutter) to avoid build errors and potential CMake 3.22+ hanging issues. PS, with non-Cray wrappers to configure and build SCORPIO, the build errors are from linking ADIOS libs (built with Cray MPI wrappers):
|
string(APPEND CPPDEFS " -DLINUX") | ||
if (COMP_NAME STREQUAL gptl) | ||
string(APPEND CPPDEFS " -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY") | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: Will check if Frontier/Crusher configurations need these.
Esp. w.r.t GPTL if they are already present or redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amametjanov We usually have -DHAVE_SLASHPROC
. What else do we really need?
-DLINUX
is checked in shr_sys_mod
conditionals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grepping through share
, it appears that only -DHAVE_VPRINTF -DHAVE_BACKTRACE
can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I wonder why don't we use rest of the options on other machines too? If we expect almost all machines to have some of these, why don't we move them to defaults in buildlib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some machines might not have low-level x86 intrinsics like rdtsc
used by HAVE_NANOTIME
.
/proc
was not available on all machines: maybe it is now.
Putting all these configure options in the top-level OS-specific XML/CMake file (instead of a build script) might have been appropriate at the time. Don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth opening an issue on CIME to poll if OS-specific cmake files are useful anywhere.
@dqwu Your tests are with GNU, right? Those don't have the Cray wrappers set. |
Right, the default compiler is GNU: And GNU compiler can use Cray wrappers, see gnu_pm-cpu.cmake for Perlmutter. |
@sarats , it looks like this likely broke the builds for most tests on crusher / e3sm_integration_next_gnu. |
Yes, I have some local changes for GNU wrappers that I will push through. |
@sarats , yes, first compiler is the default. |
@sarats , any progress on this? |
ping @sarats please see above. |
On vacation, will look next week. |
@sarats please take care of this. |
For future reference: GPTL build options -DHAVE_SLASHPROC is needed in GPTLget_memusage to obtain data directly from the /proc filesystem (present on most current systems). That file has logic for BGP and BGQ that can probably be removed in a later PR. -DHAVE_NANOTIME and -DBIT64 are used in combination in gptl.c to invoke -DHAVE_VPRINTF - GPTL assumes this is present by default, no corresponding #ifdef block. There is a NO_VPRINTF option which just prints an error message. -DHAVE_BACKTRACE - Not used -DDHAVE_COMM_F2C : Seems to be default in timing/private.h. No harm in retaining. -DHAVE_TIMES : Enables CPU stats and needed -DHAVE_GETTIMEOFDAY: Needed and present on all machines. |
Both Crusher and Frontier
Both Frontier and crusher.
This turned out to require a lot more updates to all compiler files for Frontier/Crusher but it is better to get rid of the OS specific cmake file anyway for clarity. @grnydawn You can apply the new commits to next (I presume older commits are on next already which would explain the Dashboard). Let me know when you do it and we can trigger new test runs on OLCF Gitlab. |
Note: This also fixed inconsistencies in ADIOS support on Frontier. |
@sarats Ok. I will apply the new commits to next and will test them on Crusher(or Frontier). |
@jgfouca I haven't touched the Scream compiler files. Currently, I removed the CNL reference in config_machines.xml. So, check how you want to handle any changes that are required before doing a upstream merge for Scream. We should really converge to a single machine entry soon to avoid redundant work. I don't know what outstanding issues are a hurdle to that (we resolved the Kokkos config discrepancy). Other than some Depends file where optimization is turned off for all CICE in SCREAM. |
@sarats @jgfouca , I locally merged the commits into next. Currently it is being tested on Frontier. After looking at the test results so far, there are issues in "/eam/src/physics/cam/zm_conv_intr.F90" All test cases are failed at the file. It seems that there exists a bug in the file of using "real(r8) :: jctop(pcols)" variable. The real variable is used in the place of DO loop variable or as a scalar integer index like below: do k = jctop(i),pver It apparently is not related to Sarat's commits, and if there is no other concerns, I think I will push this updated next to upstream. |
Fixes for the unrelated ZM issue are in #5805 |
* Removing CMake macro file used on old Cray supercomputers using Catamount OS. * Updated OS entry for OLCF machines. [BFB]
@sarats , you should not have to modify anything in scream/eamxx since they manage their own flags. We have discussed scream using the CIME flags like the other components so it's in the works. Thanks for the summary of what these CPPDEFs do. @grnydawn , please feel free to push to this branch and next. |
@jgfouca , This branch has been merged into next. |
We discussed the software tag matching during our regular meeting with Cray. I don't see a reason to set it as a machine-wide default at this point. As I mentioned during our conversation at all-hands, I want to be careful and deliberate about machine-wide default options and assess performance impact. You should join the bi-weekly calls with Cray for more context/details. |
I kicked off the Gitlab CI testing at OLCF this afternoon, they are still running. |
Frontier/Crusher: AMD compiler add NETCDF paths [BFB]
@sarats I have pushed the recent update about adding NETCDF paths for amd compiler to next branch. @jgfouca, while waiting for Sarat's opinion, I think it is ok to merge this PR. From CDash test results, this PR fixed the issue from removing CNL.cmake. There are still 16 test failures after merging this PR but those failures seem not related to this issue. |
FYI, https://my.cdash.org/viewTest.php?buildid=2373368 (AMD Clang has 41 pass and 77 fails) |
Since it has known fails without workarounds (OLCF/AMD not providing fixes), I'm going to disable regular testing with AMD. |
Removing CMake macro file used on old Cray supercomputers using Catamount OS.
Updated OS entry for OLCF machines.
If any other machines are still reliant on any flags in the deleted macro file, suggest moving those to specific compiler_machine.cmake macro file.
[BFB]