Skip to content

Fixes CUDA sanitizer issues#343

Merged
mark-petersen merged 3 commits intoE3SM-Project:developfrom
grnydawn:ykim/omega/memleak-pr
Mar 18, 2026
Merged

Fixes CUDA sanitizer issues#343
mark-petersen merged 3 commits intoE3SM-Project:developfrom
grnydawn:ykim/omega/memleak-pr

Conversation

@grnydawn
Copy link
Copy Markdown

@grnydawn grnydawn commented Feb 16, 2026

Add more initialization, finalization and Kokkos fences to eliminate CUDA meminit and memleak check issues.

  • Tests passed on Perlmutter-GPU and Perlmutter-CPU.
  • Initialized default static Omega objects, such as DefaultDecomp = nullptr;.
  • Added MPI_Group_Free to prevent memory leaks.
  • Created Clock ModelClockObj; on the stack instead of the heap.
  • Added destroy functions, such as Eos::destroyInstance();, in the Omega finalize routine.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses CUDA sanitizer issues by adding comprehensive memory initialization, proper synchronization with Kokkos fences, and cleanup of resource leaks. The changes ensure that uninitialized memory is not read during halo exchanges and that GPU state is properly synchronized before MPI operations.

Changes:

  • Initialize auxiliary variable arrays and ocean state arrays to zero in constructors to prevent uninitialized memory reads during halo exchanges
  • Add Kokkos::fence() calls before halo exchange operations to ensure GPU operations complete before MPI accesses device memory
  • Fix memory leaks by adding MPI_Group_free() calls, replacing heap-allocated Clock objects with stack allocation, and adding singleton destruction calls

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated no comments.

Show a summary per file
File Description
RungeKutta4Stepper.cpp Initialize ProvisTracers array to zero
WindForcingAuxVars.cpp Initialize wind stress arrays to zero
VorticityAuxVars.cpp Initialize vorticity arrays to zero and add OmegaKokkos.h include
VelocityDel2AuxVars.cpp Initialize Del2 arrays to zero and add OmegaKokkos.h include
TracerAuxVars.cpp Initialize tracer auxiliary arrays to zero and add OmegaKokkos.h include
LayerThicknessAuxVars.cpp Initialize layer thickness arrays to zero and add OmegaKokkos.h include
KineticAuxVars.cpp Initialize kinetic energy arrays to zero and add OmegaKokkos.h include
VertCoord.cpp Initialize min/max layer arrays to zero for edges and vertices
Tracers.cpp Initialize tracer arrays to zero and add fence before halo exchange
OceanState.cpp Initialize ocean state arrays to zero and add fence before halo exchange
OceanFinal.cpp Add IOStream finalization with clock and singleton destruction for Eos and VertMix
AuxiliaryState.cpp Add fence before halo exchange
IOStream.cpp Replace heap-allocated Clock with stack-allocated Clock to fix memory leaks
MachEnv.cpp Add fences before MPI operations, free MPI_Group objects, and set DefaultEnv to nullptr
Halo.cpp Set DefaultHalo to nullptr after clearing
Decomp.cpp Set DefaultDecomp to nullptr after clearing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the changes with the caveat that @mwarusz comments should be addressed. In particular, if we always need the fences before halos, I think they should be pushed into the halo routines (maybe we just need to remove the if device buffers?). But will approve while others make the case either way.

@grnydawn grnydawn force-pushed the ykim/omega/memleak-pr branch from 01915ee to 75a1973 Compare February 19, 2026 16:30
@grnydawn
Copy link
Copy Markdown
Author

@mwarusz , @philipwjones , Removed unnecessary Kokkos View initializations. Also removed the Kokkos fences, which may be reconsidered later if an actual issue arises.

Copy link
Copy Markdown
Member

@mwarusz mwarusz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passes CTests on aurora-cpu and aurora-gpu. I have some minor comments, but it looks almost ready.

@grnydawn grnydawn force-pushed the ykim/omega/memleak-pr branch from b11b385 to a5623ea Compare March 2, 2026 15:19
// the process, calls the destructors for each
AllDecomps.clear(); // removes all decomps from the list (map) and in
// the process, calls the destructors for each
DefaultDecomp = nullptr; // prevent dangling pointer
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default pointer for other classes also be set to nullptr in the clear method, i.e. HorzMesh, Tracers, State, etc?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. Basically, I looked at the code that the memory check tool pointed to. I guess the tool limits the output of its checks. I will review other parts of the code with a similar structure and try to nullify them in the same way as with Decomp and Halo. Thanks.

Copy link
Copy Markdown
Collaborator

@sbrus89 sbrus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding those additional nullifications @grnydawn. Approved by inspection.

@mark-petersen mark-petersen removed the request for review from brian-oneill March 18, 2026 16:21
@mark-petersen
Copy link
Copy Markdown
Collaborator

Thanks everyone for your feedback. I removed Brian, as there were sufficient reviews and he is reviewing other PRs.

@mark-petersen mark-petersen merged commit 47c455d into E3SM-Project:develop Mar 18, 2026
1 check passed
xylar added a commit to xylar/polaris that referenced this pull request Mar 27, 2026
This merge updates the e3sm_submodules/Omega submodule from [d0b3482](https://github.com/E3SM-Project/Omega/tree/d0b3482) to [74611d548d](https://github.com/E3SM-Project/Omega/tree/74611d548d).

This update includes the following MPAS-Ocean and MPAS-Frameworks PRs (check mark indicates bit-for-bit with previous PR in the list):
- [ ]  (ocn) E3SM-Project/Omega#343
- [ ]  (ocn) E3SM-Project/Omega#344
- [ ]  (ocn) E3SM-Project/Omega#226
- [ ]  (ocn) E3SM-Project/Omega#369
- [ ]  (ocn) E3SM-Project/Omega#379
xylar added a commit to xylar/polaris that referenced this pull request Mar 27, 2026
This merge updates the e3sm_submodules/Omega submodule from [d0b3482](https://github.com/E3SM-Project/Omega/tree/d0b3482) to [74611d548d](https://github.com/E3SM-Project/Omega/tree/74611d548d).

This update includes the following MPAS-Ocean and MPAS-Frameworks PRs (check mark indicates bit-for-bit with previous PR in the list):
- [ ]  (ocn) E3SM-Project/Omega#343
- [ ]  (ocn) E3SM-Project/Omega#344
- [ ]  (ocn) E3SM-Project/Omega#226
- [ ]  (ocn) E3SM-Project/Omega#369
- [ ]  (ocn) E3SM-Project/Omega#379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants