Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segfault, remove nonsymmetric ginkgo solver #548

Merged
merged 6 commits into from
Nov 10, 2022
Merged

Conversation

fritzgoebel
Copy link
Contributor

@fritzgoebel fritzgoebel commented Sep 22, 2022

This PR should fix issues #540 and #542. It also removes the nonsymmetric ginkgo solver as previously suggested by @pelesh.

The problem were a missing host side copy of the rhs vector for #540 and directly accessing matrix and vector values located in GPU memory in update_matrix for
#542. This is fine as long as they are in unified memory, which the apparently are when built with Debug. I added a host side copy of the matrix that gets updated and moved back to the GPU instead.

@fritzgoebel fritzgoebel self-assigned this Sep 22, 2022
@pelesh
Copy link
Collaborator

pelesh commented Sep 22, 2022

I think this is relevant for other GPU solvers in HiOp.

@kswirydo @nychiang

@fritzgoebel
Copy link
Contributor Author

I think this is relevant for other GPU solvers in HiOp.

@kswirydo @nychiang

Not sure it actually is, the GPU memory was coming from inside a ginkgo matrix

@cameronrutherford
Copy link
Collaborator

PNNL Pipelines seem to be giving a false negative here. Tests failed initially, but are passing upon a retry. I would suggest adding a new commit with https://github.com/LLNL/hiop/blob/develop/.gitlab-ci.yml#L159 changed to not include -E NlpSparse..., and we can see if the failing tests are now passing!

Thankfully this doesn't require a rebuild, so this should be a quick check.

@nkoukpaizan
Copy link
Collaborator

Tests are passing on ascent and summit with HiOp built in Release mode. I think this fixes #542.

Building with cuda_arch 60,70,75 and 80 on all Marianas GPUs
@cameronrutherford
Copy link
Collaborator

cameronrutherford commented Nov 8, 2022

Seems like CUDA_ARCHITECTURES cannot readily be passed from the CLI in the platform variables script into the CMake config. As a workaround, I have specified all Cuda architectures for Marianas in gcc-cuda.cmake. The only downside is that now all CI platforms/developers that use this config will by default now build with all the architectures, rather than only on Marianas as would be desirable.

@cameronrutherford
Copy link
Collaborator

cameronrutherford commented Nov 8, 2022

I was able to get tests passing when running manually on previously failing platform.

I am not able to re-produce old issue with updated module set. I assume that since ginkgo is built with the fully array of cuda architectures, that the CUDA runtime is smart enough to pick up the right spec, despite HiOp being built with the "wrong" cuda arch.

Unless people have qualms with how I hacked this together, I think this should be good to merge.

As a side note, ExaGO now has updated modules ready to go with a fresh build of latest hiop@develop.

@fritzgoebel
Copy link
Contributor Author

Thanks @cameronrutherford!

Copy link
Collaborator

@cameronrutherford cameronrutherford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, just pointing out points of confusion

Comment on lines +28 to +29
set(CMAKE_CUDA_ARCHITECTURES 60 70 75 80 CACHE STRING "")
message(STATUS "Setting default cuda architecture to ${CMAKE_CUDA_ARCHITECTURES}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually seem necessary to resolve the issue. It seems like it is sufficient to just have updated Ginkgo module. Not sure if we want to keep this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is anything wrong with this. It will just build a binary that can run on different NVIDIA cards.

Comment on lines +44 to +45
# ginkgo@1.5.0.glu_experimental%gcc@10.2.0+cuda~develtools~full_optimizations~hwloc~ipo~oneapi+openmp~rocm+shared build_type=Release cuda_arch=60,70,75,80 arch=linux-centos7-zen2
module load ginkgo-1.5.0.glu_experimental-gcc-10.2.0-x73b7k3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of just the updated Ginkgo module

@pelesh
Copy link
Collaborator

pelesh commented Nov 8, 2022

The only downside is that now all CI platforms/developers that use this config will by default now build with all the architectures, rather than only on Marianas as would be desirable.

This is not a big issue. If you want to distribute HiOp binary, for example, this is exactly how you would build it.

@cameronrutherford
Copy link
Collaborator

All tests passing in CI

@cameronrutherford
Copy link
Collaborator

Passing spack builds in GitHub actions!

@cnpetra cnpetra merged commit 8025c3a into develop Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants