Fix segfault, remove nonsymmetric ginkgo solver #548

fritzgoebel · 2022-09-22T10:21:21Z

This PR should fix issues #540 and #542. It also removes the nonsymmetric ginkgo solver as previously suggested by @pelesh.

The problem were a missing host side copy of the rhs vector for #540 and directly accessing matrix and vector values located in GPU memory in update_matrix for
#542. This is fine as long as they are in unified memory, which the apparently are when built with Debug. I added a host side copy of the matrix that gets updated and moved back to the GPU instead.

pelesh · 2022-09-22T14:17:46Z

I think this is relevant for other GPU solvers in HiOp.

@kswirydo @nychiang

fritzgoebel · 2022-09-22T14:21:45Z

I think this is relevant for other GPU solvers in HiOp.

@kswirydo @nychiang

Not sure it actually is, the GPU memory was coming from inside a ginkgo matrix

cameronrutherford · 2022-09-27T17:51:24Z

PNNL Pipelines seem to be giving a false negative here. Tests failed initially, but are passing upon a retry. I would suggest adding a new commit with https://github.com/LLNL/hiop/blob/develop/.gitlab-ci.yml#L159 changed to not include -E NlpSparse..., and we can see if the failing tests are now passing!

Thankfully this doesn't require a rebuild, so this should be a quick check.

nkoukpaizan · 2022-10-31T16:31:04Z

Tests are passing on ascent and summit with HiOp built in Release mode. I think this fixes #542.

Building with cuda_arch 60,70,75 and 80 on all Marianas GPUs

cameronrutherford · 2022-11-08T16:13:28Z

Seems like CUDA_ARCHITECTURES cannot readily be passed from the CLI in the platform variables script into the CMake config. As a workaround, I have specified all Cuda architectures for Marianas in gcc-cuda.cmake. The only downside is that now all CI platforms/developers that use this config will by default now build with all the architectures, rather than only on Marianas as would be desirable.

cameronrutherford · 2022-11-08T16:28:28Z

I was able to get tests passing when running manually on previously failing platform.

I am not able to re-produce old issue with updated module set. I assume that since ginkgo is built with the fully array of cuda architectures, that the CUDA runtime is smart enough to pick up the right spec, despite HiOp being built with the "wrong" cuda arch.

Unless people have qualms with how I hacked this together, I think this should be good to merge.

As a side note, ExaGO now has updated modules ready to go with a fresh build of latest hiop@develop.

fritzgoebel · 2022-11-08T16:30:11Z

Thanks @cameronrutherford!

cameronrutherford

Approving, just pointing out points of confusion

cameronrutherford · 2022-11-08T16:29:50Z

scripts/gcc-cuda.cmake

+set(CMAKE_CUDA_ARCHITECTURES 60 70 75 80 CACHE STRING "")
+message(STATUS "Setting default cuda architecture to ${CMAKE_CUDA_ARCHITECTURES}")


This doesn't actually seem necessary to resolve the issue. It seems like it is sufficient to just have updated Ginkgo module. Not sure if we want to keep this

I don't think there is anything wrong with this. It will just build a binary that can run on different NVIDIA cards.

cameronrutherford · 2022-11-08T16:30:10Z

scripts/marianasVariables.sh

+# ginkgo@1.5.0.glu_experimental%gcc@10.2.0+cuda~develtools~full_optimizations~hwloc~ipo~oneapi+openmp~rocm+shared build_type=Release cuda_arch=60,70,75,80 arch=linux-centos7-zen2
+module load ginkgo-1.5.0.glu_experimental-gcc-10.2.0-x73b7k3


Example of just the updated Ginkgo module

pelesh · 2022-11-08T16:31:26Z

The only downside is that now all CI platforms/developers that use this config will by default now build with all the architectures, rather than only on Marianas as would be desirable.

This is not a big issue. If you want to distribute HiOp binary, for example, this is exactly how you would build it.

cameronrutherford · 2022-11-08T16:41:20Z

All tests passing in CI

.github/workflows/test.yml

cameronrutherford · 2022-11-08T17:43:40Z

Passing spack builds in GitHub actions!

Fix segfault, remove nonsymmetric ginkgo solver

b7b7eb5

fritzgoebel self-assigned this Sep 22, 2022

pelesh requested a review from kswirydo September 22, 2022 14:01

Fix CudaError

d162cea

fritzgoebel force-pushed the fix_ginkgo_solver branch from b61e131 to d162cea Compare September 22, 2022 16:11

Update .gitlab-ci.yml (#551)

11b2638

This was referenced Sep 27, 2022

Ginkgo + CUDA Tests Fail on Marianas #540

Closed

Ginkgo+cuda fails on Ascent if build type is not Debug #542

Closed

Update marianas with full cuda arch matrix

1cb5f27

Building with cuda_arch 60,70,75 and 80 on all Marianas GPUs

Update CMake cache for full cuda arch matrix.

b534fa0

cameronrutherford approved these changes Nov 8, 2022

View reviewed changes

pelesh approved these changes Nov 8, 2022

View reviewed changes

Change GitHub workflow to run only on push

2c584c0

cameronrutherford reviewed Nov 8, 2022

View reviewed changes

.github/workflows/test.yml Show resolved Hide resolved

kswirydo approved these changes Nov 9, 2022

View reviewed changes

cnpetra merged commit 8025c3a into develop Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segfault, remove nonsymmetric ginkgo solver #548

Fix segfault, remove nonsymmetric ginkgo solver #548

fritzgoebel commented Sep 22, 2022 •

edited

Loading

pelesh commented Sep 22, 2022

fritzgoebel commented Sep 22, 2022

cameronrutherford commented Sep 27, 2022

nkoukpaizan commented Oct 31, 2022

cameronrutherford commented Nov 8, 2022 •

edited

Loading

cameronrutherford commented Nov 8, 2022 •

edited

Loading

fritzgoebel commented Nov 8, 2022

cameronrutherford left a comment

cameronrutherford Nov 8, 2022

pelesh Nov 8, 2022

cameronrutherford Nov 8, 2022

pelesh commented Nov 8, 2022

cameronrutherford commented Nov 8, 2022

cameronrutherford commented Nov 8, 2022

		set(CMAKE_CUDA_ARCHITECTURES 60 70 75 80 CACHE STRING "")
		message(STATUS "Setting default cuda architecture to ${CMAKE_CUDA_ARCHITECTURES}")

		# ginkgo@1.5.0.glu_experimental%gcc@10.2.0+cuda~develtools~full_optimizations~hwloc~ipo~oneapi+openmp~rocm+shared build_type=Release cuda_arch=60,70,75,80 arch=linux-centos7-zen2
		module load ginkgo-1.5.0.glu_experimental-gcc-10.2.0-x73b7k3

Fix segfault, remove nonsymmetric ginkgo solver #548

Fix segfault, remove nonsymmetric ginkgo solver #548

Conversation

fritzgoebel commented Sep 22, 2022 • edited Loading

pelesh commented Sep 22, 2022

fritzgoebel commented Sep 22, 2022

cameronrutherford commented Sep 27, 2022

nkoukpaizan commented Oct 31, 2022

cameronrutherford commented Nov 8, 2022 • edited Loading

cameronrutherford commented Nov 8, 2022 • edited Loading

fritzgoebel commented Nov 8, 2022

cameronrutherford left a comment

Choose a reason for hiding this comment

cameronrutherford Nov 8, 2022

Choose a reason for hiding this comment

pelesh Nov 8, 2022

Choose a reason for hiding this comment

cameronrutherford Nov 8, 2022

Choose a reason for hiding this comment

pelesh commented Nov 8, 2022

cameronrutherford commented Nov 8, 2022

cameronrutherford commented Nov 8, 2022

fritzgoebel commented Sep 22, 2022 •

edited

Loading

cameronrutherford commented Nov 8, 2022 •

edited

Loading

cameronrutherford commented Nov 8, 2022 •

edited

Loading