Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UVM dependency removal #937

Merged
merged 11 commits into from
Jun 22, 2022
Merged

UVM dependency removal #937

merged 11 commits into from
Jun 22, 2022

Conversation

ldh4
Copy link
Contributor

@ldh4 ldh4 commented Feb 23, 2022

[WIP] Modifications to make Nalu-wind buildable without Cuda UVM

Pre-requisite work for converting Nalu-wind to be compatible with HIP.
This is to make Nalu wind to build without Cuda UVM

@tasmith4 @alanw0 @jhux2 @ddement

@tasmith4
Copy link
Contributor

After further investigation, @ldh4 and I learned from @alanw0 that tests without NGP in the name are not expected to pass in any build. After running the unit tests with that filter, they all pass on ascicgpu both with and without UVM.

@alanw0
Copy link
Contributor

alanw0 commented Feb 24, 2022

After further investigation, @ldh4 and I learned from @alanw0 that tests without NGP in the name are not expected to pass in any build. After running the unit tests with that filter, they all pass on ascicgpu both with and without UVM.

@tasmith4 @ldh4 More precisely, tests without NGP in the name are not expected to pass in a Cuda build. All unit tests are expected to pass in a CPU build. However, I am surprised to hear that all of the NGP unit-tests pass on Cuda without UVM. For example, the *.NGP* filter would include at least 3 unit tests that have tpetra usage (MixtureFractionKernelHex8Mesh.NGP_adv_diff_edge_tpetra*). I guess it's possible those tests don't use the graph assembly code, but I would be surprised. (And the one thing I think we're pretty certain about is that the graph assembly code is not using the correct types for at least some of its Views, and fixing those would then require one or more deep_copy calls.)

In any case, I'll work with you and Dong Hun tomorrow to help establish where we're at with all this.

@tasmith4
Copy link
Contributor

You're right @alanw0, I should have been more precise, thanks for clarifying. I was also very surprised all the NGP tests pass without UVM (and am willing to be skeptical of the result, although I double-checked my build), we'll have to discuss this more today as you say.

@alanw0
Copy link
Contributor

alanw0 commented Feb 24, 2022

@tasmith4 In particular we're passing those views (that this commit changes to host views) into the setAllIndices method on the tpetra crsgraph. I guess the explanation could be that we're getting those host views from a dual view, and inside setAllIndices tpetra is doing the sync for us and then using the device views. If that's what is happening then great!

@tasmith4
Copy link
Contributor

I don't think that's what's happening, so I'm now even more confused by why these tests pass.

@alanw0
Copy link
Contributor

alanw0 commented Feb 24, 2022

I guess I didn't look closely... Now that I look at this commit again, it doesn't change the types of the views we're passing into setAllIndices. It is changing the types of other stuff. So we'll just have to work through this today and see what is going on. We'll probably want to put in a (temporary) check in the code, in TpetraLinearSystem.C, for whether uvm is actually on or not.

@@ -113,6 +113,8 @@ FixPressureAtNodeAlgorithm::execute()
}
});
});

eqSystem_->linsys_->free_coeff_applier(deviceCoeffApplier);
Copy link
Contributor

@PaulMullowney PaulMullowney Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Are we leaking memory?

Copy link
Contributor Author

@ldh4 ldh4 Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the TpetraLinearSystem class, we had to change it so that it is no longer holding onto the host/device coeff appliers as its member variables. So, a new coeff applier is constructed on each request. Because of this, we had to introduce a way to properly deallocate coeff appliers when no longer needed. This is one of the changes that we will likely to revisit to find a better way to handle this.
Just as a note, free_coeff_applier is an empty function in LinearSystem.h, so this line will not affect HypreLinearSystem.

src/SolverAlgorithm.C Outdated Show resolved Hide resolved
@@ -40,6 +40,8 @@ void AssembleOversetDecoupledAlgorithm::execute()
fringeNodes.size(), KOKKOS_LAMBDA(const size_t& i) {
coeffApplier->resetRows(1, &fringeNodes(i), 0, numDof, 1.0, 0.0);
});

eqSystem_->linsys_->free_coeff_applier(coeffApplier);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing this again. We must have been leaking memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as the above.

@ldh4 ldh4 changed the title [WIP] Start of UVM dependency removal [WIP] UVM dependency removal Jun 6, 2022
@ldh4 ldh4 changed the title [WIP] UVM dependency removal UVM dependency removal Jun 21, 2022
@ldh4 ldh4 marked this pull request as ready for review June 21, 2022 22:30
@ldh4
Copy link
Contributor Author

ldh4 commented Jun 21, 2022

A couple notes on this commit:

  • This builds and runs with and without unified memory support (tested only with cuda backend for now).
  • Building without unified memory currently shows deficiencies in both memory usage and runtime performance. We are still investigating the source of this and will have follow-on tasks to resolve this.
  • Building with unified memory does not have this problem and maintains the pervious performance. When tested with nrel5mw_refined_rcb problem (~700M nodes) on 90 nodes of summit, the build with this patch ran to its completion where as the build from Nalu-wind master timed out.

@ldh4
Copy link
Contributor Author

ldh4 commented Jun 21, 2022

I am actually running nrel5mw_refined_rcb problem again with a smaller termination count, just to see how comparable the runtimes between the two builds are.

@ldh4 ldh4 requested review from alanw0 and ddement June 21, 2022 22:56
@ldh4 ldh4 requested review from jhux2 and tasmith4 June 21, 2022 22:56
@ldh4
Copy link
Contributor Author

ldh4 commented Jun 22, 2022

Checking numbers from nrel5mw_refined_rcb runs again with its termination count reduced by half showed that there isn't any significant difference in runtime and memory usage between this patched version (uvm turned on) and the master, as expected. Runtime had ~5% speedup.

@ldh4 ldh4 merged commit 673de80 into Exawind:master Jun 22, 2022
@ldh4 ldh4 mentioned this pull request Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants