-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECP 4: Deploy Nalu/Kokkos algorithmic infrastructure with performance benchmarking #4
Comments
I'm interested in doing this. |
@alanw0 remember that we have VTune connectors for Kokkos kernels that greatly simplify how information is presented up into VTune. In general VTune with these is not a great experience since the OpenMP regions utilized by Kokkos can't be separated out cleanly. We can talk more about this if you like. The connectors also work with |
@nmhamster ok thanks Si, that will help. NGP profiling is difficult. We've had some success with vtune and recently Christian helped us start looking at nvprof for the stk project. The more help the better. |
@nmhamster I am interested in the VTune Kokkos connectors if someone could share those with me. I started looking at things with Allinea Map but was curious about trying out VTune. @spdomin Also, any thoughts on what tests we should be using for performance benchmarks? I have been looking at some of the regression tests and some of the performance tests as well. Do the performance tests capture what we want from this exercise? |
@marchdf - can you please checkout https://github.com/kokkos/kokkos-tools.git? If you go into the |
@NaluCFD/ngp, how about the following intermediate step towards Kokkos views? The concept now is that the EquationSystem retains the max set of nodesPerElement and integration points. Then, rather than the resize per bucket, I set the std::vector once per execute. The next step will be to swap out std::vector to kokkos views. After than, well, you know.. I think that I like this better than the previous code change that had one algorithm per element type..
|
Btw: on most platforms profiling support in Kokkos should be enabled by default. Probably not on Cray though. I.e. you don't need this: -DKokkos_ENABLE_Profiling:BOOL=ON with one of the more recent Trilinos (like last three months or so). |
@spdomin I think this looks good, incremental changes are always better than big changes when feasible. I'm looking forward to getting back on this! |
@crtrott, any reason to believe that this change would have slowed down the code? My tests are still running, however, show that some cases are slower... However, in most simulations, I have a heterogeneous mesh. One concern I had before with this approach is that for hybrid meshes, one topology might blow cache whereas the others are fine. With the MAX approach, all topologies might be blowing cache now - which again suggests a resize is fine. For mixed higher/low-order simulations, this can be bad. We talked about the STL resize today and it seems like it is a no-op if the size is not changing.. |
I agree resize should be free (or at least negligible) if the size is not changing. |
I talked with a fellow team member and I have four designs on the table. My latest one above, I am not happy with. I really do not like this notion of a "max" floating around. I think that this design will punish us when we have mixed orders as the low order kernels will take as large of a memory footprint at the higher order. I need @crtrott to educate me on cache so that I can feel better about this. The best design might be something that looks like my first pull request, however, instead of storing the ws_variables in the constructor, I only save master element information. Then, in the execute, I size things once like what is above, however, with the saved off integration info. |
@spdomin May be this is a silly question: Are you thinking of multiple dimensional Kokkos views and switching layouts ? It looks like how the vectors in the geometry portion above is accessed you could easily run into Array-of-structures vs structure-of-arrays issues in different architectures. That said I don't know much about how STK or other portion can handle switching layouts well. |
@srajama1, yes the plan is to transition from dynamic std::vector resize to something smarter. Once that is done, I will swap out directly to kokkos views and use the MDArray path. After that, it is changing to the algorithms to use teams, etc., i.e., full in Kokkos. |
@NaluCFD/ngp, see the below discussion points.. Greetings, Yesterday, @rcknaus and I met with our TF team to discuss the manner in which Fuego is pursuing NGP transition. In Fuego, I originally wrote algorithm interfaces to the Sierra Frameworks. In short, these "workset algorithms" were “homogeneous”. This design pattern persists today and essentially follows the “alt” design that I capture in our working notes/slides. Fuego also has a very similar algorithm design with each dof operating on a set of STK element collections. In Fuego, structs were created for each master element (for the interior) and all face:element pairs for the boundaries. Algorithms are templated on this struct. The estimate on the Fuego side for code bloat was 30-40% while some performance for unrolling the nDim loop was noted. Tempting on the strut of AlgorithmTraits includes essential information such as nodesPerElement_, numIntgPoints_ and nDim_. The role of the virtual table for gpus is still a work in progress. The most interesting path forward defined for our project seemed to be to be as follows: If we move forward with (2), then we need some improvement on the SupplementalAlgorith design (see below PS if you are interested). Robert is prototyping a kokkos/STK-SIMD Laplace assembly to demonstrate a design pattern prototyped within the ASC NW project space. I plan on prototyping 2 and 3 above. If anyone else has prototypes or ideas, socialize them when they are ready for show. Key attributes in question are 1) executable size, 2) clean design and 3) performance. Based on discussions, I think that a template-based design on the full AssembleMomentum/Continuity/Scalar, etc. approach is not viable by way of code bloat concerns. There is a possibility that if we restrict such high level design templated algorithms on a handful of algorithms, we might be safe - not sure. I think that if we can execute the above design points in the context of the generic AssembleElemSolver approach (same for non-solvers, bcs, non-conformal, etc) we can end up with something very special by way of clean design, optimal readability and good performance. @rcknaus and I will report as we make progress. If anyone has high level comments, feel free to use this email list as an avenue for communication or call a meeting. I will capture the above design in the working notes document as soon as I have something that looks good. Best, PS. The supplemental algorithm design point is to activate individual physics/algorithm “expressions", e.g.,
The algorithms are plugged into the AssembleElemSolverAlgorithm class. This is activated by the following Nalu line command (at SolutionOption scope):
Consider the momentum source terms above… Some are volume contributions (scv) while others are surface (scs). In both cases, we may have duplicated fields that need to be gathered or computed. For example, we may need scs_area_vector or scv_volume. Moreover, advection_diffusion and NSO_4TH_ALT will both need ws_dndx. However, since the supplemental algorithms are autonomous, they duplicate such work (same with gathers). Also, in the current design, a “resize” is processed at the bucket level. This would be changed to follow the homogeneous approach of algorithm creation. The core challenge here is to elevate all common “gathers”, “scatters”, “assembly” and “geometry_evaluations" within a light weight interface. I have some ideas that I will pursue. |
I am about to push the new consolidated ElemSolverAlgorithm design. At present, I have both FEM and CVFEM scalar diffusion working (verified to be second order) using a 3D MMS (also a supplemental all design). I will start folding in the work Alan is prototyping on how to elevate gathers/scatters/etc so that the SuppAlgs are not autonomous in duplication of work. Next, I will add the templated AlgTraits approach while I work to reduce polymorphism and increased Kokkos work. |
The latest formulation tested creates a homogeneous collection of SupplementalAlgorithms that are both templated and share data across all registered SupplementalAlgorithms. This homogeneous approach allows us to perform all resizes in the construction and not at run time. The SuppAlgs are now templated on HEX8, TET4, PYR5, WED6 and is tested for a hybrid test case (500k elements or so). Standard AssembleScalarDiffElemSolverAlg: Consolidated AssembleElemSolverAlg + CVFEM_DIFF + MMS: Consolidated AssembleElemSolverAlg + templated CVFEM_DIFF + MMS: Consolidated AssembleElemSolverAlg + @alanw0 sharing + templated CVFEM_DIFF + MMS: |
Thats great (I mean the fact that we are not killing performance yet ;-) ) |
Right... And the new ElemSolverAlgorithm already has the team structure embedded within it. Also, I failed to note that the above last case used full Kokkos MD views for everything other than RHS/LHS. On that side, we plan on proceeding with a 2D array that we index into for higher dimensions. This approach feels like the correct balance.. |
Transition to Jira. |
Activities:
The text was updated successfully, but these errors were encountered: