Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECP 4: Deploy Nalu/Kokkos algorithmic infrastructure with performance benchmarking #4

Closed
spdomin opened this issue Oct 7, 2016 · 20 comments
Assignees

Comments

@spdomin
Copy link
Contributor

spdomin commented Oct 7, 2016

Activities:

  1. Algorithm infrastructure modification to operate on a homogeneous topology set; step 1 is to elevate std::vector resize operations to the algorithm constructor.
  2. Transition of matrix assemble algorithms to use Kokkos-based matrix apply_coefficient().
  3. Usage of Kokkos views in solver-based algorithms.
@spdomin spdomin added this to the FY17Q4 milestone Oct 7, 2016
@spdomin spdomin assigned spdomin, crtrott and nmhamster and unassigned spdomin, nmhamster and crtrott Oct 7, 2016
@alanw0
Copy link
Contributor

alanw0 commented Oct 10, 2016

I'm interested in doing this.
I have an intel build completed for use with vtune profiling to guide step 1., and I'll be happy to work with Christian on testing, integrating, and deploying the kokkos work he has prototyped, and move on with further kokkos integration as appropriate.

@nmhamster
Copy link

@alanw0 remember that we have VTune connectors for Kokkos kernels that greatly simplify how information is presented up into VTune. In general VTune with these is not a great experience since the OpenMP regions utilized by Kokkos can't be separated out cleanly. We can talk more about this if you like. The connectors also work with nvprof and we are looking at some other tools as well.

@alanw0
Copy link
Contributor

alanw0 commented Oct 19, 2016

@nmhamster ok thanks Si, that will help. NGP profiling is difficult. We've had some success with vtune and recently Christian helped us start looking at nvprof for the stk project. The more help the better.

@marchdf
Copy link
Contributor

marchdf commented Oct 20, 2016

@nmhamster I am interested in the VTune Kokkos connectors if someone could share those with me. I started looking at things with Allinea Map but was curious about trying out VTune.

@spdomin Also, any thoughts on what tests we should be using for performance benchmarks? I have been looking at some of the regression tests and some of the performance tests as well. Do the performance tests capture what we want from this exercise?

@nmhamster
Copy link

nmhamster commented Oct 20, 2016

@marchdf - can you please checkout https://github.com/kokkos/kokkos-tools.git? If you go into the src/tools/vtune-connector directory, edit the Makefile to point to the VTune installation and then run make. It can build with GNU compilers. To run your NALU and Trilinos builds you need to enable the CMake -DKokkos_ENABLE_Profiling:BOOL=ON option. You will need to rebuild Trilinos and NALU with these new options. Finally when you are running you can then add an environment variable KOKKOS_PROFILE_LIBRARY=<path to your checkout/src/tools/vtune-connector.so>. Run with the amplxe-cl VTune profiler as usual. I can talk you through this as you need it.

@spdomin
Copy link
Contributor Author

spdomin commented Nov 28, 2016

@NaluCFD/ngp, how about the following intermediate step towards Kokkos views?

The concept now is that the EquationSystem retains the max set of nodesPerElement and integration points. Then, rather than the resize per bucket, I set the std::vector once per execute. The next step will be to swap out std::vector to kokkos views. After than, well, you know..

I think that I like this better than the previous code change that had one algorithm per element type..

//======================================================================
// extract max integration point values and size all workset variables
//======================================================================
const int maxNodesPerElement = realm_.get_max_nodes_per_element();
const int maxNumScsIp = realm_.get_max_num_scs_ip();

// space for LHS/RHS; nodesPerElem*nodesPerElem and nodesPerElem
const int lhsSize = maxNodesPerElement*maxNodesPerElement;
const int rhsSize = maxNodesPerElement;
std::vector<double> lhs(lhsSize);
std::vector<double> rhs(rhsSize);
std::vector<int> scratchIds(rhsSize);
std::vector<double> scratchVals(rhsSize);
std::vector<stk::mesh::Entity> connected_nodes(maxNodesPerElement);

// nodal fields to gather
std::vector<double> ws_vrtm(maxNodesPerElement*nDim);
std::vector<double> ws_Gpdx(maxNodesPerElement*nDim);
std::vector<double> ws_coordinates(maxNodesPerElement*nDim);
std::vector<double> ws_pressure(maxNodesPerElement);
std::vector<double> ws_density(maxNodesPerElement);

// geometry related to populate
std::vector<double> ws_scs_areav(maxNumScsIp*nDim);
std::vector<double> ws_dndx(nDim*maxNumScsIp*maxNodesPerElement);
std::vector<double> ws_dndx_lhs(nDim*maxNumScsIp*maxNodesPerElement);
std::vector<double> ws_deriv(nDim*maxNumScsIp*maxNodesPerElement);
std::vector<double> ws_det_j(maxNumScsIp);
std::vector<double> ws_shape_function(maxNumScsIp*maxNodesPerElement);

// pointers
double *p_lhs = &lhs[0];
double *p_rhs = &rhs[0];
double *p_vrtm = &ws_vrtm[0];
double *p_Gpdx = &ws_Gpdx[0];
double *p_coordinates = &ws_coordinates[0];
double *p_pressure = &ws_pressure[0];
double *p_density = &ws_density[0];
double *p_scs_areav = &ws_scs_areav[0];
double *p_dndx = &ws_dndx[0];
double *p_dndx_lhs = shiftPoisson_ ? &ws_dndx[0] : reducedSensitivities_ ? &ws_dndx_lhs[0] : &ws_dndx[0];
double *p_shape_function = &ws_shape_function[0];

// define some common selectors
stk::mesh::Selector s_locally_owned_union = meta_data.locally_owned_part()
  & stk::mesh::selectUnion(partVec_) 
  & !(realm_.get_inactive_selector());

stk::mesh::BucketVector const& elem_buckets =
  realm_.get_buckets( stk::topology::ELEMENT_RANK, s_locally_owned_union );
for ( stk::mesh::BucketVector::const_iterator ib = elem_buckets.begin();
      ib != elem_buckets.end() ; ++ib ) {
  stk::mesh::Bucket & b = **ib ;
  const stk::mesh::Bucket::size_type length   = b.size();

  // etc., etc.

@crtrott
Copy link
Contributor

crtrott commented Nov 28, 2016

Btw: on most platforms profiling support in Kokkos should be enabled by default. Probably not on Cray though. I.e. you don't need this: -DKokkos_ENABLE_Profiling:BOOL=ON with one of the more recent Trilinos (like last three months or so).

@alanw0
Copy link
Contributor

alanw0 commented Nov 28, 2016

@spdomin I think this looks good, incremental changes are always better than big changes when feasible. I'm looking forward to getting back on this!

@spdomin
Copy link
Contributor Author

spdomin commented Dec 1, 2016

@crtrott, any reason to believe that this change would have slowed down the code? My tests are still running, however, show that some cases are slower... However, in most simulations, I have a heterogeneous mesh.

One concern I had before with this approach is that for hybrid meshes, one topology might blow cache whereas the others are fine. With the MAX approach, all topologies might be blowing cache now - which again suggests a resize is fine. For mixed higher/low-order simulations, this can be bad.

We talked about the STL resize today and it seems like it is a no-op if the size is not changing..

@alanw0
Copy link
Contributor

alanw0 commented Dec 1, 2016

I agree resize should be free (or at least negligible) if the size is not changing.
How much slower are the tests? We could take a look with vtune and see more detail about where the slowdown is.

@spdomin
Copy link
Contributor Author

spdomin commented Dec 2, 2016

I talked with a fellow team member and I have four designs on the table. My latest one above, I am not happy with. I really do not like this notion of a "max" floating around. I think that this design will punish us when we have mixed orders as the low order kernels will take as large of a memory footprint at the higher order. I need @crtrott to educate me on cache so that I can feel better about this.

The best design might be something that looks like my first pull request, however, instead of storing the ws_variables in the constructor, I only save master element information. Then, in the execute, I size things once like what is above, however, with the saved off integration info.

@srajama1
Copy link

srajama1 commented Dec 2, 2016

@spdomin May be this is a silly question: Are you thinking of multiple dimensional Kokkos views and switching layouts ? It looks like how the vectors in the geometry portion above is accessed you could easily run into Array-of-structures vs structure-of-arrays issues in different architectures. That said I don't know much about how STK or other portion can handle switching layouts well.

@spdomin
Copy link
Contributor Author

spdomin commented Dec 2, 2016

@srajama1, yes the plan is to transition from dynamic std::vector resize to something smarter. Once that is done, I will swap out directly to kokkos views and use the MDArray path. After that, it is changing to the algorithms to use teams, etc., i.e., full in Kokkos.

@spdomin
Copy link
Contributor Author

spdomin commented Jan 27, 2017

@NaluCFD/ngp, see the below discussion points..

Greetings,

Yesterday, @rcknaus and I met with our TF team to discuss the manner in which Fuego is pursuing NGP transition. In Fuego, I originally wrote algorithm interfaces to the Sierra Frameworks. In short, these "workset algorithms" were “homogeneous”. This design pattern persists today and essentially follows the “alt” design that I capture in our working notes/slides. Fuego also has a very similar algorithm design with each dof operating on a set of STK element collections. In Fuego, structs were created for each master element (for the interior) and all face:element pairs for the boundaries. Algorithms are templated on this struct. The estimate on the Fuego side for code bloat was 30-40% while some performance for unrolling the nDim loop was noted. Tempting on the strut of AlgorithmTraits includes essential information such as nodesPerElement_, numIntgPoints_ and nDim_. The role of the virtual table for gpus is still a work in progress.

The most interesting path forward defined for our project seemed to be to be as follows:
drive towards homogeneous algorithms over topology that naturally allow for precise re-size operations. No need to hack a MAX_nodesPerElement_, etc.
Drive forward the prototyped AssembleElemSolverAlgorithm.C with supplemental algorithms. Therefore, the bulk of the STK/Kokkos boiler plate code will be only implemented once with these small SupplementalAlgorithms. Since they are small, the can possibly be templated. I think that this may be a good compromise for code bloat since the supplemental algorithms are tiny compared to a full AssembleSolverAlg.

If we move forward with (2), then we need some improvement on the SupplementalAlgorith design (see below PS if you are interested).

Robert is prototyping a kokkos/STK-SIMD Laplace assembly to demonstrate a design pattern prototyped within the ASC NW project space. I plan on prototyping 2 and 3 above.

If anyone else has prototypes or ideas, socialize them when they are ready for show. Key attributes in question are 1) executable size, 2) clean design and 3) performance. Based on discussions, I think that a template-based design on the full AssembleMomentum/Continuity/Scalar, etc. approach is not viable by way of code bloat concerns. There is a possibility that if we restrict such high level design templated algorithms on a handful of algorithms, we might be safe - not sure.

I think that if we can execute the above design points in the context of the generic AssembleElemSolver approach (same for non-solvers, bcs, non-conformal, etc) we can end up with something very special by way of clean design, optimal readability and good performance.

@rcknaus and I will report as we make progress. If anyone has high level comments, feel free to use this email list as an avenue for communication or call a meeting. I will capture the above design in the working notes document as soon as I have something that looks good.

Best,

@spdomin

PS. The supplemental algorithm design point is to activate individual physics/algorithm “expressions", e.g.,

    - element_source_terms:
        momentum: [advection_diffusion, buoyancy, momentum_time_derivative, NSO_4TH_ALT]
        continuity: [advection, density_time_derivative]
        mixture_fraction: [mixture_fraction_time_derivative, NSO_4TH]

The algorithms are plugged into the AssembleElemSolverAlgorithm class. This is activated by the following Nalu line command (at SolutionOption scope):

  use_consolidated_solver_algorithm: yes

Consider the momentum source terms above… Some are volume contributions (scv) while others are surface (scs). In both cases, we may have duplicated fields that need to be gathered or computed. For example, we may need scs_area_vector or scv_volume. Moreover, advection_diffusion and NSO_4TH_ALT will both need ws_dndx. However, since the supplemental algorithms are autonomous, they duplicate such work (same with gathers). Also, in the current design, a “resize” is processed at the bucket level. This would be changed to follow the homogeneous approach of algorithm creation.

The core challenge here is to elevate all common “gathers”, “scatters”, “assembly” and “geometry_evaluations" within a light weight interface.

I have some ideas that I will pursue.

@spdomin
Copy link
Contributor Author

spdomin commented Jan 30, 2017

I am about to push the new consolidated ElemSolverAlgorithm design. At present, I have both FEM and CVFEM scalar diffusion working (verified to be second order) using a 3D MMS (also a supplemental all design).

I will start folding in the work Alan is prototyping on how to elevate gathers/scatters/etc so that the SuppAlgs are not autonomous in duplication of work. Next, I will add the templated AlgTraits approach while I work to reduce polymorphism and increased Kokkos work.

c1fb12c

@spdomin
Copy link
Contributor Author

spdomin commented Feb 3, 2017

Adding templating: 90e1699 @alanw0 to merge this with his sharing of SuppAlg data approach. I am running the V27 mesh now under heat conduction with MMS now.

@spdomin
Copy link
Contributor Author

spdomin commented Feb 6, 2017

The latest formulation tested creates a homogeneous collection of SupplementalAlgorithms that are both templated and share data across all registered SupplementalAlgorithms. This homogeneous approach allows us to perform all resizes in the construction and not at run time. The SuppAlgs are now templated on HEX8, TET4, PYR5, WED6 and is tested for a hybrid test case (500k elements or so).

Standard AssembleScalarDiffElemSolverAlg:
assemble -- avg: 770.934 min: 761.327 max: 776.17

Consolidated AssembleElemSolverAlg + CVFEM_DIFF + MMS:
assemble -- avg: 780.216 min: 772.599 max: 784.485

Consolidated AssembleElemSolverAlg + templated CVFEM_DIFF + MMS:
assemble -- avg: 719.187 min: 708.683 max: 725.461

Consolidated AssembleElemSolverAlg + @alanw0 sharing + templated CVFEM_DIFF + MMS:
assemble -- avg: 653.366 min: 646.362 max: 656.972

@crtrott
Copy link
Contributor

crtrott commented Feb 6, 2017

Thats great (I mean the fact that we are not killing performance yet ;-) )

@spdomin
Copy link
Contributor Author

spdomin commented Feb 6, 2017

Right... And the new ElemSolverAlgorithm already has the team structure embedded within it. Also, I failed to note that the above last case used full Kokkos MD views for everything other than RHS/LHS. On that side, we plan on proceeding with a 2D array that we index into for higher dimensions. This approach feels like the correct balance..

@spdomin
Copy link
Contributor Author

spdomin commented Feb 23, 2017

Transition to Jira.

@spdomin spdomin closed this as completed Feb 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants