Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling: add mark_kernel_static_info #6844

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

cwpearson
Copy link
Contributor

Adds a mark_kernel_static_info interface to Kokkos Profiling. This interface takes a kernel ID returned from e.g. begin_parallel_for and associates compile-time static information about the kernel with that parallel region. There are 512 bytes reserved for static information, only one field, functor_size, is currently implemented.

Kokkos::parallel_for, parallel_reduce, and parallel_scan call this function when profiling is enabled. It is called before scratch allocation profiling in parallel_reduce.

Adds a mark_kernel_static_info interface to Kokkos Profiling.
This interface takes a kernel ID returned from e.g. begin_parallel_for
and associates compile-time static information about the kernel with that
parallel region. There are 512 bytes reserved for static information,
only one field, functor_size, is currently implemented.

Kokkos::parallel_for, parallel_reduce, and parallel_scan call this function
when profiling is enabled. It is called before scratch allocation profiling in
parallel_reduce.
@cwpearson
Copy link
Contributor Author

Corresponding tools PR: kokkos/kokkos-tools#242

@cwpearson
Copy link
Contributor Author

Is it okay to leave #define KOKKOSP_INTERFACE_VERSION 20211015 or does it need to be incremented?

@dalg24
Copy link
Member

dalg24 commented Feb 29, 2024

Are you aware about kokkos/kokkos-tools#238 ?

@vlkale
Copy link

vlkale commented Feb 29, 2024

Are you aware about kokkos/kokkos-tools#238 ?

Thanks @dalg24 I was thinking to point this out as I was going through this. I think it is related.

@vlkale
Copy link

vlkale commented Feb 29, 2024

Is it okay to leave #define KOKKOSP_INTERFACE_VERSION 20211015 or does it need to be incremented?

Looking through this and the other code files here and in the Kokkos Tools repo, I think leaving this value as is for the #define should be fine.

Also, all the CI tests here and in the Kokkos Tools PR have passed, so at least this hasn't caused a problem there.

@cwpearson
Copy link
Contributor Author

The downside to this approach is that all static information that Core wants to pass to tools has to be produced at the same time (or the info struct has to be passed around a bit), and after the kernel launch.

Should I refactor this so that there is a single function templated on the Functor type that serves as one place where all information for the static profiler is generated?

That would basically replace these three lines in impl/Kokkos_Tools_Generic but would be a single point to extended with any future static information

    Kokkos::Tools::KernelStaticInfo info;
    info.functor_size = sizeof(FunctorType);
    Kokkos::Tools::markKernelStaticInfo(kpID, info);

@cwpearson
Copy link
Contributor Author

I've tested this out with Sparta, Parthenon, and Mini-EM and it works fine with all of them.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks reasonable to me but I don't fully understand why we need to forward the functor size to tools. Can you provide some examples?

core/unit_test/tools/TestToolsInitialization.cpp Outdated Show resolved Hide resolved
Comment on lines +145 to +166
/**
* Convenience wrapper around kokkosp_mark_kernel_static_info
*
* Consider using markKernelStaticInfo<Functor>(kernelID) instead
*/
void markKernelStaticInfo(uint64_t kernelID, const KernelStaticInfo& info);

/**
* Take a kernelID produced by e.g. beginParallelFor
* and associate compile-time information about Functor with it
*
* Arguments:
*
* kernelID: An ID for a parallel loop registered with e.g. beginParallelFor
*/
template <typename Functor>
void markKernelStaticInfo(uint64_t kernelID) {
Kokkos::Tools::KernelStaticInfo info;
info.functor_size = sizeof(Functor);
markKernelStaticInfo(kernelID, info);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need both overloads? Can't we just inline the first one into the second one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first one does what a lot of the other profiling code does and calls invoke_kokkosp_callback which is a function template defined in Kokkos_Profiling.cpp and a declaration is not available in this header. I could move the entire implementation of that function into this header and do it the way you're suggesting if you prefer.

@vlkale
Copy link

vlkale commented Mar 14, 2024

The implementation looks reasonable to me but I don't fully understand why we need to forward the functor size to tools. Can you provide some examples?

@masterleinad

I think my comment here can help answer this: kokkos/kokkos-tools#242 (comment).

If you want to know about why specifically functor size in this PR, and not any other (static or dynamic) information of a Kokkos kernel during a Kokkos application execution, then ask @cwpearson. We can put in other possibly useful information in this, but I was thinking just using functor size as a starting point since it is useful for @cwpearson and that he has experimented with and used it in his application.

@cwpearson
Copy link
Contributor Author

The implementation looks reasonable to me but I don't fully understand why we need to forward the functor size to tools. Can you provide some examples?

The only use case I currently have is so Core developers can gather information about how large the functors in applications actually are to guide efforts in designing or optimizing kernel launch mechanisms.

Here's an example of the gathered information from the Kokkos-enabled open-source LANL ATS Benchmarks (github repo)

Parthenon

Functor Size Execution Count Name
48 18832 refinement_package.cpp::98::FirstDerivative
112 126 pr_loops.hpp::127::ProlongationRestrictionLoop
24 89 boundary_communication.cpp::263::SetBounds
80 49 burgers_package.cpp::155::CalculateDerived
48 49 boundary_communication.cpp::93::SendBoundBufs
136 40 burgers_package.cpp::309::CalculateFluxes
136 40 burgers_package.cpp::238::CalculateFluxes
32 40 flux_correction.cpp::70::LoadAndSendFluxCorrections
24 40 flux_correction.cpp::166::SetFluxCorrections
56 25 burgers::EstimateTimestep
104 16 MassHistory

Sparta

Functor Size Execution Count Name
72 400 N9SPARTA_NS8ExclScanIN6Kokkos6OpenMPEEE
1856 213 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS15TagParticleSortILi1ELi0EEE
39224 200 N9SPARTA_NS12UpdateKokkosE/N9SPARTA_NS13TagUpdateMoveILi2ELi1ELi0ELin1EEE
5536 200 N9SPARTA_NS16CollideVSSKokkosE/N9SPARTA_NS23TagCollideCollisionsOneILi1ELin1EEE
3600 200 N9SPARTA_NS17FixEmitFaceKokkosE/N9SPARTA_NS22TagFixEmitFace_ninsertE
3600 200 N9SPARTA_NS17FixEmitFaceKokkosE/N9SPARTA_NS27TagFixEmitFace_perform_taskE
1856 200 N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS28TagParticleCompressReactionsE
464 200 ZN9SPARTA_NS17FixEmitFaceKokkos12perform_taskEvEUliE_
48 119 Kokkos::ViewCopy-1D
80 42 Kokkos::ViewCopy-2D
32 12 Kokkos::ViewFill-1D
16 8 Kokkos::Impl::host_space_deepcopy_double
112 6 Kokkos::ViewCopy-3D
5536 5 N9SPARTA_NS16CollideVSSKokkosE/N9SPARTA_NS21TagCollideResetVremaxE
... ... ...

mini-em

Functor Size Execution Count Name
224 10753152 N9Intrepid24Impl22Basis_HGRAD_TET_C1_FEM7FunctorIN6Kokkos11DynRankViewIdJNS3_12LayoutStrideENS3_6OpenMPEEEES7_LNS_9EOperatorE1EEE
48 3816344 Kokkos::ViewCopy-1D
224 1538305 N9Intrepid24Impl22Basis_HGRAD_TET_C1_FEM7FunctorIN6Kokkos11DynRankViewIdJNS3_12LayoutStrideENS3_6OpenMPEEEES7_LNS_9EOperatorE0EEE
112 311040 N6panzer13GlobalIndexer19CopyCellLIDsFunctorIN6Kokkos4ViewIPPiJNS2_11LayoutRightENS2_6OpenMPEEEEEE
152 230400 ZN6panzer28GatherSolution_BlockedTpetraINS_6Traits8ResidualES1_dixN6Tpetra12KokkosCompat23KokkosDeviceWrapperNodeIN6Kokkos6OpenMPENS6_9HostSpaceEEEE14evaluateFieldsERKNS_7WorksetEEUlRKiE_
664 115200 Panzer_Integrator_BasisTimesVector<0>
144 115200 N6panzer17V_MultiplyFunctorILi2EdEE
1184 76800 N6panzer9SumStaticINS_6Traits8ResidualES1_NS_4CellENS_5BASISEvEE/N6panzer9SumStaticINS_6Traits8ResidualES1_NS_4CellENS_5BASISEvE12NoScalarsTagE
288 76800 ZN6panzer10DotProductINS_6Traits8ResidualES1_E14evaluateFieldsERKNS_7WorksetEEUliE_
264 76800 IntegratorScalar
144 76800 DOF: B_face (panzer::Traits::Residual)
144 76800 DOF: E_edge (panzer::Traits::Residual)
144 76800 ZN6panzer29ScatterResidual_BlockedTpetraINS_6Traits8ResidualES1_ixN6Tpetra12KokkosCompat23KokkosDeviceWrapperNodeIN6Kokkos6OpenMPENS6_9HostSpaceEEEE14evaluateFieldsERKNS_7WorksetEEUlRKiE_
32 65266 Kokkos::ViewFill-1D
... ... ...

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally fine with the direction of this pull request as long as we consider this feature as experimental so that we can change what we store. The problem is that the callback can't define what is to be captured but we have to do that internally which limits the flexibility.

@cwpearson
Copy link
Contributor Author

Would you prefer a separate profiling interface function for each separate piece of information we might want to capture? That's a relatively easy change (and how I envisioned it originally).

@vlkale
Copy link

vlkale commented Mar 26, 2024

Would you prefer a separate profiling interface function for each separate piece of information we might want to capture? That's a relatively easy change (and how I envisioned it originally).

I think a separate profiling interface function is OK. Yes, this information you are gathering requires Kokkos Tools to hook into Kokkos core.

I think you would add it as a function Kokkos Tools_ToolsProgrammingInterface struct. Note that the only other function there is the Kokkos Tools tool-invoked fence function.
There is plenty of space for other functions (about 63 slots). I think we should think wisely on what other functions there should be. I don't know if that is what you were thinking of when you mentioned tool programming interface, but that is how I would approach this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants