Investigate performance of threaded matrix multiply kernel #248

tkoskela · 2023-08-24T09:13:26Z

Once we have closed #195 and #244. We can look into the performance of these threading improvements together with the previously threaded matrix multiply kernels.

The multiply kernel can be selected with the MULT_KERN option in the Makefile. The best place to start is ompGemm, but worth looking at the other options too.

A good test case is:

Use Si.ion from test 002 in the testsuite
Use Conquest_input from test 002 in the testsuite, change Grid cutoff to 200
Use Coords.dat from the input used in Thread loops over blocks #195
--> This is the matrix_multiply performance test in Add input configurations used for profiling #262
Investigate performance of other multiply kernels #268
Think about strategies for reducing omp overhead
Test longer matrix ranges in matrix multiply #269

The text was updated successfully, but these errors were encountered:

tkoskela · 2023-08-29T09:28:27Z

Things to look out for

Can we overlap communication and calculation in

CONQUEST-release/src/multiply_module.f90

Line 251 in 6bf8f4a

call prefetch(kpart,a_b_c%ahalo,a_b_c%comms,a_b_c%bmat,icall,&

tkoskela · 2023-08-31T09:56:44Z

An initial profiling result with the test case described above. Using current develop branch with MULT_KERN = ompGemm

Almost all of the time is spent in threaded code (m_kern_min and m_kern_max) 😃
There is a lot (more than 50% of run time) of OpenMP overhead (__kmp_fork_barrier, __kmpc_barrier) ☹️

My first approach to reduce the inefficiency would be to move the threading to the main loop

CONQUEST-release/src/multiply_module.f90

Line 227 in 6bf8f4a

do kpart = 1,a_b_c%ahalo%np_in_halo ! Main loop

And wrap the MPI communications in !$omp master (or !$omp critical, if we use MPI_THREAD_FUNNELED in mpi_init) regions.

tkoskela · 2023-09-20T15:25:38Z

It should be possible to declare the parallel region in

CONQUEST-release/src/multiply_module.f90

Line 226 in 6bf8f4a

and keep the !$omp do workshare constructs as orhpaned constructs where they are in the multiply_kernel.

https://stackoverflow.com/questions/35347944/fortran-openmp-with-subroutines-and-functions/35361665#35361665

We've tried to implement this in tk-optimise-multiply

tkoskela · 2023-10-04T14:10:50Z

Conclusions

Performance of multiply kernels

Tested all multiply kernels using the matrix_multiply benchmark on 8 ranks/4 threads. Best performance with ompGemm_m and ompDoik, roughly 2x speedup with 4 threads compared to the serieal version

Reducing OMP overhead

In tk-optimise-multiply -> Optimise threading in ompGemm multiply kernel #266 we moved the creation of the OMP parallel region out of the multiply kernel outside the main loop in multiply_module and wrapped the MPI communications in !$ omp master. To do that, we had to introduce barriers around the MPI communication to ensure data has arrived before distributing work to compute threads. This was previously guaranteed because the communication was done outside the parallel region.

Longer matrix range

Tested increasing DM.L_range from 16 to 20 in the matrix_multiply benchmark using the ompGemm kernel with previous develop branch, and the tk-optimise-multiply branch.
Total runtime is about 2% longer with tk-optimise-multiply. However, the overhead from forking threads is reduced by ~30%. Unfortunately this is replaced by time spent in barriers we had to introduce to avoid race conditions.

Next steps

Next we need to get rid of the OMP barriers by overlapping communication with computation. This is addressed in #265

tkoskela added improves: speed Speed-up of code time: hours labels Aug 24, 2023

This was referenced Sep 20, 2023

Overlap communication with computation in multiply_module #265

Open

Optimise threading in ompGemm multiply kernel #266

Merged

tkoskela linked a pull request Sep 22, 2023 that will close this issue

Optimise threading in ompGemm multiply kernel #266

Merged

tkoskela closed this as completed Oct 4, 2023

tkoskela mentioned this issue Jan 16, 2024

Test all multiply kernels in CI #292

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance of threaded matrix multiply kernel #248

Investigate performance of threaded matrix multiply kernel #248

tkoskela commented Aug 24, 2023 •

edited

tkoskela commented Aug 29, 2023 •

edited

tkoskela commented Aug 31, 2023

tkoskela commented Sep 20, 2023 •

edited

tkoskela commented Oct 4, 2023

Investigate performance of threaded matrix multiply kernel #248

Investigate performance of threaded matrix multiply kernel #248

Comments

tkoskela commented Aug 24, 2023 • edited

tkoskela commented Aug 29, 2023 • edited

tkoskela commented Aug 31, 2023

tkoskela commented Sep 20, 2023 • edited

tkoskela commented Oct 4, 2023

Conclusions

Performance of multiply kernels

Reducing OMP overhead

Longer matrix range

Next steps

tkoskela commented Aug 24, 2023 •

edited

tkoskela commented Aug 29, 2023 •

edited

tkoskela commented Sep 20, 2023 •

edited