Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance of threaded matrix multiply kernel #248

Closed
3 tasks done
tkoskela opened this issue Aug 24, 2023 · 4 comments · Fixed by #266
Closed
3 tasks done

Investigate performance of threaded matrix multiply kernel #248

tkoskela opened this issue Aug 24, 2023 · 4 comments · Fixed by #266
Labels

Comments

@tkoskela
Copy link
Contributor

tkoskela commented Aug 24, 2023

Once we have closed #195 and #244. We can look into the performance of these threading improvements together with the previously threaded matrix multiply kernels.

The multiply kernel can be selected with the MULT_KERN option in the Makefile. The best place to start is ompGemm, but worth looking at the other options too.

A good test case is:

@tkoskela
Copy link
Contributor Author

tkoskela commented Aug 29, 2023

Things to look out for

@tkoskela
Copy link
Contributor Author

An initial profiling result with the test case described above. Using current develop branch with MULT_KERN = ompGemm

image

  • Almost all of the time is spent in threaded code (m_kern_min and m_kern_max) 😃
  • There is a lot (more than 50% of run time) of OpenMP overhead (__kmp_fork_barrier, __kmpc_barrier) ☹️

My first approach to reduce the inefficiency would be to move the threading to the main loop

do kpart = 1,a_b_c%ahalo%np_in_halo ! Main loop

And wrap the MPI communications in !$omp master (or !$omp critical, if we use MPI_THREAD_FUNNELED in mpi_init) regions.

@tkoskela
Copy link
Contributor Author

tkoskela commented Sep 20, 2023

It should be possible to declare the parallel region in


and keep the !$omp do workshare constructs as orhpaned constructs where they are in the multiply_kernel.

https://stackoverflow.com/questions/35347944/fortran-openmp-with-subroutines-and-functions/35361665#35361665

We've tried to implement this in tk-optimise-multiply

@tkoskela
Copy link
Contributor Author

tkoskela commented Oct 4, 2023

Conclusions

Performance of multiply kernels

  • Tested all multiply kernels using the matrix_multiply benchmark on 8 ranks/4 threads. Best performance with ompGemm_m and ompDoik, roughly 2x speedup with 4 threads compared to the serieal version

Reducing OMP overhead

  • In tk-optimise-multiply -> Optimise threading in ompGemm multiply kernel #266 we moved the creation of the OMP parallel region out of the multiply kernel outside the main loop in multiply_module and wrapped the MPI communications in !$ omp master. To do that, we had to introduce barriers around the MPI communication to ensure data has arrived before distributing work to compute threads. This was previously guaranteed because the communication was done outside the parallel region.

Longer matrix range

  • Tested increasing DM.L_range from 16 to 20 in the matrix_multiply benchmark using the ompGemm kernel with previous develop branch, and the tk-optimise-multiply branch.
  • Total runtime is about 2% longer with tk-optimise-multiply. However, the overhead from forking threads is reduced by ~30%. Unfortunately this is replaced by time spent in barriers we had to introduce to avoid race conditions.

Next steps

Next we need to get rid of the OMP barriers by overlapping communication with computation. This is addressed in #265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant