ACC init/finalize issues #422

hfp · 2021-01-21T08:51:50Z

There are two potential issues with DBCSR's init/fini flow wrt ACC interface and LIBSMM:

DBCSR calls acc_init(), which eventually calls libsmm_acc_init() internally (this is a potential issue since the backend (CUDA/HIP, or OpenCL, etc.) then depends on LIBSMM). The latter routine (libsmm_acc_init) was only recently introduced (everything here applies similarly to the respective finalize routine also). However, the expected flow (IMHO) is that DBCSR calls acc_init() and libsmm_acc_init() since both respective interfaces are used (and DBCSR does not judge potential internal depedency between ACC and LIBSMM implementation). Then, libsmm_acc_init() may call acc_init() internally if the ACC interface is used to implement LIBSMM (note, this dependency is somewhat expected). In turn, all init/finalize routines should be safe against multiple/over-initialization.
DBCSR calls dbcsr_multiply_lib_init() inside of a parallel region, and subsequent (sub-)initialization may be called only by the master thread. For instance, acc_init() is called by the master thread only (OMP MASTER construct). However, acc_finalize() is apparently called by all threads of a parallel region (workshare). The latter is inconsistent with respect to a "tandem flow" of init/finalize. The expect behavior (IMHO) is to protect acc_init/finalize and libsmm_acc_init/finalize against multiple threads, i.e., ACC and LIBSMM implementations are only expected to be thread-safe after initialization and before finalization.

The text was updated successfully, but these errors were encountered:

dev-zero · 2021-01-21T08:59:19Z

ACK. Possibly related to #421?

hfp · 2021-01-21T09:10:25Z

ACK. Possibly related to #421?

Could be, at least related in the sense of being an OpenMP/multicore inconsistency.

For OpenCL backend, the first issue from above (1) was unveiled by linker errors and weird dependency/use of libraries/objects in CMake, for the second issue (2) the expectation is/was made explicit (https://github.com/hfp/dbcsr/blob/openclacc/src/acc/opencl/acc_opencl.c#L133).

…ng a workaround for cp2k#422 (CUDA).

haampie · 2021-01-28T10:25:19Z

@hfp I don't know if you've seen this, but I prefixed the full C api with c_dbcsr_* here: #419 to avoid this issue:

The C api is now prefixed with c_dbsr_* because otherwise the Fortran code ended up calling acc_init in libgomp.so instead of the C-part of dbcsr.

hfp · 2021-01-28T11:08:15Z

@haampie Thank you! I was not aware about this. Also, good to know if end up merging our work together.

One unrelated question (in case you have an AMD graphics card at hand): perhaps you can try/play with the OpenCL backend on AMD? Should work out-of-the-box (USE_ACCEL=opencl, etc.). Though, kernels are not sophisticated (maybe it looks like this because of the ifdef switches in the kernel-code; however DBCSR's Cuda kernels are much more advanced at the moment with more parameters). I cam curious if the OpenCL based LIBSMM is still performing reasonable.

For the trial, you may want to have a look at the OpenCL backend and OpenCL-based LIBSMM documentation. In particular, the auto-tuning guide should help to increase performance further.

haampie · 2021-01-29T08:58:06Z

Hi @hfp, I'll give it a shot, but currently the cluster with AMD GPUs is unavailable, hopefully it's back next week.

hfp · 2021-01-29T09:27:42Z

@haampie Thank you!

* Completed implementation with passing regtests. Included validation internal to OpenCL backend (disabled by default); useful for debugging failing tests, etc. * Introduced USE_ACCEL replacing USE_CUDA, USE_HIP, and USE_OPENCL. Some more changes as suggested by code review. * Introduced USE_ACCEL replacing USE_CUDA, USE_HIP, and USE_OPENCL. Attempt to update cmake.in (only guessing). Changes suggested by code review. * Collected acc_opencl_synchronous_memops into global acc_opencl_options structure; included svm_interop variable into structure. Configure svm_interop depending on OpenCL standard level (only coarse grained SVM is planned/needed). Adjusted acc_host_mem_allocate/acc_host_mem_deallocate and cc_dev_mem_allocate/acc_dev_mem_deallocate to incorporate SVM. Renamed some backend/helper functions. Tested and fixed acc_opencl_stristr. * Respect compile-time setting (ACC_OPENCL_SVM). * Removed support for ACC_OPENCL_STREAM_OOOEXEC as usage depends on in-order behavior. Removed OpenCL private test (nothing to test left after code cleanup). Introduced environment variables to control acc_opencl_options. Fixed acc_opencl_stristr. * Fixed calling clGetMemObjectInfo accidentally with wrong object. Runtime-select implementations of atomic_add_global (new form almost doubles perf. on Nvidia OpenCL). * Attempt to fix linker errors with additional test case (HIP/ROCm). * Fixed warnings about explicitly deprecated CUDA/HIP functions. * Another attempt to fix cross-dependency in CUDA/HIP backend. * One more attempt to fix cross-dependencies. * Disabled dbcsr_acc_test for HIP (linker error due to cross-dependency). * Revert "Fixed warnings about explicitly deprecated CUDA/HIP functions." This reverts commit 7fc9407. * Prettify. * Improved creating resource/kernel file. Introduced CONSTANT and related runtime-check; adjusted kernel-buffer kinds accordingly (kernel code). Disabled SVM support (compile-time). Code cleanup. * Renamed CONSTANT to GLOBAL and expand GLOBAL to either "constant" or "global". Fixes for BSD/macOS (acc_opencl.sh). * Removed superfluous barrier. * Allow to disable (pre-)transposing B-matrices (to only run the SMM-kernel). Disabled comparison against EPSILON and thereby avoid EXIT_FAILURE when missing the tolerance (error margin is printed). * Prepared for tuned kernel (introduced parameters; WIP) * Introduced OPENCL_LIBSMM_SMM_BLOCK_M/N. Code cleanup. * Adjusted script generating resource file (kernel header). * Introduced compile-time WG-size (transpose kernel). * Implemented blocking SMMs into tiles. Introduced (mini-)batchsize (only BS=1 is implemented yet). * BE: Attempt to iteratively limit the WG-size prior to building the kernel (based on device's maximum supported WG-size). * BE: Adjusted definition of acc_opencl_wgsize; implemented device-specific path (in addition to kernel-specific path). * BE: Adjusted acc_opencl_wgsize to take device as an argument (rather than querying the active device internally). * LIBSMM: Introduced/implemented OPENCL_LIBSMM_SMM_BLOCK_M/N. * LIBSMM: Reworked kernel with more compile-time knowledge. * Keep macro definitions (acc_opencl.sh; kernel header). * Implemented intra-kernel (mini-)batch accumulation (disabled by default; BS=1). Normalized initial matrix values in benchmark driver. * Fixed SMM-kernel for (mini-)batches (1 < BS). Rely 2d-arrays for clarity (cleanup). * Adjusted and fixed work split. Print additional norm (debug). Fixed compiler warning. * Removed barrier (mini-batch). * Fixed array initializer. * Reintroduced barrier. * Removed dead code (as suggested). * Initial auto-tuning script (based in OpenTuner; documentation and requirements.txt to follow). Reduced benchmark runtime to accelerate auto-tuning. * acc_bench_smm: Introduced compile-time (VALIDATE) and runtime option (CHECK environment variable) to allow omitting validation of results. * acc_bench_smm: Reduced number of repetitions (normally the warm-up makes timing stable enough). * acc_bench_smm: Sanitize command line arguments. * Adjusted filename of finally written result. * Prettified Python script. * Fixed file header/banner. * Improved performance of SMM-kernel; adjusted tune_multiply.py accordingly. * tune_multiply.py: Return an non-competitive/bad result in case of an error/invalid experiment (auto-tuning). * tune_multiply.py: Avoid UnboundLocalError in Python code (local variable 'match' referenced before assignment). * tune_multiply.py: Improved handling errors and error messages. * tune_multiply.py: Adjusted defaults/seed. * Adjusted filename (max.gflops found), and added newline (final result file). * Extend result/file for easier reuse (JSON), and merge JSONs into CSV file. * Implemented console output/information about merged/ignored JSON files. * Allow custom separator (CSV file). * Code cleanup (tune_multiply.py). * Implemented loading tuned parameters embedded into binary or from file. * Ensure initialization/finalization is outside of parallel region (BE and LIBSMM). * OPENCL_LIBSMM_PARAMS_DELIMS are used to tokenize parameters (CSV file). * Print type-id in addition to name of element type (benchmark drivers). * Regex to match console output; optional CSV file (tune_multiply.py). * JSON/CSV: store type-id rather than typename (smaller). * Introduced (optional) parameter file to Make and CMake. * Improved help/usage message, and handling errors (acc_opencl.sh). * Support CSV parameter file (acc_opencl.sh). * License banner (acc_opencl.sh). * Fixed issues pointed out by Shellcheck. * Fixed/worked around initialize/finalize issue. * Correct initialization/finalization flow (benchmark drivers); including a workaround for #422 (CUDA). * Missed workaround for CUDA (#422). * Added requirements (OpenTuner). Added wrapper script to tune multiple triplets in several sessions. * Improved console output. * Updated various documentation pieces (WIP). * Allow empty/no choice with respect to USE_ACCEL. * Attempt to CI-test OpenCL backend and LIBSMM. * Adjusted CI/build setup: build LIBXSMM and help CMake to find OpenCL. * Extend PKG_CONFIG_PATH rather than overriding it. * Further adjusted build/run scripts (Daint-CI). * One more attempt to get CI up and running. * Disabled Daint-CI runtime tests (temporarily). Prepared revised transpose kernel. * Replaced OPENCL_LIBSMM_TRANS_WGSIZE in favor of OPENCL_LIBSMM_TRANS_BLOCK_M. * Sanitize command line arguments similar to acc_bench_smm. * Folded inplace-transpose into general transpose.cl. * Improved finding OpenCL bits (e.g., on Daint). * Fixed nasty typo. Adjusted default GPU to P100 (to better adhere to DBCSR default). * Improved build messages/help. * Adjusted installation instructions for clarity. * Adjusted existing documentation to better accommodate/distinct the OpenCL backend as well as the OpenCL based LIBSMM. Added documentation for both the OpenCL backend and the OpenCL based LIBSMM. * Documented auto-tuning. * Improved console output (tune_multiply.sh). * Note about opentuner.db directory. Some additional details and rephrase. * Adjusted separator (tune_multiply.sh). * Improved documentation with some sample output (auto-tuning).

alazzaro · 2021-02-02T20:43:27Z

Related to #261

alazzaro · 2021-02-04T14:41:58Z

@hfp Can we close this issue (and the corresponding #261)?

hfp · 2021-02-04T14:47:14Z

Yes

alazzaro · 2021-02-08T08:54:31Z

@hfp I went again to read this issue... My understanding is that the first point is fixed (acc_init). Note that there is issue #13 somehow related to this topic... But I now read the second bullet (init and finalize with different workflow). I assume this is still there, correct? If so, we should reopen this issue...

hfp · 2021-02-08T09:16:19Z

There was also a fix necessary on my side, which was unveiled by @haampie 's changes. My CMake stuff applied -fopenmp (or equivalent) only to the LIBSMM source code but not the backend sources. It seems to me the respective init/fini functions are both called by the master thread from inside of parallel regions (which is symmetric in contrast to above claim). However, I have not checked this in DBCSR's sources.

There is still the question why we call init/fini functions in parallel regions (if we only want the master thread). Could be just DBCSR's convention (and hence fine). Although, this means the OpenCL BE cannot setup stuff for all threads like this ACC_OPENCL_THREADLOCAL_CONTEXT feature which replicates the context into each thread (it has not proven necessary or useful at this point). It could be a need later if revisiting wrt to multiple devices.

alazzaro · 2021-02-08T09:19:15Z

Then let's reopen... I like the idea to have symmetry between init and finalize...

hfp mentioned this issue Jan 21, 2021

OpenCL based ACC-backend and SMM library #406

Merged

hfp added a commit to hfp/dbcsr that referenced this issue Jan 21, 2021

Correct initialization/finalization flow (benchmark drivers); includi…

e6693bd

…ng a workaround for cp2k#422 (CUDA).

hfp added a commit to hfp/dbcsr that referenced this issue Jan 21, 2021

Missed workaround for CUDA (cp2k#422).

11ae7d5

alazzaro mentioned this issue Feb 2, 2021

Simplify the CMake ROCm detection #419

Merged

hfp closed this as completed Feb 4, 2021

alazzaro reopened this Feb 8, 2021

alazzaro mentioned this issue Mar 25, 2022

ACC Setdevice reordering #595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACC init/finalize issues #422

ACC init/finalize issues #422

hfp commented Jan 21, 2021

dev-zero commented Jan 21, 2021

hfp commented Jan 21, 2021

haampie commented Jan 28, 2021 •

edited

hfp commented Jan 28, 2021

haampie commented Jan 29, 2021

hfp commented Jan 29, 2021

alazzaro commented Feb 2, 2021

alazzaro commented Feb 4, 2021

hfp commented Feb 4, 2021

alazzaro commented Feb 8, 2021

hfp commented Feb 8, 2021 •

edited

alazzaro commented Feb 8, 2021

ACC init/finalize issues #422

ACC init/finalize issues #422

Comments

hfp commented Jan 21, 2021

dev-zero commented Jan 21, 2021

hfp commented Jan 21, 2021

haampie commented Jan 28, 2021 • edited

hfp commented Jan 28, 2021

haampie commented Jan 29, 2021

hfp commented Jan 29, 2021

alazzaro commented Feb 2, 2021

alazzaro commented Feb 4, 2021

hfp commented Feb 4, 2021

alazzaro commented Feb 8, 2021

hfp commented Feb 8, 2021 • edited

alazzaro commented Feb 8, 2021

haampie commented Jan 28, 2021 •

edited

hfp commented Feb 8, 2021 •

edited