Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent function name in RECURSIVE #464

Open
QingleiCao opened this issue Dec 12, 2022 · 6 comments
Open

Inconsistent function name in RECURSIVE #464

QingleiCao opened this issue Dec 12, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@QingleiCao
Copy link
Contributor

QingleiCao commented Dec 12, 2022

Describe the bug

The first issue: DPLASMA_WITH_RECURSIVE and PARSEC_HAVE_RECURSIVE are independent; which means PARSEC_HAVE_RECURSIVE is always OFF by default even if DPLASMA_WITH_RECURSIVE is ON.

The second issue: if PARSEC_HAVE_RECURSIVE is enabled manually, DPLASMA can not compile with the following error:

/home/qcao3/dplasma/build/src/dpotrf_L.c:3960:32: error: 'hook_of_dpotrf_L_potrf_dgemm_recursive' undeclared here (not in a function); did you mean 'hook_of_dpotrf_L_potrf_dgemm_RECURSIVE'?
 3960 |      .hook = (parsec_hook_t *) hook_of_dpotrf_L_potrf_dgemm_recursive},
      |                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                hook_of_dpotrf_L_potrf_dgemm_RECURSIVE

I guess it’s because of this commit

To Reproduce

Steps to reproduce the behavior:

  1. Manually enable PARSEC_HAVE_RECURSIVE
  2. Compile DPLASMA

Expected behavior

Compilation failure

Environment (please complete the following information):

Saturn ICL

Additional context

Add any other context about the problem here.
The content of the config.log file can be useful in some cases.

@QingleiCao QingleiCao added the bug Something isn't working label Dec 12, 2022
@QingleiCao
Copy link
Contributor Author

QingleiCao commented Feb 14, 2023

Made a simple fix for this (maybe it's not good enough).

DPLASMA:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 908d3f0..69f8386 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -92,6 +92,10 @@ option(DPLASMA_SCALAPACK_WRAPPER
 
 option(DPLASMA_WITH_RECURSIVE
   "Enable recursive kernels to be called when available" ON)
+if(DPLASMA_WITH_RECURSIVE)
+  add_definitions(-DPARSEC_HAVE_RECURSIVE)
+  message(STATUS "PARSEC_HAVE_RECURSIVE is defined.")
+endif(DPLASMA_WITH_RECURSIVE)
 
 option(DPLASMA_TRACE_KERNELS
   "Enable tracing dplasma kernel calls for debugging (slow)" OFF)

PaRSEC:

diff --git a/parsec/interfaces/ptg/ptg-compiler/jdf2c.c b/parsec/interfaces/ptg/ptg-compiler/jdf2c.c
index 1cf9494..651ca68 100644
--- a/parsec/interfaces/ptg/ptg-compiler/jdf2c.c
+++ b/parsec/interfaces/ptg/ptg-compiler/jdf2c.c
@@ -3941,7 +3941,11 @@ jdf_generate_function_incarnation_list( const jdf_t *jdf,
                 string_arena_add_string(sa, "      .dyld     = \"%s\",\n", dyld_property->expr->jdf_var);
             }
             string_arena_add_string(sa, "      .evaluate = %s,\n", "NULL");
-            string_arena_add_string(sa, "      .hook     = (parsec_hook_t*)hook_of_%s_%s },\n", base_name
+            if (!strcmp(dev_upper, "RECURSIVE")) {
+                string_arena_add_string(sa, "      .hook     = (parsec_hook_t*)hook_of_%s_%s },\n", base_
+            } else {
+                string_arena_add_string(sa, "      .hook     = (parsec_hook_t*)hook_of_%s_%s },\n", base_
+            }
             string_arena_add_string(sa, "#endif  /* defined(PARSEC_HAVE_%s) */\n", dev_upper);
             free(dev_upper); free(dev_lower);
         }

The compilation issue is fixed. However, there are still issues when recursive is enabled. Here are infos from gdb:

#0  0x00007ffff5aa9387 in raise () from /lib64/libc.so.6
#1  0x00007ffff5aaaa78 in abort () from /lib64/libc.so.6
#2  0x00007ffff5aa21a6 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff5aa2252 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff6c78f60 in parsec_taskpool_unregister (tp=0x7fff40005ce0) at /home/qcao3/dplasma/parsec/parsec/parsec.c:2165
#5  0x00007ffff736a1ee in __parsec_dpotrf_U_internal_destructor (__parsec_tp=0x7fff40005ce0) at /home/qcao3/dplasma/build/src/dpotrf_U.c:9797
#6  0x00007ffff6c744c9 in parsec_obj_run_destructors (object=0x7fff40005ce0) at /home/qcao3/dplasma/parsec/parsec/class/parsec_object.h:446
#7  0x00007ffff6c78ff8 in parsec_taskpool_free (tp=0x7fff40005ce0) at /home/qcao3/dplasma/parsec/parsec/parsec.c:2173
#8  0x00007ffff709c770 in dplasma_dpotrf_Destruct (tp=0x7fff40005ce0) at /home/qcao3/dplasma/build/src/dpotrf_wrapper.c:263
#9  0x00007ffff736adf1 in dpotrf_U_update_INFO (_tp=0x7fff40005ce0, data=0x7fff40008d20) at /home/qcao3/dplasma/build/src/dpotrf_U.c:10120
#10 0x00007ffff7356949 in parsec_recursivecall_callback (tp=0x7fff40005ce0, cb_data=0x7fff40008d20) at /home/qcao3/dplasma/parsec/parsec/recursive.h:32
#11 0x00007ffff6c8afe1 in parsec_taskpool_termination_detected (tp=0x7fff40005ce0) at /home/qcao3/dplasma/parsec/parsec/scheduling.c:229
#12 0x00007ffff6cba26f in parsec_termdet_local_termination_detected (tp=0x7fff40005ce0) at /home/qcao3/dplasma/parsec/parsec/mca/termdet/local/termdet_local_module.c:120
#13 0x00007ffff6cba65c in parsec_termdet_local_taskpool_addto_nb_tasks (tp=0x7fff40005ce0, v=-1) at /home/qcao3/dplasma/parsec/parsec/mca/termdet/local/termdet_local_module.c:206
#14 0x00007ffff6c960e1 in parsec_release_task_to_mempool_update_nbtasks (es=0x7fff54000950, this_task=0x7fff5400aa20) at /home/qcao3/dplasma/parsec/parsec/interfaces/interface.c:37
#15 0x00007ffff7369067 in release_task_of_dpotrf_U_potrf_dpotrf (es=0x7fff54000950, this_task=0x7fff5400aa20) at /home/qcao3/dplasma/build/src/dpotrf_U.c:9421
#16 0x00007ffff6c8b68e in __parsec_complete_execution (es=0x7fff54000950, task=0x7fff5400aa20) at /home/qcao3/dplasma/parsec/parsec/scheduling.c:470
#17 0x00007ffff6c8b7b1 in __parsec_task_progress (es=0x7fff54000950, task=0x7fff5400aa20, distance=1) at /home/qcao3/dplasma/parsec/parsec/scheduling.c:497
#18 0x00007ffff6c8c25d in __parsec_context_wait (es=0x7fff54000950) at /home/qcao3/dplasma/parsec/parsec/scheduling.c:776
#19 0x00007ffff6c750c0 in __parsec_thread_init (startup=0xaec450) at /home/qcao3/dplasma/parsec/parsec/parsec.c:337
#20 0x00007fffaf454ea5 in start_thread () from /lib64/libpthread.so.0
#21 0x00007ffff5b719fd in clone () from /lib64/libc.so.6
>>> f 4
#4  0x00007ffff6c78f60 in parsec_taskpool_unregister (tp=0x7fff40005ce0) at /home/qcao3/dplasma/parsec/parsec/parsec.c:2165
2165	    assert( PARSEC_TERM_TP_TERMINATED == tp->tdm.module->taskpool_state(tp) );
>>> p tp->tdm.module->taskpool_state(tp)
$1 = PARSEC_TERM_TP_BUSY

Testing command in DPLASMA:

./testing_dpotrf -N 1000 -t 100 -z 50

@QingleiCao
Copy link
Contributor Author

QingleiCao commented Feb 14, 2023

@therault suggests this patch:

diff --git a/parsec/mca/termdet/local/termdet_local_module.c b/parsec/mca/termdet/local/termdet_local_module.c
index d599cd2ea..5e37910bc 100644
--- a/parsec/mca/termdet/local/termdet_local_module.c
+++ b/parsec/mca/termdet/local/termdet_local_module.c
@@ -88,7 +88,7 @@ static void parsec_termdet_local_monitor_taskpool(parsec_taskpool_t *tp,
 static void parsec_termdet_local_unmonitor_taskpool(parsec_taskpool_t *tp)
 {
     assert(&parsec_termdet_local_module.module == tp->tdm.module);
-    assert(tp->tdm.monitor == PARSEC_TERMDET_LOCAL_TERMINATED);
+    assert(tp->tdm.monitor == PARSEC_TERMDET_LOCAL_TERMINATED || tp->tdm.monitor == PARSEC_TERMDET_LOCAL_TERMINATING);
     tp->tdm.module   = NULL;
     tp->tdm.callback = NULL;
 }
@@ -98,9 +98,9 @@ static parsec_termdet_taskpool_state_t parsec_termdet_local_taskpool_state(parse
     if( tp->tdm.module == NULL )
         return PARSEC_TERM_TP_NOT_MONITORED;
     assert(tp->tdm.module == &parsec_termdet_local_module.module);
-    if( tp->tdm.monitor == PARSEC_TERMDET_LOCAL_TERMINATED )
+    if( tp->tdm.monitor == PARSEC_TERMDET_LOCAL_TERMINATED || tp->tdm.monitor == PARSEC_TERMDET_LOCAL_TERMINATING)
         return PARSEC_TERM_TP_TERMINATED;
-    if( tp->tdm.monitor == PARSEC_TERMDET_LOCAL_BUSY || tp->tdm.monitor == PARSEC_TERMDET_LOCAL_TERMINATING )
+    if( tp->tdm.monitor == PARSEC_TERMDET_LOCAL_BUSY ) 
         return PARSEC_TERM_TP_BUSY;
     if( tp->tdm.monitor == PARSEC_TERMDET_LOCAL_NOT_READY )
         return PARSEC_TERM_TP_NOT_READY;

After this patch, there is still an issue.

#0  0x00007ffff6c8a7f7 in parsec_atomic_fetch_add_int32 (l=0xb, v=-1) at /home/qcao3/dplasma/parsec/parsec/include/parsec/sys/atomic-c11.h:163
#1  0x00007ffff6c8a893 in parsec_atomic_fetch_dec_int32 (l=0xb) at /home/qcao3/dplasma/parsec/parsec/include/parsec/sys/atomic.h:160
#2  0x00007ffff6c8aff5 in parsec_taskpool_termination_detected (tp=0x7fff30004740) at /home/qcao3/dplasma/parsec/parsec/scheduling.c:231
#3  0x00007ffff6cba27d in parsec_termdet_local_termination_detected (tp=0x7fff30004740) at /home/qcao3/dplasma/parsec/parsec/mca/termdet/local/termdet_local_module.c:120
#4  0x00007ffff6cba71c in parsec_termdet_local_taskpool_addto_runtime_actions (tp=0x7fff30004740, v=-1) at /home/qcao3/dplasma/parsec/parsec/mca/termdet/local/termdet_local_module.c:223
#5  0x00007ffff6c8af83 in parsec_taskpool_update_runtime_nbtask (tp=0x7fff30004740, nb_tasks=-1) at /home/qcao3/dplasma/parsec/parsec/scheduling.c:217
#6  0x00007ffff6c8545f in remote_dep_dec_flying_messages (handle=0x7fff30004740) at /home/qcao3/dplasma/parsec/parsec/remote_dep.h:364
#7  0x00007ffff6c89c89 in remote_dep_mpi_new_taskpool (es=0xaced70, dep_cmd_item=0x7fff30024ab0) at /home/qcao3/dplasma/parsec/parsec/remote_dep_mpi.c:1934
#8  0x00007ffff6c881ef in remote_dep_dequeue_nothread_progress (es=0xaced70, cycles=0) at /home/qcao3/dplasma/parsec/parsec/remote_dep_mpi.c:1170
#9  0x00007ffff6c807ee in parsec_remote_dep_progress (es=0xaced70) at /home/qcao3/dplasma/parsec/parsec/remote_dep.c:327
#10 0x00007ffff6c8c1e9 in __parsec_context_wait (es=0xaced70) at /home/qcao3/dplasma/parsec/parsec/scheduling.c:760
#11 0x00007ffff6c8c836 in parsec_context_wait (context=0xab26c0) at /home/qcao3/dplasma/parsec/parsec/scheduling.c:977
#12 0x0000000000402ae8 in main (argc=7, argv=0x7fffffff6c18) at /home/qcao3/dplasma/build/tests/testing_dpotrf.c:62

@QingleiCao
Copy link
Contributor Author

The latest PR related to the introduced new issue is: Introduce parsec_taskpool_wait and parsec_taskpool_test by @devreal

@therault
Copy link
Contributor

The root of the problem is that in recursive, we try to parsec_taskpool_destroy(tp) in the callback of the termination detector for tp. To allow this behvior, we should terminate (and mark terminated) the taskpool before calling the callback.

However, when we use the termination detector for DTD (or for other DSLs like TTG), if we mark the taskpool terminated before calling the callback, we create a race condition: another thread, that is polling the status of the taskpool, can detect it terminated before the callback is called, and move on to the next phase that re-initializes the taskpool to be progressed again, leading to other problems when the callback is finally called (this is the reason why we introduced the status of TERMINATING for a taskpool: everyone agrees termination is reached, but the taskpool still exists).

The same callback cannot be used for both behaviors.

One solution (maybe): instead of calling parsec_taskpool_destroy(tp) in the callback, the termination detector should PARSEC_OBJ_RETAIN(tp) when starting to monitor the taskpool, the recursive code should call PARSEC_OBJ_RELEASE(tp), and the termination detector would also PARSEC_OBJ_RELEASE(tp) after the callback is called to trigger the destruction? @bosilca do you see a better way?

@bosilca
Copy link
Contributor

bosilca commented Feb 15, 2023

We were careful not to touch the taskpool on the way out of the callback, but now we need to go back into the termdet who might still have a pointer to it. Possible but tedious.

If we mark the taskpool before calling the callback but we remember that we did it, then the caller (either the callback itself or the main thread) can handle the taskpool as they want/need. Of course they could do bad things, such as the main thread frees the taskpool while we are still calling the callback but that's of little concerns, easily fixed with correct programming.

@bosilca
Copy link
Contributor

bosilca commented Feb 16, 2023

We might need to add a dual mechanism as the one we use in OMPI for handling requests, they can be internally completed or externally completed, allowing us a window between the two to handle object related things (like calling the completion callback)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants