comm: set parsec_tls_execution_stream in comm thread by omor1 · Pull Request #422 · ICLDisco/parsec

omor1 · 2022-08-17T00:40:22Z

PR #413 fixed issues around recursive tasks by retrieving the current
execution stream via parsec_my_execution_stream. However, under some
circumstances, it appears that the on_complete callback can be called
from the communication thread. The comm thread never sets the TLS
variable needed for parsec_my_execution_stream, so you end up with a
NULL execution stream and everything breaks.

Ensure that parsec_tls_execution_stream is set by the communication
thread as well. This introduces a new internal API to do so,
parsec_set_my_execution_stream, since parsec_tls_execution_stream is
a static variable in parsec.c.

Signed-off-by: Omri Mor omrimor2@illinois.edu

omor1 · 2022-08-17T00:40:45Z

Feel free to bikeshed parsec_set_my_execution_stream. See discussion in #412 why this is necessary. I haven't figured out when exactly on_complete is called from the comm thread, but it anecdotally occurs more frequently with LCI, so there is apparently some difference due to communication timing or other implementation details.

PR ICLDisco#413 fixed issues around recursive tasks by retrieving the current execution stream via `parsec_my_execution_stream`. However, under some circumstances, it appears that the `on_complete` callback can be called from the communication thread. The comm thread never sets the TLS variable needed for `parsec_my_execution_stream`, so you end up with a NULL execution stream and everything breaks. Ensure that `parsec_tls_execution_stream` is set by the communication thread as well. This introduces a new internal API to do so, `parsec_set_my_execution_stream`, since `parsec_tls_execution_stream` is a static variable in parsec.c. Signed-off-by: Omri Mor <omrimor2@illinois.edu>

devreal · 2022-08-17T01:16:33Z

I feel a little uneasy about this... I believe there are places where we rely on parsec_tls_execution_stream being used by one thread and by one thread only. I do for example in #324 but I understand that the scheduling explicitly relies on sharing resources... I think the notion of sharing thread-local data is somewhat counter-intuitive and may lead to issues down the road...

omor1 · 2022-08-17T02:03:30Z

I feel a little uneasy about this... I believe there are places where we rely on parsec_tls_execution_stream being used by one thread and by one thread only. I do for example in #324 but I understand that the scheduling explicitly relies on sharing resources... I think the notion of sharing thread-local data is somewhat counter-intuitive and may lead to issues down the road...

parsec_tls_execution_stream is explicitly a thread-local variable, its value is private to each thread. However, there is the problem that parsec_comm_es is cloned from context->virtual_processes[0]->execution_streams[0], so if there are shared data structures inside parsec_execution_stream_t that can be modified during execution of these callbacks then the only safe solution is to give the comm thread a "real" execution context. I asked about this several days ago on Slack, but got no response.

bosilca · 2022-08-17T15:25:43Z

In general all parsec threads should have their TLS set. I still don't understand why you are linking this with the profiling issue, because I still don't understand how the recursive tasks (as they exist today in parsec) can be completed by the communication thread. Can you provide a backtrace when this happens ?

omor1 · 2022-08-17T20:49:41Z

In general all parsec threads should have their TLS set. I still don't understand why you are linking this with the profiling issue, because I still don't understand how the recursive tasks (as they exist today in parsec) can be completed by the communication thread. Can you provide a backtrace when this happens ?

#0  0x00001555553f1702 in __parsec_complete_execution ()
   from /home/omor1/spack/opt/spack/linux-centos8-zen2/aocc-3.0.0/parsec-local-2mzvkr4bmfsj2h7fyk3cp4dll5gbn5hu/lib64/libparsec.so.3
#1  0x0000000000238616 in parsec_recursivecall_callback ()
#2  0x000015555541577f in remote_dep_dequeue_nothread_progress ()
   from /home/omor1/spack/opt/spack/linux-centos8-zen2/aocc-3.0.0/parsec-local-2mzvkr4bmfsj2h7fyk3cp4dll5gbn5hu/lib64/libparsec.so.3
#3  0x0000155555414834 in remote_dep_dequeue_main () from /home/omor1/spack/opt/spack/linux-centos8-zen2/aocc-3.0.0/parsec-local-2mzvkr4bmfsj2h7fyk3cp4dll5gbn5hu/lib64/libparsec.so.3
#4  0x000015554df1a17a in start_thread () from /lib64/libpthread.so.0
#5  0x000015554dc49df3 in clone () from /lib64/libc.so.6

omor1 · 2022-08-17T21:59:42Z

@bosilca I have a coredump, but it no longer matches the binary—but from the disassembly I still have on my console screen from when I was looking at the coredump, I've traced this to calls to remote_dep_dec_flying_messages.

static inline void remote_dep_dec_flying_messages(parsec_taskpool_t *tp) {
    parsec_taskpool_update_runtime_nbtask(tp, -1);
}

int parsec_taskpool_update_runtime_nbtask(parsec_taskpool_t *tp, int32_t nb_tasks) {
    int remaining = tp->update_nb_runtime_task(tp, nb_tasks); /* points to parsec_add_fetch_runtime_task */
    return parsec_check_complete_cb(tp, tp->context, remaining);
}

int32_t parsec_add_fetch_runtime_task(parsec_taskpool_t *tp, int32_t nb_tasks) {
    return parsec_atomic_fetch_add_int32(&tp->nb_pending_actions, nb_tasks) + nb_tasks;
}

int parsec_check_complete_cb(parsec_taskpool_t *tp, parsec_context_t *context, int remaining) {
    if (0 == remaining) {
        if (NULL != tp->on_complete) {
            tp->on_complete(tp, tp->on_complete_data); /* points to parsec_recursivecall_callback */
        }
        parsec_atomic_fetch_dec_int32(&context->active_taskpools);
        return 1;
    }
    return 0;
}

I'm fairly certain that the (inlined) call to remote_dep_dec_flying_messages is from remote_dep_mpi_new_taskpool, which has been inlined into remote_dep_dequeue_nothread_progress. The recursive taskpool ends up in the communication thread—despite being local—because it's called from parsec_taskpool_enable with distributed != 0. This is I think because jdf2c uses (1 <= nb_tasks) as the argument. There should probably be a way to propagate that we know it's a local taskpool because it's recursive.

The reason why LCI seemed to trigger this more than MPI (which I don't think did in my couple tests) is probably due to differences in timing—maybe LCI was busy processing communications at the time and so decremented the counter later, after the actual tasks were complete.

As a side note, remote_dep_mpi_new_taskpool only seems to ever do anything in the case where a process gets an activation for a taskpool that doesn't exist locally yet—it seems to exist to process those delayed activations during startup of the taskpool.

therault · 2022-08-26T21:37:52Z

My only criticism of this PR is that the other threads should also call parsec_set_my_execution_stream to set their TLS.

Today, the comm thread has an es (yes, it's not a nice clean es, but it has one, and we use it at some places, mostly related to profiling if I remember well), but that es is not set in the TLS for es. The only thing the PR does is correct this problem, so I think it's improving the code.

bosilca · 2022-08-27T17:16:55Z

@omor1 I still don't understand how a communication can release the on_complete callback of a totally local taskpool. Yes, communications can complete tasks (when all their data transfers for remote processes are completed), so the execution path your are highlighting makes sense, for distributed taskpools. However, the recursive taskpools are entirely local, they do not create communications and therefore should have no outside visible effects (and should not be completed by a communication). We need to dig a little deeper into this to understand what is really going on.

omor1 · 2022-08-29T06:51:21Z

My only criticism of this PR is that the other threads should also call parsec_set_my_execution_stream to set their TLS.

Can do.

omor1 · 2022-08-29T06:57:15Z

@omor1 I still don't understand how a communication can release the on_complete callback of a totally local taskpool. Yes, communications can complete tasks (when all their data transfers for remote processes are completed), so the execution path your are highlighting makes sense, for distributed taskpools. However, the recursive taskpools are entirely local, they do not create communications and therefore should have no outside visible effects (and should not be completed by a communication). We need to dig a little deeper into this to understand what is really going on.

I think I sort of explained this above:

The recursive taskpool ends up in the communication thread—despite being local—because it's called from parsec_taskpool_enable with distributed != 0. This is I think because jdf2c uses (1 <= nb_tasks) as the argument. There should probably be a way to propagate that we know it's a local taskpool because it's recursive.

Essentially, when the recursive taskpool is created, PaRSEC seemingly has no way to know that all the data it will need is local and that it has no remote dependents, so it necessarily treats it like a distributed taskpool. The heuristic 1 <= nb_tasks works because a taskpool with only one task is necessarily local, but for taskpools with more than one task it just assumes that the taskpool is distributed.

bosilca · 2022-08-29T16:43:02Z

Essentially, when the recursive taskpool is created, PaRSEC seemingly has no way to know that all the data it will need is local and that it has no remote dependents, so it necessarily treats it like a distributed taskpool. The heuristic 1 <= nb_tasks works because a taskpool with only one task is necessarily local, but for taskpools with more than one task it just assumes that the taskpool is distributed.

OK, this is weird but plausible. The runtime does not know the taskpool is local, so it need to register it with the communication engine. It does so by increasing the number of flying messages (a way to refcount the communication engire) and by adding a command to the comm engine queue. If the local tasks all complete before the comme engine has time to handle the registration of the new taskpool (via remote_dep_mpi_new_taskpool), then indeed when the communication engine decrements the inflying messages at the end of remote_dep_mpi_new_taskpool it will trigger the taskpool completion event. Complicated, definitively a behavior outside the runtime design. So, you're right we need to set the execution stream for the communication thread.

bosilca · 2022-08-29T16:52:25Z

My only criticism of this PR is that the other threads should also call parsec_set_my_execution_stream to set their TLS.

Can do.

This is already done, all threads are setting their parsec_tls_execution_stream when they are created and associated with an es. The communication thread is different, in the sense that it is created outside the normal thread creation mechanism (in parsec.c), and only if needed (via MCA param).

I really don't like the fact that we need to create a special accessor just for this case, but I see that the TLS variables are created statically (which forces us to have an accessor in parsec.c). But I don't think they need to be static, instead allowing the creator to decide if they are static or extern. Using this approach we can leave all other TLS variable as they are today (aka static) and make parsec_tls_execution_stream visible across the entire code base. This way we will not need an additional accessor and the communication thread should be able to use PARSEC_TLS_SET_SPECIFIC.

omor1 · 2022-08-29T18:42:54Z

OK, this is weird but plausible. [...] Complicated, definitively a behavior outside the runtime design. So, you're right we need to set the execution stream for the communication thread.

Right. Somehow with LCI we're triggering this more frequently. Maybe since we're faster at sending/receiving messages the comm thread spends more time continuing to do that—it only stops processing communications if there's an iteration where nothing progressed—so things like processing the new taskpool gets delayed.

I really don't like the fact that we need to create a special accessor just for this case, but I see that the TLS variables are created statically (which forces us to have an accessor in parsec.c). But I don't think they need to be static, instead allowing the creator to decide if they are static or extern. Using this approach we can leave all other TLS variable as they are today (aka static) and make parsec_tls_execution_stream visible across the entire code base. This way we will not need an additional accessor and the communication thread should be able to use PARSEC_TLS_SET_SPECIFIC.

Generally agree with all this. I guess we'd need a new macro, tentatively named PARSEC_TLS_DECLARE_NONSTATIC? Maybe paired with PARSEC_TLS_DECLARE_EXTERN for use in headers. I'm not a fan of the name, I'd prefer to rename the current macro (and its uses) to PARSEC_TLS_DECLARE_STATIC and reuse the current name for non-static declarations, but that's a more invasive patch.

bosilca · 2022-08-30T17:57:36Z

Why not having static PARSEC_TLS_DECLARE my_var or extern PARSEC_TLS_DECLARE my_var ?

omor1 · 2022-08-30T17:59:06Z

Why not having static PARSEC_TLS_DECLARE my_var or extern PARSEC_TLS_DECLARE my_var ?

Was a bit concerned about breaking API, if someone later adds a new TLS var and assumes it's static by default. That probably works if that's not a concern, need to double-check what the macro does for platforms where it e.g. uses the pthreads TLS API.

bosilca · 2022-08-30T18:05:39Z

don't worry, we'll make sure it does not happen.

omor1 · 2022-10-31T20:51:12Z

I never did the changes re: PARSEC_TLS_DECLARE.

bosilca · 2022-11-01T14:03:07Z

It is coming in #437 437.

omor1 requested a review from a team as a code owner August 17, 2022 00:40

omor1 force-pushed the fix/comm_es_tls branch from 827b54d to f42ab4c Compare August 17, 2022 00:48

devreal added the PaRSEC 4.0 label Oct 12, 2022

devreal approved these changes Oct 12, 2022

View reviewed changes

abouteiller added this to the v4.0 milestone Oct 12, 2022

abouteiller removed the PaRSEC 4.0 label Oct 12, 2022

bosilca self-assigned this Oct 14, 2022

abouteiller approved these changes Oct 28, 2022

View reviewed changes

bosilca merged commit 97309b1 into ICLDisco:master Oct 31, 2022

bosilca mentioned this pull request Nov 1, 2022

TSL variables are not static by default. #437

Merged

therault mentioned this pull request Nov 2, 2022

Fixes/tls in comm thread #440

Merged

Conversation

omor1 commented Aug 17, 2022

Uh oh!

omor1 commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal commented Aug 17, 2022

Uh oh!

omor1 commented Aug 17, 2022

Uh oh!

bosilca commented Aug 17, 2022

Uh oh!

omor1 commented Aug 17, 2022

Uh oh!

omor1 commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

therault commented Aug 26, 2022

Uh oh!

bosilca commented Aug 27, 2022

Uh oh!

omor1 commented Aug 29, 2022

Uh oh!

omor1 commented Aug 29, 2022

Uh oh!

bosilca commented Aug 29, 2022

Uh oh!

bosilca commented Aug 29, 2022

Uh oh!

omor1 commented Aug 29, 2022

Uh oh!

bosilca commented Aug 30, 2022

Uh oh!

omor1 commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bosilca commented Aug 30, 2022

Uh oh!

omor1 commented Oct 31, 2022

Uh oh!

bosilca commented Nov 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

omor1 commented Aug 17, 2022 •

edited

Loading

omor1 commented Aug 17, 2022 •

edited

Loading

omor1 commented Aug 30, 2022 •

edited

Loading