Calculation termination: cancel job dispatch (not yet exposed via C API)#1361
Calculation termination: cancel job dispatch (not yet exposed via C API)#1361
Conversation
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com> Co-authored-by: Martijn Govers <martygovers@hotmail.com> Signed-off-by: Martijn Govers <martygovers@hotmail.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com> Co-authored-by: Martijn Govers <martygovers@hotmail.com> Signed-off-by: Martijn Govers <martygovers@hotmail.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
|
@mgovers can you elaborate how this works (preferably in some in-code documentation in proper place)? I see you pass |
| for (Idx const thread_number : IdxRange{n_thread}) { | ||
| // compute each single thread job with stride | ||
| threads.emplace_back(single_thread_job, thread_number, n_thread, n_scenarios); | ||
| threads.emplace_back(single_thread_job, stop_token, thread_number, n_thread, n_scenarios); |
There was a problem hiding this comment.
@mgovers So after reading the cppreference on std::stop_token, I understand that the stop_token should be the first argument of your functor? Then when creating a jthread, it will automatically generate a stop_token to the functor.
But here you seem to explicitely pass a stop_token, why deviating from standard usage?
There was a problem hiding this comment.
The reason is that I am using only the getter functionality of stop_token. Instead, I am using a stop_source (from which you can obtain a stop token). That way, one stop source can trigger multiple stops in multiple threads.
In #1363, I show how this can work: Each handle has a stop source. Expensive operations (like calculations) can be offloaded to a separate thread that is awaited by the main thread. A SIGTERM (e.g.: Ctrl+C command) triggers a KeyboardInterrupt or related exceptions. Those are caught in the main thread that awaits the calculation thread. A stop is then requested (atomic, not necessarily lock-free but that is fine in this case, see below) and the exception is reraised. Then, the thread is awaited again as per usual (TBD)
Cfr. the Python docs on signal handlers (specifically https://docs.python.org/3/library/signal.html#note-on-signal-handlers-and-exceptions ) this is fine because we only need to exit on termination command (if we're in a Jupyter notebook, the notebook will take care of the complicated logic handling).
There was a problem hiding this comment.
@mgovers does it mean even in single thread calculation, a new thread will also be created for calculation, instead of calculating in main thread?
There was a problem hiding this comment.
Unfortunately, yes. Hence the experiment in #1363 so that we can discuss the implications.
Alternatively, we can use async instead of threads.
The main issue here is that Python handles everything regarding signal handling in the main thread at pseudo-random intervals. If there is no Python code, then there will also never be a call to the signal handler. Cfr. their own documentation in https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers :
A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.
There was a problem hiding this comment.
I think creating separate thread for single-thread calculation is red flag for us.
Then we need to think about if this is a desired feature at all from business value perspective. Why do we do this in the first place. For production environment, it is never controlled/interrupted at this level, rather the whole container just gets killed from above (k8s). So I am not sure about the business value of maintaining such a complicated interruption logic in PGM.
There was a problem hiding this comment.
This is indeed mostly useful for research and experiment environments. In those settings, a typo is very easily made (especially with the cartesian product batch scenarios). I myself have encountered cases where I had to kill calculations after a simple typo that could've left my PC running for hours if I would've let it continue. If I have encountered this more than once, then I'm sure other people will as well. It is not reasonable to ask people to kill and restart their Jupyter notebook session as a consequence of a simple typo. That's what made me investigate this feature.
Maybe we can come up with a way that we can make it opt-in?
There was a problem hiding this comment.
In research and experiment environments, asking people to terminate the whole jupyter notebook for typos is very reasonable. Research environments are meant to be unstable, and can be killed at any time.
Note PGM core is mainly a production low-level HPC library. When in doubt if some feature (like supporting interuptions) belongs to the responsibility of PGM or not, we can always try to refer to openblas and lapack. Do they support such a feature/check?
There was a problem hiding this comment.
openblas has an issue from 2014 regarding this: https://github.com/OpenMathLib/OpenBLAS/issues/378
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
|
figueroa1395
left a comment
There was a problem hiding this comment.
Some input/questions for later discussions.
| if (stop_token.stop_requested()) { | ||
| break; | ||
| } |
There was a problem hiding this comment.
Why not stop_if_requested(stop_token); here instead?
| if (n_thread == 1) { | ||
| // run all in sequential | ||
| single_thread_job(0, 1, n_scenarios); | ||
| single_thread_job(stop_token, 0, 1, n_scenarios); |
There was a problem hiding this comment.
Can you elaborate on how this would create an additional thread when running in single thread mode?
|
Cfr. off-line discussion: Let's close this PR until we get more clarity on the user value |



Part of #1249
Uses the
jthreadintroduced in #1358stop_tokento cancel calculations. Stop requests are acknowledged in several stages: