Skip to content

Calculation termination: cancel job dispatch (not yet exposed via C API)#1361

Closed
mgovers wants to merge 12 commits intomainfrom
pgm/feature/terminate-calculations
Closed

Calculation termination: cancel job dispatch (not yet exposed via C API)#1361
mgovers wants to merge 12 commits intomainfrom
pgm/feature/terminate-calculations

Conversation

@mgovers
Copy link
Copy Markdown
Member

@mgovers mgovers commented Apr 9, 2026

Part of #1249
Uses the jthread introduced in #1358

  • Use stop_token to cancel calculations. Stop requests are acknowledged in several stages:
    • before prepare calculate
    • before cache calculations
    • before job dispatch
    • before scenario setup
    • before scenario calculation
    • before scenario winddown
  • After every scenario setup is started, a winddown is guaranteed to be called.
    • If the scenario winddown is successful, the main model is guaranteed to be in a good state; ready to be picked up again.
    • This is even true in single scenario calculations (without copy from base model)
    • This implies that (in the future), a keyboard interrupt followed by a rerun will result in successful calculations without having to recreate the main model

mgovers added 3 commits April 9, 2026 09:33
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@mgovers mgovers self-assigned this Apr 9, 2026
@mgovers mgovers added the feature New feature or request label Apr 9, 2026
mgovers and others added 5 commits April 9, 2026 16:08
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Comment thread tests/cpp_unit_tests/test_job_dispatch.cpp Outdated
Comment thread tests/cpp_unit_tests/test_job_dispatch.cpp Outdated
Comment thread tests/cpp_unit_tests/test_job_dispatch.cpp Outdated
Comment thread tests/cpp_unit_tests/test_job_dispatch.cpp Outdated
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

Co-authored-by: Martijn Govers <martygovers@hotmail.com>
Signed-off-by: Martijn Govers <martygovers@hotmail.com>
Comment thread tests/cpp_unit_tests/test_job_dispatch.cpp Outdated
Comment thread tests/cpp_unit_tests/test_job_dispatch.cpp Outdated
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>

Co-authored-by: Martijn Govers <martygovers@hotmail.com>
Signed-off-by: Martijn Govers <martygovers@hotmail.com>
@mgovers mgovers changed the title Calculation termination: cancel job dispatch (no yet exposed via C API) Calculation termination: cancel job dispatch (not yet exposed via C API) Apr 10, 2026
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@mgovers mgovers marked this pull request as ready for review April 10, 2026 06:07
@TonyXiang8787
Copy link
Copy Markdown
Member

@mgovers can you elaborate how this works (preferably in some in-code documentation in proper place)?

I see you pass std::stop_token around, but I don't see how this object is created.

for (Idx const thread_number : IdxRange{n_thread}) {
// compute each single thread job with stride
threads.emplace_back(single_thread_job, thread_number, n_thread, n_scenarios);
threads.emplace_back(single_thread_job, stop_token, thread_number, n_thread, n_scenarios);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgovers So after reading the cppreference on std::stop_token, I understand that the stop_token should be the first argument of your functor? Then when creating a jthread, it will automatically generate a stop_token to the functor.

But here you seem to explicitely pass a stop_token, why deviating from standard usage?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that I am using only the getter functionality of stop_token. Instead, I am using a stop_source (from which you can obtain a stop token). That way, one stop source can trigger multiple stops in multiple threads.

In #1363, I show how this can work: Each handle has a stop source. Expensive operations (like calculations) can be offloaded to a separate thread that is awaited by the main thread. A SIGTERM (e.g.: Ctrl+C command) triggers a KeyboardInterrupt or related exceptions. Those are caught in the main thread that awaits the calculation thread. A stop is then requested (atomic, not necessarily lock-free but that is fine in this case, see below) and the exception is reraised. Then, the thread is awaited again as per usual (TBD)

Cfr. the Python docs on signal handlers (specifically https://docs.python.org/3/library/signal.html#note-on-signal-handlers-and-exceptions ) this is fine because we only need to exit on termination command (if we're in a Jupyter notebook, the notebook will take care of the complicated logic handling).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgovers does it mean even in single thread calculation, a new thread will also be created for calculation, instead of calculating in main thread?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes. Hence the experiment in #1363 so that we can discuss the implications.

Alternatively, we can use async instead of threads.

The main issue here is that Python handles everything regarding signal handling in the main thread at pseudo-random intervals. If there is no Python code, then there will also never be a call to the signal handler. Cfr. their own documentation in https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers :

A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.

Copy link
Copy Markdown
Member

@TonyXiang8787 TonyXiang8787 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating separate thread for single-thread calculation is red flag for us.

Then we need to think about if this is a desired feature at all from business value perspective. Why do we do this in the first place. For production environment, it is never controlled/interrupted at this level, rather the whole container just gets killed from above (k8s). So I am not sure about the business value of maintaining such a complicated interruption logic in PGM.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed mostly useful for research and experiment environments. In those settings, a typo is very easily made (especially with the cartesian product batch scenarios). I myself have encountered cases where I had to kill calculations after a simple typo that could've left my PC running for hours if I would've let it continue. If I have encountered this more than once, then I'm sure other people will as well. It is not reasonable to ask people to kill and restart their Jupyter notebook session as a consequence of a simple typo. That's what made me investigate this feature.

Maybe we can come up with a way that we can make it opt-in?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In research and experiment environments, asking people to terminate the whole jupyter notebook for typos is very reasonable. Research environments are meant to be unstable, and can be killed at any time.

Note PGM core is mainly a production low-level HPC library. When in doubt if some feature (like supporting interuptions) belongs to the responsibility of PGM or not, we can always try to refer to openblas and lapack. Do they support such a feature/check?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openblas has an issue from 2014 regarding this: https://github.com/OpenMathLib/OpenBLAS/issues/378

Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@sonarqubecloud
Copy link
Copy Markdown

@mgovers mgovers marked this pull request as draft April 13, 2026 09:06
@mgovers mgovers added the do-not-merge This should not be merged label Apr 13, 2026
Copy link
Copy Markdown
Member

@figueroa1395 figueroa1395 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some input/questions for later discussions.

Comment on lines +138 to +140
if (stop_token.stop_requested()) {
break;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not stop_if_requested(stop_token); here instead?

if (n_thread == 1) {
// run all in sequential
single_thread_job(0, 1, n_scenarios);
single_thread_job(stop_token, 0, 1, n_scenarios);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on how this would create an additional thread when running in single thread mode?

@mgovers
Copy link
Copy Markdown
Member Author

mgovers commented Apr 16, 2026

Cfr. off-line discussion: Let's close this PR until we get more clarity on the user value

@mgovers mgovers closed this Apr 16, 2026
@mgovers mgovers deleted the pgm/feature/terminate-calculations branch April 16, 2026 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge This should not be merged feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants