Skip to content

Conversation

@michael-johnston
Copy link
Member

@michael-johnston michael-johnston commented Dec 3, 2025

Operation cleanup logic was mixed together (a single function for cleaning up everything was required/was different for different scenarios) and also mixed with signal handling (a shutdown could be successful operation or a signal).

This made it impossible to correctly handle operation cleanup when there were nested operations (see #200)

In this PR

  • cleanup code required by different operations/functions is separated
  • A single signal handler exists that is generic
    • operations register their cleanup requirements with the handler so they are cleaned-up
  • shutdown global now only indicates if a signal was received NOT if an operation finished

New behaviour:

  • When an explore operation finishes it only cleans up the ray actors it has created. It does not
    - shutdown ray
    - remove global resource like the cleaner
    - set shutdown global (which is gone)
  • When orchestrate() (called via ado create operation)) exits it cleans up ResourceCleaner and ray
  • When general operation finishes there is no set cleanup as there are no ray resources created. It does not:
    - shutdown ray
    - remove global resource like the cleaner
    - set shutdown global (which is gone)
  • If operations are nested there should no longer be issues with shutdown being set or resources deleted
  • If SIGTERM is received all active operations will be cleaned up
  • If SIGINT is received during an operation an InterruptedOperationError is raised with the operation id
    - previously nothing was raised and the operation exited via the normal route

Note: If an operation is called directly that creates Ray actors then the caller is responsible for cleaning up the ResourceCleaner and shutting down ray i.e. the same responsibilities orchestrate has.

Operation cleanup logic was mixed together (a single function for cleaning up everything was required/was different for different scenarios) and also mixed with signal handling (a shutdown could be successful operation or a signal).

In this commit
- cleanup code required by different operations/functions is separated
- A single signal handler exists that is generic
   - operations register their cleanup requirements with the handler so they are cleaned-up
- shutdown global now only indicates if a signal was received NOT if an operation finished
@DRL-NextGen
Copy link
Member

DRL-NextGen commented Dec 3, 2025

Checks Summary

Last run: 2025-12-05T13:40:35.163Z

Code Risk Analyzer vulnerability scan found 2 vulnerabilities:

Severity Identifier Package Details Fix
🔷 Medium CVE-2025-50181 urllib3
urllib3 redirects are not disabled when retries are disabled on PoolManager instantiationGHSA-pq67-6m6q-mj2v

urllib3:2.3.0->kubernetes:34.1.0
2.5.0
🔷 Medium CVE-2025-50182 urllib3
urllib3 does not control redirects in browsers and Node.jsGHSA-48p4-8xcf-vxj5

urllib3:2.3.0->kubernetes:34.1.0
2.5.0

michael-johnston and others added 2 commits December 5, 2025 10:32
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
Copy link
Member

@AlessandroPomponio AlessandroPomponio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just naming suggestions

…ation

It is raised as a new InterruptedOperationError which contains the operation identifier and is subclass of KeyboardInterrupt

Previously nothing was raised and the operation exited normally after interruption. However, this pattern is not easy to maintain with multiple nested operations and each operation would have to check if the inner operation exited due to KeyboardInterrupt.

This way operators do not have to handle KeyboardInterrupt. Each outer operation will catch the inner interrupt and raise a new exception with its own id.

The outermost handler (in operation/create.py) now catches InterruptedOperationError and prints the id of the outermost (parent) interrupted operation.
@michael-johnston michael-johnston marked this pull request as ready for review December 5, 2025 12:35
@michael-johnston
Copy link
Member Author

@danielelotito can you try Trim with this branch?

The main things to check are

  1. That when Trim finishes "Successfully" both it and the child randomwalk have status SUCCESS (in ado get operations) and that ado show related for the Trim operation shows the randomwalk and vice versa
  2. That if you run Trim and interrupt the randomwalk (CTRL-C) that everything shutdowns gracefully - both operations are marked FAILED but both still are shown as related

Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com>
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
@danielelotito
Copy link
Collaborator

Will do!

@michael-johnston
Copy link
Member Author

Also need to check with vllm_performance that resource cleanup is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: nested operations cause the outer operation to think a signal has been sent

5 participants