-
Notifications
You must be signed in to change notification settings - Fork 1
refactor(core): operation cleanup #281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Operation cleanup logic was mixed together (a single function for cleaning up everything was required/was different for different scenarios) and also mixed with signal handling (a shutdown could be successful operation or a signal). In this commit - cleanup code required by different operations/functions is separated - A single signal handler exists that is generic - operations register their cleanup requirements with the handler so they are cleaned-up - shutdown global now only indicates if a signal was received NOT if an operation finished
|
Checks Summary Last run: 2025-12-05T13:40:35.163Z Code Risk Analyzer vulnerability scan found 2 vulnerabilities:
|
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
AlessandroPomponio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just naming suggestions
…ation It is raised as a new InterruptedOperationError which contains the operation identifier and is subclass of KeyboardInterrupt Previously nothing was raised and the operation exited normally after interruption. However, this pattern is not easy to maintain with multiple nested operations and each operation would have to check if the inner operation exited due to KeyboardInterrupt. This way operators do not have to handle KeyboardInterrupt. Each outer operation will catch the inner interrupt and raise a new exception with its own id. The outermost handler (in operation/create.py) now catches InterruptedOperationError and prints the id of the outermost (parent) interrupted operation.
|
@danielelotito can you try Trim with this branch? The main things to check are
|
Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com> Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
|
Will do! |
|
Also need to check with vllm_performance that resource cleanup is working. |
Operation cleanup logic was mixed together (a single function for cleaning up everything was required/was different for different scenarios) and also mixed with signal handling (a shutdown could be successful operation or a signal).
This made it impossible to correctly handle operation cleanup when there were nested operations (see #200)
In this PR
New behaviour:
- shutdown ray
- remove global resource like the cleaner
- set shutdown global (which is gone)
- shutdown ray
- remove global resource like the cleaner
- set shutdown global (which is gone)
- previously nothing was raised and the operation exited via the normal route
Note: If an operation is called directly that creates Ray actors then the caller is responsible for cleaning up the ResourceCleaner and shutting down ray i.e. the same responsibilities
orchestratehas.