-
-
Notifications
You must be signed in to change notification settings - Fork 704
fix: docker container cleanup to prevent accumulation and system slowdown #975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
…nctions Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
reuse all the finally code in gpu check |
… review feedback Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
I've refactored the container cleanup code to eliminate duplication. Created reusable helper methods ( |
_cleanup_container in rdagent/app/utils/health_check.py can call _cleanup_container in rdagent/utils/env.py |
…ore remove Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
I've refactored the container cleanup code to use a shared function as requested. Created a |
please fix the CI: |
Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
Fixed all the mypy errors mentioned in the CI:
All the specific mypy errors listed in the comment are now resolved. Commit 19d782c. |
_cleanup_container in DockerEnv to simply run cleanup_container one line is stupid, why not directly call cleanup_container instead of self._cleanup_container |
Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
Removed the unnecessary |
Problem
Docker containers were not being properly cleaned up after running experiments, causing containers to accumulate over time and making the system progressively slower. This issue occurred in three locations where containers were only cleaned up in success paths but not when exceptions occurred.
Root Cause
The original code used manual container cleanup (
container.stop()
andcontainer.remove()
) in the success path, but when exceptions occurred, the cleanup code was never reached, leaving containers running or in an exited state.Solution
Implemented proper container cleanup using try-finally blocks to ensure containers are always cleaned up regardless of execution outcome:
1. Main experiment execution (
rdagent/utils/env.py:_run_ret_code()
)2. Docker health check (
rdagent/app/utils/health_check.py:check_docker()
)Applied the same pattern to ensure test containers are cleaned up even when health checks fail.
3. GPU availability test (
rdagent/utils/env.py:_gpu_kwargs()
)Fixed nvidia-smi test containers that were never being cleaned up at all:
Testing
Created focused unit tests to verify the cleanup logic works correctly in all scenarios:
Impact
This fix prevents Docker container accumulation that was causing the system to become slower over time during long-running experiment sessions.
Fixes #971.
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.