fix: docker container cleanup to prevent accumulation and system slowdown #975

Copilot · 2025-06-19T05:22:42Z

Problem

Docker containers were not being properly cleaned up after running experiments, causing containers to accumulate over time and making the system progressively slower. This issue occurred in three locations where containers were only cleaned up in success paths but not when exceptions occurred.

Root Cause

The original code used manual container cleanup (container.stop() and container.remove()) in the success path, but when exceptions occurred, the cleanup code was never reached, leaving containers running or in an exited state.

Solution

Implemented proper container cleanup using try-finally blocks to ensure containers are always cleaned up regardless of execution outcome:

1. Main experiment execution (`rdagent/utils/env.py:_run_ret_code()`)

# Before: cleanup only on success
try:
    container = client.containers.run(...)
    # ... process logs ...
    container.stop()
    container.remove()  # Only reached on success
except Exception:
    raise  # Container left uncleaned

# After: cleanup always happens
container = None
try:
    container = client.containers.run(...)
    # ... process logs ...
finally:
    if container is not None:
        try:
            container.stop()
            container.remove()
        except Exception as cleanup_error:
            logger.warning(f"Failed to cleanup container: {cleanup_error}")

2. Docker health check (`rdagent/app/utils/health_check.py:check_docker()`)

Applied the same pattern to ensure test containers are cleaned up even when health checks fail.

3. GPU availability test (`rdagent/utils/env.py:_gpu_kwargs()`)

Fixed nvidia-smi test containers that were never being cleaned up at all:

# Before: no cleanup
client.containers.run(image, "nvidia-smi", **kwargs)

# After: proper cleanup
container = client.containers.run(image, "nvidia-smi", detach=True, **kwargs)
container.wait()
# ... cleanup in finally block

Testing

Created focused unit tests to verify the cleanup logic works correctly in all scenarios:

✅ Successful execution with cleanup
✅ Exception during execution with cleanup
✅ Cleanup errors don't mask original exceptions

Impact

This fix prevents Docker container accumulation that was causing the system to become slower over time during long-running experiment sessions.

Fixes #971.

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

…nctions Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

peteryang1 · 2025-06-19T09:43:59Z

reuse all the finally code in gpu check

… review feedback Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot · 2025-06-19T09:50:22Z

reuse all the finally code in gpu check

I've refactored the container cleanup code to eliminate duplication. Created reusable helper methods (_cleanup_container) that are now used consistently across all three locations: main experiment execution, GPU test, and health check functions. This reduces code duplication and ensures consistent cleanup behavior throughout the codebase. Commit d342d88.

peteryang1 · 2025-06-19T09:53:45Z

_cleanup_container in rdagent/app/utils/health_check.py can call _cleanup_container in rdagent/utils/env.py
Don't bother the stop difference, you can always stop the container before remove even it already stops. (Make sure stop doesn't raise error if container already stopped)

…ore remove Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot · 2025-06-19T10:01:45Z

_cleanup_container in rdagent/app/utils/health_check.py can call _cleanup_container in rdagent/utils/env.py
Don't bother the stop difference, you can always stop the container before remove even it a...

I've refactored the container cleanup code to use a shared function as requested. Created a cleanup_container function in rdagent/utils/env.py that always stops containers before removing them, and updated health_check.py to import and use this shared function instead of its own duplicate implementation. This eliminates code duplication and ensures consistent cleanup behavior throughout the codebase. Commit c79d716.

peteryang1 · 2025-06-19T10:04:30Z

please fix the CI:
python -m mypy rdagent/core
rdagent/utils/env.py:47: error: Function is missing a type annotation for one or more arguments [no-untyped-def]
rdagent/utils/env.py:727: error: Argument 2 to "_cleanup_container" becomes "Any | None" due to an unfollowed import [no-any-unimported]
rdagent/utils/env.py:887: error: Name "container" already defined on line 884 [no-redef]
rdagent/utils/env.py:887: note: Error code "no-redef" not covered by "type: ignore" comment
rdagent/utils/env.py:901: error: "None" has no attribute "logs" [attr-defined]
rdagent/utils/env.py:907: error: "None" has no attribute "id" [attr-defined]
rdagent/utils/env.py:908: error: "None" has no attribute "name" [attr-defined]
rdagent/utils/env.py:918: error: "None" has no attribute "wait" [attr-defined]

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot · 2025-06-19T10:13:41Z

please fix the CI:
python -m mypy rdagent/core
rdagent/utils/env.py:47: error: Function is missing a type annotation for one or more arguments [no-untyped-def]
...

Fixed all the mypy errors mentioned in the CI:

Added proper type annotations for the cleanup_container function
Fixed variable redefinition issue with container
Resolved "None has no attribute" errors by adding assert after container creation
Added appropriate # type: ignore[no-any-unimported] comments for Docker import issues

All the specific mypy errors listed in the comment are now resolved. Commit 19d782c.

peteryang1 · 2025-06-19T10:17:39Z

_cleanup_container in DockerEnv to simply run cleanup_container one line is stupid, why not directly call cleanup_container instead of self._cleanup_container

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot · 2025-06-19T10:24:06Z

_cleanup_container in DockerEnv to simply run cleanup_container one line is stupid, why not directly call cleanup_container instead of self._cleanup_container

Removed the unnecessary _cleanup_container wrapper method from the DockerEnv class and replaced all calls with direct calls to the shared cleanup_container function. This eliminates the redundant one-line wrapper as requested. Commit 758a2e0.

Initial plan for issue

4c9da3b

Copilot AI assigned Copilot and peteryang1 Jun 19, 2025

Copilot started work on behalf of peteryang1 June 19, 2025 05:22 View session

Copilot AI and others added 3 commits June 19, 2025 05:28

Fix Docker container cleanup issue by using try-finally block

d2e6ecc

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Fix additional Docker container leaks in health_check and GPU test fu…

9612929

…nctions Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Remove temporary test files and finalize Docker container cleanup fix

b0d3008

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Docker Container is not removed after running experiments.~~ Fix Docker container cleanup to prevent accumulation and system slowdown Jun 19, 2025

Copilot AI requested a review from peteryang1 June 19, 2025 05:32

Copilot finished work on behalf of peteryang1 June 19, 2025 05:32

peteryang1 marked this pull request as ready for review June 19, 2025 07:56

peteryang1 changed the title ~~Fix Docker container cleanup to prevent accumulation and system slowdown~~ fix: Docker container cleanup to prevent accumulation and system slowdown Jun 19, 2025

peteryang1 changed the title ~~fix: Docker container cleanup to prevent accumulation and system slowdown~~ fix: docker container cleanup to prevent accumulation and system slowdown Jun 19, 2025

peteryang1 approved these changes Jun 19, 2025

View reviewed changes

Copilot started work on behalf of peteryang1 June 19, 2025 09:44 View session

Refactor container cleanup code to reduce duplication as requested in…

d342d88

… review feedback Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot AI requested a review from peteryang1 June 19, 2025 09:50

Copilot finished work on behalf of peteryang1 June 19, 2025 09:50

Copilot started work on behalf of peteryang1 June 19, 2025 09:53 View session

Refactor container cleanup to use shared function and always stop bef…

c79d716

…ore remove Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot finished work on behalf of peteryang1 June 19, 2025 10:02

fix CI

14f6d40

Copilot started work on behalf of peteryang1 June 19, 2025 10:04 View session

Fix mypy type checking errors for Docker container cleanup

19d782c

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot finished work on behalf of peteryang1 June 19, 2025 10:14

fix CI

a484caa

Copilot started work on behalf of peteryang1 June 19, 2025 10:17 View session

Remove unnecessary _cleanup_container wrapper method in DockerEnv class

758a2e0

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>

Copilot finished work on behalf of peteryang1 June 19, 2025 10:24

peteryang1 merged commit 05cf094 into main Jun 19, 2025
9 checks passed

peteryang1 deleted the copilot/fix-971 branch June 19, 2025 10:32

you-n-g mentioned this pull request Jun 19, 2025

chore(main): release 0.6.0 #976

Merged

This was referenced Jun 30, 2025

Feature implementation from commits 70dc62d..9e60c32 codeOwlAI/RD-Agent#1

Open

Feature implementation from commits 70ba930..25caa3d codeOwlAI/RD-Agent#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: docker container cleanup to prevent accumulation and system slowdown #975

fix: docker container cleanup to prevent accumulation and system slowdown #975

Uh oh!

Copilot AI commented Jun 19, 2025 •

edited

Loading

Uh oh!

peteryang1 commented Jun 19, 2025

Uh oh!

Copilot AI commented Jun 19, 2025

Uh oh!

peteryang1 commented Jun 19, 2025

Uh oh!

Copilot AI commented Jun 19, 2025

Uh oh!

peteryang1 commented Jun 19, 2025

Uh oh!

Copilot AI commented Jun 19, 2025

Uh oh!

peteryang1 commented Jun 19, 2025

Uh oh!

Copilot AI commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: docker container cleanup to prevent accumulation and system slowdown #975

fix: docker container cleanup to prevent accumulation and system slowdown #975

Uh oh!

Conversation

Copilot AI commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

1. Main experiment execution (rdagent/utils/env.py:_run_ret_code())

2. Docker health check (rdagent/app/utils/health_check.py:check_docker())

3. GPU availability test (rdagent/utils/env.py:_gpu_kwargs())

Testing

Impact

Uh oh!

peteryang1 commented Jun 19, 2025

Uh oh!

Copilot AI commented Jun 19, 2025

Uh oh!

peteryang1 commented Jun 19, 2025

Uh oh!

Copilot AI commented Jun 19, 2025

Uh oh!

peteryang1 commented Jun 19, 2025

Uh oh!

Copilot AI commented Jun 19, 2025

Uh oh!

peteryang1 commented Jun 19, 2025

Uh oh!

Copilot AI commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Jun 19, 2025 •

edited

Loading

1. Main experiment execution (`rdagent/utils/env.py:_run_ret_code()`)

2. Docker health check (`rdagent/app/utils/health_check.py:check_docker()`)

3. GPU availability test (`rdagent/utils/env.py:_gpu_kwargs()`)