Skip to content

fix: docker container cleanup to prevent accumulation and system slowdown #975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 19, 2025

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Jun 19, 2025

Problem

Docker containers were not being properly cleaned up after running experiments, causing containers to accumulate over time and making the system progressively slower. This issue occurred in three locations where containers were only cleaned up in success paths but not when exceptions occurred.

Root Cause

The original code used manual container cleanup (container.stop() and container.remove()) in the success path, but when exceptions occurred, the cleanup code was never reached, leaving containers running or in an exited state.

Solution

Implemented proper container cleanup using try-finally blocks to ensure containers are always cleaned up regardless of execution outcome:

1. Main experiment execution (rdagent/utils/env.py:_run_ret_code())

# Before: cleanup only on success
try:
    container = client.containers.run(...)
    # ... process logs ...
    container.stop()
    container.remove()  # Only reached on success
except Exception:
    raise  # Container left uncleaned

# After: cleanup always happens
container = None
try:
    container = client.containers.run(...)
    # ... process logs ...
finally:
    if container is not None:
        try:
            container.stop()
            container.remove()
        except Exception as cleanup_error:
            logger.warning(f"Failed to cleanup container: {cleanup_error}")

2. Docker health check (rdagent/app/utils/health_check.py:check_docker())

Applied the same pattern to ensure test containers are cleaned up even when health checks fail.

3. GPU availability test (rdagent/utils/env.py:_gpu_kwargs())

Fixed nvidia-smi test containers that were never being cleaned up at all:

# Before: no cleanup
client.containers.run(image, "nvidia-smi", **kwargs)

# After: proper cleanup
container = client.containers.run(image, "nvidia-smi", detach=True, **kwargs)
container.wait()
# ... cleanup in finally block

Testing

Created focused unit tests to verify the cleanup logic works correctly in all scenarios:

  • ✅ Successful execution with cleanup
  • ✅ Exception during execution with cleanup
  • ✅ Cleanup errors don't mask original exceptions

Impact

This fix prevents Docker container accumulation that was causing the system to become slower over time during long-running experiment sessions.

Fixes #971.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI and others added 3 commits June 19, 2025 05:28
Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
…nctions

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
@Copilot Copilot AI changed the title [WIP] Docker Container is not removed after running experiments. Fix Docker container cleanup to prevent accumulation and system slowdown Jun 19, 2025
@Copilot Copilot AI requested a review from peteryang1 June 19, 2025 05:32
Copilot finished work on behalf of peteryang1 June 19, 2025 05:32
@peteryang1 peteryang1 marked this pull request as ready for review June 19, 2025 07:56
@peteryang1 peteryang1 changed the title Fix Docker container cleanup to prevent accumulation and system slowdown fix: Docker container cleanup to prevent accumulation and system slowdown Jun 19, 2025
@peteryang1 peteryang1 changed the title fix: Docker container cleanup to prevent accumulation and system slowdown fix: docker container cleanup to prevent accumulation and system slowdown Jun 19, 2025
@peteryang1
Copy link
Collaborator

reuse all the finally code in gpu check

… review feedback

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jun 19, 2025

reuse all the finally code in gpu check

I've refactored the container cleanup code to eliminate duplication. Created reusable helper methods (_cleanup_container) that are now used consistently across all three locations: main experiment execution, GPU test, and health check functions. This reduces code duplication and ensures consistent cleanup behavior throughout the codebase. Commit d342d88.

@Copilot Copilot AI requested a review from peteryang1 June 19, 2025 09:50
Copilot finished work on behalf of peteryang1 June 19, 2025 09:50
@peteryang1
Copy link
Collaborator

_cleanup_container in rdagent/app/utils/health_check.py can call _cleanup_container in rdagent/utils/env.py
Don't bother the stop difference, you can always stop the container before remove even it already stops. (Make sure stop doesn't raise error if container already stopped)

…ore remove

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jun 19, 2025

_cleanup_container in rdagent/app/utils/health_check.py can call _cleanup_container in rdagent/utils/env.py
Don't bother the stop difference, you can always stop the container before remove even it a...

I've refactored the container cleanup code to use a shared function as requested. Created a cleanup_container function in rdagent/utils/env.py that always stops containers before removing them, and updated health_check.py to import and use this shared function instead of its own duplicate implementation. This eliminates code duplication and ensures consistent cleanup behavior throughout the codebase. Commit c79d716.

Copilot finished work on behalf of peteryang1 June 19, 2025 10:02
@peteryang1
Copy link
Collaborator

please fix the CI:
python -m mypy rdagent/core
rdagent/utils/env.py:47: error: Function is missing a type annotation for one or more arguments [no-untyped-def]
rdagent/utils/env.py:727: error: Argument 2 to "_cleanup_container" becomes "Any | None" due to an unfollowed import [no-any-unimported]
rdagent/utils/env.py:887: error: Name "container" already defined on line 884 [no-redef]
rdagent/utils/env.py:887: note: Error code "no-redef" not covered by "type: ignore" comment
rdagent/utils/env.py:901: error: "None" has no attribute "logs" [attr-defined]
rdagent/utils/env.py:907: error: "None" has no attribute "id" [attr-defined]
rdagent/utils/env.py:908: error: "None" has no attribute "name" [attr-defined]
rdagent/utils/env.py:918: error: "None" has no attribute "wait" [attr-defined]

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jun 19, 2025

please fix the CI:
python -m mypy rdagent/core
rdagent/utils/env.py:47: error: Function is missing a type annotation for one or more arguments [no-untyped-def]
...

Fixed all the mypy errors mentioned in the CI:

  • Added proper type annotations for the cleanup_container function
  • Fixed variable redefinition issue with container
  • Resolved "None has no attribute" errors by adding assert after container creation
  • Added appropriate # type: ignore[no-any-unimported] comments for Docker import issues

All the specific mypy errors listed in the comment are now resolved. Commit 19d782c.

Copilot finished work on behalf of peteryang1 June 19, 2025 10:14
@peteryang1
Copy link
Collaborator

_cleanup_container in DockerEnv to simply run cleanup_container one line is stupid, why not directly call cleanup_container instead of self._cleanup_container

Co-authored-by: peteryang1 <25981102+peteryang1@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jun 19, 2025

_cleanup_container in DockerEnv to simply run cleanup_container one line is stupid, why not directly call cleanup_container instead of self._cleanup_container

Removed the unnecessary _cleanup_container wrapper method from the DockerEnv class and replaced all calls with direct calls to the shared cleanup_container function. This eliminates the redundant one-line wrapper as requested. Commit 758a2e0.

Copilot finished work on behalf of peteryang1 June 19, 2025 10:24
@peteryang1 peteryang1 merged commit 05cf094 into main Jun 19, 2025
9 checks passed
@peteryang1 peteryang1 deleted the copilot/fix-971 branch June 19, 2025 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Error loading sessions

Retrying..

Successfully merging this pull request may close these issues.

Docker Container is not removed after running experiments.
3 participants