Improve multiprocessing: instance-specific logging, stdout/stderr capture, and proper Ctrl+C cleanup #93

xingyaoww · 2025-11-13T19:54:23Z

Summary

This PR improves the multiprocessing evaluation system with better logging, output capture, and cleanup handling.

Key Changes

1. Instance-Specific Logging

Replaced PID-based logs with instance ID logs (instance_{id}.log instead of pid_{pid}.log)
Added reset_logger_for_multiprocessing() function that:
- Creates separate log files per instance
- Shows one informative console message per instance with tail -f hint
- Only logs WARNING+ to console after initial message
- Logs INFO+ to instance-specific file
Removed redundant _child_init() function

2. Auto-Save Metadata

Added model_post_init() to Evaluation class
Automatically saves metadata.json on initialization
Ensures metadata is available even if evaluation is interrupted early

3. Complete Output Capture

Added redirect_stdout_stderr() context manager
Captures ALL output to instance log files:
- SDK visualizations and progress bars
- Print statements from agent code
- Any console output from subprocesses
Modified _process_one_mp() to wrap execution with redirection

4. Proper KeyboardInterrupt (Ctrl+C) Handling

Explicitly terminates worker processes on interrupt
Accesses ProcessPoolExecutor._processes to call terminate() on each worker
Prevents zombie processes from lingering
Clean shutdown on both KeyboardInterrupt and general exceptions

5. Comprehensive Test Suite

Added tests/test_keyboard_interrupt.py with two test cases:
1. test_keyboard_interrupt_cleanup: Validates all workers terminate after SIGINT during normal operation
2. test_keyboard_interrupt_immediate: Validates cleanup works when interrupt happens early
Tests use psutil to verify no orphaned processes remain
All tests passing ✅

Benefits

Better debugging: Each instance has its own complete log file with ALL output
Cleaner console: No interleaved output from multiple workers
No zombies: Ctrl+C properly terminates all child processes
Better UX: Clear hints on how to monitor individual instances
More reliable: Metadata always saved, even on early interruption

Testing

# Run the KeyboardInterrupt cleanup tests
uv run pytest tests/test_keyboard_interrupt.py -v

# Both tests pass:
# ✅ test_keyboard_interrupt_cleanup
# ✅ test_keyboard_interrupt_immediate

Example Output

When running evaluation, each instance now shows:

Instance test_instance_1 - 2025-01-13 20:00:00 - INFO - Starting evaluation for instance test_instance_1.
Hint: run "tail -f /path/to/logs/instance_test_instance_1.log" to see live logs in a separate shell

The instance log file contains all output including SDK visualizations:

$ tail -f logs/instance_test_instance_1.log
2025-01-13 20:00:01 - INFO - Running with runtime: modal-0.1.0
2025-01-13 20:00:02 - INFO - Running action: MessageAction(content='Let me check the repository structure', ...)
[Progress bars, visualizations, and all output captured here]

…hing - Create workflow that can be manually triggered via workflow_dispatch - Integrate Blacksmith caching for faster Docker builds - Configure workflow to push images to ghcr.io/openhands/eval-agent-server - Make --critic parameter optional in build_images.py for build-only usage - Fix .gitignore patterns for eval_outputs and builds directories This workflow follows Blacksmith documentation for Docker builds and allows building SWE-Bench evaluation images with configurable parameters like dataset, split, target, platforms, and concurrent workers. Closes #37

…caching Following the pattern from OpenHands/software-agent-sdk#990 and Blacksmith's official documentation (https://docs.blacksmith.sh/blacksmith-caching/docker-builds), this change replaces the standard docker/setup-buildx-action with useblacksmith/setup-docker-builder@v1. Key improvements: - Replaces docker/setup-buildx-action@v3 with useblacksmith/setup-docker-builder@v1 - Removes manual cache configuration (useblacksmith/cache@v6) - Blacksmith's Docker builder automatically manages Docker layer caching via NVMe-backed sticky disks - Provides 2x to 40x improvements in build times according to Blacksmith's customers - Since we only build amd64 images, we don't need the complex multi-platform matrix strategy This approach is recommended for workflows that use Docker commands directly (as opposed to using docker/build-push-action). Co-authored-by: openhands <openhands@all-hands.dev>

…s/build-swe-bench-images-workflow

The GitHub Actions workflow was failing because uv was trying to build pyarrow from source, which requires the Arrow C++ library and CMake. This change adds the --no-build-package pyarrow flag to force uv to use the pre-built binary wheel instead of attempting to build from source. Co-authored-by: openhands <openhands@all-hands.dev>

The root cause of the build failure was that uv was installing Python 3.14.0, which doesn't have binary wheels for pyarrow 21.0.0 yet. This caused uv to attempt building from source, which failed due to missing Arrow C++ libraries. Solution: Added .python-version file to pin Python to 3.12, which matches the project's target-version in pyproject.toml and has full binary wheel support for all dependencies. Co-authored-by: openhands <openhands@all-hands.dev>

Use github.run_id instead of dataset/split names which contain slashes that are invalid in artifact names. Also added if-no-files-found: warn to provide better feedback if logs are missing. Co-authored-by: openhands <openhands@all-hands.dev>

…cters GitHub Actions artifact upload doesn't allow colons in filenames, but our log paths contain colons from Docker image tags (e.g., 'django-11999:latest'). Archive the entire builds directory into a tar.gz before upload to work around this restriction. Co-authored-by: openhands <openhands@all-hands.dev>

Docker image tags have a maximum length of 128 characters. When building SWE-Bench images with long base image names (e.g., scikit-learn), the generated cache tags exceed this limit and cause build failures with: 'ERROR: failed to configure registry cache exporter: invalid reference format' Solution: Apply a patch to vendor/software-agent-sdk that hashes the base_image_slug when it would cause the final tag to exceed 128 characters. Uses SHA256 hash (first 12 chars) to create a shorter unique identifier while maintaining cache efficiency. The patch is applied during the workflow setup before installing dependencies. Co-authored-by: openhands <openhands@all-hands.dev>

Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <openhands@all-hands.dev>

…s/build-swe-bench-images-workflow

This reverts commit 3ba1e46.

The build workflow was experiencing log file corruption and I/O errors due to concurrent builds writing to the wrong log files. This was caused by using ThreadPoolExecutor with contextlib.redirect_stdout/stderr, which only provides thread-local redirection of Python-level writes. The SDK's build() function spawns subprocesses and uses logger.info()/warning() to output build logs. Logger handlers write to process-wide file descriptors, not thread-local redirected streams, causing output from concurrent threads to: - Write to the wrong log files - Attempt writing to closed file handles - Result in ValueError('I/O operation on closed file.') Solution: Replace ThreadPoolExecutor with ProcessPoolExecutor to provide complete process-level isolation with separate stdout/stderr/logging per build. The additional overhead is negligible compared to Docker build time. Changes: - Import ProcessPoolExecutor instead of ThreadPoolExecutor - Move build_one_fn to module level (_build_with_logging) for pickle support - Update executor initialization to use ProcessPoolExecutor - Add explanatory comments about isolation requirements Co-authored-by: openhands <openhands@all-hands.dev>

This commit improves the tagging system for SWE-Bench Docker images to enable better reproducibility and clarity. ## Changes ### 1. Benchmarks Build System **benchmarks/swe_bench/build_images.py:** - Added `get_sdk_commit_hash()`: Extracts 7-char SDK submodule commit hash - Added `extract_instance_id()`: Parses SWE-Bench base images to extract instance IDs - Modified `main()`: Sets SDK_VERSION_OVERRIDE env var with SDK commit hash - Modified `build_one()`: - Generates custom tags: `swebench-{instance_id}` - Disables versioned tags via `include_versioned_tag=False` ### 2. SDK Submodule Update **vendor/software-agent-sdk:** Updated to commit 77d50e61 which includes: - `SDK_VERSION_OVERRIDE` environment variable support - `include_versioned_tag` option in BuildOptions - Target-based tag suffixes (replaces `-dev` suffix) - See: OpenHands/software-agent-sdk#1088 ### 3. Documentation **TAGGING_CHANGES.md:** Comprehensive documentation explaining: - Why these changes are needed (submodule git context issues) - Tag format comparison (before/after) - Benefits (reproducibility, usability, maintainability) - Implementation details and examples ## Tag Format ### Before ``` v1.0.0_docker.io_s_swebench_s_sweb.eval.x86_64.django_1776_django-12155_tag_latest_source-minimal-dev ``` - 137 characters - Package version (non-reproducible) - Unclear `-dev` suffix ### After ``` a612c0a-swebench-django-12155-source-minimal main-swebench-django-12155-source-minimal ``` - 84 characters (39% shorter) - Exact commit hash (reproducible) - Clear target indication ## Benefits 1. **Reproducibility**: Git commit hash ensures exact SDK version tracking 2. **Clarity**: Instance ID and target clearly visible in tag 3. **Consistency**: All builds use same suffix pattern 4. **Backward Compatible**: SDK changes only apply when explicitly enabled ## Related - SDK PR: OpenHands/software-agent-sdk#1088 - Issue: Improve SWE-Bench image build workflow Co-authored-by: openhands <openhands@all-hands.dev>

Updated SDK submodule to bc25aa0d which omits the target suffix for binary builds since it's the default/common case. This keeps tags cleaner. Tag examples: - Binary: a612c0a-swebench-django-12155 (no suffix) - Source: a612c0a-swebench-django-12155-source - Source-minimal: a612c0a-swebench-django-12155-source-minimal Updated TAGGING_CHANGES.md to reflect this behavior with updated examples showing both binary and source-minimal formats. Co-authored-by: openhands <openhands@all-hands.dev>

Updates SDK submodule to 27f37dc0 which fixes an issue where SHORT_SHA was using git info from the benchmarks repo instead of the SDK repo. Now tags correctly use the SDK commit hash when SDK_VERSION_OVERRIDE is set, ensuring proper versioning in vendored/submodule contexts. Co-authored-by: openhands <openhands@all-hands.dev>

SDK now automatically detects its own commit hash, so we don't need to manually extract and override it. This simplifies the build script significantly: - Removed get_sdk_commit_hash() function - Removed SDK_VERSION_OVERRIDE env var setting - Removed unused imports (subprocess, os) - Updated documentation to reflect simpler approach The SDK's _sdk_version() now automatically finds the SDK repo root and gets the commit hash directly, regardless of whether it's used as a submodule or vendored dependency. Co-authored-by: openhands <openhands@all-hands.dev>

Update SDK submodule to include automatic SDK_VERSION detection. SDK now auto-detects its own commit hash without requiring external override, making the tagging system fully automatic. Co-authored-by: openhands <openhands@all-hands.dev>

This reverts commit 8d8ed8c.

…s/build-swe-bench-images-workflow

…e-runtime

- Replace PID-based worker logs with instance-specific logs - Add reset_logger_for_multiprocessing() for cleaner per-instance logging - Logs saved to <eval_output_dir>/logs/instance_{instance_id}.log - Single INFO message to console with tail hint, then WARNING+ only - Remove redundant _child_init() function - Auto-save metadata.json to eval_output_dir on initialization Co-authored-by: openhands <openhands@all-hands.dev>

…improvements

- Add redirect_stdout_stderr() context manager to capture all output - Redirect stdout/stderr in _process_one_mp to instance log files - Captures SDK visualizations, print statements, and all console output - All output now goes to instance_{instance_id}.log files Co-authored-by: openhands <openhands@all-hands.dev>

- Explicitly terminate worker processes on KeyboardInterrupt - Access ProcessPoolExecutor._processes to call terminate() on each worker - Add comprehensive test suite to validate cleanup behavior - Test both normal interrupt and immediate interrupt scenarios Tests verify: - All worker processes are terminated after Ctrl+C - No zombie processes remain - Cleanup works even with early interrupts Co-authored-by: openhands <openhands@all-hands.dev>

Extract repetitive pool cleanup logic into _cleanup_pool() helper method. Before: - 54 lines of try-except-except-else-finally with duplicated cleanup code - Inline repetition of cancel/terminate/shutdown logic in each handler After: - 32 lines with clean try-except-except structure - Single _cleanup_pool() method handles all cleanup logic - More maintainable and easier to read Tests still pass ✅ Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww · 2025-11-14T15:02:29Z

benchmarks/utils/evaluation.py

+                    pass
+
+        # Shutdown the pool
+        pool.shutdown(wait=wait, cancel_futures=True)


@OpenHands /codereview-roasted help me review this PR

I'm on it! xingyaoww can track my progress at all-hands.dev

🔴 Code Review: PR #93 - Multiprocessing Improvements

Taste Rating: 🟡 Acceptable - Solves real problems but with some compromises on elegance

[CRITICAL ISSUES] (Must address or document)

[benchmarks/utils/evaluation.py, Line 341-342] Breaking Encapsulation

if not wait and hasattr(pool, "_processes") and pool._processes: for process in pool._processes.values():

Problem: You're reaching into ProcessPoolExecutor._processes, a private attribute. This is fragile and could break in any Python version update.

Why this matters: The underscore prefix means "this is an implementation detail, don't touch it." You're making your code dependent on CPython internals. Tomorrow Python 3.13 could change this and your code breaks.

The fundamental issue: You're using the wrong abstraction. If you need to forcefully terminate worker processes, ProcessPoolExecutor isn't giving you what you need through its public API.

Options:

Document the risk: Add a big fat comment explaining this is intentionally breaking encapsulation because ProcessPoolExecutor doesn't expose process termination, and include the Python version you tested against

Use multiprocessing.Pool instead: It gives you explicit control over worker processes with .terminate() as a public method

Accept that cleanup might not be immediate: Use pool.shutdown(wait=False, cancel_futures=True) and let the OS clean up eventually

Recommendation: At minimum, add defensive code and a comment:

# HACK: ProcessPoolExecutor doesn't expose worker termination in its public API. # We need to forcefully terminate workers on Ctrl+C to prevent zombie processes. # This accesses private _processes attribute. Tested on Python 3.9-3.11. if not wait: try: # Attempt to access private _processes (may break in future Python versions) if hasattr(pool, "_processes") and pool._processes: for process in pool._processes.values(): try: if process.is_alive(): process.terminate() except (AttributeError, OSError): pass # Process may have already exited except Exception as e: logger.warning(f"Could not terminate workers directly: {e}")

[benchmarks/utils/evaluation.py, Lines 430-480] Logger State Mutation

def reset_logger_for_multiprocessing(log_dir: str, instance_id: str) -> None: # Get root logger and remove all existing handlers root_logger = logging.getLogger() for handler in root_logger.handlers[:]: root_logger.removeHandler(handler)

Problem: You're mutating global state (the root logger) from worker processes. This is dangerous - if anything else in the process tries to log, their handlers are gone.

Better approach: Create a new logger instance instead of hijacking the root logger:

def get_instance_logger(log_dir: str, instance_id: str) -> logging.Logger: """Get a logger configured for this specific instance.""" logger = logging.getLogger(f"openhands.instance.{instance_id}") logger.handlers.clear() logger.propagate = False # Don't propagate to root logger # ... rest of setup return logger

Why this matters: Mucking with the root logger is like mucking with global variables. It creates hidden dependencies and makes debugging harder.

[IMPROVEMENT OPPORTUNITIES] (Should fix - violates good taste)

[benchmarks/utils/evaluation.py, Lines 362-368] Redundant Log Files

log_file = os.path.join(log_dir, f"instance_{instance_id}.output.log") # Redirect stdout/stderr to capture all output (SDK visualizations, etc.) with redirect_stdout_stderr(log_file):

But then in reset_logger_for_multiprocessing:

log_file = os.path.join(log_dir, f"instance_{instance_id}.log")

Problem: You're creating TWO log files per instance:

instance_{id}.log - for logger output

instance_{id}.output.log - for stdout/stderr

Better: Use ONE log file for both. Redirect stdout/stderr to the same file handler the logger uses. This eliminates the special case and makes debugging simpler.

[benchmarks/utils/evaluation.py, Lines 263-296] Context Manager Pattern Violation

pool = ProcessPoolExecutor(max_workers=self.num_workers) futures = [] try: # ... work ... pool.shutdown(wait=True) except KeyboardInterrupt: self._cleanup_pool(pool, futures, wait=False) raise except Exception: self._cleanup_pool(pool, futures, wait=False) raise

Problem: You abandoned the with statement for manual try/finally. This adds a special case (the normal completion path needs explicit pool.shutdown(wait=True)).

Good taste solution: Eliminate the special case:

pool = ProcessPoolExecutor(max_workers=self.num_workers) try: futures = [pool.submit(...) for inst in instances_to_process] # ... process results ... except (KeyboardInterrupt, Exception): self._cleanup_pool(pool, futures, wait=False) raise finally: if not hasattr(self, '_cleanup_called'): pool.shutdown(wait=True)

Or better yet, create a custom context manager that handles cleanup properly:

@contextmanager def interruptible_pool_executor(num_workers): pool = ProcessPoolExecutor(max_workers=num_workers) futures = [] try: yield pool, futures pool.shutdown(wait=True) except (KeyboardInterrupt, Exception): cleanup_pool(pool, futures, wait=False) raise

[benchmarks/utils/evaluation.py, Lines 483-515] File Handle Leaks Possible

try: log_file = open(log_file_path, "a", buffering=1, encoding="utf-8") sys.stdout = log_file sys.stderr = log_file yield finally: sys.stdout = original_stdout sys.stderr = original_stderr if log_file is not None and not log_file.closed: log_file.close()

Problem: If something still holds a reference to sys.stdout (like a thread or cached formatter), closing the file can cause issues. Also, if an exception happens during open(), you restore stdout/stderr before they were changed.

Better pattern:

log_file = None try: log_file = open(log_file_path, "a", buffering=1, encoding="utf-8") sys.stdout = log_file sys.stderr = log_file try: yield finally: sys.stdout = original_stdout sys.stderr = original_stderr finally: if log_file is not None: try: log_file.close() except Exception: pass # Already closing, don't mask original exception

[STYLE NOTES] (Minor)

[benchmarks/swe_bench/eval_infer.py, Line 215] Inconsistent Naming

Changed default from "OpenHands" to "openhands". This is fine but make sure all downstream consumers expect lowercase. Case inconsistencies cause bugs.

VERDICT: ✅ Worth merging with documentation

Core logic is sound - You're solving real problems (zombie processes, output interleaving, debugging difficulty). The implementation works and has tests.

BUT: You're building on shaky foundation (accessing private attributes). At minimum, document the risks and add defensive code.

KEY INSIGHT:

You're fighting ProcessPoolExecutor's abstraction instead of working with it. When you need to hasattr(pool, "_processes"), that's the abstraction telling you "I don't want to give you this control."

The pragmatic solution is what you did (reach in and grab _processes), but the "good taste" solution would be to either:

Use multiprocessing.Pool which exposes .terminate() publicly

Wrap worker functions to handle signals properly and allow graceful shutdown

Accept that immediate termination isn't available and document the limitation

What Linus would say: "If you're reaching into private attributes, either the API is wrong or you're using the wrong API. In this case, it's both - ProcessPoolExecutor doesn't expose what you need, and you should probably use multiprocessing.Pool instead."

But he'd also say: "Ship it with a big comment explaining why you broke the rules. Real bugs trump theoretical purity." ✅

_{View full conversation}

openhands-agent and others added 30 commits October 27, 2025 21:39

Merge commit 'bb150852c64a555806cfa939f31e8f9abd7b3791' into openhand…

282f863

…s/build-swe-bench-images-workflow

revert unneed stuff

8508006

simplify setup dependency

a565e77

set eval-agent-server

9bbd7fb

fix line break

c661b2c

default to 10 for testing

632432e

run on all prs for debugging

c536903

Update patch with pre-commit formatting fixes

21bb226

Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <openhands@all-hands.dev>

checkout to v1.0.0 of sdk

2f89775

update uv.lock

dfb966b

Merge commit 'dfb966bd2d3e4d2086223cf4ff85d998d15354d4' into openhand…

d04de8a

…s/build-swe-bench-images-workflow

Revert "Fix Docker cache tag length exceeding 128 character limit"

cdd7200

This reverts commit 3ba1e46.

chore: update SDK to commit 85e436df

6d6845e

Update SDK submodule to include automatic SDK_VERSION detection. SDK now auto-detects its own commit hash without requiring external override, making the tagging system fully automatic. Co-authored-by: openhands <openhands@all-hands.dev>

update agent-sdk version

8d8ed8c

improve custom tags for swebench image

8763fad

Revert "update agent-sdk version"

99927f8

This reverts commit 8d8ed8c.

Merge commit '2ca8a917036ddb6ac069b3ecbb0f14ec616a4883' into openhand…

8ed14f3

…s/build-swe-bench-images-workflow

update sha

7e3c50e

xingyaoww and others added 15 commits November 11, 2025 23:04

Merge commit '4dab8b1e02bd89e2ffa258847c917746967e67dd' into xw/remot…

cc121b5

…e-runtime

check image exists before launching remote runtime job

94c4326

Merge commit '34bcaea6fbf0477b6f6691ec9d2bbcda7dcafbcc' into xw/remot…

0f621e4

…e-runtime

Merge commit '15fd19d91fa933d20790abb3f87098f3d0874399' into xw/remot…

d7d6faf

…e-runtime

Merge commit '03cd6395e407d1463ed99e2eb80466fe9b10d590' into xw/remot…

422282e

…e-runtime

trying fixing docker build trigger

5d734aa

fix typo

3e1f8f9

tweak

8601875

tweak

af6966a

drop default

2160810

Merge commit 'b3f5ab74e589803943cd65414ef2510e6b1d2966' into xw/remot…

19d58fa

…e-runtime

sleep after failure

fd5c0c6

check target image existence before build

ea3f69f

misc improvements

fbe7657

xingyaoww marked this pull request as draft November 13, 2025 19:54

xingyaoww and others added 3 commits November 13, 2025 19:54

Merge commit '3dfeb4b443a693e0327aa6cf83eb21c926c8246c' into xw/misc-…

f6abf40

…improvements

xingyaoww changed the title ~~Misc logging improvements when running sessions in parallel~~ Improve multiprocessing: instance-specific logging, stdout/stderr capture, and proper Ctrl+C cleanup Nov 13, 2025

xingyaoww and others added 2 commits November 13, 2025 20:08

rename output file

3d5f197

xingyaoww marked this pull request as ready for review November 14, 2025 15:01

xingyaoww requested a review from simonrosenberg November 14, 2025 15:01

xingyaoww commented Nov 14, 2025

View reviewed changes

simonrosenberg approved these changes Nov 14, 2025

View reviewed changes

xingyaoww force-pushed the xw/misc-improvements branch 2 times, most recently from 376cd94 to 8bf1038 Compare November 14, 2025 19:24

xingyaoww enabled auto-merge (squash) November 14, 2025 19:24

xingyaoww merged commit 51efb23 into main Nov 14, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve multiprocessing: instance-specific logging, stdout/stderr capture, and proper Ctrl+C cleanup #93

Improve multiprocessing: instance-specific logging, stdout/stderr capture, and proper Ctrl+C cleanup #93

Uh oh!

xingyaoww commented Nov 13, 2025 •

edited

Loading

Uh oh!

xingyaoww Nov 14, 2025

Uh oh!

openhands-ai bot Nov 14, 2025

Uh oh!

openhands-ai bot Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Improve multiprocessing: instance-specific logging, stdout/stderr capture, and proper Ctrl+C cleanup #93

Improve multiprocessing: instance-specific logging, stdout/stderr capture, and proper Ctrl+C cleanup #93

Uh oh!

Conversation

xingyaoww commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. Instance-Specific Logging

2. Auto-Save Metadata

3. Complete Output Capture

4. Proper KeyboardInterrupt (Ctrl+C) Handling

5. Comprehensive Test Suite

Benefits

Testing

Example Output

Uh oh!

xingyaoww Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Nov 14, 2025

Choose a reason for hiding this comment

🔴 Code Review: PR #93 - Multiprocessing Improvements

[CRITICAL ISSUES] (Must address or document)

[benchmarks/utils/evaluation.py, Line 341-342] Breaking Encapsulation

[benchmarks/utils/evaluation.py, Lines 430-480] Logger State Mutation

[IMPROVEMENT OPPORTUNITIES] (Should fix - violates good taste)

[benchmarks/utils/evaluation.py, Lines 362-368] Redundant Log Files

[benchmarks/utils/evaluation.py, Lines 263-296] Context Manager Pattern Violation

[benchmarks/utils/evaluation.py, Lines 483-515] File Handle Leaks Possible

[STYLE NOTES] (Minor)

[benchmarks/swe_bench/eval_infer.py, Line 215] Inconsistent Naming

VERDICT: ✅ Worth merging with documentation

KEY INSIGHT:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xingyaoww commented Nov 13, 2025 •

edited

Loading