-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Issue Description
There's a significant discrepancy between the initial evaluation results and local re-evaluation results for the same SWE-Bench run. The same execution produces different resolved instance counts depending on how the patch evaluation is performed.
Evaluation Details
Evaluation Name: sdk-main-19866551464-claude-son
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Verified (test)
Commit: 40edd910333a97a7a32977eafbd4570ba8bfd690
Timestamp: 2025-12-02 18:31:40 UTC
Reference: OpenHands/software-agent-sdk#1288 (comment)
Results Comparison
Initial Evaluation Results (from evaluation run)
- Total instances: 500
- Submitted instances: 50
- Resolved instances: 30
- Unresolved instances: 18
- Empty patch instances: 0
- Error instances: 2
- Success rate: 30/50 (60.0%)
Re-evaluation Results (local patch evaluation)
- Resolved instances: 39/50 (78.0%)
- Unresolved instances: 11
- Empty patch instances: 0
- Error instances: 0
Reference: OpenHands/software-agent-sdk#1288 (comment)
Total instances: 500
Instances submitted: 50
Instances completed: 50
Instances incomplete: 450
Instances resolved: 39
Instances unresolved: 11
Instances with empty patches: 0
Instances with errors: 0
Observations
-
No empty patches but 2 errors in initial run: The initial results show 0 empty patch instances but 2 error instances, which is unusual since patches are now collected inside the
evaluate_instancemethod. If the agent run ended, it shouldn't end with an exception without producing a patch. The re-evaluation shows 0 errors. -
Significant discrepancy: The 9-instance difference (30 vs 39 resolved) represents a 30% improvement and is substantial. This suggests a systematic issue rather than random variance.
-
Unresolved count also differs: Initial run shows 18 unresolved, but re-evaluation shows only 11 unresolved. Combined with 2 error instances in initial run, this accounts for the difference (18 + 2 = 20 non-resolved, vs 11 non-resolved).
-
Possible causes:
- Modal infrastructure issues (as suggested in the conversation: "maybe it is modal's fault")
- Differences in patch evaluation harness between the initial run and local re-evaluation
- Timing or environment differences during evaluation
- Potential issues with how patches are collected/evaluated in the remote environment
- Test execution environment differences (network, filesystem, timing)
Expected Results
The evaluation should be deterministic and reproducible. Running patch evaluation on the same execution data should produce the same results regardless of where or when it's performed.
Context
This result (39/50) is comparable to or better than the previous baseline of 35/50 from OpenHands/software-agent-sdk#419 (comment), which is the expected performance level.
Reproducibility
The evaluation was performed on:
- First 50 instances (deterministic sort and select)
- Same model and dataset
- Same commit and configuration
Execution logs for both runs will be uploaded to this issue for further debugging.
30-out-50-resolved-swebench-logs.tar.gz
39-out-50-resolved-swebench-logs.tar.gz
Next Steps
- Upload SWE-Bench execution logs for both runs (initial 30/50 and re-evaluated 39/50)
- Compare patch files between initial evaluation and re-evaluation
- Investigate differences in evaluation harness configuration between Modal and local environments
- Determine if this is a Modal-specific issue or a broader infrastructure problem
- Verify patch collection mechanism in remote environments
- Check if test execution environment (network, filesystem, timing) affects results
- Establish best practices for ensuring reproducible evaluations