SWE-Bench evaluation discrepancy: 30/50 vs 39/50 resolved instances on same execution data



## Issue Description

There's a significant discrepancy between the initial evaluation results and local re-evaluation results for the same SWE-Bench run. The same execution produces different resolved instance counts depending on how the patch evaluation is performed.

## Evaluation Details

**Evaluation Name:** `sdk-main-19866551464-claude-son`
**Model:** `litellm_proxy/claude-sonnet-4-5-20250929`
**Dataset:** `princeton-nlp/SWE-bench_Verified (test)`
**Commit:** `40edd910333a97a7a32977eafbd4570ba8bfd690`
**Timestamp:** 2025-12-02 18:31:40 UTC

**Reference:** https://github.com/OpenHands/software-agent-sdk/pull/1288#issuecomment-3603421894

## Results Comparison

### Initial Evaluation Results (from evaluation run)
- **Total instances:** 500
- **Submitted instances:** 50
- **Resolved instances:** 30
- **Unresolved instances:** 18
- **Empty patch instances:** 0
- **Error instances:** 2
- **Success rate:** 30/50 (60.0%)

### Re-evaluation Results (local patch evaluation)
- **Resolved instances:** 39/50 (78.0%)
- **Unresolved instances:** 11
- **Empty patch instances:** 0
- **Error instances:** 0

**Reference:** https://github.com/OpenHands/software-agent-sdk/pull/1288#issuecomment-3607671139

```
Total instances: 500
Instances submitted: 50
Instances completed: 50
Instances incomplete: 450
Instances resolved: 39
Instances unresolved: 11
Instances with empty patches: 0
Instances with errors: 0
```

## Observations

1. **No empty patches but 2 errors in initial run:** The initial results show 0 empty patch instances but 2 error instances, which is unusual since patches are now collected inside the `evaluate_instance` method. If the agent run ended, it shouldn't end with an exception without producing a patch. The re-evaluation shows 0 errors.

2. **Significant discrepancy:** The 9-instance difference (30 vs 39 resolved) represents a 30% improvement and is substantial. This suggests a systematic issue rather than random variance.

3. **Unresolved count also differs:** Initial run shows 18 unresolved, but re-evaluation shows only 11 unresolved. Combined with 2 error instances in initial run, this accounts for the difference (18 + 2 = 20 non-resolved, vs 11 non-resolved).

4. **Possible causes:**
   - Modal infrastructure issues (as suggested in the conversation: "maybe it is modal's fault")
   - Differences in patch evaluation harness between the initial run and local re-evaluation
   - Timing or environment differences during evaluation
   - Potential issues with how patches are collected/evaluated in the remote environment
   - Test execution environment differences (network, filesystem, timing)

## Expected Results

The evaluation should be deterministic and reproducible. Running patch evaluation on the same execution data should produce the same results regardless of where or when it's performed.

## Context

This result (39/50) is comparable to or better than the previous baseline of 35/50 from https://github.com/OpenHands/software-agent-sdk/pull/419#issuecomment-3597157233, which is the expected performance level.

## Reproducibility

The evaluation was performed on:
- First 50 instances (deterministic sort and select)
- Same model and dataset
- Same commit and configuration

Execution logs for both runs will be uploaded to this issue for further debugging.

[30-out-50-resolved-swebench-logs.tar.gz](https://github.com/user-attachments/files/24037966/30-out-50-resolved-swebench-logs.tar.gz)

[39-out-50-resolved-swebench-logs.tar.gz](https://github.com/user-attachments/files/24037977/39-out-50-resolved-swebench-logs.tar.gz)

## Next Steps

- [ ] Upload SWE-Bench execution logs for both runs (initial 30/50 and re-evaluated 39/50)
- [ ] Compare patch files between initial evaluation and re-evaluation
- [ ] Investigate differences in evaluation harness configuration between Modal and local environments
- [ ] Determine if this is a Modal-specific issue or a broader infrastructure problem
- [ ] Verify patch collection mechanism in remote environments
- [ ] Check if test execution environment (network, filesystem, timing) affects results
- [ ] Establish best practices for ensuring reproducible evaluations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SWE-Bench evaluation discrepancy: 30/50 vs 39/50 resolved instances on same execution data #140

Issue Description

Evaluation Details

Results Comparison

Initial Evaluation Results (from evaluation run)

Re-evaluation Results (local patch evaluation)

Observations

Expected Results

Context

Reproducibility

Next Steps

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SWE-Bench evaluation discrepancy: 30/50 vs 39/50 resolved instances on same execution data #140

Description

Issue Description

Evaluation Details

Results Comparison

Initial Evaluation Results (from evaluation run)

Re-evaluation Results (local patch evaluation)

Observations

Expected Results

Context

Reproducibility

Next Steps

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions