Reproducing swebench-lite results

Thanks for releasing the repo, as well as the trajectories for swebench-lite! I am trying to reproduce the results with gpt-4o, but am seeing a fix rate of 59/300, as opposed to the 27.33% reported.

1. Other than the `--plausible` flag in rerank, are there any other possible causes for this?
2. Did you notice a large amount of variance between runs?
3. I changed the prompts slightly, adding a sentence before `# Examples` to clarify that we are giving output examples. Could this lead to large changes in resolution?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducing swebench-lite results #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducing swebench-lite results #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions