-
Notifications
You must be signed in to change notification settings - Fork 206
Open
Description
Thanks for releasing the repo, as well as the trajectories for swebench-lite! I am trying to reproduce the results with gpt-4o, but am seeing a fix rate of 59/300, as opposed to the 27.33% reported.
- Other than the
--plausible
flag in rerank, are there any other possible causes for this? - Did you notice a large amount of variance between runs?
- I changed the prompts slightly, adding a sentence before
# Examples
to clarify that we are giving output examples. Could this lead to large changes in resolution?
huyouare, AlexShypula and workworkwc3huyouare
Metadata
Metadata
Assignees
Labels
No labels