Replies: 1 comment
-
Thank you for the links. Right now, we're evaluating the SWE-multimodal (which also has a closed test set). I believe, after that we can try https://github.com/livebench/liveswebench out. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Since people are now accusing multiple AI tools of "test-hacking" it might be good to start testing on LiveSWEBench https://github.com/livebench/liveswebench https://www.kprize.ai/
The same logic applies to LiveBench/LiveCodeBench vs BigCodeBench/EvalPlus https://livebench.ai/ https://livecodebench.github.io/leaderboard.html
Beta Was this translation helpful? Give feedback.
All reactions