Hello! I have recently been attempting to reproduce the results show in Table 2 of the LongBenchV2 paper
After cloning your repo and performing the setup I proceeded to evaluate a few OpenAI models and command-r-plus-08-2024 via their respective API's. While testing, I noticed very large deltas between the scores reported in the paper and those being achieved while running the evaluation pipeline live. Below is a table demonstrating the deltas.
Consolidated Model Performance Comparison
| Model |
Score Type |
Overall |
Easy |
Hard |
Short |
Medium |
Long |
| GPT-4o-mini-2024-07-18 |
Paper |
29.3 |
31.1 |
28.2 |
31.8 |
28.6 |
26.2 |
| GPT-4o-mini-2024-07-18 |
Local |
28.7 |
31.2 |
27.1 |
30.6 |
29.0 |
25.0 |
| GPT-4o-mini-2024-07-18 |
Delta |
-0.6 |
+0.1 |
-1.1 |
-1.2 |
+0.4 |
-1.2 |
| ------- |
------------ |
--------- |
------ |
------ |
------- |
--------- |
------ |
| c4ai-command-r-plus-08-2024 |
Paper |
27.8 |
30.2 |
26.4 |
36.7 |
23.7 |
21.3 |
| c4ai-command-r-plus-08-2024 |
Local |
27.7 |
27.1 |
28.1 |
33.9 |
23.4 |
25.9 |
| c4ai-command-r-plus-08-2024 |
Delta |
-0.1 |
-3.1 |
+1.7 |
-2.8 |
-0.3 |
+4.6 |
| ------- |
------------ |
--------- |
------ |
------ |
------- |
--------- |
------ |
| GPT-4o-2024-11-20 |
Paper |
46.0 |
50.8 |
43.0 |
47.5 |
47.9 |
39.8 |
| GPT-4o-2024-11-20 |
Local |
47.2 |
51.0 |
44.8 |
47.8 |
49.5 |
41.7 |
| GPT-4o-2024-11-20 |
Delta |
+1.2 |
+0.2 |
+1.8 |
+0.3 |
+1.6 |
+1.9 |
| ------- |
------------ |
--------- |
------ |
------ |
------- |
--------- |
------ |
| GPT-4o-2024-08-06 |
Paper |
50.1 |
57.4 |
45.6 |
53.3 |
52.4 |
40.2 |
| GPT-4o-2024-08-06 |
Local |
49.0 |
56.8 |
44.2 |
54.4 |
49.1 |
39.8 |
| GPT-4o-2024-08-06 |
Delta |
-1.1 |
-0.6 |
-1.4 |
+1.1 |
-3.3 |
-0.4 |
Note: Delta values are calculated as (Local - Paper), so positive values indicate higher local scores, and negative values indicate higher paper scores.
Also worth noting that the C4ai-command-r-plus-08-2024 run was done via calling command-r-plus-08-2024 on the Cohere API and not by running locally.
I was wondering if these deltas are known to the longbenchv2 team? I was unable to find any mention of these types of reproducibility issues in other issues or the paper so this may be isolated to my setup. I would appreciate any help with rectifying this, thank you!
Hello! I have recently been attempting to reproduce the results show in Table 2 of the LongBenchV2 paper
After cloning your repo and performing the setup I proceeded to evaluate a few OpenAI models and
command-r-plus-08-2024via their respective API's. While testing, I noticed very large deltas between the scores reported in the paper and those being achieved while running the evaluation pipeline live. Below is a table demonstrating the deltas.Consolidated Model Performance Comparison
Note: Delta values are calculated as (Local - Paper), so positive values indicate higher local scores, and negative values indicate higher paper scores.
Also worth noting that the
C4ai-command-r-plus-08-2024run was done via callingcommand-r-plus-08-2024on the Cohere API and not by running locally.I was wondering if these deltas are known to the longbenchv2 team? I was unable to find any mention of these types of reproducibility issues in other issues or the paper so this may be isolated to my setup. I would appreciate any help with rectifying this, thank you!