Reproducibility Issues

Hello! I have recently been attempting to reproduce the results show in Table 2 of the [LongBenchV2 paper](https://arxiv.org/pdf/2412.15204)

<img width="563" alt="Image" src="https://github.com/user-attachments/assets/f107780c-7042-4750-9c4a-19f9ceadd4d0" />

After cloning your repo and performing the setup I proceeded to evaluate a few OpenAI models and `command-r-plus-08-2024` via their respective API's. While testing, I noticed very large deltas between the scores reported in the paper and those being achieved while running the evaluation pipeline live. Below is a table demonstrating the deltas.

# Consolidated Model Performance Comparison
| Model | Score Type | Overall | Easy | Hard | Short | Medium | Long |
|-------|------------|---------|------|------|-------|---------|------|
| GPT-4o-mini-2024-07-18 | Paper | 29.3 | 31.1 | 28.2 | 31.8 | 28.6 | 26.2 |
| GPT-4o-mini-2024-07-18 | Local | 28.7 | 31.2 | 27.1 | 30.6 | 29.0 | 25.0 |
| GPT-4o-mini-2024-07-18 | Delta | -0.6 | +0.1 | -1.1 | -1.2 | +0.4 | -1.2 |
|-------|------------|---------|------|------|-------|---------|------|
| c4ai-command-r-plus-08-2024 | Paper | 27.8 | 30.2 | 26.4 | 36.7 | 23.7 | 21.3 |
| c4ai-command-r-plus-08-2024 | Local | 27.7 | 27.1 | 28.1 | 33.9 | 23.4 | 25.9 |
| c4ai-command-r-plus-08-2024 | Delta | -0.1 | -3.1 | +1.7 | -2.8 | -0.3 | +4.6 |
|-------|------------|---------|------|------|-------|---------|------|
| GPT-4o-2024-11-20 | Paper | 46.0 | 50.8 | 43.0 | 47.5 | 47.9 | 39.8 |
| GPT-4o-2024-11-20 | Local | 47.2 | 51.0 | 44.8 | 47.8 | 49.5 | 41.7 |
| GPT-4o-2024-11-20 | Delta | +1.2 | +0.2 | +1.8 | +0.3 | +1.6 | +1.9 |
|-------|------------|---------|------|------|-------|---------|------|
| GPT-4o-2024-08-06 | Paper | 50.1 | 57.4 | 45.6 | 53.3 | 52.4 | 40.2 |
| GPT-4o-2024-08-06 | Local | 49.0 | 56.8 | 44.2 | 54.4 | 49.1 | 39.8 |
| GPT-4o-2024-08-06 | Delta | -1.1 | -0.6 | -1.4 | +1.1 | -3.3 | -0.4 |

Note: Delta values are calculated as (Local - Paper), so positive values indicate higher local scores, and negative values indicate higher paper scores.

Also worth noting that the `C4ai-command-r-plus-08-2024` run was done via calling `command-r-plus-08-2024` on the Cohere API and **not** by running locally.

I was wondering if these deltas are known to the longbenchv2 team? I was unable to find any mention of these types of reproducibility issues in other issues or the paper so this may be isolated to my setup. I would appreciate any help with rectifying this, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility Issues #111

Consolidated Model Performance Comparison

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Score Type	Overall	Easy	Hard	Short	Medium	Long
GPT-4o-mini-2024-07-18	Paper	29.3	31.1	28.2	31.8	28.6	26.2
GPT-4o-mini-2024-07-18	Local	28.7	31.2	27.1	30.6	29.0	25.0
GPT-4o-mini-2024-07-18	Delta	-0.6	+0.1	-1.1	-1.2	+0.4	-1.2
-------	------------	---------	------	------	-------	---------	------
c4ai-command-r-plus-08-2024	Paper	27.8	30.2	26.4	36.7	23.7	21.3
c4ai-command-r-plus-08-2024	Local	27.7	27.1	28.1	33.9	23.4	25.9
c4ai-command-r-plus-08-2024	Delta	-0.1	-3.1	+1.7	-2.8	-0.3	+4.6
-------	------------	---------	------	------	-------	---------	------
GPT-4o-2024-11-20	Paper	46.0	50.8	43.0	47.5	47.9	39.8
GPT-4o-2024-11-20	Local	47.2	51.0	44.8	47.8	49.5	41.7
GPT-4o-2024-11-20	Delta	+1.2	+0.2	+1.8	+0.3	+1.6	+1.9
-------	------------	---------	------	------	-------	---------	------
GPT-4o-2024-08-06	Paper	50.1	57.4	45.6	53.3	52.4	40.2
GPT-4o-2024-08-06	Local	49.0	56.8	44.2	54.4	49.1	39.8
GPT-4o-2024-08-06	Delta	-1.1	-0.6	-1.4	+1.1	-3.3	-0.4

Reproducibility Issues #111

Description

Consolidated Model Performance Comparison

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions