Skip to content

Reproducibility Issues #111

@Hisham-Cohere

Description

@Hisham-Cohere

Hello! I have recently been attempting to reproduce the results show in Table 2 of the LongBenchV2 paper

Image

After cloning your repo and performing the setup I proceeded to evaluate a few OpenAI models and command-r-plus-08-2024 via their respective API's. While testing, I noticed very large deltas between the scores reported in the paper and those being achieved while running the evaluation pipeline live. Below is a table demonstrating the deltas.

Consolidated Model Performance Comparison

Model Score Type Overall Easy Hard Short Medium Long
GPT-4o-mini-2024-07-18 Paper 29.3 31.1 28.2 31.8 28.6 26.2
GPT-4o-mini-2024-07-18 Local 28.7 31.2 27.1 30.6 29.0 25.0
GPT-4o-mini-2024-07-18 Delta -0.6 +0.1 -1.1 -1.2 +0.4 -1.2
------- ------------ --------- ------ ------ ------- --------- ------
c4ai-command-r-plus-08-2024 Paper 27.8 30.2 26.4 36.7 23.7 21.3
c4ai-command-r-plus-08-2024 Local 27.7 27.1 28.1 33.9 23.4 25.9
c4ai-command-r-plus-08-2024 Delta -0.1 -3.1 +1.7 -2.8 -0.3 +4.6
------- ------------ --------- ------ ------ ------- --------- ------
GPT-4o-2024-11-20 Paper 46.0 50.8 43.0 47.5 47.9 39.8
GPT-4o-2024-11-20 Local 47.2 51.0 44.8 47.8 49.5 41.7
GPT-4o-2024-11-20 Delta +1.2 +0.2 +1.8 +0.3 +1.6 +1.9
------- ------------ --------- ------ ------ ------- --------- ------
GPT-4o-2024-08-06 Paper 50.1 57.4 45.6 53.3 52.4 40.2
GPT-4o-2024-08-06 Local 49.0 56.8 44.2 54.4 49.1 39.8
GPT-4o-2024-08-06 Delta -1.1 -0.6 -1.4 +1.1 -3.3 -0.4

Note: Delta values are calculated as (Local - Paper), so positive values indicate higher local scores, and negative values indicate higher paper scores.

Also worth noting that the C4ai-command-r-plus-08-2024 run was done via calling command-r-plus-08-2024 on the Cohere API and not by running locally.

I was wondering if these deltas are known to the longbenchv2 team? I was unable to find any mention of these types of reproducibility issues in other issues or the paper so this may be isolated to my setup. I would appreciate any help with rectifying this, thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions