Difference in models results

Please see

  | private endpoint | public endpoint
Issue resolution: | 71.40% | 74.60%
Testing: | 67.00% | 70.40%
Information Gathering: | 63.60% | 74.50%
Greenfield development: | 50.00% | 25.00%

Both endpoints contain the same model Kimi-k2.6 but the private endpoint si consistently returning lower results.

Let's compare conversation and logs to determine why:

Private Issue resolution:"full_archive": "https://results.eval.all-hands.dev/swebench/litellm_proxy-accounts-graham-openhands-deployments-mghcd1dc/25200072711/results.tar.gz",
    
Public issue resolution "full_archive": "https://results.eval.all-hands.dev/swebench/litellm_proxy-moonshot-kimi-k2-6/25007210109/results.tar.gz",
  
Private Testing "full_archive": "https://results.eval.all-hands.dev/swtbench/litellm_proxy-accounts-graham-openhands-deployments-mghcd1dc/25328867381/results.tar.gz",
Public Testing "full_archive": "https://results.eval.all-hands.dev/swtbench/litellm_proxy-moonshot-kimi-k2-6/24901879531/results.tar.gz",

Private Information Gathering    "full_archive": "https://results.eval.all-hands.dev/gaia/litellm_proxy-accounts-graham-openhands-deployments-mghcd1dc/25223608404/results.tar.gz",
Public Information Gathering     "full_archive": "https://results.eval.all-hands.dev/gaia/litellm_proxy-moonshot-kimi-k2-6/25710749383/results.tar.gz",

Private Greenfield     "full_archive": "https://results.eval.all-hands.dev/commit0/litellm_proxy-accounts-graham-openhands-deployments-mghcd1dc/25294380732/results.tar.gz",
Public Greenfield     "full_archive": "https://results.eval.all-hands.dev/commit0/litellm_proxy-moonshot-kimi-k2-6/25710683155/results.tar.gz",

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in models results #707

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Difference in models results #707

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions