Skip to content

[R] Datasets — pin standard eval datasets #679

@AmitoVrito

Description

@AmitoVrito

Goal

Use community-standard datasets for credibility.

Sub-tasks

  • BEIR for retrieval
  • HotpotQA / NQ for multi-hop
  • AgentBench / WebArena for agents
  • MT-Bench / Arena-Hard for LLMs
  • License audit

Definition of done

  • Dataset registry with versions + checksums

Metadata

Metadata

Assignees

No one assigned

    Labels

    paperAcademic paper / preprintpost-v2.0Scheduled for after v2.0 releaseresearchResearch, methodology, ablations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions