## Goal Use community-standard datasets for credibility. ## Sub-tasks - [ ] BEIR for retrieval - [ ] HotpotQA / NQ for multi-hop - [ ] AgentBench / WebArena for agents - [ ] MT-Bench / Arena-Hard for LLMs - [ ] License audit ## Definition of done - [ ] Dataset registry with versions + checksums
Goal
Use community-standard datasets for credibility.
Sub-tasks
Definition of done