[R] Datasets — pin standard eval datasets

## Goal
Use community-standard datasets for credibility.

## Sub-tasks
- [ ] BEIR for retrieval
- [ ] HotpotQA / NQ for multi-hop
- [ ] AgentBench / WebArena for agents
- [ ] MT-Bench / Arena-Hard for LLMs
- [ ] License audit

## Definition of done
- [ ] Dataset registry with versions + checksums