-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Description
Add support for P50, P95, and P99 latency metrics in the benchmarking pipeline. These percentile latencies will provide a clearer understanding of how LLM models perform under varying loads and request patterns. Since LLM response times can vary significantly depending on prompt length, model size, and hardware, percentile metrics are essential for accurate and fair evaluation.
Why This Matters
- P50 represents typical generation latency for common workloads.
- P95 captures slower responses due to edge cases (long prompts, caching misses, cold starts).
- P99 exposes tail latency, which is critical for production use-cases where worst-case performance impacts user experience.
Scope
- Collect raw latency data per model inference/generation call.
- Compute P50, P95, P99 percentiles using a stable statistical approach.
- Expose metrics in the benchmark results object and output reports (JSON/CSV/Markdown).
- Add documentation explaining these metrics and how to interpret them.
- CLI outputs to display percentile latencies.
Reactions are currently unavailable