Add P50/P95/P99 Latency Metrics

Description

Add support for P50, P95, and P99 latency metrics in the benchmarking pipeline. These percentile latencies will provide a clearer understanding of how LLM models perform under varying loads and request patterns. Since LLM response times can vary significantly depending on prompt length, model size, and hardware, percentile metrics are essential for accurate and fair evaluation.

Why This Matters

- P50 represents typical generation latency for common workloads.
- P95 captures slower responses due to edge cases (long prompts, caching misses, cold starts).
- P99 exposes tail latency, which is critical for production use-cases where worst-case performance impacts user experience.

Scope

- Collect raw latency data per model inference/generation call.
- Compute P50, P95, P99 percentiles using a stable statistical approach.
- Expose metrics in the benchmark results object and output reports (JSON/CSV/Markdown).
- Add documentation explaining these metrics and how to interpret them.
- CLI outputs to display percentile latencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add P50/P95/P99 Latency Metrics #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add P50/P95/P99 Latency Metrics #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions