benchmark_serving: fail run when request failure rate exceeds 5%#1379
Conversation
Gate the benchmark after results are written so the artifact still uploads, then exit non-zero if (num_prompts - completed) / num_prompts > 0.05. Surfaces partial-failure runs that currently get reported as successful jobs.
| failure_rate = 1 - completed / args.num_prompts | ||
| if failure_rate > max_failure_rate: |
There was a problem hiding this comment.
🟡 Minor: Due to IEEE-754 floating-point, 1 - 950/1000 evaluates to 0.050000000000000044, which is strictly > 0.05, so the gate fires at exactly 5% failure rate (50/1000, 5/100, 25/500, 1/20, etc.) and emits the self-contradictory message "FAIL: request failure rate 5.0% exceeds 5% threshold". Compare integer counts instead — e.g. if args.num_prompts - completed > max_failure_rate * args.num_prompts: — to match the documented exceeds semantics.
Extended reasoning...
What goes wrong
At lines 905-906, the gate computes the failure rate as a float and compares strictly:
failure_rate = 1 - completed / args.num_prompts
if failure_rate > max_failure_rate:
raise SystemExit(f"FAIL: request failure rate {failure_rate:.1%} exceeds {max_failure_rate:.0%} threshold ...")0.95 is not exactly representable in float64 (stored as 0.9499999999999999555...). Subtracting from 1.0 produces 0.050000000000000044, which is strictly greater than the float64 representation of 0.05 (0.05000000000000000277...). So the gate fires at the boundary, despite both the PR description ("exceeds 5%") and the user-facing message ("exceeds 5% threshold") using strict exceeds semantics.
Step-by-step proof (Python)
>>> completed, num_prompts = 950, 1000
>>> failure_rate = 1 - completed / num_prompts
>>> failure_rate
0.050000000000000044
>>> failure_rate > 0.05
True
>>> f"{failure_rate:.1%}"
"5.0%"
The job fails with: FAIL: request failure rate 5.0% exceeds 5% threshold (950/1000 completed) — literally claiming 5.0% > 5%.
The same artifact triggers at every common num_prompts where 5% lands on an integer failure count: (95, 100), (475, 500), (19, 20), (9500, 10000). For --num-prompts 1000 (the default), the off-by-one boundary catches runs with exactly 50 failures.
Addressing the counter-argument
One could argue the 5% threshold is an arbitrary heuristic and firing on exactly 5% is defensible policy. That is fair — the practical impact is small (one-failure off-by-one at a boundary that is itself a judgement call). What is not defensible is the self-contradictory error message, which will confuse anyone debugging a failed run. If the intent is >=, the message should say at or exceeds; if the intent is >, the comparison should match.
Suggested fix
The simplest correct form compares integer counts, avoiding the float-precision artifact entirely:
if args.num_prompts - completed > max_failure_rate * args.num_prompts:
...For (950, 1000): 50 > 50.0 is False — gate does not fire. For (949, 1000): 51 > 50.0 is True — gate fires. This matches the documented exceeds semantics. Alternatively, switch to >= and update the message to at or exceeds. Either is a one-line change.
Severity
Marking as nit: real artifact, but practical impact is one-failure off-by-one at an inherently arbitrary boundary. The contradictory message is the most user-visible piece.
Summary
utils/bench_serving/benchmark_serving.py. After the benchmark finishes and the result JSON is written, the script computesfailure_rate = 1 - completed / num_promptsandraise SystemExit(...)if it exceeds 5%.save_to_pytorch_benchmark_formatso the artifact still uploads when the gate fires — failed runs remain debuggable.Why
Today the only failure signal that survives to the run level is
calc_success_rate.py, which rolls up GitHub Actions job conclusions. A benchmark that quietly fails a chunk of requests still reports SUCCESS becausebenchmark_serving.pyonly warns when 100% of requests fail. This gate forces the job to fail when partial failures cross 5%.Test plan