Skip to content

benchmark_serving: fail run when request failure rate exceeds 5%#1379

Merged
Oseltamivir merged 1 commit into
mainfrom
fail-on-request-failure-rate
May 14, 2026
Merged

benchmark_serving: fail run when request failure rate exceeds 5%#1379
Oseltamivir merged 1 commit into
mainfrom
fail-on-request-failure-rate

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

  • Add a hardcoded 5% per-request failure-rate gate in utils/bench_serving/benchmark_serving.py. After the benchmark finishes and the result JSON is written, the script computes failure_rate = 1 - completed / num_prompts and raise SystemExit(...) if it exceeds 5%.
  • Placed after the result-file write and save_to_pytorch_benchmark_format so the artifact still uploads when the gate fires — failed runs remain debuggable.
  • No new CLI argument; the threshold is fixed at 5%.

Why

Today the only failure signal that survives to the run level is calc_success_rate.py, which rolls up GitHub Actions job conclusions. A benchmark that quietly fails a chunk of requests still reports SUCCESS because benchmark_serving.py only warns when 100% of requests fail. This gate forces the job to fail when partial failures cross 5%.

Test plan

  • Trigger a sweep that produces a healthy run; verify job still succeeds and result JSON is uploaded as before.
  • Trigger (or simulate) a run with >5% request failures; verify the job fails after the artifact is uploaded and the SystemExit message reports the rate.

Gate the benchmark after results are written so the artifact still
uploads, then exit non-zero if (num_prompts - completed) / num_prompts
> 0.05. Surfaces partial-failure runs that currently get reported as
successful jobs.
Comment on lines +905 to +906
failure_rate = 1 - completed / args.num_prompts
if failure_rate > max_failure_rate:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Due to IEEE-754 floating-point, 1 - 950/1000 evaluates to 0.050000000000000044, which is strictly > 0.05, so the gate fires at exactly 5% failure rate (50/1000, 5/100, 25/500, 1/20, etc.) and emits the self-contradictory message "FAIL: request failure rate 5.0% exceeds 5% threshold". Compare integer counts instead — e.g. if args.num_prompts - completed > max_failure_rate * args.num_prompts: — to match the documented exceeds semantics.

Extended reasoning...

What goes wrong

At lines 905-906, the gate computes the failure rate as a float and compares strictly:

failure_rate = 1 - completed / args.num_prompts
if failure_rate > max_failure_rate:
    raise SystemExit(f"FAIL: request failure rate {failure_rate:.1%} exceeds {max_failure_rate:.0%} threshold ...")

0.95 is not exactly representable in float64 (stored as 0.9499999999999999555...). Subtracting from 1.0 produces 0.050000000000000044, which is strictly greater than the float64 representation of 0.05 (0.05000000000000000277...). So the gate fires at the boundary, despite both the PR description ("exceeds 5%") and the user-facing message ("exceeds 5% threshold") using strict exceeds semantics.

Step-by-step proof (Python)

>>> completed, num_prompts = 950, 1000
>>> failure_rate = 1 - completed / num_prompts
>>> failure_rate
0.050000000000000044
>>> failure_rate > 0.05
True
>>> f"{failure_rate:.1%}"
"5.0%"

The job fails with: FAIL: request failure rate 5.0% exceeds 5% threshold (950/1000 completed) — literally claiming 5.0% > 5%.

The same artifact triggers at every common num_prompts where 5% lands on an integer failure count: (95, 100), (475, 500), (19, 20), (9500, 10000). For --num-prompts 1000 (the default), the off-by-one boundary catches runs with exactly 50 failures.

Addressing the counter-argument

One could argue the 5% threshold is an arbitrary heuristic and firing on exactly 5% is defensible policy. That is fair — the practical impact is small (one-failure off-by-one at a boundary that is itself a judgement call). What is not defensible is the self-contradictory error message, which will confuse anyone debugging a failed run. If the intent is >=, the message should say at or exceeds; if the intent is >, the comparison should match.

Suggested fix

The simplest correct form compares integer counts, avoiding the float-precision artifact entirely:

if args.num_prompts - completed > max_failure_rate * args.num_prompts:
    ...

For (950, 1000): 50 > 50.0 is False — gate does not fire. For (949, 1000): 51 > 50.0 is True — gate fires. This matches the documented exceeds semantics. Alternatively, switch to >= and update the message to at or exceeds. Either is a one-line change.

Severity

Marking as nit: real artifact, but practical impact is one-failure off-by-one at an inherently arbitrary boundary. The contradictory message is the most user-visible piece.

@Oseltamivir Oseltamivir merged commit 29ac412 into main May 14, 2026
6 checks passed
@Oseltamivir Oseltamivir deleted the fail-on-request-failure-rate branch May 14, 2026 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant