benchmark_serving: fail run when request failure rate exceeds 5% by Oseltamivir · Pull Request #1379 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-05-14T20:41:52Z

Summary

Add a hardcoded 5% per-request failure-rate gate in utils/bench_serving/benchmark_serving.py. After the benchmark finishes and the result JSON is written, the script computes failure_rate = 1 - completed / num_prompts and raise SystemExit(...) if it exceeds 5%.
Placed after the result-file write and save_to_pytorch_benchmark_format so the artifact still uploads when the gate fires — failed runs remain debuggable.
No new CLI argument; the threshold is fixed at 5%.

Why

Today the only failure signal that survives to the run level is calc_success_rate.py, which rolls up GitHub Actions job conclusions. A benchmark that quietly fails a chunk of requests still reports SUCCESS because benchmark_serving.py only warns when 100% of requests fail. This gate forces the job to fail when partial failures cross 5%.

Test plan

Trigger a sweep that produces a healthy run; verify job still succeeds and result JSON is uploaded as before.
Trigger (or simulate) a run with >5% request failures; verify the job fails after the artifact is uploaded and the SystemExit message reports the rate.

Gate the benchmark after results are written so the artifact still uploads, then exit non-zero if (num_prompts - completed) / num_prompts > 0.05. Surfaces partial-failure runs that currently get reported as successful jobs.

claude · 2026-05-14T20:47:04Z

+    failure_rate = 1 - completed / args.num_prompts
+    if failure_rate > max_failure_rate:


🟡 Minor: Due to IEEE-754 floating-point, 1 - 950/1000 evaluates to 0.050000000000000044, which is strictly > 0.05, so the gate fires at exactly 5% failure rate (50/1000, 5/100, 25/500, 1/20, etc.) and emits the self-contradictory message "FAIL: request failure rate 5.0% exceeds 5% threshold". Compare integer counts instead — e.g. if args.num_prompts - completed > max_failure_rate * args.num_prompts: — to match the documented exceeds semantics.

Extended reasoning...

What goes wrong

At lines 905-906, the gate computes the failure rate as a float and compares strictly:

failure_rate = 1 - completed / args.num_prompts if failure_rate > max_failure_rate: raise SystemExit(f"FAIL: request failure rate {failure_rate:.1%} exceeds {max_failure_rate:.0%} threshold ...")

0.95 is not exactly representable in float64 (stored as 0.9499999999999999555...). Subtracting from 1.0 produces 0.050000000000000044, which is strictly greater than the float64 representation of 0.05 (0.05000000000000000277...). So the gate fires at the boundary, despite both the PR description ("exceeds 5%") and the user-facing message ("exceeds 5% threshold") using strict exceeds semantics.

Step-by-step proof (Python)

>>> completed, num_prompts = 950, 1000 >>> failure_rate = 1 - completed / num_prompts >>> failure_rate 0.050000000000000044 >>> failure_rate > 0.05 True >>> f"{failure_rate:.1%}" "5.0%"

The job fails with: FAIL: request failure rate 5.0% exceeds 5% threshold (950/1000 completed) — literally claiming 5.0% > 5%.

The same artifact triggers at every common num_prompts where 5% lands on an integer failure count: (95, 100), (475, 500), (19, 20), (9500, 10000). For --num-prompts 1000 (the default), the off-by-one boundary catches runs with exactly 50 failures.

Addressing the counter-argument

One could argue the 5% threshold is an arbitrary heuristic and firing on exactly 5% is defensible policy. That is fair — the practical impact is small (one-failure off-by-one at a boundary that is itself a judgement call). What is not defensible is the self-contradictory error message, which will confuse anyone debugging a failed run. If the intent is >=, the message should say at or exceeds; if the intent is >, the comparison should match.

Suggested fix

The simplest correct form compares integer counts, avoiding the float-precision artifact entirely:

if args.num_prompts - completed > max_failure_rate * args.num_prompts: ...

For (950, 1000): 50 > 50.0 is False — gate does not fire. For (949, 1000): 51 > 50.0 is True — gate fires. This matches the documented exceeds semantics. Alternatively, switch to >= and update the message to at or exceeds. Either is a one-line change.

Severity

Marking as nit: real artifact, but practical impact is one-failure off-by-one at an inherently arbitrary boundary. The contradictory message is the most user-visible piece.

Oseltamivir requested a review from a team May 14, 2026 20:41

github-project-automation Bot added this to InferenceMAX Board May 14, 2026

claude Bot reviewed May 14, 2026

View reviewed changes

Oseltamivir merged commit 29ac412 into main May 14, 2026
6 checks passed

Oseltamivir deleted the fail-on-request-failure-rate branch May 14, 2026 21:55

github-project-automation Bot moved this to Done in InferenceMAX Board May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark_serving: fail run when request failure rate exceeds 5%#1379

benchmark_serving: fail run when request failure rate exceeds 5%#1379
Oseltamivir merged 1 commit into
mainfrom
fail-on-request-failure-rate

Oseltamivir commented May 14, 2026

Uh oh!

claude Bot May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		failure_rate = 1 - completed / args.num_prompts
		if failure_rate > max_failure_rate:

Conversation

Oseltamivir commented May 14, 2026

Summary

Why

Test plan

Uh oh!

claude Bot May 14, 2026

Choose a reason for hiding this comment

What goes wrong

Step-by-step proof (Python)

Addressing the counter-argument

Suggested fix

Severity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant