Export NaNs in logits to scheduler_stats if output is corrupted #18777

vladmihailescu · 2025-05-27T19:04:16Z

Signed-off-by: Vlad Mihailescu vladmihailescu@meta.com

Summary:
Report nan in logits in scheduler_stats. This can be used later exported as Phrometeus counter but for now this is required so we can export it in our internal counter infra.

This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits.

It's a common metric we expose.

Reviewed By: Adolfo-Karim

Differential Revision: D75423285

NO ENV VAR (NaN counting off):
I0618 22:24:27.611000 180881 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.56s
Success rate:        100.00%
QPS:                 7.14
Avg latency:         11.361s
Avg TTFT (client):   8245.31ms
P50 TTFT (client):   8262.13ms
P99 TTFT (client):   8283.14ms
Avg TTIT (client):   311.60ms
P50 TTIT (client):   323.78ms
P99 TTIT (client):   334.87ms
Avg TTFT (server):   6748.17ms
Avg TTIT (server):   357.83ms
Avg prefill len:     2587.54 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2596.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens
I0618 22:22:13.811000 141492 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.30s
Success rate:        100.00%
QPS:                 7.18
Avg latency:         11.312s
Avg TTFT (client):   8209.60ms
P50 TTFT (client):   8214.03ms
P99 TTFT (client):   8273.95ms
Avg TTIT (client):   310.22ms
P50 TTIT (client):   322.10ms
P99 TTIT (client):   325.09ms
Avg TTFT (server):   6639.55ms
Avg TTIT (server):   355.36ms
Avg prefill len:     2587.49 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2595.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens

WITH ENVVAR (NaN counting on):
I0618 22:44:09.103000 508486 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.18s
Success rate:        100.00%
QPS:                 7.19
Avg latency:         11.291s
Avg TTFT (client):   8193.95ms
P50 TTFT (client):   8208.52ms
P99 TTFT (client):   8232.08ms
Avg TTIT (client):   309.75ms
P50 TTIT (client):   321.92ms
P99 TTIT (client):   333.03ms
Avg TTFT (server):   8574.45ms
Avg TTIT (server):   355.91ms
Avg prefill len:     2587.90 tokens
P50 prefill len:     2588.00 tokens
P99 prefill len:     2596.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens
I0618 22:47:21.294000 566871 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 57.23s
Success rate:        100.00%
QPS:                 7.06
Avg latency:         11.532s
Avg TTFT (client):   8365.60ms
P50 TTFT (client):   8229.34ms
P99 TTFT (client):   9156.80ms
Avg TTIT (client):   316.59ms
P50 TTIT (client):   322.71ms
P99 TTIT (client):   414.20ms
Avg TTFT (server):   6794.25ms
Avg TTIT (server):   358.08ms
Avg prefill len:     2587.78 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2597.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens

And unit test

pytest -s -v tests/v1/worker/test_gpu_model_runner.py

tests/v1/worker/test_gpu_model_runner.py::test_get_nans_in_logits INFO 06-19 02:06:26 [config.py:1444] Using max model len 2048
WARNING 06-19 02:06:26 [config.py:4758] Current vLLM config is not set.
WARNING 06-19 02:06:26 [config.py:4758] Current vLLM config is not set.
WARNING 06-19 02:06:26 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
PASSED

github-actions · 2025-05-27T19:04:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

facebook-github-bot · 2025-05-27T19:04:30Z

This pull request was exported from Phabricator. Differential Revision: D75423285

simon-mo · 2025-05-27T20:03:31Z

Keep in mind that the logits needs to be serialized from model runner back to the scheduler via RPC/collectives. So doing the counting in model runner will be better.

facebook-github-bot · 2025-05-27T20:06:00Z

This pull request was exported from Phabricator. Differential Revision: D75423285

facebook-github-bot · 2025-05-27T20:12:16Z

This pull request was exported from Phabricator. Differential Revision: D75423285

facebook-github-bot · 2025-05-27T20:16:17Z

This pull request was exported from Phabricator. Differential Revision: D75423285

markmc · 2025-05-27T20:22:46Z

It doesn't seem like we actually need an exact count of nans - we just want a signal that corruption is spiking?

What happens the request when this happens? Is the request aborted, or ..?

Could we do something similar to the vllm:request_success_total metric - maybe add another FinishReason like CORRUPTED?

markmc · 2025-05-28T13:48:56Z

Some overlap with #18765 ... except I'm not sure this NaN case results in the request explicitly failing?

vladmihailescu · 2025-05-28T19:03:31Z

@markmc with NaNs the requests won't fail by default. With the existing behaviour the output of the requests will just be gibberish and that's why I'm not sure if we should add a new finish reason because NaNs will not forcefully fail the request.

Internally we observe 2 behaviours in which NaNs appear:

Bad GPU - this causes all the requests landing on that GPU to have NaNs. We fail warmup if this happens during warmup (so the container restarts and retries warmup) but if it happens while the task is healthy, the container won't stop because there can be random flaky NaNs (see point 2).
Flaky - very rarely happens that a request will have NaNs in logits even though it was processed on a healthy GPUs.

I am trying to add now the per request NaNs and maybe have a bool per request telling if it's corrupted or not, but not sure if we should make it a finish reason.

houseroad

Can we have a environment var to control this? I am a bit concerned that it may cause some overhead. Or we can do some benchmark to confirm the overhead.

…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

vladmihailescu · 2025-06-19T09:19:00Z

@yeqcharlotte @houseroad I added an unit test and gated the computation behind an env var which is off by default. I ran few benchmarks with loadgen and there's no significant throughput/latency difference at high concurrency but I didn't look into traces. Are you suggesting looking at gpu traces to check the overhead or something else? If the loadtest is enough I think we can also do shadow internally after this lands.

houseroad

Looks good to me. Thanks for adding this metrics.

…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

vladmihailescu · 2025-06-20T04:17:51Z

V1 test timed out. Rebasing after the fix (#19872)

yeqcharlotte · 2025-06-20T05:12:44Z

vllm/v1/worker/gpu_model_runner.py

@@ -1826,6 +1831,25 @@ def _get_prompt_logprobs_dict(

        return prompt_logprobs_dict

+    def _get_nans_in_logits(
+        self,
+        logits: torch.Tensor,


is logits optionally None here or not?

yeqcharlotte · 2025-06-20T05:17:47Z

vllm/v1/worker/gpu_model_runner.py

+            num_nans_in_logits = {}
+            num_nans_for_index = None
+            if logits is not None:
+                num_nans_for_index = logits.isnan().sum(dim=-1).cpu().numpy()
+            for req_id in self.input_batch.req_ids:
+                req_index = self.input_batch.req_id_to_index[req_id]
+                num_nans_in_logits[req_id] = (
+                    int(num_nans_for_index[req_index])
+                    if logits is not None and num_nans_for_index is not None
+                    and req_index < logits.shape[0] else 0)
+            return num_nans_in_logits


nit: you got 2 if logits is not none check here and one in the loop. for better readability it might make sense to do the following:

if logits is None: return {req_id: 0 for ...} num_nans_for_index = logits.isnan().sum(dim=-1).cpu().numpy() # count nans

yeqcharlotte

thanks for adding this! leaving some small nits only.

…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

vladmihailescu · 2025-06-20T07:43:47Z

Addressed nits

…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

Signed-off-by: nie3e <adrcwiek@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> added notebooks to playground updates remoted verbatim HF secrets from all files updates [custom_op][vllm-plugin] update custom_op class to use op_registry (vllm-project#19164) Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Export NaNs in logits to scheduler_stats if output is corrupted (vllm-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> [CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests (vllm-project#19901) Signed-off-by: jiang1.li <jiang1.li@intel.com> [Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (vllm-project#19749)

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: juncheoll <th6re8e@naver.com>

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: Will Eaton <weaton@redhat.com>

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

vladmihailescu requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 27, 2025 19:04

mergify bot added v1 tpu labels May 27, 2025

vladmihailescu force-pushed the export-D75423285 branch from ec80866 to f1f3d22 Compare May 27, 2025 20:02

vladmihailescu force-pushed the export-D75423285 branch from f1f3d22 to b985bba Compare May 27, 2025 20:06

vladmihailescu force-pushed the export-D75423285 branch from b985bba to 15b58d4 Compare May 27, 2025 20:08

vladmihailescu force-pushed the export-D75423285 branch from 15b58d4 to fc7d538 Compare May 27, 2025 20:12

vladmihailescu force-pushed the export-D75423285 branch from fc7d538 to 957944f Compare May 27, 2025 20:12

vladmihailescu force-pushed the export-D75423285 branch from 957944f to 7282ec4 Compare May 27, 2025 20:16

vladmihailescu force-pushed the export-D75423285 branch from 7282ec4 to eed0d23 Compare May 29, 2025 09:08

houseroad reviewed Jun 19, 2025

View reviewed changes

vladmihailescu force-pushed the export-D75423285 branch from 04839fa to 7b3c0b1 Compare June 19, 2025 09:09

vladmihailescu requested a review from yeqcharlotte June 19, 2025 09:19

houseroad approved these changes Jun 19, 2025

View reviewed changes

houseroad added the ready label Jun 19, 2025

vladmihailescu force-pushed the export-D75423285 branch from 7b3c0b1 to 8c7244c Compare June 20, 2025 04:01

yeqcharlotte reviewed Jun 20, 2025

View reviewed changes

yeqcharlotte approved these changes Jun 20, 2025

View reviewed changes

vladmihailescu force-pushed the export-D75423285 branch from 8c7244c to 451048e Compare June 20, 2025 07:43

vladmihailescu force-pushed the export-D75423285 branch from 451048e to 3bb49c3 Compare June 20, 2025 07:47

houseroad merged commit 2e3e3c8 into vllm-project:main Jun 20, 2025
67 checks passed

yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jun 22, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

67e7f3e

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

juncheoll pushed a commit to juncheoll/vllm that referenced this pull request Jun 23, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

03a8c9a

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: juncheoll <th6re8e@naver.com>

fhl2000 pushed a commit to fhl2000/vllm that referenced this pull request Jun 25, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

c90f70b

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: fhl <2410591650@qq.com>

gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

2d447f9

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

e6e448f

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

afff642

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: Will Eaton <weaton@redhat.com>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

5c1da73

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

wwl2755-google pushed a commit to wwl2755-google/vllm that referenced this pull request Jul 1, 2025

Export NaNs in logits to scheduler_stats if output is corrupted (vllm…

4819016

…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

gnovack mentioned this pull request Jul 28, 2025

[Bug]: IndexError: list index out of range on chunked prefill with speculative decoding #20531

Open

1 task

Uh oh!

Export NaNs in logits to scheduler_stats if output is corrupted #18777

Export NaNs in logits to scheduler_stats if output is corrupted #18777

Uh oh!

Conversation

vladmihailescu commented May 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

simon-mo commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

markmc commented May 27, 2025

Uh oh!

markmc commented May 28, 2025

Uh oh!

vladmihailescu commented May 28, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

vladmihailescu commented Jun 19, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

vladmihailescu commented Jun 20, 2025

Uh oh!

yeqcharlotte Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte left a comment

Choose a reason for hiding this comment

Uh oh!

vladmihailescu commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

vladmihailescu commented May 27, 2025 •

edited by github-actions bot

Loading

yeqcharlotte Jun 20, 2025 •

edited

Loading