Skip to content

Export NaNs in logits to scheduler_stats if output is corrupted #18777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 20, 2025

Conversation

vladmihailescu
Copy link
Contributor

@vladmihailescu vladmihailescu commented May 27, 2025

Signed-off-by: Vlad Mihailescu vladmihailescu@meta.com

Summary:
Report nan in logits in scheduler_stats. This can be used later exported as Phrometeus counter but for now this is required so we can export it in our internal counter infra.

This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits.

It's a common metric we expose.

Reviewed By: Adolfo-Karim

Differential Revision: D75423285

NO ENV VAR (NaN counting off):
I0618 22:24:27.611000 180881 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.56s
Success rate:        100.00%
QPS:                 7.14
Avg latency:         11.361s
Avg TTFT (client):   8245.31ms
P50 TTFT (client):   8262.13ms
P99 TTFT (client):   8283.14ms
Avg TTIT (client):   311.60ms
P50 TTIT (client):   323.78ms
P99 TTIT (client):   334.87ms
Avg TTFT (server):   6748.17ms
Avg TTIT (server):   357.83ms
Avg prefill len:     2587.54 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2596.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens
I0618 22:22:13.811000 141492 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.30s
Success rate:        100.00%
QPS:                 7.18
Avg latency:         11.312s
Avg TTFT (client):   8209.60ms
P50 TTFT (client):   8214.03ms
P99 TTFT (client):   8273.95ms
Avg TTIT (client):   310.22ms
P50 TTIT (client):   322.10ms
P99 TTIT (client):   325.09ms
Avg TTFT (server):   6639.55ms
Avg TTIT (server):   355.36ms
Avg prefill len:     2587.49 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2595.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens
WITH ENVVAR (NaN counting on):
I0618 22:44:09.103000 508486 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 56.18s
Success rate:        100.00%
QPS:                 7.19
Avg latency:         11.291s
Avg TTFT (client):   8193.95ms
P50 TTFT (client):   8208.52ms
P99 TTFT (client):   8232.08ms
Avg TTIT (client):   309.75ms
P50 TTIT (client):   321.92ms
P99 TTIT (client):   333.03ms
Avg TTFT (server):   8574.45ms
Avg TTIT (server):   355.91ms
Avg prefill len:     2587.90 tokens
P50 prefill len:     2588.00 tokens
P99 prefill len:     2596.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens
I0618 22:47:21.294000 566871 reporter.py:195] Part 1: High-level performance metrics
Ran 404/404 requests in 57.23s
Success rate:        100.00%
QPS:                 7.06
Avg latency:         11.532s
Avg TTFT (client):   8365.60ms
P50 TTFT (client):   8229.34ms
P99 TTFT (client):   9156.80ms
Avg TTIT (client):   316.59ms
P50 TTIT (client):   322.71ms
P99 TTIT (client):   414.20ms
Avg TTFT (server):   6794.25ms
Avg TTIT (server):   358.08ms
Avg prefill len:     2587.78 tokens
P50 prefill len:     2587.00 tokens
P99 prefill len:     2597.00 tokens
Avg decode len:      10.00 tokens
P50 decode len:      10.00 tokens
P99 decode len:      10.00 tokens

And unit test

pytest -s -v tests/v1/worker/test_gpu_model_runner.py
tests/v1/worker/test_gpu_model_runner.py::test_get_nans_in_logits INFO 06-19 02:06:26 [config.py:1444] Using max model len 2048
WARNING 06-19 02:06:26 [config.py:4758] Current vLLM config is not set.
WARNING 06-19 02:06:26 [config.py:4758] Current vLLM config is not set.
WARNING 06-19 02:06:26 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
PASSED

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D75423285

@mergify mergify bot added v1 tpu Related to Google TPUs labels May 27, 2025
@simon-mo
Copy link
Collaborator

Keep in mind that the logits needs to be serialized from model runner back to the scheduler via RPC/collectives. So doing the counting in model runner will be better.

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D75423285

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D75423285

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D75423285

@markmc
Copy link
Member

markmc commented May 27, 2025

It doesn't seem like we actually need an exact count of nans - we just want a signal that corruption is spiking?

What happens the request when this happens? Is the request aborted, or ..?

Could we do something similar to the vllm:request_success_total metric - maybe add another FinishReason like CORRUPTED?

@markmc
Copy link
Member

markmc commented May 28, 2025

Some overlap with #18765 ... except I'm not sure this NaN case results in the request explicitly failing?

@vladmihailescu
Copy link
Contributor Author

@markmc with NaNs the requests won't fail by default. With the existing behaviour the output of the requests will just be gibberish and that's why I'm not sure if we should add a new finish reason because NaNs will not forcefully fail the request.

Internally we observe 2 behaviours in which NaNs appear:

  1. Bad GPU - this causes all the requests landing on that GPU to have NaNs. We fail warmup if this happens during warmup (so the container restarts and retries warmup) but if it happens while the task is healthy, the container won't stop because there can be random flaky NaNs (see point 2).
  2. Flaky - very rarely happens that a request will have NaNs in logits even though it was processed on a healthy GPUs.

I am trying to add now the per request NaNs and maybe have a bool per request telling if it's corrupted or not, but not sure if we should make it a finish reason.

@yeqcharlotte
Copy link
Collaborator

yeqcharlotte commented Jun 19, 2025

lgtm @vladmihailescu could you update the test plan? any data to show this is of low load overhead? possible to add a unit test to ensure it works?

cc: @simon-mo @WoosukKwon @njhill @houseroad could you help to take a look?

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a environment var to control this? I am a bit concerned that it may cause some overhead. Or we can do some benchmark to confirm the overhead.

vladmihailescu added a commit to vladmihailescu/vllm that referenced this pull request Jun 19, 2025
…-project#18777)

Summary:
Pull Request resolved: vllm-project#18777

Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com>

Report nan in logits in scheduler_stats. This can be used later to bump  Phrometeus counter but for now this is required so we can export it in our internal counter infra.

This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes.

 It's a common metric we expose internally.

Reviewed By: Adolfo-Karim

Differential Revision: D75423285

Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
@vladmihailescu
Copy link
Contributor Author

@yeqcharlotte @houseroad I added an unit test and gated the computation behind an env var which is off by default. I ran few benchmarks with loadgen and there's no significant throughput/latency difference at high concurrency but I didn't look into traces. Are you suggesting looking at gpu traces to check the overhead or something else? If the loadtest is enough I think we can also do shadow internally after this lands.

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for adding this metrics.

@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 19, 2025
vladmihailescu added a commit to vladmihailescu/vllm that referenced this pull request Jun 20, 2025
…-project#18777)

Summary:
Pull Request resolved: vllm-project#18777

Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com>

Report nan in logits in scheduler_stats. This can be used later to bump  Phrometeus counter but for now this is required so we can export it in our internal counter infra.

This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes.

 It's a common metric we expose internally.

Reviewed By: Adolfo-Karim

Differential Revision: D75423285

Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
@vladmihailescu
Copy link
Contributor Author

V1 test timed out. Rebasing after the fix (#19872)

@@ -1826,6 +1831,25 @@ def _get_prompt_logprobs_dict(

return prompt_logprobs_dict

def _get_nans_in_logits(
self,
logits: torch.Tensor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is logits optionally None here or not?

Comment on lines 1839 to 1850
num_nans_in_logits = {}
num_nans_for_index = None
if logits is not None:
num_nans_for_index = logits.isnan().sum(dim=-1).cpu().numpy()
for req_id in self.input_batch.req_ids:
req_index = self.input_batch.req_id_to_index[req_id]
num_nans_in_logits[req_id] = (
int(num_nans_for_index[req_index])
if logits is not None and num_nans_for_index is not None
and req_index < logits.shape[0] else 0)
return num_nans_in_logits
Copy link
Collaborator

@yeqcharlotte yeqcharlotte Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you got 2 if logits is not none check here and one in the loop. for better readability it might make sense to do the following:

if logits is None:
    return {req_id: 0 for ...}
num_nans_for_index = logits.isnan().sum(dim=-1).cpu().numpy()    
# count nans

Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this! leaving some small nits only.

vladmihailescu added a commit to vladmihailescu/vllm that referenced this pull request Jun 20, 2025
…-project#18777)

Summary:
Pull Request resolved: vllm-project#18777

Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com>

Report nan in logits in scheduler_stats. This can be used later to bump  Phrometeus counter but for now this is required so we can export it in our internal counter infra.

This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes.

 It's a common metric we expose internally.

Reviewed By: Adolfo-Karim

Differential Revision: D75423285

Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
@vladmihailescu
Copy link
Contributor Author

Addressed nits

…-project#18777)

Summary:
Pull Request resolved: vllm-project#18777

Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com>

Report nan in logits in scheduler_stats. This can be used later to bump  Phrometeus counter but for now this is required so we can export it in our internal counter infra.

This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes.

 It's a common metric we expose internally.

Reviewed By: Adolfo-Karim

Differential Revision: D75423285

Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
@houseroad houseroad merged commit 2e3e3c8 into vllm-project:main Jun 20, 2025
67 checks passed
chris-relational pushed a commit to chris-relational/vllm that referenced this pull request Jun 20, 2025
Signed-off-by: nie3e <adrcwiek@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

added notebooks to playground

updates

remoted verbatim HF secrets from all files

updates

[custom_op][vllm-plugin] update custom_op class to use op_registry (vllm-project#19164)

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

Export NaNs in logits to scheduler_stats if output is corrupted (vllm-project#18777)

Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>

[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests (vllm-project#19901)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (vllm-project#19749)
yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jun 22, 2025
juncheoll pushed a commit to juncheoll/vllm that referenced this pull request Jun 23, 2025
…-project#18777)

Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
Signed-off-by: juncheoll <th6re8e@naver.com>
fhl2000 pushed a commit to fhl2000/vllm that referenced this pull request Jun 25, 2025
…-project#18777)

Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
Signed-off-by: fhl <2410591650@qq.com>
gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
…-project#18777)

Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
wwl2755-google pushed a commit to wwl2755-google/vllm that referenced this pull request Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants