-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Export NaNs in logits to scheduler_stats if output is corrupted #18777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request was exported from Phabricator. Differential Revision: D75423285 |
ec80866
to
f1f3d22
Compare
Keep in mind that the logits needs to be serialized from model runner back to the scheduler via RPC/collectives. So doing the counting in model runner will be better. |
This pull request was exported from Phabricator. Differential Revision: D75423285 |
f1f3d22
to
b985bba
Compare
b985bba
to
15b58d4
Compare
This pull request was exported from Phabricator. Differential Revision: D75423285 |
15b58d4
to
fc7d538
Compare
fc7d538
to
957944f
Compare
This pull request was exported from Phabricator. Differential Revision: D75423285 |
957944f
to
7282ec4
Compare
It doesn't seem like we actually need an exact count of nans - we just want a signal that corruption is spiking? What happens the request when this happens? Is the request aborted, or ..? Could we do something similar to the |
Some overlap with #18765 ... except I'm not sure this NaN case results in the request explicitly failing? |
@markmc with NaNs the requests won't fail by default. With the existing behaviour the output of the requests will just be gibberish and that's why I'm not sure if we should add a new finish reason because NaNs will not forcefully fail the request. Internally we observe 2 behaviours in which NaNs appear:
I am trying to add now the per request NaNs and maybe have a bool per request telling if it's corrupted or not, but not sure if we should make it a finish reason. |
7282ec4
to
eed0d23
Compare
55eed71
to
04839fa
Compare
lgtm @vladmihailescu could you update the test plan? any data to show this is of low load overhead? possible to add a unit test to ensure it works? cc: @simon-mo @WoosukKwon @njhill @houseroad could you help to take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a environment var to control this? I am a bit concerned that it may cause some overhead. Or we can do some benchmark to confirm the overhead.
04839fa
to
7b3c0b1
Compare
…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
@yeqcharlotte @houseroad I added an unit test and gated the computation behind an env var which is off by default. I ran few benchmarks with loadgen and there's no significant throughput/latency difference at high concurrency but I didn't look into traces. Are you suggesting looking at gpu traces to check the overhead or something else? If the loadtest is enough I think we can also do shadow internally after this lands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for adding this metrics.
…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
7b3c0b1
to
8c7244c
Compare
V1 test timed out. Rebasing after the fix (#19872) |
vllm/v1/worker/gpu_model_runner.py
Outdated
@@ -1826,6 +1831,25 @@ def _get_prompt_logprobs_dict( | |||
|
|||
return prompt_logprobs_dict | |||
|
|||
def _get_nans_in_logits( | |||
self, | |||
logits: torch.Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is logits
optionally None here or not?
vllm/v1/worker/gpu_model_runner.py
Outdated
num_nans_in_logits = {} | ||
num_nans_for_index = None | ||
if logits is not None: | ||
num_nans_for_index = logits.isnan().sum(dim=-1).cpu().numpy() | ||
for req_id in self.input_batch.req_ids: | ||
req_index = self.input_batch.req_id_to_index[req_id] | ||
num_nans_in_logits[req_id] = ( | ||
int(num_nans_for_index[req_index]) | ||
if logits is not None and num_nans_for_index is not None | ||
and req_index < logits.shape[0] else 0) | ||
return num_nans_in_logits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you got 2 if logits is not none
check here and one in the loop. for better readability it might make sense to do the following:
if logits is None:
return {req_id: 0 for ...}
num_nans_for_index = logits.isnan().sum(dim=-1).cpu().numpy()
# count nans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this! leaving some small nits only.
…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
8c7244c
to
451048e
Compare
Addressed nits |
…-project#18777) Summary: Pull Request resolved: vllm-project#18777 Signed-off-by: Vlad Mihailescu <vladmihailescu@meta.com> Report nan in logits in scheduler_stats. This can be used later to bump Phrometeus counter but for now this is required so we can export it in our internal counter infra. This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits during model forward passes. It's a common metric we expose internally. Reviewed By: Adolfo-Karim Differential Revision: D75423285 Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
451048e
to
3bb49c3
Compare
Signed-off-by: nie3e <adrcwiek@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> added notebooks to playground updates remoted verbatim HF secrets from all files updates [custom_op][vllm-plugin] update custom_op class to use op_registry (vllm-project#19164) Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Export NaNs in logits to scheduler_stats if output is corrupted (vllm-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> [CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests (vllm-project#19901) Signed-off-by: jiang1.li <jiang1.li@intel.com> [Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (vllm-project#19749)
…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: juncheoll <th6re8e@naver.com>
…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com> Signed-off-by: Will Eaton <weaton@redhat.com>
…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
…-project#18777) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
Signed-off-by: Vlad Mihailescu vladmihailescu@meta.com
Summary:
Report nan in logits in scheduler_stats. This can be used later exported as Phrometeus counter but for now this is required so we can export it in our internal counter infra.
This counter is used to identify bad hosts or bad GPUs which cause NaNs in logits.
It's a common metric we expose.
Reviewed By: Adolfo-Karim
Differential Revision: D75423285
And unit test