[Bug]: the worker node joins successfully for a few seconds and exits without a reason

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
INFO 06-22 19:10:27 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35

Python version: 3.12.10 (main, Apr  9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.134-14.zncgsl6.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L2
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             56
On-line CPU(s) list:                0-55
Vendor ID:                          GenuineIntel
BIOS Vendor ID:                     Intel(R) Corporation
Model name:                         Intel(R) Xeon(R) Gold 6330N CPU @ 2.20GHz
BIOS Model name:                    Intel(R) Xeon(R) Gold 6330N CPU @ 2.20GHz
CPU family:                         6
Model:                              106
Thread(s) per core:                 2
Core(s) per socket:                 28
Socket(s):                          1
Stepping:                           6
BogoMIPS:                           4400.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          1.3 MiB (28 instances)
L1i cache:                          896 KiB (28 instances)
L2 cache:                           35 MiB (28 instances)
L3 cache:                           42 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-55
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:           Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] flashinfer-python==0.2.1.post2+cu124torch2.6
[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-55    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
CUDA_VERSION=12.4.0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_VERSION=2.20.5-1
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
VLLM_HOST_IP=156.176.0.154
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>


### 🐛 Describe the bug

the ray command is executed by the `vllm/vllm-openai:v0.8.5` image. the command is simplified from the `run_cluster.sh` for debugging.
```
ctr -n k8s.io run --rm --net-host --privileged --runc-binary /usr/local/nvidia/toolkit/nvidia-container-runtime   --env NVIDIA_VISIBLE_DEVICES=all   --env VLLM_HOST_IP=156.176.0.154   --mount type=bind,src=/datadisk0/hobin,dst=/root/.cache/huggingface,options=rbind:rw   docker.io/vllm/vllm-openai:v0.8.5 hobin bash -c  "ray start  --block --verbose  --address=156.176.0.153:6379"
```

the worker node joins join successfully for a few secs and exit without a reason.

On the master node, the console prints like this.
```shell
root@paas-controller-1:/vllm-workspace# ray status
======== Autoscaler status: 2025-06-22 18:45:44.434933 ========
Node status
---------------------------------------------------------------
Active:
 1 node_43ceaa6cfd5d34f5d52f7baa5edfe7cb57b2c9afc8c09116acfe6882
 1 node_484ca76fc03f637b1e986210cb0b69ac31d12f33457145930ba98f64
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/112.0 CPU
 0.0/2.0 GPU
 0B/161.08GiB memory
 0B/18.63GiB object_store_memory

Demands:
 (no resource demands)
```

On the worker node, the console prints like this.
```shell
[root@minion-1:~]$ ctr -n k8s.io run --rm --net-host --privileged --runc-binary /usr/local/nvidia/toolkit/nvidia-container-runtime   --env NVIDIA_VISIBLE_DEVICES=all   --env VLLM_HOST_IP=156.176.0.154   --mount type=bind,src=/datadisk0/hobin,dst=/root/.cache/huggingface,options=rbind:rw   docker.io/vllm/vllm-openai:v0.8.5 hobin bash -c  "ray start  --block --verbose  --address=156.176.0.153:6379"
2025-06-22 19:13:41,378 WARNING services.py:2072 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2025-06-22 19:13:41,393 W 1 1] global_state_accessor.cc:429: Retrying to get node with node ID 07108a68812f92482d474e63d2c6904d0b4e381f9e0c23f005f40e89
[2025-06-22 19:13:42,393 W 1 1] global_state_accessor.cc:429: Retrying to get node with node ID 07108a68812f92482d474e63d2c6904d0b4e381f9e0c23f005f40e89
2025-06-22 19:13:41,308 INFO scripts.py:1047 -- Local node IP: 156.176.0.154
2025-06-22 19:13:43,395 SUCC scripts.py:1063 -- --------------------
2025-06-22 19:13:43,396 SUCC scripts.py:1064 -- Ray runtime started.
2025-06-22 19:13:43,396 SUCC scripts.py:1065 -- --------------------
2025-06-22 19:13:43,396 INFO scripts.py:1067 -- To terminate the Ray runtime, run
2025-06-22 19:13:43,396 INFO scripts.py:1068 --   ray stop
2025-06-22 19:13:43,396 INFO scripts.py:1076 -- --block
2025-06-22 19:13:43,396 INFO scripts.py:1077 -- This command will now block forever until terminated by a signal.
2025-06-22 19:13:43,396 INFO scripts.py:1080 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
```
The ray-related logs is given below
```
# On master-node, /tmp/ray/session_2025-06-23_01-33-26_134828_1/logs/events/event_GCS.log
{"custom_fields":{"ip":"156.176.0.154","node_id":"ada3a3771fac751bd7c6b674e3b5e2df045b93550e6d31a410cad0be"},"event_id":"1820787c7179bb878444735ecb267c1bea04","host_name":"paas-controller-1","label":"RAY_NODE_REMOVED","message":"The node with node id: ada3a3771fac751bd7c6b674e3b5e2df045b93550e6d31a410cad0be and address: 156.176.0.154 and node name: 156.176.0.154 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a \t(1) raylet crashes unexpectedly (OOM, etc.) \\n\t(2) raylet has lagging heartbeats due to slow network or busy workload.","pid":"65","severity":"ERROR","source_type":"GCS","timestamp":1750669108}

# On worker-node, this log said it is mistakenly marked dead
{"custom_fields":{"node_id":"a418db09539dd8e5b9800eb27416ecf0f8f62df5a1c720f0484d96ca"},"event_id":"03cdd9ea75bc4aceb7d23e840be2f8a4612a","host_name":"minion-1","label":"RAYLET_MARKED_DEAD","message":"[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.","pid":"89","severity":"FATAL","source_type":"RAYLET","timestamp":1750683150}
{"custom_fields":{"node_id":"a418db09539dd8e5b9800eb27416ecf0f8f62df5a1c720f0484d96ca"},"event_id":"f9b18badad7ebe7db07c0edc11ce4463d962","host_name":"minion-1","label":"RAY_FATAL_CHECK_FAILED","message":"src/ray/raylet/node_manager.cc:1014 (PID: 89, TID: 89, errno: 2 (No such file or directory)):[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.\\n*** StackTrace Information ***\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0xd9c86a) [0x55fafff4a86a] ray::operator<<()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0xd9ece0) [0x55fafff4cce0] ray::RayLog::~RayLog()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0x331a62) [0x55faff4dfa62] ray::raylet::NodeManager::NodeRemoved()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0x560335) [0x55faff70e335] ray::gcs::NodeInfoAccessor::HandleNotification()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0x6de878) [0x55faff88c878] EventTracker::RecordExecution()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0x6d986e) [0x55faff88786e] std::_Function_handler<>::_M_invoke()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0x6d9ce6) [0x55faff887ce6] boost::asio::detail::completion_handler<>::do_complete()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0xda995b) [0x55fafff5795b] boost::asio::detail::scheduler::do_run_one()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0xdabee9) [0x55fafff59ee9] boost::asio::detail::scheduler::run()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0xdac402) [0x55fafff5a402] boost::asio::io_context::run()\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0x1efecf) [0x55faff39decf] main\\n/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f204b447d90]\\n/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f204b447e40] __libc_start_main\\n/usr/local/lib/python3.12/dist-packages/ray/core/src/ray/raylet/raylet(+0x24a387) [0x55faff3f8387]\\n","pid":"89","severity":"FATAL","source_type":"RAYLET","timestamp":1750683150}

```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: the worker node joins successfully for a few seconds and exits without a reason #19960

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: the worker node joins successfully for a few seconds and exits without a reason #19960

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions