The program hangs after using the debugger

Type: Bug

The program hangs after using the debugger

After setting a breakpoint in a Jupiter notebook cell and then running the program, the program hangs in a subprocess.

Steps to Reproduce:
1. [Kaggle Docker GPU v160](https://github.com/Kaggle/docker-python/releases/tag/5de7f3337f4166df89b2d68f886fb08b30c4260df90e55e6587be5fd7648cf3f) is used (I also tested it on other versions v158, v159).
2. ! pip install -U ipykernel
 ! pip install vllm==0.9.1 # (Tested on versions 0.7.2, 0.8.5.post1, 0.9.1)
4. Create a Jupiter notebook with the following code. Runtime environment: Python-3.11.11
5. Set a breakpoint at the last line, then run the application in debug mode (Required NVidia GPU Card >= 3xxx series).
VS Code version: Code 1.101.0 (dfaf44141ea9deb3b4096f7cd6d24e00c147a4b1, 2025-06-11T15:00:50.123Z)
OS version: Windows_NT x64 10.0.26100
Modes:
Remote OS version: Linux x64 6.8.0-60-generic
Remote OS version: Linux x64 6.8.0-60-generic

<details>
<summary>System Info</summary>

|Item|Value|
|---|---|
|CPUs|Intel(R) Core(TM) Ultra 5 125H (18 x 2995)|
|GPU Status|2d_canvas: enabled canvas_oop_rasterization: enabled_on direct_rendering_display_compositor: disabled_off_ok gpu_compositing: enabled multiple_raster_threads: enabled_on opengl: enabled_on rasterization: enabled raw_draw: disabled_off_ok skia_graphite: disabled_off video_decode: enabled video_encode: enabled vulkan: disabled_off webgl: enabled webgl2: enabled webgpu: enabled webnn: disabled_off|
|Load (avg)|undefined|
|Memory (System)|23.47GB (12.40GB free)|
|Process Argv|--crash-reporter-id dc65549e-5a3d-4f34-a3ab-d2ec4803df70|
|Screen Reader|no|
|VM|0%|

|Item|Value|
|---|---|
|Remote|SSH: Zoo|
|OS|Linux x64 6.8.0-60-generic|
|CPUs|13th Gen Intel(R) Core(TM) i5-13600K (20 x 1006)|
|Memory (System)|125.54GB (119.80GB free)|
|VM|0%|

|Item|Value|
|---|---|
|Remote|Container gcr.io/kaggle-gpu-images/python:v160-cu125-pytorch-2.7.0_vllm_091 (kgl_gpu_160_1) @ Zoo|
|OS|Linux x64 6.8.0-60-generic|
|CPUs|13th Gen Intel(R) Core(TM) i5-13600K (20 x 998)|
|Memory (System)|125.54GB (119.80GB free)|
|VM|0%|
</details><details><summary>Extensions (23)</summary>

Extension|Author (truncated)|Version
---|---|---
jupyter-keymap|ms-|1.1.2
remote-containers|ms-|0.417.0
remote-ssh|ms-|0.120.0
remote-ssh-edit|ms-|0.87.0
vscode-remote-extensionpack|ms-|0.26.0
remote-explorer|ms-|0.5.0
remote-server|ms-|1.5.2
docker|doc|0.10.0
vscode-containers|ms-|2.0.3
vscode-docker|ms-|2.0.0
debugpy|ms-|2025.8.0
python|ms-|2025.6.1
vscode-pylance|ms-|2025.5.1
datawrangler|ms-|1.22.0
jupyter|ms-|2025.4.1
jupyter-keymap|ms-|1.1.2
jupyter-renderers|ms-|1.1.0
vscode-jupyter-cell-tags|ms-|0.1.9
vscode-jupyter-powertoys|ms-|0.1.1
vscode-jupyter-slideshow|ms-|0.1.6
markdown-preview-enhanced|shd|0.8.18
intellicode-api-usage-examples|Vis|0.2.9
vscodeintellicode|Vis|1.3.2


</details><details>
<summary>A/B Experiments</summary>

```
vsliv368:30146709
vspor879:30202332
vspor708:30202333
vspor363:30204092
vscod805:30301674
binariesv615:30325510
c4g48928:30535728
azure-dev_surveyone:30548225
962ge761:30959799
h48ei257:31000450
pythontbext0:30879054
cppperfnew:31000557
dwnewjupytercf:31046870
pythonrstrctxt:31112756
nativeloc1:31192215
5fd0e150:31155592
dwcopilot:31170013
6074i472:31201624
dwoutputs:31242946
customenabled:31248079
hdaa2157:31222309
copilot_t_ci:31222730
e5gg6876:31282496
pythoneinst12:31285622
bgtreat:31268568
4gafe986:31271826
c7cif404:31314491
996jf627:31283433
pythonrdcb7:31303018
usemplatestapi:31297334
0aa6g176:31307128
7bj51361:31289155
747dc170:31275177
pylancecolor:31314202
aj953862:31281341
generatesymbolt:31295002
convertfstringf:31295003
gendocf:31295004
pylancequickfixf:31319675
0g0a1943:31327026

```

</details>



```Python
import os
import transformers; print('Transformers version:', transformers.__version__)
import torch; print('Torch version:', torch.__version__)
import vllm; print('vLLM version:', vllm.__version__)
from vllm import LLM

container_date = os.environ.get('BUILD_DATE', '').split('-')[0]
print(f"Kaggle Docker BUILD_DATE={container_date}")

!cat /etc/os-release | grep -oP "PRETTY_NAME=\"\K([^\"]*)" && uname -r
!free -h
!nv_version="$(nvidia-smi --query-gpu=driver_version --format=csv,noheader)" && echo "My NVIDIA driver version is '${nv_version}'."
!ls -l /usr/local | grep cuda

is_debug = True

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
os.environ["TRITON_PTXAS_PATH"] = "/usr/local/cuda/bin/ptxas"

if is_debug:
 os.environ['OMP_NUM_THREADS'] = '1'
 os.environ['MKL_NUM_THREADS'] = '1'
 os.environ['POLARS_ALLOW_FORKING_THREAD'] = '1'
 os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
 os.environ['TORCH_USE_CUDA_DSA'] = "1"
 os.environ['OPENBLAS_NUM_THREADS'] = '1'
 os.environ["NUM_INTER_THREADS"] = "1"
 os.environ["NUM_INTRA_THREADS"] = "1"
 os.environ["XLA_FLAGS"] = ("--xla_cpu_multi_thread_eigen=false "
 "intra_op_parallelism_threads=1")

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["PYTORCH_CUDA_ALLOC_CONF"]='expandable_segments:True'

llm = LLM(
 model="Qwen/Qwen3-0.6B",
 max_num_seqs=1,
 max_model_len=1024,
 trust_remote_code=True,
 enable_prefix_caching = True,
 dtype = torch.half,
 tensor_parallel_size=1,
 gpu_memory_utilization=0.96,
 enforce_eager=True,
 seed=2024,
)
tokenizer = llm.get_tokenizer()
print("LLM Started")
```

5. When running with a debugger, the following message appears in the log. If running without a debugger, this message does not appear.

```
77.40s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
```

6. When you press Ctrl+C, this dump appears. It is clear that it hangs in some subprocess.

```
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
/tmp/ipykernel_2264/2697110242.py in <cell line: 0>()
 34 os.environ["PYTORCH_CUDA_ALLOC_CONF"]='expandable_segments:True'
 35 
---> 36 llm = LLM(
 37 model="Qwen/Qwen3-0.6B",
 38 max_num_seqs=1,

/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py in __init__(self, model, task, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_token, hf_overrides, mm_processor_kwargs, override_pooler_config, compilation_config, **kwargs)
 241 
 242 # Create the Engine (autoselects V0 vs V1)
--> 243 self.llm_engine = LLMEngine.from_engine_args(
 244 engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
 245 self.engine_class = type(self.llm_engine)

/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py in from_engine_args(cls, engine_args, usage_context, stat_loggers)
 492 """Creates an LLM engine from the engine arguments."""
 493 # Create the engine configs.
--> 494 vllm_config = engine_args.create_engine_config(usage_context)
 495 
 496 engine_cls = cls

/usr/local/lib/python3.11/dist-packages/vllm/engine/arg_utils.py in create_engine_config(self, usage_context)
 1016 
 1017 device_config = DeviceConfig(device=current_platform.device_type)
-> 1018 model_config = self.create_model_config()
 1019 
 1020 # * If VLLM_USE_V1 is unset, we enable V1 for "supported features"

/usr/local/lib/python3.11/dist-packages/vllm/engine/arg_utils.py in create_model_config(self)
 908 self.load_format = LoadFormat.RUNAI_STREAMER
 909 
--> 910 return ModelConfig(
 911 model=self.model,
 912 hf_config_path=self.hf_config_path,

 [... skipping hidden 1 frame]

/usr/local/lib/python3.11/dist-packages/vllm/config.py in __post_init__(self)
 546 self.model, hf_token=self.hf_token, revision=self.revision)
 547 
--> 548 supported_tasks, task = self._resolve_task(self.task)
 549 self.supported_tasks = supported_tasks
 550 self.task = task

/usr/local/lib/python3.11/dist-packages/vllm/config.py in _resolve_task(self, task_option)
 796 # NOTE: Listed from highest to lowest priority,
 797 # in case the model supports multiple of them
--> 798 "transcription": registry.is_transcription_model(architectures),
 799 "generate": registry.is_text_generation_model(architectures),
 800 "pooling": registry.is_pooling_model(architectures),

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in is_transcription_model(self, architectures)
 556 architectures: Union[str, list[str]],
 557 ) -> bool:
--> 558 model_cls, _ = self.inspect_model_cls(architectures)
 559 return model_cls.supports_transcription
 560 

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in inspect_model_cls(self, architectures)
 470 
 471 for arch in architectures:
--> 472 model_info = self._try_inspect_model_cls(arch)
 473 if model_info is not None:
 474 return (model_info, arch)

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in _try_inspect_model_cls(self, model_arch)
 443 return None
 444 
--> 445 return _try_inspect_model_cls(model_arch, self.models[model_arch])
 446 
 447 def _normalize_archs(

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in _try_inspect_model_cls(model_arch, model)
 363 ) -> Optional[_ModelInfo]:
 364 try:
--> 365 return model.inspect_model_cls()
 366 except Exception:
 367 logger.exception("Error in inspecting model architecture '%s'",

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in inspect_model_cls(self)
 334 # Performed in another process to avoid initializing CUDA
 335 def inspect_model_cls(self) -> _ModelInfo:
--> 336 return _run_in_subprocess(
 337 lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
 338 

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in _run_in_subprocess(fn)
 590 # cannot use `sys.executable __file__` here because the script
 591 # contains relative imports
--> 592 returned = subprocess.run(_SUBPROCESS_COMMAND,
 593 input=input_bytes,
 594 capture_output=True)

/usr/lib/python3.11/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
 548 with Popen(*popenargs, **kwargs) as process:
 549 try:
--> 550 stdout, stderr = process.communicate(input, timeout=timeout)
 551 except TimeoutExpired as exc:
 552 process.kill()

/usr/lib/python3.11/subprocess.py in communicate(self, input, timeout)
 1207 
 1208 try:
-> 1209 stdout, stderr = self._communicate(input, endtime, timeout)
 1210 except KeyboardInterrupt:
 1211 # https://bugs.python.org/issue25942

/usr/lib/python3.11/subprocess.py in _communicate(self, input, endtime, orig_timeout)
 2113 'failed to raise TimeoutExpired.')
 2114 
-> 2115 ready = selector.select(timeout)
 2116 self._check_timeout(endtime, orig_timeout, stdout, stderr)
 2117 

/usr/lib/python3.11/selectors.py in select(self, timeout)
 413 ready = []
 414 try:
--> 415 fd_event_list = self._selector.poll(timeout)
 416 except InterruptedError:
 417 return ready

KeyboardInterrupt:
```

7. Having previously tried all working versions of VLLM and Docker, I came to the unambiguous conclusion that the problem is in VSCode.

8. Log
```
Transformers version: 4.51.3
Torch version: 2.7.0+cu126
INFO 06-12 21:02:13 [__init__.py:244] Automatically detected platform cuda.
2025-06-12 21:02:13.189772: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-12 21:02:13.197188: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1749762133.205913 4385 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749762133.208566 4385 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-12 21:02:13.217907: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
vLLM version: 0.9.1
Kaggle Docker BUILD_DATE=20250508
Ubuntu 22.04.4 LTS
6.8.0-60-generic
 total used free shared buff/cache available
Mem: 125Gi 4.3Gi 17Gi 9.0Mi 103Gi 119Gi
Swap: 8.0Gi 0.0Ki 8.0Gi
My NVIDIA driver version is '560.35.03'.
lrwxrwxrwx 1 root root 22 Jul 10 2024 cuda -> /etc/alternatives/cuda
lrwxrwxrwx 1 root root 25 Jul 10 2024 cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 1 root root 4096 Jul 10 2024 cuda-12.5
INFO 06-12 21:02:23 [config.py:823] This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
WARNING 06-12 21:02:23 [config.py:3271] Casting torch.bfloat16 to torch.float16.
INFO 06-12 21:02:23 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 06-12 21:02:23 [config.py:2232] max_num_batched_tokens (8192) exceeds max_num_seqs* max_model_len (1024). This may lead to unexpected behavior.
WARNING 06-12 21:02:23 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 06-12 21:02:25 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 06-12 21:02:27 [__init__.py:244] Automatically detected platform cuda.
2025-06-12 21:02:27.403391: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1749762147.413643 48564 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749762147.416761 48564 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 06-12 21:02:30 [core.py:455] Waiting for init message from front-end.
INFO 06-12 21:02:30 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=2024, served_model_name=Qwen/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 06-12 21:02:30 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x710235449c10>
INFO 06-12 21:02:31 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-12 21:02:31 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-12 21:02:31 [gpu_model_runner.py:1595] Starting to load model Qwen/Qwen3-0.6B...
INFO 06-12 21:02:31 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 06-12 21:02:31 [logger.py:59] Using Flash Attention backend on V1 engine.
INFO 06-12 21:02:32 [weight_utils.py:292] Using model weights format ['*.safetensors']
INFO 06-12 21:02:32 [weight_utils.py:345] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.71it/s]

INFO 06-12 21:02:33 [default_loader.py:272] Loading weights took 0.60 seconds
INFO 06-12 21:02:33 [gpu_model_runner.py:1624] Model loading took 1.1201 GiB and 1.735851 seconds
INFO 06-12 21:02:34 [gpu_worker.py:227] Available KV cache memory: 21.20 GiB
INFO 06-12 21:02:34 [kv_cache_utils.py:715] GPU KV cache size: 198,448 tokens
INFO 06-12 21:02:34 [kv_cache_utils.py:719] Maximum concurrency for 1,024 tokens per request: 193.80x
INFO 06-12 21:02:34 [core.py:171] init engine (profile, create kv cache, warmup model) took 1.06 seconds
LLM Started
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The program hangs after using the debugger #16743

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Item	Value
CPUs	Intel(R) Core(TM) Ultra 5 125H (18 x 2995)
GPU Status	2d_canvas: enabled canvas_oop_rasterization: enabled_on direct_rendering_display_compositor: disabled_off_ok gpu_compositing: enabled multiple_raster_threads: enabled_on opengl: enabled_on rasterization: enabled raw_draw: disabled_off_ok skia_graphite: disabled_off video_decode: enabled video_encode: enabled vulkan: disabled_off webgl: enabled webgl2: enabled webgpu: enabled webnn: disabled_off
Load (avg)	undefined
Memory (System)	23.47GB (12.40GB free)
Process Argv	--crash-reporter-id dc65549e-5a3d-4f34-a3ab-d2ec4803df70
Screen Reader	no
VM	0%

Item	Value
Remote	SSH: Zoo
OS	Linux x64 6.8.0-60-generic
CPUs	13th Gen Intel(R) Core(TM) i5-13600K (20 x 1006)
Memory (System)	125.54GB (119.80GB free)
VM	0%

Item	Value
Remote	Container gcr.io/kaggle-gpu-images/python:v160-cu125-pytorch-2.7.0_vllm_091 (kgl_gpu_160_1) @ Zoo
OS	Linux x64 6.8.0-60-generic
CPUs	13th Gen Intel(R) Core(TM) i5-13600K (20 x 998)
Memory (System)	125.54GB (119.80GB free)
VM	0%

Extension	Author (truncated)	Version
jupyter-keymap	ms-	1.1.2
remote-containers	ms-	0.417.0
remote-ssh	ms-	0.120.0
remote-ssh-edit	ms-	0.87.0
vscode-remote-extensionpack	ms-	0.26.0
remote-explorer	ms-	0.5.0
remote-server	ms-	1.5.2
docker	doc	0.10.0
vscode-containers	ms-	2.0.3
vscode-docker	ms-	2.0.0
debugpy	ms-	2025.8.0
python	ms-	2025.6.1
vscode-pylance	ms-	2025.5.1
datawrangler	ms-	1.22.0
jupyter	ms-	2025.4.1
jupyter-keymap	ms-	1.1.2
jupyter-renderers	ms-	1.1.0
vscode-jupyter-cell-tags	ms-	0.1.9
vscode-jupyter-powertoys	ms-	0.1.1
vscode-jupyter-slideshow	ms-	0.1.6
markdown-preview-enhanced	shd	0.8.18
intellicode-api-usage-examples	Vis	0.2.9
vscodeintellicode	Vis	1.3.2

The program hangs after using the debugger #16743

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions