Skip to content

The program hangs after using the debugger #16743

Open
@kgboyko

Description

@kgboyko

Type: Bug

The program hangs after using the debugger

After setting a breakpoint in a Jupiter notebook cell and then running the program, the program hangs in a subprocess.

Steps to Reproduce:

  1. Kaggle Docker GPU v160 is used (I also tested it on other versions v158, v159).
  2. ! pip install -U ipykernel
    ! pip install vllm==0.9.1 # (Tested on versions 0.7.2, 0.8.5.post1, 0.9.1)
  3. Create a Jupiter notebook with the following code. Runtime environment: Python-3.11.11
  4. Set a breakpoint at the last line, then run the application in debug mode (Required NVidia GPU Card >= 3xxx series).
    VS Code version: Code 1.101.0 (dfaf44141ea9deb3b4096f7cd6d24e00c147a4b1, 2025-06-11T15:00:50.123Z)
    OS version: Windows_NT x64 10.0.26100
    Modes:
    Remote OS version: Linux x64 6.8.0-60-generic
    Remote OS version: Linux x64 6.8.0-60-generic
System Info
Item Value
CPUs Intel(R) Core(TM) Ultra 5 125H (18 x 2995)
GPU Status 2d_canvas: enabled
canvas_oop_rasterization: enabled_on
direct_rendering_display_compositor: disabled_off_ok
gpu_compositing: enabled
multiple_raster_threads: enabled_on
opengl: enabled_on
rasterization: enabled
raw_draw: disabled_off_ok
skia_graphite: disabled_off
video_decode: enabled
video_encode: enabled
vulkan: disabled_off
webgl: enabled
webgl2: enabled
webgpu: enabled
webnn: disabled_off
Load (avg) undefined
Memory (System) 23.47GB (12.40GB free)
Process Argv --crash-reporter-id dc65549e-5a3d-4f34-a3ab-d2ec4803df70
Screen Reader no
VM 0%
Item Value
Remote SSH: Zoo
OS Linux x64 6.8.0-60-generic
CPUs 13th Gen Intel(R) Core(TM) i5-13600K (20 x 1006)
Memory (System) 125.54GB (119.80GB free)
VM 0%
Item Value
Remote Container gcr.io/kaggle-gpu-images/python:v160-cu125-pytorch-2.7.0_vllm_091 (kgl_gpu_160_1) @ Zoo
OS Linux x64 6.8.0-60-generic
CPUs 13th Gen Intel(R) Core(TM) i5-13600K (20 x 998)
Memory (System) 125.54GB (119.80GB free)
VM 0%
Extensions (23)
Extension Author (truncated) Version
jupyter-keymap ms- 1.1.2
remote-containers ms- 0.417.0
remote-ssh ms- 0.120.0
remote-ssh-edit ms- 0.87.0
vscode-remote-extensionpack ms- 0.26.0
remote-explorer ms- 0.5.0
remote-server ms- 1.5.2
docker doc 0.10.0
vscode-containers ms- 2.0.3
vscode-docker ms- 2.0.0
debugpy ms- 2025.8.0
python ms- 2025.6.1
vscode-pylance ms- 2025.5.1
datawrangler ms- 1.22.0
jupyter ms- 2025.4.1
jupyter-keymap ms- 1.1.2
jupyter-renderers ms- 1.1.0
vscode-jupyter-cell-tags ms- 0.1.9
vscode-jupyter-powertoys ms- 0.1.1
vscode-jupyter-slideshow ms- 0.1.6
markdown-preview-enhanced shd 0.8.18
intellicode-api-usage-examples Vis 0.2.9
vscodeintellicode Vis 1.3.2
A/B Experiments
vsliv368:30146709
vspor879:30202332
vspor708:30202333
vspor363:30204092
vscod805:30301674
binariesv615:30325510
c4g48928:30535728
azure-dev_surveyone:30548225
962ge761:30959799
h48ei257:31000450
pythontbext0:30879054
cppperfnew:31000557
dwnewjupytercf:31046870
pythonrstrctxt:31112756
nativeloc1:31192215
5fd0e150:31155592
dwcopilot:31170013
6074i472:31201624
dwoutputs:31242946
customenabled:31248079
hdaa2157:31222309
copilot_t_ci:31222730
e5gg6876:31282496
pythoneinst12:31285622
bgtreat:31268568
4gafe986:31271826
c7cif404:31314491
996jf627:31283433
pythonrdcb7:31303018
usemplatestapi:31297334
0aa6g176:31307128
7bj51361:31289155
747dc170:31275177
pylancecolor:31314202
aj953862:31281341
generatesymbolt:31295002
convertfstringf:31295003
gendocf:31295004
pylancequickfixf:31319675
0g0a1943:31327026

import os
import transformers; print('Transformers version:', transformers.__version__)
import torch; print('Torch version:', torch.__version__)
import vllm; print('vLLM version:', vllm.__version__)
from vllm import LLM

container_date = os.environ.get('BUILD_DATE', '').split('-')[0]
print(f"Kaggle Docker BUILD_DATE={container_date}")

!cat /etc/os-release | grep -oP "PRETTY_NAME=\"\K([^\"]*)" && uname -r
!free -h
!nv_version="$(nvidia-smi --query-gpu=driver_version --format=csv,noheader)" && echo "My NVIDIA driver version is '${nv_version}'."
!ls -l /usr/local | grep cuda

is_debug = True

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
os.environ["TRITON_PTXAS_PATH"] = "/usr/local/cuda/bin/ptxas"

if is_debug:
    os.environ['OMP_NUM_THREADS'] = '1'
    os.environ['MKL_NUM_THREADS'] = '1'
    os.environ['POLARS_ALLOW_FORKING_THREAD'] = '1'
    os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
    os.environ['TORCH_USE_CUDA_DSA'] = "1"
    os.environ['OPENBLAS_NUM_THREADS'] = '1'
    os.environ["NUM_INTER_THREADS"] = "1"
    os.environ["NUM_INTRA_THREADS"] = "1"
    os.environ["XLA_FLAGS"] = ("--xla_cpu_multi_thread_eigen=false "
                               "intra_op_parallelism_threads=1")

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["PYTORCH_CUDA_ALLOC_CONF"]='expandable_segments:True'

llm = LLM(
    model="Qwen/Qwen3-0.6B",
    max_num_seqs=1,
    max_model_len=1024,
    trust_remote_code=True,
    enable_prefix_caching = True,
    dtype = torch.half,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.96,
    enforce_eager=True,
    seed=2024,
)
tokenizer = llm.get_tokenizer()
print("LLM Started")
  1. When running with a debugger, the following message appears in the log. If running without a debugger, this message does not appear.
77.40s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
  1. When you press Ctrl+C, this dump appears. It is clear that it hangs in some subprocess.
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/tmp/ipykernel_2264/2697110242.py in <cell line: 0>()
     34 os.environ["PYTORCH_CUDA_ALLOC_CONF"]='expandable_segments:True'
     35 
---> 36 llm = LLM(
     37     model="Qwen/Qwen3-0.6B",
     38     max_num_seqs=1,

/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py in __init__(self, model, task, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_token, hf_overrides, mm_processor_kwargs, override_pooler_config, compilation_config, **kwargs)
    241 
    242         # Create the Engine (autoselects V0 vs V1)
--> 243         self.llm_engine = LLMEngine.from_engine_args(
    244             engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
    245         self.engine_class = type(self.llm_engine)

/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py in from_engine_args(cls, engine_args, usage_context, stat_loggers)
    492         """Creates an LLM engine from the engine arguments."""
    493         # Create the engine configs.
--> 494         vllm_config = engine_args.create_engine_config(usage_context)
    495 
    496         engine_cls = cls

/usr/local/lib/python3.11/dist-packages/vllm/engine/arg_utils.py in create_engine_config(self, usage_context)
   1016 
   1017         device_config = DeviceConfig(device=current_platform.device_type)
-> 1018         model_config = self.create_model_config()
   1019 
   1020         # * If VLLM_USE_V1 is unset, we enable V1 for "supported features"

/usr/local/lib/python3.11/dist-packages/vllm/engine/arg_utils.py in create_model_config(self)
    908             self.load_format = LoadFormat.RUNAI_STREAMER
    909 
--> 910         return ModelConfig(
    911             model=self.model,
    912             hf_config_path=self.hf_config_path,

    [... skipping hidden 1 frame]

/usr/local/lib/python3.11/dist-packages/vllm/config.py in __post_init__(self)
    546             self.model, hf_token=self.hf_token, revision=self.revision)
    547 
--> 548         supported_tasks, task = self._resolve_task(self.task)
    549         self.supported_tasks = supported_tasks
    550         self.task = task

/usr/local/lib/python3.11/dist-packages/vllm/config.py in _resolve_task(self, task_option)
    796             # NOTE: Listed from highest to lowest priority,
    797             # in case the model supports multiple of them
--> 798             "transcription": registry.is_transcription_model(architectures),
    799             "generate": registry.is_text_generation_model(architectures),
    800             "pooling": registry.is_pooling_model(architectures),

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in is_transcription_model(self, architectures)
    556         architectures: Union[str, list[str]],
    557     ) -> bool:
--> 558         model_cls, _ = self.inspect_model_cls(architectures)
    559         return model_cls.supports_transcription
    560 

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in inspect_model_cls(self, architectures)
    470 
    471         for arch in architectures:
--> 472             model_info = self._try_inspect_model_cls(arch)
    473             if model_info is not None:
    474                 return (model_info, arch)

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in _try_inspect_model_cls(self, model_arch)
    443             return None
    444 
--> 445         return _try_inspect_model_cls(model_arch, self.models[model_arch])
    446 
    447     def _normalize_archs(

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in _try_inspect_model_cls(model_arch, model)
    363 ) -> Optional[_ModelInfo]:
    364     try:
--> 365         return model.inspect_model_cls()
    366     except Exception:
    367         logger.exception("Error in inspecting model architecture '%s'",

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in inspect_model_cls(self)
    334     # Performed in another process to avoid initializing CUDA
    335     def inspect_model_cls(self) -> _ModelInfo:
--> 336         return _run_in_subprocess(
    337             lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
    338 

/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/registry.py in _run_in_subprocess(fn)
    590         # cannot use `sys.executable __file__` here because the script
    591         # contains relative imports
--> 592         returned = subprocess.run(_SUBPROCESS_COMMAND,
    593                                   input=input_bytes,
    594                                   capture_output=True)

/usr/lib/python3.11/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    548     with Popen(*popenargs, **kwargs) as process:
    549         try:
--> 550             stdout, stderr = process.communicate(input, timeout=timeout)
    551         except TimeoutExpired as exc:
    552             process.kill()

/usr/lib/python3.11/subprocess.py in communicate(self, input, timeout)
   1207 
   1208             try:
-> 1209                 stdout, stderr = self._communicate(input, endtime, timeout)
   1210             except KeyboardInterrupt:
   1211                 # https://bugs.python.org/issue25942

/usr/lib/python3.11/subprocess.py in _communicate(self, input, endtime, orig_timeout)
   2113                             'failed to raise TimeoutExpired.')
   2114 
-> 2115                     ready = selector.select(timeout)
   2116                     self._check_timeout(endtime, orig_timeout, stdout, stderr)
   2117 

/usr/lib/python3.11/selectors.py in select(self, timeout)
    413         ready = []
    414         try:
--> 415             fd_event_list = self._selector.poll(timeout)
    416         except InterruptedError:
    417             return ready

KeyboardInterrupt:
  1. Having previously tried all working versions of VLLM and Docker, I came to the unambiguous conclusion that the problem is in VSCode.

  2. Log

Transformers version: 4.51.3
Torch version: 2.7.0+cu126
INFO 06-12 21:02:13 [__init__.py:244] Automatically detected platform cuda.
2025-06-12 21:02:13.189772: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-12 21:02:13.197188: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1749762133.205913    4385 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749762133.208566    4385 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-12 21:02:13.217907: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
vLLM version: 0.9.1
Kaggle Docker BUILD_DATE=20250508
Ubuntu 22.04.4 LTS
6.8.0-60-generic
               total        used        free      shared  buff/cache   available
Mem:           125Gi       4.3Gi        17Gi       9.0Mi       103Gi       119Gi
Swap:          8.0Gi       0.0Ki       8.0Gi
My NVIDIA driver version is '560.35.03'.
lrwxrwxrwx 1 root root   22 Jul 10  2024 cuda -> /etc/alternatives/cuda
lrwxrwxrwx 1 root root   25 Jul 10  2024 cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 1 root root 4096 Jul 10  2024 cuda-12.5
INFO 06-12 21:02:23 [config.py:823] This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
WARNING 06-12 21:02:23 [config.py:3271] Casting torch.bfloat16 to torch.float16.
INFO 06-12 21:02:23 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 06-12 21:02:23 [config.py:2232] max_num_batched_tokens (8192) exceeds max_num_seqs* max_model_len (1024). This may lead to unexpected behavior.
WARNING 06-12 21:02:23 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 06-12 21:02:25 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 06-12 21:02:27 [__init__.py:244] Automatically detected platform cuda.
2025-06-12 21:02:27.403391: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1749762147.413643   48564 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749762147.416761   48564 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 06-12 21:02:30 [core.py:455] Waiting for init message from front-end.
INFO 06-12 21:02:30 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=2024, served_model_name=Qwen/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 06-12 21:02:30 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x710235449c10>
INFO 06-12 21:02:31 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-12 21:02:31 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-12 21:02:31 [gpu_model_runner.py:1595] Starting to load model Qwen/Qwen3-0.6B...
INFO 06-12 21:02:31 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 06-12 21:02:31 [logger.py:59] Using Flash Attention backend on V1 engine.
INFO 06-12 21:02:32 [weight_utils.py:292] Using model weights format ['*.safetensors']
INFO 06-12 21:02:32 [weight_utils.py:345] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.71it/s]

INFO 06-12 21:02:33 [default_loader.py:272] Loading weights took 0.60 seconds
INFO 06-12 21:02:33 [gpu_model_runner.py:1624] Model loading took 1.1201 GiB and 1.735851 seconds
INFO 06-12 21:02:34 [gpu_worker.py:227] Available KV cache memory: 21.20 GiB
INFO 06-12 21:02:34 [kv_cache_utils.py:715] GPU KV cache size: 198,448 tokens
INFO 06-12 21:02:34 [kv_cache_utils.py:719] Maximum concurrency for 1,024 tokens per request: 193.80x
INFO 06-12 21:02:34 [core.py:171] init engine (profile, create kv cache, warmup model) took 1.06 seconds
LLM Started

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions