Description
Your current environment
I use docker only to host open ai server.
🐛 Describe the bug
I try to use VLLM docker as my backend to run Gemma3 models and use concurrency and dynamic batching.
I use 22.04 Ubuntu 570.153.02 Drivers Nvidia CUDA Version: 12.8 I know I should update to 12.9 24.04 ubuntu but so far all worked I used like faster-whisper docker for llama servers etc.
I followed these issues:
#17587
#14766
#14452
The only thing that seem to work for me was solution from 14452 of hongbo-miao
Server:
docker run --gpus=all
--volume="$HOME/.cache/huggingface:/root/.cache/huggingface"
--publish=8000:8000
nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
python3 -m vllm.entrypoints.openai.api_server
--model=Qwen/Qwen2.5-0.5B-Instruct
--port=8000
--gpu-memory-utilization=0.75
--max_model_len=8192
--tensor-parallel-size=1
--max_num_seqs=128
--enforce-eager
Client:
curl http://localhost:8000/v1/chat/completions
--header "Content-Type: application/json"
--data '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."}
]
}'
which returns
{
"id": "chat-260127bb79b74e3786b810ffa6f592ed",
"object": "chat.completion",
"created": 1749898826,
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Sure! Here's one for you: Why did the tomato turn red? Because it saw the salad dressing!",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 24,
"total_tokens": 47,
"completion_tokens": 23
},
"prompt_logprobs": null
}
This works on my pc but when I try, models like
MISHANM/google-gemma-3-12b-it-fp8
JamAndTeaStudios/gemma-3-12b-it-FP8-Dynamic
RedHatAI/gemma-3-12b-it-FP8-dynamic
like running
docker run --gpus=all -v "$HOME/.cache/huggingface:/root/.cache/huggingface" -p 8000:8000 --name gemma3_12b_fp8 tritonserver-bnb python3 -m vllm.entrypoints.openai.api_server --model=RedHatAI/gemma-3-12b-it-FP8-dynamic --port=8000 --gpu-memory-utilization=0.90 --max_model_len=8192 --tensor-parallel-size=1 --max_num_seqs=128 --enforce-eager
I get error
=============================
== Triton Inference Server ==NVIDIA Release 25.05 (build 172940304)
Triton Server Version 2.58.0Copyright (c) 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 570.153.02 which has support for CUDA 12.8. This container
was built with CUDA 12.9 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.INFO 06-19 13:40:04 [init.py:239] Automatically detected platform cuda.
INFO 06-19 13:40:05 [api_server.py:1034] vLLM API server version 0.8.4+dc1a3e10.nv25.05
INFO 06-19 13:40:05 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='RedHatAI/gemma-3-12b-it-FP8-dynamic', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=128, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 06-19 13:40:10 [config.py:689] This model supports multiple tasks: {'score', 'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 06-19 13:40:10 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 06-19 13:40:10 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-19 13:40:13 [init.py:239] Automatically detected platform cuda.
INFO 06-19 13:40:15 [core.py:61] Initializing a V1 LLM engine (v0.8.4+dc1a3e10.nv25.05) with config: model='RedHatAI/gemma-3-12b-it-FP8-dynamic', speculative_config=None, tokenizer='RedHatAI/gemma-3-12b-it-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=RedHatAI/gemma-3-12b-it-FP8-dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
2025-06-19 13:40:15,592 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
WARNING 06-19 13:40:15 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7a9b659bf4d0>
[W619 13:40:16.689999241 ProcessGroupNCCL.cpp:959] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 06-19 13:40:16 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-19 13:40:16 [cuda.py:221] Using Flash Attention backend on V1 engine.
Using a slow image processor asuse_fast
is unset and a slow processor was saved with this model.use_fast=True
will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False
.
INFO 06-19 13:40:20 [gpu_model_runner.py:1276] Starting to load model RedHatAI/gemma-3-12b-it-FP8-dynamic...
INFO 06-19 13:40:20 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
INFO 06-19 13:40:20 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
INFO 06-19 13:40:21 [weight_utils.py:265] Using model weights format ['.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:02<00:04, 2.12s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:04<00:02, 2.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.88s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.95s/it]INFO 06-19 13:40:27 [loader.py:458] Loading weights took 5.92 seconds
INFO 06-19 13:40:27 [gpu_model_runner.py:1291] Model loading took 13.2955 GiB and 6.692338 seconds
INFO 06-19 13:40:27 [gpu_model_runner.py:1560] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
ERROR 06-19 13:40:30 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 06-19 13:40:30 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 320, in init
ERROR 06-19 13:40:30 [core.py:387] super().init(vllm_config, executor_class, log_stats)
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in init
ERROR 06-19 13:40:30 [core.py:387] self._initialize_kv_caches(vllm_config)
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches
ERROR 06-19 13:40:30 [core.py:387] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 06-19 13:40:30 [core.py:387] output = self.collective_rpc("determine_available_memory")
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 06-19 13:40:30 [core.py:387] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2378, in run_method
ERROR 06-19 13:40:30 [core.py:387] return func(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 13:40:30 [core.py:387] return func(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
ERROR 06-19 13:40:30 [core.py:387] self.model_runner.profile_run()
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1591, in profile_run
ERROR 06-19 13:40:30 [core.py:387] hidden_states = self._dummy_run(self.max_num_tokens)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 13:40:30 [core.py:387] return func(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1441, in _dummy_run
ERROR 06-19 13:40:30 [core.py:387] hidden_states = model(
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 13:40:30 [core.py:387] return self._call_impl(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 13:40:30 [core.py:387] return forward_call(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 630, in forward
ERROR 06-19 13:40:30 [core.py:387] hidden_states = self.language_model.model(input_ids,
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in call
ERROR 06-19 13:40:30 [core.py:387] return self.forward(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 400, in forward
ERROR 06-19 13:40:30 [core.py:387] hidden_states, residual = layer(
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 13:40:30 [core.py:387] return self._call_impl(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 13:40:30 [core.py:387] return forward_call(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 329, in forward
ERROR 06-19 13:40:30 [core.py:387] hidden_states = self.self_attn(
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 13:40:30 [core.py:387] return self._call_impl(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 13:40:30 [core.py:387] return forward_call(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 188, in forward
ERROR 06-19 13:40:30 [core.py:387] qkv, _ = self.qkv_proj(hidden_states)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 13:40:30 [core.py:387] return self._call_impl(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in call_impl
ERROR 06-19 13:40:30 [core.py:387] return forward_call(*args, **kwargs)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 06-19 13:40:30 [core.py:387] output_parallel = self.quant_method.apply(self, input, bias)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 580, in apply
ERROR 06-19 13:40:30 [core.py:387] return scheme.apply_weights(layer, x, bias=bias)
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 144, in apply_weights
ERROR 06-19 13:40:30 [core.py:387] return self.fp8_linear.apply(input=x,
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 200, in apply
ERROR 06-19 13:40:30 [core.py:387] output = ops.cutlass_scaled_mm(qinput,
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 568, in cutlass_scaled_mm
ERROR 06-19 13:40:30 [core.py:387] torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
ERROR 06-19 13:40:30 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in call
ERROR 06-19 13:40:30 [core.py:387] return self._op(*args, **(kwargs or {}))
ERROR 06-19 13:40:30 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:40:30 [core.py:387] RuntimeError: Error Internal
ERROR 06-19 13:40:30 [core.py:387]
CRITICAL 06-19 13:40:30 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in init
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 71, in make_client
return AsyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 604, in init
super().init(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 404, in init
self._wait_for_engine_startup()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 426, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
When I run
RedHatAI/gemma-3-12b-it-quantized.w8a8
RuntimeError: Currently, only fp8 gemm is implemented for Blackwell
For
RedHatAI/gemma-3-12b-it-FP8-dynamic
=============================
== Triton Inference Server ==NVIDIA Release 25.05 (build 172940304)
Triton Server Version 2.58.0Copyright (c) 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 570.153.02 which has support for CUDA 12.8. This container
was built with CUDA 12.9 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.INFO 06-19 13:46:11 [init.py:239] Automatically detected platform cuda.
INFO 06-19 13:46:11 [api_server.py:1034] vLLM API server version 0.8.4+dc1a3e10.nv25.05
INFO 06-19 13:46:11 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='RedHatAI/gemma-3-12b-it-FP8-dynamic', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=128, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 06-19 13:46:16 [config.py:689] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-19 13:46:17 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 06-19 13:46:17 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-19 13:46:20 [init.py:239] Automatically detected platform cuda.
INFO 06-19 13:46:21 [core.py:61] Initializing a V1 LLM engine (v0.8.4+dc1a3e10.nv25.05) with config: model='RedHatAI/gemma-3-12b-it-FP8-dynamic', speculative_config=None, tokenizer='RedHatAI/gemma-3-12b-it-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=RedHatAI/gemma-3-12b-it-FP8-dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
2025-06-19 13:46:21,834 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
WARNING 06-19 13:46:22 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x773ce02bdee0>
[W619 13:46:22.908698114 ProcessGroupNCCL.cpp:959] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 06-19 13:46:22 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-19 13:46:22 [cuda.py:221] Using Flash Attention backend on V1 engine.
Using a slow image processor asuse_fast
is unset and a slow processor was saved with this model.use_fast=True
will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False
.
INFO 06-19 13:46:26 [gpu_model_runner.py:1276] Starting to load model RedHatAI/gemma-3-12b-it-FP8-dynamic...
INFO 06-19 13:46:26 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
INFO 06-19 13:46:26 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
INFO 06-19 13:46:26 [weight_utils.py:265] Using model weights format ['.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.71it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.65it/s]INFO 06-19 13:46:29 [loader.py:458] Loading weights took 1.89 seconds
INFO 06-19 13:46:29 [gpu_model_runner.py:1291] Model loading took 13.2955 GiB and 2.631252 seconds
INFO 06-19 13:46:29 [gpu_model_runner.py:1560] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
ERROR 06-19 13:46:31 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 06-19 13:46:31 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 320, in init
ERROR 06-19 13:46:31 [core.py:387] super().init(vllm_config, executor_class, log_stats)
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in init
ERROR 06-19 13:46:31 [core.py:387] self._initialize_kv_caches(vllm_config)
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches
ERROR 06-19 13:46:31 [core.py:387] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 06-19 13:46:31 [core.py:387] output = self.collective_rpc("determine_available_memory")
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 06-19 13:46:31 [core.py:387] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2378, in run_method
ERROR 06-19 13:46:31 [core.py:387] return func(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 13:46:31 [core.py:387] return func(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
ERROR 06-19 13:46:31 [core.py:387] self.model_runner.profile_run()
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1591, in profile_run
ERROR 06-19 13:46:31 [core.py:387] hidden_states = self._dummy_run(self.max_num_tokens)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 13:46:31 [core.py:387] return func(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1441, in _dummy_run
ERROR 06-19 13:46:31 [core.py:387] hidden_states = model(
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 13:46:31 [core.py:387] return self._call_impl(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 13:46:31 [core.py:387] return forward_call(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 630, in forward
ERROR 06-19 13:46:31 [core.py:387] hidden_states = self.language_model.model(input_ids,
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in call
ERROR 06-19 13:46:31 [core.py:387] return self.forward(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 400, in forward
ERROR 06-19 13:46:31 [core.py:387] hidden_states, residual = layer(
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 13:46:31 [core.py:387] return self._call_impl(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 13:46:31 [core.py:387] return forward_call(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 329, in forward
ERROR 06-19 13:46:31 [core.py:387] hidden_states = self.self_attn(
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 13:46:31 [core.py:387] return self._call_impl(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 13:46:31 [core.py:387] return forward_call(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 188, in forward
ERROR 06-19 13:46:31 [core.py:387] qkv, _ = self.qkv_proj(hidden_states)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 13:46:31 [core.py:387] return self._call_impl(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in call_impl
ERROR 06-19 13:46:31 [core.py:387] return forward_call(*args, **kwargs)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 06-19 13:46:31 [core.py:387] output_parallel = self.quant_method.apply(self, input, bias)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 580, in apply
ERROR 06-19 13:46:31 [core.py:387] return scheme.apply_weights(layer, x, bias=bias)
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 144, in apply_weights
ERROR 06-19 13:46:31 [core.py:387] return self.fp8_linear.apply(input=x,
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 200, in apply
ERROR 06-19 13:46:31 [core.py:387] output = ops.cutlass_scaled_mm(qinput,
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 568, in cutlass_scaled_mm
ERROR 06-19 13:46:31 [core.py:387] torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
ERROR 06-19 13:46:31 [core.py:387] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in call
ERROR 06-19 13:46:31 [core.py:387] return self._op(*args, **(kwargs or {}))
ERROR 06-19 13:46:31 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 13:46:31 [core.py:387] RuntimeError: Error Internal
ERROR 06-19 13:46:31 [core.py:387]
CRITICAL 06-19 13:46:31 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in init
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 71, in make_client
return AsyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 604, in init
super().init(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 404, in init
self._wait_for_engine_startup()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 426, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
Is there anything I could do I am trying to update image from these issues but get errors on building and it take some time
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.