[Bug]: 5090 gemma-3-12b-it using FP8/INT8/FP16 quantization for conncurent requests DOCKER.

### Your current environment

I use docker only to host open ai server.


### 🐛 Describe the bug

I try to use VLLM docker as my backend to run Gemma3 models and use concurrency and dynamic batching.  
I use 22.04 Ubuntu 570.153.02 Drivers Nvidia CUDA Version: 12.8 I know I should update to 12.9 24.04 ubuntu but so far all worked I used like faster-whisper docker for llama servers etc. 

I followed these issues:
https://github.com/vllm-project/vllm/issues/17587 
https://github.com/vllm-project/vllm/pull/14766
https://github.com/vllm-project/vllm/issues/14452

The only thing that seem to work for me was solution from 14452 of [hongbo-miao](https://github.com/hongbo-miao)
Server:

docker run --gpus=all \
    --volume="$HOME/.cache/huggingface:/root/.cache/huggingface" \
    --publish=8000:8000 \
    nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3 \
        python3 -m vllm.entrypoints.openai.api_server \
            --model=Qwen/Qwen2.5-0.5B-Instruct \
            --port=8000 \
            --gpu-memory-utilization=0.75 \
            --max_model_len=8192 \
            --tensor-parallel-size=1 \
            --max_num_seqs=128 \
            --enforce-eager

Client:

curl http://localhost:8000/v1/chat/completions \
    --header "Content-Type: application/json" \
    --data '{
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "messages": [ \
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a joke."}
        ]
    }'

which returns

{
  "id": "chat-260127bb79b74e3786b810ffa6f592ed",
  "object": "chat.completion",
  "created": 1749898826,
  "model": "Qwen/Qwen2.5-0.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Sure! Here's one for you: Why did the tomato turn red? Because it saw the salad dressing!",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "total_tokens": 47,
    "completion_tokens": 23
  },
  "prompt_logprobs": null
}

This works on my pc but when I try, models like
MISHANM/google-gemma-3-12b-it-fp8
JamAndTeaStudios/gemma-3-12b-it-FP8-Dynamic
RedHatAI/gemma-3-12b-it-FP8-dynamic

like running 
`docker run --gpus=all   -v "$HOME/.cache/huggingface:/root/.cache/huggingface"   -p 8000:8000   --name gemma3_12b_fp8   tritonserver-bnb   python3 -m vllm.entrypoints.openai.api_server     --model=RedHatAI/gemma-3-12b-it-FP8-dynamic     --port=8000     --gpu-memory-utilization=0.90     --max_model_len=8192     --tensor-parallel-size=1     --max_num_seqs=128     --enforce-eager`

I get error 

> =============================
> == Triton Inference Server ==
> =============================
> 
> NVIDIA Release 25.05 (build 172940304)
> Triton Server Version 2.58.0
> 
> Copyright (c) 2018-2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> 
> Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> 
> GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
> (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
> and the Product-Specific Terms for NVIDIA AI Products
> (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
> 
> WARNING: CUDA Minor Version Compatibility mode ENABLED.
>   Using driver version 570.153.02 which has support for CUDA 12.8.  This container
>   was built with CUDA 12.9 and will be run in Minor Version Compatibility mode.
>   CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
>   with this container but was unavailable:
>   [[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
>   See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
> 
> INFO 06-19 13:40:04 [__init__.py:239] Automatically detected platform cuda.
> INFO 06-19 13:40:05 [api_server.py:1034] vLLM API server version 0.8.4+dc1a3e10.nv25.05
> INFO 06-19 13:40:05 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='RedHatAI/gemma-3-12b-it-FP8-dynamic', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=128, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
> INFO 06-19 13:40:10 [config.py:689] This model supports multiple tasks: {'score', 'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
> INFO 06-19 13:40:10 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
> WARNING 06-19 13:40:10 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
> INFO 06-19 13:40:13 [__init__.py:239] Automatically detected platform cuda.
> INFO 06-19 13:40:15 [core.py:61] Initializing a V1 LLM engine (v0.8.4+dc1a3e10.nv25.05) with config: model='RedHatAI/gemma-3-12b-it-FP8-dynamic', speculative_config=None, tokenizer='RedHatAI/gemma-3-12b-it-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=RedHatAI/gemma-3-12b-it-FP8-dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
> 2025-06-19 13:40:15,592 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
> WARNING 06-19 13:40:15 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7a9b659bf4d0>
> [W619 13:40:16.689999241 ProcessGroupNCCL.cpp:959] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
> [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
> [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
> [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
> [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
> INFO 06-19 13:40:16 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
> INFO 06-19 13:40:16 [cuda.py:221] Using Flash Attention backend on V1 engine.
> Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
> INFO 06-19 13:40:20 [gpu_model_runner.py:1276] Starting to load model RedHatAI/gemma-3-12b-it-FP8-dynamic...
> INFO 06-19 13:40:20 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
> INFO 06-19 13:40:20 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
> INFO 06-19 13:40:21 [weight_utils.py:265] Using model weights format ['*.safetensors']
> Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
> Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:02<00:04,  2.12s/it]
> Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:04<00:02,  2.14s/it]
> Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00,  1.88s/it]
> Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00,  1.95s/it]
> 
> INFO 06-19 13:40:27 [loader.py:458] Loading weights took 5.92 seconds
> INFO 06-19 13:40:27 [gpu_model_runner.py:1291] Model loading took 13.2955 GiB and 6.692338 seconds
> INFO 06-19 13:40:27 [gpu_model_runner.py:1560] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
> ERROR 06-19 13:40:30 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
> ERROR 06-19 13:40:30 [core.py:387]     engine_core = EngineCoreProc(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 320, in __init__
> ERROR 06-19 13:40:30 [core.py:387]     super().__init__(vllm_config, executor_class, log_stats)
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
> ERROR 06-19 13:40:30 [core.py:387]     self._initialize_kv_caches(vllm_config)
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches
> ERROR 06-19 13:40:30 [core.py:387]     available_gpu_memory = self.model_executor.determine_available_memory()
> ERROR 06-19 13:40:30 [core.py:387]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
> ERROR 06-19 13:40:30 [core.py:387]     output = self.collective_rpc("determine_available_memory")
> ERROR 06-19 13:40:30 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
> ERROR 06-19 13:40:30 [core.py:387]     answer = run_method(self.driver_worker, method, args, kwargs)
> ERROR 06-19 13:40:30 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2378, in run_method
> ERROR 06-19 13:40:30 [core.py:387]     return func(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
> ERROR 06-19 13:40:30 [core.py:387]     return func(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
> ERROR 06-19 13:40:30 [core.py:387]     self.model_runner.profile_run()
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1591, in profile_run
> ERROR 06-19 13:40:30 [core.py:387]     hidden_states = self._dummy_run(self.max_num_tokens)
> ERROR 06-19 13:40:30 [core.py:387]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
> ERROR 06-19 13:40:30 [core.py:387]     return func(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1441, in _dummy_run
> ERROR 06-19 13:40:30 [core.py:387]     hidden_states = model(
> ERROR 06-19 13:40:30 [core.py:387]                     ^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
> ERROR 06-19 13:40:30 [core.py:387]     return self._call_impl(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
> ERROR 06-19 13:40:30 [core.py:387]     return forward_call(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 630, in forward
> ERROR 06-19 13:40:30 [core.py:387]     hidden_states = self.language_model.model(input_ids,
> ERROR 06-19 13:40:30 [core.py:387]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
> ERROR 06-19 13:40:30 [core.py:387]     return self.forward(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 400, in forward
> ERROR 06-19 13:40:30 [core.py:387]     hidden_states, residual = layer(
> ERROR 06-19 13:40:30 [core.py:387]                               ^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
> ERROR 06-19 13:40:30 [core.py:387]     return self._call_impl(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
> ERROR 06-19 13:40:30 [core.py:387]     return forward_call(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 329, in forward
> ERROR 06-19 13:40:30 [core.py:387]     hidden_states = self.self_attn(
> ERROR 06-19 13:40:30 [core.py:387]                     ^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
> ERROR 06-19 13:40:30 [core.py:387]     return self._call_impl(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
> ERROR 06-19 13:40:30 [core.py:387]     return forward_call(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 188, in forward
> ERROR 06-19 13:40:30 [core.py:387]     qkv, _ = self.qkv_proj(hidden_states)
> ERROR 06-19 13:40:30 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
> ERROR 06-19 13:40:30 [core.py:387]     return self._call_impl(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
> ERROR 06-19 13:40:30 [core.py:387]     return forward_call(*args, **kwargs)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 474, in forward
> ERROR 06-19 13:40:30 [core.py:387]     output_parallel = self.quant_method.apply(self, input_, bias)
> ERROR 06-19 13:40:30 [core.py:387]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 580, in apply
> ERROR 06-19 13:40:30 [core.py:387]     return scheme.apply_weights(layer, x, bias=bias)
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 144, in apply_weights
> ERROR 06-19 13:40:30 [core.py:387]     return self.fp8_linear.apply(input=x,
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 200, in apply
> ERROR 06-19 13:40:30 [core.py:387]     output = ops.cutlass_scaled_mm(qinput,
> ERROR 06-19 13:40:30 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 568, in cutlass_scaled_mm
> ERROR 06-19 13:40:30 [core.py:387]     torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
> ERROR 06-19 13:40:30 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
> ERROR 06-19 13:40:30 [core.py:387]     return self._op(*args, **(kwargs or {}))
> ERROR 06-19 13:40:30 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:40:30 [core.py:387] RuntimeError: Error Internal
> ERROR 06-19 13:40:30 [core.py:387] 
> CRITICAL 06-19 13:40:30 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
> Traceback (most recent call last):
>   File "<frozen runpy>", line 198, in _run_module_as_main
>   File "<frozen runpy>", line 88, in _run_code
>   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
>     uvloop.run(run_server(args))
>   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
>     return __asyncio.run(
>            ^^^^^^^^^^^^^^
>   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
>     return runner.run(main)
>            ^^^^^^^^^^^^^^^^
>   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
>     return self._loop.run_until_complete(task)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
>   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
>     return await main
>            ^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
>     async with build_async_engine_client(args) as engine_client:
>   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
>     return await anext(self.gen)
>            ^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
>     async with build_async_engine_client_from_engine_args(
>   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
>     return await anext(self.gen)
>            ^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
>     async_llm = AsyncLLM.from_vllm_config(
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
>     return cls(
>            ^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in __init__
>     self.engine_core = EngineCoreClient.make_client(
>                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 71, in make_client
>     return AsyncMPClient(vllm_config, executor_class, log_stats)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 604, in __init__
>     super().__init__(
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 404, in __init__
>     self._wait_for_engine_startup()
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 426, in _wait_for_engine_startup
>     raise RuntimeError("Engine core initialization failed. "
> RuntimeError: Engine core initialization failed. See root cause above.

When I run 
RedHatAI/gemma-3-12b-it-quantized.w8a8

> RuntimeError: Currently, only fp8 gemm is implemented for Blackwell

For 
RedHatAI/gemma-3-12b-it-FP8-dynamic

> 
> 
> =============================
> == Triton Inference Server ==
> =============================
> 
> NVIDIA Release 25.05 (build 172940304)
> Triton Server Version 2.58.0
> 
> Copyright (c) 2018-2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> 
> Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> 
> GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
> (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
> and the Product-Specific Terms for NVIDIA AI Products
> (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
> 
> WARNING: CUDA Minor Version Compatibility mode ENABLED.
>   Using driver version 570.153.02 which has support for CUDA 12.8.  This container
>   was built with CUDA 12.9 and will be run in Minor Version Compatibility mode.
>   CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
>   with this container but was unavailable:
>   [[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
>   See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
> 
> INFO 06-19 13:46:11 [__init__.py:239] Automatically detected platform cuda.
> INFO 06-19 13:46:11 [api_server.py:1034] vLLM API server version 0.8.4+dc1a3e10.nv25.05
> INFO 06-19 13:46:11 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='RedHatAI/gemma-3-12b-it-FP8-dynamic', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=128, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
> INFO 06-19 13:46:16 [config.py:689] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
> INFO 06-19 13:46:17 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
> WARNING 06-19 13:46:17 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
> INFO 06-19 13:46:20 [__init__.py:239] Automatically detected platform cuda.
> INFO 06-19 13:46:21 [core.py:61] Initializing a V1 LLM engine (v0.8.4+dc1a3e10.nv25.05) with config: model='RedHatAI/gemma-3-12b-it-FP8-dynamic', speculative_config=None, tokenizer='RedHatAI/gemma-3-12b-it-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=RedHatAI/gemma-3-12b-it-FP8-dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
> 2025-06-19 13:46:21,834 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
> WARNING 06-19 13:46:22 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x773ce02bdee0>
> [W619 13:46:22.908698114 ProcessGroupNCCL.cpp:959] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
> [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
> [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
> [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
> [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
> INFO 06-19 13:46:22 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
> INFO 06-19 13:46:22 [cuda.py:221] Using Flash Attention backend on V1 engine.
> Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
> INFO 06-19 13:46:26 [gpu_model_runner.py:1276] Starting to load model RedHatAI/gemma-3-12b-it-FP8-dynamic...
> INFO 06-19 13:46:26 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
> INFO 06-19 13:46:26 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
> INFO 06-19 13:46:26 [weight_utils.py:265] Using model weights format ['*.safetensors']
> Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
> Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.71it/s]
> Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.52it/s]
> Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.68it/s]
> Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.65it/s]
> 
> INFO 06-19 13:46:29 [loader.py:458] Loading weights took 1.89 seconds
> INFO 06-19 13:46:29 [gpu_model_runner.py:1291] Model loading took 13.2955 GiB and 2.631252 seconds
> INFO 06-19 13:46:29 [gpu_model_runner.py:1560] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
> ERROR 06-19 13:46:31 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
> ERROR 06-19 13:46:31 [core.py:387]     engine_core = EngineCoreProc(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 320, in __init__
> ERROR 06-19 13:46:31 [core.py:387]     super().__init__(vllm_config, executor_class, log_stats)
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
> ERROR 06-19 13:46:31 [core.py:387]     self._initialize_kv_caches(vllm_config)
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches
> ERROR 06-19 13:46:31 [core.py:387]     available_gpu_memory = self.model_executor.determine_available_memory()
> ERROR 06-19 13:46:31 [core.py:387]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
> ERROR 06-19 13:46:31 [core.py:387]     output = self.collective_rpc("determine_available_memory")
> ERROR 06-19 13:46:31 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
> ERROR 06-19 13:46:31 [core.py:387]     answer = run_method(self.driver_worker, method, args, kwargs)
> ERROR 06-19 13:46:31 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2378, in run_method
> ERROR 06-19 13:46:31 [core.py:387]     return func(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
> ERROR 06-19 13:46:31 [core.py:387]     return func(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
> ERROR 06-19 13:46:31 [core.py:387]     self.model_runner.profile_run()
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1591, in profile_run
> ERROR 06-19 13:46:31 [core.py:387]     hidden_states = self._dummy_run(self.max_num_tokens)
> ERROR 06-19 13:46:31 [core.py:387]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
> ERROR 06-19 13:46:31 [core.py:387]     return func(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1441, in _dummy_run
> ERROR 06-19 13:46:31 [core.py:387]     hidden_states = model(
> ERROR 06-19 13:46:31 [core.py:387]                     ^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
> ERROR 06-19 13:46:31 [core.py:387]     return self._call_impl(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
> ERROR 06-19 13:46:31 [core.py:387]     return forward_call(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 630, in forward
> ERROR 06-19 13:46:31 [core.py:387]     hidden_states = self.language_model.model(input_ids,
> ERROR 06-19 13:46:31 [core.py:387]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
> ERROR 06-19 13:46:31 [core.py:387]     return self.forward(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 400, in forward
> ERROR 06-19 13:46:31 [core.py:387]     hidden_states, residual = layer(
> ERROR 06-19 13:46:31 [core.py:387]                               ^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
> ERROR 06-19 13:46:31 [core.py:387]     return self._call_impl(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
> ERROR 06-19 13:46:31 [core.py:387]     return forward_call(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 329, in forward
> ERROR 06-19 13:46:31 [core.py:387]     hidden_states = self.self_attn(
> ERROR 06-19 13:46:31 [core.py:387]                     ^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
> ERROR 06-19 13:46:31 [core.py:387]     return self._call_impl(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
> ERROR 06-19 13:46:31 [core.py:387]     return forward_call(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 188, in forward
> ERROR 06-19 13:46:31 [core.py:387]     qkv, _ = self.qkv_proj(hidden_states)
> ERROR 06-19 13:46:31 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
> ERROR 06-19 13:46:31 [core.py:387]     return self._call_impl(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
> ERROR 06-19 13:46:31 [core.py:387]     return forward_call(*args, **kwargs)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 474, in forward
> ERROR 06-19 13:46:31 [core.py:387]     output_parallel = self.quant_method.apply(self, input_, bias)
> ERROR 06-19 13:46:31 [core.py:387]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 580, in apply
> ERROR 06-19 13:46:31 [core.py:387]     return scheme.apply_weights(layer, x, bias=bias)
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 144, in apply_weights
> ERROR 06-19 13:46:31 [core.py:387]     return self.fp8_linear.apply(input=x,
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 200, in apply
> ERROR 06-19 13:46:31 [core.py:387]     output = ops.cutlass_scaled_mm(qinput,
> ERROR 06-19 13:46:31 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 568, in cutlass_scaled_mm
> ERROR 06-19 13:46:31 [core.py:387]     torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
> ERROR 06-19 13:46:31 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
> ERROR 06-19 13:46:31 [core.py:387]     return self._op(*args, **(kwargs or {}))
> ERROR 06-19 13:46:31 [core.py:387]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ERROR 06-19 13:46:31 [core.py:387] RuntimeError: Error Internal
> ERROR 06-19 13:46:31 [core.py:387] 
> CRITICAL 06-19 13:46:31 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
> Traceback (most recent call last):
>   File "<frozen runpy>", line 198, in _run_module_as_main
>   File "<frozen runpy>", line 88, in _run_code
>   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
>     uvloop.run(run_server(args))
>   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
>     return __asyncio.run(
>            ^^^^^^^^^^^^^^
>   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
>     return runner.run(main)
>            ^^^^^^^^^^^^^^^^
>   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
>     return self._loop.run_until_complete(task)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
>   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
>     return await main
>            ^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
>     async with build_async_engine_client(args) as engine_client:
>   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
>     return await anext(self.gen)
>            ^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
>     async with build_async_engine_client_from_engine_args(
>   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
>     return await anext(self.gen)
>            ^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
>     async_llm = AsyncLLM.from_vllm_config(
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
>     return cls(
>            ^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in __init__
>     self.engine_core = EngineCoreClient.make_client(
>                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 71, in make_client
>     return AsyncMPClient(vllm_config, executor_class, log_stats)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 604, in __init__
>     super().__init__(
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 404, in __init__
>     self._wait_for_engine_startup()
>   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 426, in _wait_for_engine_startup
>     raise RuntimeError("Engine core initialization failed. "
> RuntimeError: Engine core initialization failed. See root cause above.

Is there anything I could do I am trying to update image from these issues but get errors on building and it take some time 

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: 5090 gemma-3-12b-it using FP8/INT8/FP16 quantization for conncurent requests DOCKER. #19863

Your current environment

🐛 Describe the bug

=============================
== Triton Inference Server ==

=============================
== Triton Inference Server ==

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: 5090 gemma-3-12b-it using FP8/INT8/FP16 quantization for conncurent requests DOCKER. #19863

Description

Your current environment

🐛 Describe the bug

============================= == Triton Inference Server ==

============================= == Triton Inference Server ==

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

=============================
== Triton Inference Server ==

=============================
== Triton Inference Server ==