Skip to content

[CI Failure]: Quantization Test - quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model #19964

@mgoin

Description

@mgoin

Name of failing test

quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

pytest -s -v "quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]"
INFO 06-23 04:48:10 [__init__.py:244] Automatically detected platform cuda.
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================================================================================ test session starts =============================================================================================
platform linux -- Python 3.12.4, pytest-8.3.3, pluggy-1.5.0 -- /home/mgoin/venvs/vllm/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/home/mgoin/code/vllm/tests/.hypothesis/examples'))
rootdir: /home/mgoin/code/vllm
configfile: pyproject.toml
plugins: forked-1.6.0, subtests-0.14.1, asyncio-0.24.0, shard-0.1.2, buildkite-test-collector-0.1.9, timeout-2.3.1, schemathesis-3.39.15, anyio-4.6.2.post1, mock-3.14.0, hypothesis-6.131.0, rerunfailures-14.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 1 item                                                                                                                                                                                             
Running 1 items in this shard: tests/quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]

quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight] Fork a new process to run a test 1747225
Fork a new process to run a test 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.80s/it]
INFO 06-23 04:48:29 [config.py:588] Found sentence-transformers tokenize configuration.
INFO 06-23 04:48:35 [config.py:484] Found sentence-transformers modules configuration.
INFO 06-23 04:48:35 [config.py:504] Found pooling configuration.
INFO 06-23 04:48:35 [config.py:1444] Using max model len 1024
WARNING 06-23 04:48:35 [config.py:939] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 06-23 04:48:35 [arg_utils.py:1568] (Enabling) prefix caching by default
WARNING 06-23 04:48:36 [utils.py:2613] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
INFO 06-23 04:48:39 [__init__.py:244] Automatically detected platform cuda.
INFO 06-23 04:48:42 [core.py:459] Waiting for init message from front-end.
INFO 06-23 04:48:42 [core.py:69] Initializing a V1 LLM engine (v0.9.1.dev287+g89b1388d8) with config: model='intfloat/e5-mistral-7b-instruct', speculative_config=None, tokenizer='intfloat/e5-mistral-7b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=intfloat/e5-mistral-7b-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-23 04:48:42 [utils.py:2753] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ab4c475f5f0>
INFO 06-23 04:48:42 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-23 04:48:42 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-23 04:48:42 [gpu_model_runner.py:1696] Starting to load model intfloat/e5-mistral-7b-instruct...
INFO 06-23 04:48:43 [gpu_model_runner.py:1701] Loading model from scratch...
INFO 06-23 04:48:43 [cuda.py:270] Using Flash Attention backend on V1 engine.
INFO 06-23 04:48:43 [bitsandbytes_loader.py:454] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 06-23 04:48:43 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.00s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.06s/it]

INFO 06-23 04:48:46 [gpu_model_runner.py:1725] Model loading took 3.9099 GiB and 2.797580 seconds
INFO 06-23 04:48:51 [backends.py:508] Using cache directory: /home/mgoin/.cache/vllm/torch_compile_cache/dee6dd784e/rank_0_0/backbone for vLLM's torch.compile
INFO 06-23 04:48:51 [backends.py:519] Dynamo bytecode transform time: 4.86 s
INFO 06-23 04:48:54 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 3.547 s
INFO 06-23 04:48:55 [monitor.py:34] torch.compile takes 4.86 s in total
INFO 06-23 04:48:56 [gpu_worker.py:232] Available KV cache memory: 66.87 GiB
INFO 06-23 04:48:56 [kv_cache_utils.py:716] GPU KV cache size: 547,776 tokens
INFO 06-23 04:48:56 [kv_cache_utils.py:720] Maximum concurrency for 1,024 tokens per request: 526.71x
WARNING 06-23 04:48:56 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
Capturing CUDA graphs:   0%|                                                                                                                                                           | 0/67 [00:00<?, ?it/s]
ERROR 06-23 04:48:56 [core.py:519] EngineCore failed to start.
ERROR 06-23 04:48:56 [core.py:519] Traceback (most recent call last):
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 510, in run_engine_core
ERROR 06-23 04:48:56 [core.py:519]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 394, in __init__
ERROR 06-23 04:48:56 [core.py:519]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 82, in __init__
ERROR 06-23 04:48:56 [core.py:519]     self._initialize_kv_caches(vllm_config)
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 169, in _initialize_kv_caches
ERROR 06-23 04:48:56 [core.py:519]     self.model_executor.initialize_from_config(kv_cache_configs)
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 66, in initialize_from_config
ERROR 06-23 04:48:56 [core.py:519]     self.collective_rpc("compile_or_warm_up_model")
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-23 04:48:56 [core.py:519]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-23 04:48:56 [core.py:519]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/utils.py", line 2687, in run_method
ERROR 06-23 04:48:56 [core.py:519]     return func(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 266, in compile_or_warm_up_model
ERROR 06-23 04:48:56 [core.py:519]     self.model_runner.capture_model()
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 2213, in capture_model
ERROR 06-23 04:48:56 [core.py:519]     self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg)
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-23 04:48:56 [core.py:519]     return func(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
ERROR 06-23 04:48:56 [core.py:519]     outputs = model(
ERROR 06-23 04:48:56 [core.py:519]               ^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-23 04:48:56 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-23 04:48:56 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 581, in forward
ERROR 06-23 04:48:56 [core.py:519]     model_output = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-23 04:48:56 [core.py:519]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/compilation/decorators.py", line 246, in __call__
ERROR 06-23 04:48:56 [core.py:519]     model_output = self.forward(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 368, in forward
ERROR 06-23 04:48:56 [core.py:519]     def forward(
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-23 04:48:56 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-23 04:48:56 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 06-23 04:48:56 [core.py:519]     return fn(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 06-23 04:48:56 [core.py:519]     return self._wrapped_call(self, *args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 06-23 04:48:56 [core.py:519]     raise e
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 06-23 04:48:56 [core.py:519]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-23 04:48:56 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-23 04:48:56 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "<eval_with_key>.66", line 337, in forward
ERROR 06-23 04:48:56 [core.py:519]     submod_2 = self.submod_2(getitem_3, s0, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_bnb_shard_offsets, getitem_4, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_bnb_shard_offsets = getitem_4 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_bnb_shard_offsets = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_bnb_shard_offsets = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets = None
ERROR 06-23 04:48:56 [core.py:519]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/compilation/cuda_piecewise_backend.py", line 156, in __call__
ERROR 06-23 04:48:56 [core.py:519]     return entry.runnable(*args)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/compilation/compiler_interface.py", line 510, in compiled_graph
ERROR 06-23 04:48:56 [core.py:519]     graph_output = inductor_compiled_graph(list_args)
ERROR 06-23 04:48:56 [core.py:519]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
ERROR 06-23 04:48:56 [core.py:519]     return self.current_callable(inputs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/.cache/vllm/torch_compile_cache/dee6dd784e/rank_0_0/inductor_cache/3k/c3kedtjuicpyiyo55z5hejbsjwtnfepkyplcvf6hyoj4zxhhu3pa.py", line 589, in call
ERROR 06-23 04:48:56 [core.py:519]     torch.ops.vllm.apply_bnb_4bit.default(buf6, arg6_1, arg7_1, buf5)
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 756, in __call__
ERROR 06-23 04:48:56 [core.py:519]     return self._op(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/code/vllm/vllm/model_executor/layers/quantization/bitsandbytes.py", line 372, in _apply_bnb_4bit
ERROR 06-23 04:48:56 [core.py:519]     out[:, current_index:current_index + output_size] = matmul_4bit(
ERROR 06-23 04:48:56 [core.py:519]                                                         ^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 533, in matmul_4bit
ERROR 06-23 04:48:56 [core.py:519]     return MatMul4Bit.apply(A, B, out, bias, quant_state)
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
ERROR 06-23 04:48:56 [core.py:519]     return super().apply(*args, **kwargs)  # type: ignore[misc]
ERROR 06-23 04:48:56 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 462, in forward
ERROR 06-23 04:48:56 [core.py:519]     output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
ERROR 06-23 04:48:56 [core.py:519]                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 77.44 MiB is free. Process 1747225 has 7.29 GiB memory in use. Including non-PyTorch memory, this process has 71.81 GiB memory in use. Of the allocated memory 71.01 GiB is allocated by PyTorch, and 73.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Process EngineCore_0:
Traceback (most recent call last):
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 523, in run_engine_core
    raise e
  File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 510, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 394, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 82, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 169, in _initialize_kv_caches
    self.model_executor.initialize_from_config(kv_cache_configs)
  File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 66, in initialize_from_config
    self.collective_rpc("compile_or_warm_up_model")
  File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/utils.py", line 2687, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 266, in compile_or_warm_up_model
    self.model_runner.capture_model()
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 2213, in capture_model
    self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg)
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
    outputs = model(
              ^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 581, in forward
    model_output = self.model(input_ids, positions, intermediate_tensors,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/compilation/decorators.py", line 246, in __call__
    model_output = self.forward(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 368, in forward
    def forward(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 830, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 406, in __call__
    raise e
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 393, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.66", line 337, in forward
    submod_2 = self.submod_2(getitem_3, s0, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_bnb_shard_offsets, getitem_4, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_bnb_shard_offsets = getitem_4 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_bnb_shard_offsets = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_bnb_shard_offsets = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/compilation/cuda_piecewise_backend.py", line 156, in __call__
    return entry.runnable(*args)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/compilation/compiler_interface.py", line 510, in compiled_graph
    graph_output = inductor_compiled_graph(list_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/.cache/vllm/torch_compile_cache/dee6dd784e/rank_0_0/inductor_cache/3k/c3kedtjuicpyiyo55z5hejbsjwtnfepkyplcvf6hyoj4zxhhu3pa.py", line 589, in call
    torch.ops.vllm.apply_bnb_4bit.default(buf6, arg6_1, arg7_1, buf5)
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 756, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/quantization/bitsandbytes.py", line 372, in _apply_bnb_4bit
    out[:, current_index:current_index + output_size] = matmul_4bit(
                                                        ^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 533, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 462, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 77.44 MiB is free. Process 1747225 has 7.29 GiB memory in use. Including non-PyTorch memory, this process has 71.81 GiB memory in use. Of the allocated memory 71.01 GiB is allocated by PyTorch, and 73.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W623 04:48:57.557535085 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/home/mgoin/code/vllm/tests/utils.py", line 741, in wrapper
    f(*args, **kwargs)
  File "/home/mgoin/code/vllm/tests/quantization/test_bitsandbytes.py", line 159, in test_4bit_bnb_embedding_model
    with vllm_runner(model_name,
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/tests/conftest.py", line 787, in __init__
    self.model = LLM(
                 ^^^^
  File "/home/mgoin/code/vllm/vllm/entrypoints/llm.py", line 263, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/engine/llm_engine.py", line 501, in from_engine_args
    return engine_cls.from_vllm_config(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/engine/llm_engine.py", line 124, in from_vllm_config
    return cls(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/engine/llm_engine.py", line 101, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 75, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 558, in __init__
    super().__init__(
  File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 422, in __init__
    self._init_engines_direct(vllm_config, local_only,
  File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 491, in _init_engines_direct
    self._wait_for_engine_startup(handshake_socket, input_address,
  File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 511, in _wait_for_engine_startup
    wait_for_engine_startup(
  File "/home/mgoin/code/vllm/vllm/v1/utils.py", line 494, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
FAILED

================================================================================================== FAILURES ==================================================================================================
___________________________________________________ test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight] ____________________________________________________

args = ()
kwargs = {'description': 'quantize embedding model inflight', 'dtype': 'half', 'example_prompts': ['vLLM is a high-throughput a...n global economic structures and future business models.\n', ...], 'hf_runner': <class 'tests.conftest.HfRunner'>, ...}
Skipped = <class 'Skipped'>, pid = 1747225, pgid = 1747030, _pid = 1747225, _exitcode = 256, old_signal_handler = <Handlers.SIG_DFL: 0>

    @functools.wraps(f)
    def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> None:
        # Make the process the leader of its own process group
        # to avoid sending SIGTERM to the parent process
        os.setpgrp()
        from _pytest.outcomes import Skipped
        pid = os.fork()
        print(f"Fork a new process to run a test {pid}")
        if pid == 0:
            try:
                f(*args, **kwargs)
            except Skipped as e:
                # convert Skipped to exit code 0
                print(str(e))
                os._exit(0)
            except Exception:
                import traceback
                traceback.print_exc()
                os._exit(1)
            else:
                os._exit(0)
        else:
            pgid = os.getpgid(pid)
            _pid, _exitcode = os.waitpid(pid, 0)
            # ignore SIGTERM signal itself
            old_signal_handler = signal.signal(signal.SIGTERM, signal.SIG_IGN)
            # kill all child processes
            os.killpg(pgid, signal.SIGTERM)
            # restore the signal handler
            signal.signal(signal.SIGTERM, old_signal_handler)
>           assert _exitcode == 0, (f"function {f} failed when called with"
                                    f" args {args} and kwargs {kwargs}")
E           AssertionError: function <function test_4bit_bnb_embedding_model at 0x75cfc981f9c0> failed when called with args () and kwargs {'model_name': 'intfloat/e5-mistral-7b-instruct', 'description': 'quantize embedding model inflight', 'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'dtype': 'half'}

utils.py:761: AssertionError
============================================================================================== warnings summary ==============================================================================================
../../../venvs/vllm/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/mgoin/venvs/vllm/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

tests/quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]
  /home/mgoin/code/vllm/tests/utils.py:737: DeprecationWarning: This process (pid=1747030) is multi-threaded, use of fork() may lead to deadlocks in the child.
    pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================== short test summary info ===========================================================================================
FAILED quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight] - AssertionError: function <function test_4bit_bnb_embedding_model at 0x75cfc981f9c0> failed when called with args () and kwargs {'model_name': 'intfloat/e5-mistral-7b-instruct', 'description': 'quantize...
======================================================================================= 1 failed, 2 warnings in 47.09s =======================================================================================

📝 History of failing test

https://buildkite.com/vllm/ci/builds/22498/summary/annotations?jid=01979a3a-fc0d-4b8e-96a1-fe70e2d781b8

CC List.

@jeejeelee

Metadata

Metadata

Assignees

Labels

ci-failureIssue about an unexpected test failure in CI

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions