-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Closed
Labels
ci-failureIssue about an unexpected test failure in CIIssue about an unexpected test failure in CI
Description
Name of failing test
quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]
Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers
)
🧪 Describe the failing test
pytest -s -v "quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]"
INFO 06-23 04:48:10 [__init__.py:244] Automatically detected platform cuda.
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================================================================================ test session starts =============================================================================================
platform linux -- Python 3.12.4, pytest-8.3.3, pluggy-1.5.0 -- /home/mgoin/venvs/vllm/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/home/mgoin/code/vllm/tests/.hypothesis/examples'))
rootdir: /home/mgoin/code/vllm
configfile: pyproject.toml
plugins: forked-1.6.0, subtests-0.14.1, asyncio-0.24.0, shard-0.1.2, buildkite-test-collector-0.1.9, timeout-2.3.1, schemathesis-3.39.15, anyio-4.6.2.post1, mock-3.14.0, hypothesis-6.131.0, rerunfailures-14.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 1 item
Running 1 items in this shard: tests/quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]
quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight] Fork a new process to run a test 1747225
Fork a new process to run a test 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.80s/it]
INFO 06-23 04:48:29 [config.py:588] Found sentence-transformers tokenize configuration.
INFO 06-23 04:48:35 [config.py:484] Found sentence-transformers modules configuration.
INFO 06-23 04:48:35 [config.py:504] Found pooling configuration.
INFO 06-23 04:48:35 [config.py:1444] Using max model len 1024
WARNING 06-23 04:48:35 [config.py:939] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 06-23 04:48:35 [arg_utils.py:1568] (Enabling) prefix caching by default
WARNING 06-23 04:48:36 [utils.py:2613] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
INFO 06-23 04:48:39 [__init__.py:244] Automatically detected platform cuda.
INFO 06-23 04:48:42 [core.py:459] Waiting for init message from front-end.
INFO 06-23 04:48:42 [core.py:69] Initializing a V1 LLM engine (v0.9.1.dev287+g89b1388d8) with config: model='intfloat/e5-mistral-7b-instruct', speculative_config=None, tokenizer='intfloat/e5-mistral-7b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=intfloat/e5-mistral-7b-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-23 04:48:42 [utils.py:2753] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ab4c475f5f0>
INFO 06-23 04:48:42 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-23 04:48:42 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-23 04:48:42 [gpu_model_runner.py:1696] Starting to load model intfloat/e5-mistral-7b-instruct...
INFO 06-23 04:48:43 [gpu_model_runner.py:1701] Loading model from scratch...
INFO 06-23 04:48:43 [cuda.py:270] Using Flash Attention backend on V1 engine.
INFO 06-23 04:48:43 [bitsandbytes_loader.py:454] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 06-23 04:48:43 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.00s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.06s/it]
INFO 06-23 04:48:46 [gpu_model_runner.py:1725] Model loading took 3.9099 GiB and 2.797580 seconds
INFO 06-23 04:48:51 [backends.py:508] Using cache directory: /home/mgoin/.cache/vllm/torch_compile_cache/dee6dd784e/rank_0_0/backbone for vLLM's torch.compile
INFO 06-23 04:48:51 [backends.py:519] Dynamo bytecode transform time: 4.86 s
INFO 06-23 04:48:54 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 3.547 s
INFO 06-23 04:48:55 [monitor.py:34] torch.compile takes 4.86 s in total
INFO 06-23 04:48:56 [gpu_worker.py:232] Available KV cache memory: 66.87 GiB
INFO 06-23 04:48:56 [kv_cache_utils.py:716] GPU KV cache size: 547,776 tokens
INFO 06-23 04:48:56 [kv_cache_utils.py:720] Maximum concurrency for 1,024 tokens per request: 526.71x
WARNING 06-23 04:48:56 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
Capturing CUDA graphs: 0%| | 0/67 [00:00<?, ?it/s]
ERROR 06-23 04:48:56 [core.py:519] EngineCore failed to start.
ERROR 06-23 04:48:56 [core.py:519] Traceback (most recent call last):
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 510, in run_engine_core
ERROR 06-23 04:48:56 [core.py:519] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 394, in __init__
ERROR 06-23 04:48:56 [core.py:519] super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 82, in __init__
ERROR 06-23 04:48:56 [core.py:519] self._initialize_kv_caches(vllm_config)
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 169, in _initialize_kv_caches
ERROR 06-23 04:48:56 [core.py:519] self.model_executor.initialize_from_config(kv_cache_configs)
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 66, in initialize_from_config
ERROR 06-23 04:48:56 [core.py:519] self.collective_rpc("compile_or_warm_up_model")
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-23 04:48:56 [core.py:519] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/utils.py", line 2687, in run_method
ERROR 06-23 04:48:56 [core.py:519] return func(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 266, in compile_or_warm_up_model
ERROR 06-23 04:48:56 [core.py:519] self.model_runner.capture_model()
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 2213, in capture_model
ERROR 06-23 04:48:56 [core.py:519] self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg)
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-23 04:48:56 [core.py:519] return func(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
ERROR 06-23 04:48:56 [core.py:519] outputs = model(
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-23 04:48:56 [core.py:519] return self._call_impl(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-23 04:48:56 [core.py:519] return forward_call(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 581, in forward
ERROR 06-23 04:48:56 [core.py:519] model_output = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/compilation/decorators.py", line 246, in __call__
ERROR 06-23 04:48:56 [core.py:519] model_output = self.forward(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 368, in forward
ERROR 06-23 04:48:56 [core.py:519] def forward(
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-23 04:48:56 [core.py:519] return self._call_impl(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-23 04:48:56 [core.py:519] return forward_call(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 06-23 04:48:56 [core.py:519] return fn(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 06-23 04:48:56 [core.py:519] return self._wrapped_call(self, *args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 06-23 04:48:56 [core.py:519] raise e
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 06-23 04:48:56 [core.py:519] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-23 04:48:56 [core.py:519] return self._call_impl(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-23 04:48:56 [core.py:519] return forward_call(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "<eval_with_key>.66", line 337, in forward
ERROR 06-23 04:48:56 [core.py:519] submod_2 = self.submod_2(getitem_3, s0, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_bnb_shard_offsets, getitem_4, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_bnb_shard_offsets = getitem_4 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_bnb_shard_offsets = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_bnb_shard_offsets = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets = None
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/compilation/cuda_piecewise_backend.py", line 156, in __call__
ERROR 06-23 04:48:56 [core.py:519] return entry.runnable(*args)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/compilation/compiler_interface.py", line 510, in compiled_graph
ERROR 06-23 04:48:56 [core.py:519] graph_output = inductor_compiled_graph(list_args)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
ERROR 06-23 04:48:56 [core.py:519] return self.current_callable(inputs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/.cache/vllm/torch_compile_cache/dee6dd784e/rank_0_0/inductor_cache/3k/c3kedtjuicpyiyo55z5hejbsjwtnfepkyplcvf6hyoj4zxhhu3pa.py", line 589, in call
ERROR 06-23 04:48:56 [core.py:519] torch.ops.vllm.apply_bnb_4bit.default(buf6, arg6_1, arg7_1, buf5)
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 756, in __call__
ERROR 06-23 04:48:56 [core.py:519] return self._op(*args, **kwargs)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/code/vllm/vllm/model_executor/layers/quantization/bitsandbytes.py", line 372, in _apply_bnb_4bit
ERROR 06-23 04:48:56 [core.py:519] out[:, current_index:current_index + output_size] = matmul_4bit(
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 533, in matmul_4bit
ERROR 06-23 04:48:56 [core.py:519] return MatMul4Bit.apply(A, B, out, bias, quant_state)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
ERROR 06-23 04:48:56 [core.py:519] return super().apply(*args, **kwargs) # type: ignore[misc]
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 462, in forward
ERROR 06-23 04:48:56 [core.py:519] output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
ERROR 06-23 04:48:56 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-23 04:48:56 [core.py:519] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 77.44 MiB is free. Process 1747225 has 7.29 GiB memory in use. Including non-PyTorch memory, this process has 71.81 GiB memory in use. Of the allocated memory 71.01 GiB is allocated by PyTorch, and 73.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Process EngineCore_0:
Traceback (most recent call last):
File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 523, in run_engine_core
raise e
File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 510, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 394, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 82, in __init__
self._initialize_kv_caches(vllm_config)
File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 169, in _initialize_kv_caches
self.model_executor.initialize_from_config(kv_cache_configs)
File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 66, in initialize_from_config
self.collective_rpc("compile_or_warm_up_model")
File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/utils.py", line 2687, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 266, in compile_or_warm_up_model
self.model_runner.capture_model()
File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 2213, in capture_model
self._dummy_run(num_tokens, capture_attn_cudagraph=full_cg)
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
outputs = model(
^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 581, in forward
model_output = self.model(input_ids, positions, intermediate_tensors,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/compilation/decorators.py", line 246, in __call__
model_output = self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 368, in forward
def forward(
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 830, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 406, in __call__
raise e
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 393, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<eval_with_key>.66", line 337, in forward
submod_2 = self.submod_2(getitem_3, s0, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_bnb_shard_offsets, getitem_4, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_bnb_shard_offsets = getitem_4 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_bnb_shard_offsets = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_bnb_shard_offsets = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets = None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/compilation/cuda_piecewise_backend.py", line 156, in __call__
return entry.runnable(*args)
^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/compilation/compiler_interface.py", line 510, in compiled_graph
graph_output = inductor_compiled_graph(list_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
return self.current_callable(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/.cache/vllm/torch_compile_cache/dee6dd784e/rank_0_0/inductor_cache/3k/c3kedtjuicpyiyo55z5hejbsjwtnfepkyplcvf6hyoj4zxhhu3pa.py", line 589, in call
torch.ops.vllm.apply_bnb_4bit.default(buf6, arg6_1, arg7_1, buf5)
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 756, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/model_executor/layers/quantization/bitsandbytes.py", line 372, in _apply_bnb_4bit
out[:, current_index:current_index + output_size] = matmul_4bit(
^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 533, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 462, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 77.44 MiB is free. Process 1747225 has 7.29 GiB memory in use. Including non-PyTorch memory, this process has 71.81 GiB memory in use. Of the allocated memory 71.01 GiB is allocated by PyTorch, and 73.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W623 04:48:57.557535085 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/home/mgoin/code/vllm/tests/utils.py", line 741, in wrapper
f(*args, **kwargs)
File "/home/mgoin/code/vllm/tests/quantization/test_bitsandbytes.py", line 159, in test_4bit_bnb_embedding_model
with vllm_runner(model_name,
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/tests/conftest.py", line 787, in __init__
self.model = LLM(
^^^^
File "/home/mgoin/code/vllm/vllm/entrypoints/llm.py", line 263, in __init__
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/engine/llm_engine.py", line 501, in from_engine_args
return engine_cls.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/v1/engine/llm_engine.py", line 124, in from_vllm_config
return cls(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/v1/engine/llm_engine.py", line 101, in __init__
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 75, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 558, in __init__
super().__init__(
File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 422, in __init__
self._init_engines_direct(vllm_config, local_only,
File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 491, in _init_engines_direct
self._wait_for_engine_startup(handshake_socket, input_address,
File "/home/mgoin/code/vllm/vllm/v1/engine/core_client.py", line 511, in _wait_for_engine_startup
wait_for_engine_startup(
File "/home/mgoin/code/vllm/vllm/v1/utils.py", line 494, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
FAILED
================================================================================================== FAILURES ==================================================================================================
___________________________________________________ test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight] ____________________________________________________
args = ()
kwargs = {'description': 'quantize embedding model inflight', 'dtype': 'half', 'example_prompts': ['vLLM is a high-throughput a...n global economic structures and future business models.\n', ...], 'hf_runner': <class 'tests.conftest.HfRunner'>, ...}
Skipped = <class 'Skipped'>, pid = 1747225, pgid = 1747030, _pid = 1747225, _exitcode = 256, old_signal_handler = <Handlers.SIG_DFL: 0>
@functools.wraps(f)
def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> None:
# Make the process the leader of its own process group
# to avoid sending SIGTERM to the parent process
os.setpgrp()
from _pytest.outcomes import Skipped
pid = os.fork()
print(f"Fork a new process to run a test {pid}")
if pid == 0:
try:
f(*args, **kwargs)
except Skipped as e:
# convert Skipped to exit code 0
print(str(e))
os._exit(0)
except Exception:
import traceback
traceback.print_exc()
os._exit(1)
else:
os._exit(0)
else:
pgid = os.getpgid(pid)
_pid, _exitcode = os.waitpid(pid, 0)
# ignore SIGTERM signal itself
old_signal_handler = signal.signal(signal.SIGTERM, signal.SIG_IGN)
# kill all child processes
os.killpg(pgid, signal.SIGTERM)
# restore the signal handler
signal.signal(signal.SIGTERM, old_signal_handler)
> assert _exitcode == 0, (f"function {f} failed when called with"
f" args {args} and kwargs {kwargs}")
E AssertionError: function <function test_4bit_bnb_embedding_model at 0x75cfc981f9c0> failed when called with args () and kwargs {'model_name': 'intfloat/e5-mistral-7b-instruct', 'description': 'quantize embedding model inflight', 'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'dtype': 'half'}
utils.py:761: AssertionError
============================================================================================== warnings summary ==============================================================================================
../../../venvs/vllm/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
ref_error: type[Exception] = jsonschema.RefResolutionError,
tests/quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]
/home/mgoin/code/vllm/tests/utils.py:737: DeprecationWarning: This process (pid=1747030) is multi-threaded, use of fork() may lead to deadlocks in the child.
pid = os.fork()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================== short test summary info ===========================================================================================
FAILED quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight] - AssertionError: function <function test_4bit_bnb_embedding_model at 0x75cfc981f9c0> failed when called with args () and kwargs {'model_name': 'intfloat/e5-mistral-7b-instruct', 'description': 'quantize...
======================================================================================= 1 failed, 2 warnings in 47.09s =======================================================================================
📝 History of failing test
CC List.
Metadata
Metadata
Assignees
Labels
ci-failureIssue about an unexpected test failure in CIIssue about an unexpected test failure in CI
Type
Projects
Status
Done