Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: can't pickle model config error on V0 engine for deepseek-r1 #14966

Open
1 task done
cjackal opened this issue Mar 17, 2025 · 6 comments
Open
1 task done

[Bug]: can't pickle model config error on V0 engine for deepseek-r1 #14966

cjackal opened this issue Mar 17, 2025 · 6 comments
Labels
bug Something isn't working

Comments

@cjackal
Copy link
Contributor

cjackal commented Mar 17, 2025

Your current environment

AFAICT it happens regardless of the GPU arch

the particular version that I have tested is:

vLLM API server version 0.8.0rc2.dev9+g6eaf1e5c (for cuda)

🐛 Describe the bug

When VLLM_USE_V1=0, launching vLLM server with --trust-remote-code fails with the following error - Can't pickle <class 'transformers_modules.configuration_deepseek.DeepseekV3Config'>: it's not the same object as transformers_modules.configuration_deepseek.DeepseekV3Config

traceback
INFO 03-17 21:48:11 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-17 21:48:13 [api_server.py:750] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 03-17 21:48:13 [api_server.py:972] vLLM API server version 0.8.0rc2.dev9+g6eaf1e5c
INFO 03-17 21:48:13 [api_server.py:973] args: Namespace(subparser='serve', model_tag='/app/model/deepseek-r1-gguf-q2-k-xl/DeepSeek-R1-UD-Q2_K_XL.gguf', config='', host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/app/model/deepseek-r1-gguf-q2-k-xl/DeepSeek-R1-UD-Q2_K_XL.gguf', task='auto', tokenizer='/app/model/deepseek-r1-gguf-q2-k-xl/', hf_config_path='/app/model/deepseek-r1-gguf-q2-k-xl/', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=16384, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=4096, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-ai/deepseek-r1'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=True, reasoning_parser='deepseek_r1', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fa13df06950>)
INFO 03-17 21:48:13 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-17 21:48:19 [config.py:583] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 03-17 21:48:19 [config.py:662] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 03-17 21:48:19 [config.py:1499] Defaulting to use mp for distributed inference
INFO 03-17 21:48:19 [config.py:1677] Chunked prefill is enabled with max_num_batched_tokens=4096.
INFO 03-17 21:48:20 [cuda.py:159] Forcing kv cache block size to 64 for FlashMLA backend.
INFO 03-17 21:48:20 [api_server.py:236] Started engine process with PID 353
INFO 03-17 21:48:23 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-17 21:48:24 [api_server.py:750] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 03-17 21:48:25 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.0rc2.dev9+g6eaf1e5c) with config: model='/app/model/deepseek-r1-gguf-q2-k-xl/DeepSeek-R1-UD-Q2_K_XL.gguf', speculative_config=None, tokenizer='/app/model/deepseek-r1-gguf-q2-k-xl/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=gguf, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend='deepseek_r1'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=deepseek-ai/deepseek-r1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 03-17 21:48:25 [multiproc_worker_utils.py:310] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-17 21:48:25 [custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
ERROR 03-17 21:48:25 [engine.py:443] Can't pickle <class 'transformers_modules.configuration_deepseek.DeepseekV3Config'>: it's not the same object as transformers_modules.configuration_deepseek.DeepseekV3Config
ERROR 03-17 21:48:25 [engine.py:443] Traceback (most recent call last):
ERROR 03-17 21:48:25 [engine.py:443]   File "/app/.venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 431, in run_mp_engine
ERROR 03-17 21:48:25 [engine.py:443]     engine = MQLLMEngine.from_vllm_config(
ERROR 03-17 21:48:25 [engine.py:443]   File "/app/.venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 126, in from_vllm_config
ERROR 03-17 21:48:25 [engine.py:443]     return cls(
ERROR 03-17 21:48:25 [engine.py:443]   File "/app/.venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 80, in __init__
ERROR 03-17 21:48:25 [engine.py:443]     self.engine = LLMEngine(*args, **kwargs)
ERROR 03-17 21:48:25 [engine.py:443]   File "/app/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 280, in __init__
ERROR 03-17 21:48:25 [engine.py:443]     self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 03-17 21:48:25 [engine.py:443]   File "/app/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 271, in __init__
ERROR 03-17 21:48:25 [engine.py:443]     super().__init__(*args, **kwargs)
ERROR 03-17 21:48:25 [engine.py:443]   File "/app/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 03-17 21:48:25 [engine.py:443]     self._init_executor()
ERROR 03-17 21:48:25 [engine.py:443]   File "/app/.venv/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py", line 90, in _init_executor
ERROR 03-17 21:48:25 [engine.py:443]     worker = ProcessWorkerWrapper(result_handler,
ERROR 03-17 21:48:25 [engine.py:443]   File "/app/.venv/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 171, in __init__
ERROR 03-17 21:48:25 [engine.py:443]     self.process.start()
ERROR 03-17 21:48:25 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
ERROR 03-17 21:48:25 [engine.py:443]     self._popen = self._Popen(self)
ERROR 03-17 21:48:25 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
ERROR 03-17 21:48:25 [engine.py:443]     return Popen(process_obj)
ERROR 03-17 21:48:25 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
ERROR 03-17 21:48:25 [engine.py:443]     super().__init__(process_obj)
ERROR 03-17 21:48:25 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
ERROR 03-17 21:48:25 [engine.py:443]     self._launch(process_obj)
ERROR 03-17 21:48:25 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
ERROR 03-17 21:48:25 [engine.py:443]     reduction.dump(process_obj, fp)
ERROR 03-17 21:48:25 [engine.py:443]   File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ERROR 03-17 21:48:25 [engine.py:443]     ForkingPickler(file, protocol).dump(obj)
ERROR 03-17 21:48:25 [engine.py:443] _pickle.PicklingError: Can't pickle <class 'transformers_modules.configuration_deepseek.DeepseekV3Config'>: it's not the same object as transformers_modules.configuration_deepseek.DeepseekV3Config

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@cjackal cjackal added the bug Something isn't working label Mar 17, 2025
@cjackal
Copy link
Contributor Author

cjackal commented Mar 17, 2025

The same model runs okay with vllm=0.7.4.dev410; I guess something went wrong after setting V1 engine as default?

@DarkLight1337
Copy link
Member

See #14925

@afeldman-nm
Copy link
Contributor

@cjackal can you please share the whole CLI command which reproduces this error, including (but not limited to) the specific deepseek r1 model you used? Thanks

@afeldman-nm
Copy link
Contributor

For example I was able to successfully run this command

VLLM_USE_V1=0 vllm serve --trust-remote-code deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 8091

so I am wondering if this is not the model you used

@DarkLight1337
Copy link
Member

DarkLight1337 commented Mar 17, 2025

Based on the other issue, perhaps the error only occurs when directly referencing a local model via filepath?

@cjackal
Copy link
Contributor Author

cjackal commented Mar 17, 2025

For example I was able to successfully run this command

VLLM_USE_V1=0 vllm serve --trust-remote-code deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 8091

so I am wondering if this is not the model you used

Deepseek distill qwen models belong to Qwen2 model architecture, you may try a deepseekV2 model architecture for reproducibility. I think DeepSeek-V2-Lite is the smallest.

What I'm using is https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-Q2_K_XL, but as GGUF models must first be joined you need to first download the model binary to local filepath and follow this. The specific launch command should look like:

VLLM_USE_V1=0 vllm serve /app/model/deepseek-r1-gguf-q2-k-xl/DeepSeek-R1-UD-Q2_K_XL.gguf --served-model-name deepseek-ai/deepseek-r1 --tokenizer /app/model/deepseek-r1-gguf-q2-k-xl/ --hf-config-path /app/model/deepseek-r1-gguf-q2-k-xl/ --trust-remote-code ... 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

3 participants