-
Notifications
You must be signed in to change notification settings - Fork 282
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Issue description:
Launching a server for a 7B model succeeded but failed on serving a 72B model. The launcher took about half an hour to initialize and then reported EOFError: connection closed by peer.
Please provide a clear and concise description of your issue.
Steps to reproduce:
Please list the steps to reproduce the issue, such as:
- run the container
ghcr.io/modeltc/lightllm:main - start server:
python -m lightllm.server.api_server --model_dir ~/resources/huggingface/models/Qwen/Qwen1.5-72B-chat/ \
--host 0.0.0.0 \
--port 8080 \
--tp 8 \
--eos_id 151645 \
--trust_remote_code \
--max_total_token_num 120000- Wait for half an hour and see error
Expected behavior:
Please describe what you expected to happen.
Error logging:
< python -m lightllm.server.api_server --model_dir ~/resources/huggingface/models/Qwen/Qwen1.5-72B-chat/ \
--host 0.0.0.0 \
--port 8080 \
--tp 8 \
--eos_id 151645 \
--trust_remote_code \
--max_total_token_num 120000
INFO 03-09 16:38:17 [tokenizer.py:79] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 03-09 16:38:21 [tokenizer.py:79] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 03-09 17:07:54 [mem_utils.py:9] mode setting params: []
INFO 03-09 17:07:54 [mem_utils.py:18] Model kv cache using mode normal
INFO 03-09 17:07:56 [mem_utils.py:9] mode setting params: []
INFO 03-09 17:07:56 [mem_utils.py:18] Model kv cache using mode normal
INFO 03-09 17:07:56 [mem_utils.py:9] mode setting params: []
INFO 03-09 17:07:56 [mem_utils.py:18] Model kv cache using mode normal
INFO 03-09 17:07:56 [mem_utils.py:9] mode setting params: []
INFO 03-09 17:07:56 [mem_utils.py:18] Model kv cache using mode normal
INFO 03-09 17:07:58 [mem_utils.py:9] mode setting params: []
INFO 03-09 17:07:58 [mem_utils.py:18] Model kv cache using mode normal
INFO 03-09 17:07:58 [mem_utils.py:9] mode setting params: []
INFO 03-09 17:07:58 [mem_utils.py:18] Model kv cache using mode normal
ERROR 03-09 17:07:58 [start_utils.py:24] init func start_router_process : Traceback (most recent call last):
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/lightllm/lightllm/server/router/manager.py", line 379, in start_router_process
ERROR 03-09 17:07:58 [start_utils.py:24] asyncio.run(router.wait_to_model_ready())
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
ERROR 03-09 17:07:58 [start_utils.py:24] return loop.run_until_complete(main)
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/lightllm/lightllm/server/router/manager.py", line 83, in wait_to_model_ready
ERROR 03-09 17:07:58 [start_utils.py:24] await asyncio.gather(*init_model_ret)
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 455, in init_model
ERROR 03-09 17:07:58 [start_utils.py:24] await ans
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 427, in _func
ERROR 03-09 17:07:58 [start_utils.py:24] await asyncio.to_thread(ans.wait)
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread
ERROR 03-09 17:07:58 [start_utils.py:24] return await loop.run_in_executor(None, func_call)
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run
ERROR 03-09 17:07:58 [start_utils.py:24] result = self.fn(*self.args, **self.kwargs)
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait
ERROR 03-09 17:07:58 [start_utils.py:24] self._conn.serve(self._ttl)
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve
ERROR 03-09 17:07:58 [start_utils.py:24] data = self._channel.poll(timeout) and self._channel.recv()
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv
ERROR 03-09 17:07:58 [start_utils.py:24] header = self.stream.read(self.FRAME_HEADER.size)
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read
ERROR 03-09 17:07:58 [start_utils.py:24] raise EOFError("connection closed by peer")
ERROR 03-09 17:07:58 [start_utils.py:24]
ERROR 03-09 17:07:58 [start_utils.py:24] EOFError: connection closed by peer
ERROR 03-09 17:07:58 [start_utils.py:24]
Environment:
Please provide information about your environment, such as:
-
Using container
-
OS: Ubuntu 20.04.6
-
GPU info:
nvidia-smi: NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2- Graphics cards: H800-80G x 8
-
Python: Python 3.9.18
-
LightLLm: 486f647
-
openai-triton:
pip show triton
Name: triton
Version: 2.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License:
Location: /opt/conda/lib/python3.9/site-packages
Requires: filelock
Required-by: lightllm
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working