diff --git a/_static/images/sglang.png b/_static/images/sglang.png new file mode 100644 index 0000000..2a8bc25 Binary files /dev/null and b/_static/images/sglang.png differ diff --git a/index.rst b/index.rst index c2f7bbd..1de34b5 100644 --- a/index.rst +++ b/index.rst @@ -39,6 +39,7 @@ sources/lm_deploy/index.rst sources/torchchat/index.rst sources/torchtitan/index.rst + sources/sglang/index.rst 选择您的偏好,并按照 :doc:`快速安装昇腾环境` 的安装指导进行操作。 @@ -392,6 +393,24 @@ | 快速上手 - + + +
+
+
+
+

SGLang

+

用于LLM和VLM的高速服务框架

+
+
+
+ +
diff --git a/sources/sglang/index.rst b/sources/sglang/index.rst new file mode 100755 index 0000000..48f8b09 --- /dev/null +++ b/sources/sglang/index.rst @@ -0,0 +1,8 @@ +SGLang +============ + +.. toctree:: + :maxdepth: 2 + + install.rst + quick_start.rst diff --git a/sources/sglang/install.rst b/sources/sglang/install.rst new file mode 100755 index 0000000..c23b420 --- /dev/null +++ b/sources/sglang/install.rst @@ -0,0 +1,193 @@ +安装指南 +============== + +本教程面向使用 SGLang & 昇腾的开发者,帮助完成昇腾环境下 SGLang 的安装。截至 2025 年 9 月,该项目涉及的如下组件正在活跃开发中,建议使用最新版本,并注意版本以及设备兼容性。 + +昇腾环境安装 +------------ + +请根据已有昇腾产品型号及 CPU 架构等按照 :doc:`快速安装昇腾环境指引 <../ascend/quick_install>` 进行昇腾环境安装。 + +.. warning:: + CANN 推荐版本为 8.2.RC1 以上,安装 CANN 时,请同时安装 Kernel 算子包以及 nnal ARM 平台加速库软件包。 + + +SGLang 安装 +---------------------- + +方法1:使用源码安装 SGLang +~~~~~~~~~~~~~~~~~~~~~~ + + +Python 环境创建 +^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: shell + :linenos: + + # Create a new conda environment, and only python 3.11 is supported + conda create --name sglang_npu python=3.11 + # Activate the virtual environment + conda activate sglang_npu + +安装 python 依赖 +^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: shell + :linenos: + + pip install attrs==24.2.0 numpy==1.26.4 scipy==1.13.1 decorator==5.1.1 psutil==6.0.0 pytest==8.3.2 pytest-xdist==3.6.1 pyyaml + + +MemFabric Adaptor 安装 +^^^^^^^^^^^^^^^^^^^^^^ + +MemFabric Adaptor 是 Mooncake Transfer Engine 在昇腾 NPU 集群上实现 KV cache 传输的替代方案。 + + +目前,MemFabric Adaptor 仅支持 aarch64 架构的设备。请根据实际架构选择安装: + +.. code-block:: shell + :linenos: + + MF_WHL_NAME="mf_adapter-1.0.0-cp311-cp311-linux_aarch64.whl" + MEMFABRIC_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/${MF_WHL_NAME}" + wget -O "${MF_WHL_NAME}" "${MEMFABRIC_URL}" && pip install "./${MF_WHL_NAME}" + + +torch-npu 安装 +^^^^^^^^^^^^^^^^^^^^^^ + +按照 :doc:`torch-npu 安装指引 <../pytorch/install>` 本项目由于 NPUGraph 和 Triton-Ascend 的限制,目前仅支持安装 2.6.0 版本 torch 和 torch-npu,后续会推出更通用的版本方案。 + +.. code-block:: shell + :linenos: + + # Install torch 2.6.0 and torchvision 0.21.0 on CPU only + PYTORCH_VERSION=2.6.0 + TORCHVISION_VERSION=0.21.0 + pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu + + # Install torch_npu 2.6.0 or you can just pip install torch_npu==2.6.0 + PTA_VERSION="v7.1.0.2-pytorch2.6.0" + PTA_NAME="torch_npu-2.6.0.post2-cp311-cp311-manylinux_2_28_aarch64.whl" + PTA_URL="https://gitcode.com/ascend/pytorch/releases/download/${PTA_VERSION}/${PTA_WHL_NAME}" + wget -O "${PTA_NAME}" "${PTA_URL}" && pip install "./${PTA_NAME}" + +安装完成后,可以通过以下代码验证 torch_npu 是否安装成功: + +.. code-block:: shell + :linenos: + + import torch + # import torch_npu # In torch 2.6.0,no need to import torch_npu explicitly + + x = torch.randn(2, 2).npu() + y = torch.randn(2, 2).npu() + z = x.mm(y) + + print(z) + +程序能够成功打印矩阵 Z 的值即为安装成功。 + +vLLM 安装 +^^^^^^^^^^^^^^^^^^^^^^ + +vLLM 目前仍是昇腾 NPU 上的一个主要前提条件。基于 torch==2.6.0 版本,vLLM 需要从源码编译安装 v0.8.5 版本。 + +.. code-block:: shell + :linenos: + + VLLM_TAG=v0.8.5 + git clone --depth 1 https://github.com/vllm-project/vllm.git --branch $VLLM_TAG + cd vllm + VLLM_TARGET_DEVICE="empty" pip install -v -e . + cd .. + +Triton-Ascend 安装 +^^^^^^^^^^^^^^^^^^^^^^ + +Triton Ascend还在频繁更新。为能使用最新功能特性,建议拉取代码进行源码安装。详细安装步骤请参考 `安装指南 `_。 + +或者选择安装 Triton Ascend nightly 包: + +.. code-block:: shell + :linenos: + + pip install -i https://test.pypi.org/simple/ "triton-ascend<3.2.0rc" --pre --no-cache-dir + + +安装 Deep-ep 与 sgl-kernel-npu: +^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: shell + :linenos: + + pip install wheel==0.45.1 + git clone https://github.com/sgl-project/sgl-kernel-npu.git + + # Add environment variables + export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/runtime/lib64/stub:$LD_LIBRARY_PATH + source /usr/local/Ascend/ascend-toolkit/set_env.sh + cd sgl-kernel-npu + + # Compile and install deep-ep, sgl-kernel-npu + bash build.sh + pip install output/deep_ep*.whl output/sgl_kernel_npu*.whl --no-cache-dir + cd .. + rm -rf sgl-kernel-npu + + # Link to the deep_ep_cpp.*.so file + cd "$(pip show deep-ep | grep -E '^Location:' | awk '{print $2}')" && ln -s deep_ep/deep_ep_cpp*.so + + +源码安装 SGLang: +^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: shell + :linenos: + + # Use the last release branch + git clone -b v0.5.3rc0 https://github.com/sgl-project/sglang.git + cd sglang + + pip install --upgrade pip + # Install SGLang with NPU support + pip install -e python[srt_npu] + cd .. + + + +方法2:使用 docker 镜像安装 SGLang +~~~~~~~~~~~~~~~~~~~~~~ + +注意:--privileged 和 --network=host 是 RDMA 所必需的,而 RDMA 通常也是 Ascend NPU 集群的必备组件。 + +以下 Docker 命令基于 Atlas 800I A3 机型。若使用 Atlas 800I A2 机型,请确保仅将 davinci [0-7] 映射到容器中。 + +.. code-block:: shell + :linenos: + + # Clone the SGLang repository + git clone https://github.com/sgl-project/sglang.git + cd sglang/docker + + # Build the docker image + docker build -t -f Dockerfile.npu . + + alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \ + --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \ + --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \ + --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \ + --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \ + --device=/dev/davinci_manager --device=/dev/hisi_hdc \ + --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ + --volume /etc/ascend_install.info:/etc/ascend_install.info \ + --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/' + + # Run the docker container and start the SGLang server + drun --env "HF_TOKEN=" \ + \ + python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend --host 0.0.0.0 --port 30000 + diff --git a/sources/sglang/quick_start.rst b/sources/sglang/quick_start.rst new file mode 100644 index 0000000..c32576a --- /dev/null +++ b/sources/sglang/quick_start.rst @@ -0,0 +1,104 @@ +快速开始 +================== + +.. note:: + + 阅读本篇前,请确保已按照 :doc:`安装教程 <./install>` 准备好昇腾环境及 SGLang ! + + 本篇教程将介绍如何使用 SGLang 进行快速开发,帮助您快速上手 SGLang。 + +本文档帮助昇腾开发者快速使用 SGLang × 昇腾 进行 LLM 推理服务。可以访问 `这篇官方文档 `_ 获取更多信息。 + +概览 +------------------------ + +SGLang 是一款适用于 LLM 和 VLM 的高速服务框架。通过协同设计后端运行时环境与前端语言,让用户与模型的交互更快速、更可控。 + +使用 SGLang 启动服务 +------------------------ + +以下示例展示了如何使用 SGLang 启动一个简单的会话生成服务: + +启动一个 server: + +.. code-block:: shell + :linenos: + + # Launch the SGLang server on NPU + python -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct \ + --device npu --port 8000 --attention-backend ascend \ + --host 0.0.0.0 --trust-remote-code + +启动成功后,将看到类似如下的日志输出: + +.. code-block:: shell + :linenos: + + INFO: Started server process [89394] + INFO: Waiting for application startup. + INFO: Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) + INFO: 127.0.0.1:40106 - "GET /get_model_info HTTP/1.1" 200 OK + Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, + INFO: 127.0.0.1:40108 - "POST /generate HTTP/1.1" 200 OK + The server is fired up and ready to roll! + +使用 curl 进行测试: + +.. code-block:: shell + :linenos: + + curl -s http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen/qwen2.5-0.5b-instruct", + "messages": [ + { + "role": "user", + "content": "What is the capital of France?" + } + ] + }' + +将看到类似如下返回结果: + +.. code-block:: shell + :linenos: + + {"id":"3f2f1aa779b544c19f01c08b803bf4ef","object":"chat.completion","created":1759136880,"model":"qwen/qwen2.5-0.5b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The capital of France is Paris.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":36,"total_tokens":44,"completion_tokens":8,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}} + +使用 SGLang 进行推理验证 +------------------------ + +以下代码展示了如何使用 SGLang 进行推理验证: + +.. code-block:: shell + :linenos: + + # example.py + import torch + + import sglang as sgl + + def main(): + + prompts = [ + "Hello, my name is", + "The Independence Day of the United States is", + "The capital of Germany is", + "The full form of AI is", + ] * 1 + + llm = sgl.Engine(model_path="/Qwen2.5/Qwen2.5-0.5B-Instruct", device="npu", attention_backend="ascend") + + sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 100} + + outputs = llm.generate(prompts, sampling_params) + for prompt, output in zip(prompts, outputs): + print("===============================") + print(f"Prompt: {prompt}\nGenerated text: {output['text']}") + + if __name__ == '__main__': + main() + +运行 example.py 进行测试,查看是否得到输出即可验证 SGLang 是否安装成功。 \ No newline at end of file