GitHub - Rayrtfr/llama.cpp: Port of Facebook's LLaMA model in C/C++

使用llama.cpp量化部署

以llama.cpp工具为例，介绍模型量化并在本地部署的详细步骤。Windows则可能需要cmake等编译工具的安装。本地快速部署体验推荐使用经过指令精调的Atom-7B-Chat模型，有条件的推荐使用6-bit或者8-bit模型，效果更佳。 运行前请确保：

系统应有make（MacOS/Linux自带）或cmake（Windows需自行安装）编译工具
建议使用Python 3.10以上编译和运行该工具

Step 1: 克隆和编译llama.cpp

（可选）如果已下载旧版仓库，建议git pull拉取最新代码，并执行make clean进行清理
拉取最新版适配过Atom大模型的llama.cpp仓库代码

$ git clone https://github.com/Rayrtfr/llama.cpp

对llama.cpp项目进行编译，生成./main（用于推理）和./quantize（用于量化）二进制文件。

$ make

Windows/Linux用户如需启用GPU推理，则推荐与BLAS（或cuBLAS如果有GPU）一起编译，可以提高prompt处理速度。以下是和cuBLAS一起编译的命令，适用于NVIDIA相关GPU。参考：llama.cpp#blas-build

$ make LLAMA_CUBLAS=1

macOS用户无需额外操作，llama.cpp已对ARM NEON做优化，并且已自动启用BLAS。M系列芯片推荐使用Metal启用GPU推理，显著提升速度。只需将编译命令改为：LLAMA_METAL=1 make，参考llama.cpp#metal-build

$ LLAMA_METAL=1 make

Step 2: 生成量化版本模型

目前llama.cpp已支持.safetensors文件以及huggingface格式.bin转换为GGUF的FP16格式。

$ python convert.py --outfile ./atom-7B-cpp.gguf  /path/Atom-7B-Chat

$ ./quantize ./atom-7B-cpp.gguf ./ggml-atom-7B-q4_0.gguf q4_0

Step 3: 加载并启动模型

如果想使用GPU推理：cuBLAS/Metal编译需要指定offload层数，在./main中指定例如-ngl 40表示offload 40层模型参数到GPU

使用以下命令启动聊天。

text="<s>Human: 介绍一下北京\n</s><s>Assistant:"
./main -m \
./ggml-atom-7B-q4_0.gguf \
-p "${text}"  \
--logdir ./logtxt

如果要带聊天的上下文，上面的text需要调整成类似这样：

text="<s>Human: 介绍一下北京\n</s><s>Assistant:北京是一个美丽的城市</s>\n<s>Human: 再介绍一下合肥\n</s><s>Assistant:"

更详细的官方说明请参考：https://github.com/Rayrtfr/llama.cpp/tree/master/examples/main

Name		Name	Last commit message	Last commit date
Latest commit History 2,038 Commits
.devops		.devops
.github		.github
awq-py		awq-py
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
gguf-py		gguf-py
grammars		grammars
kompute @ 4565194		kompute @ 4565194
kompute-shaders		kompute-shaders
media		media
models		models
pocs		pocs
prompts		prompts
requirements		requirements
scripts		scripts
spm-headers		spm-headers
tests		tests
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
Package.swift		Package.swift
README-Origin.md		README-Origin.md
README-sycl.md		README-sycl.md
README.md		README.md
SHA256SUMS		SHA256SUMS
build.zig		build.zig
codecov.yml		codecov.yml
convert-hf-to-gguf.py		convert-hf-to-gguf.py
convert-llama-ggml-to-gguf.py		convert-llama-ggml-to-gguf.py
convert-lora-to-ggml.py		convert-lora-to-ggml.py
convert-persimmon-to-gguf.py		convert-persimmon-to-gguf.py
convert.py		convert.py
direct_trainsformers.py		direct_trainsformers.py
flake.lock		flake.lock
flake.nix		flake.nix
ggml-alloc.c		ggml-alloc.c
ggml-alloc.h		ggml-alloc.h
ggml-backend-impl.h		ggml-backend-impl.h
ggml-backend.c		ggml-backend.c
ggml-backend.h		ggml-backend.h
ggml-cuda.cu		ggml-cuda.cu
ggml-cuda.h		ggml-cuda.h
ggml-impl.h		ggml-impl.h
ggml-kompute.cpp		ggml-kompute.cpp
ggml-kompute.h		ggml-kompute.h
ggml-metal.h		ggml-metal.h
ggml-metal.m		ggml-metal.m
ggml-metal.metal		ggml-metal.metal
ggml-mpi.c		ggml-mpi.c
ggml-mpi.h		ggml-mpi.h
ggml-opencl.cpp		ggml-opencl.cpp
ggml-opencl.h		ggml-opencl.h
ggml-quants.c		ggml-quants.c
ggml-quants.h		ggml-quants.h
ggml-sycl.cpp		ggml-sycl.cpp
ggml-sycl.h		ggml-sycl.h
ggml-vulkan-shaders.hpp		ggml-vulkan-shaders.hpp
ggml-vulkan.cpp		ggml-vulkan.cpp
ggml-vulkan.h		ggml-vulkan.h
ggml.c		ggml.c
ggml.h		ggml.h
ggml_vk_generate_shaders.py		ggml_vk_generate_shaders.py
llama.cpp		llama.cpp
llama.h		llama.h
mypy.ini		mypy.ini
requirements.txt		requirements.txt
stage_1_convert.sh		stage_1_convert.sh
stage_5_test_token.sh		stage_5_test_token.sh
unicode.h		unicode.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

使用llama.cpp量化部署

Step 1: 克隆和编译llama.cpp

Step 2: 生成量化版本模型

Step 3: 加载并启动模型

About

Releases

Packages

Languages

License

Rayrtfr/llama.cpp

Folders and files

Latest commit

History

Repository files navigation

使用llama.cpp量化部署

Step 1: 克隆和编译llama.cpp

Step 2: 生成量化版本模型

Step 3: 加载并启动模型

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages