# 转换并量化中文LLaMA/Alpaca模型
注意：由于最小的7B模型转换也需要13G以上可用内存，**如果没有Colab Pro及更高订阅是无法完成转换的**。不过仍然可以参考整个流程，以便在其他机器上运行并对照。

运行前，请选择 “代码执行程序” -> “更改运行时类型” -> “高RAM”

## 安装相关依赖

In [1]:
!pip install git+https://github.com/huggingface/transformers.git
!pip install peft
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-p0m59mzk
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-p0m59mzk
  Resolved https://github.com/huggingface/transformers.git to commit c612628045822f909020f7eb6784c79700813eda
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.

## 克隆目录和代码

In [2]:
!git clone https://github.com/ymcui/Chinese-LLaMA-Alpaca
!git clone https://github.com/ggerganov/llama.cpp

Cloning into 'Chinese-LLaMA-Alpaca'...
remote: Enumerating objects: 242, done.[K
remote: Counting objects:   1% (1/92)[Kremote: Counting objects:   2% (2/92)[Kremote: Counting objects:   3% (3/92)[Kremote: Counting objects:   4% (4/92)[Kremote: Counting objects:   5% (5/92)[Kremote: Counting objects:   6% (6/92)[Kremote: Counting objects:   7% (7/92)[Kremote: Counting objects:   8% (8/92)[Kremote: Counting objects:   9% (9/92)[Kremote: Counting objects:  10% (10/92)[Kremote: Counting objects:  11% (11/92)[Kremote: Counting objects:  13% (12/92)[Kremote: Counting objects:  14% (13/92)[Kremote: Counting objects:  15% (14/92)[Kremote: Counting objects:  16% (15/92)[Kremote: Counting objects:  17% (16/92)[Kremote: Counting objects:  18% (17/92)[Kremote: Counting objects:  19% (18/92)[Kremote: Counting objects:  20% (19/92)[Kremote: Counting objects:  21% (20/92)[Kremote: Counting objects:  22% (21/92)[Kremote: Counting objects:  23% (22/92)[Krem

## 合并模型（以Alpaca-7B为例）
注意，此处使用的是huggingface提供的基模型（已是HF格式），而不是facebook官方的LLaMA模型，因此这里略去将原版LLaMA转换为HF格式的步骤。

直接运行第二步：合并LoRA权重，生成全量模型权重。可以直接指定🤗模型库的地址（也可以是本地存放地址）。
- 基模型：`decapoda-research/llama-7b-hf`
- LoRA模型：`ziqingyang/chinese-alpaca-lora-7b`

该过程比较耗时，需要几分钟，请耐心等待。
转换好的模型存放在`7B-combined`目录。
如果你不需要量化模型，那么到这一步就结束了。

In [6]:
!python ./Chinese-LLaMA-Alpaca/scripts/merge_llama_with_chinese_lora.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --lora_model 'ziqingyang/chinese-alpaca-lora-7b' \
    --output_dir 7B-combined

Downloading tokenizer.model: 100% 758k/758k [00:00<00:00, 6.00MB/s]
Downloading (…)cial_tokens_map.json: 100% 96.0/96.0 [00:00<00:00, 15.2kB/s]
Downloading (…)okenizer_config.json: 100% 166/166 [00:00<00:00, 62.5kB/s]
Downloading (…)lve/main/config.json: 100% 427/427 [00:00<00:00, 57.2kB/s]
Downloading (…)model.bin.index.json: 100% 25.5k/25.5k [00:00<00:00, 1.37MB/s]
Downloading shards:   0% 0/33 [00:00<?, ?it/s]
Downloading (…)l-00001-of-00033.bin:   0% 0.00/405M [00:00<?, ?B/s][A
Downloading (…)l-00001-of-00033.bin:   3% 10.5M/405M [00:00<00:33, 11.7MB/s][A
Downloading (…)l-00001-of-00033.bin:   5% 21.0M/405M [00:01<00:19, 19.8MB/s][A
Downloading (…)l-00001-of-00033.bin:   8% 31.5M/405M [00:01<00:12, 28.9MB/s][A
Downloading (…)l-00001-of-00033.bin:  10% 41.9M/405M [00:01<00:09, 39.1MB/s][A
Downloading (…)l-00001-of-00033.bin:  16% 62.9M/405M [00:01<00:05, 60.3MB/s][A
Downloading (…)l-00001-of-00033.bin:  21% 83.9M/405M [00:01<00:04, 79.1MB/s][A
Downloading (…)l-00001-of-00033.

## 量化模型
接下来我们使用[llama.cpp](https://github.com/ggerganov/llama.cpp)工具对上一步生成的全量版本权重进行转换，生成4-bit量化模型。

首先对llama.cpp工具进行编译。

In [7]:
!cd llama.cpp && make

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wca

然后，我们将模型转换为ggml格式（FP16），并进一步转换为4-bit量化模型。
- 在这之前需要把`7B-combined`目录挪个位置，并且保证符合转换脚本的要求。
- tokenizer文件需要在模型文件的父节点上（注意使用的是LoRA权重带的，而不是转换出来的）。
- 这里我们直接从https://huggingface.co/ziqingyang/chinese-alpaca-lora-7b/resolve/main/tokenizer.model 下载中文Alpaca-7B的tokenizer.model文件。

In [16]:
!cd llama.cpp && mkdir zh-models && mv ../7B-combined zh-models/7B

In [21]:
!cd llama.cpp/zh-models && wget https://huggingface.co/ziqingyang/chinese-alpaca-lora-7b/resolve/main/tokenizer.model

--2023-04-03 04:09:48--  https://huggingface.co/ziqingyang/chinese-alpaca-lora-7b/resolve/main/tokenizer.model
Resolving huggingface.co (huggingface.co)... 54.82.45.103, 52.22.128.237, 34.206.0.154, ...
Connecting to huggingface.co (huggingface.co)|54.82.45.103|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/0f/01/0f01544c04c27e0a0357540e7be5763000a215cedb3be4a0356b56983f2fd5e3/2d967e855b1213a439df6c8ce2791f869c84b4f3b6cfacf22b86440b8192a2f8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27tokenizer.model%3B+filename%3D%22tokenizer.model%22%3B&Expires=1680754188&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzBmLzAxLzBmMDE1NDRjMDRjMjdlMGEwMzU3NTQwZTdiZTU3NjMwMDBhMjE1Y2VkYjNiZTRhMDM1NmI1Njk4M2YyZmQ1ZTMvMmQ5NjdlODU1YjEyMTNhNDM5ZGY2YzhjZTI3OTFmODY5Yzg0YjRmM2I2Y2ZhY2YyMmI4NjQ0MGI4MTkyYTJmOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3N

In [22]:
!cd llama.cpp && python convert-pth-to-ggml.py zh-models/7B/ 1

{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': -1}
Namespace(dir_model='zh-models/7B/', ftype=1, vocab_only=0)
n_parts = 1

Processing part 1 of 1

Processing variable: tok_embeddings.weight with shape: (49954, 4096) and type: torch.float16
Processing variable: layers.0.attention.wq.weight with shape: (4096, 4096) and type: torch.float16
Processing variable: layers.0.attention.wk.weight with shape: (4096, 4096) and type: torch.float16
Processing variable: layers.0.attention.wv.weight with shape: (4096, 4096) and type: torch.float16
Processing variable: layers.0.attention.wo.weight with shape: (4096, 4096) and type: torch.float16
Processing variable: layers.0.feed_forward.w1.weight with shape: (11008, 4096) and type: torch.float16
Processing variable: layers.0.feed_forward.w2.weight with shape: (4096, 11008) and type: torch.float16
Processing variable: layers.0.feed_forward.w3.weight with shape: (11008, 4096) and type: torch.float16
Pro

In [23]:
!cd llama.cpp && ./quantize ./zh-models/7B/ggml-model-f16.bin ./zh-models/7B/ggml-model-q4_0.bin 2

llama_model_quantize_internal: loading model from './zh-models/7B/ggml-model-f16.bin'
llama_model_quantize_internal: n_vocab = 49954
llama_model_quantize_internal: n_ctx   = 512
llama_model_quantize_internal: n_embd  = 4096
llama_model_quantize_internal: n_mult  = 256
llama_model_quantize_internal: n_head  = 32
llama_model_quantize_internal: n_layer = 32
llama_model_quantize_internal: f16     = 1
                           tok_embeddings.weight - [ 4096, 49954], type =    f16 quantizing .. size =   780.53 MB ->   121.96 MB | hist: 0.000 0.022 0.019 0.033 0.053 0.078 0.104 0.125 0.133 0.125 0.104 0.078 0.053 0.033 0.019 0.022 
                    layers.0.attention.wq.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.021 0.016 0.028 0.046 0.071 0.103 0.137 0.158 0.137 0.103 0.071 0.046 0.028 0.016 0.021 
                    layers.0.attention.wk.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | h

至此已完成了所有转换步骤。
我们运行一条命令测试一下是否能够正常加载并进行对话。

FP16和Q4量化文件存放在./llama.cpp/zh-models/7B下，可按需下载使用。

In [25]:
!cd llama.cpp && ./main -m ./zh-models/7B/ggml-model-q4_0.bin --color -f ./prompts/alpaca.txt -p "介绍一下北京的名胜古迹" -n 512

main: seed = 1680495616
llama_model_load: loading model from './zh-models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 49954
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4105.59 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5897.67 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './zh-models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4104.93 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: t