# 转换并量化中文LLaMA和Alpaca模型

项目地址：https://github.com/ymcui/Chinese-LLaMA-Alpaca

⚠️ 内存消耗提示（确保刷出来的机器RAM大于以下要求）：
- 7B模型：15G+
- 13B模型：18G+
- 33B模型：22G+

💡 提示和小窍门：
- 免费用户默认的内存只有12G左右，不足以转换模型。**实测选择TPU的话有机会随机出35G内存**，建议多试几次
- Pro(+)用户请选择 “代码执行程序” -> “更改运行时类型” -> “高RAM”
- 程序莫名崩掉或断开连接就说明内存爆了
- 如果选了“高RAM”之后内存还是不够大的话，选择以下操作，有的时候会分配出很高内存的机器，祝你好运😄！
    - 可以把GPU或者TPU也选上（虽然不会用到）
    - 选GPU时，Pro(+)用户可选“A100”类型GPU

*温馨提示：用完之后注意断开运行时，选择满足要求的最低配置即可，避免不必要的计算单元消耗（Pro只给100个计算单元）。*

## 安装相关依赖

In [None]:
!pip install torch==1.13.1
!pip install transformers==4.30.2
!pip install peft==0.3.0
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting peft==0.3.0
  Downloading peft-0.3.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate (from peft==0.3.0)
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate, peft
Successfully installed accelerate-0.20.3 peft-0.3.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## 克隆目录和代码

In [None]:
!git clone https://github.com/ymcui/Chinese-LLaMA-Alpaca
!git clone https://github.com/ggerganov/llama.cpp

Cloning into 'Chinese-LLaMA-Alpaca'...
remote: Enumerating objects: 1407, done.[K
remote: Counting objects: 100% (599/599), done.[K
remote: Compressing objects: 100% (257/257), done.[K
remote: Total 1407 (delta 369), reused 494 (delta 338), pack-reused 808[K
Receiving objects: 100% (1407/1407), 22.61 MiB | 27.14 MiB/s, done.
Resolving deltas: 100% (831/831), done.
Cloning into 'llama.cpp'...
remote: Enumerating objects: 3618, done.[K
remote: Counting objects: 100% (1155/1155), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 3618 (delta 1076), reused 1036 (delta 1031), pack-reused 2463[K
Receiving objects: 100% (3618/3618), 3.28 MiB | 21.36 MiB/s, done.
Resolving deltas: 100% (2424/2424), done.


## 合并模型（以Alpaca-7B为例）

此处使用的是🤗模型库中提供的基模型（已是HF格式），而不是Facebook官方的LLaMA模型，因此略去将原版LLaMA转换为HF格式的步骤。
**这里直接运行第二步：合并LoRA权重**，生成全量模型权重。可以直接指定🤗模型库的地址，也可以是本地存放地址。
- 基模型：`elinas/llama-7b-hf-transformers-4.29` *（use at your own risk，我们比对过SHA256和正版一致，但你应确保自己有权使用该模型）*
- LoRA模型：`ziqingyang/chinese-alpaca-lora-7b`
   - 如果是Alpaca-Plus模型，记得要同时传入llama和alpaca的lora，教程：[这里](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/手动模型合并与转换#多lora权重合并适用于chinese-alpaca-plus)
- 输出格式：可选pth或者huggingface，这里选择pth，因为后面要用llama.cpp量化

由于要下载模型，所以需要耐心等待一下，尤其是33B模型。
转换好的模型存放在`alpaca-combined`目录。
如果你不需要量化模型，那么到这一步就结束了，可自行下载或者转存到Google Drive。

In [None]:
!python ./Chinese-LLaMA-Alpaca/scripts/merge_llama_with_chinese_lora_low_mem.py \
    --base_model 'elinas/llama-7b-hf-transformers-4.29' \
    --lora_model 'ziqingyang/chinese-alpaca-lora-7b' \
    --output_type pth \
    --output_dir alpaca-combined

Base model: elinas/llama-7b-hf-transformers-4.29
LoRA model(s) ['ziqingyang/chinese-alpaca-lora-7b']:
Loading ziqingyang/chinese-alpaca-lora-7b
Cannot find lora model on the disk. Downloading lora model from hub...
Fetching 7 files:   0% 0/7 [00:00<?, ?it/s]
Downloading (…)c39d6ac454/README.md: 100% 316/316 [00:00<00:00, 1.93MB/s]

Downloading (…)/adapter_config.json: 100% 472/472 [00:00<00:00, 3.48MB/s]

Downloading (…)cial_tokens_map.json: 100% 96.0/96.0 [00:00<00:00, 661kB/s]

Downloading (…)ac454/.gitattributes: 100% 1.48k/1.48k [00:00<00:00, 7.92MB/s]
Fetching 7 files:  14% 1/7 [00:00<00:00,  6.42it/s]
Downloading (…)okenizer_config.json: 100% 166/166 [00:00<00:00, 804kB/s]

Downloading tokenizer.model:   0% 0.00/758k [00:00<?, ?B/s][A

Downloading tokenizer.model: 100% 758k/758k [00:00<00:00, 15.6MB/s]


Downloading adapter_model.bin:   1% 10.5M/858M [00:00<00:12, 66.0MB/s][A[A

Downloading adapter_model.bin:   2% 21.0M/858M [00:00<00:11, 75.2MB/s][A[A

Downloading adapter_m

## 比对SHA256

完整值：https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/SHA256.md

其中本示例生成的Alpaca-7B的标准SHA256：
- fbfccc91183169842aac8d093379f0a449b5a26c5ee7a298baf0d556f1499b90

使用下述命令评测后发现两者相同，合并无误。

In [None]:
!sha256sum alpaca-combined/consolidated.*.pth

fbfccc91183169842aac8d093379f0a449b5a26c5ee7a298baf0d556f1499b90  alpaca-combined/consolidated.00.pth


## 量化模型
接下来我们使用[llama.cpp](https://github.com/ggerganov/llama.cpp)工具对上一步生成的全量版本权重进行转换，生成4-bit量化模型。

### 编译工具

首先对llama.cpp工具进行编译。

In [None]:
!cd llama.cpp && make

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native 

### 模型转换为ggml格式（FP16）

这一步，我们将模型转换为ggml格式（FP16）。
- 在这之前需要把`alpaca-combined`目录挪个位置，把模型文件放到`llama.cpp/zh-models/7B`下，把`tokenizer.model`放到`llama.cpp/zh-models`
- tokenizer在哪里？
    - `alpaca-combined`目录下有
    - 或者从以下网址下载：https://huggingface.co/ziqingyang/chinese-alpaca-lora-7b/resolve/main/tokenizer.model （注意，Alpaca和LLaMA的`tokenizer.model`不能混用！）

💡 转换13B/33B模型提示：
- tokenizer可以直接用7B的，13B/33B和7B的相同
- Alpaca和LLaMA的`tokenizer.model`不能混用！
- 以下看到7B字样的都是文件夹名，与转换过程没有关系了，改不改都行

In [None]:
!cd llama.cpp && mkdir zh-models && mv ../alpaca-combined zh-models/7B
!mv llama.cpp/zh-models/7B/tokenizer.model llama.cpp/zh-models/
!ls llama.cpp/zh-models/

7B  tokenizer.model


In [None]:
!cd llama.cpp && python convert.py zh-models/7B/

Loading model file zh-models/7B/consolidated.00.pth
Loading vocab file zh-models/tokenizer.model
Writing vocab...
[  1/291] Writing tensor tok_embeddings.weight                  | size  49954 x   4096  | type UnquantizedDataType(name='F16')
[  2/291] Writing tensor norm.weight                            | size   4096           | type UnquantizedDataType(name='F32')
[  3/291] Writing tensor output.weight                          | size  49954 x   4096  | type UnquantizedDataType(name='F16')
[  4/291] Writing tensor layers.0.attention.wq.weight           | size   4096 x   4096  | type UnquantizedDataType(name='F16')
[  5/291] Writing tensor layers.0.attention.wk.weight           | size   4096 x   4096  | type UnquantizedDataType(name='F16')
[  6/291] Writing tensor layers.0.attention.wv.weight           | size   4096 x   4096  | type UnquantizedDataType(name='F16')
[  7/291] Writing tensor layers.0.attention.wo.weight           | size   4096 x   4096  | type UnquantizedDataType(name='F16

### 将FP16模型量化为4-bit

我们进一步将FP16模型转换为4-bit量化模型，此处选择的是新版Q4_K方法。

In [None]:
!cd llama.cpp && ./quantize ./zh-models/7B/ggml-model-f16.bin ./zh-models/7B/ggml-model-q4_K.bin q4_K

main: build = 670 (254a7a7)
main: quantizing './zh-models/7B/ggml-model-f16.bin' to './zh-models/7B/ggml-model-q4_K.bin' as Q4_K
llama.cpp: loading model from ./zh-models/7B/ggml-model-f16.bin
llama.cpp: saving model to ./zh-models/7B/ggml-model-q4_K.bin
[   1/ 291]                tok_embeddings.weight -     4096 x 49954, type =    f16, quantizing .. size =   390.27 MB ->   109.76 MB | hist: 
[   2/ 291]                          norm.weight -             4096, type =    f32, size =    0.016 MB
[   3/ 291]                        output.weight -     4096 x 49954, type =    f16, quantizing .. size =   390.27 MB ->   160.07 MB | hist: 
[   4/ 291]         layers.0.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 
[   5/ 291]         layers.0.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 
[   6/ 291]         layers.0.attention.wv.weight -     4096 x  4096, type =   

### （可选）测试量化模型解码
至此已完成了所有转换步骤。
我们运行一条命令测试一下是否能够正常加载并进行对话。

FP16和Q4量化文件存放在./llama.cpp/zh-models/7B下，可按需下载使用。

In [None]:
!cd llama.cpp && ./main -m ./zh-models/7B/ggml-model-q4_K.bin --color -p "详细介绍一下北京的名胜古迹：" -n 128

main: build = 670 (254a7a7)
main: seed  = 1686819449
llama.cpp: loading model from ./zh-models/7B/ggml-model-q4_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 49954
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5780.29 MB (+ 1026.00 MB per state)
................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 4 | AVX = 