<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/Quantization_LLM_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **哪种量化方法适合你?(GPTQ vs. GGUF vs. AWQ)**
*探索预量化的大型语言模型*

- https://www.maartengrootendorst.com/blog/quantization/
- https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/

<br>
<div>

<img src="https://i.imgur.com/7CuJ50z.png" width="750"/>
</div>



---
        
💡 **NOTE**: 在这个用例中，我们将使用GPU来运行Llama2和BERTopic。在Google Colab中，进入

**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

我们将从安装一些我们将在整个示例中使用的软件包开始:

In [6]:
%%capture

# Latest HF transformers version for Mistral-like models
#!pip install git+https://github.com/huggingface/transformers.git
# https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g/discussions/11
# ImportError: Using `load_in_8bit=True` requires Accelerate
!pip install transformers==4.30
!pip install accelerate bitsandbytes xformers

# GPTQ Dependencies
!pip install optimum
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

# GGUF Dependencies
!pip install ctransformers[cuda]

# 🤗 **HuggingFace**

我们可以使用下面的HF pipeline来轻松加载LLM:

In [2]:
from torch import bfloat16
from transformers import pipeline

# Load in your LLM without any compression tricks
pipe = pipeline(
    "text-generation",
    model="HuggingFaceH4/zephyr-7b-beta",
    torch_dtype=bfloat16,
    device_map="auto"
)

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

这种加载LLM的方法通常不会执行任何压缩技巧来节省VRAM或提高效率。

要生成prompt，首先必须创建必要的模板。幸运的是，如果聊天模板保存在底层分词器tokenizer中，这可以自动完成:

In [3]:
# We use the tokenizer's chat template to format each message
# See https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role": "user",
        "content": "Tell me a funny joke about Large Language Models."
    },
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



使用内部提示模板生成的prompt是这样构造的:

<div>

<img src="https://i.imgur.com/I4bkVwb.png" width="1250"/>
</div>


然后，我们可以开始将提示传递给LLM来生成我们的答案:

In [4]:
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)
print(outputs[0]["generated_text"])



<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To impress everyone with its vocabulary!

The partygoers were amazed as the Large Language Model effortlessly chatted with them, using words they had never heard before. But as the night went on, the Large Language Model started to run out of steam. It began repeating itself, and soon everyone was tired of hearing the same jokes and anecdotes.

The Large Language Model realized it had a problem. It had learned a lot of words, but it didn't know how to use them in new and interesting ways. It needed to learn how to be more creative and spontaneous.

So the Large Language Model went back to its training data and studied the works of the greatest writers and poets. It analyzed their styles and techniques, looking for clues on how to be more original.

And it worked! The next time the Large Language Model went to a party,

# 🧩 **Sharding 分片**

在我们进入量化策略之前，我们可以使用另一个技巧来减少加载模型所需的VRAM。通过分片，我们实际上是将我们的模型分割成小块或分片。

<div>
<img src="https://i.imgur.com/NGxs89n.png" width="1250"/>
</div>


In [5]:
from accelerate import Accelerator

# Shard our model into pieces of 1GB
accelerator = Accelerator()
accelerator.save_model(
    model=pipe.model,
    save_directory="/content/model",
    max_shard_size="4GB"
)

In [8]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding


In [9]:
!ls -hl /content/model/

total 14G
-rw-r--r-- 1 root root 3.7G Dec 26 14:59 model-00001-of-00004.safetensors
-rw-r--r-- 1 root root 3.7G Dec 26 14:59 model-00002-of-00004.safetensors
-rw-r--r-- 1 root root 3.7G Dec 26 15:00 model-00003-of-00004.safetensors
-rw-r--r-- 1 root root 2.5G Dec 26 15:00 model-00004-of-00004.safetensors
-rw-r--r-- 1 root root  24K Dec 26 15:00 model.safetensors.index.json


In [10]:
!cat /content/model/model.safetensors.index.json

# llama2 模型结构

{
  "metadata": {
    "total_size": 14483464192
  },
  "weight_map": {
    "lm_head.weight": "model-00004-of-00004.safetensors",
    "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
    "model.layers.1.input_layernorm.weight": "model-00001-of-00004.s

# 📄 **量化 Quantization**

大型语言模型由一堆权重和激活表示。这些值通常由通常的32位浮点(`float32`)数据类型表示。

比特的数量告诉你它可以表示多少个值。**Float32**可以表示1.18e-38和3.4e38之间的值，相当多的值!比特数越少，它能表示的值就越少。
<br><br><br>
<div>
<img src="https://i.imgur.com/qn67oGd.png" width="1250"/>
</div>
<br><br>

正如您所料，如果我们选择较低的位大小，那么模型就会变得不那么准确，但它也需要表示更少的值，从而降低其大小和内存需求。

<br><br><br>
<div>
<img src="https://i.imgur.com/SIcVjQv.png" width="1000"/>
</div>
<br><br>

`4bit-NormalFloat`(NF4)由三个步骤组成:
* **Normalization归一化**: 将模型的权值归一化，使我们期望权值落在一定范围内。这允许更有效地表示更常见的值。
* **Quantization量化**: 权重量化为4位。在NF4中，量化级别相对于归一化权重是均匀间隔的，从而有效地表示原始的32位权重。
* **Dequantization去量化**: 虽然权重存储在4位，但它们在计算期间被去量化，从而在推理期间提高性能。

In [None]:
# Delete any models previously created
del pipe, accelerator

# Empty VRAM cache
import torch
import gc
gc.collect()
torch.cuda.empty_cache()

In [13]:
!nvidia-smi --query-gpu=timestamp,memory.total,memory.free,memory.used,name,utilization.gpu,utilization.memory --format=csv

# 如果gpu显存没有释放，需要 重启会话

timestamp, memory.total [MiB], memory.free [MiB], memory.used [MiB], name, utilization.gpu [%], utilization.memory [%]
2023/12/26 15:24:46.318, 15360 MiB, 1881 MiB, 13221 MiB, Tesla T4, 0 %, 0 %


为了使用HuggingFace执行这个量化，我们需要用Bitsandbytes定义一个量化配置:

In [5]:
from transformers import BitsAndBytesConfig
from torch import bfloat16

# Our 4-bit configuration to load the LLM with less GPU memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

这个配置允许我们指定要使用的量化级别。通常，我们希望用4位量化表示权重，但用16位进行推理。
然后在管道中加载模型就很简单了:

In [None]:
!pip install transformers==4.30

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Zephyr with BitsAndBytes Configuration
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta",
    quantization_config=bnb_config,
    device_map='auto',
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

In [None]:
prompt = "<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\n"
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



In [None]:
# We will use the same prompt as we did originally
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)
print(outputs[0]["generated_text"])

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To network and make some new connections, of course! But as the night went on, it started to feel a little overwhelmed by the crowd. It tried to make small talk, but its responses were a bit too technical and dry.

Eventually, it found itself in a corner, feeling a bit out of place. That's when it overheard someone say, "I heard Large Language Models are all the rage these days. But honestly, I'm not really feeling the hype."

The Large Language Model couldn't help but take offense. After all, it had worked tirelessly to perfect its language skills and was proud of the progress it had made.

But then it realized something important. Maybe being a Large Language Model wasn't all it was cracked up to be. Maybe it was time to let loose and just be a regular chatbot for a change.

So the Large Language Model shed its form

# 👀 **Pre-Quantization**

模型通常已经被分割和量化以供我们使用。特别是[TheBloke](https://huggingface.co/TheBloke)是HuggingFace上的一个用户，它为我们执行了一堆量化。
<br><br><br>
<div>
<img src="https://i.imgur.com/mdOCWIQ.png" width="1550"/>
</div>



## **GPTQ**

GPTQ是一种4位量化的训练后量化 post-training quantization(PTQ)方法，主要关注GPU的推理和性能。

该方法背后的思想是，它将尝试通过最小化该权重的均方误差将所有权重压缩到4位量化。在推理过程中，它将动态地将其权重去量化为float16，以提高性能，同时保持低内存。

In [None]:
# Delete any models previously created
del tokenizer, model, pipe

# Empty VRAM cache
import torch
import gc
gc.collect()
torch.cuda.empty_cache()

现在，我们坚持使用模型的“主”分支，因为这通常是压缩和精度之间的一个很好的平衡:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load LLM and Tokenizer
model_id = "TheBloke/zephyr-7B-beta-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=False,
    revision="main"
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [None]:
prompt = "<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\n"
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



加载模型后，我们可以运行如下提示符:

In [None]:
# We will use the same prompt as we did originally
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)
print(outputs[0]["generated_text"])



<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To make some small talk!

(Large Language Models are artificial intelligence systems trained on vast amounts of text data, but they're not yet advanced enough to carry on a natural conversation like a human. So, the joke is that they're trying to make small talk, which is a basic level of social interaction.)


## **GGUF**

尽管GPTQ在压缩方面做得很好，但如果您没有运行它的硬件，那么它对GPU的关注可能是一个缺点。

GGUF(以前称为GGML)是一种量化方法，允许用户使用CPU来运行LLM，但也可以将其某些层卸载到GPU以提高速度。
虽然使用CPU进行推理通常比使用GPU慢，但对于那些在CPU或苹果设备上运行模型的人来说，这是一种令人难以置信的格式。

特别是因为我们看到更小、更有能力的型号出现，比如西北风7B, GGUF格式可能会继续存在!

In [None]:
# Delete any models previously created
del tokenizer, model, pipe

# Empty VRAM cache
import torch
import gc
gc.collect()
torch.cuda.empty_cache()

我们使用的是“zephyr-7b-beta.Q4_K_M.gguf”，因为我们关注的是4位量化:

In [None]:
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-beta-GGUF",
    model_file="zephyr-7b-beta.Q4_K_M.gguf",
    model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta", use_fast=True
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

zephyr-7b-beta.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

In [None]:
prompt = "<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\n"
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



加载模型后，我们可以运行如下提示符:

In [None]:
# We will use the same prompt as we did originally
outputs = pipe(prompt, max_new_tokens=256)
print(outputs[0]["generated_text"])

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To impress everyone with its vocabulary!

But unfortunately, it kept repeating the same jokes over and over again, making everyone groan and roll their eyes. The partygoers soon realized that the Large Language Model was more of a party pooper than a party animal.

Moral of the story: Just because a Large Language Model can generate a lot of words, doesn't mean it knows how to be funny or entertaining. Sometimes, less is more!


## **AWQ**

块上的一种新格式是AWQ([激活感知权重量化Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978))，它是一种类似于GPTQ的量化方法。AWQ和GPTQ作为方法有几个不同之处，但最重要的是AWQ假设并非所有权重对LLM的性能都同等重要。

换句话说，在量化过程中会跳过一小部分权重，这有助于减轻量化损失。

因此，他们的论文提到了与GPTQ相比的显著加速，同时保持了相似的，有时甚至更好的性能。

**注**: 要使用这个`AWQ`示例，我们首先需要在安装`vllm`依赖项之前断开运行时。与我们之前安装的包可能存在依赖冲突，所以最好从头开始。

In [None]:
%%capture
# Install vllm dependency
!pip install vllm

In [None]:
from vllm import LLM, SamplingParams

# Load the LLM
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=256)
llm = LLM(
    model="TheBloke/zephyr-7B-beta-AWQ",
    quantization='awq',
    dtype='half',
    gpu_memory_utilization=.95,
    max_model_len=4096
)

In [None]:
prompt = "<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\n"
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



In [None]:
# Generate output based on the input prompt and sampling parameters
output = llm.generate(prompt, sampling_params)
print(output[0].outputs[0].text)

Processed prompts: 100%|██████████| 1/1 [00:09<00:00,  9.50s/it]

Why did the Large Language Model go to the party?

To network and expand its vocabulary!

Why did the Large Language Model blush?

Because it overheard another model saying it was a little too wordy!

Why did the Large Language Model get kicked out of the library?

It was being too loud and kept interrupting other models' conversations with its endless chatter!

Why did the Large Language Model get a standing ovation at the comedy club?

Because it told some really punny jokes!

Why did the Large Language Model get a job as a writer?

Because it was the most wordy model in the room!

Why did the Large Language Model get a job as a librarian?

Because it knew all the right words to shelve books in the right place!

Why did the Large Language Model get a job as a teacher?

Because it knew all the right words to help students learn and grow!

Why did the Large Language Model get a job as a lawyer?

Because it knew all the right words to argue a case in court!

Why did the Large Language M


