<a href="https://www.kaggle.com/code/aisuko/quantization-methods?scriptVersionId=163042451" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

**Note: The images are from the articles in the Credit section**

We distinguish two main families of weight quantization techniques in the literature:
* Post-Training Quantization(PTQ)
* Quantization-Aware Training(QAT)

More detail see [Introduction to weight quantization](https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization)

In this notebook, we will try to use many different Quantization Methods like GPTQ, GGUF and AWQ.

In [1]:
%%capture --no-stderr
!pip install transformers==4.37.2
!pip install bitsandbytes==0.42.0
!pip install peft==0.8.2
!pip install accelerate==0.27.2
!pip install auto-gptq==0.6.0
# optimum should be updated to latest version to fix the c4 issue https://github.com/huggingface/optimum/pull/1646
!pip install optimum==1.16.2

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["MODEL_NAME"] = "HuggingFaceH4/zephyr-7b-beta"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `HuggingFaceH4/zephyr-7b-beta` from `transformers`...
config.json: 100%|█████████████████████████████| 638/638 [00:00<00:00, 3.63MB/s]
┌──────────────────────────────────────────────────────┐
│Memory Usage for loading `HuggingFaceH4/zephyr-7b-beta`│
├───────┬─────────────┬──────────┬─────────────────────┤
│ dtype │Largest Layer│Total Size│ Training using Adam │
├───────┼─────────────┼──────────┼─────────────────────┤
│float32│  864.03 MB  │ 27.49 GB │      109.96 GB      │
│float16│  432.02 MB  │ 13.74 GB │       54.98 GB      │
│  int8 │  216.01 MB  │ 6.87 GB  │       27.49 GB      │
│  int4 │   108.0 MB  │ 3.44 GB  │       13.74 GB      │
└───────┴─────────────┴──────────┴─────────────────────┘


# Pure Inference

We are going to use [Zephyr 7B](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a fine-tuned variant of Mistral 7B that was trained with DPO(Direct Preference Optimization). And we already had a notebook to discuss and use DPO, see [Fine tuning mistral 7b with DPO](https://www.kaggle.com/code/aisuko/fine-tuning-mistral-7b-with-dpo) and [Supervised Fine Tune Llama2 with DPO](https://www.kaggle.com/code/aisuko/supervised-fine-tuned-llama2-with-dpo).


The method below of loading an LLM generally does not perform any compression tricks for saving VRAM or increasing efficiency.

In [4]:
from transformers import pipeline
import torch

pipe=pipeline(
    "text-generation",
    model=os.getenv("MODEL_NAME"),
    torch_dtype=torch.float16,
    device_map='auto'
)



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

## Generating prompt

The generated prompt, using the internal prompt template, is contructed like

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/797/830/779/160/176/original/25bb3317b94d3367.webp" width="60%" heigh="60%" alt="Sharding an LLM into pieces"></div>



In [5]:
messages=[
    {
        "role":"system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role":"user",
        "content":"Tell me a funny joke about Large Language Models."
    },
]

prompt=pipe.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

outputs=pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)

outputs[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


"<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\nWhy did the Large Language Model go to the party?\n\nTo impress everyone with its vocabulary!\n\nBut unfortunately, it kept repeating the same jokes over and over again, making everyone groan and roll their eyes. The punchline was getting old, and the partygoers couldn't help but wonder if the Large Language Model had a sense of humor or if it was just programmed to be funny.\n\nIn the end, the Large Language Model left the party feeling a little embarrassed and a lot disappointed. It realized that being able to spout off a lot of words didn't necessarily make it a hit at the party. Maybe next time, it should focus on being more original and less repetitive.\n\nBut hey, at least it had a good time trying, right? After all, that's what Large Language Models are here for - to make us laugh, learn, and maybe even groan a little bit."

The LLMs are all about expanding their vocabulary and networking with other models to improve their language skills. So, this joke is a perfect fit for them!

# Sharding

> Sharding an LLM is nothing more than breaking it up into pieces. Each individual piece is much easier to handle and might prevent memory issues.

Before we go into quantization strategies, there is another trick that we can employ to reduce the necessary VRAM for loading our model. With `sharding`, we are essentially splitting out model up into small pieces of `shards`.

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/797/767/089/086/156/original/d52c60af9991cb5b.webp" width="60%" heigh="60%" alt="Sharding an LLM into pieces"></div>

Each shard contians a smaller part of the model and aims to work around GPU memory limitations by distributing the model weights across different devices. And you can see that the model Zephyr-7B already sharded, [see here](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta/tree/main).

Sharding is quite straightforward using the Accelerate package:

In [6]:
from accelerate import Accelerator

# Shared model into pieces of 1GB
accelerator=Accelerator()
accelerator.save_model(
    model=pipe.model,
    save_directory='sharding/model',
    max_shard_size='4GB'
)

In [7]:
!ls sharding/model

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


model-00001-of-00004.safetensors  model-00004-of-00004.safetensors
model-00002-of-00004.safetensors  model.safetensors.index.json
model-00003-of-00004.safetensors


# Quantization with 4bit-NormalFloat(NF4)

This datatype does a few special tricks in order to efficiently represent a large bit datatype. These steps below:

1. Normalization: The weights of the model are normalized so that we expect the weights to fall within a certain range. This allows for more efficient representation of more common values.
2. Quantization: The weights are quantized to 4-bit. In NF4, the quantization levels are evenly spaced with respect to the normalized weights, thereby efficiently representing the original 32-bit weights.
3. Dequantization: Although the weights are stored in 4-bit, they are dequantized during computation which gives a performance boost during inference.

In [8]:
pipe.model.get_memory_footprint()

15020343296

In [9]:
import gc

del pipe, outputs
gc.collect()
torch.cuda.empty_cache()

In [10]:
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM

bnb_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer= AutoTokenizer.from_pretrained(os.getenv("MODEL_NAME"))
model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    quantization_config=bnb_config,
    device_map='auto'
)
model.get_memory_footprint()

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

4551360512

In [11]:
pipe=pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation'
)

outputs=pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

outputs[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


"<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\nWhy did the Large Language Model go to the party?\n\nTo mingle with the crowd and drop some knowledge bombs, of course! But the party was so loud that it couldn't hear itself think, and the music was so repetitive that it started to repeat itself... Literally! The Large Language Model ended up getting stuck in a loop, repeating the same joke over and over again until the partygoers grew tired of its antics and sent it home. Moral of the story: Large Language Models are great at generating humor, but they still need some human input to break out of their programming and truly shine at a party!"

In [12]:
del tokenizer, model, pipe, outputs
gc.collect()
torch.cuda.empty_cache()

# GPTQ: Post-Training Quantization for GPT Models

GPTQ is a Post Training Quantization(PTQ) method for 4-bit quantization that focues primarily on GPU infernece and performance.

The idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight. During inference, it will dynamically dequantize its weights to float16 for improved performance whilist keeping memory low. This can save our memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate.


In [13]:
from transformers import AutoModelForCausalLM, GPTQConfig

os.environ["OPT_MODEL"]="facebook/opt-125m"

In [14]:
!accelerate estimate-memory ${OPT_MODEL} --library_name transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained config for `facebook/opt-125m` from `transformers`...
config.json: 100%|█████████████████████████████| 651/651 [00:00<00:00, 3.35MB/s]
┌────────────────────────────────────────────────────┐
│    Memory Usage for loading `facebook/opt-125m`    │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│  147.28 MB  │477.75 MB │      1.87 GB      │
│float16│   73.64 MB  │238.88 MB │      955.5 MB     │
│  int8 │   36.82 MB  │119.44 MB │     477.75 MB     │
│  int4 │   18.41 MB  │ 59.72 MB │     238.88 MB     │
└───────┴─────────────┴──────────┴───────────────────┘


## Quantization a model with GPTQ

To quantize a model, we need to create a GPTQConfig class and set the number of bits to quantize to, a dataset to calibrate the weights for quantization, and a tokenizer to prepare the dataset.

It is highly recommends to use the same dataset from the GPTQ paper.

```python
dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)
```

In [15]:
tokenizer=AutoTokenizer.from_pretrained(os.getenv("OPT_MODEL"))
gptq_config=GPTQConfig(
    bits=4,
    #https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/quantization#transformers.GPTQConfig.dataset
    dataset="c4-new",
    tokenizer=tokenizer
)

quantized_model=AutoModelForCausalLM.from_pretrained(
    os.getenv("OPT_MODEL"),
    # automatically offload the mmodel to a CPU to help fit the model in memory,
    # allow the model modules to be moved between the CPU and GPU for quantization.
    device_map='auto',
    quantization_config=gptq_config
)

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading and preparing dataset json/allenai--c4 to /root/.cache/huggingface/datasets/json/allenai--c4-ec45c889631c3c39/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/allenai--c4-ec45c889631c3c39/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


Quantizing model.decoder.layers blocks :   0%|          | 0/12 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

If you are running out of memory because a dataset is too large , disk offloading is not supported. If this is the case, try passing the max_memory prameters to allocate the amount of memory to use on your device(GPU and CPU)

```python
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config)
```

The time spending of quantization process is depends on the hardware. Before we quantize a model, it is a good idea to search if a GPTQ-quantized version of the model already exists.

In [16]:
os.environ["QUANTIZED_MODEL_NAME"]="opt-125m-gptq"

quantized_model.push_to_hub(os.getenv("QUANTIZED_MODEL_NAME"))
tokenizer.push_to_hub(os.getenv("QUANTIZED_MODEL_NAME"))

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/opt-125m-gptq/commit/3729cec301e0c890ea320375aeb8dc4347882d1f', commit_message='Upload tokenizer', commit_description='', oid='3729cec301e0c890ea320375aeb8dc4347882d1f', pr_url=None, pr_revision=None, pr_num=None)

Reload a quantized model with `device_map=auto` to automatically distribute the model on all avaliable GPUs to load the model faster without using more memory than needed.

```python
from transformers import AutoModelForCausalLM

model=AutoModelForCausalLM.from_pretrained("aisuko/"+os.getenv("QUANTIZED_MODEL_NAMNE"), device_map="auto")
```

In [17]:
del quantized_model, tokenizer
gc.collect()
torch.cuda.empty_cache()

## Using a GPTQ-quantized version model

In [18]:
os.environ["GPTQ_MODEL"]="TheBloke/zephyr-7B-beta-GPTQ"

tokenizer=AutoTokenizer.from_pretrained(os.getenv("GPTQ_MODEL"), use_fast=True)
model=AutoModelForCausalLM.from_pretrained(
    os.getenv("GPTQ_MODEL"),
    device_map="auto",
    trust_remote_code=False,
    revision="main"
)

pipe=pipeline(model=model,tokenizer=tokenizer, task="text-generation")

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [19]:
outputs=pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)

outputs[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\nWhy did the Large Language Model go to the party?\n\nTo make some small talk!\n\n(Large Language Models are artificial intelligence models trained on vast amounts of text data to generate human-like responses. They are not capable of small talk in the traditional sense, but this joke plays on the idea that they can generate human-like responses.)'

# GGUF: GPT-Generated Unified Format

GGUF, previously GGML is a quantization method that allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up.

# AWQ: Activation-aware Weight Quantization

See notebook [AWQ-Transformers](https://www.kaggle.com/code/aisuko/awq-transformers)

# Credit

* https://maartengrootendorst.substack.com/p/which-quantization-method-is-right?utm_source=profile&utm_medium=reader2
* https://huggingface.co/docs/transformers/main/quantization?fuse=supported+architectures#awq
