<a href="https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-gptq?scriptVersionId=164058388" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

We will describe how to quantiza Llama2 7b using the GPTQ quantization. More detail about the concept of GPTQ please check the notebook [Quantization with GPTQ](https://www.kaggle.com/code/aisuko/quantization-with-gptq)

In [1]:
%%capture --no-stderr
!pip install transformers==4.36.2

# No useful in GPTQ quantization below, but I am trying accelerate with multi GPUs
# And if you install datasets  datasets==2.15.0, will cause issue see https://huggingface.co/datasets/allenai/c4/discussions/7
# !pip install accelerate==0.25.0
# !pip install peft==0.7.1
# !pip install bitsandbytes==0.41.3

!pip install auto-gptq==0.6.0
!pip install optimum==1.16.2
# Although flash-attn is not supported in Kaggle env.However, we prepare the notebook for future usage.
# !pip install flash-attn==2.4.2

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Quantized-models"
os.environ["WANDB_NOTES"] = "Quantized models by using Post-training quantization methods"
os.environ["MODEL_NAME"] = "meta-llama/Llama-2-7b-hf"
os.environ["WANDB_NAME"] = "quantized-Llama-2-7b0hf-with-c4-gptq"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

tokenizer=AutoTokenizer.from_pretrained(
    os.getenv("MODEL_NAME"),
    use_fast=True
)

quantization_config=GPTQConfig(
    bits=4,
    group_size=128,
    dataset="c4",
    # Whether to quantiza columns in order of decreasing activation size. Setting it to False can significantly
    # speed up inference but the perplexity may become slightly worse. Also known as act-order.
    desc_act=False,
    tokenizer=tokenizer
)


quant_model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    quantization_config=quantization_config,
    # for maximize the GPU usage while using CPU offload
    device_map="auto"
)

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading and preparing dataset json/allenai--c4 to /root/.cache/huggingface/datasets/json/allenai--c4-ec45c889631c3c39/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/allenai--c4-ec45c889631c3c39/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 14.75 GiB total capacity; 14.51 GiB already allocated; 51.06 MiB free; 14.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
quant_model.push_to_hub(os.getenv("WANDB_NAME"))
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))

# Inference

In [None]:
import gc

del quant_model,tokenizer
gc.collect()
torch.cuda.empty_cache()

In [None]:
model_name="aisuko/"+os.getenv("WANDB_NAME")

tokenizer=AutoTokenizer.from_pretrained(model_name)
model=AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Credit

* https://pub.towardsai.net/gptq-quantization-on-a-llama-2-7b-fine-tuned-model-with-huggingface-a7b291fbb871