# **QUANTIZATION**
- Quantization is a technique that **reduces the size of large language models** (LLMs) by changing the precision of their **weights and activations**. This process converts **continuous(float** data into **discrete data(int)**, which reduces the number of bits needed to display the signal
- This script ran in Colab with GPU(T4) Enabled

 **Quantization type**

  ![plot](./Image/quantization_type.png)

  **Quantization Size Reduce**

  ![plot](./Image/quantization_size_reduce.png)

  **Quantization Benefits**

  ![plot](./Image/quantization_Benefits.png)


## Different Quantization technique in LLM
- **GPTQ**(generalized post tranining quantization)
- **GGML**(Generative Graphical Model)
- **GGUF**(GPT generate unified format)
- **AWQ**(Activation aware weight quantization)
- **PTQ**(Post Training Quantization)
- **QAT**(Quantization Aware trainig)

**Here we used GGUF method**
- Best link diff for quantization methods : https://rentry.co/quants
- Llama quantized model framework(GGUF) - https://github.com/ggerganov/llama.cpp/tree/master/gguf-py

## **STEPS FOLLOWED**
- Use LLAMA CPP for QUANTIZATION
  - Clone quantization supporting github repo and install quantization required libraries

- Download Hugging base Model which to be quantized. Here used **Qwen/Qwen1.5-1.8B**
  - Its 1.5B parameter model. https://huggingface.co/Qwen/Qwen1.5-1.8B
- Create Quantized model folder in local and convert HF base model to GGUF format
- Convert base model to gguf format, Here still quantization not happened
- Apply **Quantization on gguf format model convert to 4Bit GGUF** quantized model
  - End of this line execution, it creates quantized Q4_K_M.gguf model
  - Converts 16Bit GGUF model to 4bit Mode - 'q4_k_m' varient

- Inference: Below script executes quantized model and we can chat with the model
- Push quantized model to HF
  - Upload quantized model to HP repo from local folder



# **Use LLAMA CPP for QUANTIZATION**

## **Clone quantization supporting github repo**

In [1]:
! git clone https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 36794, done.[K
remote: Counting objects: 100% (8674/8674), done.[K
remote: Compressing objects: 100% (502/502), done.[K
remote: Total 36794 (delta 8416), reused 8265 (delta 8171), pack-reused 28120 (from 1)[K
Receiving objects: 100% (36794/36794), 60.07 MiB | 8.67 MiB/s, done.
Resolving deltas: 100% (26842/26842), done.


## Install quantization required libraries from below github link
- It converts Hugging face model into gguf model(quantized)
- https://github.com/ggerganov/llama.cpp/blob/master/requirements/requirements-convert_hf_to_gguf.txt

In [3]:
!cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements/requirements-convert-hf-to-gguf.txt  -q
#Error Makefile:78: *** LLAMA_CUBLAS is removed. Use GGML_CUDA instead..  Stop.

Makefile:78: *** LLAMA_CUBLAS is removed. Use GGML_CUDA instead..  Stop.


In [10]:
!cd llama.cpp

In [14]:
!pwd
# /content/llama.cpp

/content


### Above line giving error, so pip install separately

In [15]:
! pip install -r /content/llama.cpp/requirements/requirements-convert_hf_to_gguf.txt -q

# **Download Hugging base Model which to be quantized**
 - **Snapshot download** in Hugging face
  - snapshot_download() downloads an **entire repository** at a given revision. - It uses internally hf_hub_download() which means all downloaded files are also cached on your local disk.
  - https://huggingface.co/docs/huggingface_hub/en/guides/download

 - **Qwen/Qwen1.5-1.8B**- This model used. Its 1.5B parameter model
  - https://huggingface.co/Qwen/Qwen1.5-1.8B
  - Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data.

In [16]:
from huggingface_hub import snapshot_download

In [17]:
# Hugging base Model which to be quantized
model_name="Qwen/Qwen1.5-1.8B"

#Colab Local Folder for original and quantized model
base_model="./original_model"
quantized_path = "./quantized_model"

In [18]:
# Download Qwen/Qwen1.5-1.8B model in colab local fodler  - original_model
snapshot_download(repo_id=model_name,local_dir=base_model,local_dir_use_symlinks=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.79k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/7.28k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

'/content/original_model'

## Create Quantized model folder in local and convert HF base model to GGUF format

In [19]:
quantized_model_path=quantized_path+'/FP16.gguf'
print(quantized_model_path)

./quantized_model/FP16.gguf


In [20]:
!mkdir quantized_model  #./quantized_model

## **Convert base model to gguf format**
- This is a Python package for writing binary files in the GGUF (GGML Universal File) format.
- Here still quantization not happened

- This one line command, where **convert_hf_to_gguf.py** will convert base model to gguf model which required in next step Quantization. saves in folder **FP16.gguf**
- This is not llama framework, its separate llama.cpp framework, which built specifically for quantization.


In [21]:
!python llama.cpp/convert_hf_to_gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf

INFO:hf-to-gguf:Loading model: original_model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,             torch.bfloat16 --> F16, shape = {2048, 151936}
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> F16, shape = {2048, 151936}
INFO:hf-to-gguf:blk.0.attn_norm.weight,    torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.bfloat16 --> F16, shape = {5504, 2048}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {2048, 5504}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.bfloat16 --> F16, shape = {2048, 5504}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.attn_k.bias,         torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.attn_k.weight,       torch.bfloat16 --> F16, shape = {2048, 2048}
INFO:hf-to-

  **gguf_format_model_in_colab_local**
  ![plot](./Image/gguf_format_model_in_colab_local.png)

# **Apply Quantization on gguf format model convert to 4Bit GGUF quantized model**
- Here qualtization all steps are written here **llama.cpp/quantize**,
- End of this line execution, it creates quantized **Q4_K_M.gguf** model

- Converts  16Bit GGUF model to 4bit Mode - **'q4_k_m' varient**

- **quantize the model to 4-bits (using Q4_K_M method)**
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

In [22]:
# Qunatization method
methods=["q4_k_m"]  # We can keep multiple varient of quantization methods
quantized_path = "./quantized_model"

In [23]:
import os

#New Quantized model name - Q4_K_M.gguf
for method in methods:
  qtype=f"{quantized_path}/{method.upper()}.gguf"
  print(qtype)
  print("\nFP16.gguf to Q4_K_M.gguf conversion statement:","./llama.cpp/quantize "+ "./quantized_model"+"/FP16.gguf "+ qtype + " " + method)

  # **Convert 16Bit 'FP16.gguf' Quantized model to 4bit Mode - 'q4_k_m' varient using  './llama.cpp/quantize' script
  #os.system : This will used to run any script inside notebook, not on terminal
  os.system("./llama.cpp/quantize "+ "./quantized_model"+"/FP16.gguf "+ qtype + " " + method)

./quantized_model/Q4_K_M.gguf

FP16.gguf to Q4_K_M.gguf conversion statement: ./llama.cpp/quantize ./quantized_model/FP16.gguf ./quantized_model/Q4_K_M.gguf q4_k_m


**When we run above command,It creates new 4bit quantized model as -' /quantized_model/FP16.gguf' in local.  Now original llama.cpp repo got updated and this llama.cpp/quantize function only not availble.**
- Original model size is **3.67 GB** , now this 4bit Quantized model will be around **1.22 GB**. So toal **2.45GB** Reduced

# **Inference: Below script executes quantized model and we can chat with the model**
- During this run it pop-ups UserQ and we can ask Q and get the response.
- Here "User:" -f llama.cpp/prompts/chat-with-bob.txt" - > We set the User prompt


In [24]:
! /content/llama.cpp/main -m ./quantized_model/Q4_K_M.gguf -n 90 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt


/bin/bash: line 1: /content/llama.cpp/main: No such file or directory


**We should get run log like this. It behaves like chatbot, when we run above command. Now original llama.cpp repo got updated and this llama.cpp/main function only not availble.**


 **Quantization_model_inference**

  ![plot](./Image/quantization_model_inference.png)

# **Push quantized model to HF**
- Creates REPO in HF - **qwen1.5-llm-quantized**
- Create object of Hugging face API
- Upload quantized model to HP repo from local folder

In [25]:
from huggingface_hub import notebook_login

In [26]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [28]:
from huggingface_hub import HfApi, HfFolder, create_repo, upload_file

In [31]:
# Now this Q4_K_M.gguf model not created due to update in llama.cpp. Just keeping this script for reference
# model_path="/content/quantized_model/Q4_K_M.gguf"  #Local just now created model
model_path="/content/quantized_model/FP16.gguf"  #Q4_K_M model not availbale so pushing earlier  model
repo_name="qwen1.5-llm-quantized"  #Repo name in my HF
repo_url=create_repo(repo_name,private=False) #Creates REPO Url in HF

In [32]:
#Create object of Hugging face API
api=HfApi()

#Upload this to HF Repo
api.upload_file(
    path_or_fileobj=model_path,
    # path_in_repo="Q4_K_M.gguf",
    path_in_repo="FP16.gguf",    #Q4_K_M model not availbale so pushing earlier  model
    repo_id="PrabhaB/qwen1.5-llm-quantized",
    repo_type="model",

)

FP16.gguf:   0%|          | 0.00/3.68G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/PrabhaB/qwen1.5-llm-quantized/commit/be0858088fc7980860fd06ec8464890a7744f636', commit_message='Upload FP16.gguf with huggingface_hub', commit_description='', oid='be0858088fc7980860fd06ec8464890a7744f636', pr_url=None, pr_revision=None, pr_num=None)

  **GGUF Formated model in hf**
  ![plot](./Image/quantization_model_in_hf.png)
  
  
  **Actual quantization model in hf**
  
  ![plot](./Image/quantization_model_in_hf_actual.png)

# **Assignment**

## Download the llama cpp in your local and use this quantized model over there

### llama model or mistral model and convert it into gguf or ggml format

### write a difference between gguf vs ggml vs gptq vs awq

### give me at least 5 differnce with proof

### PTQ(Post Training Quantization)
### QAT(Quantization Aware trainig)