ü§ñ **Quantization in AI**

Quantization in AI is an essential technique that involves **reducing the precision or bit-width of numerical data** in a neural network model. It's like giving the model a makeover to optimize its performance on devices with **limited computational resources**, such as mobile phones or embedded systems.

During quantization, the model's **floating-point values** are transformed into **fixed-point or integer values** with a **reduced number of bits**. This clever compression technique helps to **save memory and computational power**, making the model more efficient when deployed on hardware with lower precision capabilities.

However, it's worth noting that quantization comes with a trade-off between **model accuracy and efficiency**. When we reduce the precision, there's a chance of losing some valuable information, which can lead to a decline in the model's performance. To combat this, experts employ **optimization and calibration techniques** to minimize the loss and maintain an acceptable level of accuracy.

Overall, quantization plays a vital role in the AI world by enabling the deployment of neural network models on **resource-constrained devices**. By finding the perfect balance between efficiency and accuracy, we can make the most of the available hardware resources while still achieving satisfactory performance levels.


## Why is AI Quantized? ü§î

AI models are often quantized to achieve various benefits, such as:

1. **Model Size Reduction:** Quantization techniques help reduce the size of AI models by representing the weights and activations with lower precision data types. üìâ This is particularly useful when deploying models on resource-constrained devices with limited storage capacity. üíæ

2. **Inference Speed Improvement:** Quantized models can perform computations faster due to the reduced memory bandwidth requirements and optimized hardware instructions for low-precision operations. ‚ö°Ô∏è This enables real-time or near-real-time inference, making AI applications more efficient and responsive. üöÄ

3. **Energy Efficiency:** By reducing the precision of AI models, quantization reduces the computational workload, resulting in lower power consumption. üîã This is especially important for battery-powered devices or scenarios where energy efficiency is a priority. üí°

4. **Deployment Flexibility:** Quantized models can be deployed on a wide range of platforms, including edge devices, embedded systems, and IoT devices. üåê The smaller model size and improved performance make it easier to integrate AI capabilities into various applications. üì±

It's important to note that quantization involves a trade-off between model performance and resource efficiency. ‚öñÔ∏è While quantized models offer benefits in terms of size and speed, they may experience a slight decrease in accuracy compared to their full-precision counterparts. üîç However, advancements in quantization techniques have significantly minimized this accuracy gap, making it a valuable optimization strategy for AI models.

By quantizing AI models, we can unlock their potential to run efficiently on diverse hardware and enable widespread deployment of AI applications across different domains. üåü

**QUANTIZATION parameters-GGUF:**

| para | Quantization | Advantages | Trade-offs |
|---|---|---|---|
| **q2_k** | 2-bit integers | Significant model size reduction | Minimal impact on accuracy |
| **q3_k_l** | 3-bit integers | Balanced model size reduction and accuracy preservation | Moderate impact on accuracy |
| **q3_k_m** | 3-bit integers | Enhanced accuracy with mixed precision | Increased computational complexity |
| **q3_k_s** | 3-bit integers | Improved model efficiency with structured pruning | Reduced accuracy |
| **q4_0** | 4-bit integers | Significant model size reduction | Moderate impact on accuracy |
| **q4_1** | 4-bit integers | Enhanced accuracy with mixed precision | Increased computational complexity |
| **q4_k_m** | 4-bit integers | Optimized model size and accuracy with mixed precision and structured pruning | Reduced accuracy |
| **q4_k_s** | 4-bit integers | Improved model efficiency with structured pruning | Reduced accuracy |
| **q5_0** | 5-bit integers | Balanced model size reduction and accuracy preservation | Moderate impact on accuracy |
| **q5_1** | 5-bit integers | Enhanced accuracy with mixed precision | Increased computational complexity |
| **q5_k_m** | 5-bit integers | Optimized model size and accuracy with mixed precision and structured pruning | Reduced accuracy |
| **q5_k_s** | 5-bit integers | Improved model efficiency with structured pruning | Reduced accuracy |
| **q6_k** | 6-bit integers | Balanced model size reduction and accuracy preservation | Moderate impact on accuracy |
| **q8_0** | 8-bit integers | Significant model size reduction | Minimal impact on accuracy |


In [1]:
# @title ‚ö° AutoQuantize-GGUF
# @markdown ---
# @markdown AutoQuantize automatically quantizes language models hosted on Hugging Face,
# @markdown making them more efficient and practical for use in resource-constrained environments.
# @markdown ---
# @markdown ‚ù§Ô∏è Created by [OEvortex](https://youtube.com/@OEvortex).
# @markdown Please subscribe to my channel.
# @markdown ---
# @markdown ### ‚ö° Quantization parameters
from google.colab import userdata
# Model details
MODEL_ID = "MysteriousAI/Mia-1B"  # @param {type:"string"}
QUANTIZATION_METHODS = "q2_k"  # @param {type:"string"}
QUANTIZATION_METHODS = QUANTIZATION_METHODS.replace(" ", "").split(",")

# @markdown ---

# @markdown ### ü§ó Hugging Face Hub
# @markdown  It requires a `HF_TOKEN` secret in Colab with the value of your [Hugging Face access token](https://huggingface.co/settings/tokens).
username = "MysteriousAI"  # @param {type:"string"}
HF_Token = userdata.get('HF_TOKEN')

MODEL_NAME = MODEL_ID.split('/')[-1]

# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -qqq -r llama.cpp/requirements.txt --no-progress-bar

# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

# Install required packages
!pip install -q huggingface_hub
from huggingface_hub import create_repo, HfApi, ModelCard


# Defined in the secrets tab in Google Colab
token = HF_Token
api = HfApi()

# Create empty repo
create_repo(
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
    token=token
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=["*.gguf", "$.md"],
    token=token
)



Cloning into 'llama.cpp'...
remote: Enumerating objects: 21798, done.[K
remote: Counting objects: 100% (9551/9551), done.[K
remote: Compressing objects: 100% (491/491), done.[K
remote: Total 21798 (delta 9314), reused 9103 (delta 9060), pack-reused 12247[K
Receiving objects: 100% (21798/21798), 26.25 MiB | 19.00 MiB/s, done.
Resolving deltas: 100% (15430/15430), done.
Already up to date.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissin

mia-1b.Q2_K.gguf:   0%|          | 0.00/432M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/MysteriousAI/Mia-1B-GGUF/commit/e6a08e34c3fc7afdf335e137126382e698f824e4', commit_message='Upload folder using huggingface_hub', commit_description='', oid='e6a08e34c3fc7afdf335e137126382e698f824e4', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# @title ‚ö° AutoQuantize-GPTQ
# @markdown ### ü§ó Hugging Face Hub
# @markdown  It requires a `HF_TOKEN` secret in Colab with the value of your [Hugging Face access token](https://huggingface.co/settings/tokens).
MODEL_ID = "MysteriousAI/Mia-1B"  # @param {type:"string"}

username = "Abhaykoul"  # @param {type:"string"}
HF_Token = userdata.get('HF_TOKEN')

MODEL_NAME = MODEL_ID.split('/')[-1]
# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}
!pip install -q huggingface_hub
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import create_repo, HfApi, ModelCard
from google.colab import userdata, runtime
model = AutoModel.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
token = userdata.get('HF_TOKEN')
api = HfApi()
gptq = MODEL_ID + '-GPTQ'
model.save_pretrained(gptq, use_safetensors=True)
tokenizer.save_pretrained(gptq)
# Create empty repo
create_repo(
    repo_id=f"{username}/{MODEL_NAME}-GPTQ",
    repo_type="model",
    exist_ok=True,
    token=token
)

# Upload
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GPTQ",
    token=token
)
