# 8-bit Integer Quantization (INT8) for Transformer Models

## What is 8-bit (INT8) Quantization?

INT8 quantization compresses both model **weights** and **activations** to 8-bit integers.

This results in:
-  Faster inference speeds (especially on CPUs and low-power GPUs)
-  Smaller memory footprint (4x smaller than FP32)
-  Better energy efficiency for edge or mobile devices

###  Trade-off
Slight accuracy degradation may occur, particularly in sensitive layers. Techniques like **outlier thresholding** help reduce that.


## Why Use INT8 Quantization?

- **Edge/Mobile Optimization**: Perfect for devices with limited RAM/compute.
- **Compression**: Reduces model size dramatically — ~4x smaller than FP32.
- **Latency**: Speeds up inference, especially on INT8-supported hardware.
- **Simple Integration**: Easily enabled via Hugging Face + bitsandbytes.


### Install Required Packages

In [None]:
!pip install transformers bitsandbytes accelerate

### Configure INT8 Quantization Parameters

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Set 8-bit quantization parameters
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,             # Enables 8-bit quantization
    llm_int8_threshold=6.0         # Handles outliers in sensitive layers (higher = more aggressive quantization)
)


### Load Quantized Model

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-coder-6.7b-instruct",
    quantization_config=bnb_config,  # Apply 8-bit config
    device_map="auto"                # Automatically allocate across available GPUs/CPUs
)


### Memory Footprint Comparison

In [None]:
# Check memory usage of the loaded model (in GB)
model_size_gb = model.get_memory_footprint() / 1e9
print(f"Estimated 8-bit model size: {model_size_gb:.2f} GB")

##  Summary: INT8 Quantization Results

-  **4x Smaller**: Dramatically reduces memory usage vs FP32
-  **Fast Inference**: Especially on modern CPUs and Tensor Cores
-  **Outlier Robust**: `llm_int8_threshold` maintains stability
-  **Simple to Enable**: Just use `BitsAndBytesConfig` with Hugging Face

###  Recommended Use Cases:
- Deployment on edge/mobile devices
- Latency-critical inference services
- GPU-constrained production environments
