Here’s an enhanced end-to-end recipe for 4-bit (and hybrid 8-bit/4-bit) quantization with BitsAndBytes—including layer-wise overrides, outlier handling, offloading, and post-quant metrics—plus a checklist of what you need to know **before** and **after** quantization.

---

## Summary

We’ll extend your base PTQ snippet to cover:

1. **Advanced BitsAndBytesConfig options**

   * Layer-wise skipping & fp32 offload
   * Outlier threshold tuning (`llm_int8_threshold`) ([Hugging Face][1])
   * Mixed-precision hybrids (8-bit/4-bit toggles) ([ApX Machine Learning][2])
2. **Pre-quantization checklist**

   * Model dtype, supported ops, memory footprint, baseline accuracy.
3. **Post-quantization evaluation**

   * Memory & VRAM use (`get_memory_footprint`) ([Hugging Face][3])
   * Inference latency via `time.time()` loops
   * Accuracy/perplexity drop on a validation set
4. **Additional recipes**

   * GPTQ / SmoothQuant pointers ([Kaggle][4])

---

## 1. Enhanced BitsAndBytes PTQ Snippet

```python
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "facebook/opt-1.3b"

# ─── 1) Advanced BitsAndBytesConfig ────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                        # 4-bit weights
    bnb_4bit_quant_type="nf4",                # NF4 gives better weight fidelity :contentReference[oaicite:4]{index=4}
    bnb_4bit_compute_dtype=torch.bfloat16,    # bf16 compute reduces overflow :contentReference[oaicite:5]{index=5}
    bnb_4bit_use_double_quant=True,           # nested quantization to claw back ~0.4 bits/param
    # Layer-wise overrides:
    llm_int8_threshold=6.0,                   # skip INT8 on outlier channels >6 :contentReference[oaicite:6]{index=6}
    llm_int8_skip_modules=["lm_head"],        # keep lm_head in full precision if unstable
    llm_int8_enable_fp32_cpu_offload=True,    # offload fp32 weights to CPU to fit on GPU :contentReference[oaicite:7]{index=7}
)

# ─── 2) Load quantized model (no fine-tuning) ─────────────
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",        # auto sharding across GPU/CPU
    torch_dtype="auto",       # matches compute dtype
)
model.eval()

# ─── 3) Tokenizer ────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# ─── 4) Quick inference test & timing ────────────────────
prompt = "The quick brown fox"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Warm-up
_ = model.generate(**inputs, max_new_tokens=10)

# Time benchmark
start = time.time()
_ = model.generate(**inputs, max_new_tokens=50)
latency = time.time() - start
print(f"Latency for 50 tokens: {latency:.3f}s")

# ─── 5) Save quantized model ─────────────────────────────
model.save_pretrained("opt-1.3b-4bit-nf4")
tokenizer.save_pretrained("opt-1.3b-4bit-nf4")
```

---

## 2. Pre-Quantization Checklist

Before you quantize, ensure you’ve gathered:

* **Baseline metrics**

  * **FP32 memory footprint** via `model.get_memory_footprint()` ([Hugging Face][3])
  * **Inference latency** on representative prompts
  * **Accuracy / Perplexity** on your validation set
* **Model compatibility**

  * Uses supported layers (`nn.Linear`, `nn.Conv`, attention). Unsupported ops fall back to FP32 ([Kaggle][4])
* **Hardware constraints**

  * GPU architecture (Ampere or newer for bf16)
  * Available VRAM vs. model size
* **Calibration data** (for static PTQ if you switch from dynamic)

  * A small, diverse dataset (100–500 samples) for activation range capture

---

## 3. Post-Quantization Evaluation

After quantization, measure:

```python
# Memory footprint (bytes)
footprint_bytes = model.get_memory_footprint()
print(f"Quantized model footprint: {footprint_bytes/1e9:.2f} GB")  # :contentReference[oaicite:10]{index=10}

# Inference throughput (tokens/sec)
tokens = 50
throughput = tokens / latency
print(f"Throughput: {throughput:.1f} tokens/s")

# Accuracy / Perplexity drop
# (Example using Hugging Face’s `evaluate` library)
import evaluate
metric = evaluate.load("perplexity")
eval_data = ["Hello world!", "Quantization is cool."]  # your validation split
ppl_scores = []
for txt in eval_data:
    inpt = tokenizer(txt, return_tensors="pt").to(model.device)
    out = model(**inpt, labels=inpt["input_ids"])
    ppl_scores.append(out.loss.item())
print("Avg perplexity:", float(torch.exp(torch.tensor(ppl_scores))))
```

Compare these against your FP32 baseline to quantify trade-offs.

---

## 4. Additional Recipes & Next Steps

* **8-bit dynamic quant**:

  ```python
  bnb_config = BitsAndBytesConfig(load_in_8bit=True)
  ```

  Hits \~90 % FP32 accuracy with 2× memory reduction ([ApX Machine Learning][2]).

* **GPTQ** (per-group quant) and **SmoothQuant** wrappers can further reduce error without retraining ([Kaggle][4]).

* **Quantization-Aware Training (QAT)** if you need >99 % accuracy recovery—insert fake-quant modules and fine-tune for a few epochs.

---

By following these patterns, you’ll have a reproducible pipeline to quantize **and** validate your large-scale models—maximizing efficiency while keeping accuracy loss in check.

[1]: https://huggingface.co/docs/transformers/en/quantization/bitsandbytes?utm_source=chatgpt.com "bitsandbytes - Hugging Face"
[2]: https://apxml.com/courses/quantized-llm-deployment/chapter-2-implementing-llm-quantization-toolkits/quantization-hf-transformers-accelerate?utm_source=chatgpt.com "Quantization with Hugging Face Transformers and Accelerate"
[3]: https://huggingface.co/docs/transformers/main/en//quantization?utm_source=chatgpt.com "Quantization - Hugging Face"
[4]: https://www.kaggle.com/code/enesbeinci/how-to-work-with-mistral7b?utm_source=chatgpt.com "how-to-work-with-mistral7b - Kaggle"
