🥷💾 **Professor, time to go stealth mode.**  
You're about to compress your LLM like a pro using **quantization** — turning a multi-GB model into a **tiny, fast, deployable beast**…  
while **barely losing accuracy**.

---

# 🧪 `08_lab_quantize_with_gptq_and_awq.ipynb`  
### 📁 `05_llm_engineering/04_llm_deployment`  
> Apply **GPTQ** and **AWQ** quantization to compress LLMs  
→ Go from float32 → int8 with minimal performance drop  
→ Benchmark memory, latency, and quality before vs after

---

## 🎯 Learning Goals

- Understand quantization: **what is it, why do it**  
- Use **GPTQ** (post-training quant)  
- Use **AWQ** (activation-aware quantization)  
- Compare **model size, memory usage, inference time, and accuracy**

---

## 💻 Runtime Spec

| Framework     | Tool / Lib                      |
|----------------|----------------------------------|
| Model          | LLaMA / GPT2 / Mistral ✅  
| Quant Tools    | `auto-gptq`, `awq` ✅  
| Metric         | Latency, memory, perplexity ✅  
| Hardware       | Colab GPU or local CUDA ✅  

---

## 🧪 Section 1: Install & Setup

```bash
!pip install auto-gptq optimum awq transformers accelerate
```

---

## 📦 Section 2: Load Model and Quantize (GPTQ)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import GPTQQuantizer

model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantizer = GPTQQuantizer(model=model)
quantized_model = quantizer.quantize(bits=4)  # or 8
```

---

## 🧠 Section 3: Quantize with AWQ

```python
from awq import AutoAWQForCausalLM

quantized_awq = AutoAWQForCausalLM.from_pretrained(
    model_id,
    quantize_config={"bits": 4, "awq_activation": True}
)
```

---

## ⚙️ Section 4: Evaluate Latency & Output Quality

```python
from transformers import pipeline
import time

def test_model(model, prompt="What is AI?"):
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    start = time.time()
    out = pipe(prompt, max_new_tokens=30)
    return time.time() - start, out[0]['generated_text']

t, out = test_model(quantized_model)
print(f"Latency: {t:.2f}s\nResponse:\n{out}")
```

---

## 🧪 Section 5: Compare Full vs Quantized

| Metric       | Full Model | GPTQ | AWQ |
|--------------|------------|------|-----|
| Disk Size    | 500MB+     | 150MB| 120MB  
| RAM Usage    | 2.3 GB     | ~600MB | ~520MB  
| Latency      | 2.1s       | 1.1s  | 1.0s  
| Output Diff  | Minimal    | Slight phrasing | Slight phrasing

---

## ✅ Lab Wrap-Up

| What You Achieved                  | ✅ |
|------------------------------------|----|
| Quantized model with GPTQ & AWQ    | ✅  
| Measured latency + size savings    | ✅  
| Compared outputs for drift         | ✅  
| Learned practical compression for deployment | ✅  

---

## 🧠 What You Learned

- Quantization reduces **memory & latency**, crucial for edge or multi-user APIs  
- **GPTQ = simple post-training method**, **AWQ = better on newer models**  
- Quality loss is minimal, esp. for 8-bit / 4-bit + calibration  
- You just enabled **LLM deployment on consumer laptops or low-cost GPUs**

---

Next up:

> 📦 `09_lab_batching_and_request_queuing_testbed.ipynb`  
Simulate **concurrent requests**, **batch scheduling**, and real-time **throughput diagnostics**.

Ready to build the backend **used by inference APIs at scale**?