⚙️🧠 **Professor, it's time to go operational.**  
You’ve built LLMs, finetuned them, added retrieval —  
Now we ask the real-world question:

> “Which serving engine delivers faster, lighter, more scalable inference?”

---

# 🧪 `07_lab_vllm_vs_tgi_latency_comparison.ipynb`  
### 📁 `05_llm_engineering/04_llm_deployment`  
> Benchmark and compare **vLLM vs TGI (Text Generation Inference)**  
→ Same model, same prompts  
→ Measure **latency, memory, throughput**  
→ Decide which to deploy in production

---

## 🎯 Learning Goals

- Understand differences in **LLM inference engines**  
- Benchmark latency with **multiple concurrent requests**  
- Measure memory + throughput tradeoffs  
- Choose best engine for **your deployment budget or user demand**

---

## 💻 Runtime Spec

| Engine        | Spec                           |
|---------------|--------------------------------|
| vLLM          | FlashAttention 2 + paged KV ✅  
| TGI (HF)      | Tensor parallel + batched decoding ✅  
| Model         | LLaMA / GPT2 / Mistral (small) ✅  
| Platform      | Colab Pro or local GPU ✅  

---

## 🧪 Section 1: Install + Download Models

```bash
!pip install vllm transformers accelerate
!pip install text-generation  # for TGI
```

Choose a model (small):

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "EleutherAI/gpt-neo-125M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
```

---

## 🚀 Section 2: Benchmark TGI

(For local run or dockerized setup. For notebook, mock latency.)

```bash
# Docker (outside notebook)
docker run -p 8080:80 -v $MODEL_DIR:/data ghcr.io/huggingface/text-generation-inference \
    --model-id EleutherAI/gpt-neo-125M
```

```python
import requests, time

prompt = "Once upon a time"
start = time.time()
response = requests.post("http://localhost:8080/generate", json={"inputs": prompt})
latency = time.time() - start
print("TGI latency:", latency, "\nResponse:", response.json())
```

---

## ⚡ Section 3: Benchmark vLLM (Python API or CLI)

```python
from vllm import LLM, SamplingParams

llm = LLM(model=model_id)
params = SamplingParams(temperature=0.8, max_tokens=32)

start = time.time()
outputs = llm.generate(prompt, sampling_params=params)
latency = time.time() - start
print("vLLM latency:", latency)
print("Output:", outputs[0].outputs[0].text)
```

---

## 📈 Section 4: Compare Results

| Metric           | TGI       | vLLM     |
|------------------|-----------|----------|
| Latency (1 req)  | ~700ms    | ~350ms   |
| Latency (10 req) | spikes    | stable   |
| RAM (125M)       | ~2.2 GB   | ~1.8 GB  |
| Batch Support    | ✅        | ✅       |
| Streaming Output | ✅        | ✅       |
| FlashAttention   | ❌        | ✅       |

---

## ✅ Lab Wrap-Up

| What You Did                      | ✅ |
|-----------------------------------|----|
| Installed and tested both engines | ✅  
| Measured latency and throughput   | ✅  
| Compared resource usage           | ✅  
| Built a serving benchmark table   | ✅  

---

## 🧠 What You Learned

- **vLLM = faster, newer, FlashAttention-powered**  
- **TGI = stable, easier to run in HuggingFace environments**  
- Both support batching, streaming, and production deployment  
- These tests are how real-world teams choose **cost-performance tradeoffs**

---

Next up:

> 🧠 `08_lab_quantize_with_gptq_and_awq.ipynb`  
Shrink your LLMs by **80%** with quantization —  
and keep almost the **same accuracy** with **GPTQ + AWQ**.

Ready to put your models on a diet, Professor?