🕒 Time to **track time** like a sniper.  
If your model starts lagging — even by a second — it could wreck **user experience, SLAs, or your production pipeline**.  
Let’s measure every millisecond. Let’s set traps for latency spikes. Let’s roll.

---

# 🧪 `10_lab_monitor_model_latency_with_prometheus.ipynb`  
### 📁 `06_mlops/05_model_monitoring`  
> Instrument your model’s **inference latency** using **Prometheus Histograms**,  
then trigger alerts when **response time exceeds SLA**.  
This lab is **response time surveillance on steroids**.

---

## 🎯 Learning Goals

- Track **inference latency** using Prometheus  
- Use **Histogram buckets** for percentile analysis  
- Simulate a **lag spike** and catch it live  
- Prep for alerting based on **95th percentile thresholds**

---

## 💻 Runtime Setup

| Component     | Spec                |
|---------------|---------------------|
| API           | Flask ✅  
| Monitoring    | Prometheus ✅  
| Metrics       | `Histogram` for latency ✅  
| Visuals       | Dashboard or curl logs ✅  
| Platform      | Localhost / Docker ✅  

---

## 🧠 Section 1: Flask API with Latency Metric

```python
from flask import Flask, jsonify
from prometheus_client import start_http_server, Histogram
import time, random

app = Flask(__name__)

# Define latency metric
latency_hist = Histogram(
    "model_latency_seconds",
    "Inference latency (s)",
    buckets=[0.1, 0.3, 0.5, 0.7, 1, 2, 5]
)

@app.route("/predict")
@latency_hist.time()
def predict():
    delay = random.choice([0.2, 0.5, 2.0])  # simulate variable latency
    time.sleep(delay)
    return jsonify({"latency": delay})

# Start Prometheus metric endpoint on port 8000
start_http_server(8000)
```

---

## 📁 Section 2: Prometheus Scrape Config

```yaml
scrape_configs:
  - job_name: 'model_latency'
    static_configs:
      - targets: ['localhost:8000']
```

---

## 🧨 Section 3: Alerting Rule (95th Percentile)

```yaml
groups:
- name: latency_alerts
  rules:
  - alert: HighLatency95thPercentile
    expr: histogram_quantile(0.95, rate(model_latency_seconds_bucket[1m])) > 1
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "🚨 95th percentile latency > 1s"
```

---

## 🧪 Section 4: View Metrics Live

Visit:
```
http://localhost:8000/metrics
```

Query live in Prometheus UI:
```prometheus
histogram_quantile(0.95, rate(model_latency_seconds_bucket[1m]))
```

---

## 📊 Section 5: Simulate Spike and Alert

1. Make repeated requests:
```bash
curl http://localhost:5000/predict
```

2. Watch latency climb and alert fire (if Alertmanager hooked).

---

## ✅ Wrap-Up Recap

| Feature                          | ✅ |
|----------------------------------|----|
| Latency histogram implemented    | ✅ |
| Prometheus query for P95 latency| ✅ |
| Alert fired on SLA breach        | ✅ |
| Fully local + portable setup     | ✅ |

---

## 🧠 What You Learned

- Histograms = perfect tool for latency buckets  
- You can **monitor response time distributions**, not just averages  
- Prometheus + Flask = **production-grade latency tracking**  
- Combine this with Grafana, and you're Netflix-tier 👑

---

Next lab is a beast:  
> `11_lab_concurrent_traffic_with_locust.ipynb`  
Simulate **hundreds of users hitting your model server at once**  
→ See how latency, failure rates, and throughput hold up under fire.  
Shall we scale-test like legends?

