# Part IX — Deployment and Production Operations  
## 40. Operations: Backups, Rollbacks, Incident Response (Runbooks Professionals Actually Use)

This chapter is about *operating* a Django system after deployment—where real
reliability comes from. Many projects fail not because the code is bad, but because
operations are immature:

- no backups (or backups that can’t be restored)
- unclear rollback process
- migrations that block deploys and no plan to recover
- incidents handled ad-hoc without a runbook
- no ownership or communication plan

You’ll produce concrete operational documents and implement basic operational
capabilities:

- database backups and restore drills
- media/export file backup strategy
- rollback plan (code vs schema)
- incident response checklist and runbooks
- “game day” practice mindset

---

## 40.0 Learning Outcomes

By the end, you should be able to:

1. Design a backup strategy for:
   - PostgreSQL database
   - media uploads (S3/disk)
   - export artifacts
2. Perform a restore drill and verify integrity.
3. Explain rollback strategies:
   - code rollback
   - schema rollback (rare, risky)
   - forward-fix strategy
4. Create incident runbooks:
   - DB down
   - Redis down
   - worker down
   - 5xx spike
   - migration stuck/locking
5. Create and use health checks (`/healthz`, `/readyz`) as part of incident response.
6. Write a “postmortem” template and apply it.

---

# 40.1 Backups: What You Must Back Up (and What You Must Not)

### 40.1.1 Must back up
- **Database** (PostgreSQL): source of truth for most business data.
- **Media uploads** if stored on disk or in object storage:
  - cover images
  - attachments
- **Export files** if they must be retained (or if they’re the only copy).

### 40.1.2 Typically do NOT back up
- staticfiles output (`STATIC_ROOT`) if it can be regenerated from code.
- caches (Redis cache) unless used as primary data store (shouldn’t be).
- Celery broker queues (message queues are usually not the data source of truth).
  - If you lose them, you lose in-flight tasks; design idempotency.

### 40.1.3 The “source of truth” rule
Only back up things that cannot be reconstructed reliably:
- DB and user uploads are usually irrecoverable otherwise.
- Everything else should be reproducible from code + config.

---

# 40.2 Database Backups (PostgreSQL) — Practical Industry Patterns

There are two common types of backups:

### 40.2.1 Logical backups (`pg_dump`)
- creates a SQL or custom-format dump of schema + data
- portable across versions (within reason)
- good for:
  - small/medium DBs
  - daily backups
  - restore into staging for testing

Command example:

```bash
pg_dump \
  --format=custom \
  --file=backup.dump \
  --dbname="postgres://user:pass@host:5432/dbname"
```

Restore:

```bash
pg_restore \
  --clean \
  --no-owner \
  --dbname="postgres://user:pass@host:5432/dbname" \
  backup.dump
```

### 40.2.2 Physical backups / snapshots
- filesystem-level snapshots (or managed service snapshots)
- fastest restore
- can support point-in-time recovery (PITR) with WAL archiving

Often used by:
- managed DB services (RDS, Cloud SQL)
- larger databases

**Industry baseline:**  
For most Django apps on managed DB services:
- enable automated snapshots
- enable PITR if available
- still do periodic logical dumps for portability/testing

---

## 40.2.3 A sane backup schedule (baseline)
- Daily full backup
- Retain for 7–30 days (depending on compliance)
- Weekly/monthly longer retention if needed
- Test restore monthly (at minimum)

---

# 40.3 Restore Drills (The “Backup is worthless until restored” Rule)

### 40.3.1 Restore drill checklist (staging/local)
1. Create a new empty database.
2. Restore the latest backup.
3. Run application migrations if needed (should match backup point).
4. Run a smoke test:
   - admin login
   - list pages load
   - `/readyz` passes
5. Validate key data (counts, a few records).

### 40.3.2 Data integrity checks (practical)
- row counts for critical tables:
  - Articles
  - Tasks
  - Memberships
- spot-check latest objects
- check constraints:
  - unique membership
  - published article constraints
- check migrations table: `django_migrations` should match expected state

---

# 40.4 Media and File Backups (If Not Using Managed Object Storage)

### 40.4.1 If you use S3-like storage
- your “backup strategy” is often:
  - bucket versioning
  - lifecycle rules
  - cross-region replication (if needed)
  - periodic snapshot exports (optional)

### 40.4.2 If you store media on server disk
You must back up:
- `MEDIA_ROOT` directory
- export files directory if stored there

Common approach:
- nightly rsync to backup server
- snapshots of disk volume
- tarball backups stored in object storage

**Industry advice:** Prefer object storage for media to reduce ops burden.

---

# 40.5 Rollbacks: Code vs Schema (Professional Reality)

### 40.5.1 Code rollback (common, safe)
Rolling back code is usually:
- deploy previous artifact/container
- restart web and worker
- verify readiness

This is why:
- you keep previous builds available
- you tag releases

### 40.5.2 Schema rollback (rare, risky)
Rolling back schema is hard because:
- destructive migrations lose data
- data migrations aren’t always reversible
- concurrent writes during the rollback window can cause inconsistency

**Industry approach:**
- use expand/contract pattern so code rollbacks still work with expanded schema.
- prefer forward-fixes for DB issues.

### 40.5.3 “Rollback matrix” concept
Before deploying:
- confirm rollback path for code
- confirm schema changes are backward compatible with old code for at least one
  release window

This avoids “we can’t rollback because schema changed.”

---

# 40.6 Incident Response: What to Do When Things Break

### 40.6.1 The goals in an incident
1. Restore service (mitigate impact)
2. Understand root cause
3. Prevent recurrence

During incident, **service restoration** is priority.

### 40.6.2 Severity levels (simple baseline)
- SEV-1: system down / major data leak / payment broken
- SEV-2: major feature degraded
- SEV-3: partial degradation / minor bug
- SEV-4: small issue, no user impact

Define who gets paged and how.

---

# 40.7 Incident Runbooks (Step-by-Step)

These are “copy/paste checklists” for on-call.

## 40.7.1 Runbook: DB Down (readyz failing)
Symptoms:
- `/readyz` returns 503
- web requests error (500) or time out
- Celery tasks failing with DB connection errors

Immediate steps:
1. Confirm DB status (provider dashboard / `psql` connect).
2. Check DB connection settings:
   - env vars present
   - network security groups/firewall
3. If using managed DB:
   - check maintenance/outage
   - failover status
4. Mitigation options:
   - restart app processes only if config changed (don’t restart blindly)
   - enable read-only mode if available (feature flag or maintenance mode)
5. Communicate status to stakeholders.

Post-incident:
- write postmortem
- add alert on DB connection errors
- consider connection pooling and retry strategies

## 40.7.2 Runbook: Redis Down (Celery/Channels failures)
Symptoms:
- Celery tasks stuck (broker unreachable)
- WebSockets disconnected
- cache misses or timeouts

Steps:
1. Identify which Redis role is down:
   - broker?
   - channel layer?
   - cache?
2. If broker down:
   - web can still serve, but background tasks won’t run
   - disable features that depend on tasks if needed
3. If channels down:
   - realtime disabled; core app can still function
4. Restart Redis or failover (managed redis)
5. Check connection URLs and auth.

Post-incident:
- separate Redis instances for cache/broker/channels (blast radius reduction)
- add monitoring for Redis latency/errors

## 40.7.3 Runbook: Celery Worker Down
Symptoms:
- export jobs stuck in pending
- emails not sent
- webhook processing delayed

Steps:
1. Check worker service status:
   - systemd: `systemctl status myapp-celery-worker`
   - logs: `journalctl -u myapp-celery-worker -f`
2. Check broker connectivity.
3. Restart worker.
4. Confirm tasks are being consumed.

Post-incident:
- add alerts for queue depth or “no heartbeat”
- ensure tasks are idempotent so retries are safe

## 40.7.4 Runbook: 5xx Spike / Latency Spike
Symptoms:
- error rate rises
- response times increase
- user reports timeouts

Steps:
1. Check dashboards/logs (request_id correlation).
2. Identify top failing endpoint/path.
3. Check DB slow queries (`pg_stat_statements` or logs).
4. Check if a deploy just happened:
   - if yes, consider rollback to previous release quickly.
5. Check resource exhaustion:
   - CPU/memory
   - DB connections
   - thread pool saturation
6. Mitigation:
   - rollback
   - disable expensive features via feature flags
   - scale web workers (temporary)
   - add caching / fix N+1 / add index (later)

## 40.7.5 Runbook: Migration Stuck / Locking
Symptoms:
- deploy step hangs on migrate
- app timeouts due to locked tables

Steps:
1. Identify blocking query using Postgres locks view (requires DB access).
2. Decide:
   - cancel migration
   - wait (if acceptable)
3. If safe:
   - terminate the blocking transaction (carefully)
4. In the future:
   - use concurrent indexes
   - expand/contract
   - schedule heavy migrations off-peak

---

# 40.8 Create a “Maintenance Mode” (Optional but Very Useful)

Many production apps include a toggle:
- return a friendly maintenance page for non-admin users
- allow admin access for verification

Implement via middleware that checks an env var:

```python
import os
from django.http import HttpResponse


class MaintenanceModeMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        if os.environ.get("MAINTENANCE_MODE") == "true":
            if request.path.startswith("/admin/"):
                return self.get_response(request)
            return HttpResponse("Maintenance", status=503)
        return self.get_response(request)
```

In an incident, this can reduce load while you recover.

---

# 40.9 Postmortems (How Professionals Improve)

A good postmortem is:
- blameless
- concrete
- action-oriented

Template:

1. Summary (what happened)
2. Impact (who affected, how long)
3. Timeline (key events)
4. Root cause (technical)
5. Contributing factors
6. What went well
7. What went poorly
8. Action items (owner + due date)

Even small teams benefit from this discipline.

---

# 40.10 Operations Docs You Should Produce Now (Deliverables)

Create `docs/ops/`:

```text
docs/ops/
  backups.md
  restore-drill.md
  rollback.md
  incidents.md
  runbook-db-down.md
  runbook-redis-down.md
  runbook-worker-down.md
  runbook-5xx-spike.md
  runbook-migration-lock.md
```

Each runbook should include:
- symptoms
- diagnosis steps
- mitigation steps
- escalation contacts
- post-incident actions

---

## 40.11 Hands-On Lab: Run a Restore Drill (Even Locally)

1. Dump your DB (local postgres):
```bash
pg_dump --format=custom --file=backup.dump postgres://django:django@127.0.0.1:5432/django_app
```

2. Create a new DB:
```bash
createdb -h 127.0.0.1 -U django django_app_restore
```

3. Restore:
```bash
pg_restore --clean --no-owner --dbname=postgres://django:django@127.0.0.1:5432/django_app_restore backup.dump
```

4. Point Django to restored DB (env var change) and run:
- `python manage.py migrate`
- `python manage.py check --deploy`
- `/readyz`

Document the steps in `docs/ops/restore-drill.md`.

---

## 40.12 Exercises (Do These Before Proceeding)

1. Write your backups policy:
   - frequency
   - retention
   - storage location
   - encryption at rest
2. Write your restore drill plan and schedule.
3. Write a rollback plan that explicitly states:
   - what you can rollback (code)
   - what you avoid rolling back (schema)
   - when you do forward-fix
4. Add a maintenance mode middleware and verify behavior with tests (optional).
5. Simulate an incident locally:
   - stop Postgres
   - confirm `/readyz` returns 503
   - confirm logs show DB down error
   - restart and verify recovery

---

## 40.13 Chapter Summary

- Operations is what keeps production alive: backups, restores, rollbacks, and
  incident response.
- Backups must be tested via restore drills.
- Rollbacks are mostly code; schema rollbacks are risky—use expand/contract.
- Runbooks turn chaos into repeatable procedure.
- Postmortems turn incidents into improvements.

---

Next chapter: **Part X — 41. Multi‑Tenancy Patterns (SaaS Architectures)**  
We’ll take your org-scoped tasks system and evolve it into robust multi-tenant
patterns, discuss tenant isolation choices (shared DB vs schema-per-tenant vs
DB-per-tenant), and implement safe tenant scoping across ORM, middleware, and APIs.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='39. caching_cdn_and_performance_in_production.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../10. Advanced_topics/41. multi-tenancy_patterns.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
