From ada709801bacc4c9410ee91a77693c0d200c3d1f Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 18 Apr 2026 14:50:51 +0000
Subject: [PATCH] docs: migrate all wiki pages to mkdocs

- Add docs/architecture.md (radix router, compiled DI, middleware chain, zero-copy JSON, uvloop internals)
- Add docs/benchmark-methodology.md (benchmark types, hardware specs, direct ASGI protocol, CI workflow)
- Add docs/concepts/sub-interpreters.md (PEP 684/734, concurrency API, SubInterpreterPool internals, examples)
- Enhance docs/migration-from-fastapi.md with field aliases (msgspec rename) and migration checklist
- Update mkdocs.yml nav to include all four new/updated pages

Wiki pages are now redundant as all content lives in the mkdocs site.

https://claude.ai/code/session_01SF7ikYNfez1eVEK9d473Ef
---
 docs/architecture.md              | 316 +++++++++++++++++++++++++++++
 docs/benchmark-methodology.md     | 217 ++++++++++++++++++++
 docs/concepts/sub-interpreters.md | 323 ++++++++++++++++++++++++++++++
 docs/migration-from-fastapi.md    |  29 +++
 mkdocs.yml                        |   3 +
 5 files changed, 888 insertions(+)
 create mode 100644 docs/architecture.md
 create mode 100644 docs/benchmark-methodology.md
 create mode 100644 docs/concepts/sub-interpreters.md

diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..e4df88a
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,316 @@
+# Architecture
+
+This page explains FasterAPI's internal architecture. Read this before contributing — it covers why each component exists and how they interact.
+
+---
+
+## Request Lifecycle
+
+```
+Incoming ASGI Request
+       │
+       ▼
+┌──────────────────────────┐
+│  Middleware Chain         │  Built once at first request, cached.
+│  (CORS → GZip → ...)    │  Each middleware wraps the next as an ASGI app.
+└────────────┬─────────────┘
+             │
+             ▼
+┌──────────────────────────┐
+│  Faster.__call__         │  ASGI entry point. Routes to HTTP, WebSocket,
+│                          │  or Lifespan handler based on scope["type"].
+└────────────┬─────────────┘
+             │ (HTTP)
+             ▼
+┌──────────────────────────┐
+│  RadixRouter.resolve()   │  O(k) path lookup (k = path segments).
+│                          │  Returns (handler, path_params, metadata).
+└────────────┬─────────────┘
+             │
+             ▼
+┌──────────────────────────┐
+│  _resolve_handler()      │  Iterates pre-compiled _ParamSpec tuples.
+│                          │  Injects dependencies, parses params.
+│                          │  Zero per-request introspection.
+└────────────┬─────────────┘
+             │
+             ▼
+┌──────────────────────────┐
+│  Handler executes        │  async def → event loop (uvloop)
+│                          │  plain def → process pool (CPU auto-detect)
+└────────────┬─────────────┘
+             │
+             ▼
+┌──────────────────────────┐
+│  _send_response()        │  dict/Struct → msgspec.json.encode → bytes
+│                          │  Zero-copy: Rust encodes directly to bytes.
+│                          │  Pre-encoded headers avoid repeated .encode().
+└────────────┬─────────────┘
+             │
+             ▼
+        Response sent
+             │
+             ▼ (if any)
+       BackgroundTasks.run()
+```
+
+---
+
+## 1. Radix Tree Router
+
+**File:** `FasterAPI/router.py`
+
+### Why not regex?
+
+FastAPI (via Starlette) compiles each route path into a regex pattern and checks them sequentially on every request — O(n) where n = total routes. This is fine for 10 routes, but at 100+ routes the linear scan becomes measurable overhead.
+
+### How the radix tree works
+
+Routes are decomposed into segments and inserted into a tree at startup:
+
+```
+Registered routes:
+  GET /users
+  GET /users/{id}
+  GET /users/{id}/posts
+  GET /health
+  GET /orgs/{org_id}/teams/{team_id}
+
+Tree structure:
+
+         root
+        /    \
+    users    health → handler
+      |        
+    [leaf] → handler (GET /users)
+      |
+     {id} → handler (GET /users/{id})
+      |
+    posts → handler (GET /users/{id}/posts)
+
+    orgs
+      |
+    {org_id}
+      |
+    teams
+      |
+    {team_id} → handler
+```
+
+### Resolution algorithm
+
+The `_walk` method uses **iterative traversal** (not recursion) for the common case:
+
+```python
+while idx < n:
+    seg = segments[idx]
+    child = node.children.get(seg)      # Try static match first
+    if child is not None:
+        node = child; idx += 1; continue
+    param_child = node.children.get("*") # Then try param wildcard
+    if param_child is not None:
+        params[param_child.param_name] = seg
+        node = param_child; idx += 1; continue
+    return None                          # No match
+```
+
+**Key design choices:**
+
+- Static children are checked before param wildcards (most routes are static segments)
+- `__slots__` on `RadixNode` eliminates per-instance `__dict__` — less memory, faster attribute access
+- Path splitting uses a list comprehension (`[s for s in path.split("/") if s]`) — this hits CPython's fast C path
+
+### Complexity
+
+| Operation | Radix Tree | Regex (Starlette) |
+|---|---|---|
+| Lookup | O(k) where k = path segments | O(n) where n = total routes |
+| 100 routes | ~3 segment checks | ~50 regex evaluations (avg) |
+
+---
+
+## 2. Compiled Dependency Injection
+
+**File:** `FasterAPI/dependencies.py`
+
+### The problem
+
+FastAPI calls `inspect.signature()` and `typing.get_type_hints()` on **every request** to figure out what a handler needs. These are expensive reflection operations.
+
+### The solution: compile once, resolve many
+
+At route registration time, `compile_handler(func)` introspects the handler once and produces a tuple of `_ParamSpec` objects:
+
+```
+Route registration (startup):
+  @app.get("/users/{id}")
+  async def get_user(id: str = Path(), q: str = Query(None)):
+      ...
+
+  compile_handler(get_user) is called immediately.
+  Returns: (
+      _ParamSpec(name="id",  kind=_KIND_PATH,  ...),
+      _ParamSpec(name="q",   kind=_KIND_QUERY, ...),
+  ), is_async=True
+```
+
+At request time, `_resolve_from_specs` iterates the pre-compiled tuple with integer kind comparisons — no reflection, no isinstance chains:
+
+```
+Request time (hot path):
+  for spec in specs:
+      if spec.kind == _KIND_PATH:   kwargs[spec.name] = path_params[spec.name]
+      elif spec.kind == _KIND_QUERY: kwargs[spec.name] = request.query_params.get(...)
+      ...
+```
+
+### _ParamSpec design
+
+```python
+class _ParamSpec:
+    __slots__ = ("name", "kind", "annotation", "default", "marker")
+```
+
+- `kind` is an integer constant (0–11), not an enum — integer comparison is faster than `isinstance`
+- `__slots__` avoids `__dict__` overhead
+- `@lru_cache(maxsize=512)` on `compile_handler` means the same function is never introspected twice
+- Dependencies (`Depends(...)`) are compiled recursively — the entire dependency tree is pre-resolved
+
+---
+
+## 3. Request Object — Lazy Parsing
+
+**File:** `FasterAPI/request.py`
+
+Most handlers only need 1-2 request attributes (e.g., path params and body). Parsing all headers, query params, and cookies on every request wastes time.
+
+FasterAPI's `Request` uses lazy properties:
+
+```python
+@property
+def headers(self) -> dict[str, str]:
+    h = self._headers           # Check cache
+    if h is None:               # First access → parse
+        raw = self._scope.get("headers", [])
+        h = {k.decode("latin-1").lower(): v.decode("latin-1") for k, v in raw}
+        self._headers = h       # Cache for subsequent access
+    return h
+```
+
+The same pattern applies to `query_params`, `cookies`, and `body`. If a handler never accesses `request.cookies`, they're never parsed.
+
+---
+
+## 4. Middleware Chain
+
+**File:** `FasterAPI/middleware.py`
+
+### How the chain is built
+
+Middleware is registered via `app.add_middleware(CORSMiddleware, allow_origins=["*"])` and stored as `(class, kwargs)` pairs.
+
+On the first request, the chain is built **once** by wrapping the core app in reverse order:
+
+```
+Registration order:     [CORS, GZip, TrustedHost]
+Build order (reversed): TrustedHost(GZip(CORS(app)))
+
+Request flow:
+  → TrustedHost.__call__  (checks Host header)
+    → GZip.__call__       (buffers response for compression)
+      → CORS.__call__     (injects CORS headers)
+        → app._asgi_app   (route dispatch)
+```
+
+The built chain is cached in `self._middleware_app`. Adding middleware after the first request invalidates the cache (sets it to `None`).
+
+### ASGI middleware pattern
+
+Each middleware is a valid ASGI app that wraps another ASGI app:
+
+```python
+class CORSMiddleware(BaseHTTPMiddleware):
+    def __init__(self, app, **kwargs):
+        self.app = app  # The next app in the chain
+
+    async def __call__(self, scope, receive, send):
+        if scope["type"] != "http":
+            await self.app(scope, receive, send)  # Pass through non-HTTP
+            return
+        await self.dispatch(scope, receive, send)
+```
+
+---
+
+## 5. Response Path — Zero-Copy JSON
+
+**File:** `FasterAPI/app.py` (module-level `_send_response`)
+
+When a handler returns a dict or msgspec Struct, the response path is:
+
+```
+dict → msgspec.json.encode(dict) → bytes → ASGI send
+
+One allocation. msgspec's Rust core converts Python objects directly
+to JSON bytes without an intermediate string step.
+```
+
+Compare to the standard approach:
+
+```
+dict → json.dumps(dict) → str → str.encode("utf-8") → bytes → send
+
+Three allocations. Each creates a new Python object the GC must track.
+```
+
+Additionally, common header values are pre-encoded as module-level bytes constants:
+
+```python
+_CT_JSON = b"application/json"
+_HEADER_CT = b"content-type"
+```
+
+This avoids calling `.encode()` on every response.
+
+---
+
+## 6. Event Loop — uvloop
+
+**File:** `FasterAPI/concurrency.py`
+
+uvloop replaces Python's default asyncio event loop with one backed by libuv (the same C library that powers Node.js). It handles I/O polling, callback scheduling, and timer management in C instead of Python.
+
+```python
+def install_event_loop() -> str:
+    try:
+        import uvloop
+        if _PY312_PLUS:
+            asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
+        else:
+            uvloop.install()
+        return "uvloop"
+    except ImportError:
+        return "asyncio"
+```
+
+This is called at module import time (`_event_loop = install_event_loop()`) so it's set before any async code runs.
+
+---
+
+## File Map
+
+| File | Responsibility |
+|---|---|
+| `app.py` | ASGI entry point, route registration, HTTP/WS/lifespan dispatch |
+| `router.py` | Radix tree router + FasterRouter (sub-router/blueprint) |
+| `dependencies.py` | Compiled DI, `Depends()`, param resolution |
+| `request.py` | Lazy-parsed Request object |
+| `response.py` | Response classes (JSON, HTML, Streaming, File) |
+| `middleware.py` | CORS, GZip, TrustedHost, HTTPS redirect |
+| `concurrency.py` | uvloop, sub-interpreters, thread/process pools |
+| `exceptions.py` | HTTPException, validation errors, default handlers |
+| `params.py` | Path, Query, Body, Header, Cookie, File, Form descriptors |
+| `background.py` | BackgroundTasks (post-response execution) |
+| `websocket.py` | WebSocket connection handler |
+| `datastructures.py` | UploadFile, FormData |
+| `openapi/` | Auto-generated OpenAPI 3.0 schema + Swagger/ReDoc UI |
diff --git a/docs/benchmark-methodology.md b/docs/benchmark-methodology.md
new file mode 100644
index 0000000..c26a52a
--- /dev/null
+++ b/docs/benchmark-methodology.md
@@ -0,0 +1,217 @@
+# Benchmark Methodology
+
+This page explains how FasterAPI's performance claims are measured, what hardware was used, and how to reproduce the results yourself.
+
+---
+
+## Benchmark Types
+
+We run three categories of benchmarks:
+
+| Category | What it measures | How |
+|---|---|---|
+| **Component** | Individual operations (routing, JSON encode/decode) | Tight loop, `time.perf_counter()` |
+| **Framework (Direct ASGI)** | Full request cycle without network overhead | Synthetic ASGI scope → `app(scope, receive, send)` |
+| **End-to-End HTTP** | Real HTTP performance including server + network | `httpx.AsyncClient` against a live uvicorn server |
+
+The README numbers come from **Framework (Direct ASGI)** benchmarks — these isolate the framework's actual performance without conflating uvicorn's overhead.
+
+---
+
+## Hardware & Environment
+
+### README Baseline (Python 3.13.7)
+
+```
+Machine:    Apple Silicon (M-series)
+OS:         macOS
+Python:     3.13.7
+uvloop:     0.21.x
+msgspec:    0.19.x
+FastAPI:    0.115.x (comparison target)
+Pydantic:   2.10.x
+```
+
+### CI Benchmark Runner
+
+```
+Machine:    GitHub Actions ubuntu-latest (2-core x86_64)
+OS:         Ubuntu 22.04
+Python:     3.13
+```
+
+!!! note
+    CI runners are significantly slower than local Apple Silicon. The CI benchmark workflow compares **speedup ratios** (FasterAPI/FastAPI), not raw req/s, to account for hardware differences.
+
+---
+
+## Direct ASGI Benchmark (Primary)
+
+This is the main benchmark used for the README results. It bypasses the network layer entirely.
+
+### What it does
+
+1. Creates both a FasterAPI and FastAPI app with identical routes
+2. Constructs synthetic ASGI `scope`, `receive`, `send` functions
+3. Calls `await app(scope, receive, send)` in a tight loop
+4. Measures throughput in requests/second
+
+### Routes tested
+
+| Endpoint | Method | Purpose |
+|---|---|---|
+| `/health` | GET | Minimal handler — measures framework dispatch overhead |
+| `/users/{id}` | GET | Path parameter extraction + JSON response |
+| `/users` | POST | JSON body parsing + validation + response |
+
+### Protocol
+
+```
+1. Warm-up phase:  500 requests (not timed)
+2. Measured phase:  50,000 requests
+3. Timing:         time.perf_counter() around the measured phase
+4. Result:         requests / elapsed_seconds
+```
+
+Both frameworks are benchmarked in the same process, same event loop, same Python version. This eliminates environmental variance.
+
+### Code
+
+The benchmark is in `benchmarks/compare.py`. Run with `--direct` flag:
+
+```bash
+python benchmarks/compare.py --direct
+```
+
+---
+
+## Component Benchmarks
+
+### Routing (Radix Tree vs Regex)
+
+```
+Setup:
+  - 100 routes registered (50 static, 30 single-param, 20 multi-param)
+  - 3 representative lookup paths tested
+  - 500,000 iterations × 3 paths = 1,500,000 total lookups
+
+Measured: ops/second for each router implementation
+```
+
+### JSON Encoding (msgspec vs json.dumps)
+
+```
+Setup:
+  - Dict payload: {"id": 42, "name": "test", "email": "t@t.com", "scores": [1,2,3]}
+  - 1,000,000 iterations
+
+Measured: encode ops/second
+```
+
+### JSON Decode + Validate (msgspec vs Pydantic v2)
+
+```
+Setup:
+  - Same payload as encoding, as raw bytes
+  - Decoded into a typed Struct/BaseModel
+  - 1,000,000 iterations
+
+Measured: decode+validate ops/second
+```
+
+---
+
+## How to Reproduce
+
+### Prerequisites
+
+```bash
+cd FasterAPI
+python -m venv .venv
+source .venv/bin/activate
+pip install -e ".[dev,benchmark]"
+```
+
+### Run all benchmarks
+
+```bash
+python benchmarks/compare.py
+```
+
+### Run only the direct ASGI benchmark (fastest, most accurate)
+
+```bash
+python benchmarks/compare.py --direct
+```
+
+### Run individual component benchmarks
+
+```python
+import time
+import msgspec
+import json
+
+data = {"id": 42, "name": "test", "scores": [1, 2, 3]}
+N = 1_000_000
+
+# msgspec
+start = time.perf_counter()
+for _ in range(N):
+    msgspec.json.encode(data)
+msgspec_rps = N / (time.perf_counter() - start)
+
+# stdlib json
+start = time.perf_counter()
+for _ in range(N):
+    json.dumps(data).encode()
+json_rps = N / (time.perf_counter() - start)
+
+print(f"msgspec: {msgspec_rps:,.0f} ops/s")
+print(f"json:    {json_rps:,.0f} ops/s")
+print(f"speedup: {msgspec_rps/json_rps:.1f}x")
+```
+
+---
+
+## CI Benchmark Workflow
+
+Every PR to `stage` or `master` triggers an automated benchmark that:
+
+1. Runs the direct ASGI benchmark (50,000 requests per endpoint)
+2. Runs the routing benchmark (1.5M lookups)
+3. Posts a comment on the PR with results
+
+### How to read the PR comment
+
+```
+| Endpoint         | FasterAPI      | FastAPI       | Speedup | vs Baseline |
+|------------------|----------------|---------------|---------|-------------|
+| GET /health      | 150,000/s      | 22,000/s      | 6.82x   | ⚪ -0.4%    |
+| GET /users/{id}  | 128,000/s      | 15,000/s      | 8.53x   | ⚪ -2.3%    |
+| POST /users      | 95,000/s       | 13,000/s      | 7.31x   | 🟢 +2.2%   |
+```
+
+- **Speedup** = FasterAPI req/s ÷ FastAPI req/s
+- **vs Baseline** compares the speedup ratio against the README baseline
+- 🟢 = speedup improved by >2%
+- ⚪ = within noise (±5%)
+- 🔴 = speedup regressed by >5% — needs investigation
+
+!!! note
+    Raw req/s on CI runners will be 2-3x lower than local Apple Silicon. This is expected. The **speedup ratio** is hardware-independent and is what matters.
+
+---
+
+## Fairness & Methodology Notes
+
+1. **Same process, same loop** — Both frameworks run in the same Python process and event loop. No one gets a "warm" advantage.
+
+2. **Warm-up phase** — 500 requests are run before timing starts to ensure JIT-like optimizations (e.g., `__pycache__`, `lru_cache` warming) are accounted for.
+
+3. **Identical routes** — Both apps define the exact same endpoints with equivalent handler logic. The FasterAPI handler uses `msgspec.Struct`, FastAPI uses `pydantic.BaseModel`.
+
+4. **No GC interference** — The benchmark runs long enough (50K requests) that GC pauses are amortized and don't skew results.
+
+5. **Deterministic input** — The same request payload is used for every iteration. No randomness that could cause branch prediction differences.
+
+6. **Open source** — All benchmark code is in `benchmarks/compare.py`. Run it yourself.
diff --git a/docs/concepts/sub-interpreters.md b/docs/concepts/sub-interpreters.md
new file mode 100644
index 0000000..e41f5e9
--- /dev/null
+++ b/docs/concepts/sub-interpreters.md
@@ -0,0 +1,323 @@
+# Sub-Interpreters Guide
+
+This page is a deep dive into Python's sub-interpreter support (PEP 684 / PEP 734) and how FasterAPI uses it for CPU-bound parallelism.
+
+---
+
+## The GIL Problem
+
+Python's Global Interpreter Lock (GIL) prevents true parallel execution of Python bytecode across threads in the same process. For I/O-bound work this doesn't matter (threads release the GIL during I/O), but for CPU-bound work it's a hard ceiling:
+
+```
+Thread 1:  [compute]---[wait for GIL]---[compute]---[wait for GIL]
+Thread 2:  [wait for GIL]---[compute]---[wait for GIL]---[compute]
+                    ↑
+            Only one thread runs Python at a time
+```
+
+Before Python 3.13, the only way to achieve CPU parallelism in Python was `multiprocessing` / `ProcessPoolExecutor`:
+
+```
+Process 1 (own GIL):  [compute]────────────────────────
+Process 2 (own GIL):  [compute]────────────────────────
+                       ↑
+               True parallel, but expensive:
+               - Fork/spawn overhead (~100ms)
+               - Memory duplication
+               - Arguments must be picklable
+               - No shared state
+```
+
+---
+
+## PEP 684 & PEP 734: Per-Interpreter GIL
+
+### PEP 684 (Python 3.12) — The Foundation
+
+Added per-interpreter GIL support at the C level. Each sub-interpreter can optionally have its own GIL, meaning Python bytecode in different interpreters runs truly in parallel.
+
+**Python 3.12 status:** Available via the C API only. No Python-level access.
+
+### PEP 734 (Python 3.13) — The Python API
+
+Exposed sub-interpreters via the `interpreters` stdlib module, making them accessible from pure Python.
+
+```python
+import interpreters
+
+interp = interpreters.create()
+interp.call(some_function, arg1, arg2)
+interp.close()
+```
+
+**Python 3.13 status:** Experimental. The `interpreters` module may not be available in all builds. FasterAPI detects this at import time and falls back gracefully.
+
+---
+
+## How Sub-Interpreters Compare
+
+```
+                        ┌──────────────┬─────────────┬──────────────────┐
+                        │   Threads    │  Processes  │ Sub-Interpreters │
+├───────────────────────┼──────────────┼─────────────┼──────────────────┤
+│ True CPU parallelism  │      No      │     Yes     │       Yes        │
+│ Startup cost          │    ~0.1ms    │   ~100ms    │      ~1ms        │
+│ Memory overhead       │     Low      │    High     │     Medium       │
+│ Shared state          │     Yes      │      No     │       No         │
+│ Argument passing      │   Direct     │   Pickle    │   Shareable*     │
+│ GIL                   │   Shared     │  Per-proc   │  Per-interp      │
+│ Best for              │  I/O-bound   │  CPU-bound  │   CPU-bound      │
+└───────────────────────┴──────────────┴─────────────┴──────────────────┘
+
+* "Shareable" means types that implement the buffer protocol (bytes,
+  memoryview, some numeric types). Complex objects need serialization.
+```
+
+Sub-interpreters are ~100x lighter than processes. They're the closest Python analog to Go goroutines: lightweight, parallel, share-nothing by default.
+
+---
+
+## FasterAPI's Concurrency API
+
+FasterAPI provides three concurrency primitives. Use the right one for your workload:
+
+### `run_in_subinterpreter(func, *args)` — CPU-bound work
+
+```python
+from FasterAPI.concurrency import run_in_subinterpreter
+
+async def compute_hash(data: bytes) -> str:
+    return await run_in_subinterpreter(hashlib.sha256, data)
+```
+
+**What happens under the hood:**
+
+| Python Version | Backend | Behavior |
+|---|---|---|
+| 3.13+ (with `interpreters`) | Sub-interpreter pool | Own GIL, true parallelism, no pickling |
+| 3.13+ (without `interpreters`) | ProcessPoolExecutor | Separate process, pickle-based |
+| 3.10–3.12 | ProcessPoolExecutor | Separate process, pickle-based |
+
+**When to use:**
+
+- CPU-intensive computation (hashing, image processing, data crunching)
+- Work that would block the event loop for >1ms
+- Compute tasks that don't need shared mutable state
+
+**When NOT to use:**
+
+- I/O-bound work (database queries, HTTP calls) — use `async`/`await` instead
+- Tasks that need access to the main interpreter's global state
+- Very short computations (<0.1ms) — the dispatch overhead isn't worth it
+
+### `run_in_threadpool(func, *args)` — Blocking I/O
+
+```python
+from FasterAPI.concurrency import run_in_threadpool
+
+async def read_legacy_file(path: str) -> bytes:
+    return await run_in_threadpool(open(path, "rb").read)
+```
+
+**When to use:**
+
+- Calling synchronous libraries that do I/O (file reads, legacy database drivers)
+- Wrapping blocking SDK calls that don't have async versions
+- Any blocking operation that would freeze the event loop
+
+**When NOT to use:**
+
+- CPU-bound work — threads share the GIL, so you get zero parallelism
+- Operations that already have async versions — just `await` them directly
+
+### `run_in_executor(func, *args)` — Process pool
+
+```python
+from FasterAPI.concurrency import run_in_executor
+
+async def heavy_computation(n: int) -> int:
+    return await run_in_executor(sum, range(n))
+```
+
+**When to use:**
+
+- CPU-bound work on Python < 3.13
+- Functions with arguments that are picklable
+- When you explicitly want process isolation
+
+**When NOT to use:**
+
+- Arguments that can't be pickled (open file handles, database connections, lambdas)
+- On Python 3.13+ — prefer `run_in_subinterpreter` instead
+
+---
+
+## Decision Flowchart
+
+```
+Is the work CPU-bound or I/O-bound?
+│
+├── I/O-bound
+│   │
+│   ├── Has async API?  →  Just use await
+│   └── No async API?   →  run_in_threadpool()
+│
+└── CPU-bound
+    │
+    ├── Python 3.13+ with interpreters module?
+    │   └──  run_in_subinterpreter()    ← best option
+    │
+    ├── Arguments picklable?
+    │   └──  run_in_subinterpreter()    ← falls back to ProcessPool
+    │
+    └── Arguments not picklable?
+        └──  Restructure to pass serializable data,
+             or run_in_threadpool() if parallelism isn't critical
+```
+
+---
+
+## SubInterpreterPool Internals
+
+### Pool initialization
+
+```python
+pool = SubInterpreterPool(max_workers=4)
+```
+
+On first `.run()` call (lazy init):
+
+- Creates `max_workers` sub-interpreters via `interpreters.create()`
+- Creates a `ThreadPoolExecutor` with the same number of workers
+- Creates an `asyncio.Semaphore(max_workers)` to limit concurrency
+
+### How a task executes
+
+```
+1. await pool.run(func, *args)
+2. Acquire semaphore (limits concurrent sub-interpreter tasks)
+3. Select interpreter via round-robin: id(current_task) % pool_size
+4. Submit to ThreadPoolExecutor: interp.call(func, *args)
+5. Thread runs func in the sub-interpreter (with its own GIL)
+6. Result returned to the awaiting coroutine
+```
+
+```
+Main event loop thread          Thread pool threads
+─────────────────────          ─────────────────────
+await pool.run(f, x)
+  │
+  ├─ acquire semaphore
+  ├─ submit to executor ──────→ Thread 1: interp_0.call(f, x)
+  │                              (runs with interp_0's GIL)
+  │   (event loop continues      │
+  │    serving other requests)    │
+  │                               ▼
+  ├─ result ready ◄──────────── return value
+  └─ return result
+```
+
+### Fallback pool (no `interpreters` module)
+
+When the `interpreters` module isn't available, `SubInterpreterPool` is replaced with a drop-in that uses `ProcessPoolExecutor`:
+
+```python
+class SubInterpreterPool:  # fallback
+    def __init__(self, max_workers=None):
+        self._executor = ProcessPoolExecutor(max_workers=max_workers or cpu_count)
+
+    async def run(self, func, *args):
+        loop = asyncio.get_running_loop()
+        return await loop.run_in_executor(self._executor, partial(func, *args))
+```
+
+Same API, different backend. Your code doesn't change.
+
+---
+
+## Examples
+
+### Image Thumbnail Generation
+
+```python
+from PIL import Image
+import io
+from FasterAPI.concurrency import run_in_subinterpreter
+
+def generate_thumbnail(image_bytes: bytes, size: tuple[int, int]) -> bytes:
+    img = Image.open(io.BytesIO(image_bytes))
+    img.thumbnail(size)
+    buf = io.BytesIO()
+    img.save(buf, format="JPEG")
+    return buf.getvalue()
+
+@app.post("/upload")
+async def upload_image(file: UploadFile):
+    data = await file.read()
+    thumb = await run_in_subinterpreter(generate_thumbnail, data, (128, 128))
+    return Response(content=thumb, media_type="image/jpeg")
+```
+
+### CPU-Bound Data Processing
+
+```python
+import hashlib
+from FasterAPI.concurrency import run_in_subinterpreter
+
+def compute_proof_of_work(data: bytes, difficulty: int) -> tuple[int, str]:
+    nonce = 0
+    target = "0" * difficulty
+    while True:
+        attempt = hashlib.sha256(data + nonce.to_bytes(8, "big")).hexdigest()
+        if attempt.startswith(target):
+            return nonce, attempt
+        nonce += 1
+
+@app.post("/mine")
+async def mine_block(data: bytes):
+    nonce, hash_val = await run_in_subinterpreter(compute_proof_of_work, data, 4)
+    return {"nonce": nonce, "hash": hash_val}
+```
+
+### Mixed I/O and CPU
+
+```python
+@app.get("/report/{id}")
+async def generate_report(id: str):
+    # I/O-bound: fetch from database (runs on event loop)
+    raw_data = await db.fetch(f"SELECT * FROM reports WHERE id = $1", id)
+
+    # CPU-bound: crunch the numbers (runs in sub-interpreter)
+    summary = await run_in_subinterpreter(analyze_data, raw_data)
+
+    # I/O-bound: cache the result (runs on event loop)
+    await cache.set(f"report:{id}", summary, ttl=3600)
+
+    return summary
+```
+
+---
+
+## Limitations & Caveats
+
+1. **Module state is NOT shared.** Each sub-interpreter has its own import state. Global variables in one interpreter are invisible to others.
+
+2. **Not all types are shareable.** Complex objects (class instances, open connections, file handles) can't be passed directly. Pass bytes, numbers, or strings.
+
+3. **C extensions must be compatible.** Some C extensions aren't safe to use in sub-interpreters. If a function crashes in a sub-interpreter, try `run_in_executor` instead.
+
+4. **The `interpreters` module is experimental** in Python 3.13. It may change or be absent in some builds. FasterAPI always has a working fallback.
+
+5. **Pool sizing matters.** Default is `os.cpu_count()`. For mixed workloads, you may want fewer sub-interpreter workers to leave cores free for the event loop.
+
+---
+
+## Version Compatibility Matrix
+
+| Python | Sub-interpreters | uvloop | Best CPU strategy |
+|---|---|---|---|
+| 3.10 | No | Yes | ProcessPoolExecutor |
+| 3.11 | No | Yes | ProcessPoolExecutor |
+| 3.12 | C API only | Yes (via policy) | ProcessPoolExecutor |
+| 3.13 | Experimental | Yes (via policy) | SubInterpreterPool (if available) |
+| 3.14+ | Expected stable | TBD | SubInterpreterPool |
diff --git a/docs/migration-from-fastapi.md b/docs/migration-from-fastapi.md
index b0381f3..5f8628a 100644
--- a/docs/migration-from-fastapi.md
+++ b/docs/migration-from-fastapi.md
@@ -74,6 +74,21 @@ class Product(msgspec.Struct):
     price: Price
 ```
 
+### Field aliases
+
+```python
+# Before (Pydantic)
+class Item(BaseModel):
+    item_name: str = Field(alias="itemName")
+
+# After (msgspec) — use the rename class option
+class Item(msgspec.Struct, rename="camel"):
+    item_name: str
+    # Automatically serializes/deserializes as "itemName"
+```
+
+msgspec supports these rename strategies: `"camel"`, `"pascal"`, `"kebab"`, `"lower"`, `"upper"`.
+
 ## 2. Routing and decorators
 
 `@app.get`, `@app.post`, `@app.put`, `@app.delete`, `@app.patch`, `@app.websocket`,
@@ -217,6 +232,20 @@ async def get_user(id: int) -> UserPublic:
 Any code that imported directly from `starlette.*` needs updating to equivalent
 FasterAPI imports or plain Python/httpx alternatives.
 
+## Migration checklist
+
+- [ ] Replace `fastapi` imports with `FasterAPI` imports
+- [ ] Replace `FastAPI()` with `Faster()`
+- [ ] Replace `APIRouter` with `FasterRouter`
+- [ ] Convert `BaseModel` subclasses to `msgspec.Struct`
+- [ ] Convert Pydantic validators to `__post_init__`
+- [ ] Replace `model_dump()` with `msgspec.structs.asdict()`
+- [ ] Replace `model_dump_json()` with `msgspec.json.encode()`
+- [ ] Update custom middleware to ASGI-level dispatch
+- [ ] Convert yield-based dependencies to return-based
+- [ ] Test all endpoints
+- [ ] Run benchmarks to verify speedup
+
 ## Suggested migration order
 
 1. Add **FasterAPI** beside FastAPI in a branch; swap the app factory.
diff --git a/mkdocs.yml b/mkdocs.yml
index 574efbf..d2c6934 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -88,10 +88,13 @@ nav:
       - Async / Await: concepts/async-await.md
       - Concurrency & Parallelism: concepts/concurrency.md
       - Python Type Hints: concepts/types-intro.md
+      - Sub-Interpreters Guide: concepts/sub-interpreters.md
   - How-To Recipes: how-to/index.md
   - Reference:
       - API Reference: api-reference.md
+      - Architecture: architecture.md
       - Benchmarks: benchmarks.md
+      - Benchmark Methodology: benchmark-methodology.md
       - Migrating from FastAPI: migration-from-fastapi.md
       - Python 3.13 & Compatibility: python-313.md
       - Changelog: changelog.md