A FastAPI app that benchmarks the overhead and output of three Python profilers: cProfile (stdlib), py-spy (sampling), and scalene (line-level CPU + memory).
Built with Python 3.13 and managed with uv.
├── main.py # FastAPI app with CPU / memory / I/O endpoints
├── workloads.py # Pure workload functions (fibonacci, primes, matrix, memory, I/O)
├── benchmark.py # Orchestrator: measures latency overhead per profiler
├── profile_workloads.py # Runs workloads directly (no server) — used for clean profiling
├── profiler.py # Reusable @cprofile_fn and @scalene_fn decorators
├── pyproject.toml # Dependencies
├── uv.lock # Locked dependency versions
├── .python-version # Pins Python 3.13
└── profiles/ # Generated output (gitignored)
├── baseline.json
├── cprofile_stats.txt
├── py-spy.json / pyspy.svg
└── scalene.json
uv sync| Endpoint | Workload |
|---|---|
GET /health |
Health check |
GET /cpu/fibonacci?n=28 |
Recursive Fibonacci — stresses function-call dispatch |
GET /cpu/primes?limit=50000 |
Sieve of Eratosthenes — stresses integer loops |
GET /cpu/matrix?size=40 |
Naive O(n³) matrix multiply — stresses nested loops |
GET /memory?num_lists=6&list_size=10000 |
Large list alloc + sort + reduce |
GET /io?count=40 |
Repeated file write + read |
GET /mixed |
All of the above combined |
GET /profiler/stats |
Live cProfile dump (only with --with-cprofile) |
The benchmark starts the server as a subprocess, hits every endpoint, and measures
latency. Results are saved to profiles/ as JSON for cross-session comparison.
uv run python benchmark.py --mode baseline
# saves profiles/baseline.jsonuv run python benchmark.py --mode cprofile
# starts server with --with-cprofile, compares against saved baseline# Terminal A — server (no profiler)
uv run python main.py
# Terminal B — attach py-spy while traffic runs
py-spy record -o profiles/pyspy.svg --pid $(pgrep -f 'main.py') --duration 90
# macOS may need: sudo py-spy record …
# Terminal C — benchmark and compare
uv run python benchmark.py --mode external# Terminal A — server under scalene
uv run python -m scalene run -o profiles/scalene.json -- main.py
# Terminal B — benchmark and compare
uv run python benchmark.py --mode externaluv run python benchmark.py--mode baseline | cprofile | external | all (default: all)
--requests N timed requests per endpoint (default: 20)
--warmup N warmup requests per endpoint (default: 5)
# Generate scalene report on workloads directly
uv run python -m scalene run -o profiles/scalene_workloads.json profile_workloads.py
# View in browser
uv run python -m scalene view profiles/scalene_workloads.json
# View as standalone HTML file
uv run python -m scalene view profiles/scalene_workloads.json --standalone
# View in terminal (only active lines)
uv run python -m scalene view profiles/scalene_workloads.json --cli --reduced# cProfile in one command
uv run python -m cProfile -s cumulative profile_workloads.py | head -30Copy profiler.py to your project and decorate the functions you care about:
from profiler import cprofile_fn, scalene_fn
# cProfile: call counts + cumulative time per function
# → saves report to profiles/compute_nodes.txt automatically
@cprofile_fn(top=15, save_to="profiles/compute_nodes.txt")
def compute_nodes(graph):
...
# scalene: line-level CPU% + memory — active only under `scalene run`
@scalene_fn
def build_edge_matrix(nodes):
...Then:
# cProfile (works immediately)
python your_script.py
# scalene (line-level detail)
python -m scalene run -o profiles/out.json your_script.py
python -m scalene view profiles/out.json --cli --reducedncalls tottime percall cumtime percall filename:lineno(function)
1028457/1 0.099 0.000 0.099 0.099 workloads.py:17(fibonacci)
| Column | Meaning |
|---|---|
ncalls |
Total calls / non-recursive calls (slash = recursive) |
tottime |
Time in this function only (excludes callees) |
cumtime |
Time including everything this function called |
percall |
Per-call cost (tottime or cumtime ÷ ncalls) |
Rule: high
cumtimebut lowtottime→ bottleneck is inside something this function calls.
line 19: return fibonacci(n-1) + fibonacci(n-2)
Python: 94% Native: 0% Memory: 0 MB
| Column | Meaning | Action |
|---|---|---|
| Python% | Time Python's interpreter spent on this line | High → rewrite or vectorize |
| Native% | Time in C/native libraries called from this line | High → already fast, fix the algorithm |
| Memory (MB) | Allocations on this line | High → reducing copies or intermediate objects |
Key insight: high Native% on a numpy/pandas line means the C library is doing the work — the problem is usually calling it too many times (e.g. inside a loop), not the line itself.
| Profiler | Mechanism | fibonacci overhead | memory overhead |
|---|---|---|---|
| py-spy | OS-level sampling, external | ~0% | ~0% |
| scalene | Line-level CPU + allocation tracking | ~0% | +90% |
| cProfile | Hook on every function call | +261% | +19% |
Each profiler has a different Achilles' heel:
- cProfile is slow on recursion-heavy / call-heavy code
- scalene is slow on allocation-heavy code (by design — it's tracking every allocation)
- py-spy has near-zero overhead on everything
1. scalene → find the hot LINE and whether it's CPU-bound or memory-bound
2. cProfile → confirm call counts dropped after your fix
3. py-spy → verify the fix holds under real production load