Skip to content

BertitSabir/python-profiler-benchmark

Repository files navigation

Python Profiler Benchmark

A FastAPI app that benchmarks the overhead and output of three Python profilers: cProfile (stdlib), py-spy (sampling), and scalene (line-level CPU + memory).

Built with Python 3.13 and managed with uv.


Project structure

├── main.py                # FastAPI app with CPU / memory / I/O endpoints
├── workloads.py           # Pure workload functions (fibonacci, primes, matrix, memory, I/O)
├── benchmark.py           # Orchestrator: measures latency overhead per profiler
├── profile_workloads.py   # Runs workloads directly (no server) — used for clean profiling
├── profiler.py            # Reusable @cprofile_fn and @scalene_fn decorators
├── pyproject.toml         # Dependencies
├── uv.lock                # Locked dependency versions
├── .python-version        # Pins Python 3.13
└── profiles/              # Generated output (gitignored)
    ├── baseline.json
    ├── cprofile_stats.txt
    ├── py-spy.json / pyspy.svg
    └── scalene.json

Setup

uv sync

Endpoints

Endpoint Workload
GET /health Health check
GET /cpu/fibonacci?n=28 Recursive Fibonacci — stresses function-call dispatch
GET /cpu/primes?limit=50000 Sieve of Eratosthenes — stresses integer loops
GET /cpu/matrix?size=40 Naive O(n³) matrix multiply — stresses nested loops
GET /memory?num_lists=6&list_size=10000 Large list alloc + sort + reduce
GET /io?count=40 Repeated file write + read
GET /mixed All of the above combined
GET /profiler/stats Live cProfile dump (only with --with-cprofile)

Benchmark: measuring profiler overhead

The benchmark starts the server as a subprocess, hits every endpoint, and measures latency. Results are saved to profiles/ as JSON for cross-session comparison.

Step 1 — establish baseline

uv run python benchmark.py --mode baseline
# saves profiles/baseline.json

Step 2 — cProfile overhead (automatic)

uv run python benchmark.py --mode cprofile
# starts server with --with-cprofile, compares against saved baseline

Step 3 — py-spy overhead (external sampling)

# Terminal A — server (no profiler)
uv run python main.py

# Terminal B — attach py-spy while traffic runs
py-spy record -o profiles/pyspy.svg --pid $(pgrep -f 'main.py') --duration 90
# macOS may need: sudo py-spy record …

# Terminal C — benchmark and compare
uv run python benchmark.py --mode external

Step 4 — scalene overhead (external)

# Terminal A — server under scalene
uv run python -m scalene run -o profiles/scalene.json -- main.py

# Terminal B — benchmark and compare
uv run python benchmark.py --mode external

Full auto run (baseline + cProfile + instructions)

uv run python benchmark.py

Options

--mode      baseline | cprofile | external | all  (default: all)
--requests  N timed requests per endpoint          (default: 20)
--warmup    N warmup requests per endpoint         (default: 5)

Profiling your own code

Cleanest way: profile without FastAPI noise

# Generate scalene report on workloads directly
uv run python -m scalene run -o profiles/scalene_workloads.json profile_workloads.py

# View in browser
uv run python -m scalene view profiles/scalene_workloads.json

# View as standalone HTML file
uv run python -m scalene view profiles/scalene_workloads.json --standalone

# View in terminal (only active lines)
uv run python -m scalene view profiles/scalene_workloads.json --cli --reduced
# cProfile in one command
uv run python -m cProfile -s cumulative profile_workloads.py | head -30

Using the profiling decorators

Copy profiler.py to your project and decorate the functions you care about:

from profiler import cprofile_fn, scalene_fn

# cProfile: call counts + cumulative time per function
# → saves report to profiles/compute_nodes.txt automatically
@cprofile_fn(top=15, save_to="profiles/compute_nodes.txt")
def compute_nodes(graph):
    ...

# scalene: line-level CPU% + memory — active only under `scalene run`
@scalene_fn
def build_edge_matrix(nodes):
    ...

Then:

# cProfile (works immediately)
python your_script.py

# scalene (line-level detail)
python -m scalene run -o profiles/out.json your_script.py
python -m scalene view profiles/out.json --cli --reduced

How to read the outputs

cProfile

ncalls    tottime  percall  cumtime  percall  filename:lineno(function)
1028457/1   0.099    0.000    0.099    0.099  workloads.py:17(fibonacci)
Column Meaning
ncalls Total calls / non-recursive calls (slash = recursive)
tottime Time in this function only (excludes callees)
cumtime Time including everything this function called
percall Per-call cost (tottime or cumtime ÷ ncalls)

Rule: high cumtime but low tottime → bottleneck is inside something this function calls.

scalene

line 19:  return fibonacci(n-1) + fibonacci(n-2)
          Python: 94%   Native: 0%   Memory: 0 MB
Column Meaning Action
Python% Time Python's interpreter spent on this line High → rewrite or vectorize
Native% Time in C/native libraries called from this line High → already fast, fix the algorithm
Memory (MB) Allocations on this line High → reducing copies or intermediate objects

Key insight: high Native% on a numpy/pandas line means the C library is doing the work — the problem is usually calling it too many times (e.g. inside a loop), not the line itself.


Benchmark results summary

Profiler Mechanism fibonacci overhead memory overhead
py-spy OS-level sampling, external ~0% ~0%
scalene Line-level CPU + allocation tracking ~0% +90%
cProfile Hook on every function call +261% +19%

Each profiler has a different Achilles' heel:

  • cProfile is slow on recursion-heavy / call-heavy code
  • scalene is slow on allocation-heavy code (by design — it's tracking every allocation)
  • py-spy has near-zero overhead on everything

Recommended workflow

1. scalene  →  find the hot LINE and whether it's CPU-bound or memory-bound
2. cProfile →  confirm call counts dropped after your fix
3. py-spy   →  verify the fix holds under real production load

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages