Demonstrates how to:
- Use
cProfileto identify a CPU-bound bottleneck in naive recursive Fibonacci computations. - Accelerate independent heavy computations with
ProcessPoolExecutorto achieve speedup by leveraging multiple processes (bypassing the GIL).
The sample intentionally keeps code simple and self-contained while illustrating a diagnostic-to-mitigation workflow:
Profile the serial baseline -> discover hot function (
fib) -> mitigate with parallel process execution -> measure speedup.
Python's Global Interpreter Lock (GIL) prevents true CPU-bound parallelism with threads. When you have multiple independent heavy computations, distributing them across separate processes allows concurrent execution on multiple cores.
cProfile tells us where time is spent; ProcessPoolExecutor changes how we schedule work to reduce total wall-clock time.
| Route | Purpose |
|---|---|
/api/serial_profile_trigger |
Serial execution of several Fibonacci tasks under cProfile; returns top cumulative-time stats. |
/api/serial |
Serial execution only (no profiling). Baseline duration. |
/api/parallel |
Parallel execution via ProcessPoolExecutor. Measures duration including process spawn overhead. |
/api/compare |
Runs both serial and parallel paths in one call; returns speedup ratio. |
Default Fibonacci inputs: [34, 35, 36, 35] (tuned so each call is noticeably expensive but finishes quickly on a modern machine). Override with query parameter: ?nums=33,34,35 etc.
serial_profile_trigger wraps the entire serial loop in a cProfile.Profile() context:
profiler = cProfile.Profile()
profiler.enable()
# serial work
profiler.disable()We then sort by cumulative time to surface the deepest hotspots:
pstats.Stats(profiler).sort_stats("cumtime").print_stats(15)You should observe the majority of cumulative time in fib due to its naive recursion.
- Ultra-short (sub-millisecond) functions: results dominated by profiler overhead.
- Highly IO-bound code: use
asyncioor specialized tracing tools.
After profiling reveals repeated independent Fibonacci calls, we parallelize them:
with ProcessPoolExecutor() as executor:
values = list(executor.map(fib, nums))Each process runs its own Python interpreter; the GIL is not shared, so true parallel CPU execution occurs. Good when:
- Tasks are CPU-bound and independent.
- Result serialization cost (pickling) is small compared to compute time.
Not ideal when:
- Tasks are extremely fast (overhead outweighs gain).
- Shared mutable state required (design for message passing instead).
Prerequisites:
- Python (recommended 3.11 or 3.10)
- Azure Functions Core Tools v4
Create local.settings.json
{
"IsEncrypted": false,
"Values": {
"FUNCTIONS_WORKER_RUNTIME": "python",
"AzureWebJobsStorage": "UseDevelopmentStorage=true"
}
}
Install dependencies:
pip install -r requirements.txtStart the Functions host:
func startInvoke endpoints (PowerShell examples):
# Serial profile (profiling + results)
Invoke-WebRequest http://localhost:7071/api/serial_profile_trigger | Select-Object -ExpandProperty Content
# Serial only
Invoke-WebRequest http://localhost:7071/api/serial | Select-Object -ExpandProperty Content
# Parallel only
Invoke-WebRequest http://localhost:7071/api/parallel | Select-Object -ExpandProperty Content
# Compare serial vs parallel (includes speedup)
Invoke-WebRequest http://localhost:7071/api/compare | Select-Object -ExpandProperty ContentWith curl:
curl http://localhost:7071/api/serial_profile_trigger
curl http://localhost:7071/api/compare{
"mode": "serial_profile",
"nums": [
34,
35,
36,
35
],
"results": [
{
"n": 34,
"fib": 5702887
},
{
"n": 35,
"fib": 9227465
},
{
"n": 36,
"fib": 14930352
},
{
"n": 35,
"fib": 9227465
}
],
"started_utc": "2025-10-29T00:14:52.060666Z",
"ended_utc": "2025-10-29T00:15:24.062489Z",
"duration_seconds": 32.0018,
"profiling_top": [
" 126494861 function calls (2126 primitive calls) in 32.002 seconds",
"",
" Ordered by: cumulative time",
" List reduced from 46 to 15 due to restriction <15>",
"",
" ncalls tottime percall cumtime percall filename:lineno(function)",
" 256 0.002 0.000 0.013 0.000 C:\\Users\\tsushi\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\threading.py:327(wait)",
" 255 0.001 0.000 0.001 0.000 C:\\Program Files\\Microsoft\\Azure Functions Core Tools\\workers\\python\\3.13\\WINDOWS\\X64\\grpc\\_channel.py:954(_response_ready)",
" 256 0.001 0.000 0.001 0.000 {method '_acquire_restore' of '_thread.RLock' objects}",
" 256 0.000 0.000 0.000 0.000 {method '_release_save' of '_thread.RLock' objects}",
" 256 0.000 0.000 0.000 0.000 {method 'remove' of 'collections.deque' objects}",
" 257 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}",
" 1 0.000 0.000 0.000 0.000 C:\\Program Files\\Microsoft\\Azure Functions Core Tools\\workers\\python\\3.13\\WINDOWS\\X64\\grpc\\_common.py:121(wait)",
" 256/2 0.001 0.000 0.000 0.000 C:\\Program Files\\Microsoft\\Azure Functions Core Tools\\workers\\python\\3.13\\WINDOWS\\X64\\grpc\\_common.py:111(_wait_once)",
" 258 0.000 0.000 0.000 0.000 {method '_is_owned' of '_thread.RLock' objects}",
" 257 0.000 0.000 0.000 0.000 {method 'append' of 'collections.deque' objects}",
" 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}",
" 1 0.000 0.000 0.000 0.000 C:\\Program Files\\Microsoft\\Azure Functions Core Tools\\workers\\python\\3.13\\WINDOWS\\X64\\grpc\\_channel.py:238(handle_event)",
" 1 0.000 0.000 0.000 0.000 C:\\Users\\tsushi\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\asyncio\\futures.py:406(wrap_future)",
" 2 0.000 0.000 0.000 0.000 C:\\Users\\tsushi\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\threading.py:428(notify_all)",
" 2 0.000 0.000 0.000 0.000 C:\\Users\\tsushi\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\threading.py:398(notify)",
"",
""
]
}{
"input_nums": [34, 35, 36, 35],
"serial": {"duration_sec": 5.42, "results": [{"n":34,"fib":5702887}, ...]},
"parallel": {"duration_sec": 2.11, "results": [{"n":34,"fib":5702887}, ...]},
"speedup": 2.57,
"note": "First parallel call may include process spawn overhead."
}(Times will vary by CPU and current system load.)
| Adjustment | Effect |
|---|---|
| Increase numbers (e.g. 37,38) | More CPU time; clearer parallel benefit but risk longer cold start/timeouts. |
| Fewer numbers | Less aggregate work; parallel overhead may dominate. |
| Reorder numbers | No practical effect (all independent). |
| Memoize fib | Eliminates cost after first call; reduces usefulness of demo. |
First parallel call includes process creation. Run /api/parallel once before benchmarking for a steadier comparison.
- Recursive Fibonacci is pedagogical only. Replace with an actually needed computation or optimize (iterative, memoization, fast doubling).
- Process pools are created per request here for clarity. In production, consider a reusable process pool (long-lived) to amortize spawn cost.
- Avoid very large inputs that may exceed Function time limits or memory constraints.
- Profiling overhead is non-trivial; use profiling endpoints only in lower environments or gated scenarios.
- Add an endpoint using
functools.lru_cacheto show algorithmic speedup vs parallelism. - Add simple tracing of per-task durations.
- Implement a reusable, module-level process pool with lazy initialization.
- Provide an endpoint to dump raw profiling stats as downloadable text.
| Symptom | Possible Cause | Fix |
|---|---|---|
| No logs appear | Host log level filters INFO | Add logging section to host.json or raise to WARNING temporarily. |
| Parallel slower than serial | Too few tasks / small n | Increase n values moderately (34–36). |
| Timeout errors | Inputs too large | Reduce largest n or split calls. |
Import error azure.functions |
Missing package | pip install azure-functions. |
This sample focuses on clarity over micro-optimizations. It illustrates the journey from measurement (profiling) to improvement (parallelization) for CPU-bound Python workloads in Azure Functions.
Enjoy experimenting and adapting to your own workload!