
# HPC Profiling with **Scalene** (Python, NumPy/Numba, GPU, and MPI)

**Goals**
- Understand what Scalene measures (Python vs. native vs. GPU time, and memory).
- Profile mixes of Python, NumPy/Numba, and GPU code paths.
- Generate and read HTML reports.
- Apply profiling to I/O-heavy workloads (optional Zarr demo).
- Profile `mpi4py` programs rank-by-rank.


## What Scalene Measures (and why it matters)

Scalene is a sampling profiler that attributes time to:
- **Python time** (interpreter, GIL-bound)
- **Native time** (NumPy/Numba/C-extensions)
- **GPU time** (CuPy / Numba CUDA)
- **Memory** (alloc/free) per line (where supported)

This lets you decide whether to:
- Vectorize / JIT (if Python time is high), or
- Look at I/O / memory / algorithmic choices (if native/GPU time dominates).


## Profiler Types: Sampling Profilers vs. Instrumenting Profilers

Understanding how profilers collect their data is essential for interpreting results correctly — especially in HPC or hybrid Python + native code environments.

---

## Two Main Types of Profilers

| Type | How It Works | Example Tools |
|------|---------------|----------------|
| **Sampling profiler** | Periodically interrupts a running program and records *where* it is (stack trace). | **Scalene**, `perf`, Intel VTune, Pyinstrument |
| **Instrumenting profiler** | Modifies or wraps code to record *every function entry/exit* and measure exact times. | `cProfile`, `line_profiler`, TAU, mpiP (for MPI calls) |

---

## Sampling Profilers

**How they work:**  
A timer fires at regular intervals (e.g., every 1 ms). Each time:
1. The profiler pauses the program briefly.
2. Records the current function/line being executed.
3. Resumes execution immediately.

After many samples, the profiler estimates where most time is spent based on how often each line appears in the samples.

**Example (Scalene):**
| Line | Python % | Native % | GPU % |
|------|-----------|----------|-------|
| 12 | 45 | 10 | 0 |
| 22 | 2 | 50 | 0 |
| 35 | 0 | 0 | 50 |

### Advantages
- Very **low overhead** suitable for long HPC runs.  
- Works with **JIT** or **native** code (Numba, C/C++, CUDA).  
- Can attribute **CPU**, **native**, and **GPU** time separately.  
- Doesn’t distort performance (no code rewriting).

### Limitations
- **Statistical**, not exact — tiny or rare functions may be missed.  
- **Resolution** limited by sampling frequency.  
- Results may vary slightly between runs.

---

## Instrumenting Profilers

**How they work:**  
Profiler inserts code around every function call to record precise start and end times.

```python
start = time.perf_counter()
foo()
elapsed = time.perf_counter() - start


## Pure Python vs. NumPy (native) hotspot demo

We'll create two versions of the same computation:
- A **pure Python** double loop (intentionally slow).
- A **NumPy** vectorized version (native BLAS under the hood).

We'll run Scalene **as a CLI** so it generates an HTML report.


In [18]:

%%writefile py_vs_numpy.py
import numpy as np

 
def slow_python(N):
    X = np.arange(N**2).reshape(N,N)
    for i in range(N):
        for j in range(N):
            X[i,j] = X[i,j]**2
    return X

def fast_numpy(N):
    X = np.arange(N**2).reshape(N,N)
    X = X**2
    return X

if __name__ == "__main__":
    slow_python(2000)
    fast_numpy(2000)


Overwriting py_vs_numpy.py


In [23]:

# Run scalene to create an HTML report (uncomment to execute).
# The --html flag writes 'scalene_py_vs_numpy.html' next to the script.
!scalene --html --outfile scalene_py_vs_numpy.html py_vs_numpy.py


NOTE: The GPU is currently running in a mode that can reduce Scalene's accuracy when reporting GPU utilization.
If you have sudo privileges, you can run this command (Linux only) to enable per-process GPU accounting:
  python3 -m scalene.set_nvidia_gpu_modes


### Interpretation


| **Column** | **Meaning** | **How It’s Measured** | **Interpretation / What to Look For** |
|-------------|-------------|------------------------|---------------------------------------|
| **% of Time** | Total fraction of wall-clock runtime spent on this line (sum of Python + Native + GPU). | Statistical sampling of stack traces every few ms. | High → major hotspot. Focus optimization here. |
| **% Python** | Time spent executing interpreted Python bytecode. | Samples taken while the GIL is held. | High → interpreter-bound → vectorize or JIT with Numba. |
| **% Native** | Time spent in compiled/native code (NumPy, C/C++, Fortran, Numba CPU). | Separate sampling thread attributes time outside the GIL. | High → already optimized native code; algorithmic or I/O limits dominate. |
| **% System** | Time in OS/kernel calls (I/O, sleep, etc.). | Samples where stack shows system-level frames. | High → I/O-bound or waiting on the OS. |
| **% GPU** | Time the GPU was busy executing kernels launched by this line. | Hooks into CUDA runtime; counts active GPU time. | High → GPU compute-bound; check data transfers & kernel efficiency. |
| **CPU Mem Avg (MB/s)** | Average rate of host-side allocations/frees while this line was active. | Hooks `malloc`/`free` and Python allocator; computes bytes / sec. | High → heavy allocation throughput. |
| **CPU Mem Peak (MB)** | Maximum increase in total process memory after this line executed. | Tracks deltas in process RSS / heap size. | High → transient spikes or potential memory growth/leak. |
| **GPU Mem Avg / Peak** | (Experimental) Device memory allocation rate / peak usage. | CUDA driver hooks — partial support (mainly CuPy). | Often blank for Numba CUDA — normal. |

---

###  Reading Tips
- **High Python %** → optimize in Python space (Numba/vectorize).  
- **High Native %** → library code is doing work; tune algorithms.  
- **High GPU %** → GPU dominates; focus on data movement.  
- **High CPU Mem Avg/Peak** → frequent or large host allocations.  
- **High System %** → I/O or synchronization bottleneck.  
- Blank GPU/Memory columns → no allocations or unsupported (normal for Numba CUDA).




## Numba (CPU JIT) demo

Numba JIT-compiles Python to native code, which Scalene will attribute as **native time**.


In [10]:

%%writefile numba_cpu.py
import numpy as np
from numba import njit

@njit(cache=True, fastmath=True)
def matmul_numba(a, b):
    return a @ b

def main(n):
    
    # Warmup
    a = np.random.rand(10,10)
    b = np.random.rand(10,10)
    c = matmul_numba
    
    # Test
    a = np.random.rand(n, n)
    b = np.random.rand(n, n)
    c = matmul_numba(a, b)


if __name__ == "__main__":
    main(1500)


Overwriting numba_cpu.py


In [11]:

!scalene --html --outfile scalene_numba_cpu.html numba_cpu.py

NOTE: The GPU is currently running in a mode that can reduce Scalene's accuracy when reporting GPU utilization.
If you have sudo privileges, you can run this command (Linux only) to enable per-process GPU accounting:
  python3 -m scalene.set_nvidia_gpu_modes



## GPU with Numba CUDA

A custom kernel to illustrate kernel launch vs. device compute.


In [47]:

%%writefile numba_cuda.py
import numpy as np
from numba import cuda

@cuda.jit
def f(a, x, y, out):
    i = cuda.grid(1)
    if i < x.size:
        out[i] = a * x[i] + y[i]

def main(n):
    x = np.random.rand(n).astype(np.float32)
    y = np.random.rand(n).astype(np.float32)
    out = np.empty_like(x)

    d_x = cuda.to_device(x)
    d_y = cuda.to_device(y)
    d_out = cuda.device_array_like(x)

    threads = 256
    blocks = (n + threads - 1) // threads
    f[blocks, threads](2.0, d_x, d_y, d_out)
    cuda.synchronize()
    print("sum:", float(d_out.copy_to_host().sum()))

if __name__ == "__main__":
    main(50000000)


Overwriting numba_cuda.py


In [22]:

!scalene --html --outfile scalene_numba_cuda.html numba_cuda.py

sum: 75009888.0
NOTE: The GPU is currently running in a mode that can reduce Scalene's accuracy when reporting GPU utilization.
If you have sudo privileges, you can run this command (Linux only) to enable per-process GPU accounting:
  python3 -m scalene.set_nvidia_gpu_modes



## Profiling `mpi4py` programs (rank-by-rank)

Scalene does not aggregate across ranks automatically when using mpi4py. Hence, The common pattern is to **produce one report per rank**.

One approach to deal with this is to use your MPI launcher to start each rank under Scalene and write a per-rank report. Examples:


In [44]:

%%writefile mpi_demo.py
#!/usr/bin/env python
"""
Example MPI program for profiling with Scalene
This demonstrates typical MPI operations that you'd want to profile
"""
from mpi4py import MPI
import numpy as np
import time

def expensive_computation(size):
    """Simulate some CPU-intensive work"""
    data = np.random.rand(size, size)
    result = np.linalg.inv(data @ data.T + np.eye(size))
    return result

def memory_intensive_operation(size):
    """Simulate memory allocation"""
    arrays = []
    for i in range(10):
        arrays.append(np.random.rand(size, size))
    return np.sum([arr.sum() for arr in arrays])

def main():
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()
    
    print(f"Rank {rank}/{size} starting...")
    
    # Different work for different ranks
    if rank == 0:
        # Root does some preparation
        data = expensive_computation(1000)
        total = memory_intensive_operation(500)
        print(f"Rank 0: Prepared data, total = {total:.2f}")
    else:
        # Workers do their own computation
        result = expensive_computation(100)
        local_sum = memory_intensive_operation(1000)
        print(f"Rank {rank}: Computed result, sum = {local_sum:.2f}")
    
    # Synchronize
    comm.Barrier()
    
    # Gather operation
    local_value = rank * 10.0
    all_values = comm.gather(local_value, root=0)
    
    if rank == 0:
        print(f"Gathered values: {all_values}")
    
    # Broadcast operation
    if rank == 0:
        broadcast_data = np.random.rand(1000, 1000)
    else:
        broadcast_data = None
    
    broadcast_data = comm.bcast(broadcast_data, root=0)
    
    # Everyone does some work with broadcast data
    local_result = np.sum(broadcast_data) * rank
    
    # Reduce operation
    total_result = comm.reduce(local_result, op=MPI.SUM, root=0)
    
    if rank == 0:
        print(f"Total result from reduce: {total_result:.2f}")
    
    print(f"Rank {rank} completed successfully")

if __name__ == "__main__":
    main()


Overwriting mpi_demo.py


In [46]:

# To profile each rank programmatically:
!mpiexec --allow-run-as-root -n 4 bash -c 'scalene --html --outfile profile-rank-${OMPI_COMM_WORLD_RANK}.html mpi_demo.py'

Rank 3/4 starting...
Rank 2/4 starting...
Rank 1/4 starting...
Rank 0/4 starting...
Rank 3: Computed result, sum = 5000400.23
Rank 2: Computed result, sum = 5000206.24
Rank 1: Computed result, sum = 4999641.77
Rank 0: Prepared data, total = 1250162.33
Gathered values: [0.0, 10.0, 20.0, 30.0]
Rank 2 completed successfully
Rank 3 completed successfully
Rank 1 completed successfully
Total result from reduce: 3001602.16
Rank 0 completed successfully


While `Scalene` can be used to profile individual MPI ranks, it's probably better to use manual timing `MPI.Wtime()` or an MPI specific profiler like `mpip`.