# Day 3 - Parallel and High-Performance Python: Threads, AsyncIO, NumPy, Numba, GPUs

Welcome to Day 3 of the advanced Python course. Today we focus on **performance** and **parallelism**.

We will connect Python language features with how modern CPUs and GPUs execute code. The examples are oriented around physics and mechanical **size/length measurement** data (surface roughness, thickness, diameter measurements, repeated measurements, etc.).

## What we will cover today

- CPU vs I/O bound work and a high level mental model of performance
- The Global Interpreter Lock (GIL) and why it matters
- Threads, processes, and when to use which
- AsyncIO for I/O-bound tasks (simulated sensor queries)
- NumPy vectorized computations for numerical work
- Numba JIT compilation for accelerating pure Python loops
- A short overview of GPU tools: **cuDF**, **CuPy** and where they fit
- A look at Python 3.13 and the future: experimental JIT and optional GIL
- A complex end-to-end example: processing multiple measurement files in parallel

## Daily agenda and course flow

**09:00 - 10:30 (1h 30m)**  
- CPU vs I/O bound tasks, performance model
- GIL recap and impact on threading
- Threads for I/O-bound workloads

**10:30 - 10:45 (15m)**  
- Short break

**10:45 - 12:00 (1h 15m)**  
- Processes for CPU-bound workloads
- Intro to AsyncIO and simulated asynchronous measurements

**12:00 - 13:00 (1h)**  
- Lunch break

**13:00 - 14:45 (1h 45m)**  
- NumPy recap and vectorized computations for measurement data
- Practical NumPy vectorization patterns and exercises (physics themed)

**14:45 - 15:00 (15m)**  
- Short break

**15:00 - 16:30 (1h 30m)**  
- Numba JIT compilation for numerical loops
- Overview of GPU tools (cuDF, CuPy) and limitations
- Python 3.13: experimental JIT and optional GIL, and what it may mean
- Complex example: parallel processing of multiple measurement data sets

Throughout the day, we will mark good points to pause, ask questions, or take a sip of water. Try to roughly follow the timing so that we finish comfortably.

## Topic 1 - Performance mental model: CPU vs I/O, GIL recap

In real scientific and engineering work, **performance** usually means one of:

- How fast we can process a large dataset (throughput)
- How fast we get a single answer (latency)

Python performance is strongly influenced by:

- The speed of the underlying CPU or GPU
- Whether our code is **CPU-bound** or **I/O-bound**
- How much work we keep in **C-accelerated libraries** like NumPy
- The **Global Interpreter Lock (GIL)** in CPython

### CPU-bound vs I/O-bound

- **CPU-bound**: most time is spent doing computations on the CPU.
  - Example: computing statistics on millions of surface height samples.
- **I/O-bound**: most time is spent waiting for input/output.
  - Example: reading files from disk or waiting for a measurement device over the network.

### GIL recap

The CPython interpreter uses a **Global Interpreter Lock (GIL)**. Only one thread at a time can execute Python bytecode. This means:

- Multiple threads do **not** speed up CPU-bound pure Python code.
- Threads can still help with I/O-bound tasks, because while one thread waits for I/O, another can run.
- Libraries that release the GIL internally (NumPy, some C extensions) can still run in parallel in C.

For a deeper dive, see:

- CPython GIL overview: https://wiki.python.org/moin/GlobalInterpreterLock
- CPython implementation notes: https://docs.python.org/3/c-api/init.html


In [1]:
import time

def cpu_bound(n: int) -> int:
    """Fake CPU work: sum of squares up to n."""
    total = 0
    for i in range(n):
        total += i * i
    return total

def io_bound(delay: float) -> None:
    """Fake I/O: sleep to simulate waiting for a device or disk."""
    time.sleep(delay)

start = time.perf_counter()
cpu_bound(2_000_000)
cpu_time = time.perf_counter() - start

start = time.perf_counter()
io_bound(0.2)
io_time = time.perf_counter() - start

print(f"CPU-bound function took ~{cpu_time:.3f} s")
print(f"I/O-bound function took ~{io_time:.3f} s")

CPU-bound function took ~0.131 s
I/O-bound function took ~0.202 s


### üí™ Exercise (advanced): Rough timing experiment

In this exercise you will run a tiny timing experiment to get a feel for the numbers.

1. Use the provided `cpu_bound` function.
2. Call it with different values, for example `50_000`, `200_000`, `500_000`, and measure the time for each.
3. For timing, use `time.perf_counter()` like in the example.
4. Print the `n` value and the time for each run.

Do this in a simple, sequential loop. The goal is to get a feeling for how quickly Python loops slow down as `n` grows.

In [None]:
import time

# Advanced exercise starter
# TODO: run cpu_bound with several n values and measure the time.

def cpu_bound(n: int) -> int:
    total = 0
    for i in range(n):
        total += i * i
    return total

ns = [50_000, 200_000, 500_000]

# for n in ns:
#     start = ...  # time.perf_counter()
#     result = cpu_bound(n)
#     elapsed = ...
#     print(f"n={n}, elapsed={elapsed:.4f} s")

In [3]:
# Solution

import time

def cpu_bound(n: int) -> int:
    total = 0
    for i in range(n):
        total += i * i
    return total

ns = [50_000, 200_000, 500_000]
for n in ns:
    start = time.perf_counter()
    _ = cpu_bound(n)
    elapsed = time.perf_counter() - start
    print(f"n={n}, elapsed={elapsed:.4f} s")

n=50000, elapsed=0.0027 s
n=200000, elapsed=0.0131 s
n=500000, elapsed=0.0326 s


## Topic 2 - Threads for I/O-bound tasks

Because of the GIL, Python threads are usually **not** a good solution for CPU-bound acceleration. But for I/O-bound tasks (waiting for files, network, measurement devices), threads can help hide latency.

Typical pattern in lab / measurement setups:

- You need to query multiple instruments or devices.
- Each device responds in, say, 100 ms.
- With sequential code, talking to 5 devices takes roughly 5 x 100 ms.
- With threads, you can send requests in parallel and total time can be close to 100 ms.

Python module: [`threading`](https://docs.python.org/3/library/threading.html)

Note that this does not break the GIL for CPU work, but it is very useful to manage multiple slow I/O operations.

In [2]:
import threading
import time
from random import uniform

def measure_sensor(sensor_id: int, delay: float) -> None:
    """Simulate a measurement: wait 'delay' seconds, then print a value."""
    time.sleep(delay)
    value = uniform(0.9, 1.1)  # normalized size measurement
    print(f"Sensor {sensor_id}: value={value:.4f}, delay={delay:.2f}s")

delays = [0.4, 0.3, 0.6, 0.2]

threads = []
start = time.perf_counter()
for i, d in enumerate(delays, start=1):
    t = threading.Thread(target=measure_sensor, args=(i, d))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

elapsed = time.perf_counter() - start
print(f"Total elapsed (threaded): {elapsed:.3f} s")

Sensor 4: value=1.0400, delay=0.20s
Sensor 2: value=1.0798, delay=0.30s
Sensor 1: value=0.9523, delay=0.40s
Sensor 3: value=0.9374, delay=0.60s
Total elapsed (threaded): 0.622 s


### ‚úè Exercise (easy): Compare sequential vs threaded measurements

Use the `measure_sensor` function idea to compare sequential and threaded execution.

1. Re-implement a simple `measure_sensor` that sleeps and prints a message.
2. First, call it sequentially in a loop for all delays.
3. Then, call it using `threading.Thread` like in the example.
4. Measure and print the elapsed time for both cases.

You can reuse the list `delays = [0.4, 0.3, 0.6, 0.2]`.

In [None]:
import threading
import time
from random import uniform

# TODO: implement sequential and threaded measurement and compare their times.

def measure_sensor(sensor_id: int, delay: float) -> None:
    time.sleep(delay)
    value = uniform(0.9, 1.1)
    print(f"Sensor {sensor_id}: value={value:.4f}, delay={delay:.2f}s")

delays = [0.4, 0.3, 0.6, 0.2]

# 1) Sequential run
# start = ...
# for i, d in enumerate(delays, start=1):
#     measure_sensor(i, d)
# seq_elapsed = ...
# print(f"Sequential elapsed: {seq_elapsed:.3f} s")

# 2) Threaded run
# threads = []
# start = ...
# for i, d in enumerate(delays, start=1):
#     ... create and start threads ...
# for t in threads:
#     t.join()
# thr_elapsed = ...
# print(f"Threaded elapsed: {thr_elapsed:.3f} s")

In [5]:
# Solution

import threading
import time
from random import uniform

def measure_sensor(sensor_id: int, delay: float) -> None:
    time.sleep(delay)
    value = uniform(0.9, 1.1)
    print(f"Sensor {sensor_id}: value={value:.4f}, delay={delay:.2f}s")

delays = [0.4, 0.3, 0.6, 0.2]

# Sequential
start = time.perf_counter()
for i, d in enumerate(delays, start=1):
    measure_sensor(i, d)
seq_elapsed = time.perf_counter() - start
print(f"Sequential elapsed: {seq_elapsed:.3f} s")

# Threaded
threads = []
start = time.perf_counter()
for i, d in enumerate(delays, start=1):
    t = threading.Thread(target=measure_sensor, args=(i, d))
    threads.append(t)
    t.start()
for t in threads:
    t.join()
thr_elapsed = time.perf_counter() - start
print(f"Threaded elapsed: {thr_elapsed:.3f} s")

Sensor 1: value=1.0886, delay=0.40s
Sensor 2: value=0.9940, delay=0.30s
Sensor 3: value=0.9065, delay=0.60s
Sensor 4: value=1.0301, delay=0.20s
Sequential elapsed: 1.505 s
Sensor 4: value=1.0565, delay=0.20s
Sensor 2: value=0.9084, delay=0.30s
Sensor 1: value=0.9360, delay=0.40s
Sensor 3: value=1.0763, delay=0.60s
Threaded elapsed: 0.608 s


### üí™ Exercise (advanced): Collect results into a shared list

Right now, `measure_sensor` just prints values. In real life, we want to **store** results.

1. Modify `measure_sensor` so that it appends `(sensor_id, value)` to a shared list.
2. Use a `threading.Lock` to protect the shared list.
3. After all threads finish, print the collected list of results.

This pattern mimics collecting measurement results from multiple devices into a shared in-memory structure.

In [None]:
import threading
import time
from random import uniform

results = []
lock = threading.Lock()

def measure_and_store(sensor_id: int, delay: float) -> None:
    """TODO: simulate measurement and safely append to results."""
    time.sleep(delay)
    value = uniform(0.9, 1.1)
    # with lock:
    #     results.append(...)

delays = [0.4, 0.3, 0.6, 0.2]
threads = []
# TODO: start threads using measure_and_store and join them, then print results.
# for i, d in enumerate(delays, start=1):
#     t = threading.Thread(target=measure_and_store, args=(i, d))
#     threads.append(t)
#     t.start()

# for t in threads:
#     t.join()

# print(results)

In [6]:
# Solution

import threading
import time
from random import uniform

results = []
lock = threading.Lock()

def measure_and_store(sensor_id: int, delay: float) -> None:
    time.sleep(delay)
    value = uniform(0.9, 1.1)
    with lock:
        results.append((sensor_id, value))

delays = [0.4, 0.3, 0.6, 0.2]
threads = []
for i, d in enumerate(delays, start=1):
    t = threading.Thread(target=measure_and_store, args=(i, d))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print("Collected results:", results)

Collected results: [(4, 1.0242552744792301), (2, 0.9700888704122314), (1, 1.0925002232553185), (3, 0.9773592353806989)]


---
# Short break (10:30 - 10:45)
---

## Topic 3 - Processes for CPU-bound workloads

Because of the GIL, threads do not speed up CPU-bound pure Python loops. For heavy numerical work in pure Python, we can use **processes** instead of threads.

Python module: [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html)

A **process** has its own Python interpreter and its own GIL, so multiple processes can truly run in parallel on multiple CPU cores.

Typical pattern in scientific computing:

- Split a large dataset into chunks (e.g. surface height maps for different samples).
- Spawn a pool of worker processes.
- Each process computes statistics for one chunk.
- Collect results in the main process.

Downside: processes have more overhead than threads (especially for sending large arrays between processes), but they allow true CPU-core scaling for pure Python loops.

In [None]:
# RUN FROM .py INSTEAD OF NOTEBOOK

from multiprocessing import Pool, cpu_count
import math
import random
import time

def compute_rms(values):
    """Compute RMS roughness of a list of heights."""
    s = 0.0
    for x in values:
        s += x * x
    return math.sqrt(s / len(values))

def main():
    # Create fake measurement data: 4 samples with 100_000 points each
    samples = [[random.uniform(-1e-6, 1e-6) for _ in range(100_000)] for _ in range(4)]

    print(f"Using up to {cpu_count()} CPU cores")
    start = time.perf_counter()
    with Pool() as pool:
        rms_values = pool.map(compute_rms, samples)
    elapsed = time.perf_counter() - start
    print(f"Total elapsed (threaded): {elapsed:.3f} s")
    
    print("RMS roughness per sample:")
    for i, rms in enumerate(rms_values, start=1):
        print(f"  Sample {i}: {rms:.3e} m")

if __name__ == "__main__":
    main()


Using up to 8 CPU cores


### ‚úè Exercise (easy): Parallel average diameter per batch

Imagine you have several batches of diameter measurements for different parts. Each batch is a list of floats.

1. Implement a function `average(values)` that computes the mean of the list.
2. Create a list of batches (for example 3-5 lists with random diameters in millimeters).
3. Use `multiprocessing.Pool.map` to compute the average diameter for each batch in parallel.
4. Print the result for each batch.

Keep the batch sizes small enough that the code finishes quickly.

In [None]:
# RUN FROM .py INSTEAD OF NOTEBOOK

from multiprocessing import Pool
import random

# TODO: compute average diameter per batch in parallel.

def average(values):
    # return ...

# Create example batches of diameters in mm
batches = [
    [random.uniform(9.95, 10.05) for _ in range(50)],
    [random.uniform(4.95, 5.05) for _ in range(80)],
    [random.uniform(19.9, 20.1) for _ in range(100)],
]

def main():
    # with Pool() as pool:
    #     averages = ...
    # for i, avg in enumerate(averages, start=1):
    #     print(f"Batch {i}: average diameter = {avg:.3f} mm")

if __name__ == "__main__":
    main()


In [None]:
# Solution RUN FROM .py INSTEAD OF NOTEBOOK

from multiprocessing import Pool
import random

def average(values):
    total = 0.0
    for v in values:
        total += v
    return total / len(values)

batches = [
    [random.uniform(9.95, 10.05) for _ in range(50)],
    [random.uniform(4.95, 5.05) for _ in range(80)],
    [random.uniform(19.9, 20.1) for _ in range(100)],
]

def main():
    with Pool() as pool:
        averages = pool.map(average, batches)
    
    for i, avg in enumerate(averages, start=1):
        print(f"Batch {i}: average diameter = {avg:.3f} mm")

if __name__ == "__main__":
    main()


### üí™ Exercise (advanced): Parallel min, max, mean

Extend the previous idea to compute **three** statistics per batch: `(min, max, mean)`.

1. Write a function `stats(values)` that returns a tuple `(min_value, max_value, mean_value)`.
2. Reuse or create batches of diameter or thickness measurements.
3. Use a process pool to compute stats for each batch in parallel.
4. Print the three statistics for each batch in a readable way.

This is close to what you would actually do with per-sample measurement datasets.

In [None]:
# RUN FROM .py INSTEAD OF NOTEBOOK

from multiprocessing import Pool
import random

def stats(values):
    """TODO: return (min, max, mean) for the list."""
    # m = min(values)
    # M = max(values)
    # mean = ...
    # return m, M, mean
    m = min(values)
    M = max(values)
    total = 0.0
    for v in values:
        total += v
    mean = total / len(values)
    return m, M, mean

batches = [
    [random.uniform(9.95, 10.05) for _ in range(50)],
    [random.uniform(4.95, 5.05) for _ in range(80)],
    [random.uniform(19.9, 20.1) for _ in range(100)],
]

def main():
    # TODO: use Pool to compute stats in parallel and print them.

if __name__ == "__main__":
    main()


In [None]:
# RUN FROM .py INSTEAD OF NOTEBOOK

from multiprocessing import Pool
import random

def stats(values):
    m = min(values)
    M = max(values)
    total = 0.0
    for v in values:
        total += v
    mean = total / len(values)
    return m, M, mean

batches = [
    [random.uniform(9.95, 10.05) for _ in range(50)],
    [random.uniform(4.95, 5.05) for _ in range(80)],
    [random.uniform(19.9, 20.1) for _ in range(100)],
]

def main():
    with Pool() as pool:
        results = pool.map(stats, batches)
    
    for i, (m, M, mean) in enumerate(results, start=1):
        print(f"Batch {i}: min={m:.3f}, max={M:.3f}, mean={mean:.3f}")

if __name__ == "__main__":
    main()


## Topic 4 - AsyncIO for concurrent I/O

Python's [`asyncio`](https://docs.python.org/3/library/asyncio.html) module lets you write **concurrent** programs using the `async` / `await` syntax.

Unlike `threading` or `multiprocessing`, AsyncIO usually runs **in a single OS thread**. It uses an **event loop** that rapidly switches between many "tasks" (coroutines) whenever they are waiting for I/O.

### Key concepts

- **Coroutine**: a function defined with `async def`. Calling it does **not** run it, it returns a coroutine object (a bit like creating a `Task` in LabVIEW or a job in a scheduler).
  ```python
  async def measure():
      ...
  c = measure()  # coroutine object, not a result yet
  ```
- **Event loop**: a scheduler that runs coroutines. In scripts you usually start it with `asyncio.run(main())`. In Jupyter, the notebook already runs an event loop for you, so you can use `await` directly at the top level.
- **`await`**: suspends the current coroutine until the awaited operation is finished, and lets the event loop run other tasks in the meantime.
  - You can only use `await` **inside** `async def` functions (or at the top level in special environments like Jupyter).
  - This is like saying "I am waiting for the ADC / network / disk - while I wait, please do something else".

### When does AsyncIO help?

AsyncIO is ideal when your program is **I/O bound** and spends most of its time waiting:

- Many concurrent HTTP requests (web scraping, microservices).
- Talking to many instruments over TCP/serial/USB at once.
- File or database operations that frequently wait on the OS.

It does **not** make pure Python CPU loops faster. The Global Interpreter Lock (GIL) still limits CPU bound code. For heavy numeric work you usually use:

- **NumPy / vectorization** (day 3 topic),
- **C / C++ extensions**, or
- **multiprocessing** to use multiple cores.

### How does control flow work?

Think of the event loop as a conductor for an orchestra of coroutines:

1. You define `async def` coroutines.
2. You create tasks from them (for example with `asyncio.create_task`).
3. The event loop picks a task, runs it until it hits an `await` on something that is not yet ready (e.g. `await asyncio.sleep(0.5)` or an async network call).
4. At that `await`, the coroutine "pauses", the loop switches to another ready task.
5. When the awaited I/O operation completes, the loop resumes the paused coroutine right after the `await`.

Because tasks yield control at `await`, many of them can **make progress in overlapping time** - even though there is a single underlying OS thread.

### Frequently asked questions

**Q: Why can I not `await` in a normal function?**  
Because the Python grammar only allows `await` inside `async def` (or in special interactive environments). A normal `def` function does not know how to suspend and resume itself. If you want to use `await`, you must either:
- make the function `async def`, or
- call an async function from the **outside** using `asyncio.run(my_async_main())` in a script.

**Q: When is the event loop actually running?**  
- In a **script**, when you call `asyncio.run(main())` (which internally creates and runs an event loop).
- In **Jupyter**, there is already a loop running; the notebook integrates with it so `await` works at the top level. That is why you can simply do `await main()` in a notebook cell.

**Q: How do I actually get real benefits?**  
1. Identify I/O operations that can run in parallel (waiting for TCP sockets, serial ports, HTTP, `asyncio.sleep`, etc.).
2. Wrap them in `async def` coroutines that `await` the I/O.
3. Start many tasks with `asyncio.create_task` or `asyncio.gather`.
4. Let the event loop interleave their waiting periods.

Below we compare sequential vs async measurements to see the effect in practice.

**Further reading:**
- Official docs: https://docs.python.org/3/library/asyncio.html
- "Async IO in Python: A Complete Walkthrough" (Real Python): https://realpython.com/async-io-python/


In [1]:
import asyncio
import time
from random import uniform

# Synchronous version: measure each sensor one after the other

def measure_sensor_sync(sensor_id: int, delay: float) -> float:
    """Simulate a blocking sensor measurement using time.sleep."""
    time.sleep(delay)
    value = uniform(0.9, 1.1)
    print(f"[sync ] Sensor {sensor_id}: value={value:.4f}, delay={delay:.2f}s")
    return value

# Async version: non-blocking wait with asyncio.sleep

async def measure_sensor_async(sensor_id: int, delay: float) -> float:
    """Async measurement: uses await asyncio.sleep instead of time.sleep."""
    await asyncio.sleep(delay)
    value = uniform(0.9, 1.1)
    print(f"[async] Sensor {sensor_id}: value={value:.4f}, delay={delay:.2f}s")
    return value

async def main_compare_sync_vs_async() -> None:
    delays = [0.4, 0.3, 0.6, 0.2]

    print("\n--- Sequential sync measurements ---")
    start_sync = time.perf_counter()
    sync_values = [measure_sensor_sync(i, d) for i, d in enumerate(delays, start=1)]
    elapsed_sync = time.perf_counter() - start_sync
    print(f"Sync total elapsed: {elapsed_sync:.3f} s")

    print("\n--- Concurrent async measurements ---")
    start_async = time.perf_counter()
    # Create tasks for all sensors
    tasks = [asyncio.create_task(measure_sensor_async(i, d))
             for i, d in enumerate(delays, start=1)]
    # Wait for all tasks to finish
    async_values = await asyncio.gather(*tasks)
    elapsed_async = time.perf_counter() - start_async
    print(f"Async total elapsed: {elapsed_async:.3f} s")

    print("\nSync values: ", [f"{v:.4f}" for v in sync_values])
    print("Async values:", [f"{v:.4f}" for v in async_values])
    print("Max individual delay:", max(delays), "s")

# In a Jupyter notebook you can run the async main with top-level await:
await main_compare_sync_vs_async()


--- Sequential sync measurements ---
[sync ] Sensor 1: value=1.0224, delay=0.40s
[sync ] Sensor 2: value=1.0988, delay=0.30s
[sync ] Sensor 3: value=0.9386, delay=0.60s
[sync ] Sensor 4: value=1.0355, delay=0.20s
Sync total elapsed: 1.504 s

--- Concurrent async measurements ---
[async] Sensor 4: value=0.9681, delay=0.20s
[async] Sensor 2: value=1.0904, delay=0.30s
[async] Sensor 1: value=0.9783, delay=0.40s
[async] Sensor 3: value=1.0900, delay=0.60s
Async total elapsed: 0.609 s

Sync values:  ['1.0224', '1.0988', '0.9386', '1.0355']
Async values: ['0.9783', '1.0904', '1.0900', '0.9681']
Max individual delay: 0.6 s


### ‚úè Exercise (easy): Measure async speedup factor

Using the example above as a starting point:

1. Copy the idea of `measure_sensor_async` and `asyncio.gather` into a new async function, for example `async_measure_many_sensors()`.
2. Use a list of delays like `[0.1, 0.5, 0.2, 0.8, 0.3]`.
3. Measure
   - how long it would take to run all measurements **sequentially** (with `time.sleep`), and
   - how long it takes to run them **concurrently** with AsyncIO.
4. Print the **speedup factor** as `sync_time / async_time`.

You already saw everything you need:
- `time.perf_counter()` for timing,
- `asyncio.create_task` + `asyncio.gather` for running tasks concurrently,
- `await` inside `async def`.

Think about why the async time is close to the **maximum** delay instead of the **sum** of delays.

In [None]:
import asyncio
import time
from random import uniform

# You can reuse or adapt these building blocks

def measure_sensor_sync(sensor_id: int, delay: float) -> float:
    time.sleep(delay)
    value = uniform(0.9, 1.1)
    print(f"[sync ] Sensor {sensor_id}: value={value:.4f}, delay={delay:.2f}s")
    return value

async def measure_sensor_async(sensor_id: int, delay: float) -> float:
    await asyncio.sleep(delay)
    value = uniform(0.9, 1.1)
    print(f"[async] Sensor {sensor_id}: value={value:.4f}, delay={delay:.2f}s")
    return value

async def async_measure_many_sensors() -> None:
    delays = [0.1, 0.5, 0.2, 0.8, 0.3]

    # TODO:
    # 1) Measure sequential sync time using measure_sensor_sync.
    # 2) Measure async time by creating tasks and awaiting asyncio.gather.
    # 3) Print both times and the speedup factor (sync_time / async_time).
    pass

# TODO: run async_measure_many_sensors from this cell using await.
# Example:
# await async_measure_many_sensors()

In [3]:
# Example solution for the easy async speedup exercise

import asyncio
import time
from random import uniform


def measure_sensor_sync(sensor_id: int, delay: float) -> float:
    time.sleep(delay)
    value = uniform(0.9, 1.1)
    return value

async def measure_sensor_async(sensor_id: int, delay: float) -> float:
    await asyncio.sleep(delay)
    value = uniform(0.9, 1.1)
    return value

async def async_measure_many_sensors() -> None:
    delays = [0.1, 0.5, 0.2, 0.8, 0.3]

    # Sequential measurements
    start_sync = time.perf_counter()
    sync_values = [measure_sensor_sync(i, d) for i, d in enumerate(delays, start=1)]
    elapsed_sync = time.perf_counter() - start_sync

    # Async concurrent measurements
    start_async = time.perf_counter()
    tasks = [asyncio.create_task(measure_sensor_async(i, d))
             for i, d in enumerate(delays, start=1)]
    async_values = await asyncio.gather(*tasks)
    elapsed_async = time.perf_counter() - start_async

    speedup = elapsed_sync / elapsed_async if elapsed_async > 0 else float("inf")

    print(f"Sync elapsed:  {elapsed_sync:.3f} s")
    print(f"Async elapsed: {elapsed_async:.3f} s")
    print(f"Speedup:       {speedup:.2f}x")
    print("Sync values: ", [f"{v:.4f}" for v in sync_values])
    print("Async values:", [f"{v:.4f}" for v in async_values])

await async_measure_many_sensors()

Sync elapsed:  1.904 s
Async elapsed: 0.807 s
Speedup:       2.36x
Sync values:  ['1.0394', '1.0554', '0.9323', '0.9256', '0.9803']
Async values: ['0.9262', '1.0007', '1.0895', '0.9597', '0.9325']


---
# Lunch break (12:00 - 13:00)
---

## Topic 5 - NumPy recap and basic vectorization

[NumPy](https://numpy.org/) is the standard array library for numerical computing in Python.

Key ideas:

- A `numpy.ndarray` is an efficient, typed, homogeneous n-dimensional array.
- Many operations are implemented in optimized C code.
- Operations like `a + b`, `a * b` on arrays are **vectorized**: they run in fast loops in C rather than Python.

In measurement and physics workflows, NumPy is ideal for:

- Handling long arrays of measured values (heights, diameters, thicknesses, voltages).
- Computing statistics and transformations (offset correction, unit conversion, normalization).
- Applying elementwise functions (e.g. calibration curves, non-linear corrections).

Make sure you have NumPy installed. In this course environment it should already be available.

In [4]:
import numpy as np

# Thickness measurements in micrometers for 10 samples
thickness_um = np.array([100.2, 99.8, 100.5, 100.1, 99.9, 100.3, 100.0, 99.7, 100.4, 100.1])
print("Raw thickness (um):", thickness_um)
print("Shape:", thickness_um.shape, "dtype:", thickness_um.dtype)

# Convert to millimeters
thickness_mm = thickness_um / 1000.0
print("Thickness (mm):", thickness_mm)

# Basic statistics
print("Mean (um):", thickness_um.mean())
print("Std (um):", thickness_um.std())

Raw thickness (um): [100.2  99.8 100.5 100.1  99.9 100.3 100.   99.7 100.4 100.1]
Shape: (10,) dtype: float64
Thickness (mm): [0.1002 0.0998 0.1005 0.1001 0.0999 0.1003 0.1    0.0997 0.1004 0.1001]
Mean (um): 100.1
Std (um): 0.24494897427831783


### ‚úè Exercise (easy): Diameter conversion and simple statistics

1. Create a NumPy array of diameters in millimeters for at least 8 parts.
2. Convert them to micrometers (multiply by 1000).
3. Compute and print the mean and standard deviation in micrometers.

Use the same patterns as in the example above.

In [None]:
import numpy as np

# TODO: diameter conversion and basic statistics with NumPy.

# diam_mm = np.array([...])
# diam_um = ...
# mean_um = ...
# std_um = ...
# print("Diameters (um):", diam_um)
# print("Mean (um):", mean_um)
# print("Std (um):", std_um)

In [5]:
# Solution

import numpy as np

diam_mm = np.array([10.01, 9.99, 10.02, 10.00, 10.03, 9.98, 10.01, 9.97])
diam_um = diam_mm * 1000.0
mean_um = diam_um.mean()
std_um = diam_um.std()
print("Diameters (um):", diam_um)
print("Mean (um):", mean_um)
print("Std (um):", std_um)

Diameters (um): [10010.  9990. 10020. 10000. 10030.  9980. 10010.  9970.]
Mean (um): 10001.25
Std (um): 18.99835519196333


### üí™ Exercise (advanced): Offset correction and normalized values

Imagine your thickness sensor has a calibration offset error of +0.3 micrometers.

1. Create an array of measured thicknesses in micrometers.
2. Subtract the offset from all values to get corrected thicknesses.
3. Compute the mean and standard deviation of the corrected values.
4. Compute a normalized array where you subtract the mean and divide by the standard deviation.

This pattern (offset correction + normalization) is common in data preprocessing.

In [None]:
import numpy as np

# TODO: offset correction and normalization.
# thickness_um = np.array([...])
# offset = 0.3
# corrected = ...  # subtract offset
# mean_corr = ...
# std_corr = ...
# normalized = ...  # (corrected - mean_corr) / std_corr
# print("Corrected thickness (um):", corrected)
# print("Normalized values:", normalized)

In [6]:
# Solution

import numpy as np

thickness_um = np.array([100.2, 100.0, 99.9, 100.4, 100.1, 99.8])
offset = 0.3
corrected = thickness_um - offset
mean_corr = corrected.mean()
std_corr = corrected.std()
normalized = (corrected - mean_corr) / std_corr
print("Corrected thickness (um):", corrected)
print("Normalized values:", normalized)

Corrected thickness (um): [ 99.9  99.7  99.6 100.1  99.8  99.5]
Normalized values: [ 0.6761234  -0.3380617  -0.84515425  1.69030851  0.16903085 -1.35224681]


## Topic 6 - NumPy vectorized computations in practice

Now we go deeper into NumPy vectorization. We will use:

- Elementwise operations on arrays
- Boolean masks and filtering
- Aggregations along axes (2D arrays)
- Simple physics-themed computations

NumPy allows you to write **array expressions** instead of Python `for` loops. These expressions are executed in optimized C loops under the hood.

Useful links:

- NumPy user guide: https://numpy.org/doc/stable/user/index.html
- NumPy quickstart: https://numpy.org/doc/stable/user/quickstart.html


In [7]:
import numpy as np

# Example: filter diameters by tolerance
diam_mm = np.array([10.01, 9.97, 10.05, 10.02, 9.94, 10.00, 9.99])
target = 10.00
tolerance = 0.03

deviation = diam_mm - target
mask_in_spec = np.abs(deviation) <= tolerance
print("Diameters:", diam_mm)
print("Deviation:", deviation)
print("In spec mask:", mask_in_spec)
print("In spec diameters:", diam_mm[mask_in_spec])

Diameters: [10.01  9.97 10.05 10.02  9.94 10.    9.99]
Deviation: [ 0.01 -0.03  0.05  0.02 -0.06  0.   -0.01]
In spec mask: [ True  True False  True False  True  True]
In spec diameters: [10.01  9.97 10.02 10.    9.99]


### ‚úè Exercise (easy): Filter out-of-spec samples

1. Create a NumPy array of length measurements in millimeters.
2. Define a target and tolerance.
3. Build a mask of samples that are **out of spec** (absolute deviation larger than tolerance).
4. Print the array of out-of-spec values and their count.

Use boolean masks like in the example above.

In [None]:
import numpy as np

# TODO: filter out-of-spec length values using a boolean mask.

# lengths_mm = np.array([...])
# target = ...
# tolerance = ...
# deviation = ...
# mask_out = ...  # abs(deviation) > tolerance
# print("Out-of-spec values:", lengths_mm[mask_out])
# print("Count:", mask_out.sum())

In [8]:
# Solution

import numpy as np

lengths_mm = np.array([49.98, 50.02, 49.95, 50.04, 50.01, 49.92])
target = 50.00
tolerance = 0.03
deviation = lengths_mm - target
mask_out = np.abs(deviation) > tolerance
print("Out-of-spec values:", lengths_mm[mask_out])
print("Count:", int(mask_out.sum()))

Out-of-spec values: [49.95 50.04 49.92]
Count: 3


### üí™ Exercise (advanced): Repeated measurements per part (2D arrays)

Suppose you measure the same part multiple times to estimate uncertainty.

1. Create a 2D NumPy array `data` of shape `(n_parts, n_repeats)` containing diameters in mm.
2. Compute the mean diameter **per part** (axis 1).
3. Compute the standard deviation per part (axis 1).
4. Compute the overall mean of all measurements.
5. Print the per-part mean and standard deviation.

You should use `data.mean(axis=1)` and `data.std(axis=1)`.

In [None]:
import numpy as np

# TODO: repeated measurement statistics with 2D arrays.

# Example: 4 parts, 5 repeated measurements each
# data = np.array([
#     [...],
#     [...],
# ])

# mean_per_part = ...  # data.mean(axis=1)
# std_per_part = ...   # data.std(axis=1)
# overall_mean = ...   # data.mean()

# print("Mean per part:", mean_per_part)
# print("Std per part:", std_per_part)
# print("Overall mean:", overall_mean)

In [9]:
# Solution

import numpy as np

data = np.array([
    [10.01, 9.99, 10.00, 10.02, 9.98],
    [4.99, 5.01, 5.00, 5.02, 4.98],
    [19.98, 20.00, 20.01, 19.99, 20.02],
    [29.99, 30.01, 30.00, 30.02, 29.98],
])

mean_per_part = data.mean(axis=1)
std_per_part = data.std(axis=1)
overall_mean = data.mean()

print("Mean per part:", mean_per_part)
print("Std per part:", std_per_part)
print("Overall mean:", overall_mean)

Mean per part: [10.  5. 20. 30.]
Std per part: [0.01414214 0.01414214 0.01414214 0.01414214]
Overall mean: 16.25


### Vectorized physics-style computation example

As a small example, imagine you measured heights `h` at given positions `x` along a line, and you want to approximate the area under the curve using the trapezoidal rule.

The trapezoidal rule for arrays `x` and `h` can be written as:

`area ‚âà sum( (h[i] + h[i+1]) / 2 * (x[i+1] - x[i]) )`

We can implement this with pure NumPy, using slicing, without explicit Python loops.

In [10]:
import numpy as np

# Simulate positions in mm and heights in micrometers
x = np.linspace(0.0, 10.0, 1001)  # 0..10 mm, 1001 points
h_um = 2.0 * np.sin(2 * np.pi * x / 10.0) + 10.0  # some periodic height pattern in um

# Trapezoidal rule using vectorized slices
dx = x[1:] - x[:-1]
h_avg = (h_um[1:] + h_um[:-1]) / 2.0
area_um_mm = np.sum(h_avg * dx)  # units: um * mm

print(f"Approximate area (um*mm): {area_um_mm:.3f}")

Approximate area (um*mm): 100.000


### üí™ Exercise (advanced - optional): Chain of vectorized operations

Combine several vectorized operations into a small pipeline:

1. Simulate an array of thicknesses in micrometers with `np.random.normal`.
2. Apply an offset correction.
3. Clip the values to a realistic range using `np.clip`.
4. Convert to millimeters.
5. Compute and print the mean and standard deviation in both units.

Do not use explicit Python loops. Use NumPy array operations only.

In [None]:
import numpy as np

# TODO: implement the vectorized processing pipeline.

# n = 1000
# thickness_um = np.random.normal(loc=100.0, scale=0.5, size=n)
# offset = 0.2
# corrected = ...
# clipped = ...  # np.clip
# thickness_mm = ...
# print statistics

In [11]:
# Solution

import numpy as np

n = 1000
thickness_um = np.random.normal(loc=100.0, scale=0.5, size=n)
offset = 0.2
corrected = thickness_um - offset
clipped = np.clip(corrected, 98.0, 102.0)
thickness_mm = clipped / 1000.0

print("Mean (um):", clipped.mean())
print("Std (um):", clipped.std())
print("Mean (mm):", thickness_mm.mean())
print("Std (mm):", thickness_mm.std())

Mean (um): 99.80521127498606
Std (um): 0.5115783054404365
Mean (mm): 0.09980521127498607
Std (mm): 0.0005115783054404363


---
# Short break (14:45 - 15:00)

Final stretch: Numba, GPUs, Python 3.13, and a complex example.
---

## Topic 7 - Numba JIT compilation

[Numba](https://numba.pydata.org/) is a Just-In-Time (JIT) compiler for Python functions that operate mainly on NumPy arrays and numbers.

Basic usage:

```python
from numba import njit

@njit
def f(x):
    # numerical code
    ...
```

When you first call `f`, Numba compiles it to machine code (using LLVM). Subsequent calls run at near-C speed.

### Important: how Numba and NumPy interact

- NumPy operations like `a + b` or `a.mean()` are already implemented in C.
- However, **Python loops around those operations** still run in the Python interpreter.
- Numba helps most when you have **custom loops and logic** that cannot be expressed as simple NumPy expressions.

**Answering the question:** "If Numba works with NumPy + Python code, and NumPy is already implemented in C, how does Numba JIT help?"

- NumPy is fast for each individual operation, but if you write Python like:
  - `for i in range(n): result[i] = complex_expression(a[i], b[i])`
  - this loop runs in Python and pays Python overhead per iteration.
- Numba compiles the **whole loop** (and the operations inside it) into one optimized machine code function.
- This removes Python overhead and can fuse several operations into one pass.

In practice, combine them:

- Use NumPy vectorization where it is natural.
- Use Numba for custom numeric kernels that are hard to write as a single NumPy expression.

In [4]:
import numpy as np
import math
import time

try:
    from numba import njit
except ImportError:
    njit = None
    print("Numba is not installed in this environment. The examples will fall back to pure Python.")

def rms_python(arr: np.ndarray) -> float:
    s = 0.0
    n = arr.size
    for i in range(n):
        x = float(arr[i])
        s += x * x
    return math.sqrt(s / n)

if njit is not None:
    @njit
    def rms_numba(arr):
        s = 0.0
        n = arr.size
        for i in range(n):
            x = arr[i]
            s += x * x
        return math.sqrt(s / n)
else:
    rms_numba = None

# Test on a large array
arr = np.random.normal(loc=0.0, scale=1.0, size=1_000_00)

# Python version
start = time.perf_counter()
r1 = rms_python(arr)
t_python = time.perf_counter() - start
print(f"Python RMS: {r1:.6f}, time={t_python:.4f} s")

if rms_numba is not None:
    # First call includes compilation time
    start = time.perf_counter()
    r2 = rms_numba(arr)
    t_first = time.perf_counter() - start
    # Second call is fast
    start = time.perf_counter()
    r3 = rms_numba(arr)
    t_numba = time.perf_counter() - start
    print(f"Numba RMS first call: {r2:.6f}, time={t_first:.4f} s (includes compile)")
    print(f"Numba RMS second call: {r3:.6f}, time={t_numba:.4f} s")

Python RMS: 0.999263, time=0.0114 s
Numba RMS first call: 0.999263, time=1.6283 s (includes compile)
Numba RMS second call: 0.999263, time=0.0001 s


### ‚úè Exercise (easy): Numba-accelerated difference of squares

1. Implement a function `diff_squares_python(a, b)` that for each element computes `a[i]**2 - b[i]**2`.
2. Time it on large NumPy arrays.
3. If Numba is available, implement `diff_squares_numba` with `@njit`.
4. Compare the timings.

Make sure you do not create new Python lists inside the function. Work directly with NumPy arrays.

In [None]:
import numpy as np
import time

try:
    from numba import njit
except ImportError:
    njit = None
    print("Numba not available - you can still implement the pure Python version.")

def diff_squares_python(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    n = a.size
    out = np.empty_like(a)
    # for i ...
    return out

if njit is not None:
    @njit
    def diff_squares_numba(a, b):
        n = a.size
        out = np.empty_like(a)
        for i # ... (same logic as above)
        return out

# a = np.random.normal(size=200_000)
# b = np.random.normal(size=200_000)
# start = time.perf_counter()
# out_py = diff_squares_python(a, b)
# t_py = time.perf_counter() - start
# print(f"Python time: {t_py:.4f} s")

# if njit is not None:
#     start = time.perf_counter()
#     out_nb1 = diff_squares_numba(a, b)
#     t_nb1 = time.perf_counter() - start
#     start = time.perf_counter()
#     out_nb2 = diff_squares_numba(a, b)
#     t_nb2 = time.perf_counter() - start
#     print(f"Numba first call: {t_nb1:.4f} s, second call: {t_nb2:.4f} s")

In [5]:
# Solution

import numpy as np
import time

try:
    from numba import njit
except ImportError:
    njit = None
    print("Numba not available - only Python version will run.")

def diff_squares_python(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    n = a.size
    out = np.empty_like(a)
    for i in range(n):
        out[i] = a[i] * a[i] - b[i] * b[i]
    return out

if njit is not None:
    @njit
    def diff_squares_numba(a, b):
        n = a.size
        out = np.empty_like(a)
        for i in range(n):
            out[i] = a[i] * a[i] - b[i] * b[i]
        return out

a = np.random.normal(size=200_000)
b = np.random.normal(size=200_000)

start = time.perf_counter()
out_py = diff_squares_python(a, b)
t_py = time.perf_counter() - start
print(f"Python time: {t_py:.4f} s")

if njit is not None:
    start = time.perf_counter()
    out_nb1 = diff_squares_numba(a, b)
    t_nb1 = time.perf_counter() - start
    start = time.perf_counter()
    out_nb2 = diff_squares_numba(a, b)
    t_nb2 = time.perf_counter() - start
    print(f"Numba first call: {t_nb1:.4f} s, second call: {t_nb2:.4f} s")

Python time: 0.1173 s
Numba first call: 0.5256 s, second call: 0.0005 s


### üí™ Exercise (advanced): Custom statistic with Numba

Define a custom function that is harder to express with pure NumPy:

1. Implement `moving_rms_python(arr, window)` that computes RMS roughness over a sliding window.
   - For each position `i`, compute RMS of `arr[i : i+window]`.
2. If Numba is available, implement `moving_rms_numba` using `@njit`.
3. Compare performance on a large array.

This is a common pattern when analyzing profiles from surface measurement devices.

In [None]:
import numpy as np
import math
import time

try:
    from numba import njit
except ImportError:
    njit = None
    print("Numba not available - you can still implement the Python version.")

def moving_rms_python(arr: np.ndarray, window: int) -> np.ndarray:
    """TODO: pure Python moving RMS."""
    n = arr.size
    out = np.empty(n - window + 1, dtype=float)
    for i in range(n - window + 1):
        sum_squares = 0.0

        # For each position i, compute RMS of arr[i : i+window].
        for j in range(window):
            # ...
        out[i] = math.sqrt(sum_squares / window)
    return out

if njit is not None:
    @njit
    def moving_rms_numba(arr, window):
        n = arr.size
        out = np.empty(n - window + 1, dtype=float)
        for i in range(n - window + 1):
            sum_squares = 0.0
    
            # For each position i, compute RMS of arr[i : i+window].
            for j in range(window):
                # ... # Same logic as above
            out[i] = math.sqrt(sum_squares / window)
        return out

# arr = np.random.normal(size=200_000)
# window = 50
# start = time.perf_counter()
# r_py = moving_rms_python(arr, window)
# t_py = time.perf_counter() - start
# print(f"Python moving RMS time: {t_py:.4f} s")

# if njit is not None:
#     start = time.perf_counter()
#     r_nb1 = moving_rms_numba(arr, window)
#     t_nb1 = time.perf_counter() - start
#     start = time.perf_counter()
#     r_nb2 = moving_rms_numba(arr, window)
#     t_nb2 = time.perf_counter() - start
#     print(f"Numba first call: {t_nb1:.4f} s, second call: {t_nb2:.4f} s")

In [6]:
# Solution

import numpy as np
import math
import time

try:
    from numba import njit
except ImportError:
    njit = None
    print("Numba not available - only Python version will run.")

def moving_rms_python(arr: np.ndarray, window: int) -> np.ndarray:
    n = arr.size
    out = np.empty(n - window + 1, dtype=float)
    for i in range(n - window + 1):
        s = 0.0
        for j in range(window):
            x = float(arr[i + j])
            s += x * x
        out[i] = math.sqrt(s / window)
    return out

if njit is not None:
    @njit
    def moving_rms_numba(arr, window):
        n = arr.size
        out = np.empty(n - window + 1, dtype=np.float64)
        for i in range(n - window + 1):
            s = 0.0
            for j in range(window):
                x = arr[i + j]
                s += x * x
            out[i] = math.sqrt(s / window)
        return out

arr = np.random.normal(size=200_000)
window = 50

start = time.perf_counter()
r_py = moving_rms_python(arr, window)
t_py = time.perf_counter() - start
print(f"Python moving RMS time: {t_py:.4f} s")

if njit is not None:
    start = time.perf_counter()
    r_nb1 = moving_rms_numba(arr, window)
    t_nb1 = time.perf_counter() - start
    start = time.perf_counter()
    r_nb2 = moving_rms_numba(arr, window)
    t_nb2 = time.perf_counter() - start
    print(f"Numba first call: {t_nb1:.4f} s, second call: {t_nb2:.4f} s")

Python moving RMS time: 1.4391 s
Numba first call: 0.2612 s, second call: 0.0080 s


## Topic 8 - GPU acceleration overview: cuDF and CuPy

For very large datasets and heavy numerical work, GPUs can be useful.

Two popular libraries in the Python ecosystem:

- [CuPy](https://cupy.dev/):
  - NumPy-like interface for arrays stored on a CUDA GPU.
  - Many functions mirror the NumPy API (`cupy.array`, `cupy.mean`, etc.).
- [cuDF](https://docs.rapids.ai/api/cudf/stable/):
  - Part of the RAPIDS ecosystem: https://rapids.ai/
  - Pandas-like DataFrame library running on the GPU.

In many cases, the workflow is:

- Move large arrays or tables to the GPU once.
- Perform many operations there.
- Move reduced results (e.g. aggregates) back to the CPU.

In this course environment, GPUs might not be available, so the examples below are illustrative only. Do not worry if they raise `ImportError` - that is expected on a CPU-only machine.

In [7]:
import time
import numpy as np

N = 100_000_000  # size of the array

print(f"Array size: {N:,}")

# ---------------- CPU: NumPy ----------------
t0 = time.perf_counter()
a_cpu = np.random.normal(size=N)
mean_cpu = a_cpu.mean()
std_cpu = a_cpu.std()
t1 = time.perf_counter()
cpu_time = t1 - t0

print(f"[NumPy / CPU]  mean={mean_cpu:.5f}, std={std_cpu:.5f}, time={cpu_time:.4f} s")

# ---------------- GPU: CuPy (if available) ----------------
try:
    import cupy as cp
    print("CuPy version:", cp.__version__)

    t0 = time.perf_counter()
    a_gpu = cp.random.normal(size=N)
    mean_gpu = a_gpu.mean()
    std_gpu = a_gpu.std()
    t1 = time.perf_counter()
    gpu_time = t1 - t0

    print(f"[CuPy / GPU]  mean={float(mean_gpu):.5f}, std={float(std_gpu):.5f}, time={gpu_time:.4f} s")
    print(f"Speedup (CPU time / GPU time): {cpu_time / gpu_time:.2f}x")

except ImportError:
    print("CuPy is not installed. This example is for illustration only.")


CuPy is not installed. This example is for illustration only.


## Topic 9 - Python 3.13 and the future: experimental JIT and optional GIL

Recent and upcoming CPython releases are adding major performance features:

- **Python 3.13** (released in 2024) includes:
  - An experimental **free-threaded build** (optional no-GIL mode). See PEP 703.
  - An experimental **JIT compiler** (PEP 744) that can speed up some workloads.
  - These features are **off by default** and require special builds / flags.
- Future versions (3.14 and beyond) are expected to improve JIT performance and evolve the no-GIL story.

What this means for you in the medium term:

- Well-written numeric Python might become faster without you changing code.
- True multi-threaded CPU-bound Python code may become possible without having to use `multiprocessing`.
- Libraries like Numba, cuDF, CuPy, and others will likely evolve to take advantage of new capabilities.

Official resources:

- What's new in Python 3.13: https://docs.python.org/3/whatsnew/3.13.html
- PEP 703 (optional GIL): https://peps.python.org/pep-0703/
- PEP 744 (JIT): https://peps.python.org/pep-0744/

For now, you should still learn threads, processes, AsyncIO, NumPy, and Numba - these skills remain valuable regardless of how the interpreter evolves.

In [8]:
import sys

print("Running Python version:", sys.version)
print("Executable:", sys.executable)
print("Note: in standard CPython 3.13+, JIT and no-GIL builds are optional and may need explicit enabling.")

Running Python version: 3.13.7 (main, Sep  2 2025, 14:16:00) [MSC v.1944 64 bit (AMD64)]
Executable: C:\Users\gregk\Desktop\winpython\WPy64-31700\python\python.exe
Note: in standard CPython 3.13+, JIT and no-GIL builds are optional and may need explicit enabling.


## Topic 10 - Complex example: parallel processing of measurement datasets

In this final example we combine multiple ideas from today:

- NumPy arrays and vectorized computations
- Numba JIT for a custom numeric kernel (optional)
- `multiprocessing` to process multiple independent datasets in parallel

### Scenario

You have several measurement files from a surface profiler. Each file contains a 1D height profile (in micrometers). For each profile, you want to:

1. Load the data (in this notebook we will just simulate it).
2. Apply an offset correction (subtract mean).
3. Compute RMS roughness and peak-to-valley height (max - min).
4. Return a small summary dictionary.

Then, for many profiles (e.g. 8 or 16), you want to process them in parallel using multiple CPU cores.

We will build:

- A pure NumPy summary function.
- Optionally a Numba-accelerated variant.
- A small wrapper that can be used with `multiprocessing.Pool.map`.

Your task is to fill in the missing parts.

In [None]:
# RUN FROM .py INSTEAD OF NOTEBOOK

import numpy as np
import math
from multiprocessing import Pool

try:
    from numba import njit
except ImportError:
    njit = None

def simulate_profile(n_points: int = 50_000) -> np.ndarray:
    """Simulate a 1D surface profile in micrometers."""
    x = np.linspace(0.0, 10.0, n_points)
    base = 5.0 * np.sin(2 * np.pi * x / 5.0)
    noise = np.random.normal(loc=0.0, scale=0.5, size=n_points)
    return base + noise

def summarize_profile_numpy(profile: np.ndarray) -> dict:
    """TODO: center profile and compute RMS and peak-to-valley using NumPy only."""
    # mean = ...
    # centered = ...
    # rms = ...
    # ptv = ...
    # return {"mean": float(mean), "rms": float(rms), "ptv": float(ptv)}
    mean = profile.mean()
    centered = profile - mean
    rms = math.sqrt((centered * centered).mean())
    ptv = float(centered.max() - centered.min())
    return {"mean": float(mean), "rms": float(rms), "ptv": ptv}

if njit is not None:
    @njit
    def summarize_profile_numba_kernel(profile):
        n = profile.size
        # Compute mean
        s = 0.0
        for i in range(n):
            s += profile[i]
        mean = s / n
        # Compute RMS and min, max of centered profile
        s2 = 0.0
        min_c = 1e30
        max_c = -1e30
        for i in range(n):
            c = profile[i] - mean
            s2 += c * c
            if c < min_c:
                min_c = c
            if c > max_c:
                max_c = c
        rms = math.sqrt(s2 / n)
        ptv = max_c - min_c
        return mean, rms, ptv

def summarize_profile_numba(profile: np.ndarray) -> dict:
    if njit is None:
        return summarize_profile_numpy(profile)
    mean, rms, ptv = summarize_profile_numba_kernel(profile)
    return {"mean": float(mean), "rms": float(rms), "ptv": float(ptv)}

def process_one_profile(args):
    """Wrapper for Pool.map: args could be (index, use_numba)."""
    index, use_numba = args
    profile = simulate_profile()
    if use_numba:
        summary = summarize_profile_numba(profile)
    else:
        summary = summarize_profile_numpy(profile)
    summary["index"] = index
    return summary

def main():
    # TODO:
    # 1) Create a list of indices, e.g. range(8)
    # 2) Use Pool to process them in parallel
    # 3) Print the summaries sorted by index
    
    # indices = ...
    # args_list = [(i, True) for i in indices]
    # with Pool() as pool:
    #     results = pool.map(process_one_profile, args_list)
    # results_sorted = sorted(results, key=lambda d: d["index"])
    # for r in results_sorted:
    #     print(r)

if __name__ == "__main__":
    main()


In [None]:
# Solution RUN FROM .py INSTEAD OF NOTEBOOK

import numpy as np
import math
from multiprocessing import Pool

try:
    from numba import njit
except ImportError:
    njit = None

def simulate_profile(n_points: int = 50_000) -> np.ndarray:
    x = np.linspace(0.0, 10.0, n_points)
    base = 5.0 * np.sin(2 * np.pi * x / 5.0)
    noise = np.random.normal(loc=0.0, scale=0.5, size=n_points)
    return base + noise

def summarize_profile_numpy(profile: np.ndarray) -> dict:
    mean = profile.mean()
    centered = profile - mean
    rms = math.sqrt((centered * centered).mean())
    ptv = float(centered.max() - centered.min())
    return {"mean": float(mean), "rms": float(rms), "ptv": ptv}

if njit is not None:
    @njit
    def summarize_profile_numba_kernel(profile):
        n = profile.size
        s = 0.0
        for i in range(n):
            s += profile[i]
        mean = s / n
        s2 = 0.0
        min_c = 1e30
        max_c = -1e30
        for i in range(n):
            c = profile[i] - mean
            s2 += c * c
            if c < min_c:
                min_c = c
            if c > max_c:
                max_c = c
        rms = math.sqrt(s2 / n)
        ptv = max_c - min_c
        return mean, rms, ptv

def summarize_profile_numba(profile: np.ndarray) -> dict:
    if njit is None:
        return summarize_profile_numpy(profile)
    mean, rms, ptv = summarize_profile_numba_kernel(profile)
    return {"mean": float(mean), "rms": float(rms), "ptv": float(ptv)}

def process_one_profile(args):
    index, use_numba = args
    profile = simulate_profile()
    if use_numba:
        summary = summarize_profile_numba(profile)
    else:
        summary = summarize_profile_numpy(profile)
    summary["index"] = index
    return summary

if __name__ == "__main__":
    indices = list(range(8))
    args_list = [(i, True) for i in indices]
    with Pool() as pool:
        results = pool.map(process_one_profile, args_list)
    results_sorted = sorted(results, key=lambda d: d["index"])
    for r in results_sorted:
        print(r)

## Day 3 summary

Today you:

- Built a mental model of **CPU-bound** vs **I/O-bound** workloads.
- Reviewed the impact of the **GIL** on threads and why processes are used for CPU-bound speedups.
- Used **threads** for concurrent I/O-style tasks, collecting results safely with locks.
- Used **multiprocessing** to process independent batches of measurement data in parallel.
- Learned the basics of **AsyncIO** and coordinated multiple async measurement coroutines.
- Revisited **NumPy**, created arrays, applied vectorized transformations, and computed statistics for physics/measurement data.
- Practiced boolean masks, axis-wise aggregations, and small vectorized physics-style calculations.
- Used **Numba** to JIT-compile custom numeric kernels and understood how it complements NumPy.
- Saw an overview of **GPU tools** like [CuPy](https://cupy.dev/) and [cuDF](https://docs.rapids.ai/api/cudf/stable/) for GPU acceleration.
- Discussed **Python 3.13** and its experimental JIT and optional no-GIL builds, and how future CPython versions may affect performance.
- Combined multiple concepts in a complex example: parallel processing of simulated surface profiles with NumPy, Numba, and multiprocessing.

These tools and concepts form a practical toolbox for high-performance numerical work in Python, especially in physics and engineering contexts. On the next days you can build on this knowledge for more advanced machine learning and deep learning workloads.