<p style='text-align: right;'> Birkan Emrem </p>
<p style='text-align: right;'> 16.10.2025 </p>
<p style='text-align: right;'> AI Training Series: Python Refresher: Session V </p>

## Introduction to Parallel Computing
#### Parallel Execution with Processes

In [None]:
import concurrent.futures as cf
import os

In [None]:
# Function to square
def sr(n):
    return (os.getpid(), n, n * n)

In [None]:
# Use multiple process to execute
if __name__ == "__main__":
    with cf.ProcessPoolExecutor() as exc:
        re = list(exc.map(sr,range(5)))
    
    for p, n, s in re:
        print(f"Prc {p} handl. {n},sq: {s}")
    
    print("List:", [sq for _, _, sq in re])

### Key Points:
- Uses multiple CPU cores
- Each is a separate process
- Efficient for CPU-bound tasks

#### Sequential Baseline for Comparison

In [None]:
import time

In [None]:
# Function to square with delay
def sr(n):
    print("Prc", os.getpid(), "handling", n)
    time.sleep(1)
    return n*n

In [None]:
results = []
# Execute sequentially in a single process
for i in [1, 2, 3, 4, 5]:
    results.append(sr(i))

print("Squared:", results)

### Key Points:
- Takes ~5 seconds
- Single process handles all work
- No parallelism or concurrency

<hr style="border:1.3px solid gray">

## CPU-bound vs I/O-bound Tasks
#### CPU-bound

In [None]:
import time

In [None]:
# Function for heavy computation
def compute():
    total = 0
    for i in range(100_000_000):
        total += i*i
    return total

In [None]:
start = time.perf_counter()
result = compute()
end = time.perf_counter()

In [None]:
print("Result:", result)
print("Time:", round(end-start, 2), "s")

## Key Points:
- Keeps CPU fully busy
- Work is continous, no waiting
- Use processes for speed-up

#### I/O-bound

In [None]:
# Function to simulate I/O delay
def fetch_data():
    print("Fetching data ...")
    time.sleep(2)
    return "Done"

In [None]:
start = time.perf_counter()
result = fetch_data()
end = time.perf_counter()

print("Result:", result)
print("Time:", round(end-start, 2), "s")

## Key Points:
- CPU mostly idle while waiting
- Common in file, network, or DB operations
- Suited for threading, not processes

<hr style="border:1.3px solid gray">

## Threading Basics
#### Creating and Starting a Thread

In [None]:
import threading as th
import time

In [None]:
def greet(name):
    print("Hello, ", name)
    time.sleep(1)
    print("Goodbye, ", name)

In [None]:
# Create a thread that runs the function
t = th.Thread(target=greet, args=("Bob", ))
t.start()
t.join()

print("Main thread finished")

### Key Points:
- Use Thread to run a function concurrently
- `start()` begins a thread execution
- `join()` blocks until it finishes
- Useful for I/O-bound tasks

#### Running Multiple Threads

In [None]:
def worker(i):
    print("Thread", i, "started")
    time.sleep(1)
    print("Thread", i, "ended")

In [None]:
threads = []
# Create and start 3 threads
for i in range(3):
    t = th.Thread(target=worker, args=(i, ))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

### Key Points:
- Start many threads in a loop
- All threads share memory

<hr style="border:1.3px solid gray">

## Global Interpreter Lock (GIL)
#### Threading May Fail for CPU Tasks

In [None]:
import threading as th

x = 0
def increment(): #Function that increment global x
    global x
    for _ in range(100_000):
        x += 1 # Not thread safe due to GIL!

In [None]:
# Start 2 threads that run increment concurrently
t1 = th.Thread(target=increment)
t2 = th.Thread(target=increment)

In [None]:
t1.start(); t2.start()
t1.join(); t2.join()
print("Final x: ", x)

### Key Points:
- Threads can‘t run Python bytecode in parallel
- The GIL prevents true CPU-bound threading
- Final result may be wrong

#### Multiprocessing Avoids the GIL

In [None]:
import multiprocessing as mp

In [None]:
def compute():
    total = 0
    for i in range(10**6):
        total += i
    print("Done:", total)

In [None]:
p1 = mp.Process(target=compute)
p2 = mp.Process(target=compute)

In [None]:
p1.start(); p2.start()
p1.join(); p2.join()

### Key Points:
- True CPU parallelism across cores
- Bypasses the GIL entirely
- Each process runs in its own Python interpreter

<hr style="border:1.3px solid gray">

## Multiprocessing Basics
#### Creating a Single Process

In [None]:
import multiprocessing as mp

In [None]:
def say_hello():
    print("Hello from a separate process!")

In [None]:
if __name__ == "__main__":
    p = mp.Process(target=say_hello)
    p.start()
    p.join()

    print("Main process finished")

### Key Points:
- `Process` runs a function in a new process
- Each process has its own memory space
- Start with `start()`, wait with `join()`
- Good for CPU-bound tasks

#### Running Multiple Processes

In [None]:
import os

In [None]:
def wr(n):
    print(f"Worker {n} PID {os.getpid()}")

In [None]:
if __name__ == "__main__":
    for i in range(3):
        p = mp.Process(target=wr, args=(i,))
        p.start()
        p.join()

### Key Points:
- True Create multiple processes in a loop
- Each runs fully in parallel
- Ideal for dividing CPU-heavy work

<hr style="border:1.3px solid gray">

## Using `concurrent.futures`
#### `ThreadPoolExecutor` (I/O-bound)

In [None]:
import concurrent.futures as cf
import time

In [None]:
def fetch(i):
    print("Start fetching", i)
    time.sleep(1)
    print("Done with", i)
    return i*10

In [None]:
with cf.ThreadPoolExecutor() as exc:
    results = list(exc.map(fetch, range(3)))

print("Results:", results)

### Key Points:
- Threads run concurrently
- Easy API with ThreadPoolExecutor
- Threads share memory

#### `ProcessPoolExecutor` (CPU-bound)

In [None]:
def sr(n):
    print(f"Squaring {n}\n")
    return n*n

In [None]:
if __name__ == "__main__":
    with cf.ProcessPoolExecutor() as exc:
        results = list(exc.map(sr, range(4)))

    print("Squares:", results)

### Key Points:
- Use multiple processes not threads
- Ideal for CPU-intensive taskts
- Each function call runs in parallel

<hr style="border:1.3px solid gray">

## Shared Memory in Multiprocessing
#### Using `Value` (Shared Scalar)

In [None]:
import multiprocessing as mp

In [None]:
def add(val): #Function to increment a shared value
    for i in range(10000):
        val.value += 1

In [None]:
if __name__ == "__main__":
    ctr = mp.Value("i", 0) # Shared integer
     # Start 2 processes that runs add() using the
     # shared counter
    p1 = mp.Process(target=add, args=(ctr,))
    p2 = mp.Process(target=add, args=(ctr,))
    p1.start(); p2.start()
    p1.join(); p2.join()

    print("Final count:", ctr.value)

### Key Points:
- `Value` stores a single shared value
- `“i“` = C-style integer format
- Shared between processes safely

#### Using `Array` (Shared List)

In [None]:
# Function to square each element in a shared array
def sr(arr):     
    for i in range(len(arr)):
        arr[i] = arr[i] * arr[i]

In [None]:
if __name__ == "__main__":
    numbers = mp.Array("i", [1, 2, 3, 4])
    p = mp.Process(target=sr,args=(numbers,))      
    p.start()
    p.join()

    print("Squared Array:", list(numbers))

### Key Points:
- `Array` shares a fixed-size list
- Elements are updated in-place
- Changes are visible across processes

<hr style="border:1.3px solid gray">

## Accelerating with Numba (JIT Basics)
#### Using `@jit` for Instant Speed-up

In [None]:
from numba import jit
import time

In [None]:
@jit
def compute():
    total = 0
    for i in range(100_000_000):
        total += i*i
    return total

In [None]:
start = time.perf_counter()
result = compute()
print("Result:", result)
print("Time:", time.perf_counter() - start)

### Key Points:
- `@jit` compiles the function at runtime
- Massive speed-up for loops and math
- Works with pure Python syntax

#### Using @njit

In [None]:
from numba import njit

In [None]:
@njit
def multiply():
    result = 1
    for i in range(1, 1_000_000):
        result *= 1.00001
    return result

In [None]:
start = time.perf_counter()
output = multiply()
print("Output:", output)
print("Time:", time.perf_counter() - start)

### Key Points:
- `@njit` = no Python interpreter fallback
- Pure machine level speed
- Best for tight numeric loops

<hr style="border:1.3px solid gray">

## Accelerating with Numba
#### Parallel sum with `prange`

In [None]:
from numba import njit
import numba
import time

In [None]:
@njit(parallel=True)
def parallel_sum():
    total = 0
    for i in numba.prange(1_000_000):
        total += i
    return total

In [None]:
start = time.perf_counter()
result = parallel_sum()
print("Result:", result)
print("Time:", time.perf_counter() - start)

#### Parallel Element-wise Operation

In [None]:
import numpy as np

In [None]:
@njit(parallel=True)
def scale_array(arr):
    for i in numba.prange(len(arr)):
        arr[i] = arr[i] * 2

In [None]:
data = np.arange(1_000_000, dtype="int64")
scale_array(data)
print("First 5:", data[:5])

### Key Points:
- Operates directly on NumPy array
- Auto-parallelized loop with `prange`
- Numba = fast without leaving Python

<hr style="border:1.3px solid gray">

In [None]:
%%writefile cy_add.pyx
def add(int a, int b):
    cdef int result
    result = a + b
    return result

In [None]:
from Cython.Build import cythonize
cythonize("cy_add.pyx", language_level="3")

### Key Points:
- Cython compiles Python to C
- Static types boost performance
- `cdef` declares C-level variables

#### Using Cython from Python code

In [None]:
import pyximport
pyximport.install()

In [None]:
import cy_add

In [None]:
print("3 + 4 =", cy_add.add(3, 4))
print("10 + 20 =", cy_add.add(10, 20))

```bash
# No need to manage shared libraries
# Python imports compiled Cython module
```

### Key Points:
- Use `pyximport` for easy development
- Imports like a regular python module
- Code runs at compiled C speed

<hr style="border:1.3px solid gray">

## Accelerating with Cython (Integration)
#### Creating a `setup.py` for Cython

In [None]:
%%writefile setup.py
import setuptools as st
import Cython.Build as cb
 
st.setup(
  name="cy_add",
  ext_modules=cb.cythonize("cy_add.pyx"),
  zip_safe=False
)

In [None]:
! python setup.py build_ext --inplace

### Key Points:
- Use setuptools+cythonize for building
- Compiles `.pyx` to fast C extensions
- Produces `.so` or `.pyd` file in-place

#### Using the Compiled Module

In [None]:
%%writefile main.py
import cy_add

print("Fast add:", cy_add.add(7, 8))
print("Another one:", cy_add.add(100, 30))

In [None]:
! python main.py

### Key Points:
- Import compiled module directly
- Runs at native C speed
- Compatible with any Python code

<hr style="border:1.3px solid gray">

## Best Practices and Common Pitfalls
#### Best Practices for Parallel Code

In [None]:
import concurrent.futures as cf 

In [None]:
def sr(n):
    return n*n

if __name__ == "__main__":
    with cf.ProcessPoolExecutor() as pl:
        results = list(pl.map(sr, range(5)))

        print("Result:", results)

### Key Points:
- Always guard parallel code with `if __name__ == "__main__"`
- Use `with` blocks to manage executors cleanly
- Processes for CPU, threads for I/O

#### Common Pitfalls to Avoid

In [None]:
import multiprocessing as mp

In [None]:
def run():
    print("Running task")

In [None]:
# Missing __main__ guard crashes or hang
p = mp.Process(target=run)
p.start()
p.join()

### Key Points:
- Missing `__main__` guard breaks the multiprocessing on Windows/MacOS
- Overusing threads causes context switching overhead
- Don‘t parallelize tiny or fast tasks

<hr style="border:1.3px solid gray">