# Lecture 1 & 2 (Introduction)

# Real-World Scenario: Training a Neural Network
## The Challenge:
1. Imagine you need to train a modern language model with 175 billion parameters on a dataset of 100 million text samples.

Single CPU (laptop): 45 days of continuous processing
Distributed cluster (64 GPUs): 3-4 hours

2. Processing Customer Data

Single-threaded pandas: Processing 50GB customer transaction data = 8+ hours

Parallel processing with Dask: Same task = 25 minutes

## The Core Problem
Moore's Law is ending. CPU clock speeds have matured around 3-4 GHz since 2005. Instead of faster cores, we now have:

- More cores per CPU (8, 16, 32+ cores)
- Specialized hardware (GPUs with thousands of cores)
- Distributed systems (thousands of machines working together)  

Key Insight: To handle growing data volumes and model complexity, we must embrace parallelism.

## Why Parallel Computing Matters in Data Science & ML
Modern Challenges in DS/ML
1. Massive Dataset Sizes

Traditional: MB to GB datasets

Today: TB to PB datasets

Examples:

- Netflix: 200+ billion events per day
- Facebook: 4 petabytes of new data daily
- Genomics: Single human genome = 200GB+ when fully analyzed



2. Complex Model Architectures

- 2012 AlexNet: 60M parameters
- 2019 GPT-2: 1.5B parameters
- 2023 GPT-4: Estimated 1.7T+ parameters
Training time scales exponentially with model size

3. Real-Time Requirements

Recommendation systems: < 100ms response time

High-frequency trading: < 1ms decisions

Autonomous vehicles: < 10ms for safety-critical decisions

## Benefits of Parallelism
1. Speed Gains

python# Sequential processing
for batch in dataset:
    model.train(batch)  # Takes 1 second per batch

Total: 1000 batches × 1 sec = 1000 seconds

Parallel processing (8 workers)

parallel_workers = 8

Total: 1000 batches ÷ 8 workers × 1 sec = 125 seconds

Speedup: 8x faster!

2. Scalability

- Handle datasets larger than single machine memory
- Train models too large for single GPU
- Process streams of real-time data

3. Hardware Utilization

- CPU: Use all cores instead of just one
- GPU: Leverage thousands of CUDA cores
- Memory: Distribute data across multiple machines
- Network: Pipeline data processing and transfer


## Types of Parallelism
1. Data Parallelism
Concept: Same model/algorithm processes different portions of data simultaneously.

How it works:
Original Dataset: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Split across 3 workers:
- Worker 1: [1, 2, 3, 4]    → processes subset
- Worker 2: [5, 6, 7]       → processes subset  
- Worker 3: [8, 9, 10]      → processes subset

Combine results: [result1, result2, result3]

### Real Example - Neural Network Training:

Each GPU gets different batches of the same data
- GPU 0: processes batch [0:32]    using same model
- GPU 1: processes batch [32:64]   using same model
- GPU 2: processes batch [64:96]   using same model
- GPU 3: processes batch [96:128]  using same model

Gradients are averaged across all GPUs and Model parameters are synchronized

### Use Cases:

- Training neural networks
- Processing large datasets with same algorithm
- Monte Carlo simulations
- Image/video processing pipelines

2. Task Parallelism

Concept: Different tasks/functions run simultaneously on same or different data.

Pipeline Example:

Data Flow: Raw Data → Preprocess → Feature Extract → Model Train → Evaluate

Sequential:
[Load] → [Clean] → [Extract] → [Train] → [Eval]

Total time: 5 + 3 + 4 + 10 + 2 = 24 minutes

### Parallel Pipeline:
- Time 0-5:   [Load Batch 1]
- Time 3-8:   [Load Batch 2] + [Clean Batch 1]  
- Time 6-11:  [Load Batch 3] + [Clean Batch 2] + [Extract Batch 1]
- Time 9-19:  [Clean Batch 3] + [Extract Batch 2] + [Train Batch 1]

...

Total time: ~12 minutes (2x speedup)

### Use Cases:

- ETL pipelines
- Real-time data streaming
- Multi-model ensemble training
- Independent experiments running simultaneously

3. Model Parallelism

### Concept:
Split a single large model across multiple devices.
### Layer-wise Splitting:
Large Neural Network:

Insert image here

### When to Use Model Parallelism:

- Model too large for single GPU memory
- Training very large language models (GPT-4, PaLM)
- Memory-bound rather than compute-bound scenarios

| Type            | Best For                              | Example                    | Complexity |
|-----------------|---------------------------------------|----------------------------|------------|
| **Data Parallel**  | Same computation on different data     | Training on multiple batches | Low        |
| **Task Parallel**  | Different operations simultaneously   | ETL pipeline stages          | Medium     |
| **Model Parallel** | Very large models                     | GPT-4 training               | High       |




# Python's Role in Parallel and Distributed Computing
## Python's Strengths for Parallel Computing
1. Rich Ecosystem


In [5]:
# Scientific computing
import numpy as np      # Optimized linear algebra
import pandas as pd     # Data manipulation
import scipy as sp      # Scientific algorithms

# Machine learning
import sklearn as sk    # scikit-learn (traditional ML)
import torch            # PyTorch (deep learning)
import tensorflow as tf # TensorFlow (Google's ML framework)

# Parallel / distributed computing
import multiprocessing as mp  # Built-in parallelism
import dask                    # Dask (Pandas/NumPy scaling)
import ray                     # Ray (distributed computing)


ImportError: cannot import name 'pywrap_tensorflow' from 'tensorflow.python' (C:\Users\PMLS\%USERPROFILE%\ray312\Lib\site-packages\tensorflow\python\__init__.py)

In [6]:
import sys, os, pprint
print("Python:", sys.version)
print("Executable:", sys.executable)
print("USERPROFILE:", os.environ.get("USERPROFILE"))
print("PYTHONPATH:", os.environ.get("PYTHONPATH"))
print("PIP_TARGET:", os.environ.get("PIP_TARGET"))
print("PYTHONUSERBASE:", os.environ.get("PYTHONUSERBASE"))
pprint.pp(sys.path[:5])


Python: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
Executable: C:\Users\PMLS\%USERPROFILE%\ray312\Scripts\python.exe
USERPROFILE: C:\Users\PMLS
PYTHONPATH: None
PIP_TARGET: None
PYTHONUSERBASE: None
['C:\\Users\\PMLS\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip',
 'C:\\Users\\PMLS\\AppData\\Local\\Programs\\Python\\Python312\\DLLs',
 'C:\\Users\\PMLS\\AppData\\Local\\Programs\\Python\\Python312\\Lib',
 'C:\\Users\\PMLS\\AppData\\Local\\Programs\\Python\\Python312',
 'C:\\Users\\PMLS\\%USERPROFILE%\\ray312']


### Easy integration with High Performance Computing

In [None]:
# NumPy operations run in C
import numpy as np
large_array = np.random.random((10000, 10000))
result1 = np.dot(large_array, large_array.T)  # C-speed matrix multiplication

# CUDA operations through PyTorch
import torch
gpu_tensor = torch.randn(10000, 10000).cuda()
result2 = torch.mm(gpu_tensor, gpu_tensor.T)  # GPU-accelerated

## Process Memory Model
Each process has its own independent memory space, which includes:

- Virtual address space: Completely isolated from other processes
- Code segment: Program instructions
- Data segment: Global and static variables
- Heap: Dynamically allocated memory
- Stack: Local variables and function calls

Processes cannot directly access each other's memory. Communication between processes requires special mechanisms like pipes, shared memory segments, or message passing.
## Thread Memory Model
Threads within the same process share most memory, but maintain some private areas:

### Shared Memory:

- Code segment (program instructions)
- Data segment (global variables)
- Heap (dynamically allocated memory)
- Open file descriptors
- Signal handlers

### Private Memory:

- Stack: Each thread has its own stack for local variables and function calls
- Registers: CPU register states are private to each thread
- Program counter: Each thread tracks its own execution position

## Key Memory Implications
Isolation vs. Sharing:

Processes provide strong isolation but require overhead for context switching
Threads enable efficient data sharing but require careful synchronization

Memory Overhead:

Creating a new process duplicates the entire memory space
Creating a new thread only adds a new stack (typically 1-8MB)

Synchronization:

Processes rarely need synchronization for memory access
Threads require locks, mutexes, or other synchronization primitives to prevent race conditions when accessing shared memory

# GIL Challenge

## What is the Global Interpreter Lock (GIL)?
The GIL is a mutex that protects Python objects, allowing only one thread to execute Python bytecode at a time.

In [None]:
import time
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed

def io_task(task_id, duration=1.0):
    """Simulated I/O task with logging."""
    thread_id = threading.get_ident()
    start = time.strftime("%H:%M:%S")
    print(f"[{start}] Thread-{thread_id} START task {task_id}")

    time.sleep(duration)  # Simulate blocking I/O (releases the GIL)

    end = time.strftime("%H:%M:%S")
    print(f"[{end}] Thread-{thread_id} END task {task_id}")
    return task_id

def run_io_simulation(num_tasks=8, workers=4, duration=1.0):
    print(f"\nRunning {num_tasks} I/O tasks with {workers} threads...\n")
    with ThreadPoolExecutor(max_workers=workers) as executor:
        futures = [executor.submit(io_task, i, duration) for i in range(num_tasks)]
        for fut in as_completed(futures):
            _ = fut.result()
    print("\nSimulation complete.")

# Run the demo
run_io_simulation(num_tasks=8, workers=4, duration=2)



Running 8 I/O tasks with 4 threads...

[02:27:58] Thread-132373323032128 START task 0
[02:27:58] Thread-132373257639488 START task 1
[02:27:58] Thread-132373274424896 START task 2
[02:27:58] Thread-132373266032192 START task 3
[02:28:00] Thread-132373323032128 END task 0[02:28:00] Thread-132373257639488 END task 1
[02:28:00] Thread-132373257639488 START task 4

[02:28:00] Thread-132373323032128 START task 5
[02:28:00] Thread-132373274424896 END task 2
[02:28:00] Thread-132373274424896 START task 6
[02:28:00] Thread-132373266032192 END task 3
[02:28:00] Thread-132373266032192 START task 7
[02:28:02] Thread-132373257639488 END task 4
[02:28:02] Thread-132373266032192 END task 7
[02:28:02] Thread-132373274424896 END task 6
[02:28:02] Thread-132373323032128 END task 5

Simulation complete.


it explains the GIL behavior clearly in an I/O-bound context:

At 02:16:28, 4 threads (equal to the worker pool size) start tasks 0–3 simultaneously.

Each one goes to sleep (time.sleep → simulating I/O), and importantly:

time.sleep releases the GIL, so the interpreter can let other threads run freely.

At 02:16:30, all 4 threads wake up almost together, finish tasks 0–3, and immediately pick up the next 4 tasks (4–7).

At 02:16:32, all finish nearly together again.

This shows:

Threads are highly effective for I/O-bound workloads, since the GIL isn’t blocking while threads are sleeping or waiting for I/O.

If this were CPU-bound work (tight Python loops), you’d see only one thread actually progressing at a time (the GIL would serialize them).

In [None]:
import time
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
import math

def cpu_task(task_id, n=5_000_000):
    """Simulated CPU-bound task with logging."""
    thread_id = threading.get_ident()
    start = time.strftime("%H:%M:%S")
    print(f"[{start}] Thread-{thread_id} START task {task_id}")

    # CPU-intensive loop (stays under the GIL)
    s = 0.0
    for i in range(1, n):
        s += math.sqrt(i) * math.sin(i)  # heavy but pure Python
    result = s

    end = time.strftime("%H:%M:%S")
    print(f"[{end}] Thread-{thread_id} END task {task_id}")
    return task_id, result

def run_cpu_simulation(num_tasks=4, workers=4, n=5_000_000):
    print(f"\nRunning {num_tasks} CPU-bound tasks with {workers} threads...\n")
    t0 = time.perf_counter()

    with ThreadPoolExecutor(max_workers=workers) as executor:
        futures = [executor.submit(cpu_task, i, n) for i in range(num_tasks)]
        for fut in as_completed(futures):
            _ = fut.result()

    t1 = time.perf_counter()
    print(f"\nSimulation complete in {t1 - t0:.2f} seconds.")

# Run the demo
run_cpu_simulation(num_tasks=4, workers=4, n=5_000_00)  # adjust n upward for slower/clearer demo



Running 4 CPU-bound tasks with 4 threads...

[02:32:31] Thread-132373323032128 START task 0
[02:32:31] Thread-132373266032192 START task 1
[02:32:31] Thread-132373274424896 START task 2
[02:32:31] Thread-132373257639488 START task 3
[02:32:32] Thread-132373266032192 END task 1
[02:32:32] Thread-132373323032128 END task 0
[02:32:32] Thread-132373257639488 END task 3
[02:32:32] Thread-132373274424896 END task 2

Simulation complete in 0.26 seconds.


In [None]:
import time
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
import math

def cpu_task(task_id, n=2_000_000, log_every=500_000):
    """CPU-bound task with per-iteration logging to show which thread is active."""
    thread_id = threading.get_ident()
    start = time.strftime("%H:%M:%S")
    print(f"[{start}] Thread-{thread_id} START task {task_id}")

    s = 0.0
    for i in range(1, n + 1):
        s += math.sqrt(i) * math.sin(i)

        # Log occasionally to avoid flooding the output
        if i % log_every == 0:
            ts = time.strftime("%H:%M:%S")
            print(f"[{ts}] Thread-{thread_id} working on task {task_id}, iteration {i}")

    end = time.strftime("%H:%M:%S")
    print(f"[{end}] Thread-{thread_id} END task {task_id}")
    return task_id, s

def run_cpu_simulation(num_tasks=4, workers=4, n=2_000_000):
    print(f"\nRunning {num_tasks} CPU-bound tasks with {workers} threads...\n")
    t0 = time.perf_counter()

    with ThreadPoolExecutor(max_workers=workers) as executor:
        futures = [executor.submit(cpu_task, i, n) for i in range(num_tasks)]
        for fut in as_completed(futures):
            _ = fut.result()

    t1 = time.perf_counter()
    print(f"\nSimulation complete in {t1 - t0:.2f} seconds.")

# Run the demo
run_cpu_simulation(num_tasks=4, workers=4, n=2_000_000)



Running 4 CPU-bound tasks with 4 threads...

[02:21:41] Thread-132373257639488 START task 0
[02:21:41] Thread-132373323032128 START task 1
[02:21:41] Thread-132373266032192 START task 2
[02:21:41] Thread-132373274424896 START task 3
[02:21:41] Thread-132373257639488 working on task 0, iteration 500000
[02:21:41] Thread-132373323032128 working on task 1, iteration 500000
[02:21:42] Thread-132373266032192 working on task 2, iteration 500000
[02:21:42] Thread-132373323032128 working on task 1, iteration 1000000
[02:21:42] Thread-132373266032192 working on task 2, iteration 1000000
[02:21:42] Thread-132373323032128 working on task 1, iteration 1500000
[02:21:42] Thread-132373266032192 working on task 2, iteration 1500000
[02:21:42] Thread-132373274424896 working on task 3, iteration 500000
[02:21:42] Thread-132373266032192 working on task 2, iteration 2000000
[02:21:42] Thread-132373266032192 END task 2
[02:21:42] Thread-132373257639488 working on task 0, iteration 1000000
[02:21:42] Thre