In [None]:
# Párhuzamos programozás: multiprocesszing, multithreading, aszinkron

## Course Material: Parallel Programming in Python

### Overview
Parallel programming is a technique to run multiple operations concurrently, thereby reducing the time to complete a task. Python offers several ways to achieve parallelism, including multiprocessing, multithreading, and asynchronous programming. In addition, libraries like CuPy, Numba, and cuDF can leverage GPU capabilities to further enhance performance.

### Learning Objectives
- Understand the concepts of parallel programming.
- Learn the differences between multiprocessing, multithreading, and asynchronous programming.
- Utilize Python's standard library for parallelism.
- Explore advanced libraries such as CuPy, Numba, and cuDF for GPU-based parallelism.

### 1. Multithreading

**Concept:** Multithreading is the parallel execution of tasks using threads. Python's `threading` module allows you to create threads and manage them. It's suitable for I/O-bound tasks due to Python's Global Interpreter Lock (GIL).

**Notes on Parallelism in Python**

 - On POSIX systems, `fork()` is used to create new processes which may cause deadlocks.
Starting from python 3.14, `spawn()` will be the default for creating new processes [[1](https://pythonspeed.com/articles/python-multiprocessing/)][[2](https://github.com/python/cpython/issues/84559)].
 - Work is being done to remove the GIL from the CPython implementation [[1](https://developer.vonage.com/en/blog/removing-pythons-gil-its-happening)][[2](https://peps.python.org/pep-0703/)].

**Example:**

In [1]:
import threading
import time

def worker(num):
    """Thread worker function"""
    print(f'Worker: {num}')
    time.sleep(2)
    print(f'Worker {num} done')

threads = []
for i in range(5):
    t = threading.Thread(target=worker, args=(i,))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

Worker: 0Worker: 1

Worker: 2
Worker: 3
Worker: 4
Worker 2 doneWorker 3 done
Worker 1 done

Worker 0 done
Worker 4 done


**Explanation:** This script spawns five threads that execute the `worker` function concurrently. Each thread sleeps for 2 seconds before completing.

A thread pool allows you to manage a pool of threads and submit tasks to it without manually creating and managing individual threads.

Let's enhance the previous example by using a thread pool from Python's `concurrent.futures` module.

In [2]:
from concurrent.futures import ThreadPoolExecutor
import time

def worker(num):
    """Thread worker function"""
    print(f'Worker: {num}')
    time.sleep(2)
    print(f'Worker {num} done')

# Number of threads in the pool
num_threads = 3

# Create a ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=num_threads) as executor:
    # Submit tasks to the executor
    futures = [executor.submit(worker, i) for i in range(5)]

    # Wait for all futures to complete
    for future in futures:
        # You can call future.result() here if you need to catch exceptions
        # or get return values from the worker function
        future.result()

print("All workers have completed.")

Worker: 0
Worker: 1
Worker: 2
Worker 0 done
Worker: 3
Worker 1 done
Worker: 4
Worker 2 done
Worker 3 done
Worker 4 done
All workers have completed.


### Explanation:
1. **ThreadPoolExecutor**: We use `ThreadPoolExecutor` from the `concurrent.futures` module to manage a pool of threads. The `max_workers` parameter specifies the maximum number of threads that can be active at the same time. 

2. **Submitting Tasks**: We use `executor.submit(worker, i)` to submit tasks to the thread pool. This returns a `Future` object representing the execution of the function. 

3. **Waiting for Completion**: `future.result()` is called to block until the individual task completes. If an exception was raised during task execution, it will be re-raised when calling `result()`, allowing you to handle it.

4. **Context Manager**: The `with` statement ensures that the `ThreadPoolExecutor` is properly cleaned up and closed after use.

This example uses a pool of 3 threads to execute 5 tasks. Tasks are managed by the thread pool, which handles the concurrency for you, making your code more concise and easier to manage.

### 2. Multiprocessing

**Concept:** Multiprocessing involves using multiple CPU cores to perform parallel tasks. Python’s `multiprocessing` module allows you to create processes, share data between them, and manage process pools.

**Example:**

In [4]:
import multiprocessing
import os

def worker(num):
    """Thread worker function"""
    print(f'Worker: {num}, PID: {os.getpid()}')

if __name__ == '__main__':
    jobs = []
    for i in range(5):
        p = multiprocessing.Process(target=worker, args=(i,))
        jobs.append(p)
        p.start()

When running the previous multiprocessing code in a JupyterLab notebook, it might seem like nothing happens because of how JupyterLab handles multiprocessing.
JupyterLab, which uses an interactive environment, can have issues with multiprocessing due to the way it manages I/O and process spawning.

---

Switch to VSCode
---

1. Open the `03_parallel` directory in VSCode.
2. Open the `03_parallel.py` file in the editor.
3. Inspect the code.
4. Open the `03_parallel` directory in the terminal.
5. Activate the virtual environment in the terminal: `C:\Users\Administrator\python-advanced\venv\Scripts\activate`
6. Run the `03_parallel.py` file, using the following command: `python 03_parallel.py`

---

### 3. Asynchronous Programming

**Concept:** Asynchronous programming allows you to run tasks asynchronously using coroutines, which are special functions that can pause and resume their execution. The `asyncio` module facilitates writing single-threaded concurrent code.

**Example:**

In [6]:
import asyncio

async def worker(num):
    """Async worker function"""
    print(f'Worker: {num}')
    await asyncio.sleep(2)
    print(f'Worker {num} done')

async def main():
    tasks = [worker(i) for i in range(5)]
    await asyncio.gather(*tasks)

# asyncio.run(main())
await main()

Worker: 0
Worker: 1
Worker: 2
Worker: 3
Worker: 4
Worker 0 done
Worker 2 done
Worker 4 done
Worker 1 done
Worker 3 done


**Explanation:** This script runs five coroutines concurrently using `asyncio`. Each coroutine sleeps for 2 seconds before completing.

**Note:** In JupyterLab, you use `await main()` instead of `asyncio.run(main())` because JupyterLab's event loop is already running. Using `await` allows you to work within this existing event loop, while `asyncio.run()` tries to start a new event loop, which can cause conflicts or errors.

### **Asyncio vs Threads**

**Asyncio:**
- **Definition**: Asyncio is a library used to write concurrent code using the `async`/`await` syntax. It is part of the Python standard library from Python 3.5+.
- **Event Loop**: Central to asyncio, the event loop runs asynchronous tasks and callbacks, performs network IO operations, and runs subprocesses.
- **Concurrency Model**: Cooperative multitasking, where tasks yield control back to the event loop at await points, allowing other tasks to run.

**Threads:**
- **Definition**: Threads are a way to achieve concurrency by running multiple threads (smaller units of a process) in parallel. Python’s threading module provides high-level support for threads.
- **Concurrency Model**: Pre-emptive multitasking, where the operating system decides when a thread is interrupted and another is run.

### **Programming Model**

**Asyncio:**
- **Code Style**: Uses `async def` to declare asynchronous functions and `await` to yield control.
- **Complexity**: Requires understanding of coroutines and the event-driven programming model.
- **Examples**:
    ```python
    import asyncio

    async def main():
        await asyncio.sleep(1)
        print("Hello, world!")

    asyncio.run(main())
    ```

**Threads:**
- **Code Style**: Uses the `threading` module, creating threads by subclassing `Thread` or directly instantiating `Thread` objects.
- **Complexity**: Easier to grasp for those familiar with traditional multi-threading but requires careful handling of shared resources to avoid race conditions.
- **Examples**:
    ```python
    import threading, time

    def worker():
        time.sleep(1)
        print("Hello, world!")

    thread = threading.Thread(target=worker)
    thread.start()
    thread.join()
    ```

### **Integration and Ecosystem**

**Asyncio:**
- **Integration**: Well-integrated with modern Python libraries and frameworks (e.g., `aiohttp`, `FastAPI`).
- **Ecosystem**: Growing ecosystem, especially for network-related and async-compatible libraries.

**Threads:**
- **Integration**: Supported by virtually all Python libraries, as threading is a fundamental part of many applications.
- **Ecosystem**: Mature ecosystem with extensive support in the standard library and third-party modules.


---

Switch to [Google Colab](https://colab.research.google.com/drive/18KgT9cFa4MCBk2tyhb_9-YGux3aWVWsd)
---

---

### 4. GPU-based Parallelism

#### 4.1. CuPy

**Concept:** CuPy is a GPU array library that leverages NVIDIA CUDA to accelerate computations.

**Example:**

```python
import cupy as cp

# Create a random array on the GPU
x = cp.random.rand(1000000)

# Perform elementwise operations
y = cp.sin(x)

# Transfer data back to the host (CPU)
y_host = cp.asnumpy(y)

print(y_host[:10])
```

**Explanation:** This script generates a large random array on the GPU, computes the sine of each element, and transfers the result back to the CPU.

#### 4.2. Numba

**Concept:** Numba is a JIT compiler that translates a subset of Python and NumPy code into fast machine code.

**Example:**

```python
from numba import jit
import numpy as np

@jit(nopython=True)
def sum_2d_array(arr):
    m, n = arr.shape
    result = 0.0
    for i in range(m):
        for j in range(n):
            result += arr[i, j]
    return result

arr = np.random.rand(1000, 1000)
result = sum_2d_array(arr)
print(result)
```

**Explanation:** This script uses Numba to compile a function that sums a 2D array, significantly speeding up the computation compared to pure Python.

#### 4.3. cuDF

**Concept:** cuDF is a GPU DataFrame library that provides a pandas-like API for manipulating large datasets with GPU acceleration.

**Example:**

```python
import cudf
import pandas as pd

# Create a DataFrame on the CPU
pdf = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

# Transfer it to the GPU
gdf = cudf.DataFrame.from_pandas(pdf)

# Perform operations on the GPU
gdf['c'] = gdf['a'] + gdf['b']

print(gdf)
```

**Explanation:** This script creates a pandas DataFrame, transfers it to the GPU, performs an elementwise addition, and prints the resulting cuDF DataFrame.

---

### Summary

- **Multiprocessing:** Suitable for CPU-bound tasks. Utilizes multiple CPU cores.
- **Multithreading:** Suitable for I/O-bound tasks. Utilizes multiple threads.
- **Asynchronous:** Suitable for I/O-bound tasks. Uses coroutines for concurrency.
- **CuPy:** GPU array library for fast numerical computations.
- **Numba:** JIT compiler for fast machine code generation.
- **cuDF:** GPU DataFrame library for handling large datasets.

### Exercises

1. **Multiprocessing:** Modify the multiprocessing example to calculate the sum of squares for a list of numbers using multiple processes.
2. **Multithreading:** Modify the multithreading example to read multiple files concurrently and print their content.
3. **Asynchronous:** Write an asynchronous program to fetch data from multiple URLs concurrently.
4. **CuPy:** Implement matrix multiplication using CuPy and compare the performance with NumPy.
5. **Numba:** Use Numba to optimize a function that computes the Mandelbrot set.
6. **cuDF:** Load a large CSV file into a cuDF DataFrame and perform some basic data analysis operations.

### Sources

 - ["Why your multiprocessing Pool is stuck (it’s full of sharks!)" - Python fork deadlock](https://pythonspeed.com/articles/python-multiprocessing/)
 - [multiprocessing's default posix start method of 'fork' is broken: change to 'spawn' #84559](https://github.com/python/cpython/issues/84559)
 - [Removing Python's GIL: It's Happening!](https://developer.vonage.com/en/blog/removing-pythons-gil-its-happening)]
 - [PEP 703 – Making the Global Interpreter Lock Optional in CPython](https://peps.python.org/pep-0703/)

### Further Reading

- Python [concurrent.futures.ProcessPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor).
- Python [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) documentation.
- Python [threading](https://docs.python.org/3/library/threading.html) documentation.
- Python [asyncio](https://docs.python.org/3/library/asyncio.html) documentation.
- [CuPy](https://docs.cupy.dev/en/stable/) documentation.
- [Numba](https://numba.pydata.org/) documentation.
- [cuDF](https://docs.rapids.ai/api/cudf/stable/) documentation.

# Thread vs Process

A comparison is presented in `99_parallel.py`. Run this file from VSCode to inspect the results.