# CleanCodeAndDirtyTricks 🚀
## Session 02: Optimize that snake! 🐍

<div style="display: flex; justify-content: space-between;">

<div style="width: 15%;">
<img src="imgs/logo_rot.png" alt="Drawing" style="width: 50px;"/>
</div>

<div style="width: 80%;">
<img src="imgs/speed_up.png" alt="Drawing" style="width: 450px;"/>
</div>

</div>

@#DKOP#TTK#WJSZ#PHYSICS#ComputerScience

## 🧵 Topics We'll Cover

<div style="display: flex; justify-content: space-between;">

<div style="width: 48%;">

### 🚀 Just-In-Time (JIT) Compiling

- **Numba** – Accelerate loops and math-heavy code
- **JAX** – Auto-differentiation + JIT + GPU/TPU-ready
- **PyPy** – Drop-in faster Python interpreter

---

### 🔄 Parallelism & Concurrency

- **Multiprocessing** – True parallelism using processes
- **Multithreading** – Limited by GIL, but useful for I/O
- **GIL (Global Interpreter Lock)** – What it is and why it matters
- **CuPy and PyTorch** 
</div>

<div style="width: 48%;">


### 🧪 Other Performance Boosters

- **Choosing the Right Interpreter**  
  - CPython, PyPy, and beyond
- **Benchmarking & Profiling**  
  - `cProfile`, `line_profiler`, `memory_profiler`

---

## 🧭 Goal of This Session

- Learn **how Python works under the hood**
- Discover **what tools to use when**
- Write code that’s not just clean — but **fast and scalable**

</div>
</div>
> _“Python isn’t slow — your loops are.”_

## Disclaimer

Almost every figure is borrowed from the articles listed among the **References**, or from Google. I do not claim authorship for any of those, they are used solely for educational purposes.

# 1. Why is Python Slow? - 🔐 Understanding the GIL  
_The Global Interpreter Lock in Python_

<img src="imgs/What-is-the-Python-Global-Interpreter-Lock-GIL_Watermarked.0695d8c16efe.avif" alt="Drawing" style="width: 600px;"/>

## 🧠 What Is the GIL?

- The GIL is a **mutex (lock)** that ensures **only one thread executes Python bytecode at a time**, even on multi-core systems.
- Originally introduced in **CPython** to protect memory during **reference counting**.

## 🔍 Why Does It Exist?

- Prevents **race conditions** during memory management (ref counting).
- Avoids **deadlocks** by locking the entire interpreter instead of individual objects.
- Simplifies integration with **C extensions**, many of which are not thread-safe.

Exmaple reference counting in Python:

```python
import sys
a = []
b = a
sys.getrefcount(a)
>> 3
```

## 🧨 The Drawback

- In **CPU-bound programs**, multithreading is ineffective:
  - Threads fight for the GIL — one runs, others wait.
  - Only one CPU core is used effectively.

### Not ideal for:
- Heavy numerical computations
- Data processing with many threads


## ✅ When It's Okay

- **I/O-bound programs** (file, network, DB wait time):
  - Threads can release the GIL while waiting
  - Useful for concurrent I/O


## 🛠 How to Work Around It

- Use **multiprocessing**: Each process has its own Python interpreter & GIL
- Use **JIT compilers** (like **Numba**, **PyPy**) that reduce Python's need for the GIL
- Offload to **GPU** (e.g., with **JAX**, **CuPy**) or C libraries

---

> _“The GIL makes Python simple and fast for single-threaded tasks — but a bottleneck for CPU-bound parallelism.”_


## Other notable mentions

- **Garbage collecting:** Garbage collection is a form of automatic memory management that a programming language runtime system uses to reclaim memory that is no longer in use by the program.
- **Dynamic typing:** In addition, Python is dynamically typed, which means that you don´t have to declare the type of a variable when you initialize it. Hence, in Python types are determined by the runtime, i.e. **the interpreter needs to do type-checking every time it executes a piece of code**.

# 2. ⚡ Speeding Up Python with Numba
<img src="imgs/python_fast.png" alt="Drawing" style="width: 500px;"/>

## 🚀 What is Numba?

- **Numba** is a **Just-In-Time (JIT) compiler** for numerical Python.
- Translates a subset of Python + NumPy into **fast machine code** using LLVM.
- No need to rewrite your code in C or C++!



## 🧪 How to Use It

### Simple Example:
```python
from numba import jit

@jit
def sum_numbers(n):
    total = 0
    for i in range(n):
        total += i
    return total
```

- First run compiles it — next runs are 🔥 fast!


## ✅ Key Features

- **@jit** — main decorator for JIT compiling
- **nopython mode** — best performance (no Python objects allowed)
- **parallel=True** — multi-core execution
- **@vectorize** and **@guvectorize** — write your own fast NumPy ufuncs
- **@jitclass** — compile custom classes
- Works with **NumPy**, supports **CUDA for GPUs**


## ⚠️ When to Use Numba
<div style="display: flex; justify-content: space-between;">

<div style="width: 48%;">

### Ideal for:
- Loops, number-crunching, array processing
- Tight, performance-critical functions
</div>

<div style="width: 48%;">

### Avoid for:
- Code with lots of objects or dynamic typing
- Heavy use of Python standard library
- I/O-bound code or logic-heavy app code
</div>
</div>


## 🔧 Numba: njit = Fastest Mode
## 🧠 What’s `@njit`?

- Short for **no-Python JIT** — alias for `@jit(nopython=True)`
- Forces Numba to compile in the fastest mode:
  - No use of Python objects
  - Fully converted to machine code
- Errors if it can’t compile without Python runtime


<div style="display: flex; justify-content: space-between;">

<div style="width: 48%;">

### ✅ Use This Instead of:
```python
@jit(nopython=True)
def fast_func(x): ...
```
</div>

<div style="width: 48%;">

### ✅ Cleaner:
```python
from numba import njit

@njit
def fast_func(x): ...
```
</div>
</div>

> 💡 Best performance comes from sticking to NumPy arrays and numerical types.


# 🧵 Numba: Parallel Execution with `parallel=True`
## 🚀 What It Does

- Tells Numba to **auto-parallelize** your code across CPU cores
- Use in `@jit` or `@njit`:

```python
from numba import njit, prange

@njit(parallel=True)
def sum_parallel(arr):
    total = 0.0
    for i in prange(len(arr)):
        total += arr[i]
    return total
```

- `prange` = parallel version of `range`


## ⚠️ Notes

- Only works in **nopython mode**
- Performance gains depend on problem size and CPU cores
- Can be combined with vectorized operations too

## 🔍 Check if Parallel Worked

Run with:
```bash
NUMBA_DEBUG_ARRAY_OPT_STATS=1 python myscript.py
```

> _“Parallel=True is like turbo mode — but use it only when you have real work to spread.”_

## 📏 Performance Tip

- Always test with `nopython=True` and `parallel=True`
- Profile with `%timeit`, `cProfile`, or `line_profiler`
- Use Numba's built-in diagnostics (`numba --annotate-html`)

> _“With one decorator, you can get C-like speed in Python.”_  
— [Numba Docs](https://numba.pydata.org/)

# 3. ⚙️ JAX: High-Performance Numerical Computing for Python
<img src="imgs/jax_logo.jpeg" alt="Drawing" style="width: 500px;"/>


## 🚀 What Is JAX?

- A **NumPy-compatible** library with superpowers:
  - Automatic **differentiation** (like TensorFlow or PyTorch)
  - Fast execution via **XLA (Accelerated Linear Algebra)**
  - Seamless support for **GPU and TPU** execution
- Developed by Google; ideal for ML, simulations, scientific computing

## 🔍 Why JAX > NumPy?

| Feature             | NumPy       | JAX                      |
|---------------------|-------------|--------------------------|
| GPU/TPU Support     | ❌ No        | ✅ Built-in              |
| Auto-Differentiation| ❌ No        | ✅ With `grad()`         |
| JIT Compilation     | ❌ No        | ✅ With `@jit`           |
| Parallelism         | Limited     | ✅ With `pmap`, `vmap`   |

## 🧠 Core Features

- `jit(f)` → Just-in-time compile any function  
- `grad(f)` → Auto compute gradients  
- `vmap(f)` → Vectorize functions easily  
- `pmap(f)` → Parallelize across multiple devices


## ✨ Example

```python
import jax.numpy as jnp
from jax import grad, jit

def square(x):
    return x ** 2

fast_square = jit(square)
deriv = grad(square)

print(fast_square(3.0))  # 9.0
print(deriv(3.0))        # 6.0
```


## ⚠️ Notes

- Arrays are immutable
- Debugging can be tricky (compiled under the hood)
- Still maturing, especially for general-purpose computing


> _“JAX is NumPy on steroids, built for modern ML and hardware acceleration.”_

# 4. ⚡ CuPy: GPU-Accelerated NumPy
<img src="imgs/cupy.png" alt="Drawing" style="width: 500px;"/>

## 🚀 What is CuPy?

- A **NumPy-compatible** array library that runs on **GPU**
- Built with CUDA — designed for high-performance numerical computing
- Drop-in replacement: just `import cupy as cp` instead of `numpy`

## 🧠 Key Features

- Uses NVIDIA CUDA libraries under the hood (cuBLAS, cuDNN, cuFFT, etc.)
- Supports broadcasting, indexing, ufuncs, slicing — just like NumPy
- 100x speed-up possible on some operations
- Custom CUDA kernels via `ElementwiseKernel` or `RawKernel`


## 🧪 Example Usage

```python
import cupy as cp

x = cp.arange(6).reshape(2, 3).astype('f')
print(x.sum(axis=1))  # Fast and GPU-accelerated
```

> Works just like NumPy, but much faster on large data

## ⚙️ Installation (pick your CUDA version)

```bash
pip install cupy-cuda11x  # For CUDA 11.x
pip install cupy-cuda12x  # For CUDA 12.x
```

## ⚠️ Considerations

- Requires **NVIDIA GPU** with CUDA support
- Does not support every SciPy function (yet)
- Performance gain only on **large arrays** and **numeric operations**


> _“If you know NumPy, you already know CuPy — just faster, on GPU.”_

# 5. 🔥 PyTorch: Flexible Deep Learning Framework
<img src="imgs/misc-pytorch-course-launch-cover-white-text-black-background.jpg" alt="Drawing" style="width: 500px;"/>

## 🧠 What Is PyTorch?

- An open-source deep learning library developed by Meta AI.
- Combines a Pythonic interface with efficient GPU acceleration.
- Widely adopted in both research and industry for its flexibility and ease of use.


## ⚙️ Key Features

- **Dynamic Computational Graphs**: Build and modify models on-the-fly, facilitating intuitive debugging and model customization.
- **GPU Acceleration**: Seamless integration with CUDA for high-performance computations.
- **Autograd Module**: Automatic differentiation for efficient gradient computation.
- **TorchScript**: Convert models into a form that can be run independently from Python, enabling deployment in production environments.
- **TorchServe**: A tool for serving PyTorch models at scale, supporting features like multi-model serving and RESTful endpoints.



## 🧪 Example: Simple Neural Network

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
```

## 📦 Installation

```bash
pip install torch torchvision
```


> _“PyTorch offers a seamless path from research prototyping to production deployment.”_
```


# 6. 🤹 Multithreading vs Multiprocessing in Python
<img src="imgs/python_pool.webp" alt="Drawing" style="width: 350px;"/>

## 🤔 Common Misconceptions

> ⛔️ **"They're basically the same thing"**

- **Multithreading** uses threads (within the same process).
- **Multiprocessing** uses separate processes (each with its own Python interpreter).
- In Python, threads do not run *in true parallel* due to the GIL.

## 🧪 Experiment #1: CPU-Heavy Tasks

### Code:
```python
def cpu_heavy(x):
    for i in range(10**8):
        pass
```

- `multithreading(cpu_heavy, range(4), 4)` → ~20 sec  
- `multiprocessing(cpu_heavy, range(4), 4)` → ~5 sec  

✅ Multiprocessing is **much faster** for CPU-heavy tasks!

## ❗ Why Threads Are Slower on CPU Tasks

> **"Threads run in parallel"**

- Actually, threads **share the same core** — they take turns (concurrent, not parallel).
- Only one thread runs at a time due to the **Global Interpreter Lock (GIL)**.
- Switching threads adds **context-switch overhead**.

<img src="imgs/thread_proc_runtime.webp" alt="Drawing" style="width: 750px;"/>

Actually threads neither run in parallel nor in sequence. They run concurrently! Each time one job will be executed a little and then the other takes on.


## 🧪 Experiment #2: Serial vs Threads

- 4 CPU-bound jobs on 4 threads:
```python
for i in range(4): cpu_heavy(i)  # serial
multithreading(cpu_heavy, range(4), 4)
```

Results:
- **Serial:** ~1659 sec  
- **Multithreading:** ~1669 sec  
🛑 Threads were slightly **slower** due to thread-switching overhead.


## ✅ When Is Multithreading Actually Useful?

> **"Multithreading is useless"**

🔁 **False** — it's just not for CPU-bound tasks.

🧵 **Multithreading shines with I/O-bound operations** like:
- Downloading files
- Reading from disk
- Waiting on web responses

### Example:
```python
with urllib.request.urlopen(url) as conn:
    data = conn.read()
```

- Serial I/O: ~7.8 sec  
- Threads (4): ~2.6 sec  
- Threads (8): ~1.5 sec  

✅ Threads help hide I/O **latency** by switching tasks while waiting!

## Packages for Multiprocessing/Multithreading

- `concurrent.futures`
- `multiprocessing`
- `threading`



## 📚 Summary

| Task Type       | Use                        |
|------------------|-----------------------------|
| CPU-heavy        | ✅ Multiprocessing          |
| I/O-bound        | ✅ Multithreading           |
| Mixed/General    | Depends — profile it!       |

---

> _“Concurrency ≠ Parallelism — understand the difference, and pick the right tool.”_

# 7. 🧩 Beyond CPython: Other Python Interpreters
<img src="imgs/interpreter.png" alt="Drawing" style="width: 500px;"/>


## 🐍 What Is CPython?

- The **default** Python interpreter
- Written in C — what you run when you type `python`
- Prioritizes **compatibility and stability**
- BUT... has limitations like the **GIL** and **slower performance**

## 🚀 Why Consider Other Interpreters?

- 🔥 **Speed**: JIT compilers (PyPy) make some Python code much faster
- 🧵 **Concurrency**: Some interpreters don’t have a GIL (e.g. Jython, IronPython)
- 🤝 **Interoperability**: Want to use Java or .NET libraries? Use Jython/IronPython.
- 🧪 **Experimentation**: For exploring new platforms or specialized environments

## 🧠 Popular Alternatives

| Interpreter     | Highlights                           |
|----------------|--------------------------------------|
| **PyPy**        | JIT compiled, much faster for many tasks |
| **Jython**      | Python on the Java Virtual Machine   |
| **IronPython**  | Python for .NET/Mono environments    |
| **GraalPython** | Python via GraalVM, polyglot ready   |
| **MicroPython** | Python for microcontrollers          |

## ⚠️ Keep in Mind

- Not all packages work on all interpreters (e.g., C extensions on PyPy or Jython)
- Trade-offs: speed vs ecosystem vs compatibility
- Great for specific use cases — not always a full replacement

> _“Different interpreters unlock different powers — choose based on your problem, not just habit.”_


# References

<div style="display: flex; justify-content: space-between;">

<div style="width: 48%;">

### ⚙️ Python Performance & Optimization
- [Why Python is slower than C](https://medium.com/thedeephub/but-why-python-is-so-slow-da1a4fb9be92)
- [Numba from Scratch](https://pythonspeed.com/articles/numba-faster-python/)
- [PyPy 101 – Real Python](https://realpython.com/pypy-faster-python/)
- [JIT Compilation (Numba Tutorial)](https://medium.com/data-science/make-python-run-as-fast-as-c-9fdccdb501d4)
- [Speed Up with Numba – Kaggle](https://www.kaggle.com/code/rudrasing/speed-up-python-code-up-to-100x-using-numba)
- [Make python fast with numba](https://thedatafrog.com/en/articles/make-python-fast-numba/)
- [PyPy Documentation](https://doc.pypy.org/en/latest/)
- [JAX vs NumPy](https://medium.com/@harshavardhangv/jax-vs-numpy-key-differences-and-benefits-72e442bbf67f)
- [Introduction to Jax](https://www.kaggle.com/code/goktugguvercin/introduction-to-jax)
</div>

<div style="width: 48%;">

### 🧵 Multiprocessing & Parallelism

- [Intro to Multiprocessing – Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/04/a-beginners-guide-to-multi-processing-in-python/)
- [Optimal Multiprocessing – Medium](https://medium.com/@sampsa.riikonen/doing-python-multiprocessing-the-right-way-a54c1880e300)
- [Threads vs Processes – DataCamp](https://www.datacamp.com/tutorial/python-multiprocessing-tutorial)
- [Threads vs Processes – Contentsquare](https://engineering.contentsquare.com/2018/multithreading-vs-multiprocessing-in-python/)
- [ProcessPoolExecutor Guide – Medium](https://medium.com/@superfastpython/python-processpoolexecutor-7-day-crash-course-71cf062409d2)
- [SuperfastPython – ProcessPoolExecutor](https://superfastpython.com/processpoolexecutor-in-python/)
- [RealPython on the GIL](https://realpython.com/python-gil/)
- [High-Speed Execution Tips – Analytics Vidhya](https://www.analyticsvidhya.com/blog/2024/01/optimize-python-code-for-high-speed-execution/)
- [PyTorch mastery course](https://github.com/mrdbourke/pytorch-deep-learning?tab=readme-ov-file)
- [multithreading-vs-multiprocessing](https://github.com/baatout/multithreading-vs-multiprocessing/tree/master)
</div>
</div>

<div style="display: flex; justify-content: space-between;">

<div style="width: 48%;">


</div>

<div style="width: 48%;">


</div>
</div>