

# 🔧 Loop Unrolling and JIT Compilation  
## A Deep Dive into Accelerating CPU-bound Python Code with Numba and Numexpr



---

# 🔁 Loop Unrolling in Python

Loop unrolling is an optimization technique that reduces loop overhead by processing multiple elements per iteration.

### ✅ Why Use It?
- Reduces loop control overhead
- Improves cache usage and instruction-level parallelism
- Can speed up small arithmetic loops

### ⚠️ When to Avoid It?
- For large or complex loop bodies
- If you're already using vectorized libraries like NumPy

---

## 🧪 Example: Normal vs. Unrolled Loop

```python
import numpy as np
import time

def normal_loop(x, y, out):
    for i in range(len(x)):
        out[i] = x[i] + y[i]

def unrolled_loop(x, y, out):
    # Process 4 items at a time
    for i in range(0, len(x) - 3, 4):
        out[i]     = x[i]     + y[i]
        out[i + 1] = x[i + 1] + y[i + 1]
        out[i + 2] = x[i + 2] + y[i + 2]
        out[i + 3] = x[i + 3] + y[i + 3]
    
    # Handle remaining elements
    for i in range(len(x) - (len(x) % 4), len(x)):
        out[i] = x[i] + y[i]
```

### 📊 Benchmarking Loop Performance

```python
size = 10_000_000
a = np.random.rand(size)
b = np.random.rand(size)
result_normal = np.zeros_like(a)
result_unrolled = np.zeros_like(a)

# Time normal loop
start = time.time()
normal_loop(a, b, result_normal)
normal_time = time.time() - start
print(f"Normal loop time: {normal_time:.4f} seconds")

# Time unrolled loop
start = time.time()
unrolled_loop(a, b, result_unrolled)
unrolled_time = time.time() - start
print(f"Unrolled loop time: {unrolled_time:.4f} seconds")
print(f"Speedup: {normal_time / unrolled_time:.2f}x")
```

> 💡 Sample Output:
```
Normal loop time: 1.2345 seconds
Unrolled loop time: 0.8901 seconds
Speedup: 1.39x
```

---

# 🚀 Just-In-Time (JIT) Compilation with Numba

Numba compiles Python functions to machine code at runtime using LLVM — often achieving near-C speeds.

## 🧩 How Numba Works:
1. Decorate function with `@jit` or `@njit`
2. Numba infers types and compiles optimized machine code
3. Compiled code is cached for faster future calls

---

## 🧪 Example: JIT Speedup Using Numba

```python
from numba import jit, njit, prange, set_num_threads

@njit  # Fastest mode: nopython=True
def numba_loop(x, y):
    result = np.empty_like(x)
    for i in range(len(x)):
        result[i] = x[i] + y[i]
    return result

@njit(parallel=True)
def numba_parallel_loop(x, y):
    result = np.empty_like(x)
    for i in prange(len(x)):  # Parallelized loop
        result[i] = x[i] + y[i]
    return result
```

### 📈 Benchmarking Numba

```python
size = 10_000_000
a = np.random.rand(size)
b = np.random.rand(size)

# Baseline: Pure Python loop
start = time.time()
normal_loop(a, b, result_normal)
python_time = time.time() - start
print(f"Python loop time: {python_time:.4f} seconds")

# Numba nopython
start = time.time()
res_numba = numba_loop(a, b)
numba_time = time.time() - start
print(f"Numba loop time: {numba_time:.4f} seconds")
print(f"Numba speedup: {python_time / numba_time:.2f}x")

# Numba parallel
start = time.time()
res_numba_par = numba_parallel_loop(a, b)
numba_par_time = time.time() - start
print(f"Numba parallel time: {numba_par_time:.4f} seconds")
print(f"Numba parallel speedup: {python_time / numba_par_time:.2f}x")
```

> 📌 You'll typically see:
- **Numba**: 10–100x speedups over pure Python
- **Parallel Numba**: Additional gains on multi-core CPUs

---

# 🧮 Vectorized Math Expression Optimization with Numexpr

Numexpr evaluates array expressions efficiently using:
- Virtual machine
- Multi-threading
- Memory-efficient execution

## 🧪 Example: Evaluate Complex Formula

```python
import numexpr as ne

def compute_with_numpy(x, y, z):
    return np.exp(-(x - y)**2) / (1 + (x + y)**2)

def compute_with_numexpr(x, y, z):
    return ne.evaluate("exp(-(x - y)**2) / (1 + (x + y)**2)")
```

### 📊 Benchmarking Numexpr vs NumPy

```python
size = 10_000_000
x = np.random.rand(size)
y = np.random.rand(size)
z = np.random.rand(size)

# With NumPy
start = time.time()
result_numpy = compute_with_numpy(x, y, z)
numpy_time = time.time() - start
print(f"NumPy time: {numpy_time:.4f} seconds")

# With Numexpr
start = time.time()
result_numexpr = compute_with_numexpr(x, y, z)
numexpr_time = time.time() - start
print(f"Numexpr time: {numexpr_time:.4f} seconds")
print(f"Numexpr speedup: {numpy_time / numexpr_time:.2f}x")
```

> 🚀 Numexpr can be **2–5x faster** than NumPy for memory-heavy expressions.

---

# 🧠 Summary Table: Techniques Compared

| Technique | Best For | Speedup Over Python | Notes |
|----------|-----------|---------------------|-------|
| Loop Unrolling | Small tight loops | ~1.5–2x | Manual, limited benefit |
| Numba (@njit) | Numerical loops | ~50–100x | Easy to use, great for arrays |
| Numba (parallel) | Multi-core CPU work | ~2x over single-core Numba | Requires no side effects |
| Numexpr | Array expressions | ~2–5x over NumPy | Good for large data, string-based expr |

---

# 🛠 Challenge: Optimize This Computation

Given this formula:
$$
\text{result}[i,j] = \frac{\exp(- (x[i,j] - y[i,j])^2)}{1 + (x[i,j] + y[i,j])^2}
$$

Compare two implementations:

### ✅ Solution 1: Using Numexpr

```python
def optimized_computation_numexpr(x, y):
    return ne.evaluate("exp(-(x - y)**2) / (1 + (x + y)**2)")
```

### ✅ Solution 2: Using Numba

```python
@njit
def optimized_computation_numba(x, y):
    result = np.empty_like(x)
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            diff = x[i, j] - y[i, j]
            total = x[i, j] + y[i, j]
            result[i, j] = np.exp(-diff**2) / (1 + total**2)
    return result
```

### 📊 Benchmark

```python
shape = (2000, 2000)
x = np.random.rand(*shape)
y = np.random.rand(*shape)

# Numexpr version
start = time.time()
result_ne = optimized_computation_numexpr(x, y)
ne_time = time.time() - start
print(f"Numexpr time: {ne_time:.4f} sec")

# Numba version
start = time.time()
result_nb = optimized_computation_numba(x, y)
nb_time = time.time() - start
print(f"Numba time: {nb_time:.4f} sec")
print(f"Speedup: {nb_time / ne_time:.2f}x")
```

> 📌 Result:
- Numexpr usually wins for large arrays due to threading and expression optimization.
- Numba is more flexible and integrates better with logic/conditionals.

---

# ✅ When to Use What?

| Tool | Use Case | Benefit |
|------|----------|---------|
| Loop Unrolling | Very small loops | Minor gain; rarely worth effort |
| Numba | CPU-bound numerical loops | Huge speedups, especially with parallelization |
| Numexpr | Large array expressions | Blazing fast for math-heavy formulas |
| NumPy | General-purpose vectorization | Already fast, but not always optimal |

---

# 📌 Takeaways

### 🔹 Use Numba for:
- Custom loops and computations
- Functions with conditionals or custom logic
- Needing full control with C-like performance

### 🔹 Use Numexpr for:
- Heavy array expressions
- Memory-bound operations
- Where readability and simplicity matter

### 🔹 Don’t Use These For:
- I/O-bound tasks
- String manipulation
- One-off scripts where compilation time matters

---



In [None]:
import numpy as np
import time

def normal_loop(x, y, out):
    for i in range(len(x)):
        out[i] = x[i] + y[i]

def unrolled_loop(x, y, out):
    # Process 4 items at a time
    for i in range(0, len(x) - 3, 4):
        out[i]     = x[i]     + y[i]
        out[i + 1] = x[i + 1] + y[i + 1]
        out[i + 2] = x[i + 2] + y[i + 2]
        out[i + 3] = x[i + 3] + y[i + 3]

    # Handle remaining elements
    for i in range(len(x) - (len(x) % 4), len(x)):
        out[i] = x[i] + y[i]

In [None]:
size = 10_000_000
a = np.random.rand(size)
b = np.random.rand(size)
result_normal = np.zeros_like(a)
result_unrolled = np.zeros_like(a)

# Time normal loop
start = time.time()
normal_loop(a, b, result_normal)
normal_time = time.time() - start
print(f"Normal loop time: {normal_time:.4f} seconds")

# Time unrolled loop
start = time.time()
unrolled_loop(a, b, result_unrolled)
unrolled_time = time.time() - start
print(f"Unrolled loop time: {unrolled_time:.4f} seconds")
print(f"Speedup: {normal_time / unrolled_time:.2f}x")

Normal loop time: 4.8954 seconds
Unrolled loop time: 3.9353 seconds
Speedup: 1.24x


In [None]:
from numba import jit, njit, prange, set_num_threads

@njit  # Fastest mode: nopython=True
def numba_loop(x, y):
    result = np.empty_like(x)
    for i in range(len(x)):
        result[i] = x[i] + y[i]
    return result

@njit(parallel=True)
def numba_parallel_loop(x, y):
    result = np.empty_like(x)
    for i in prange(len(x)):  # Parallelized loop
        result[i] = x[i] + y[i]
    return result

In [None]:
size = 10_000_000
a = np.random.rand(size)
b = np.random.rand(size)

# Baseline: Pure Python loop
start = time.time()
normal_loop(a, b, result_normal)
python_time = time.time() - start
print(f"Python loop time: {python_time:.4f} seconds")

# Numba nopython
start = time.time()
res_numba = numba_loop(a, b)
numba_time = time.time() - start
print(f"Numba loop time: {numba_time:.4f} seconds")
print(f"Numba speedup: {python_time / numba_time:.2f}x")

# Numba parallel
start = time.time()
res_numba_par = numba_parallel_loop(a, b)
numba_par_time = time.time() - start
print(f"Numba parallel time: {numba_par_time:.4f} seconds")
print(f"Numba parallel speedup: {python_time / numba_par_time:.2f}x")

Python loop time: 4.0395 seconds
Numba loop time: 1.6296 seconds
Numba speedup: 2.48x
Numba parallel time: 0.5697 seconds
Numba parallel speedup: 7.09x


In [None]:
import numexpr as ne
import dis as dis

def compute_with_numpy(x, y, z):
    return np.exp(-(x - y)**2) / (1 + (x + y)**2)

def compute_with_numexpr(x, y, z):
    return ne.evaluate("exp(-(x - y)**2) / (1 + (x + y)**2)")
dis.dis(compute_with_numpy)
dis.dis(compute_with_numexpr)

  4           0 RESUME                   0

  5           2 LOAD_GLOBAL              0 (np)
             14 LOAD_METHOD              1 (exp)
             36 LOAD_FAST                0 (x)
             38 LOAD_FAST                1 (y)
             40 BINARY_OP               10 (-)
             44 LOAD_CONST               1 (2)
             46 BINARY_OP                8 (**)
             50 UNARY_NEGATIVE
             52 PRECALL                  1
             56 CALL                     1
             66 LOAD_CONST               2 (1)
             68 LOAD_FAST                0 (x)
             70 LOAD_FAST                1 (y)
             72 BINARY_OP                0 (+)
             76 LOAD_CONST               1 (2)
             78 BINARY_OP                8 (**)
             82 BINARY_OP                0 (+)
             86 BINARY_OP               11 (/)
             90 RETURN_VALUE
  7           0 RESUME                   0

  8           2 LOAD_GLOBAL              0 (ne)
        

In [None]:
size = 10_000_000
x = np.random.rand(size)
y = np.random.rand(size)
z = np.random.rand(size)

# With NumPy
start = time.time()
result_numpy = compute_with_numpy(x, y, z)
numpy_time = time.time() - start
print(f"NumPy time: {numpy_time:.4f} seconds")

# With Numexpr
start = time.time()
result_numexpr = compute_with_numexpr(x, y, z)
numexpr_time = time.time() - start
print(f"Numexpr time: {numexpr_time:.4f} seconds")
print(f"Numexpr speedup: {numpy_time / numexpr_time:.2f}x")

NumPy time: 0.2521 seconds
Numexpr time: 0.1479 seconds
Numexpr speedup: 1.70x


# 🔥 Why **Numba** and **Numexpr** Are Faster Than Pure Python

When working with numerical data in Python, pure loops or NumPy-only operations can be slow due to Python's dynamic typing, interpreter overhead, and memory inefficiencies.

This document explains **why Numba and Numexpr are faster**, how they work under the hood, and when you should use them.

---

## 🚀 Why Is **Numba** Fast?

### ✅ Summary:
> **Numba compiles Python functions to machine code at runtime**, bypassing the Python interpreter. This results in **C-level performance** for numerical code.

---

### 🧠 How Does It Work?

1. **JIT Compilation (Just-In-Time)**  
   - When a function is decorated with `@njit` or `@jit`, Numba translates it into optimized machine code using **LLVM**.
   - Compilation happens once — subsequent calls reuse the compiled version.

2. **Type Inference**
   - Numba infers types of variables from actual inputs during the first call.
   - No need for explicit type declarations like in C/C++.

3. **Vectorization & CPU Instructions**
   - Uses modern CPU instructions like **SIMD (Single Instruction Multiple Data)**
   - Efficiently maps Python arrays to registers and cache

4. **No Python GIL (Global Interpreter Lock) Overhead**
   - With `nopython=True`, Numba releases the GIL
   - Allows true parallel execution with `prange()` and `parallel=True`

5. **Zero Overhead Abstractions**
   - Avoids Python object boxing/unboxing
   - Directly operates on raw memory buffers

---

### 🧪 Example: Loop Optimization with Numba

```python
from numba import njit
import numpy as np

@njit
def sum_loop(arr):
    total = 0
    for x in arr:
        total += x
    return total

arr = np.random.rand(1_000_000)
sum_loop(arr)  # First run compiles
sum_loop(arr)  # Subsequent runs are super fast!
```

| Version | Time (approx.) |
|--------|----------------|
| Python loop | ~80–100 ms |
| NumPy `.sum()` | ~1–2 ms |
| Numba JIT | ~0.1 ms |

---

### ⚙️ Under the Hood: What Makes Numba Fast?

| Feature | Benefit |
|--------|---------|
| Bypasses Python VM | Runs directly as machine code |
| Type specialization | Optimized for each input type |
| SIMD vectorization | Processes multiple elements per instruction |
| Parallel execution | Use `prange` and `parallel=True` |
| Memory access | Direct pointer access to NumPy arrays |

---

### ✅ When to Use Numba?

- You're writing custom numerical algorithms (not covered by built-in NumPy)
- You have tight loops over large arrays
- You want to write Python but get near-C performance
- You want to enable multi-core computation with minimal effort

---

## 🧮 Why Is **Numexpr** Fast?

### ✅ Summary:
> **Numexpr evaluates array expressions efficiently** using an internal virtual machine that minimizes memory usage and maximizes CPU cache efficiency.

---

### 🧠 How Does It Work?

#### Consider this expression:

```python
result = (a + b * c) / np.sqrt(d)
```

In standard NumPy:
1. Each operation creates a **temporary array** (`b*c`, then `a + temp`, etc.)
2. Memory bandwidth becomes the bottleneck

With **Numexpr**:
```python
import numexpr as ne
result = ne.evaluate("(a + b * c) / sqrt(d)")
```
- Parses the string expression into bytecodes
- Evaluates element-wise without creating temporary arrays
- Keeps everything inside CPU cache
- Uses **multi-threaded evaluation**

---

### 📈 Performance Advantages of Numexpr

| Advantage | Description |
|----------|-------------|
| **Memory Efficiency** | Avoids temporary arrays → reduces memory usage |
| **CPU Cache Friendly** | Operates on chunks that fit in L1/L2 cache |
| **Multi-threaded** | Scales across CPU cores automatically |
| **Vectorized Execution** | Uses SSE2/AVX instructions where available |
| **Expression Fusion** | Combines multiple operations into one pass |

---

### 🧪 Benchmark: NumPy vs Numexpr

```python
import numpy as np
import numexpr as ne
import time

x = np.random.rand(10_000_000)

# NumPy version
start = time.time()
y_numpy = np.log(np.sin(x)**2 + np.cos(x)**2)
numpy_time = time.time() - start

# Numexpr version
start = time.time()
y_ne = ne.evaluate("log(sin(x)**2 + cos(x)**2)")
ne_time = time.time() - start

print(f"NumPy: {numpy_time:.4f} sec")
print(f"Numexpr: {ne_time:.4f} sec")
print(f"Speedup: {numpy_time / ne_time:.2f}x")
```

| Library | Time (approx.) | Speedup |
|--------|----------------|---------|
| NumPy | ~0.5 s | 1x |
| Numexpr | ~0.15 s | ~3.3x |

---

### ✅ When to Use Numexpr?

- You're evaluating **complex array expressions**
- Your computations are **memory-bound** (limited by RAM speed)
- You're working with **very large arrays**
- You want **automatic threading** without writing parallel code

---

## 🧊 Key Differences Between Numba and Numexpr

| Feature | Numba | Numexpr |
|--------|-------|---------|
| Input | Python functions | String expressions |
| Best For | Custom loops, logic, control flow | Array expressions, math-heavy formulas |
| Parallelism | Explicit via `prange()` | Automatic |
| Compilation | On first run (cached) | Once per expression |
| Flexibility | Very high | Limited to supported ops |
| Syntax | Native Python | Must write expressions as strings |
| Supported Types | Most NumPy dtypes | Floats, ints, bools, strings (limited) |

---

## 🧠 Final Takeaways

### ✅ Use **Numba** if:
- You're writing **custom numerical loops**
- You want full **control over algorithm behavior**
- You're doing more than just arithmetic (e.g., conditionals, indexing, state)

### ✅ Use **Numexpr** if:
- You're doing **math-heavy array expressions**
- You want **zero-copy expression evaluation**
- You want **simple, fast, automatic multi-threading**

### ❌ Don't Use Them If:
- You're doing **I/O-bound** tasks (files, sockets, databases)
- You're manipulating **strings or objects**
- The overhead of compilation outweighs gains (e.g., tiny arrays)

---

## 📌 Real-World Comparison Table

| Task | Pure Python | NumPy | Numba | Numexpr |
|------|-------------|--------|--------|----------|
| Sum of 1M floats | ~80ms | ~1ms | ~0.1ms | – |
| Complex math on 1M floats | ~200ms | ~5ms | ~0.3ms | ~0.15ms |
| Vectorized conditional filter | ~150ms | ~3ms | ~0.2ms | ~0.1ms |
| Broadcasted matrix op (1000×1000) | – | ~10ms | ~6ms | ~4ms |

---

## 📦 Bonus: Numba + Numexpr Together?

Yes! You can combine both:
- Use **Numba** for control flow and custom logic
- Use **Numexpr** for complex math expressions inside your function

---

## 🧾 Summary: Pick the Right Tool

| Situation | Recommended Tool |
|-----------|------------------|
| Small loops, reusable logic | ✅ Numba |
| Large array expressions | ✅ Numexpr |
| Simple vectorization | ✅ NumPy |
| GPU acceleration needed | ✅ CuPy, PyTorch, TensorFlow |
| Sparse data | ✅ SciPy.sparse |
| Distributed computing | ✅ Dask, Spark |

---

