# Memory Spaces & Power Iteration

## Table of Contents
1. [Introduction to Memory Spaces](#1-introduction-to-memory-spaces)
2. [The CPU Baseline (NumPy)](#2-the-cpu-baseline-numpy)
3. [The GPU Port (CuPy)](#3-the-gpu-port-cupy)
4. [Optimizing Data Generation](#4-optimizing-data-generation)
5. [Verification and Benchmarking](#5-verification-and-benchmarking)
6. [Extra Credit](#extra-credit)

---

## 1. Introduction to Memory Spaces

Before we implement algorithms on the GPU, we must understand the hardware architecture. A heterogeneous system (like the one you are using) consists of two distinct memory spaces:

1.  **Host Memory (CPU):** System RAM. Accessible by the CPU.
2.  **Device Memory (GPU):** High-bandwidth memory (HBM) attached to the GPU. Accessible by the GPU.

The CPU cannot directly calculate data stored on the GPU, and the GPU cannot directly calculate data stored in System RAM. To perform work on the GPU, you must explicitly manage data movement.

* **Host $\to$ Device:** Move data to the GPU to compute.
    * Syntax: `x_device = cp.asarray(x_host)`
* **Device $\to$ Host:** Move results back to the CPU to save to disk, plot with Matplotlib, or print.
    * Syntax: `y_host = cp.asnumpy(y_device)`

### Implicit Transfers and Synchronization

It is crucial to understand when CuPy interacts with the CPU implicitly. These interactions can kill performance because they force the GPU to pause (synchronize) while data moves.

CuPy silently transfers and synchronizes when you:
1.  **Print** a GPU array (`print(gpu_array)`).
2.  **Convert** to a Python scalar (`float(gpu_array)` or `.item()`).
3.  **Evaluate** a GPU scalar in a boolean context (`if gpu_scalar > 0:`).

### The Task
To understand the implications of these concepts, let's experiment with estimating the dominant eigenvalue of a matrix using the **Power Iteration** algorithm.

Before we dive into the code, let's understand the math behind the algorithm we are implementing.

**Power Iteration** is a classic iterative method used to find the dominant eigenvalue (the eigenvalue with the largest absolute value) and its corresponding eigenvector of a square matrix $A$.

#### How It Works

The core idea is simple: if you repeatedly multiply a vector by a matrix $A$, the vector will eventually converge towards the dominant eigenvector of $A$, regardless of the initial vector you started with (provided the initial vector has some component in the direction of the dominant eigenvector).

#### The Mathematical Steps

Given a square matrix $A$ and a random initial vector $x_0$, the algorithm proceeds as follows for each step $k$:

**1. Matrix-Vector Multiplication:**

We calculate the next approximation of the vector:

$$y = A x_k$$

**2. Eigenvalue Estimation (Rayleigh Quotient):**

We estimate the eigenvalue $\lambda$ using the current vector. This is essentially projecting $y$ onto $x$:

$$\lambda_k = \frac{x_k^T y}{x_k^T x_k} = \frac{x_k^T A x_k}{x_k^T x_k}$$

**3. Residual Calculation (Error Check):**

We check how close we are to the true definition of an eigenvector ($Ax = \lambda x$) by calculating the "residual" (error):

$$r = ||y - \lambda_k x_k||$$

If $r$ is close to 0, we have converged.

**4. Normalization:**

To prevent the numbers from exploding (overflow) or vanishing (underflow), we normalize the vector for the next iteration:

$$x_{k+1} = \frac{y}{||y||}$$

We will start with a standard CPU implementation, port it to the GPU using CuPy, and analyze the performance impact of memory transfers.


In [None]:
import numpy as np
import cupy as cp
import time
from dataclasses import dataclass

# Configuration for the algorithm
@dataclass
class PowerIterationConfig:
    dim: int = 4096                    # Matrix size (dim x dim)
    dominance: float = 0.1             # How much larger the top eigenvalue is (controls convergence speed)
    max_steps: int = 400               # Maximum iterations
    check_frequency: int = 10          # Check for convergence every N steps
    progress: bool = True              # Print progress logs
    residual_threshold: float = 1e-10  # Stop if error is below this

## 2. The CPU Baseline (NumPy)

We generate a random dense matrix that is diagonalizable. This data is generated on the **Host (CPU)** and resides in **Host Memory**.


In [None]:
def generate_host(cfg=PowerIterationConfig()):
    """Generates a random diagonalizable matrix on the CPU."""
    np.random.seed(42)

    # Create eigenvalues: One large one (1.0), the rest smaller
    weak_lam = np.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
    lam = np.random.permutation(np.concatenate(([1.0], weak_lam)))

    # Construct matrix A = P * D * P^-1
    P = np.random.random((cfg.dim, cfg.dim))  # Random invertible matrix
    D = np.diag(np.random.permutation(lam))   # Diagonal matrix of eigenvalues
    A = ((P @ D) @ np.linalg.inv(P))          # The final matrix
    return A

# Generate the data on Host
print("Generating Host Data...")
A_host = generate_host()
print(f"Host Matrix Shape: {A_host.shape}")
print(f"Data Type: {A_host.dtype}")

### Implementing Power Iteration (CPU)

As described above, the Power Iteration algorithm repeatedly multiplies a vector $x$ by matrix $A$ ($y = Ax$) and normalizes the result. We initialize this algorithm with a vector of 1s ($x_0$) as our initial guess.

In [None]:
def estimate_host(A, cfg=PowerIterationConfig()):
    """
    Performs power iteration using purely NumPy (CPU).
    """
    # Initialize vector of ones on Host
    x = np.ones(A.shape[0], dtype=np.float64)

    for i in range(0, cfg.max_steps, cfg.check_frequency):
        # Matrix-Vector multiplication
        y = A @ x
        
        # Rayleigh quotient: (x . y) / (x . x)
        lam = (x @ y) / (x @ x)
        
        # Calculate residual (error)
        res = np.linalg.norm(y - lam * x)
        
        # Normalize vector for next step
        x = y / np.linalg.norm(y)

        if cfg.progress:
            print(f"Step {i}: residual = {res:.3e}")

        # Convergence check
        if res < cfg.residual_threshold:
            break

        # Run intermediate steps without checking residual to save compute
        for _ in range(cfg.check_frequency - 1):
            y = A @ x
            x = y / np.linalg.norm(y)

    return (x.T @ (A @ x)) / (x.T @ x)

# Run CPU Baseline
print("\nRunning CPU Estimate...")
start_time = time.time()
lam_est_host = estimate_host(A_host)
end_time = time.time()

print(f"\nEstimated Eigenvalue (CPU): {lam_est_host}")
print(f"Time taken: {end_time - start_time:.4f}s")

## 3. The GPU Port (CuPy)

### Exercise: Port the CPU Implementation to GPU

Now it's your turn! Your task is to convert the `estimate_host` function to run on the GPU using CuPy.

**Remember the rules of Memory Spaces:**
1.  **Transfer:** Move `A_host` from CPU to GPU using `cp.asarray()`.
2.  **Compute:** Perform math using `cp` functions on the GPU.
3.  **Retrieve:** Move result back to CPU using `cp.asnumpy()` or `.item()` if we need to print it or use it in standard Python.

**Hint:** CuPy tries to replicate the NumPy API. In many cases, you can simply change `np.` to `cp.`. However, CuPy operations *must* run on data present in Device Memory.

**Fill in the `TODO` sections in the skeleton code below:**


In [None]:
def estimate_device_exercise(A, cfg=PowerIterationConfig()):
    """
    Port the power iteration algorithm to the GPU using CuPy.
    
    Steps to complete:
    1. Transfer the input matrix A to the GPU (if it's a numpy array)
    2. Initialize the vector x on the GPU
    3. Replace np operations with cp operations
    4. Return the result as a Python scalar
    """
    # ---------------------------------------------------------
    # SOLUTION: MEMORY TRANSFER (Host -> Device)
    # Check if A is a numpy array. If so, move it to GPU using cp.asarray()
    # Otherwise, assume it's already on the device.
    # ---------------------------------------------------------
    if isinstance(A, np.ndarray):
        A_gpu = cp.asarray(A)  # SOLUTION: Transfer to GPU
    else:
        A_gpu = A
    
    # ---------------------------------------------------------
    # SOLUTION: Initialize vector of ones ON THE GPU
    # ---------------------------------------------------------
    x = cp.ones(A_gpu.shape[0], dtype=cp.float64)  # SOLUTION: Create vector of ones on GPU
    
    for i in range(0, cfg.max_steps, cfg.check_frequency):
        # ---------------------------------------------------------
        # SOLUTION: Perform GPU computations using CuPy
        # ---------------------------------------------------------
        
        # Matrix-Vector multiplication (this works the same with CuPy!)
        y = A_gpu @ x
        
        # Rayleigh quotient
        lam = (x @ y) / (x @ x)
        
        # SOLUTION: Calculate residual using cp.linalg.norm
        res = cp.linalg.norm(y - lam * x)
        
        # SOLUTION: Normalize x using cp.linalg.norm
        x = y / cp.linalg.norm(y)
        
        if cfg.progress:
            print(f"Step {i}: residual = {res:.3e}")
        
        if res < cfg.residual_threshold:
            break
        
        for _ in range(cfg.check_frequency - 1):
            y = A_gpu @ x
            x = y / cp.linalg.norm(y)
    
    # ---------------------------------------------------------
    # SOLUTION: MEMORY TRANSFER (Device -> Host)
    # Return the eigenvalue as a Python scalar using .item()
    # ---------------------------------------------------------
    result = (x.T @ (A_gpu @ x)) / (x.T @ x)
    return result.item()  # SOLUTION: Convert GPU scalar to Python scalar

# Run the GPU implementation
print("\nRunning GPU Estimate (Input is Host Array)...")
start_time = time.time()
lam_est_device = estimate_device_exercise(A_host)
cp.cuda.Stream.null.synchronize()
end_time = time.time()

print(f"\nEstimated Eigenvalue (GPU): {lam_est_device}")
print(f"Time taken: {end_time - start_time:.4f}s")

## 4. Optimizing Data Generation

In the previous step, we generated data on the CPU and copied it to the GPU. For large datasets, the transfer time (`Host -> Device`) can be a bottleneck. 

It is almost always faster to **generate** the data directly on the GPU if possible.

### Exercise: Generate Data Directly on the GPU

Your task is to convert the `generate_host` function to generate the matrix directly on the GPU using CuPy's random functions.

**Hints:**
- Use `cp.random.seed()` instead of `np.random.seed()`
- Use `cp.random.random()` instead of `np.random.random()`
- Use `cp.random.permutation()` instead of `np.random.permutation()`
- Use `cp.concatenate()`, `cp.array()`, `cp.diag()`, and `cp.linalg.inv()`

**Fill in the `TODO` sections in the skeleton code below:**


In [None]:
def generate_device_exercise(cfg=PowerIterationConfig()):
    """
    Generate a random diagonalizable matrix directly on the GPU.
    
    This should mirror the generate_host function but use CuPy instead of NumPy.
    The key benefit: no Host->Device transfer needed!
    """
    # ---------------------------------------------------------
    # SOLUTION: Set the random seed on the GPU
    # ---------------------------------------------------------
    cp.random.seed(42)
    
    # ---------------------------------------------------------
    # SOLUTION: Create eigenvalues on the GPU
    # Generate (dim-1) random values, scale them, then combine with 1.0
    # ---------------------------------------------------------
    # SOLUTION: Generate weak eigenvalues using cp.random.random()
    weak_lam = cp.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
    
    # SOLUTION: Concatenate [1.0] with weak_lam using cp.concatenate and cp.array
    # Then permute them using cp.random.permutation
    lam = cp.random.permutation(cp.concatenate((cp.array([1.0]), weak_lam)))
    
    # ---------------------------------------------------------
    # SOLUTION: Construct the matrix A = P * D * P^-1 on the GPU
    # ---------------------------------------------------------
    # SOLUTION: Generate random matrix P using cp.random.random()
    P = cp.random.random((cfg.dim, cfg.dim))
    
    # SOLUTION: Create diagonal matrix D using cp.diag()
    D = cp.diag(cp.random.permutation(lam))
    
    # SOLUTION: Compute A = P @ D @ P^-1 using cp.linalg.inv()
    A = ((P @ D) @ cp.linalg.inv(P))
    
    return A

print("\nGenerating Data directly on GPU...")
start_time = time.time()
A_device = generate_device_exercise()
end_time = time.time()
print(f"Generation time: {end_time - start_time:.4f}s")

print("Running GPU Estimate (Input is Device Array)...")
start_time = time.time()
# No transfer overhead here because A_device is already on GPU
lam_est_device_gen = estimate_device_exercise(A_device)
cp.cuda.Stream.null.synchronize()
end_time = time.time()
print(f"Compute time: {end_time - start_time:.4f}s")

### Think About It

Both functions use `seed(42)`. Are `A_host` and `A_device` identical? Try comparing them:

In [None]:
print("NumPy:", A_host[0, :3])
print("CuPy:", A_device[0, :3].get())

**SOLUTION:**

This reveals that `np.random` and `cp.random` use **different random number generator (RNG) implementations**, even with the same seed.

- NumPy uses the Mersenne Twister algorithm (or PCG64 in newer versions) on the CPU.
- CuPy uses a GPU-optimized RNG (typically XORWOW from cuRAND) that runs efficiently in parallel on thousands of GPU threads.

Even with the same seed value (`42`), these different algorithms produce completely different sequences of "random" numbers. This is why `A_host` and `A_device` contain different values.

**Key Takeaway:** If you need *identical* data on both CPU and GPU for verification purposes, you should:
1. Generate the data on one device (e.g., CPU with NumPy)
2. Transfer it to the other device (e.g., `cp.asarray(A_host)`)

This guarantees bit-for-bit identical data, which is essential for debugging and validation.

## 5. Verification and Benchmarking

Finally, let's verify our accuracy against a reference implementation (`numpy.linalg.eigvals`) and benchmark the speedup.

**Note on CuPy Limitations:** You might wonder why we use `np.linalg.eigvals` on the CPU instead of a CuPy equivalent. The reason is that CuPy does not yet implement `eigvals`. While CuPy covers a large portion of the NumPy API, it does not support every function. Always check the [CuPy documentation](https://docs.cupy.dev/en/stable/reference/comparison.html) to verify which functions are available before assuming a direct NumPy-to-CuPy conversion will work.


In [None]:
print("Calculating Reference Eigenvalue (numpy.linalg)...")
# Note: calculating all eigenvalues is computationally expensive
lam_ref = np.linalg.eigvals(A_host).real.max()

print(f"\n--- Results ---")
print(f"Reference: {lam_ref}")
print(f"CPU Est:   {lam_est_host}")
print(f"GPU Est:   {lam_est_device_gen}")

# Assert correctness
np.testing.assert_allclose(lam_est_host, lam_ref, rtol=1e-4)
np.testing.assert_allclose(lam_est_device_gen, lam_ref, rtol=1e-4)
print("\nAccuracy verification passed!")

### A Note on Verification

As mentioned before, `A_host` and `A_device` are **different matrices** (NumPy and CuPy use different RNG implementations). Yet the verification passes. Why?

Both matrices are *constructed* with one eigenvalue explicitly set to **1.0**. The verification confirms that power iteration correctly finds this dominant eigenvalueâ€”not that the matrices are identical.

**Key takeaway:** If you need to verify GPU computation against CPU on the *exact same data*, generate on one device and transfer to the other.

### Benchmarking with `cupyx.profiler.benchmark`

We use CuPy's built-in benchmarking utility for accurate GPU timing. This handles warmup and synchronization automatically.


In [None]:
from cupyx.profiler import benchmark

cfg = PowerIterationConfig(progress=False)

# 1. CPU
print("Timing CPU...")
result_cpu = benchmark(estimate_host, args=(A_host, cfg), n_repeat=10)
t_cpu_ms = result_cpu.cpu_times.mean() * 1000

# 2. GPU (with transfer overhead)
print("Timing GPU (Host Input)...")
result_transfer = benchmark(estimate_device_exercise, args=(A_host, cfg), n_repeat=10)
t_gpu_transfer_ms = result_transfer.gpu_times.mean() * 1000

# 3. GPU (pure device)
print("Timing GPU (Device Input)...")
result_pure = benchmark(estimate_device_exercise, args=(A_device, cfg), n_repeat=10)
t_gpu_pure_ms = result_pure.gpu_times.mean() * 1000

print(f"\n--- Average Compute Times ---")
print(f"CPU:                 {t_cpu_ms:.2f} ms")
print(f"GPU (with transfer): {t_gpu_transfer_ms:.2f} ms")
print(f"GPU (pure):          {t_gpu_pure_ms:.2f} ms")

speedup = t_cpu_ms / t_gpu_pure_ms
print(f"\nSpeedup: {speedup:.1f}x")

---

## Extra Credit

**Explore the impact of changing the following parameters:**

1. **Problem Size (`dim`):** How does the GPU speedup change as you increase or decrease the matrix dimensions? Try values like 1024, 2048, 4096, 8192.

2. **Compute Workload (`max_steps` and `dominance`):** The `dominance` parameter controls how quickly the algorithm converges. A smaller dominance means eigenvalues are closer together, requiring more iterations. How does this affect the CPU vs GPU comparison?

3. **Check Frequency (`check_frequency`):** This controls how often we check for convergence (and trigger implicit CPU synchronization via the print statement). What happens to GPU performance when you check every step (`check_frequency=1`) vs. less frequently (`check_frequency=50`)?

**Experiment below:**

In [None]:
# Try different configurations here!
# Example:
# cfg_large = PowerIterationConfig(dim=8192, progress=False)
# cfg_slow_converge = PowerIterationConfig(dominance=0.01, progress=False)
# cfg_frequent_check = PowerIterationConfig(check_frequency=1, progress=True)

# Your experiments: