# Exercise - Memory Spaces - Power Iteration - SOLUTION

Let's learn about memory spaces and transfers! In this exercise, we'll learn:

- How to explicitly transfer data between host and device.
  - `d = cupy.asarray(h)` to copy from a host array to a device array.
  - `h = cupy.asnumpy(d)` to copy from a device array to a host array.
- What happens if we mix NumPy and CuPy code.
- Some ways in which NumPy and CuPy produce different results.
- Some of the limitations of CuPy.
- How problem size and compute workload impacts performance.

We're going to estimate the dominant eigenvalue of a matrix with the [power iteration algorithm](https://en.wikipedia.org/wiki/Power_iteration).
First, we'll randomly generate a dense diagonalizable square matrix.

In [None]:
import numpy as np
import cupy as cp # Don't forget to add the CuPy include!

In [None]:
from dataclasses import dataclass

@dataclass
class PowerIterationConfig:
  dim: int = 4096 # Number of rows and columns in the square matrix.

  # Value from 0 to 1 that controls how much greater the dominant eigenvalue is
  # from the rest of the eigenvalues. A higher value means quicker convergence.
  dominance: float = 0.1

  # Maximum number of steps to perform.
  max_steps: int = 400

  # Every `check_frequency` steps we save a checkpoint and compute the residual.
  check_frequency: int = 10

  # Whether the residual should be printed every `check_frequency` steps.
  progress: bool = True

  # If the residual is below `residual_threshold`, terminate early.
  residual_threshold: float = 1e-10

In [None]:
def generate_host(cfg=PowerIterationConfig()):
  np.random.seed(42)

  # Vector with a single 1 & `cfg.dim - 1` values from 0 to `1 - cfg.dominance`.
  weak_lam = np.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
  lam = np.random.permutation(np.concatenate(([1.0], weak_lam)))

  P = np.random.random((cfg.dim, cfg.dim)) # Random invertible matrix.
  D = np.diag(np.random.permutation(lam))  # Diagonal matrix w/ random eigenvalues.
  A = ((P @ D) @ np.linalg.inv(P))         # Diagonalizable matrix.
  return A

A_host = generate_host()

with np.printoptions(precision=4):
  print(A_host)

[[-0.4937 -0.519  -0.2935 ... -0.1628  0.8361  0.531 ]
 [-1.0859  0.0087 -0.0661 ... -0.1706  1.0955  0.7075]
 [-0.4291 -0.3628  0.3393 ...  0.1813  0.2238 -0.2124]
 ...
 [-1.1089 -0.4564 -0.3024 ...  0.2075  1.2864  0.9066]
 [-0.8714 -0.5109 -0.1201 ... -0.072   1.3048  0.4372]
 [-1.6421 -0.6629 -0.2001 ... -0.2997  1.8579  1.5576]]


Next, we perform the power iteration with NumPy, using a vector of 1s as our initial guess.

We'll perform at most `cfg.max_steps`. Every `cfg.check_frequency` steps, we'll output a checkpoint, compute the absolute residual, check whether it's below a `cfg.residual_threshold`. If it is, then we'll stop early.

In [None]:
def estimate_host(A, cfg=PowerIterationConfig()):
  x = np.ones(A.shape[0], dtype=np.float64)

  for i in range(0, cfg.max_steps, cfg.check_frequency):
    y = A @ x
    lam = (x @ y) / (x @ x)            # Rayleigh quotient.
    res = np.linalg.norm(y - lam * x)
    x = y / np.linalg.norm(y)          # Normalize for next step.

    if cfg.progress:
      print(f"step {i}: residual = {res:.3e}")

    np.savetxt(f"host_{i}.txt", x) # Save a checkpoint.

    if res < cfg.residual_threshold:
      break

    for _ in range(cfg.check_frequency - 1):
      y = A @ x
      x = y / np.linalg.norm(y) # Normalize for next step.

  return (x.T @ (A @ x)) / (x.T @ x)

lam_est_host = estimate_host(A_host).item()

print()
print(lam_est_host)

step 0: residual = 7.594e+00
step 10: residual = 1.699e-02
step 20: residual = 2.148e-02
step 30: residual = 1.295e-02
step 40: residual = 4.494e-03
step 50: residual = 1.366e-03
step 60: residual = 4.129e-04
step 70: residual = 1.274e-04
step 80: residual = 4.013e-05
step 90: residual = 1.286e-05
step 100: residual = 4.181e-06
step 110: residual = 1.374e-06
step 120: residual = 4.550e-07
step 130: residual = 1.517e-07
step 140: residual = 5.085e-08
step 150: residual = 1.712e-08
step 160: residual = 5.784e-09
step 170: residual = 1.961e-09
step 180: residual = 6.655e-10
step 190: residual = 2.269e-10
step 200: residual = 7.746e-11

0.9999999999604734


Here's our power iteration function ported to CuPy.

Did you forget to port any calls and get an error message? Sometimes they can be tricky to decipher. AI coding assistants are a great tool for identifying the underlying issue. On Google Colab, when a cell fails to compile, there will be a helpful AI-powered "explain error" button.

In [None]:
def estimate_device(A, cfg=PowerIterationConfig):
  A_gpu = cp.asarray(A) # If `A` is on the host, copy from host to device.
                        # Otherwise, does nothing.

  x = cp.ones(A_gpu.shape[0], dtype=np.float64)

  for i in range(0, cfg.max_steps, cfg.check_frequency):
    y = A_gpu @ x # If we used `A` here instead of `A_gpu`, we would get an error.
                  # CuPy doesn't allow implicit conversions between NumPy <-> CuPy.

    lam = (x @ y) / (x @ x)            # Rayleigh quotient.
    res = cp.linalg.norm(y - lam * x)
    x = y / cp.linalg.norm(y)          # Normalize for next step.

    if cfg.progress:
      print(f"step {i}: residual = {cp.asnumpy(res):.3e}")

    np.savetxt(f"device_{i}.txt", cp.asnumpy(x)) # Copy from device to host
                                                 # and save a checkpoint.

    if res < cfg.residual_threshold:
      break

    for _ in range(cfg.check_frequency - 1):
      y = A_gpu @ x # We have to use `A_gpu` here as well.
      x = y / cp.linalg.norm(y) # Normalize for next step.

  return cp.asnumpy((x.T @ (A_gpu @ x)) / (x.T @ x)) # Copy from device to host.

lam_est_device = estimate_device(A_host).item()

print()
print(lam_est_device)

step 0: residual = 7.594e+00
step 10: residual = 1.699e-02
step 20: residual = 2.148e-02
step 30: residual = 1.295e-02
step 40: residual = 4.494e-03
step 50: residual = 1.366e-03
step 60: residual = 4.129e-04
step 70: residual = 1.274e-04
step 80: residual = 4.013e-05
step 90: residual = 1.286e-05
step 100: residual = 4.181e-06
step 110: residual = 1.374e-06
step 120: residual = 4.550e-07
step 130: residual = 1.517e-07
step 140: residual = 5.085e-08
step 150: residual = 1.712e-08
step 160: residual = 5.784e-09
step 170: residual = 1.960e-09
step 180: residual = 6.659e-10
step 190: residual = 2.269e-10
step 200: residual = 7.736e-11

0.9999999999593268


We can also port the matrix generation function to CuPy. However, you may have noticed that the residuals and number of steps you get are different. That's because we're actually generating a different matrix than NumPy!

CuPy uses a different pseudo random number generator algorithm than NumPy, because it needs one that can be parallelized. So even with the same code and same seed, we'll get different results between NumPy and CuPy.

There are a number of operations in CuPy that have the same functionality as their NumPy counterparts but are not guaranteed to be bit-for-bit equivalent.

For the purpose of our benchmarking, we'll use the NumPy generated matrix.

In [None]:
def generate_device(cfg=PowerIterationConfig):
  cp.random.seed(42)

  # Vector with a single 1 & `cfg.dim - 1` values from 0 to `1 - cfg.dominance`.
  weak_lam = cp.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
  lam = cp.random.permutation(cp.concatenate((cp.asarray([1.0]), weak_lam)))
  # In NumPy, `concatenate` takes anything array-like, so a list is fine, but
  # in CuPy, all args must be device arrays, as there are no implicit transfers!

  P = cp.random.random((cfg.dim, cfg.dim)) # Random invertible matrix.
  D = cp.diag(cp.random.permutation(lam))  # Diagonal matrix with random eigenvalues.
  A = ((P @ D) @ cp.linalg.inv(P))         # Diagonalizable matrix.
  return A

A_device = generate_device()

with np.printoptions(precision=4):
  print("A_host:")
  print(A_host)
  print()
  print("A_device:")
  print(A_device)
  print()

lam_est_device_generation = estimate_device(A_device).item()

print()
print(lam_est_device_generation)

A_host:
[[-0.4937 -0.519  -0.2935 ... -0.1628  0.8361  0.531 ]
 [-1.0859  0.0087 -0.0661 ... -0.1706  1.0955  0.7075]
 [-0.4291 -0.3628  0.3393 ...  0.1813  0.2238 -0.2124]
 ...
 [-1.1089 -0.4564 -0.3024 ...  0.2075  1.2864  0.9066]
 [-0.8714 -0.5109 -0.1201 ... -0.072   1.3048  0.4372]
 [-1.6421 -0.6629 -0.2001 ... -0.2997  1.8579  1.5576]]

A_device:
[[-0.3175 -0.227   0.0765 ... -0.2411  0.9861 -0.6663]
 [-1.8135 -0.6378 -0.4466 ... -0.4503  3.2291 -1.3058]
 [-0.6443  0.3033  0.9719 ...  0.1302  0.4531 -0.3723]
 ...
 [-1.8697 -0.7714 -0.2048 ...  0.0244  2.859  -1.1147]
 [-1.1404 -0.9935 -0.4077 ... -0.2789  2.8553 -0.7879]
 [-0.5979 -0.6596  0.1047 ... -0.1826  1.4195 -0.0636]]

step 0: residual = 2.346e+01
step 10: residual = 2.859e-02
step 20: residual = 2.984e-02
step 30: residual = 1.433e-02
step 40: residual = 5.035e-03
step 50: residual = 1.643e-03
step 60: residual = 5.309e-04
step 70: residual = 1.726e-04
step 80: residual = 5.663e-05
step 90: residual = 1.875e-05
step 100:

Next, let's compute the eigenvalues of the matrix with `numpy.linalg.eigvals`. This may take a little while.

Did you try changing this to use CuPy?
Unfortunately we can't, because CuPy doesn't have `eigvals` yet. It's important to remember that while CuPy supports much of NumPy, it has limitations!

In [None]:
lam_ref = np.linalg.eigvals(A_host).real.max()

Now we can check whether our power iteration estimation is correct.

In [None]:
print(f"Solution")
print()
print(f"Power iteration (host)   = {lam_est_host:.6e}")
print(f"Power iteration (device) = {lam_est_device:.6e}")
print(f"`eigvals` reference      = {lam_ref:.6e}")

rel_err_host   = abs(lam_est_host - lam_ref) / abs(lam_ref)
rel_err_device = abs(lam_est_device - lam_ref) / abs(lam_ref)
print()
print(f"Relative error (host)    = {rel_err_host:.3e}")
print(f"Relative error (device)  = {rel_err_device:.3e}")

np.testing.assert_allclose(lam_est_host, lam_ref, rtol=1e-4)
np.testing.assert_allclose(lam_est_device, lam_ref, rtol=1e-4)

Solution

Power iteration (host)   = 1.000000e+00
Power iteration (device) = 1.000000e+00
`eigvals` reference      = 1.000000e+00

Relative error (host)    = 3.938e-11
Relative error (device)  = 4.053e-11


Finally, let's benchmark all three solutions.

In [None]:
print(f"Execution Time")
print()

time_host = %timeit -q -o estimate_host(A_host, PowerIterationConfig(progress=False)).item()
print(f"Power iteration (host)   = {time_host}")

# We intentionally use `A_host`, not `A_device`, because they're not
# the same matrices due to differences in NumPy and CuPy's random facilities.
time_device = %timeit -q -o estimate_device(A_host, PowerIterationConfig(progress=False)).item()
print(f"Power iteration (device) = {time_device}")

time_ref = %timeit -q -o -r 1 -n 1 np.linalg.eigvals(A_host).real.max()
print(f"`eigvals` reference      = {time_ref}")

Execution Time

Power iteration (host)   = 2.38 s ± 382 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Power iteration (device) = 324 ms ± 7.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
`eigvals` reference      = 49.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
