# Exercise - NumPy to CuPy - SVD Reconstruction

Let's try another NumPy to CuPy porting exercise, this time with the SVD reconstruction code from before.

**TODO: Port this code to CuPy. Here's what you'll have to do:**

- **Change `import numpy as xp` to `import cupy as xp`.**
- **NumPy arrays are converted to CuPy arrays using `xp.asarray()`.  You'll see errors like `only supports cupy.ndarray` when you have this problem.**
- **CuPy arrays are converted back to NumPy arrays (for Matplotlib) using `xp.asarray()`.**

First, we need to import the compute vision and plotting stack we're using:

In [None]:
import matplotlib.pyplot as plt
import cv2
import numpy as xp

Next let's download an image of Bryce's dog:

In [None]:
import urllib.request
urllib.request.urlretrieve(
  "https://drive.usercontent.google.com/download?id=1ClKrHt4-SIHaeBJdF0K3MG64jyVnt62L&export=download",
  "loonie.jpg")

Then we read the image in grayscale mode:

In [None]:
image = cv2.imread("loonie.jpg", cv2.IMREAD_GRAYSCALE)

print(f"nbytes: {image.nbytes}")
print(f"shape: {image.shape}")
print(image)

Here we can see the image is 1600x1200 pixels, and each pixel is an unsigned 8-bit value (0-255).  Let's plot it with matplotlib to verify it looks correct:

In [None]:
plt.imshow(image, cmap="gray")
plt.title("Bryce's Dog")
plt.show()

Yes, we can confirm that is a dog (and a very cute one at that).  Now let's start doing some linear algebra!

NumPy provides an [implementation of SVD](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html).  By selecting `full_matrices=False`, we get the singular value matrix, `S`, as a 1D vector rather than a 2D diagonal matrix.

In [None]:
U, S, Vt = xp.linalg.svd(image, full_matrices=False)
U.shape, S.shape, Vt.shape

Since the image is not square and we've not selected `full_matrices`, NumPy returns `U` as a non-square matrix, `S` as the 1D vector which is the smaller of the two dimensions, and the `Vt` matrix is a square matrix.

The singular values are returned in descending order, which we can see if we look at the first 10 elements of `S`:

In [None]:
S[:10]

In fact, if we look at the size of the singular values, we see that the first few contribute a lot to the matrix, and then fall off very rapidly:

In [None]:
plt.semilogy(S)

That suggests we can get a pretty good approximation of the original image with a relatively small number of terms.  We can reconstruct the image matrix by slicing the `U`, `S`, and `Vt` matrices and remultiplying them.  We will need to convert `S` back into a 2D matrix for the multiplication as well.  Note that we are using the `@` operator to perform matrix multiplication, because `*` does element-wise multiplication.

In [None]:
# First 3 terms.
nterms = 3
reconstructed = U[:, :nterms] @ xp.diag(S[:nterms]) @ Vt[:nterms, :]
plt.imshow(reconstructed, cmap="gray")
plt.title("n = 3")
plt.show()

That's still pretty fuzzy, so let's check out the image with more terms included:

In [None]:
plt.figure(figsize=(16, 4))

start, end, step = 10, 50, 10
for i in range(start, end, step):
  plt.subplot(1, (end - start) // step + 1, (i - start) // step + 1)
  reconstructed = U[:, :i] @ xp.diag(S[:i]) @ Vt[:i, :]
  plt.imshow(reconstructed, cmap="gray")
  plt.title(f"n = {i}")

plt.tight_layout()
plt.show()

**EXTRA CREDIT: After you port this for loop to CuPy, consider the flow of compute and I/O. Are there any problems with this pattern? How could it be improved?**

Now we'll print the compression ratio for the values of `n` used above.  This is the number of bytes of the reduced arrays added together and divided by the size of the original grayscale image array.  It seems we can get significant storage savings with this technique.

In [None]:
for i in range(start, end, step):
  compress_ratio = (U[:, :i].nbytes + S[:i].nbytes + Vt[:i, :].nbytes) / image.nbytes
  print(f"n = {i}: compression = {compress_ratio:.1%}")

Next, we compute and display the difference in the reconstruction for `n = 10` and the original image using `cmap="coolwarm"` to display the difference.

In [None]:
delta = image - (U[:,:10] @ xp.diag(S[:10]) @ Vt[:10,:])
plt.imshow(delta, cmap="coolwarm")

Now that you have gotten SVD to work on CuPy, let's benchmark it!  To make things clearer, let's reimport NumPy and CuPy with their usual abbreviations:

In [None]:
import numpy as np
import cupy as cp
import cupyx as cpx # For `cupyx.profiler.benchmark`.

We have to be very careful when benchmarking GPU code. As GPU programming is inherently asynchronous, so it can be tricky to make sure we're measuring the right thing.

Imagine you're measuring how long it takes to ship a package to someone, but you only time how long it takes for you to drop it off at the post office, not how long it takes for them to receive it and send you a thank you.

Common Pythonic benchmarking tools like `%timeit` are not GPU aware, so it's easy to measure incorrectly with them.  We can only use them when we know the code we're benchmarking will perform the proper synchronization.  It's better to use something like [`cupyx.profiler.benchmark`](https://docs.cupy.dev/en/stable/reference/generated/cupyx.profiler.benchmark.html#cupyx.profiler.benchmark).

First, we need a NumPy (CPU) and CuPy (GPU) copy of our image:

In [None]:
cpu_image = cv2.imread('loonie.jpg', cv2.IMREAD_GRAYSCALE)
gpu_image = cp.asarray(cpu_image)

Next let's benchmark both CPU and GPU execution:

In [None]:
repeat = 10
warmup = 1
D_np = cpx.profiler.benchmark(n_repeat=repeat, n_warmup=warmup, func=lambda:
  np.linalg.svd(cpu_image, full_matrices=False)
).cpu_times
D_cp = cpx.profiler.benchmark(n_repeat=repeat, n_warmup=warmup, func=lambda:
  cp.linalg.svd(gpu_image, full_matrices=False)
).cpu_times

print(f"SVD (Host)   = {D_np.mean():.3g} s ± {(D_np.std() / D_np.mean()):.2%} (mean ± relative stdev of {D_np.size} runs)")
print(f"SVD (Device) = {D_cp.mean():.3g} s ± {(D_cp.std() / D_cp.mean()):.2%} (mean ± relative stdev of {D_cp.size} runs)")

Depending on your hardware, the CPU and GPU might be close to the same speed, or the GPU might even be slower!  This is because the image is not big enough to fully utilize the GPU.  We can simulate a larger image by tiling the image using `np.tile`.  This duplicates the image both along axis 0 and axis 1:

In [None]:
cpu_image_tile = np.tile(cpu_image, (2, 2))
gpu_image_tile = cp.asarray(cpu_image_tile)

Now we can benchmark again (this will take longer because the matrices are much bigger):

In [None]:
repeat = 5
warmup = 1
D_np = cpx.profiler.benchmark(n_repeat=repeat, n_warmup=warmup, func=lambda:
  np.linalg.svd(cpu_image_tile, full_matrices=False)
).cpu_times
D_cp = cpx.profiler.benchmark(n_repeat=repeat, n_warmup=warmup, func=lambda:
  cp.linalg.svd(gpu_image_tile, full_matrices=False)
).cpu_times

print(f"SVD (Host)   = {D_np.mean():.3g} s ± {(D_np.std() / D_np.mean()):.2%} (mean ± relative stdev of {D_np.size} runs)")
print(f"SVD (Device) = {D_cp.mean():.3g} s ± {(D_cp.std() / D_cp.mean()):.2%} (mean ± relative stdev of {D_cp.size} runs)")

**TODO: Experiment with differ sizes of image by changing the `np.tile` arguments.  When is the GPU faster?**