# Exercise - Asynchrony - Power Iteration - SOLUTION

GPU programming is inherently asynchronous - in this exercise, we'll learn the implications that has when using CuPy, and how we can understand and analyze the flow of execution in our code.

We'll revisit our power iteration example from earlier for this exercise.

First, we need to make sure the Nsight Systems profiler, Nsightful, and NVTX are available in our notebook:

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !curl -s -L -O https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/NsightSystems-linux-cli-public-2025.3.1.90-3582212.deb
  !sudo dpkg -i NsightSystems-linux-cli-public-2025.3.1.90-3582212.deb > /dev/null
  !pip install "nvtx" "nsightful[notebook] @ git+https://github.com/brycelelbach/nsightful.git" > /dev/null 2>&1

All GPU work is launched asynchronously on a stream. The work items in a stream are executed in order. If you launch `f` on a stream and later launch `g` on that same stream, then `f` will be executed before `g`. But if `f` and `g` are launched on different streams, then their execution might overlap.

With CuPy, much of this is hidden from us. Unless you specify otherwise, CuPy launches work on the default CUDA stream. That means that by default, all CuPy work is executed sequentially on the device, but with respect to the host, it's all happening asynchronously.

CuPy supports explicitly synchronizing, creating, and manipulating streams. A few operations will implicitly synchronize, such as memory transfers and host-side usage of scalars (results from reductions, etc).

Let's run our code throught Nsight Systems profiler, which will help us visualize what's going on.

**NOTE: The next cell won't actually run any code, it will just write its contents to a file. This is necessary because we have to run the code with the Nsight Systems profiler.**

In [2]:
%%writefile power_iteration__baseline.py

import numpy as np
import cupy as cp
import cupyx as cpx
import nvtx
from dataclasses import dataclass

@dataclass
class PowerIterationConfig:
  dim: int = 8192
  dominance: float = 0.05
  max_steps: int = 1000
  check_frequency: int = 10
  progress: bool = True
  residual_threshold: float = 1e-10

def generate_device(cfg=PowerIterationConfig()):
  cp.random.seed(42)
  weak_lam = cp.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
  lam = cp.random.permutation(cp.concatenate((cp.asarray([1.0]), weak_lam)))
  P = cp.random.random((cfg.dim, cfg.dim))
  D = cp.diag(cp.random.permutation(lam))
  A = ((P @ D) @ cp.linalg.inv(P))
  return A

def estimate_device(A, cfg=PowerIterationConfig()):
  with nvtx.annotate("Setup"):
    A_gpu = cp.asarray(A) # If `A` is on the host, copy from host to device.
                          # Otherwise, does nothing.

    x = cp.ones(A_gpu.shape[0], dtype=np.float64)

  with nvtx.annotate("Loop"):
    for i in range(0, cfg.max_steps, cfg.check_frequency):
      with nvtx.annotate(f"Step {i} to {i + cfg.check_frequency}"):
        with nvtx.annotate(f"Compute & Residual {i}"):
          y = A_gpu @ x
          lam = (x @ y) / (x @ x)            # Rayleigh quotient.
          res = cp.linalg.norm(y - lam * x)
          x = y / cp.linalg.norm(y)          # Normalize for next step.

        with nvtx.annotate(f"Copy {i}"):
          res_host = cp.asnumpy(res)
          x_host = cp.asnumpy(x)

        with nvtx.annotate(f"I/O {i}"):
          if cfg.progress:
            print(f"step {i}: residual = {res_host:.3e}")

          np.savetxt(f"device_{i}.txt", x_host) # Copy from device to host and
                                                # save a checkpoint.

          if res_host < cfg.residual_threshold:
            break

        with nvtx.annotate(f"Compute {i}"):
          for _ in range(cfg.check_frequency - 1):
            y = A_gpu @ x # We have to use `A_gpu` here as well.
            x = y / cp.linalg.norm(y) # Normalize for next step.

  return cp.asnumpy((x.T @ (A_gpu @ x)) / (x.T @ x)) # Copy from device to host.

A_device = generate_device()

# Warmup to ensure modules are loaded and code is JIT compiled before timing.
estimate_device(A_device, cfg=PowerIterationConfig(progress=False))

with cpx.profiler.profile():
  start = cp.cuda.get_current_stream().record()
  lam_est_device = estimate_device(A_device).item()
  stop = cp.cuda.get_current_stream().record()

duration = cp.cuda.get_elapsed_time(start, stop) / 1e3

print()
print(f"GPU Execution Time: {duration:.3f} s")

Writing power_iteration__baseline.py


Now let's profile our code by running it under the Nsight Systems `nsys` tool. The syntax for this is `nsys <nsys flags> <your program> <your program args>`. It will run your program while collecting a birdseye view of everything going on in your program.

In [3]:
!nsys profile --capture-range=cudaProfilerApi --capture-range-end=stop --force-overwrite true -o power_iteration__baseline python power_iteration__baseline.py

Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-aed2.qdstrm'
[1/1] [0%                          ] power_iteration__baseline.nsys-repProcessing events...
Generated:
	/content/power_iteration__baseline.nsys-rep
step 0: residual = 1.101e+01
step 10: residual = 2.170e-02
step 20: residual = 1.114e-02
step 30: residual = 7.595e-03
step 40: residual = 6.893e-03
step 50: residual = 8.331e-03
step 60: residual = 1.052e-02
step 70: residual = 1.135e-02
step 80: residual = 9.876e-03
step 90: residual = 7.137e-03
step 100: residual = 4.559e-03
step 110: residual = 2.717e-03
step 120: residual = 1.563e-03
step 130: residual = 8.848e-04
step 140: residual = 4.983e-04
step 150: residual = 2.807e-04
step 160: residual = 1.585e-04
step 170: residual = 8.991e-05
step 180: residual = 5.121e-05
step 190: residual = 2.929e-05
step 200: residual = 1.683e-05
step 210: residual = 9.701e-06
step 220: residual = 5.612e-06
step 230: residual = 3.256

Now let's view our report and explore what's going on in our program.

Remember, you can see even more information in the [Nsight Systems GUI](https://developer.nvidia.com/nsight-systems).

In [4]:
import nsightful

!nsys export --type sqlite --quiet true --force-overwrite true power_iteration__baseline.nsys-rep
nsightful.display_nsys_sqlite_file_in_notebook("power_iteration__baseline.sqlite", title="Power Iteration - Baseline")

Nsight Systems shows us a lot of information - sometimes it's too much and not all relevant.

We're using two techniques to can filter and annotate what we see in Nsight systems to make it easier to understand our profile.

The first is to limit when we start and stop profiling in the program. In Python, we can do this with `cupyx.profiler.profile()`, which give us a Python context manager. Any CUDA code used during scope will be included in the profile.

```
not_in_the profile()
with cpx.profiler.profile():
  in_the_profile()
not_in_the_profile()
```

For this to work, we have to pass `--capture-range=cudaProfilerApi --capture-range-end=stop` as flags to `nsys`.

We can also annotate specific regions of our code, which will show up in the profiler. We can even add categories, domains, and colors to these regions, and they can be nested. To add these annotations, we use `nvtx.annnotate()`, another Python context manager, this time from a library called [NVTX](http://nvtx.readthedocs.io/en/latest/reference.html).

```
with nvtx.annotate("Loop")
  for i in range(20):
     with nvtx.annotate(f"Step {i}"):
       pass
```

From our profile trace, we can see that both our CPU and GPU are idly waiting for each other! Device code is idle during every I/O step when we print the residual and write the checkpoint, and host code spends a long time synchronizing on `cudaMemcpyAsync`.

Here's what happens at the start of each I/O step:

- We copy from device to host, which synchronizes with any outstanding work on the device. This blocks the host for awhile.
- After that synchronous transfer has completed, we begin the I/O (printing and writing the checkpoint). During this time, the device is idle.
- After the I/O has completed on the host, we start launching the next set of iterations.

This is inefficient; we can do better by overlapping compute and I/O:

- First, host code asynchronously initiate our device to host copies.
- Then, host code asynchronously launch the next set of compute steps on the device.
- Next, host code synchronize with the asynchronous copies we started.
- Finally, the host performs the I/O while the device performs the next set of compute steps.

Everything is still going to run on one stream, but we want to be able to synchronize with just the I/O, which is launched on the stream before the compute work. We'll use a CUDA event, which we will record on the stream right
after the copy. Then, we can synchronize with the event later, waiting for the
I/O but not the compute!

In [5]:
%%writefile power_iteration__async.py

import numpy as np
import cupy as cp
import cupyx as cpx
import nvtx
from dataclasses import dataclass

@dataclass
class PowerIterationConfig:
  dim: int = 8192
  dominance: float = 0.05
  max_steps: int = 1000
  check_frequency: int = 10
  progress: bool = True
  residual_threshold: float = 1e-10

def generate_device(cfg=PowerIterationConfig()):
  cp.random.seed(42)
  weak_lam = cp.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
  lam = cp.random.permutation(cp.concatenate((cp.asarray([1.0]), weak_lam)))
  P = cp.random.random((cfg.dim, cfg.dim))
  D = cp.diag(cp.random.permutation(lam))
  A = ((P @ D) @ cp.linalg.inv(P))
  return A

def estimate_device(A, cfg=PowerIterationConfig()):
  with nvtx.annotate("Setup"):
    A_gpu = cp.asarray(A) # If `A` is on the host, copy from host to device.
                          # Otherwise, does nothing.

    x = cp.ones(A_gpu.shape[0], dtype=np.float64)

  with nvtx.annotate("Loop"):
    for i in range(0, cfg.max_steps, cfg.check_frequency):
      with nvtx.annotate(f"Step {i} to {i + cfg.check_frequency}"):
        with nvtx.annotate(f"Compute & Residual {i}"):
          y = A_gpu @ x
          lam = (x @ y) / (x @ x)            # Rayleigh quotient.
          res = cp.linalg.norm(y - lam * x)
          x = y / cp.linalg.norm(y)          # Normalize for next step.

        with nvtx.annotate(f"Copy {i}"):
          res_host = cp.asnumpy(res, blocking=False)
          x_host = cp.asnumpy(x, blocking=False)
          copy_event = cp.cuda.get_current_stream().record()

        with nvtx.annotate(f"Compute {i}"):
          for _ in range(cfg.check_frequency - 1):
            y = A_gpu @ x # We have to use `A_gpu` here as well.
            x = y / cp.linalg.norm(y) # Normalize for next step.

        with nvtx.annotate(f"I/O {i}", payload=i):
          copy_event.synchronize() # Wait for the copies to complete.

          if cfg.progress:
            print(f"step {i}: residual = {res_host:.3e}")

          np.savetxt(f"device_{i}.txt", x_host) # Save a checkpoint.

          if res_host < cfg.residual_threshold:
            break

  return cp.asnumpy((x.T @ (A_gpu @ x)) / (x.T @ x)) # Copy from device to host.

A_device = generate_device()

# Warmup to ensure modules are loaded and code is JIT compiled before timing.
estimate_device(A_device, cfg=PowerIterationConfig(progress=False))

with cpx.profiler.profile():
  start = cp.cuda.get_current_stream().record()
  lam_est_device = estimate_device(A_device).item()
  stop = cp.cuda.get_current_stream().record()

duration = cp.cuda.get_elapsed_time(start, stop) / 1e3

print()
print(f"GPU Execution Time: {duration:.3f} s")

Writing power_iteration__async.py


Before we profile the improved code, let's compare the execution times of both.

In [6]:
power_iteration_baseline_output   = !python power_iteration__baseline.py
power_iteration_baseline_duration = float(power_iteration_baseline_output[-1].split()[-2])
power_iteration_async_output      = !python power_iteration__async.py
power_iteration_async_duration    = float(power_iteration_async_output[-1].split()[-2])
speedup = power_iteration_baseline_duration / power_iteration_async_duration

print(f"GPU Execution Time")
print()
print(f"power_iteration_baseline: {power_iteration_baseline_duration:.3f} s")
print(f"power_iteration_async:    {power_iteration_async_duration:.3f} s")
print(f"power_iteration_async speedup over power_iteration_baseline: {speedup:.2f}")

GPU Execution Time

power_iteration_baseline: 1.817 s
power_iteration_async:    1.372 s
power_iteration_async speedup over power_iteration_baseline: 1.32


Next, let's capture a profile report of our improved code.

In [7]:
!nsys profile --capture-range=cudaProfilerApi --capture-range-end=stop --force-overwrite true -o power_iteration__async python power_iteration__async.py

Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-a99f.qdstrm'
[1/1] [0%                          ] power_iteration__async.nsys-repProcessing events...
Generated:
	/content/power_iteration__async.nsys-rep
step 0: residual = 1.101e+01
step 10: residual = 2.170e-02
step 20: residual = 1.114e-02
step 30: residual = 7.595e-03
step 40: residual = 6.893e-03
step 50: residual = 8.331e-03
step 60: residual = 1.052e-02
step 70: residual = 1.135e-02
step 80: residual = 9.876e-03
step 90: residual = 7.137e-03
step 100: residual = 4.559e-03
step 110: residual = 2.717e-03
step 120: residual = 1.563e-03
step 130: residual = 8.848e-04
step 140: residual = 4.983e-04
step 150: residual = 2.807e-04
step 160: residual = 1.585e-04
step 170: residual = 8.991e-05
step 180: residual = 5.121e-05
step 190: residual = 2.929e-05
step 200: residual = 1.683e-05
step 210: residual = 9.701e-06
step 220: residual = 5.612e-06
step 230: residual = 3.256e-06
s

Finally, let's look at the profile in Perfetto and confirm we've gotten rid of the idling.

In [8]:
!nsys export --type sqlite --quiet true --force-overwrite true power_iteration__async.nsys-rep
nsightful.display_nsys_sqlite_file_in_notebook("power_iteration__async.sqlite", title="Power Iteration - Async")