# Exercise - Asynchrony - Power Iteration

GPU programming is inherently asynchronous - in this exercise, we'll learn the implications that has when using CuPy, and how we can understand and analyze the flow of execution in our code.

We'll revisit our power iteration example from earlier for this exercise.

First, we need to make sure the Nsight Systems profiler, Nsightful, and NVTX are available in our notebook:

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab
  !curl -s -L -O https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/NsightSystems-linux-cli-public-2025.3.1.90-3582212.deb
  !sudo dpkg -i NsightSystems-linux-cli-public-2025.3.1.90-3582212.deb > /dev/null
  !pip install "nvtx" "nsightful[notebook] @ git+https://github.com/brycelelbach/nsightful.git" > /dev/null 2>&1

All GPU work is launched asynchronously on a stream. The work items in a stream are executed in order. If you launch `f` on a stream and later launch `g` on that same stream, then `f` will be executed before `g`. But if `f` and `g` are launched on different streams, then their execution might overlap.

With CuPy, much of this is hidden from us. Unless you specify otherwise, CuPy launches work on the default CUDA stream. That means that by default, all CuPy work is executed sequentially on the device, but with respect to the host, it's all happening asynchronously.

CuPy supports explicitly synchronizing, creating, and manipulating streams.

**TODO: However, there are also a few common operations in CuPy that will implicitly synchronize with the device. What operations do you think these are?**

Let's run our code throught Nsight Systems profiler, which will help us visualize what's going on.

**NOTE: The next cell won't actually run any code, it will just write its contents to a file. This is necessary because we have to run the code with the Nsight Systems profiler.**

In [None]:
%%writefile power_iteration__baseline.py

import numpy as np
import cupy as cp
import cupyx as cpx
import nvtx
from dataclasses import dataclass

@dataclass
class PowerIterationConfig:
  dim: int = 8192
  dominance: float = 0.05
  max_steps: int = 1000
  check_frequency: int = 10
  progress: bool = True
  residual_threshold: float = 1e-10

def generate_device(cfg=PowerIterationConfig()):
  cp.random.seed(42)
  weak_lam = cp.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
  lam = cp.random.permutation(cp.concatenate((cp.asarray([1.0]), weak_lam)))
  P = cp.random.random((cfg.dim, cfg.dim))
  D = cp.diag(cp.random.permutation(lam))
  A = ((P @ D) @ cp.linalg.inv(P))
  return A

def estimate_device(A, cfg=PowerIterationConfig()):
  A_gpu = cp.asarray(A) # If `A` is on the host, copy from host to device.
                        # Otherwise, does nothing.

  x = cp.ones(A_gpu.shape[0], dtype=np.float64)

  for i in range(0, cfg.max_steps, cfg.check_frequency):
    y = A_gpu @ x
    lam = (x @ y) / (x @ x)            # Rayleigh quotient.
    res = cp.linalg.norm(y - lam * x)
    x = y / cp.linalg.norm(y)          # Normalize for next step.

    if cfg.progress:
      print(f"step {i}: residual = {res:.3e}")

    np.savetxt(f"device_{i}.txt", cp.asnumpy(x)) # Copy from device to host
                                                  # and save a checkpoint.

    if res < cfg.residual_threshold:
      break

    for _ in range(cfg.check_frequency - 1):
      y = A_gpu @ x # We have to use `A_gpu` here as well.
      x = y / cp.linalg.norm(y) # Normalize for next step.

  return cp.asnumpy((x.T @ (A_gpu @ x)) / (x.T @ x)) # Copy from device to host.

A_device = generate_device()

# Warmup to ensure modules are loaded and code is JIT compiled before timing.
estimate_device(A_device, cfg=PowerIterationConfig(progress=False))

start = cp.cuda.get_current_stream().record()
lam_est_device = estimate_device(A_device).item()
stop = cp.cuda.get_current_stream().record()

duration = cp.cuda.get_elapsed_time(start, stop) / 1e3

print()
print(f"GPU Execution Time: {duration:.3f} s")

Now let's profile our code by running it under the Nsight Systems `nsys` tool. The syntax for this is `nsys <nsys flags> <your program> <your program args>`. It will run your program while collecting a birdseye view of everything going on in your program.

In [None]:
!nsys profile --cuda-event-trace=false --force-overwrite true -o power_iteration__baseline python power_iteration__baseline.py

Now let's view our report and explore what's going on in our program.

**TODO: Run the next cell, which will generate the report and create a button that when clicked will open it up in [Perfetto](https://ui.perfetto.dev/), a web-based no-install visual profiler.**

**EXTRA CREDIT: Download the [Nsight Systems GUI](https://developer.nvidia.com/nsight-systems) and open the report in it to see even more information.**

In [None]:
import nsightful

!nsys export --type sqlite --quiet true --force-overwrite true power_iteration__baseline.nsys-rep
nsightful.display_nsys_sqlite_file_in_notebook("power_iteration__baseline.sqlite", title="Power Iteration - Baseline")

Nsight Systems shows us a lot of information - sometimes it's too much and not all relevant.

There's two ways that we can filter and annotate what we see in Nsight systems.

The first is to limit when we start and stop profiling in the program. In Python, we can do this with `cupyx.profiler.profile()`, which give us a Python context manager. Any CUDA code used during scope will be included in the profile.

```
not_in_the profile()
with cpx.profiler.profile():
  in_the_profile()
not_in_the_profile()
```

For this to work, we have to pass `--capture-range=cudaProfilerApi --capture-range-end=stop` as flags to `nsys`.

We can also annotate specific regions of our code, which will show up in the profiler. We can even add categories, domains, and colors to these regions, and they can be nested. To add these annotations, we use `nvtx.annnotate()`, another Python context manager, this time from a library called [NVTX](http://nvtx.readthedocs.io/en/latest/reference.html).

```
with nvtx.annotate("Loop")
  for i in range(20):
     with nvtx.annotate(f"Step {i}"):
       pass
```

**TODO: Go back to the earlier cells and improve the profile results by adding:**
- **`nvtx.annotate()` regions. Remember, you can nest them.**
- **A `cpx.profiler.profile()` around the `start =`/`stop =` lines that run the solver.**
- **`--capture-range=cudaProfilerApi --capture-range-end=stop` to the `nsys` flags.**

**Then, capture another profile and see if you can identify how we can improve the code. Specifically, think about how we could add more asynchrony.**

Remember what we've learned about streams and how to use them with CuPy:

- By default, all CuPy operations within a single thread run on the same stream. You can access this stream with `cp.cuda.get_current_stream()`. 
- You can create a new stream with `cp.cuda.Stream(non_blocking=True)`. Use `with` statements to use the stream for all CuPy operations within a block.
- You can record an event on a stream by calling `.record()` on it.
- You can synchronize on an event (or an entire stream) by calling `.synchronize()` on it.
- Memory transfers will block by default. You can launch them asynchronously with `cp.asarray(..., blocking=False)` (for host to device transfers) and `cp.asnumpy(..., blocking=False)` (for device to host transfers).

**TODO: Copy the kernel from the earlier cell with your NVTX and CuPy profiler regions into the cell below. Then, try to improve performance by adding asynchrony. Make sure that you don't copy and paste the `%%writefile` directive.**

In [None]:
%%writefile power_iteration__async.py

import numpy as np
import cupy as cp
import cupyx as cpx
import nvtx
from dataclasses import dataclass

@dataclass
class PowerIterationConfig:
  dim: int = 8192
  dominance: float = 0.05
  max_steps: int = 1000
  check_frequency: int = 10
  progress: bool = True
  residual_threshold: float = 1e-10

def generate_device(cfg=PowerIterationConfig()):
  cp.random.seed(42)
  weak_lam = cp.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
  lam = cp.random.permutation(cp.concatenate((cp.asarray([1.0]), weak_lam)))
  P = cp.random.random((cfg.dim, cfg.dim))
  D = cp.diag(cp.random.permutation(lam))
  A = ((P @ D) @ cp.linalg.inv(P))
  return A

def estimate_device(A, cfg=PowerIterationConfig()):
  raise NotImplementedError("TODO: You need to implement this kernel!")

A_device = generate_device()

# Warmup to ensure modules are loaded and code is JIT compiled before timing.
estimate_device(A_device, cfg=PowerIterationConfig(progress=False))

start = cp.cuda.get_current_stream().record()
lam_est_device = estimate_device(A_device).item()
stop = cp.cuda.get_current_stream().record()

duration = cp.cuda.get_elapsed_time(start, stop) / 1e3

print()
print(f"GPU Execution Time: {duration:.3f} s")

Now let's make sure it works:

In [None]:
!python power_iteration__async.py

Before we profile the improved code, let's compare the execution times of both.

In [None]:
power_iteration_baseline_output   = !python power_iteration__baseline.py
power_iteration_baseline_duration = float(power_iteration_baseline_output[-1].split()[-2])
power_iteration_async_output      = !python power_iteration__async.py
power_iteration_async_duration    = float(power_iteration_async_output[-1].split()[-2])
speedup = power_iteration_baseline_duration / power_iteration_async_duration

print(f"GPU Execution Time")
print()
print(f"power_iteration_baseline: {power_iteration_baseline_duration:.3f} s")
print(f"power_iteration_async:    {power_iteration_async_duration:.3f} s")
print(f"power_iteration_async speedup over power_iteration_baseline: {speedup:.2f}")

Next, let's capture a profile report of our improved code.

In [None]:
!nsys profile --cuda-event-trace=false --capture-range=cudaProfilerApi --capture-range-end=stop --force-overwrite true -o power_iteration__async python power_iteration__async.py

Finally, let's look at the profile in Perfetto and confirm we've gotten rid of the idling.

In [None]:
!nsys export --type sqlite --quiet true --force-overwrite true power_iteration__async.nsys-rep
nsightful.display_nsys_sqlite_file_in_notebook("power_iteration__async.sqlite", title="Power Iteration - Async Event")