# Asynchrony and Power Iteration - SOLUTION

## Table of Contents
1. [Introduction and Setup](#1-Introduction-and-Setup)
   - [1.1 Environment Setup](#11-Environment-Setup)
2. [Theory: Streams and Synchronization](#2-Theory:-Streams-and-Synchronization)
3. [The Baseline Implementation](#3-The-Baseline-Implementation)
4. [Profiling the Baseline](#4-Profiling-the-Baseline)
5. [Better Visibility with NVTX](#5-Better-Visibility-with-NVTX)
6. [Implementing Asynchrony](#6-Implementing-Asynchrony)
7. [Performance Analysis](#7-Performance-Analysis)

## 1. Introduction and Setup

GPU programming is inherently asynchronous. In this exercise, we will explore the implications of this behavior when using CuPy and learn how to analyze the flow of execution using profiling tools.

We will revisit the Power Iteration algorithm. Our goal is to take a standard implementation, profile it to identify bottlenecks caused by implicit synchronization, and then optimize it using CUDA streams and asynchronous memory transfers.

### 1.1 Environment Setup

First, we need to ensure the Nsight Systems profiler (nsys), Nsightful, and NVTX are installed and available.

In [None]:
import os

# Install necessary tools if running in Google Colab
if os.getenv("COLAB_RELEASE_TAG"):
  !curl -s -L -O https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/NsightSystems-linux-cli-public-2025.3.1.90-3582212.deb
  !sudo dpkg -i NsightSystems-linux-cli-public-2025.3.1.90-3582212.deb > /dev/null
  !pip install "nvtx" "nsightful[notebook] @ git+https://github.com/brycelelbach/nsightful.git" > /dev/null 2>&1

print("Environment setup complete.")

## 2. Theory: Streams and Synchronization

All GPU work is launched asynchronously on a stream. The work items in a stream are executed in order. If you launch `f` on a stream and later launch `g` on that same stream, then `f` will be executed before `g`. But if `f` and `g` are launched on different streams, then their execution might overlap.

**How CuPy handles this:**

- **Default Stream:** Unless specified, CuPy launches work on the default CUDA stream.

- **Sequential Device Execution:** By default, CuPy work executes sequentially on the GPU.

- **Asynchronous Host Execution:** From the Python (Host) perspective, the code often returns immediately after launching the GPU kernel, before the work is actually finished.

**SOLUTION:** Certain operations force the CPU to wait for the GPU to finish (implicit synchronization):
- Accessing element values from device arrays (e.g., `x[0]`, `.item()`)
- Printing device array values
- Device-to-host memory transfers with `cp.asnumpy()` (by default)
- Explicit synchronization calls

## 3. The Baseline Implementation

We will start with a baseline implementation of the Power Iteration algorithm.

**Note:** The cell below writes the code to a file named `power_iteration__baseline.py`. We do this because we must run the code through the Nsight Systems profiler via the command line.

**SOLUTION:** The baseline code below already includes NVTX annotations and `cpx.profiler.profile()` to demonstrate the proper profiling setup.

In [None]:
%%writefile power_iteration__baseline.py

import numpy as np
import cupy as cp
import cupyx as cpx
import nvtx
from dataclasses import dataclass

@dataclass
class PowerIterationConfig:
  dim: int = 8192
  dominance: float = 0.05
  max_steps: int = 1000
  check_frequency: int = 10
  progress: bool = True
  residual_threshold: float = 1e-10

def generate_device(cfg=PowerIterationConfig()):
  cp.random.seed(42)
  weak_lam = cp.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
  lam = cp.random.permutation(cp.concatenate((cp.asarray([1.0]), weak_lam)))
  P = cp.random.random((cfg.dim, cfg.dim))
  D = cp.diag(cp.random.permutation(lam))
  A = ((P @ D) @ cp.linalg.inv(P))
  return A

def estimate_device(A, cfg=PowerIterationConfig()):
  with nvtx.annotate("Setup"):
    A_gpu = cp.asarray(A) # If `A` is on the host, copy from host to device.
                          # Otherwise, does nothing.

    x = cp.ones(A_gpu.shape[0], dtype=np.float64)

  with nvtx.annotate("Loop"):
    for i in range(0, cfg.max_steps, cfg.check_frequency):
      with nvtx.annotate(f"Step {i} to {i + cfg.check_frequency}"):
        with nvtx.annotate(f"Compute & Residual {i}"):
          y = A_gpu @ x
          lam = (x @ y) / (x @ x)            # Rayleigh quotient.
          res = cp.linalg.norm(y - lam * x)
          x = y / cp.linalg.norm(y)          # Normalize for next step.

        with nvtx.annotate(f"Copy {i}"):
          res_host = cp.asnumpy(res)
          x_host = cp.asnumpy(x)

        with nvtx.annotate(f"I/O {i}"):
          if cfg.progress:
            print(f"step {i}: residual = {res_host:.3e}")

          np.savetxt(f"device_{i}.txt", x_host) # Copy from device to host and
                                                # save a checkpoint.

          if res_host < cfg.residual_threshold:
            break

        with nvtx.annotate(f"Compute {i}"):
          for _ in range(cfg.check_frequency - 1):
            y = A_gpu @ x # We have to use `A_gpu` here as well.
            x = y / cp.linalg.norm(y) # Normalize for next step.

  return cp.asnumpy((x.T @ (A_gpu @ x)) / (x.T @ x)) # Copy from device to host.

A_device = generate_device()

# Warmup to ensure modules are loaded and code is JIT compiled before timing.
estimate_device(A_device, cfg=PowerIterationConfig(progress=False))

with cpx.profiler.profile():
  start = cp.cuda.get_current_stream().record()
  lam_est_device = estimate_device(A_device).item()
  stop = cp.cuda.get_current_stream().record()

duration = cp.cuda.get_elapsed_time(start, stop) / 1e3

print()
print(f"GPU Execution Time: {duration:.3f} s")

## 4. Profiling the Baseline

Now let's profile our code by running it under the Nsight Systems `nsys` tool. The syntax for this is `nsys <nsys flags> <your program> <your program args>`. It will run your program while collecting a birdseye view of everything going on in your program.

In [None]:
!sudo nsys profile --cuda-event-trace=false --capture-range=cudaProfilerApi --capture-range-end=stop --force-overwrite true -o power_iteration__baseline python power_iteration__baseline.py

Now let's view our report and explore what's going on in our program.

Run the next cell, which will generate the report and create a button that when clicked will open it up in Perfetto, a web-based no-install visual profiler.

**EXTRA CREDIT:** Download the Nsight Systems GUI and open the report in it to see even more information.

In [None]:
import nsightful

!nsys export --type sqlite --quiet true --force-overwrite true power_iteration__baseline.nsys-rep
nsightful.display_nsys_sqlite_file_in_notebook("power_iteration__baseline.sqlite", title="Power Iteration - Baseline")

## 5. Better Visibility with NVTX

Nsight Systems shows us a lot of information - sometimes it's too much and not all relevant.

There's two ways that we can filter and annotate what we see in Nsight systems.

The first is to limit when we start and stop profiling in the program. In Python, we can do this with `cupyx.profiler.profile()`, which give us a Python context manager. Any CUDA code used during scope will be included in the profile.

```
not_in_the profile()
with cpx.profiler.profile():
  in_the_profile()
not_in_the_profile()
```

For this to work, we have to pass `--capture-range=cudaProfilerApi --capture-range-end=stop` as flags to `nsys`.

We can also annotate specific regions of our code, which will show up in the profiler. We can even add categories, domains, and colors to these regions, and they can be nested. To add these annotations, we use `nvtx.annnotate()`, another Python context manager, this time from a library called NVTX.

```
with nvtx.annotate("Loop")
  for i in range(20):
     with nvtx.annotate(f"Step {i}"):
       pass
```

**SOLUTION:** The baseline code above already includes:

- `nvtx.annotate()` regions for "Setup", "Loop", "Step", "Compute & Residual", "Copy", "I/O", and "Compute" phases.

- A `cpx.profiler.profile()` around the timing section.

- The `nsys` command includes `--capture-range=cudaProfilerApi --capture-range-end=stop` flags.

From our profile trace, we can see that both our CPU and GPU are idly waiting for each other! Device code is idle during every I/O step when we print the residual and write the checkpoint, and host code spends a long time synchronizing on `cudaMemcpyAsync`.

Here's what happens at the start of each I/O step:

- We copy from device to host, which synchronizes with any outstanding work on the device. This blocks the host for awhile.
- After that synchronous transfer has completed, we begin the I/O (printing and writing the checkpoint). During this time, the device is idle.
- After the I/O has completed on the host, we start launching the next set of iterations.

This is inefficient; we can do better by overlapping compute and I/O:

- First, host code asynchronously initiate our device to host copies.
- Then, host code asynchronously launch the next set of compute steps on the device.
- Next, host code synchronize with the asynchronous copies we started.
- Finally, the host performs the I/O while the device performs the next set of compute steps.

Everything is still going to run on one stream, but we want to be able to synchronize with just the I/O, which is launched on the stream before the compute work. We'll use a CUDA event, which we will record on the stream right after the copy. Then, we can synchronize with the event later, waiting for the I/O but not the compute!

## 6. Implementing Asynchrony

Remember what we've learned about streams and how to use them with CuPy:

- By default, all CuPy operations within a single thread run on the same stream. You can access this stream with `cp.cuda.get_current_stream()`.

- You can create a new stream with `cp.cuda.Stream(non_blocking=True)`. Use `with` statements to use the stream for all CuPy operations within a block.

- You can record an event on a stream by calling `.record()` on it.

- You can synchronize on an event (or an entire stream) by calling `.synchronize()` on it.

- Memory transfers will block by default. You can launch them asynchronously with `cp.asarray(..., blocking=False)` (for host to device transfers) and `cp.asnumpy(..., blocking=False)` (for device to host transfers).

**SOLUTION:** The implementation below uses asynchronous memory transfers and CUDA events to overlap compute and I/O operations.

In [None]:
%%writefile power_iteration__async.py

import numpy as np
import cupy as cp
import cupyx as cpx
import nvtx
from dataclasses import dataclass

@dataclass
class PowerIterationConfig:
  dim: int = 8192
  dominance: float = 0.05
  max_steps: int = 1000
  check_frequency: int = 10
  progress: bool = True
  residual_threshold: float = 1e-10

def generate_device(cfg=PowerIterationConfig()):
  cp.random.seed(42)
  weak_lam = cp.random.random(cfg.dim - 1) * (1.0 - cfg.dominance)
  lam = cp.random.permutation(cp.concatenate((cp.asarray([1.0]), weak_lam)))
  P = cp.random.random((cfg.dim, cfg.dim))
  D = cp.diag(cp.random.permutation(lam))
  A = ((P @ D) @ cp.linalg.inv(P))
  return A

def estimate_device(A, cfg=PowerIterationConfig()):
  with nvtx.annotate("Setup"):
    A_gpu = cp.asarray(A) # If `A` is on the host, copy from host to device.
                          # Otherwise, does nothing.

    x = cp.ones(A_gpu.shape[0], dtype=np.float64)

  with nvtx.annotate("Loop"):
    for i in range(0, cfg.max_steps, cfg.check_frequency):
      with nvtx.annotate(f"Step {i} to {i + cfg.check_frequency}"):
        with nvtx.annotate(f"Compute & Residual {i}"):
          y = A_gpu @ x
          lam = (x @ y) / (x @ x)            # Rayleigh quotient.
          res = cp.linalg.norm(y - lam * x)
          x = y / cp.linalg.norm(y)          # Normalize for next step.

        with nvtx.annotate(f"Copy {i}"):
          res_host = cp.asnumpy(res, blocking=False)
          x_host = cp.asnumpy(x, blocking=False)
          copy_event = cp.cuda.get_current_stream().record()

        with nvtx.annotate(f"Compute {i}"):
          for _ in range(cfg.check_frequency - 1):
            y = A_gpu @ x # We have to use `A_gpu` here as well.
            x = y / cp.linalg.norm(y) # Normalize for next step.

        with nvtx.annotate(f"I/O {i}", payload=i):
          copy_event.synchronize() # Wait for the copies to complete.

          if cfg.progress:
            print(f"step {i}: residual = {res_host:.3e}")

          np.savetxt(f"device_{i}.txt", x_host) # Save a checkpoint.

          if res_host < cfg.residual_threshold:
            break

  return cp.asnumpy((x.T @ (A_gpu @ x)) / (x.T @ x)) # Copy from device to host.

A_device = generate_device()

# Warmup to ensure modules are loaded and code is JIT compiled before timing.
estimate_device(A_device, cfg=PowerIterationConfig(progress=False))

with cpx.profiler.profile():
  start = cp.cuda.get_current_stream().record()
  lam_est_device = estimate_device(A_device).item()
  stop = cp.cuda.get_current_stream().record()

duration = cp.cuda.get_elapsed_time(start, stop) / 1e3

print()
print(f"GPU Execution Time: {duration:.3f} s")

Now let's make sure it works:

In [None]:
!python power_iteration__async.py

## 7. Performance Analysis

Before we profile the improved code, let's compare the execution times of both.

In [None]:
power_iteration_baseline_output   = !python power_iteration__baseline.py
power_iteration_baseline_duration = float(power_iteration_baseline_output[-1].split()[-2])
power_iteration_async_output      = !python power_iteration__async.py
power_iteration_async_duration    = float(power_iteration_async_output[-1].split()[-2])
speedup = power_iteration_baseline_duration / power_iteration_async_duration

print(f"GPU Execution Time")
print()
print(f"power_iteration_baseline: {power_iteration_baseline_duration:.3f} s")
print(f"power_iteration_async:    {power_iteration_async_duration:.3f} s")
print(f"power_iteration_async speedup over power_iteration_baseline: {speedup:.2f}")

Next, let's capture a profile report of our improved code.

In [None]:
!sudo nsys profile --cuda-event-trace=false --capture-range=cudaProfilerApi --capture-range-end=stop --force-overwrite true -o power_iteration__async python power_iteration__async.py

Finally, let's look at the profile in Perfetto and confirm we've gotten rid of the idling.

In [None]:
!nsys export --type sqlite --quiet true --force-overwrite true power_iteration__async.nsys-rep
nsightful.display_nsys_sqlite_file_in_notebook("power_iteration__async.sqlite", title="Power Iteration - Async")