In [None]:
# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<img src="./images/nvmath_head_panel@0.5x.png" alt="nvmath-python" />

# Getting Started with nvmath-python: Memory and Execution Spaces

## Overview

This notebook explores the concepts of *memory spaces* and *execution spaces* in **nvmath-python**. Understanding these concepts is crucial for optimizing performance and avoiding unnecessary data transfer overhead between CPU and GPU.

**Learning Objectives:**
* Understand the difference between memory space and execution space
* Identify when **nvmath-python** performs automatic data transfers between memory spaces
* Use **nvmath-python**'s logging mechanism to diagnose implicit data transfers
* Benchmark the performance impact of data residing in different memory spaces
* Apply memory space concepts to generic APIs like FFT
* Estimate and minimize CPU-GPU data transfer overhead

---
## Introduction

In heterogeneous computing environments with both CPUs and GPUs, understanding where data resides and where computation occurs is essential for achieving optimal performance. The *memory space* refers to where data is stored (CPU RAM or GPU memory), while the *execution space* refers to where computations are performed (CPU or GPU). These two spaces are not necessarily the same, and mismatches between them can lead to implicit data transfers that significantly impact performance.

**nvmath-python** is backed by both GPU libraries (such as [cuBLAS](https://developer.nvidia.com/cublas) and [cuFFT](https://developer.nvidia.com/cufft)) and CPU libraries (NVPL for NVIDIA Grace CPUs and Arm v8 and Intel MKL for x86 hosts). This allows easy code migration between CPU and GPU as well as implementation of complex hybrid workflows that combine both CPU and GPU execution.

This notebook demonstrates how **nvmath-python** handles different memory and execution space configurations, and provides tools to diagnose and optimize data transfer patterns.

**Prerequisites:** To use this notebook, you will need:
- A computer equipped with an NVIDIA GPU
- An environment with properly installed Python libraries as described in [01_kernel_fusion.ipynb](./01_kernel_fusion.ipynb)
- Completion of the previous notebook on kernel fusion (recommended)
- Understanding of basic GPU computing concepts

---
## Setup

This notebook requires the same libraries as the previous notebook [01_kernel_fusion.ipynb](./01_kernel_fusion.ipynb). If you already completed it, no further installation is required for this notebook.

For detailed installation instructions, please refer to the [nvmath-python documentation](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#install-nvmath-python).

---
## Memory and Execution Spaces

The *memory space* is the memory dedicated for storing input data and results. It is tied to a specific device (or host) and is allocated and released by means of the respective device/host API calls. The *execution space* is where the data is actually processed. 

**Memory and execution spaces are not necessarily the same.** This is important to remember because data transfer between memory spaces often incurs non-negligible costs. These costs may be high not only in the case of data movement between a host CPU and a GPU device, but also between two GPU devices.

Let's examine an example to understand how **nvmath-python** handles different memory and execution space combinations:

In [None]:
import cupy as cp
import numpy as np
import nvmath

m, n, k = 8000, 2000, 4000
a_cpu = np.random.randn(m, k).astype(np.float32)
b_cpu = np.random.randn(k, n).astype(np.float32)

a_gpu = cp.random.randn(m, k, dtype=cp.float32)
b_gpu = cp.random.randn(k, n, dtype=cp.float32)

d_cpu = nvmath.linalg.advanced.matmul(a_cpu, b_cpu)
d_gpu = nvmath.linalg.advanced.matmul(a_gpu, b_gpu)
type(d_cpu)  # numpy.ndarray
type(d_gpu)  # cupy.ndarray

### Benchmarking Helper Function

We will use the same benchmarking helper function from the previous notebook:

In [None]:
import numpy as np
import cupyx as cpx


# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    timing = cpx.profiler.benchmark(F, n_repeat=n_repeat, n_warmup=n_warmup)  # warm-up + repeated runs
    perf = np.min(timing.gpu_times)  # best time from repeated runs
    print(f"{F_name} performance = {perf:0.4f} sec")

    if F_alternative is not None:
        timing_alt = cpx.profiler.benchmark(F_alternative, n_repeat=n_repeat, n_warmup=n_warmup)
        perf_alt = np.min(timing_alt.gpu_times)
        print(f"{F_alternative_name} performance = {perf_alt:0.4f} sec")
        print(f"Speedup = {perf_alt / perf:0.4f}x")
    else:
        perf_alt = None

    return perf, perf_alt

### Performance Comparison: GPU vs. CPU Memory Spaces

Now let's benchmark the performance difference when inputs are in GPU memory versus CPU memory:


In [None]:
benchmark(
    lambda: nvmath.linalg.advanced.matmul(a_gpu, b_gpu),
    F_name="Matmul with GPU inputs",
    F_alternative=lambda: nvmath.linalg.advanced.matmul(a_cpu, b_cpu),
    F_alternative_name="Matmul with CPU inputs",
)

### Understanding the Performance Difference

The difference is noticeable, but where does the cost come from? The answer lies in understanding *specialized APIs* versus *generic APIs*.

`nvmath.linalg.advanced.matmul` belongs to a category of *specialized APIs*. In contrast to *generic APIs* such as `nvmath.fft.fft`, specialized APIs serve very specific needs, which comes at a cost of generality. Specifically, `nvmath.linalg.advanced.matmul` supports **GPU execution space only**. 

When `nvmath.linalg.advanced.matmul` receives CPU tensor inputs, it:
1. Copies the inputs from CPU memory to GPU memory (the execution space)
2. Performs the operation on the GPU
3. Copies the result back from GPU memory to CPU memory (the original memory space)

The next example illustrates what is happening under the hood of **nvmath-python** through the library's logging mechanism.

In [None]:
import logging

# Configure root logger to show info messages from nvmath and its internals
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)-8s %(message)s", force=True)
logging.disable(logging.NOTSET)  # ensure logging is enabled

# Run matmul with GPU inputs (execution space == GPU)
logging.info("******************************************************")
logging.info("********************* GPU INPUTS *********************")
logging.info("******************************************************")
d_gpu = nvmath.linalg.advanced.matmul(a_gpu, b_gpu)
print("d_gpu type:", type(d_gpu))

# Run matmul with CPU inputs (this will cause nvmath to copy to execution space internally)
logging.info("******************************************************")
logging.info("********************* CPU INPUTS *********************")
logging.info("******************************************************")
d_cpu = nvmath.linalg.advanced.matmul(a_cpu, b_cpu)
print("d_cpu type:", type(d_cpu))

### Interpreting the Logging Output

In the case of GPU inputs, the `= SPECIFICATION PHASE =` section reports:

```
The input operands' memory space is cuda, and the execution space is on device 0.
```

While in the case of CPU inputs, the report is different:

```
The input operands' memory space is cpu, and the execution space is on device 0.
```

This mismatch between memory space (CPU) and execution space (GPU) causes the implicit data transfers we observed. Such significant overhead cannot be ignored. The **nvmath-python** logging mechanism is a great tool to understand potential costs and refactor the code accordingly to minimize the impact.

## Exercise: Estimate CPU-GPU Data Transfer Overhead

We see a non-negligible performance difference between data residing in CPU memory space versus GPU memory space in the above example. Given that the execution space is always GPU, estimate the data transfer cost. Implement a dedicated benchmark for a cross-check.

**Hint**: You can use CuPy to explicitly time the data transfer between CPU and GPU using `cp.asarray()` and `cp.asnumpy()` operations.

In [None]:
## Exercise: Evaluate performance of CuPy `@`-based vs. `matmul`-based implementations of GEMM

import nvmath
import numpy as np
import cupy as cp
from cupyx.profiler import benchmark

# Define matmul parameters
m, n, k = 8000, 2000, 4000
a_cpu = np.random.randn(m, k).astype(np.float32)
b_cpu = np.random.randn(k, n).astype(np.float32)

a_gpu = cp.random.randn(m, k, dtype=cp.float32)
b_gpu = cp.random.randn(k, n, dtype=cp.float32)

# Benchmark 1: Measure time for CPU -> GPU -> matmul -> CPU
# TODO: Implement this benchmark and print the result

# Benchmark 2: Measure time for GPU -> matmul -> GPU
# TODO: Implement this benchmark and print the result

# Based on above results print estimated time for CPU -> GPU -> CPU transfer
# TODO: Implement this benchmark and print the result

# Benchmark 3: Measure time for CPU -> GPU transfer
# TODO: Implement this benchmark and print the result

# Benchmark 4: Measure time for GPU -> CPU transfer
# TODO: Implement this benchmark and print the result

# Based on above results print estimated time for CPU -> GPU -> CPU transfer
# TODO: Implement this benchmark and print the result

# Benchmark 5: Measure time for CPU -> GPU transfer
# TODO: Implement this benchmark and print the result

# Perform cross-checks for different ways to estimate the transfer costs.
# Are the results consistent?
# Note how much time is spent in data transfers vs. actual computation.
# Can data transfer overheads be ignored?

---
## Generic APIs: Fast Fourier Transform (FFT)

Next, let us illustrate the data flow in the case of the library's *Fast Fourier Transform* (FFT), which is a *generic API*:

In [None]:
N = 10000
e_cpu = (np.random.randn(N) + 1j * np.random.randn(N)).astype(np.complex64)
e_gpu = cp.array(e_cpu)  # move NumPy data to GPU as CuPy array (complex64)

# compute FFT with nvmath (for CPU inputs nvmath may copy to execution space internally)
logging.info("******************************************************")
logging.info("********************* CPU INPUTS *********************")
logging.info("******************************************************")
r_cpu = nvmath.fft.fft(e_cpu)
print("r_cpu type:", type(r_cpu), getattr(r_cpu, "dtype", None), getattr(r_cpu, "shape", None))

logging.info("******************************************************")
logging.info("********************* GPU INPUTS *********************")
logging.info("******************************************************")
r_gpu = nvmath.fft.fft(e_gpu)
print("r_gpu type:", type(r_gpu), getattr(r_gpu, "dtype", None), getattr(r_gpu, "shape", None))

### Generic APIs Adapt to Input Location

Note that when the input operand is a CPU operand, the library chooses the execution space to be CPU, thanks to the fact that FFT belongs to *generic APIs* providing consistent behavior between CPU and GPU. In the case of GPU inputs, the library selects GPU as the execution space.

This is a key difference from specialized APIs: **generic APIs automatically match the execution space to the memory space of the inputs**, avoiding unnecessary data transfers.

**Key Takeaways:**

- Memory space (where data is stored) and execution space (where computation happens) may differ, leading to data transfer costs.
- Some specialized APIs, such as `nvmath.linalg.advanced.matmul`, only support GPU execution, automatically transferring CPU data to GPU with associated overhead.
- Generic APIs like `nvmath.fft.fft` adapt to input location: CPU inputs execute on CPU, GPU inputs execute on GPU.
- Use **nvmath-python**'s logging mechanism to understand internal operations and identify potential bottlenecks.

---
## Conclusion

In this notebook, we explored the critical concepts of memory spaces and execution spaces in **nvmath-python**. Understanding the distinction between where data is stored and where it is processed is essential for optimizing heterogeneous computing workflows.

**Key Takeaways:**
- Memory space and execution space are distinct concepts that may not always align, potentially causing implicit data transfers.
- Specialized APIs (like `nvmath.linalg.advanced.matmul`) only support GPU execution, requiring data transfers when CPU inputs are provided.
- Generic APIs (like `nvmath.fft.fft`) adapt their execution space to match the input memory space, avoiding unnecessary transfers.
- The **nvmath-python** logging mechanism provides valuable insights into internal operations and helps identify performance bottlenecks.
- Data transfer between CPU and GPU can significantly impact performance and should be minimized in production code.

**Next Steps:**
- Learn about stateful APIs and autotuning in the next notebook: [03_stateful_api.ipynb](03_stateful_api.ipynb)
- Explore FFT callbacks: [04_callbacks.ipynb](04_callbacks.ipynb)
- Discover device APIs: [05_device_api.ipynb](05_device_api.ipynb)

---
## References

- NVIDIA nvmath-python documentation, "API Reference," https://docs.nvidia.com/cuda/nvmath-python/, Accessed: October 23, 2025.
- NVIDIA, "cuBLAS Library User Guide," https://docs.nvidia.com/cuda/cublas/, Accessed: October 23, 2025.
- NVIDIA, "cuFFT Library User Guide," https://docs.nvidia.com/cuda/cufft/, Accessed: October 23, 2025.
- NVIDIA, "CUDA C++ Programming Guide - Heterogeneous Programming," https://docs.nvidia.com/cuda/cuda-c-programming-guide/, Accessed: October 23, 2025.

- Williams, Samuel, et al., "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM, 52(4), 65-76, 2009.