In [None]:
# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<img src="./images/nvmath_head_panel@0.5x.png" alt="nvmath-python" />

# Getting Started with nvmath-python: FFT Callbacks

## Overview

This notebook introduces FFT callbacks in **nvmath-python**, which allow custom Python functions to be just-in-time (JIT) compiled and fused with FFT operations. This advanced feature enables significant performance improvements by combining multiple operations into a single kernel.

**Learning Objectives:**
* Understand what FFT callbacks are and how they work
* Write custom epilog functions for FFT operations
* Use `nvmath.fft.compile_epilog` to JIT-compile callback functions
* Apply FFT callbacks to a real-world image processing problem (Gaussian filtering)
* Benchmark performance improvements from kernel fusion with callbacks
* Use stateful FFT APIs to amortize JIT compilation costs across multiple executions
* Analyze the cost breakdown of compilation, planning, and execution phases

---
## Introduction

FFT callbacks are a powerful feature of **nvmath-python** that enables kernel fusion for FFT operations. Callbacks are custom Python functions that are *just-in-time (JIT)* compiled into *intermediate representation (LTO-IR)* and provided as *prolog* or *epilog* arguments to FFT functions. This allows you to fuse custom operations with FFT kernels, dramatically improving performance by:

1. **Eliminating intermediate memory allocations**: Data doesn't need to be written to memory between FFT and the custom operation
2. **Reducing kernel launch overhead**: Multiple operations are combined into a single kernel
3. **Improving arithmetic intensity**: The fused kernel can better utilize GPU resources

This notebook demonstrates FFT callbacks through a practical image processing example: implementing a Gaussian blur filter using FFT convolution.

**Prerequisites:** To use this notebook, you will need:
- A computer equipped with an NVIDIA GPU
- An environment with properly installed Python libraries
- Completion of previous notebooks (recommended)
- Understanding of FFT concepts and image processing basics

For detailed installation instructions, please refer to the [nvmath-python documentation](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#install-nvmath-python).

---

## Setup

This notebook uses the same benchmarking helper function from previous notebooks:

In [None]:
import numpy as np
import cupyx as cpx


# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    timing = cpx.profiler.benchmark(F, n_repeat=n_repeat, n_warmup=n_warmup)  # warm-up + repeated runs
    perf = np.min(timing.gpu_times)  # best time from repeated runs
    print(f"{F_name} performance = {perf:0.4f} sec")

    if F_alternative is not None:
        timing_alt = cpx.profiler.benchmark(F_alternative, n_repeat=n_repeat, n_warmup=n_warmup)
        perf_alt = np.min(timing_alt.gpu_times)
        print(f"{F_alternative_name} performance = {perf_alt:0.4f} sec")
        print(f"Speedup = {perf_alt / perf:0.4f}x")
    else:
        perf_alt = None

    return perf, perf_alt

It also leverages a few other Python libraries that have to be additionally installed:

```bash
pip install scipy
pip install matplotlib
```

---
## Image Processing Example: Gaussian Filtering

In this notebook, we explore a typical image processing problem: applying a Gaussian filter to an image. The Gaussian filter is widely used for image blurring and noise reduction.

### Baseline Implementation with SciPy

The following code implements a Gaussian filter using the [`scipy.ndimage`](https://docs.scipy.org/doc/scipy/reference/ndimage.html) library, which provides a convenient CPU-based implementation:

In [None]:
import matplotlib.pyplot as plt
from PIL import Image
from scipy.ndimage import gaussian_filter

asset_path = "./images/"
img = Image.open(asset_path + "dog.jpg").convert("L") # Load the image and convert it to grayscale
original_image = np.array(img, dtype=np.float32) / 255.0 # Convert the image to a float array and normalize it to [0, 1]

sigma_value = 20.0  # Filter size

filtered_image_scipy = gaussian_filter(original_image, sigma=sigma_value)

# Plotting the results
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.imshow(original_image, cmap="gray")
plt.title("Original")
plt.axis("off")

plt.subplot(1, 2, 2)
plt.imshow(filtered_image_scipy, cmap="gray")
plt.title("Filtered (SciPy)")
plt.axis("off")

plt.tight_layout()
plt.show()

### GPU Implementation with CuPy

Next, we implement the Gaussian filter using CuPy's 2D forward and inverse FFT, applying the filter (frequency response) in the frequency domain.

The frequency response is implemented in the `create_gaussian_filter()` function, which leverages the mathematical property that the Fourier transform of a Gaussian distribution in the *spatial domain* corresponds to another Gaussian distribution in the *frequency domain* (and vice versa).

**Implementation Details:**
- The original image is loaded as a [NumPy](https://numpy.org/) array in CPU memory
- We convert the image to a [CuPy](https://cupy.dev/) array to perform computations on the GPU
- Forward FFT transforms the image to the frequency domain
- We apply the Gaussian filter by element-wise multiplication
- Inverse FFT transforms back to the spatial domain
- The result is converted back to CPU memory for visualization with [`matplotlib`](https://matplotlib.org/)

In [None]:
import nvmath  # Preload CTK libraries installed from wheels for CuPy
import cupy as cp


image_gpu = cp.asarray(original_image, dtype=cp.float32) 

fy = cp.fft.fftfreq(image_gpu.shape[0])[:, None].astype(cp.float32)
fx = cp.fft.rfftfreq(image_gpu.shape[1])[None, :].astype(cp.float32)
h = cp.exp(-2.0 * cp.pi * cp.pi * sigma_value * sigma_value * (fx * fx + fy * fy)).astype(cp.complex64)


def gaussian_filter_cupy(image, clear_cache=True):
    """
    Apply Gaussian filter using CuPy R2C/C2R FFT.
    """
    if clear_cache:
        cp.fft.config.clear_plan_cache()  # Clear CuPy FFT cache to ensure clean FFT benchmarking
    image_fft = cp.fft.rfft2(image)  # Real to complex FFT
    filtered = cp.fft.irfft2(image_fft * h)  # Complex to real FFT
    return filtered

filtered_image_cupy = gaussian_filter_cupy(image_gpu)

# Plotting the results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.imshow(original_image, cmap="gray")
plt.title("Original")
plt.axis("off")

plt.subplot(1, 2, 2)
plt.imshow(cp.asnumpy(filtered_image_cupy), cmap="gray")
plt.title("Filtered (CuPy)")
plt.axis("off")

plt.tight_layout()
plt.show()

### GPU Implementation with nvmath-python and FFT Callbacks

Now we implement the Gaussian filter using **nvmath-python**'s forward and inverse FFTs with a fused epilog callback. 

**Key Difference from CuPy Implementation:**
- The Gaussian kernel multiplication (previously a separate operation) is now compiled as an `epilog_impl` function into intermediate representation (LTO-IR)
- This epilog is fused with the forward FFT operation
- The fused kernel improves *arithmetic intensity* by eliminating intermediate memory operations
- There is a one-time JIT compilation cost that can be amortized over multiple executions

The epilog function multiplies the FFT output with the Gaussian filter and performs normalization:

In [None]:
def gaussian_filter_nvmath(image):
    wh = image.shape[0] * image.shape[1] # In nvmath-python FFT must be explicitly normalized

    # Define epilog function for gaussian kernel multiplication
    def epilog_impl(data_out, offset, data, filter_data, unused):
            data_out[offset] = data * filter_data[offset] / wh  # Normalize by the image area

    # Compile the epilog to LTO-IR
    epilog = nvmath.fft.compile_epilog(epilog_impl, "complex64", "complex64")

    # Compute R2C FFT using nvmath with epilog to apply gaussian kernel multiplication
    image_fft = nvmath.fft.rfft(image, epilog={"ltoir": epilog, "data": h.data.ptr})

    # Inverse C2R FFT using nvmath
    filtered = nvmath.fft.irfft(image_fft)

    return filtered



image_gpu = cp.asarray(original_image, dtype=cp.float32)
filtered_image_nvmath = gaussian_filter_nvmath(image_gpu)

# Plotting the results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.imshow(original_image, cmap="gray")
plt.title("Original")
plt.axis("off")

plt.subplot(1, 2, 2)
plt.imshow(cp.asnumpy(filtered_image_nvmath), cmap="gray")
plt.title("Filtered (nvmath-python)")
plt.axis("off")

plt.tight_layout()
plt.show()

In [None]:
benchmark(
    lambda: gaussian_filter_nvmath(image_gpu),
    "nvmath-python API",
    lambda: gaussian_filter_cupy(image_gpu),
    "CuPy API",
)

---
## Stateful FFT APIs for Batch Processing

The **nvmath-python** library provides both *stateless* and *stateful* APIs for FFT, allowing you to separate the *planning* phase (including expensive JIT compilation of the epilog) from the *execution* phase. This enables you to amortize the planning cost through multiple executions.

To illustrate this, let us consider the problem of applying the Gaussian filter to multiple images in a batch:

In [None]:
batch_size = 16
images_gpu = [image_gpu] * batch_size

# Display the array of identical images
plt.figure(figsize=(30, 3))

for i in range(batch_size):
    plt.subplot(1, batch_size, i + 1)
    plt.imshow(cp.asnumpy(images_gpu[i]), cmap="gray")
    plt.title(f"Dog {i + 1}")
    plt.axis("off")

plt.tight_layout()
plt.show()

### CuPy Batch Implementation

First, we create a reference implementation of the batched Gaussian filter using pure CuPy:

In [None]:
def process_batch_cupy(images_gpu):
    # Clear CuPy FFT cache to ensure clean FFT benchmarking during multiple repetitions (n_repeat > 1)
    # For fair comparison, we do not want the planning cost of the first call to be ignored
    cp.fft.config.clear_plan_cache()
    filtered_images = []
    for i in range(len(images_gpu)):
        filtered_images.append(gaussian_filter_cupy(images_gpu[i], clear_cache=False))
    return filtered_images


filtered_images = process_batch_cupy(images_gpu)

plt.figure(figsize=(30, 3))

for i in range(batch_size):
    plt.subplot(1, batch_size, i + 1)
    plt.imshow(cp.asnumpy(filtered_images[i]), cmap="gray")
    plt.title(f"Filtered {i + 1}")
    plt.axis("off")

plt.tight_layout()
plt.show()

### nvmath-python Batch Implementation with Stateful APIs

Next, we implement the same logic using **nvmath-python**'s stateful API:

**Implementation Strategy:**
- Create two FFT objects: one for the forward FFT plan (which includes a compiled epilog) and another for the inverse FFT plan
- We need two separate objects because a single object can only have one plan
- Apply the filter to each image in the batch through a series of forward and inverse FFTs
- Use the `reset_operand()` method to update operands for chained FFT operations

**Key Benefits:**
- The epilog compilation happens once
- FFT plans are created once and reused for all images
- Only execution is repeated for each image in the batch 

In [None]:
def process_batch_nvmath(images_gpu):
    wh = images_gpu[0].shape[0] * images_gpu[0].shape[1] # Normalization factor


    # Define epilog function for gaussian kernel multiplication
    def epilog_impl(data_out, offset, data, filter_data, unused):
        data_out[offset] = data * filter_data[offset] / wh


    # Compile the epilog to LTO-IR once
    epilog = nvmath.fft.compile_epilog(epilog_impl, "complex64", "complex64")


    def convolve_image_gpu(fft, ifft, image_gpu):
        fft.reset_operand(image_gpu)
        image_fft = fft.execute(direction=nvmath.fft.FFTDirection.FORWARD)
        ifft.reset_operand(image_fft)
        image_ifft = ifft.execute(direction=nvmath.fft.FFTDirection.INVERSE)
        return image_ifft


    image_gpu = images_gpu[0]  # Real input for R2C FFT
    image_fft = cp.empty((image_gpu.shape[0], image_gpu.shape[1] // 2 + 1), dtype=cp.complex64)

    with (
        nvmath.fft.FFT(image_gpu) as fft,
        nvmath.fft.FFT(image_fft, options={"fft_type": "C2R"}) as ifft,
    ):
        # Two plans are created, one for the forward R2C FFT with an epilog
        # and another for the inverse C2R FFT
        fft.plan(epilog={"ltoir": epilog, "data": h.data.ptr})
        ifft.plan()

        # Process each image in the batch
        filtered_images = []
        for i in range(len(images_gpu)):
            filtered_images.append(convolve_image_gpu(fft, ifft, images_gpu[i]))
    return filtered_images


# Process the batch using nvmath stateful API
filtered_images_nvmath = process_batch_nvmath(images_gpu)

plt.figure(figsize=(30, 3))

for i in range(batch_size):
    plt.subplot(1, batch_size, i + 1)
    plt.imshow(cp.asnumpy(filtered_images_nvmath[i]), cmap="gray")
    plt.title(f"Filtered {i + 1}")
    plt.axis("off")

plt.tight_layout()
plt.show()

### Performance Comparison

Now let's benchmark the performance. 

**Important Notes:**
- CuPy inherently performs FFT plan *caching*, so subsequent calls with images of the same shape and dtype avoid re-planning overhead
- With **nvmath-python**, we avoid re-planning explicitly by using *stateful* APIs
- Both approaches benefit from amortizing planning costs, but the fused epilog in **nvmath-python** provides additional performance improvements

In [None]:
# Clear CuPy FFT cache to ensure clean FFT operations
cp.fft.config.clear_plan_cache()
benchmark(
    lambda: process_batch_nvmath(images_gpu),
    "nvmath-python stateful API",
    lambda: process_batch_cupy(images_gpu),
    "CuPy API",
)

---
## Performance Cost Breakdown

Let us drill down to understand how much each phase costs in each library. For CuPy, the cost of the very first call (where FFT planning and plan caching is performed) is very different from subsequent calls (where the cached plan is reused).

### CuPy Cost Analysis

In [None]:
def process_cupy_first_call(image_gpu):
    # To emulate the cost of the first call we need to clear the cache
    cp.fft.config.clear_plan_cache()
    gaussian_filter_cupy(image_gpu, clear_cache=False)

def process_cupy_subsequent_call(image_gpu):
    # With n_repeat > 1 the first call cost will be ignored
    gaussian_filter_cupy(image_gpu, clear_cache=False)

perf_cupy_subsequent_call, perf_cupy_first_call = benchmark(
    lambda: process_cupy_subsequent_call(image_gpu),
    "CuPy subsequent calls",
    lambda: process_cupy_first_call(image_gpu),
    "CuPy first call",
)

perf_cupy_planning = perf_cupy_first_call - perf_cupy_subsequent_call
print(f"Estimated CuPy planning cost = {perf_cupy_planning:0.4f} sec")


### nvmath-python Cost Analysis

Similarly, let's break down the costs within the **nvmath-python** implementation:

In [None]:
wh = image_gpu.shape[0] * image_gpu.shape[1] # Normalization factor
# Define epilog function for gaussian kernel multiplication
def epilog_impl(data_out, offset, data, filter_data, unused):
    data_out[offset] = data * filter_data[offset] / wh  # Normalize by the image area

# Compile the epilog to LTO-IR once
def compile_epilog():
    return nvmath.fft.compile_epilog(epilog_impl, "complex64", "complex64")

def forward_fft_execute(fft, image_gpu):
    fft.reset_operand(image_gpu)
    return fft.execute(direction=nvmath.fft.FFTDirection.FORWARD)

def inverse_fft_execute(ifft, image_gpu):
    ifft.reset_operand(image_gpu)
    return ifft.execute(direction=nvmath.fft.FFTDirection.INVERSE)

def forward_fft_plan(image_gpu, epilog):
    fft = nvmath.fft.FFT(image_gpu)
    fft.plan(epilog={"ltoir": epilog, "data": h.data.ptr})
    return fft

def inverse_fft_plan(c2r_output):
    ifft = nvmath.fft.FFT(c2r_output, options={"fft_type": "C2R"})
    ifft.plan()
    return ifft

epilog = compile_epilog()
c2r_output = cp.empty((image_gpu.shape[0], image_gpu.shape[1] // 2 + 1), dtype=cp.complex64)
fft = forward_fft_plan(image_gpu, epilog)
ifft = inverse_fft_plan(c2r_output)
fft_image = forward_fft_execute(fft, image_gpu)
filtered_image = inverse_fft_execute(ifft, fft_image)

perf_compile_epilog = cpx.profiler.benchmark(lambda: compile_epilog(), n_repeat=5, n_warmup=1).gpu_times.min()
perf_forward_fft_plan = cpx.profiler.benchmark(lambda: forward_fft_plan(image_gpu, epilog), n_repeat=5, n_warmup=1).gpu_times.min()
perf_inverse_fft_plan = cpx.profiler.benchmark(lambda: inverse_fft_plan(c2r_output), n_repeat=5, n_warmup=1).gpu_times.min()
perf_forward_fft_execute = cpx.profiler.benchmark(lambda: forward_fft_execute(fft, image_gpu), n_repeat=5, n_warmup=1).gpu_times.min()
perf_inverse_fft_execute = cpx.profiler.benchmark(lambda: inverse_fft_execute(ifft, fft_image), n_repeat=5, n_warmup=1).gpu_times.min()

print(f"Compilation cost = {perf_compile_epilog:0.4f} sec")
print(f"Forward FFT plan cost = {perf_forward_fft_plan:0.4f} sec")
print(f"Inverse FFT plan cost = {perf_inverse_fft_plan:0.4f} sec")
print(f"Forward FFT execute cost = {perf_forward_fft_execute:0.4f} sec")
print(f"Inverse FFT execute cost = {perf_inverse_fft_execute:0.4f} sec")


In [None]:
### Cost per Image Analysis

# Plot CuPy and nvmath-python cost per image for batch sizes 1 to 16
batch_sizes = np.arange(1, 17)

# CuPy: First image includes planning + execution, subsequent images are execution only
cupy_costs_per_image = [(perf_cupy_first_call + (n - 1) * perf_cupy_subsequent_call) / n for n in batch_sizes]

# nvmath-python: First image includes compilation + planning + execution, subsequent images are execution only
nvmath_costs_per_image = [(perf_compile_epilog + perf_forward_fft_plan + perf_inverse_fft_plan) / n + perf_forward_fft_execute + perf_inverse_fft_execute for n in batch_sizes]

plt.figure(figsize=(10, 6))
plt.plot(batch_sizes, cupy_costs_per_image, marker='o', linewidth=2, markersize=8, color='darkgreen', label='CuPy')
plt.plot(batch_sizes, nvmath_costs_per_image, marker='s', linewidth=2, markersize=8, color='#76b900', label='nvmath-python')
plt.xlabel('Number of Images', fontsize=12)
plt.ylabel('Cost per Image (seconds)', fontsize=12)
plt.title('Cost per Image Comparison: CuPy vs nvmath-python\n(First image includes planning/compilation + execution, subsequent images execution only)', fontsize=13)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xticks(batch_sizes)
plt.tight_layout()
plt.show()

# Print some statistics
print("CuPy Cost Breakdown:")
print(f"  Planning cost: {perf_cupy_planning:.4f} sec")
print(f"  Execution cost (per image): {perf_cupy_subsequent_call:.4f} sec")
print(f"  First image total: {perf_cupy_first_call:.4f} sec")

print("\nnvmath-python Cost Breakdown:")
print(f"  Compilation cost: {perf_compile_epilog:.4f} sec")
print(f"  Forward FFT plan cost: {perf_forward_fft_plan:.4f} sec")
print(f"  Inverse FFT plan cost: {perf_inverse_fft_plan:.4f} sec")
print(f"  Execution cost (per image): {perf_forward_fft_execute + perf_inverse_fft_execute:.4f} sec")
print(f"  First image total: {perf_compile_epilog + perf_forward_fft_plan + perf_inverse_fft_plan + perf_forward_fft_execute + perf_inverse_fft_execute:.4f} sec")

print("\nCost per image comparison:")
for n in [1, 4, 8, 16]:
    cupy_per_img = cupy_costs_per_image[n-1]
    nvmath_per_img = nvmath_costs_per_image[n-1]
    speedup = cupy_per_img / nvmath_per_img
    print(f"  {n:2d} images: CuPy={cupy_per_img:.4f} sec/img, nvmath={nvmath_per_img:.4f} sec/img, speedup={speedup:.2f}x")


## Exercise: Applying Sepia and Gaussian blur filters to an image

In this exercise you will perform the following steps:
1. Load the color image using `PIL`.
2. Create and compile the *Sepia* filter for the R2C FFT prolog 
3. Create and compile the *Gaussian Blur* filter for the C2R FFT prolog.
4. Perform forward FFT with the compiled Sepia filter prolog.
5. Perform backward C2R FFT with the compiled Gaussian Blur filter prolog.
6. Display the processed image using `matplotlib`

The following NumPy implementation is a reference code for you to follow while implementing nvmath-python variant with forward FFT prolog and epilog:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

# 1. Load the color image using PIL
# Load the color image and convert it to NumPyarray normalized to [0, 1]
asset_path = "./images/"
original_image = np.array(Image.open(asset_path + "dog.jpg")) / 255.0


# Display the original image
plt.figure(figsize=(10, 5))

plt.imshow(original_image)
plt.title("Original Image")
plt.axis("off")

plt.tight_layout()
plt.show()

original_image.shape

Note the shape of the array: spatial data goes in dimensions 0 and 1, followed by color channels data.

The following code is a reference implementation of the Sepia filter, which is a color matrix transformation applied to original pixels capped by 1.0.

In [None]:
SEPIA_FILTER_MATRIX = np.array([
    [0.393, 0.769, 0.189],
    [0.349, 0.686, 0.168],
    [0.272, 0.534, 0.131]
]).T


def sepia_filter(image):
    """
    Apply sepia filter to the image.
    Image shape: (height, width, 3)
    Returns: (height, width, 3)
    """
    # Apply filter: (height*width, 3) @ (3, 3) = (height*width, 3)
    return np.minimum(1.0, image @ SEPIA_FILTER_MATRIX)


sepia_image = sepia_filter(original_image)

# Display the sepia image
plt.figure(figsize=(10, 5))

plt.imshow(sepia_image)
plt.title("Sepia Image")
plt.axis("off")

plt.tight_layout()
plt.show()

Finally we apply Gaussian Blur filter implemented through R2C and C2R FFTs:

In [None]:
sigma = 20.0
fy = np.fft.fftfreq(sepia_image.shape[0])[:, None]  # column vector
fx = np.fft.rfftfreq(sepia_image.shape[1])[None, :]  # row vector for R2C (only positive frequencies)
h = np.exp(-2.0 * np.pi * np.pi * sigma * sigma * (fx * fx + fy * fy)).astype(np.complex64)
print(h.shape)


def gaussian_filter_numpy(image):
    # Apply FFT on spatial dimensions (axes 0 and 1), leaving color channel intact
    image_fft = np.fft.rfft2(image, axes=(0, 1))  # Real to complex FFT
    blurred = np.fft.irfft2(image_fft * h[..., None], axes=(0, 1))  # Complex to real FFT
    return blurred

blurred_image = gaussian_filter_numpy(sepia_image)

# Display the blurred image
plt.figure(figsize=(10, 5))

plt.imshow(blurred_image)
plt.title("Sepia & Blurred Image")
plt.axis("off")

plt.tight_layout()
plt.show()

Note the shape of `h`: for R2C FFT we neeed only positive frequences, hence the size in `fx` dimension is `image.shape[1] // 2 + 1`.

Now porting the NumPy implementation to GPU with CuPy:

In [None]:
import nvmath # Workaround for CuPy: CTK shared objects preload from wheels
import cupy as cp

image_gpu = cp.asarray(original_image, dtype=cp.float32)
SEPIA_FILTER_MATRIX_GPU = cp.asarray(SEPIA_FILTER_MATRIX).astype(cp.float32)


def sepia_filter(image):
    return cp.minimum(1.0, image @ SEPIA_FILTER_MATRIX_GPU)


sepia_image = sepia_filter(image_gpu)

# Display the sepia image
plt.figure(figsize=(10, 5))

plt.imshow(sepia_image.get())
plt.title("Sepia Image")
plt.axis("off")

plt.tight_layout()
plt.show()


In [None]:
sigma = 20.0
fy = cp.fft.fftfreq(sepia_image.shape[0])[:, None].astype(cp.float32)  # column vector
fx = cp.fft.rfftfreq(sepia_image.shape[1])[None, :].astype(cp.float32)  # row vector for R2C (only positive frequencies)
h = cp.exp(-2.0 * np.pi * np.pi * sigma * sigma * (fx * fx + fy * fy)).astype(cp.complex64)


def gaussian_filter_cupy(image):
    # Apply FFT on spatial dimensions (axes 0 and 1), leaving color channel intact
    image_fft = cp.fft.rfft2(image, axes=(0, 1))  # Real to complex FFT
    blurred = cp.fft.irfft2(image_fft * h[..., None], axes=(0, 1))  # Complex to real FFT
    return blurred

blurred_image = gaussian_filter_cupy(sepia_image)

# Display the blurred image
plt.figure(figsize=(10, 5))

plt.imshow(blurred_image.get())
plt.title("Sepia & Blurred Image")
plt.axis("off")

plt.tight_layout()
plt.show()

Finally, we get to implement the above code using **nvmath-python** with compiled prolog and epilog functions:

In [None]:
# REMEMBER:Dimensions over which FFT is performed must be contiguous in memory.
# Hint: Rearrange the dimensions so that spatial data is contiguous

# 1. Copy original image to GPU and rearrange dimensions so that spatial data is contiguous
# TODO: Implement this step

# 2. Implement the sepia filter prolog
def sepia_prolog_impl(data_in, offset, unused_user_info, unused):
    # TODO: Implement this step
    # 1. Get the channel index and pixel index from the offset
    # 2. Access the (r,g,b) pixel data from data_in array
    # 3. Based on the channel index, apply the sepia filter to the respective channel
    # 4. Return the filtered pixel data as a function result
    pass

# 3. Implement the Gaussian blur prolog
def blur_prolog_impl(data_in, offset, filter_data, unused):
    # Offset is a flattened index of the three (R, G, B) FFTs. Each FFT has size h.size 
    pass

# 4. Compile the prologs
# TODO: Implement this step
sepia_prolog = nvmath.fft.compile_prolog(sepia_prolog_impl, "complex64", "complex64")
blur_prolog = nvmath.fft.compile_prolog(blur_prolog_impl, "complex64", "complex64")

# 5. Perform forward R2C FFT with the compiled Sepia filter prolog
# TODO: Implement this step

# 6. Perform backward C2R FFT with the compiled Gaussian Blur prolog
# TODO: Implement this step

# 7. Display the processed image
plt.figure(figsize=(10, 5))

plt.imshow(filtered_image.transpose(1, 2, 0).get())
plt.title("Sepia & Blurred Image")
plt.axis("off")

plt.tight_layout()
plt.show()

**Key Takeaways:**

- FFT callbacks allow custom Python functions to be JIT-compiled and fused with FFT kernels as prolog or epilog operations.
- JIT compilation overhead is a one-time cost that can be amortized across multiple executions.
- CuPy has a built-in plan caching mechanism that amortizes the initial planning cost across executions.
- **nvmath-python** stateful API amortizes compilation and planning costs and avoids plan cache retrieval overhead in subsequent executions.
- For batch processing, the cost per image decreases as batch size increases due to amortization of setup costs.

---
## Conclusion

In this notebook, we explored FFT callbacks in **nvmath-python**, a powerful feature that enables kernel fusion for FFT operations. Through a practical image processing example (Gaussian filtering), we demonstrated how callbacks can significantly improve performance.

**Key Takeaways:**
- FFT callbacks allow custom operations to be fused with FFT kernels by JIT-compiling Python functions to LTO-IR
- Kernel fusion eliminates intermediate memory operations and reduces kernel launch overhead
- Stateful FFT APIs enable separation of compilation/planning from execution, allowing cost amortization
- For the Gaussian filtering example, **nvmath-python** with callbacks provides better performance than CuPy for batch processing
- The cost per image decreases as batch size increases due to amortization of compilation and planning costs
- CuPy's plan caching and **nvmath-python**'s stateful APIs both address the challenge of setup cost amortization, but with different trade-offs

**Next Steps:**
- Explore device APIs in the next notebook: [05_device_api.ipynb](05_device_api.ipynb)
- Return to previous notebooks to review kernel fusion, memory spaces, and stateful APIs

---
## References

- NVIDIA nvmath-python documentation, "FFT API Reference," https://docs.nvidia.com/cuda/nvmath-python/, Accessed: October 23, 2025.
- NVIDIA, "cuFFT Library User Guide," https://docs.nvidia.com/cuda/cufft/, Accessed: October 23, 2025.
- NVIDIA, "cuFFT Callbacks," https://docs.nvidia.com/cuda/cufft/index.html#callback-routines, Accessed: October 23, 2025.
- Williams, Samuel, et al., "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM, 52(4), 65-76, 2009.