# 9.4. `nvmath-python` scaling to many GPUs

This example illustrates the use of function-form distributed FFT APIs with NumPy ndarrays, using the default cuFFTMp Slab distributions. The NumPy ndarrays reside in CPU memory, and are copied transparently to GPU symmetric memory to process them with cuFFTMp.

The input as well as the result from the FFT operations are NumPy ndarrays, resulting in effortless interoperability between nvmath-python and NumPy.

We start with a few lines of required initialization code:

In [None]:
!source /global/common/software/trn018/init-training-pyhpc-2025.sh

In [None]:
!unset SLURM_NTASKS_PER_NODE

In [None]:
%%writefile scaling-nvmath-python.py

import numpy as np
import cuda.core.experimental
from mpi4py import MPI

import nvmath.distributed

# Initialize nvmath.distributed.
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nranks = comm.Get_size()
device_id = rank % cuda.core.experimental.system.num_devices
nvmath.distributed.initialize(device_id, comm)

We use 3D FFT problem. In this example, the input data is distributed across processes according to the cuFFTMp Slab distribution on the Y axis (second dimension):

In [None]:
%%writefile -a scaling-nvmath-python.py

shape = 64, 256 // nranks, 128

# NumPy ndarray, on the CPU.
a = np.random.rand(*shape) + 1j * np.random.rand(*shape)

Here we perform the forward FFT. By default, the `reshape` option is `True`, which means that the output of the distributed FFT will be re-distributed to retain the same distribution as the input (in this case `Slab.Y`).

In [None]:
%%writefile -a scaling-nvmath-python.py

b = nvmath.distributed.fft.fft(a, nvmath.distributed.fft.Slab.Y)

Note the same shape of a and b (they are both using the same distribution):

In [None]:
%%writefile -a scaling-nvmath-python.py

if rank == 0:
    print(f"Shape of a on rank {rank} is {a.shape}")
    print(f"Shape of b on rank {rank} is {b.shape}")

    print(f"Input type = {type(a)}, FFT output type = {type(b)}")

List available GPUs:

In [None]:
!srun -l nvidia-smi -L

Launch the computation:

In [None]:
!srun -n 4 --gres-flags=allow-task-sharing -l python scaling-nvmath-python.py

Note that the code is very similar to the single GPU FFT examples we considered earlier.