# SA3D MultiGPU Development

This notebook is intended to faciltiate development and testing of a multi-GPU implementation of the `aind-large-scale-prediction` library.

## Example Dataload from Zarr

This section largely follows the code in `scripts/eample_create_data_loader.py` and is intended simply to illustrate data loading without multi-GPU processing.

In [1]:
import logging
import multiprocessing
import os
import time
from datetime import datetime

import torch
import matplotlib.pyplot as plt
import numpy as np
import zarr
import s3fs

from aind_large_scale_prediction.generator.dataset import create_data_loader
from aind_large_scale_prediction.generator.utils import (
    concatenate_lazy_data,
    estimate_output_volume,
    get_suggested_cpu_count,
    recover_global_position,
    unpad_global_coords,
)

In [9]:
DATASET_PATH = "s3://aind-open-data/SmartSPIM_709392_2024-01-29_18-33-39_stitched_2024-02-04_12-45-58/image_tile_fusing/OMEZarr/Ex_639_Em_667.zarr"

In [10]:
s3 = s3fs.S3FileSystem()  
s3 = s3fs.S3FileSystem(secret=os.getenv("AWS_SECRET_ACCESS_KEY"))  # Uses default AWS credentials

# Try opening the Zarr dataset directly#
store = zarr.open_group(s3fs.S3Map(DATASET_PATH, s3=s3), mode="r")
print(store.tree())  # This should print the Zarr directory structure

/
 ├── 0 (1, 1, 3990, 10232, 7437) uint16
 ├── 1 (1, 1, 1995, 5116, 3718) uint16
 ├── 2 (1, 1, 997, 2558, 1859) uint16
 └── 3 (1, 1, 498, 1279, 929) uint16


In [11]:
multiscale = "3"
target_size_mb = 1024
n_workers = 0
batch_size = 1
prediction_chunksize = (128, 128, 128)
overlap_prediction_chunksize = (30, 30, 30)
super_chunksize = (512, 512, 512)

lazy_data = concatenate_lazy_data(
    dataset_paths=[DATASET_PATH],
    multiscales=[multiscale],
    concat_axis=-4,
)

In [12]:
lazy_data

Unnamed: 0,Array,Chunk
Bytes,1.10 GiB,4.00 MiB
Shape,"(1, 1, 498, 1279, 929)","(1, 1, 128, 128, 128)"
Dask graph,320 chunks in 2 graph layers,320 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 1.10 GiB 4.00 MiB Shape (1, 1, 498, 1279, 929) (1, 1, 128, 128, 128) Dask graph 320 chunks in 2 graph layers Data type uint16 numpy.ndarray",1  1  929  1279  498,

Unnamed: 0,Array,Chunk
Bytes,1.10 GiB,4.00 MiB
Shape,"(1, 1, 498, 1279, 929)","(1, 1, 128, 128, 128)"
Dask graph,320 chunks in 2 graph layers,320 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


In [13]:
suggested_cpus = get_suggested_cpu_count()

print(f"Suggested number of CPUs: {suggested_cpus} - Provided n count workers {n_workers}")

Suggested number of CPUs: 48 - Provided n count workers 0


In [14]:
zarr_data_loader, zarr_dataset = create_data_loader(
    lazy_data=lazy_data,
    target_size_mb=target_size_mb,
    prediction_chunksize=prediction_chunksize,
    overlap_prediction_chunksize=overlap_prediction_chunksize,
    n_workers=n_workers,
    batch_size=batch_size,
    dtype=np.float32,  # Allowed data type to process with pytorch cuda
    super_chunksize=super_chunksize,
    lazy_callback_fn=None,  # partial_lazy_deskewing,
    logger=None,
    device=None,
    pin_memory=False,
    drop_last=False,
    override_suggested_cpus=False,
    locked_array=True,
)

Adding overlap area to super chunk size: (512, 512, 512) - (572, 572, 572)


In [15]:
output_volume_shape = estimate_output_volume(
    image_shape=zarr_dataset.lazy_data.shape,
    chunk_shape=prediction_chunksize,
    overlap_per_axis=overlap_prediction_chunksize,
)
print(output_volume_shape)

(940, 2068, 1504)


In [16]:
prediction_chunksize_overlap = np.array(prediction_chunksize) + (
    np.array(overlap_prediction_chunksize) * 2
)
print(prediction_chunksize_overlap)

[188 188 188]


In [4]:
for i, sample in enumerate(zarr_data_loader):
    break

NameError: name 'zarr_data_loader' is not defined

In [18]:
sample.batch_tensor[0, ...].numpy().shape

(158, 158, 158)

In [19]:
sample.batch_super_chunk

((slice(0, 542, None), slice(0, 542, None), slice(0, 542, None)),)

## Visualization

In [2]:
from vedo import Volume, show

In [3]:
sample

NameError: name 'sample' is not defined

# Torchrun Distributed

This section is intended to replicate some of the above processing in a multi-GPU environment.

See `aind-large-scale-prediction/tests/multigpu_aind.py` for example run.