# Tutorial 5: GPU Acceleration with HPXPy

This tutorial covers GPU acceleration in HPXPy using HPX's executor infrastructure:

1. Checking GPU availability (CUDA and SYCL)
2. Transparent device selection (recommended)
3. Creating arrays on different devices
4. Data transfers between CPU and GPU
5. SYCL support for cross-platform GPUs
6. Async GPU operations with HPX futures

HPXPy supports multiple GPU backends through HPX's executor infrastructure:
- **CUDA**: For NVIDIA GPUs via `hpx::cuda::experimental::cuda_executor`
- **SYCL**: For Intel, AMD, Apple Silicon GPUs via `hpx::sycl::experimental::sycl_executor`

Both backends use HPX for:
- Async operations with proper future integration
- GIL release during GPU operations
- Consistent API across platforms

HPXPy provides two ways to work with GPUs:
- **Transparent API**: Use the `device` parameter on regular functions (recommended)
- **Explicit API**: Use `hpx.gpu` (CUDA) or `hpx.sycl` modules directly

The transparent API is preferred because it makes your code portable.

In [None]:
import numpy as np
import hpxpy as hpx

# Initialize HPX runtime
hpx.init()

## 1. Checking GPU Availability

Before using GPU features, check what's available on your system.

In [None]:
# Check what GPU backends are available
print("=== GPU Backend Availability ===")
print(f"CUDA available: {hpx.gpu.is_available()}")
print(f"SYCL available: {hpx.sycl.is_available()}")

if hpx.gpu.is_available():
    print(f"  CUDA devices: {hpx.gpu.device_count()}")
    
if hpx.sycl.is_available():
    print(f"  SYCL devices: {hpx.sycl.device_count()}")

In [None]:
# List all available CUDA devices
cuda_devices = hpx.gpu.get_devices()

if cuda_devices:
    print("Available CUDA GPUs:")
    for dev in cuda_devices:
        print(f"  [{dev.id}] {dev.name}")
        print(f"      Memory: {dev.total_memory_gb():.1f} GB")
        print(f"      Compute Capability: {dev.compute_capability()}")
        print(f"      Multiprocessors: {dev.multiprocessor_count}")
else:
    print("No CUDA GPUs available")

In [None]:
# List all available SYCL devices
sycl_devices = hpx.sycl.get_devices()

if sycl_devices:
    print("Available SYCL GPUs:")
    for dev in sycl_devices:
        print(f"  [{dev.id}] {dev.name}")
        print(f"      Vendor: {dev.vendor}")
        print(f"      Backend: {dev.backend}")
        print(f"      Memory: {dev.global_mem_size_gb():.1f} GB")
        print(f"      Compute Units: {dev.max_compute_units}")
else:
    print("No SYCL GPUs available")

## 2. Transparent Device Selection (Recommended)

The recommended way to use GPUs in HPXPy is through the `device` parameter on array creation functions. This makes your code portable and easy to read.

### Device Options

| Value | Behavior |
|-------|----------|
| `None` or `'cpu'` | Create array on CPU (default) |
| `'gpu'` or `'cuda'` | Create array on CUDA GPU (error if unavailable) |
| `'sycl'` | Create array on SYCL GPU (Intel/AMD/Apple) |
| `'auto'` | Use best available: CUDA > SYCL > CPU |
| `0`, `1`, etc. | Use specific GPU device ID |

The `'auto'` option is especially useful - it automatically selects the best available backend:
- On systems with NVIDIA GPUs: uses CUDA
- On systems with SYCL GPUs: uses SYCL
- Otherwise: falls back to CPU

In [None]:
# Create arrays on CPU (default behavior)
cpu_arr = hpx.zeros(1000)
print(f"CPU array shape: {cpu_arr.shape}")

# Explicit CPU
cpu_arr2 = hpx.zeros(1000, device='cpu')
print(f"Explicit CPU array shape: {cpu_arr2.shape}")

In [None]:
# Use 'auto' for portable code - best GPU if available, CPU otherwise
# This is the recommended approach for most use cases

arr = hpx.zeros(10000, device='auto')
print(f"Array created with device='auto'")
print(f"  Shape: {arr.shape}")

# Check where the array lives
if hasattr(arr, 'device'):
    print(f"  Location: GPU device {arr.device}")
else:
    print(f"  Location: CPU")

In [None]:
# All array creation functions support the device parameter

# zeros, ones, empty
z = hpx.zeros((100, 100), device='auto')
o = hpx.ones((100, 100), device='auto')
e = hpx.empty(1000, device='auto')

# full - create array filled with a value
f = hpx.full(1000, 3.14159, device='auto')

# arange - evenly spaced values
a = hpx.arange(10000, device='auto')

# linspace - evenly spaced over interval
l = hpx.linspace(0, 1, 100, device='auto')

print("Created arrays with device='auto':")
print(f"  zeros: {z.shape}")
print(f"  ones: {o.shape}")
print(f"  empty: {e.shape}")
print(f"  full: {f.shape}")
print(f"  arange: {a.shape}")
print(f"  linspace: {l.shape}")

In [None]:
# Transfer numpy arrays to the preferred device
np_data = np.random.randn(1000).astype(np.float64)

# Using from_numpy
arr1 = hpx.from_numpy(np_data, device='auto')

# Using array
arr2 = hpx.array([1.0, 2.0, 3.0, 4.0, 5.0], device='auto')

print(f"from_numpy result: {arr1.shape}")
print(f"array result: {arr2.shape}")

## 3. Data Transfers

All HPXPy arrays (CPU and GPU) support the `to_numpy()` method to transfer data back to a NumPy array on the CPU.

In [None]:
# Create an array (on GPU if available)
arr = hpx.arange(10, device='auto')

# Transfer back to CPU as NumPy array
np_result = arr.to_numpy()

print(f"Original HPXPy array: {arr}")
print(f"NumPy result: {np_result}")
print(f"NumPy type: {type(np_result)}")

In [None]:
# Round-trip: NumPy -> HPXPy (auto device) -> NumPy
original = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
print(f"Original NumPy: {original}")

# Transfer to best available device
hpx_arr = hpx.from_numpy(original, device='auto')
print(f"HPXPy array created")

# Transfer back
result = hpx_arr.to_numpy()
print(f"Back to NumPy: {result}")

# Verify data integrity
np.testing.assert_array_equal(original, result)
print("Data integrity verified!")

## 4. SYCL Support (Cross-Platform GPUs)

HPXPy supports SYCL via `hpx::sycl::experimental::sycl_executor`. SYCL provides cross-platform GPU support for:

| Backend | Platform | SYCL Implementation |
|---------|----------|---------------------|
| Level-Zero | Intel GPUs | Intel oneAPI |
| HIP | AMD GPUs | AdaptiveCpp |
| CUDA | NVIDIA GPUs | AdaptiveCpp, oneAPI |
| Metal | Apple Silicon | AdaptiveCpp (experimental) |
| OpenCL | Various | Intel oneAPI, AdaptiveCpp |

### Using SYCL Explicitly

In [None]:
if hpx.sycl.is_available():
    print(f"SYCL available with {hpx.sycl.device_count()} device(s)")
    
    # Get device info
    dev = hpx.sycl.get_device(0)
    print(f"Device: {dev.name}")
    print(f"Backend: {dev.backend}")
    print(f"Memory: {dev.global_mem_size_gb():.1f} GB")
    
    # Create arrays explicitly on SYCL
    sycl_arr = hpx.zeros(1000, device='sycl')
    print(f"\nCreated SYCL array: {sycl_arr.shape}, device={sycl_arr.device}")
    
    # All creation functions work with SYCL
    z = hpx.sycl.zeros([100, 100])
    o = hpx.sycl.ones([1000])
    f = hpx.sycl.full([100], 3.14)
    
    print(f"\nCreated various SYCL arrays:")
    print(f"  zeros: {z.shape}")
    print(f"  ones: {o.shape}")
    print(f"  full: {f.shape}")
    
    # Transfer data
    np_data = np.random.randn(100)
    sycl_from_np = hpx.sycl.from_numpy(np_data)
    back_to_np = sycl_from_np.to_numpy()
    np.testing.assert_array_almost_equal(np_data, back_to_np)
    print("\nNumPy <-> SYCL round-trip verified!")
    
    # SYCL operations
    ones = hpx.sycl.ones([100])
    total = hpx.sycl.sum(ones)
    print(f"Sum of 100 ones: {total}")
else:
    print("SYCL not available")
    print("Build HPX with HPX_WITH_SYCL=ON and HPXPy with HPXPY_WITH_SYCL=ON")

## 5. Explicit CUDA GPU API

For more control over NVIDIA GPUs, you can use the `hpx.gpu` module directly. This uses `hpx::cuda::experimental::cuda_executor` for proper HPX integration.

In [None]:
if hpx.gpu.is_available():
    # Get current GPU
    current = hpx.gpu.current_device()
    print(f"Current GPU: {current}")
    
    # Get memory info
    free, total = hpx.gpu.memory_info(current)
    print(f"GPU Memory: {free / 1e9:.2f} GB free / {total / 1e9:.2f} GB total")
    
    # Create array explicitly on GPU
    gpu_arr = hpx.gpu.zeros([1000, 1000])
    print(f"Created GPU array: {gpu_arr.shape} on device {gpu_arr.device}")
    
    # Synchronize GPU (wait for all operations to complete)
    hpx.gpu.synchronize()
    print("GPU synchronized")
else:
    print("CUDA not available - skipping explicit GPU examples")

In [None]:
if hpx.gpu.is_available():
    # Explicit GPU array operations
    
    # Create arrays
    zeros = hpx.gpu.zeros([100])
    ones = hpx.gpu.ones([100])
    full = hpx.gpu.full([100], 42.0)
    
    # Fill with value
    zeros.fill(3.14)
    
    # Transfer from numpy
    np_data = np.random.randn(100)
    from_np = hpx.gpu.from_numpy(np_data)
    
    # GPU sum
    total = hpx.gpu.sum(ones)
    print(f"Sum of ones: {total}")
    
    # Transfer back to CPU
    result = full.to_numpy()
    print(f"Full array (first 5): {result[:5]}")
else:
    print("CUDA not available - skipping explicit GPU examples")

## 6. Async GPU Operations

HPXPy supports async GPU operations that return HPX futures. This allows overlapping GPU transfers with other computation.

In [None]:
if hpx.gpu.is_available():
    # Enable async operations (starts HPX CUDA polling)
    hpx.gpu.enable_async()
    
    # Create GPU array
    arr = hpx.gpu.zeros([1000000])
    data = np.random.rand(1000000)
    
    # Async copy - returns immediately
    future = arr.async_from_numpy(data)
    print(f"Async copy started, future.is_ready(): {future.is_ready()}")
    
    # Do other work while transfer happens...
    print("Doing other work while transfer in progress...")
    
    # Wait for completion
    future.get()
    print("Transfer complete!")
    
    # Verify data
    result = arr.to_numpy()
    np.testing.assert_array_almost_equal(data, result)
    print("Data verified!")
    
    # Disable async when done
    hpx.gpu.disable_async()
else:
    print("CUDA not available - skipping async examples")

In [None]:
# Using the AsyncContext context manager
if hpx.gpu.is_available():
    arr1 = hpx.gpu.zeros([100000])
    arr2 = hpx.gpu.zeros([100000])
    data1 = np.ones(100000)
    data2 = np.ones(100000) * 2
    
    with hpx.gpu.AsyncContext():
        # Multiple async transfers
        f1 = arr1.async_from_numpy(data1)
        f2 = arr2.async_from_numpy(data2)
        
        print("Both transfers started")
        
        # Wait for both
        f1.get()
        f2.get()
    
    # Async automatically disabled when exiting context
    print("Both transfers complete!")
    print(f"arr1 sum: {hpx.gpu.sum(arr1)}")
    print(f"arr2 sum: {hpx.gpu.sum(arr2)}")
else:
    print("CUDA not available - skipping async context example")

## 7. Writing Portable Code

The `device='auto'` pattern makes it easy to write code that automatically uses the GPU when available, but gracefully falls back to CPU when not.

In [None]:
def compute_on_best_device(size):
    """
    Example function that automatically uses the best available device.
    
    This code works identically on systems with or without GPUs.
    """
    # Create data on best device
    data = hpx.arange(size, device='auto')
    
    # Get the result back as NumPy
    result = data.to_numpy()
    
    return result.sum()

# This works on any system
result = compute_on_best_device(10000)
print(f"Sum of 0 to 9999: {result}")
print(f"Expected: {sum(range(10000))}")

In [None]:
# Pattern: Check device at runtime for logging/debugging
def create_with_info(shape, device='auto'):
    """Create array and report where it was created."""
    arr = hpx.zeros(shape, device=device)
    
    if hasattr(arr, 'device'):
        location = f"GPU {arr.device}"
    else:
        location = "CPU"
    
    print(f"Created {shape} array on {location}")
    return arr

arr = create_with_info((1000, 1000))

## 8. Error Handling

Understanding how HPXPy handles GPU-related errors helps write robust code.

In [None]:
# device='auto' never raises an error - it falls back to CPU
arr = hpx.zeros(100, device='auto')  # Always works
print("device='auto' always succeeds")

# device='gpu' raises RuntimeError if CUDA not available
if not hpx.gpu.is_available():
    try:
        arr = hpx.zeros(100, device='gpu')
    except RuntimeError as e:
        print(f"Expected error: {e}")

# device='sycl' raises RuntimeError if SYCL not available
if not hpx.sycl.is_available():
    try:
        arr = hpx.zeros(100, device='sycl')
    except RuntimeError as e:
        print(f"Expected error: {e}")

# Invalid device specification raises ValueError
try:
    arr = hpx.zeros(100, device='invalid')
except ValueError as e:
    print(f"Expected error: {e}")

## 9. Best Practices

1. **Use `device='auto'` for portable code** - Your code will run on any system (CUDA, SYCL, or CPU)

2. **Use `device='gpu'` or `device='sycl'` when specific backend is required** - Get a clear error if unavailable

3. **Check availability for conditional logic** - Use `hpx.gpu.is_available()` or `hpx.sycl.is_available()`

4. **Minimize data transfers** - GPU-CPU transfers have overhead; keep data on device when possible

5. **Use larger arrays for GPU benefit** - Small arrays may be faster on CPU due to overhead

6. **Use async operations for overlap** - `async_from_numpy()` allows overlapping transfers with computation

## Summary

### Device Selection Options

| Device | Description |
|--------|-------------|
| `None` / `'cpu'` | CPU (NumPy) |
| `'gpu'` / `'cuda'` | NVIDIA CUDA GPU |
| `'sycl'` | SYCL GPU (Intel/AMD/Apple Silicon) |
| `'auto'` | Best available (CUDA > SYCL > CPU) |
| `0`, `1`, ... | Specific GPU device ID |

### Backend Comparison

| Feature | CUDA | SYCL |
|---------|------|------|
| HPX Executor | `cuda_executor` | `sycl_executor` |
| Platforms | NVIDIA only | Intel, AMD, Apple, NVIDIA |
| Async Support | Yes | Yes |
| Future Integration | HPX futures | HPX futures |

### API Comparison

| Feature | Transparent API | CUDA Explicit | SYCL Explicit |
|---------|----------------|---------------|---------------|
| Create zeros | `hpx.zeros(shape, device='auto')` | `hpx.gpu.zeros(shape)` | `hpx.sycl.zeros(shape)` |
| Create ones | `hpx.ones(shape, device='auto')` | `hpx.gpu.ones(shape)` | `hpx.sycl.ones(shape)` |
| Create full | `hpx.full(shape, val, device='auto')` | `hpx.gpu.full(shape, val)` | `hpx.sycl.full(shape, val)` |
| From numpy | `hpx.from_numpy(arr, device='auto')` | `hpx.gpu.from_numpy(arr)` | `hpx.sycl.from_numpy(arr)` |
| To numpy | `arr.to_numpy()` | `arr.to_numpy()` | `arr.to_numpy()` |
| Check available | - | `hpx.gpu.is_available()` | `hpx.sycl.is_available()` |
| Portable? | Yes | No | No |

**Recommendation**: Use the transparent API with `device='auto'` for portable code that runs on any system.

In [None]:
# Example: Portable scientific computation

def portable_computation(n):
    """
    A computation that runs on GPU if available, CPU otherwise.
    """
    # Create data on best device
    x = hpx.linspace(0, 2 * np.pi, n, device='auto')
    
    # Get back to numpy for computation
    # (In future versions, operations will run directly on GPU)
    x_np = x.to_numpy()
    result = np.sin(x_np)
    
    return result

result = portable_computation(1000)
print(f"Computed sin(x) for 1000 points")
print(f"Result range: [{result.min():.4f}, {result.max():.4f}]")

In [None]:
# Clean up
hpx.finalize()
print("HPX runtime finalized")