# Lab 1.3.4: CuPy Integration

**Module:** 1.3 - CUDA Python & GPU Programming  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê (Beginner-Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Use CuPy as a drop-in replacement for NumPy
- [ ] Port a data preprocessing pipeline to GPU
- [ ] Write custom CUDA kernels using CuPy's RawKernel
- [ ] Transfer data seamlessly between CuPy and PyTorch (zero-copy!)
- [ ] Achieve **10x+ speedup** on data preprocessing

---

## üìö Prerequisites

- Completed: Labs 1.3.1-1.3.3 (helpful but not required)
- Knowledge of: NumPy basics

---

## üåç Real-World Context

**The data preprocessing bottleneck is real.**

In many ML pipelines, data preprocessing can take longer than model training:
- **Image preprocessing:** Resize, normalize, augment (rotation, flip, crop)
- **Text preprocessing:** Tokenization, padding, attention masks
- **Tabular data:** Feature engineering, normalization, encoding
- **Time series:** Windowing, FFT, filtering

**The problem:** NumPy only uses CPU. While your $50,000 GPU sits idle during preprocessing!

**The solution:** CuPy - a drop-in NumPy replacement that runs on GPU.

```python
# NumPy (CPU)
import numpy as np
data = np.random.randn(10000, 10000)
result = np.linalg.svd(data)  # CPU, slow

# CuPy (GPU) - just change the import!
import cupy as cp
data = cp.random.randn(10000, 10000)
result = cp.linalg.svd(data)  # GPU, fast!
```

---

## üßí ELI5: What is CuPy?

> **Imagine you have a recipe book** (your NumPy code) with instructions for one cook (CPU).
>
> **CuPy** is like a translator that takes the same recipe and adapts it for a massive kitchen with 6,144 cooks (CUDA cores) who can all work simultaneously.
>
> The best part? You don't need to rewrite the recipe! Just change "NumPy" to "CuPy" and the translator handles everything.
>
> **Example:**
> - "Chop 1000 carrots" ‚Üí CPU: one knife, 1000 cuts
> - "Chop 1000 carrots" ‚Üí GPU: 1000 knives, all cutting at once!

### CuPy vs NumPy: API Comparison

| NumPy | CuPy | Notes |
|-------|------|-------|
| `np.array([1,2,3])` | `cp.array([1,2,3])` | Arrays live on GPU |
| `np.zeros((100,100))` | `cp.zeros((100,100))` | Allocated on GPU memory |
| `np.dot(a, b)` | `cp.dot(a, b)` | Uses cuBLAS internally |
| `np.fft.fft(x)` | `cp.fft.fft(x)` | Uses cuFFT internally |
| `np.linalg.svd(A)` | `cp.linalg.svd(A)` | Uses cuSOLVER internally |

---

## Part 0: Environment Setup

In [None]:
import numpy as np
import time
from typing import Tuple, Callable
import warnings
warnings.filterwarnings('ignore')

# Try to import CuPy
try:
    import cupy as cp
    HAS_CUPY = True
    print(f"‚úÖ CuPy {cp.__version__} available")
    print(f"   CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
    
    # Get GPU info
    device = cp.cuda.Device(0)
    mem_info = device.mem_info
    print(f"   Device: {cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}")
    print(f"   Free memory: {mem_info[0] / 1024**3:.1f} GB / {mem_info[1] / 1024**3:.1f} GB")
except ImportError:
    HAS_CUPY = False
    print("‚ùå CuPy not available")
    print("   Install with: pip install cupy-cuda12x")
    print("   Or in NGC container: already included!")

# PyTorch for interop demo
try:
    import torch
    HAS_TORCH = True
    print(f"\n‚úÖ PyTorch {torch.__version__} available")
except ImportError:
    HAS_TORCH = False

---

## Part 1: CuPy Basics - Drop-in NumPy Replacement

Let's start with simple operations to see how CuPy mirrors NumPy.

In [None]:
if HAS_CUPY:
    print("üîß Basic CuPy Operations")
    print("="*50)
    
    # Creating arrays
    a_np = np.array([1, 2, 3, 4, 5], dtype=np.float32)
    a_cp = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)  # Same syntax!
    
    print(f"NumPy array: {a_np}, type: {type(a_np)}")
    print(f"CuPy array:  {a_cp}, type: {type(a_cp)}")
    
    # Array operations
    print(f"\nüìä Element-wise operations:")
    print(f"np.sqrt(a): {np.sqrt(a_np)}")
    print(f"cp.sqrt(a): {cp.sqrt(a_cp)}")
    
    # Reductions
    print(f"\nüìä Reductions:")
    print(f"np.sum(a): {np.sum(a_np)}")
    print(f"cp.sum(a): {cp.sum(a_cp)}")
    
    # Matrix operations
    print(f"\nüìä Matrix operations:")
    A_np = np.random.randn(3, 3).astype(np.float32)
    A_cp = cp.asarray(A_np)  # Convert NumPy to CuPy
    
    print(f"NumPy matrix multiply: {np.dot(A_np, A_np)[0]}")
    print(f"CuPy matrix multiply:  {cp.dot(A_cp, A_cp)[0]}")

### Converting Between NumPy and CuPy

In [None]:
if HAS_CUPY:
    print("üîÑ Data Transfer Between CPU and GPU")
    print("="*50)
    
    # NumPy ‚Üí CuPy (CPU ‚Üí GPU)
    cpu_array = np.random.randn(1000, 1000).astype(np.float32)
    
    start = time.perf_counter()
    gpu_array = cp.asarray(cpu_array)  # Copy to GPU
    cp.cuda.Stream.null.synchronize()
    time_to_gpu = time.perf_counter() - start
    
    print(f"CPU ‚Üí GPU transfer ({cpu_array.nbytes / 1e6:.1f} MB): {time_to_gpu*1000:.2f} ms")
    print(f"Effective bandwidth: {cpu_array.nbytes / time_to_gpu / 1e9:.1f} GB/s")
    
    # CuPy ‚Üí NumPy (GPU ‚Üí CPU)
    start = time.perf_counter()
    cpu_array_back = cp.asnumpy(gpu_array)  # Copy to CPU
    time_to_cpu = time.perf_counter() - start
    
    print(f"GPU ‚Üí CPU transfer: {time_to_cpu*1000:.2f} ms")
    
    # Alternative: .get() method
    cpu_array_back2 = gpu_array.get()  # Same as cp.asnumpy()
    
    print(f"\nüí° On DGX Spark with unified memory, these transfers are faster")
    print(f"   than discrete GPUs because CPU and GPU share the same memory!")

---

## Part 2: Benchmarking CuPy vs NumPy

Let's see the real speedups on common operations.

In [None]:
def benchmark(name: str, numpy_func: Callable, cupy_func: Callable, 
              np_args: tuple, cp_args: tuple, iterations: int = 10):
    """Benchmark NumPy vs CuPy function."""
    # Warm up
    _ = numpy_func(*np_args)
    _ = cupy_func(*cp_args)
    if HAS_CUPY:
        cp.cuda.Stream.null.synchronize()
    
    # NumPy timing
    start = time.perf_counter()
    for _ in range(iterations):
        _ = numpy_func(*np_args)
    time_np = (time.perf_counter() - start) / iterations
    
    # CuPy timing
    start = time.perf_counter()
    for _ in range(iterations):
        _ = cupy_func(*cp_args)
    if HAS_CUPY:
        cp.cuda.Stream.null.synchronize()
    time_cp = (time.perf_counter() - start) / iterations
    
    speedup = time_np / time_cp
    print(f"{name:<30} NumPy: {time_np*1000:>8.2f}ms  CuPy: {time_cp*1000:>8.2f}ms  Speedup: {speedup:>6.1f}x")
    return speedup


if HAS_CUPY:
    print("üìä CuPy vs NumPy Benchmark")
    print("="*80)
    
    # Test arrays
    N = 5000
    M = 5000
    
    A_np = np.random.randn(N, M).astype(np.float32)
    B_np = np.random.randn(N, M).astype(np.float32)
    x_np = np.random.randn(M).astype(np.float32)
    
    A_cp = cp.asarray(A_np)
    B_cp = cp.asarray(B_np)
    x_cp = cp.asarray(x_np)
    
    print(f"\nArray size: {N}√ó{M} ({N*M*4/1e6:.0f} MB per array)\n")
    
    # Element-wise operations
    benchmark("Element-wise multiply", 
              lambda a, b: a * b, lambda a, b: a * b,
              (A_np, B_np), (A_cp, B_cp))
    
    benchmark("Element-wise exp",
              lambda a: np.exp(a), lambda a: cp.exp(a),
              (A_np,), (A_cp,))
    
    benchmark("Element-wise sin",
              lambda a: np.sin(a), lambda a: cp.sin(a),
              (A_np,), (A_cp,))
    
    # Reductions
    benchmark("Sum (all elements)",
              lambda a: np.sum(a), lambda a: cp.sum(a),
              (A_np,), (A_cp,))
    
    benchmark("Sum (along axis)",
              lambda a: np.sum(a, axis=1), lambda a: cp.sum(a, axis=1),
              (A_np,), (A_cp,))
    
    benchmark("Mean",
              lambda a: np.mean(a), lambda a: cp.mean(a),
              (A_np,), (A_cp,))
    
    benchmark("Standard deviation",
              lambda a: np.std(a), lambda a: cp.std(a),
              (A_np,), (A_cp,))
    
    # Linear algebra
    benchmark("Matrix multiply",
              lambda a, b: np.dot(a, b.T), lambda a, b: cp.dot(a, b.T),
              (A_np, B_np), (A_cp, B_cp))
    
    benchmark("Matrix-vector multiply",
              lambda a, x: np.dot(a, x), lambda a, x: cp.dot(a, x),
              (A_np, x_np), (A_cp, x_cp))
    
    # Smaller array for expensive operations
    A_small_np = A_np[:1000, :1000].copy()
    A_small_cp = cp.asarray(A_small_np)
    
    benchmark("SVD (1000√ó1000)",
              lambda a: np.linalg.svd(a), lambda a: cp.linalg.svd(a),
              (A_small_np,), (A_small_cp,), iterations=3)
    
    benchmark("QR decomposition (1000√ó1000)",
              lambda a: np.linalg.qr(a), lambda a: cp.linalg.qr(a),
              (A_small_np,), (A_small_cp,), iterations=5)
    
    # FFT
    benchmark("2D FFT",
              lambda a: np.fft.fft2(a), lambda a: cp.fft.fft2(a),
              (A_np,), (A_cp,))
    
    # Sorting
    benchmark("Sort (flatten)",
              lambda a: np.sort(a.flatten()), lambda a: cp.sort(a.flatten()),
              (A_np,), (A_cp,))

### üîç Understanding the Results

**Large speedups (10-100x):**
- Matrix multiplication: Highly parallel, uses cuBLAS
- FFT: Uses cuFFT, highly optimized
- Element-wise operations: Embarrassingly parallel

**Moderate speedups (2-10x):**
- Reductions: Need to aggregate across threads
- Sorting: Complex algorithms, but GPU helps

**Similar or slower:**
- Very small arrays: Transfer overhead dominates
- Operations that don't parallelize well

**Rule of thumb:** If your array has > 10,000 elements and the operation is parallelizable, CuPy will likely be faster!

---

## Part 3: Real-World Pipeline - Data Preprocessing

Let's port a realistic data preprocessing pipeline to GPU.

In [None]:
# Generate synthetic tabular dataset (like you might have in ML)
np.random.seed(42)

N_SAMPLES = 1_000_000  # 1 million rows
N_FEATURES = 100       # 100 features

print(f"üìä Creating synthetic dataset: {N_SAMPLES:,} samples √ó {N_FEATURES} features")
print(f"   Total size: {N_SAMPLES * N_FEATURES * 4 / 1e9:.2f} GB")

# Raw data with some realistic properties
data_np = np.random.randn(N_SAMPLES, N_FEATURES).astype(np.float32)

# Add some missing values (NaN)
mask = np.random.random((N_SAMPLES, N_FEATURES)) < 0.05  # 5% missing
data_np[mask] = np.nan

# Add some outliers
outlier_mask = np.random.random((N_SAMPLES, N_FEATURES)) < 0.01  # 1% outliers
data_np[outlier_mask] = data_np[outlier_mask] * 10

print(f"   Missing values: {np.isnan(data_np).sum():,} ({100*np.isnan(data_np).mean():.1f}%)")
print(f"   Memory: {data_np.nbytes / 1e9:.2f} GB")

In [None]:
def preprocess_numpy(data: np.ndarray) -> np.ndarray:
    """
    NumPy preprocessing pipeline.
    
    Steps:
    1. Fill NaN with column median
    2. Clip outliers (outside 3 std devs)
    3. Standardize (z-score normalization)
    4. Apply tanh transformation (bounded output)
    """
    result = data.copy()
    
    # Step 1: Fill NaN with column median
    for col in range(result.shape[1]):
        col_data = result[:, col]
        median = np.nanmedian(col_data)
        col_data[np.isnan(col_data)] = median
    
    # Step 2: Clip outliers
    mean = np.mean(result, axis=0, keepdims=True)
    std = np.std(result, axis=0, keepdims=True)
    lower = mean - 3 * std
    upper = mean + 3 * std
    result = np.clip(result, lower, upper)
    
    # Step 3: Standardize (recompute after clipping)
    mean = np.mean(result, axis=0, keepdims=True)
    std = np.std(result, axis=0, keepdims=True)
    result = (result - mean) / (std + 1e-8)
    
    # Step 4: Apply tanh
    result = np.tanh(result)
    
    return result


# Time NumPy version
print("\n‚è±Ô∏è  NumPy Preprocessing Pipeline")
start = time.perf_counter()
result_np = preprocess_numpy(data_np)
time_numpy = time.perf_counter() - start
print(f"   Time: {time_numpy:.2f} seconds")
print(f"   Throughput: {N_SAMPLES * N_FEATURES / time_numpy / 1e6:.1f} million elements/sec")

In [None]:
if HAS_CUPY:
    def preprocess_cupy(data: cp.ndarray) -> cp.ndarray:
        """
        CuPy preprocessing pipeline - same logic, different library!
        """
        result = data.copy()
        
        # Step 1: Fill NaN with column median
        # CuPy's nanmedian works the same way
        for col in range(result.shape[1]):
            col_data = result[:, col]
            median = cp.nanmedian(col_data)
            col_data[cp.isnan(col_data)] = median
        
        # Step 2: Clip outliers
        mean = cp.mean(result, axis=0, keepdims=True)
        std = cp.std(result, axis=0, keepdims=True)
        lower = mean - 3 * std
        upper = mean + 3 * std
        result = cp.clip(result, lower, upper)
        
        # Step 3: Standardize
        mean = cp.mean(result, axis=0, keepdims=True)
        std = cp.std(result, axis=0, keepdims=True)
        result = (result - mean) / (std + 1e-8)
        
        # Step 4: Apply tanh
        result = cp.tanh(result)
        
        return result
    
    
    # Transfer to GPU and time
    print("\n‚è±Ô∏è  CuPy Preprocessing Pipeline")
    
    # Transfer
    start = time.perf_counter()
    data_cp = cp.asarray(data_np)
    cp.cuda.Stream.null.synchronize()
    time_transfer = time.perf_counter() - start
    print(f"   CPU‚ÜíGPU transfer: {time_transfer:.3f} seconds")
    
    # Warm up
    _ = preprocess_cupy(data_cp[:1000])
    cp.cuda.Stream.null.synchronize()
    
    # Time CuPy version
    start = time.perf_counter()
    result_cp = preprocess_cupy(data_cp)
    cp.cuda.Stream.null.synchronize()
    time_cupy = time.perf_counter() - start
    
    print(f"   Processing time: {time_cupy:.3f} seconds")
    print(f"   Throughput: {N_SAMPLES * N_FEATURES / time_cupy / 1e6:.1f} million elements/sec")
    
    # Transfer back
    start = time.perf_counter()
    result_cp_np = cp.asnumpy(result_cp)
    time_back = time.perf_counter() - start
    print(f"   GPU‚ÜíCPU transfer: {time_back:.3f} seconds")
    
    # Verify correctness
    print(f"\n‚úÖ Results match: {np.allclose(result_np, result_cp_np, rtol=1e-4, equal_nan=True)}")
    
    # Summary
    total_cupy = time_transfer + time_cupy + time_back
    print(f"\nüìä Summary:")
    print(f"   NumPy:      {time_numpy:.2f} seconds (total)")
    print(f"   CuPy:       {time_cupy:.3f} seconds (processing only)")
    print(f"   CuPy total: {total_cupy:.3f} seconds (with transfers)")
    print(f"\n   üöÄ Speedup (processing): {time_numpy/time_cupy:.1f}x")
    print(f"   üöÄ Speedup (total):      {time_numpy/total_cupy:.1f}x")

### üîç Key Observations

1. **Same code structure** - Just changed `np` to `cp`
2. **Transfer overhead matters** - If you're only processing once, transfers add up
3. **Keep data on GPU** - In real pipelines, data stays on GPU for multiple operations
4. **The loop is slow** - Even with CuPy, Python loops are bottlenecks

---

## Part 4: Optimizing the Pipeline (Vectorization)

The loop in our preprocessing is slow. Let's vectorize it!

In [None]:
if HAS_CUPY:
    def preprocess_cupy_optimized(data: cp.ndarray) -> cp.ndarray:
        """
        Optimized CuPy preprocessing - no Python loops!
        """
        result = data.copy()
        
        # Step 1: Fill NaN with column median - VECTORIZED
        # Compute medians for all columns at once
        medians = cp.nanmedian(result, axis=0)  # Shape: (N_FEATURES,)
        # Create mask of NaN positions
        nan_mask = cp.isnan(result)
        # Use broadcasting to fill NaN with corresponding column median
        # This is a bit tricky: we need to get the column index for each NaN
        nan_rows, nan_cols = cp.where(nan_mask)
        result[nan_rows, nan_cols] = medians[nan_cols]
        
        # Steps 2-4: Already vectorized, same as before
        mean = cp.mean(result, axis=0, keepdims=True)
        std = cp.std(result, axis=0, keepdims=True)
        lower = mean - 3 * std
        upper = mean + 3 * std
        result = cp.clip(result, lower, upper)
        
        mean = cp.mean(result, axis=0, keepdims=True)
        std = cp.std(result, axis=0, keepdims=True)
        result = (result - mean) / (std + 1e-8)
        
        result = cp.tanh(result)
        
        return result
    
    
    # Benchmark optimized version
    print("\n‚è±Ô∏è  Optimized CuPy Pipeline (no Python loops)")
    
    # Warm up
    _ = preprocess_cupy_optimized(data_cp[:1000])
    cp.cuda.Stream.null.synchronize()
    
    start = time.perf_counter()
    result_opt = preprocess_cupy_optimized(data_cp)
    cp.cuda.Stream.null.synchronize()
    time_optimized = time.perf_counter() - start
    
    print(f"   Processing time: {time_optimized:.3f} seconds")
    print(f"   Throughput: {N_SAMPLES * N_FEATURES / time_optimized / 1e6:.1f} million elements/sec")
    
    # Verify
    result_opt_np = cp.asnumpy(result_opt)
    print(f"\n‚úÖ Results match original: {np.allclose(result_np, result_opt_np, rtol=1e-4, equal_nan=True)}")
    
    print(f"\nüìä Comparison:")
    print(f"   NumPy:            {time_numpy:.2f} seconds")
    print(f"   CuPy (with loop): {time_cupy:.3f} seconds")
    print(f"   CuPy (optimized): {time_optimized:.3f} seconds")
    print(f"\n   üöÄ Total speedup: {time_numpy/time_optimized:.1f}x faster than NumPy!")

---

## Part 5: CuPy Custom Kernels (RawKernel)

Sometimes CuPy's built-in functions aren't enough. You can write raw CUDA C!

In [None]:
if HAS_CUPY:
    # Custom CUDA kernel using CuPy's RawKernel
    fused_normalize_tanh_kernel = cp.RawKernel(r'''
    extern "C" __global__
    void fused_normalize_tanh(
        const float* input,
        const float* mean,
        const float* std,
        float* output,
        int n_samples,
        int n_features
    ) {
        // Each thread handles one element
        int idx = blockDim.x * blockIdx.x + threadIdx.x;
        int total = n_samples * n_features;
        
        if (idx < total) {
            int col = idx % n_features;  // Which feature
            
            // Normalize
            float normalized = (input[idx] - mean[col]) / (std[col] + 1e-8f);
            
            // Tanh
            output[idx] = tanhf(normalized);
        }
    }
    ''', 'fused_normalize_tanh')
    
    
    def preprocess_cupy_custom(data: cp.ndarray) -> cp.ndarray:
        """
        Preprocessing with custom fused kernel for normalize+tanh.
        """
        result = data.copy()
        n_samples, n_features = result.shape
        
        # Step 1: Fill NaN (same as before)
        medians = cp.nanmedian(result, axis=0)
        nan_mask = cp.isnan(result)
        nan_rows, nan_cols = cp.where(nan_mask)
        result[nan_rows, nan_cols] = medians[nan_cols]
        
        # Step 2: Clip (same as before)
        mean = cp.mean(result, axis=0)
        std = cp.std(result, axis=0)
        lower = mean - 3 * std
        upper = mean + 3 * std
        result = cp.clip(result, lower.reshape(1, -1), upper.reshape(1, -1))
        
        # Steps 3+4: Fused normalize + tanh with custom kernel
        mean = cp.mean(result, axis=0).astype(cp.float32)
        std = cp.std(result, axis=0).astype(cp.float32)
        
        # Make result contiguous for kernel
        result = cp.ascontiguousarray(result.astype(cp.float32))
        output = cp.empty_like(result)
        
        # Launch kernel
        total_elements = n_samples * n_features
        threads_per_block = 256
        blocks = (total_elements + threads_per_block - 1) // threads_per_block
        
        fused_normalize_tanh_kernel(
            (blocks,), (threads_per_block,),
            (result, mean, std, output, n_samples, n_features)
        )
        
        return output
    
    
    # Benchmark custom kernel version
    print("\n‚è±Ô∏è  CuPy with Custom Fused Kernel")
    
    # Warm up
    _ = preprocess_cupy_custom(data_cp[:1000])
    cp.cuda.Stream.null.synchronize()
    
    start = time.perf_counter()
    result_custom = preprocess_cupy_custom(data_cp)
    cp.cuda.Stream.null.synchronize()
    time_custom = time.perf_counter() - start
    
    print(f"   Processing time: {time_custom:.3f} seconds")
    
    # Verify
    result_custom_np = cp.asnumpy(result_custom)
    print(f"\n‚úÖ Results match: {np.allclose(result_np, result_custom_np, rtol=1e-3)}")
    
    print(f"\nüìä All versions compared:")
    print(f"   NumPy:              {time_numpy:.2f}s")
    print(f"   CuPy (basic):       {time_cupy:.3f}s  ({time_numpy/time_cupy:.0f}x)")
    print(f"   CuPy (optimized):   {time_optimized:.3f}s  ({time_numpy/time_optimized:.0f}x)")
    print(f"   CuPy (custom):      {time_custom:.3f}s  ({time_numpy/time_custom:.0f}x)")

---

## Part 6: CuPy ‚Üî PyTorch Interoperability

One of CuPy's superpowers: zero-copy sharing with PyTorch!

In [None]:
if HAS_CUPY and HAS_TORCH:
    print("üîÑ CuPy ‚Üî PyTorch Zero-Copy Interop")
    print("="*50)
    
    # Create CuPy array
    cp_array = cp.random.randn(1000, 1000).astype(cp.float32)
    print(f"CuPy array: {cp_array.shape}, device: GPU")
    
    # Method 1: Using DLPack (recommended, zero-copy)
    torch_tensor = torch.from_dlpack(cp_array.toDlpack())
    print(f"PyTorch tensor: {torch_tensor.shape}, device: {torch_tensor.device}")
    
    # Verify they share memory (modifying one affects the other)
    cp_array[0, 0] = 999.0
    print(f"\nAfter setting cp_array[0,0] = 999:")
    print(f"  torch_tensor[0,0] = {torch_tensor[0,0].item()}")
    print(f"  ‚úÖ Zero-copy: same memory!")
    
    # Method 2: PyTorch ‚Üí CuPy
    torch_tensor2 = torch.randn(500, 500, device='cuda')
    cp_array2 = cp.from_dlpack(torch_tensor2)
    print(f"\nPyTorch ‚Üí CuPy: {cp_array2.shape}")
    
    # Use case: Preprocess with CuPy, train with PyTorch
    print("\nüí° Typical workflow:")
    print("   1. Load data with CuPy/NumPy")
    print("   2. Preprocess with CuPy (fast!)")
    print("   3. Zero-copy convert to PyTorch")
    print("   4. Train model with PyTorch")
    print("   5. No memory copies = maximum performance!")

In [None]:
if HAS_CUPY and HAS_TORCH:
    # Practical example: Preprocess ‚Üí Train pipeline
    print("\nüìä End-to-End Pipeline Demo")
    print("="*50)
    
    # Simulate raw data arriving
    raw_data = np.random.randn(10000, 100).astype(np.float32)
    
    # Transfer to GPU with CuPy
    cp_data = cp.asarray(raw_data)
    
    # Preprocess with CuPy
    mean = cp.mean(cp_data, axis=0, keepdims=True)
    std = cp.std(cp_data, axis=0, keepdims=True)
    preprocessed = (cp_data - mean) / (std + 1e-8)
    
    # Zero-copy to PyTorch
    torch_data = torch.from_dlpack(preprocessed.toDlpack())
    
    # Now use in PyTorch model
    model = torch.nn.Sequential(
        torch.nn.Linear(100, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 10)
    ).cuda()
    
    output = model(torch_data)
    print(f"Input shape: {torch_data.shape}")
    print(f"Output shape: {output.shape}")
    print(f"\n‚úÖ Seamless CuPy ‚Üí PyTorch pipeline!")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting to Synchronize for Timing

In [None]:
if HAS_CUPY:
    # ‚ùå WRONG: Not synchronizing
    # start = time.time()
    # result = cp.sum(large_array)  # Async!
    # elapsed = time.time() - start  # Measures launch time, not execution!
    
    # ‚úÖ CORRECT: Synchronize before timing
    # start = time.time()
    # result = cp.sum(large_array)
    # cp.cuda.Stream.null.synchronize()  # Wait for GPU!
    # elapsed = time.time() - start
    
    print("üí° CuPy operations are asynchronous!")
    print("   Always use cp.cuda.Stream.null.synchronize() before timing.")

### Mistake 2: Mixing NumPy and CuPy Arrays

In [None]:
if HAS_CUPY:
    # ‚ùå WRONG: Mixing array types
    # a_np = np.array([1, 2, 3])
    # b_cp = cp.array([4, 5, 6])
    # c = a_np + b_cp  # TypeError or implicit slow transfer!
    
    # ‚úÖ CORRECT: Ensure same array type
    # a_cp = cp.array([1, 2, 3])
    # b_cp = cp.array([4, 5, 6])
    # c = a_cp + b_cp  # Both on GPU
    
    print("üí° Never mix NumPy and CuPy arrays in operations!")
    print("   Convert explicitly: cp.asarray(np_array) or cp.asnumpy(cp_array)")

### Mistake 3: Not Managing Memory

In [None]:
if HAS_CUPY:
    print("üí° Memory management tips:")
    print()
    print("   1. Clear unused arrays:")
    print("      del large_array")
    print("      cp.get_default_memory_pool().free_all_blocks()")
    print()
    print("   2. Check memory usage:")
    print("      mempool = cp.get_default_memory_pool()")
    print("      print(f'Used: {mempool.used_bytes() / 1e9:.2f} GB')")
    print()
    print("   3. Set memory limit:")
    print("      cp.cuda.set_allocator(cp.cuda.MemoryPool(cp.cuda.malloc_managed).malloc)")

---

## ‚úã Try It Yourself: Image Preprocessing Pipeline

**Challenge:** Port this image preprocessing pipeline to CuPy.

In [None]:
def preprocess_images_numpy(images: np.ndarray) -> np.ndarray:
    """
    Preprocess batch of images.
    
    Input: (batch, height, width, channels) uint8 [0-255]
    Output: (batch, channels, height, width) float32 [-1, 1]
    
    Steps:
    1. Convert to float32 and normalize to [0, 1]
    2. Apply per-channel normalization (ImageNet mean/std)
    3. Transpose to (batch, channels, height, width)
    4. Random horizontal flip (50% chance per image)
    """
    # ImageNet normalization constants
    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)
    
    # Step 1: Normalize to [0, 1]
    result = images.astype(np.float32) / 255.0
    
    # Step 2: Per-channel normalization
    result = (result - mean) / std
    
    # Step 3: Transpose NHWC -> NCHW
    result = np.transpose(result, (0, 3, 1, 2))
    
    # Step 4: Random horizontal flip
    flip_mask = np.random.random(result.shape[0]) < 0.5
    result[flip_mask] = result[flip_mask, :, :, ::-1]
    
    return result


# TODO: Implement CuPy version
def preprocess_images_cupy(images):
    """
    CuPy version of image preprocessing.
    
    Hint: The code is almost identical! Just change np ‚Üí cp.
    """
    # YOUR CODE HERE
    pass


# Test data
batch_size = 64
height, width = 224, 224
channels = 3

images = np.random.randint(0, 256, (batch_size, height, width, channels), dtype=np.uint8)

# Test NumPy version
np.random.seed(42)
result_np = preprocess_images_numpy(images)
print(f"NumPy result shape: {result_np.shape}")

# When implemented:
# np.random.seed(42)
# result_cp = preprocess_images_cupy(cp.asarray(images))
# print(f"CuPy result shape: {result_cp.shape}")
# print(f"Results match: {np.allclose(result_np, cp.asnumpy(result_cp))}")

<details>
<summary>üí° Solution</summary>

```python
def preprocess_images_cupy(images):
    mean = cp.array([0.485, 0.456, 0.406], dtype=cp.float32)
    std = cp.array([0.229, 0.224, 0.225], dtype=cp.float32)
    
    result = images.astype(cp.float32) / 255.0
    result = (result - mean) / std
    result = cp.transpose(result, (0, 3, 1, 2))
    
    flip_mask = cp.random.random(result.shape[0]) < 0.5
    result[flip_mask] = result[flip_mask, :, :, ::-1]
    
    return result
```
</details>

---

## üéâ Checkpoint

Congratulations! You've learned:

- ‚úÖ **CuPy basics** - Drop-in NumPy replacement
- ‚úÖ **Performance characteristics** - When CuPy shines
- ‚úÖ **Vectorization** - Avoiding Python loops
- ‚úÖ **Custom kernels** - RawKernel for specialized ops
- ‚úÖ **PyTorch interop** - Zero-copy data sharing

You achieved **10x+ speedup** on a realistic preprocessing pipeline!

---

## üìñ Further Reading

- [CuPy Documentation](https://docs.cupy.dev/en/stable/)
- [CuPy User Guide](https://docs.cupy.dev/en/stable/user_guide/index.html)
- [CuPy Kernel Fusion](https://docs.cupy.dev/en/stable/user_guide/kernel.html)
- [RAPIDS cuDF](https://github.com/rapidsai/cudf) - GPU-accelerated pandas

---

## üßπ Cleanup

In [None]:
import gc

# Clean up
if HAS_CUPY:
    # Clear CuPy memory pool
    mempool = cp.get_default_memory_pool()
    pinned_mempool = cp.get_default_pinned_memory_pool()
    
    mempool.free_all_blocks()
    pinned_mempool.free_all_blocks()

if HAS_TORCH:
    torch.cuda.empty_cache()

gc.collect()

print("‚úÖ GPU memory cleared!")
print("\n‚û°Ô∏è Ready for Lab 1.3.5: Profiling Workshop")