# CUDA Parallel Tutorial - High-Level GPU Programming
## Table of Contents

1. Introduction to cuda.cccl.parallel
2. Setting Up Your Environment
3. Understanding Parallel Algorithms
4. Your First Reduction
5. Working with Iterators
6. Scan Operations (Prefix Sums)
7. Sorting Algorithms
8. Transform Operations
9. Custom Data Types
10. Lab Exercises

## 1. Introduction to cuda.cccl.parallel
The `cuda.cccl.parallel` module provides a high-level, Pythonic interface to GPU programming. Unlike `cuda.core`, it abstracts away many low-level details while still providing excellent performance.

Think of `cuda.cccl.parallel` as a toolkit of pre-built, highly optimized parallel algorithms that you can use without writing any CUDA code yourself.

### What Makes It Special?

**High-Level Abstractions**: Instead of writing complex GPU kernels, you call simple Python functions like `reduce_into()`, `sort()`, or `scan()`.

**Performance**: These algorithms deliver the performance of hand-optimized CUDA kernels - they're written by NVIDIA's experts and optimized for all GPU architectures.

**Pythonic**: Works seamlessly with NumPy arrays and CuPy arrays, using familiar Python syntax.

**No Memory Management**: The library handles all GPU memory allocation and deallocation automatically.

### When to Use cuda.cccl.parallel
**Best used for:**

* Data science and scientific computing
* When you need fast parallel algorithms (reduce, scan, sort, transform)
* Prototyping GPU-accelerated applications
* Learning parallel programming concepts
* When you want GPU performance without CUDA complexity

Comparison with cuda.core:
| Feature             | cuda.core              | cuda.parallel         |
|---------------------|------------------------|-----------------------|
| Memory Management   | Manual                 | Automatic             |
| Kernel Definition   | CUDA C/C++ strings     | Python decorators     |
| Learning Curve      | Steep                  | Gentle                |
| Performance Control | Maximum                | Good                  |
| Development Speed   | Slow                   | Fast                  |

## 2. Setting Up Your Environment
### Prerequisites

* NVIDIA GPU with CUDA capability
* CUDA driver version 12.2 or higher
* Python 3.8+

### Installation

In [None]:
# Install CUDA Python packages
pip install cuda-python

# Install cuda-cccl
pip install cuda-cccl

# Install numerical computing libraries
pip install cupy-cuda12x numpy

### Quick Verification

In [None]:
import numpy as np
import cupy as cp
import cuda.cccl.parallel.experimental as parallel

def verify_installation():
    """Test that cuda.cccl.parallel is working correctly"""
    
    print("=== CUDA Parallel Installation Test ===")
    
    # Test 1: Check if we can create GPU arrays
    try:
        test_array = cp.array([1, 2, 3, 4, 5], dtype=np.int32)
        print("✓ CuPy GPU arrays working")
    except Exception as e:
        print(f"✗ CuPy error: {e}")
        return False
    
    # Test 2: Simple reduction operation
    try:
        def add_op(a, b):
            return a + b
        
        # Input data
        d_input = cp.array([1, 2, 3, 4, 5], dtype=np.int32)
        d_output = cp.empty(1, dtype=np.int32)
        h_init = np.array([0], dtype=np.int32)
        
        # Perform reduction
        parallel.reduce_into(d_input, d_output, add_op, len(d_input), h_init)
        
        result = d_output.get()[0]  # Copy result back to CPU
        expected = 15  # 1+2+3+4+5
        
        if result == expected:
            print(f"✓ Parallel reduction working: {result}")
        else:
            print(f"✗ Reduction failed: got {result}, expected {expected}")
            return False
            
    except Exception as e:
        print(f"✗ Parallel operation error: {e}")
        return False
    
    print("✓ All tests passed! cuda.cccl.parallel is ready to use.")
    return True

# Run verification
verify_installation()

**What this test does**:
1. Creates GPU arrays: Verifies CuPy can allocate GPU memory
2. Tests parallel reduction: Confirms the parallel library can sum numbers on GPU
3. Checks results: Ensures the computation is correct

## 3. Understanding Parallel Algorithms
### What Are Parallel Algorithms?
**Sequential algorithm** (like a for-loop): Does one operation at a time

In [None]:
# Sequential sum - does one addition at a time
total = 0
for num in [1, 2, 3, 4, 5]:
    total += num  # One operation per step

**Parallel algorithm**: Does many operations simultaneously

In [None]:
# Parallel sum - combines pairs simultaneously
# Step 1: [1,2,3,4,5] → [3, 7, 5] (1+2=3, 3+4=7, 5 remains)
# Step 2: [3, 7, 5] → [10, 5] (3+7=10, 5 remains) 
# Step 3: [10, 5] → [15] (10+5=15)

### The Power of Parallelism
**Why parallel algorithms matter:**
* Speed: What takes 1 second on CPU might take 0.01 seconds on GPU
* Scalability: Performance improves as data size increases
* Efficiency: Better use of modern hardware

**Real-world example:**

In [None]:
# Processing 1 million numbers
# CPU sequential: ~100ms
# GPU parallel: ~1ms (100x faster!)

### Core Algorithm Types
Let's understand the main types of parallel algorithms:
**1. Reduction: Combine all elements into one result**

In [None]:
# Examples: sum, max, min, average
[1, 2, 3, 4, 5] → 15 (sum)
[1, 2, 3, 4, 5] → 5 (max)

**2. Scan (Prefix Sum): Running total of elements**

In [None]:
# Inclusive scan (include current element)
[1, 2, 3, 4, 5] → [1, 3, 6, 10, 15]

# Exclusive scan (exclude current element)
[1, 2, 3, 4, 5] → [0, 1, 3, 6, 10]

**3. Sort: Arrange elements in order**

In [None]:
[5, 2, 8, 1, 9] → [1, 2, 5, 8, 9]

**4. Transform: Apply function to each element**

In [None]:
# Square each element
[1, 2, 3, 4, 5] → [1, 4, 9, 16, 25]

#### Understanding Algorithm Complexity
**Why parallel algorithms are different:**

**Sequential complexity**: O(n) - time grows linearly with data size

**Parallel complexit**y**: O(log n) - time grows logarithmically with data size

**Real example with 1 million elements:**
* Sequential: 1,000,000 operations
* Parallel: ~20 operations (log₂(1,000,000) ≈ 20)

This is why GPU algorithms can be 100x faster!

### 4. Your First Reduction
**What is a Reduction?**

A **reduction** takes many values and combines them into a single result using a binary operation (a function that takes two inputs).

**Common reductions:**
* Sum: Add all numbers together
* Maximum: Find the largest number
* Minimum: Find the smallest number
* Product: Multiply all numbers together

**Basic Sum Reduction**

Let's start with the simplest example, adding numbers:

In [None]:
import numpy as np
import cupy as cp
import cuda.cccl.parallel.experimental as parallel

def basic_sum_example():
    """Learn reduction by summing numbers on GPU"""
    
    print("=== Basic Sum Reduction ===")
    
    # Step 1: Define our operation (how to combine two numbers)
    def add_op(a, b):
        """Add two numbers together"""
        return a + b
    
    # Step 2: Create input data on GPU
    input_data = [1, 2, 3, 4, 5]
    print(f"Input data: {input_data}")
    
    d_input = cp.array(input_data, dtype=np.int32)  # Move to GPU
    print(f"Created GPU array with {len(d_input)} elements")
    
    # Step 3: Prepare output storage
    d_output = cp.empty(1, dtype=np.int32)  # Space for 1 result
    
    # Step 4: Set initial value (what to start the sum with)
    h_init = np.array([0], dtype=np.int32)  # Start from 0
    
    # Step 5: Perform the reduction
    parallel.reduce_into(
        d_input,        # Input array on GPU
        d_output,       # Output array on GPU  
        add_op,         # Function to combine elements
        len(d_input),   # Number of elements to process
        h_init          # Initial value
    )
    
    # Step 6: Get result back to CPU
    result = d_output.get()[0]
    expected = sum(input_data)
    
    print(f"GPU result: {result}")
    print(f"CPU verification: {expected}")
    print(f"Correct: {result == expected}")
    
    return result

# Run the example
basic_sum_example()

**Explanation**:
1. Define operation: `add_op(a, b)` tells the GPU how to combine two numbers
2. Create GPU data: `cp.array()` moves our data to GPU memory
3. Prepare output: `cp.empty(1)` allocates space for the single result
4. Set initial value: Start the sum from 0
5. Run reduction: The GPU combines all elements in parallel
6. Get result: `.get()` copies the result back to CPU memory

**Understanding the parallel execution:**
Input: [1, 2, 3, 4, 5]
Step 1: Pairs combine → [3, 7, 5] (1+2=3, 3+4=7)
Step 2: Continue → [10, 5] (3+7=10)  
Step 3: Final → [15] (10+5=15)

**Finding Maximum Value**
Let's try a different reduction, finding the largest number:

In [None]:
def maximum_reduction_example():
    """Find the maximum value in an array"""
    
    print("\n=== Maximum Value Reduction ===")
    
    # Step 1: Define how to find maximum of two numbers
    def max_op(a, b):
        """Return the larger of two numbers"""
        return a if a > b else b
    
    # Step 2: Create test data with some large and small numbers
    input_data = [23, 7, 91, 15, 4, 88, 12, 77, 3, 99]
    print(f"Input data: {input_data}")
    
    d_input = cp.array(input_data, dtype=np.int32)
    d_output = cp.empty(1, dtype=np.int32)
    
    # Step 3: Set initial value (start with very small number)
    h_init = np.array([-999999], dtype=np.int32)  # Very small starting value
    
    # Step 4: Find maximum
    parallel.reduce_into(d_input, d_output, max_op, len(d_input), h_init)
    
    # Step 5: Compare with CPU result
    gpu_max = d_output.get()[0]
    cpu_max = max(input_data)
    
    print(f"GPU maximum: {gpu_max}")
    print(f"CPU maximum: {cpu_max}")
    print(f"Correct: {gpu_max == cpu_max}")
    
    return gpu_max

# Run the example
maximum_reduction_example()

**Why we use a very small initial value:**
* The reduction combines the initial value with our data
* If we started with 0, and all our numbers were negative, we'd get 0 (wrong!)
* Starting with -999999 ensures any real number will be larger

### Custom Reduction: Average
Let's create a more complex reduction to calculate the average:

In [None]:
def average_reduction_example():
    """Calculate average using reduction (more advanced)"""
    
    print("\n=== Average Calculation ===")
    
    # For average, we need both sum and count
    # We'll use a custom data structure
    
    @parallel.gpu_struct
    class SumCount:
        """Custom type to hold sum and count together"""
        total: np.float64
        count: np.int32
    
    def combine_sum_count(a, b):
        """Combine two SumCount structures"""
        return SumCount(a.total + b.total, a.count + b.count)
    
    def value_to_sum_count(value):
        """Convert a single value to SumCount"""
        return SumCount(float(value), 1)
    
    # Create input data
    input_data = [10, 20, 30, 40, 50]
    print(f"Input data: {input_data}")
    
    d_input = cp.array(input_data, dtype=np.float64)
    
    # Transform each value to SumCount structure
    transform_it = parallel.TransformIterator(d_input, value_to_sum_count)
    
    # Prepare output
    d_output = cp.empty(1, dtype=SumCount.dtype)
    h_init = SumCount(0.0, 0)
    
    # Perform reduction
    parallel.reduce_into(transform_it, d_output, combine_sum_count, len(d_input), h_init)
    
    # Calculate average
    result = d_output.get()[0]
    gpu_average = result['total'] / result['count']
    cpu_average = sum(input_data) / len(input_data)
    
    print(f"GPU average: {gpu_average}")
    print(f"CPU average: {cpu_average}")
    print(f"Correct: {abs(gpu_average - cpu_average) < 1e-10}")
    
    return gpu_average

# Run the example
average_reduction_example()

**Explanation:**
1. Custom type: We create `SumCount` to hold both sum and count
2. Transform: Convert each number to a `SumCount(value, 1)`
3. Combine: Add sums and counts separately
4. Final calculation: Divide total by count to get average

This shows how parallel algorithms can handle complex operations!

### Performance Comparison
Let's see how much faster GPU reduction is compared to CPU:

In [None]:
import time

def performance_comparison():
    """Compare GPU vs CPU performance for large arrays"""
    
    print("\n=== Performance Comparison ===")
    
    # Test with different array sizes
    sizes = [1000, 10000, 100000, 1000000]
    
    def add_op(a, b):
        return a + b
    
    for size in sizes:
        print(f"\nTesting with {size:,} elements:")
        
        # Create large random array
        np.random.seed(42)  # For reproducible results
        data = np.random.randint(0, 100, size, dtype=np.int32)
        
        # Test CPU performance
        start_time = time.time()
        cpu_result = np.sum(data)  # NumPy sum
        cpu_time = time.time() - start_time
        
        # Test GPU performance
        d_input = cp.array(data)
        d_output = cp.empty(1, dtype=np.int32)
        h_init = np.array([0], dtype=np.int32)
        
        # Warm up GPU (first run is always slower)
        parallel.reduce_into(d_input, d_output, add_op, len(d_input), h_init)
        
        # Time the actual operation
        start_time = time.time()
        parallel.reduce_into(d_input, d_output, add_op, len(d_input), h_init)
        gpu_result = d_output.get()[0]
        gpu_time = time.time() - start_time
        
        # Calculate speedup
        speedup = cpu_time / gpu_time if gpu_time > 0 else float('inf')
        
        print(f"  CPU time: {cpu_time*1000:.2f} ms")
        print(f"  GPU time: {gpu_time*1000:.2f} ms") 
        print(f"  Speedup: {speedup:.1f}x")
        print(f"  Results match: {cpu_result == gpu_result}")

# Run performance comparison
performance_comparison()

Expected results:
* Small arrays (1,000 elements): GPU might be slower due to overhead
* Large arrays (1,000,000+ elements): GPU can be 10-100x faster
* The speedup increases with array size

Why this happens:
* GPU has setup overhead but massive parallel processing power
* CPU is fast for small tasks but doesn't scale well
* As data grows, GPU's parallel advantage becomes dominant

## 5. Working with Iterators
#### **What Are Iterators?**

Iterators provide a way to represent sequences of data without needing to allocate memory for them. Think of them as "virtual arrays" that generate values on-demand.

**Benefits of iterators:**
* Memory efficient: No need to store all values in memory
* Composable: Can combine multiple iterators together
* Flexible: Generate sequences, transform data, reverse arrays

#### CountingIterator: Generate Number Sequences
The most basic iterator generates a sequence of consecutive numbers:

In [None]:
def counting_iterator_example():
    """Learn iterators by generating number sequences"""
    
    print("=== CountingIterator Example ===")
    
    def add_op(a, b):
        return a + b
    
    # Instead of creating an array [10, 11, 12], we use an iterator
    first_number = 10
    how_many = 5  # Generate 5 numbers: 10, 11, 12, 13, 14
    
    print(f"Generating sequence starting from {first_number}, {how_many} numbers")
    print(f"Virtual sequence: {list(range(first_number, first_number + how_many))}")
    
    # Create the counting iterator
    counting_it = parallel.CountingIterator(np.int32(first_number))
    
    # Prepare reduction
    d_output = cp.empty(1, dtype=np.int32)
    h_init = np.array([0], dtype=np.int32)
    
    # Sum the sequence without storing it in memory!
    parallel.reduce_into(counting_it, d_output, add_op, how_many, h_init)
    
    # Verify result
    gpu_result = d_output.get()[0]
    cpu_result = sum(range(first_number, first_number + how_many))
    
    print(f"GPU sum: {gpu_result}")
    print(f"CPU sum: {cpu_result}")
    print(f"Correct: {gpu_result == cpu_result}")
    
    # Show memory efficiency
    print(f"\nMemory efficiency:")
    print(f"  Array approach: would need {how_many * 4} bytes")
    print(f"  Iterator approach: needs ~0 bytes (generated on-demand)")

counting_iterator_example()

Why this is powerful:
* No memory allocation for the sequence
* Can generate huge sequences without running out of memory
* Perfect for mathematical sequences and patterns
### ConstantIterator: Repeat the Same Value
Sometimes you need a sequence of identical values:

In [None]:
def constant_iterator_example():
    """Use ConstantIterator to create sequences of repeated values"""
    
    print("\n=== ConstantIterator Example ===")
    
    def add_op(a, b):
        return a + b
    
    # Create virtual array of repeated 7s: [7, 7, 7, 7, 7]
    repeated_value = 7
    how_many = 1000
    
    print(f"Creating {how_many} copies of {repeated_value}")
    
    # Create constant iterator
    constant_it = parallel.ConstantIterator(np.int32(repeated_value))
    
    # Sum all the repeated values
    d_output = cp.empty(1, dtype=np.int32)
    h_init = np.array([0], dtype=np.int32)
    
    parallel.reduce_into(constant_it, d_output, add_op, how_many, h_init)
    
    gpu_result = d_output.get()[0]
    expected = repeated_value * how_many  # 7 * 1000 = 7000
    
    print(f"GPU result: {gpu_result}")
    print(f"Expected: {expected}")
    print(f"Correct: {gpu_result == expected}")
    
    # Real-world use case
    print(f"\nReal-world use: Initialize arrays, default values, padding")

constant_iterator_example()

**Use cases for ConstantIterator:**
* Initializing arrays with default values
* Creating padding for algorithms
* Mathematical operations with constants

### TransformIterator: Apply Functions On-The-Fly
TransformIterator provides a way to compose operations by applying a function to each element as it's accessed:

In [None]:
def transform_iterator_example():
    """Use TransformIterator to apply functions without storing intermediate results"""
    
    print("\n=== TransformIterator Example ===")
    
    def add_op(a, b):
        return a + b
    
    def square_op(x):
        """Square a number"""
        return x * x
    
    # We want to: 
    # 1. Generate numbers [1, 2, 3, 4, 5]
    # 2. Square each one [1, 4, 9, 16, 25]  
    # 3. Sum the squares = 55
    # All without storing intermediate arrays!
    
    first_number = 1
    how_many = 5
    
    print(f"Computing sum of squares from {first_number} to {first_number + how_many - 1}")
    
    # Step 1: Create counting iterator for [1, 2, 3, 4, 5]
    counting_it = parallel.CountingIterator(np.int32(first_number))
    
    # Step 2: Transform each number by squaring it
    transform_it = parallel.TransformIterator(counting_it, square_op)
    
    # Step 3: Sum the transformed values
    d_output = cp.empty(1, dtype=np.int32)
    h_init = np.array([0], dtype=np.int32)
    
    parallel.reduce_into(transform_it, d_output, add_op, how_many, h_init)
    
    # Verify
    gpu_result = d_output.get()[0]
    numbers = list(range(first_number, first_number + how_many))
    cpu_result = sum(x * x for x in numbers)
    
    print(f"Original numbers: {numbers}")
    print(f"Squared numbers: {[x*x for x in numbers]}")
    print(f"GPU sum of squares: {gpu_result}")
    print(f"CPU sum of squares: {cpu_result}")
    print(f"Correct: {gpu_result == cpu_result}")
    
    # Show the power of composition
    print(f"\nPower of composition:")
    print(f"  No intermediate arrays stored")
    print(f"  Operations fused together for efficiency")

transform_iterator_example()

**The magic of TransformIterator:**
1. Memory efficient: No intermediate arrays stored
2. Composable: Chain multiple transformations together
3. Efficient: Operations are "fused" together on the GPU

### Complex Iterator Composition
Let's combine multiple iterators to solve a real problem:

In [None]:
def complex_iterator_example():
    """Combine multiple iterators to solve: sum of squares of even numbers"""
    
    print("\n=== Complex Iterator Composition ===")
    
    def add_op(a, b):
        return a + b
    
    def square_if_even(x):
        """Square the number if it's even, otherwise return 0"""
        return (x * x) if (x % 2 == 0) else 0
    
    # Goal: From numbers 1-10, square the even ones and sum
    # Even numbers: 2, 4, 6, 8, 10
    # Squares: 4, 16, 36, 64, 100  
    # Sum: 220
    
    first_number = 1
    how_many = 10
    
    print(f"Finding sum of squares of even numbers from {first_number} to {first_number + how_many - 1}")
    
    # Chain operations together
    counting_it = parallel.CountingIterator(np.int32(first_number))
    transform_it = parallel.TransformIterator(counting_it, square_if_even)
    
    # Perform reduction
    d_output = cp.empty(1, dtype=np.int32)
    h_init = np.array([0], dtype=np.int32)
    
    parallel.reduce_into(transform_it, d_output, add_op, how_many, h_init)
    
    # Verify result
    gpu_result = d_output.get()[0]
    
    # CPU verification
    numbers = list(range(first_number, first_number + how_many))
    even_numbers = [x for x in numbers if x % 2 == 0]
    squares = [x * x for x in even_numbers]
    cpu_result = sum(squares)
    
    print(f"All numbers: {numbers}")
    print(f"Even numbers: {even_numbers}")
    print(f"Squares of evens: {squares}")
    print(f"GPU result: {gpu_result}")
    print(f"CPU result: {cpu_result}")
    print(f"Correct: {gpu_result == cpu_result}")

complex_iterator_example()

### Iterator Performance Benefits
Let's compare memory usage between arrays and iterators:

In [None]:
def iterator_performance_comparison():
    """Compare memory usage: arrays vs iterators"""
    
    print("\n=== Iterator Performance Comparison ===")
    
    def add_op(a, b):
        return a + b
    
    def square_op(x):
        return x * x
    
    # Test with large sequences
    sequence_size = 1000000  # 1 million numbers
    
    print(f"Processing {sequence_size:,} numbers")
    
    # Method 1: Using arrays (memory intensive)
    print("\nMethod 1: Using Arrays")
    try:
        # This creates actual arrays in memory
        start_time = time.time()
        d_input = cp.arange(1, sequence_size + 1, dtype=np.int32)
        d_squared = cp.square(d_input)  # Another array
        result_array = cp.sum(d_squared)
        array_time = time.time() - start_time
        
        memory_used = d_input.nbytes + d_squared.nbytes
        print(f"  Time: {array_time*1000:.2f} ms")
        print(f"  Memory used: {memory_used / (1024**2):.1f} MB")
        print(f"  Result: {result_array}")
        
    except Exception as e:
        print(f"  Failed: {e}")
        array_time = float('inf')
        memory_used = float('inf')
    
    # Method 2: Using iterators (memory efficient)  
    print("\nMethod 2: Using Iterators")
    start_time = time.time()
    
    counting_it = parallel.CountingIterator(np.int32(1))
    transform_it = parallel.TransformIterator(counting_it, square_op)
    
    d_output = cp.empty(1, dtype=np.int64)  # Only need space for result
    h_init = np.array([0], dtype=np.int64)
    
    parallel.reduce_into(transform_it, d_output, add_op, sequence_size, h_init)
    result_iterator = d_output.get()[0]
    iterator_time = time.time() - start_time
    
    print(f"  Time: {iterator_time*1000:.2f} ms")
    print(f"  Memory used: {d_output.nbytes} bytes (~0 MB)")
    print(f"  Result: {result_iterator}")
    
    # Compare efficiency
    if array_time != float('inf'):
        speedup = array_time / iterator_time
        memory_savings = memory_used / d_output.nbytes
        print(f"\nComparison:")
        print(f"  Iterator is {speedup:.1f}x faster")
        print(f"  Iterator uses {memory_savings:.0f}x less memory")
    else:
        print(f"\nArrays failed due to memory constraints, iterators succeeded!")

iterator_performance_comparison()

**Key takeaways about iterators:**
1. Memory efficient: Generate data on-demand, no storage needed
2. Composable: Chain operations together easily
3. Performance: Often faster due to better memory usage
4. Scalable: Work with sequences too large to fit in memory

### 6. Scan Operations (Prefix Sums)
#### What is a Scan Operation?
A **scan** (also called prefix sum) computes a running total of elements. For each position, it shows the cumulative result up to that point.

**Two types of scans:**
* Inclusive scan: Includes the current element in the sum
* Exclusive scan: Excludes the current element (shifts results)

**Visual example:**

Input:     [3, 1, 4, 1, 5]
Inclusive: [3, 4, 8, 9, 14]  (3, 3+1, 3+1+4, 3+1+4+1, 3+1+4+1+5)
Exclusive: [0, 3, 4, 8, 9]   (0, 3, 3+1, 3+1+4, 3+1+4+1)

**Why Scans Are Useful**

Scans are building blocks for many parallel algorithms:
* Parallel selection: Filter arrays in parallel
* Stream compaction: Remove unwanted elements
* Histogram computation: Count occurrences efficiently
* Load balancing: Distribute work evenly

**Basic Inclusive Scan**

Let's start with an inclusive scan (running sum):

In [None]:
def inclusive_scan_example():
    """Learn inclusive scan with running sum"""
    
    print("=== Inclusive Scan Example ===")
    
    def add_op(a, b):
        return a + b
    
    # Input data
    input_data = [3, 1, 4, 1, 5, 9, 2, 6]
    print(f"Input data: {input_data}")
    
    # Setup GPU arrays
    d_input = cp.array(input_data, dtype=np.int32)
    d_output = cp.empty_like(d_input)  # Same size as input
    
    # Initial value (what to start the scan with)
    h_init = np.array([0], dtype=np.int32)
    
    # Perform inclusive scan
    parallel.inclusive_scan(
        d_input,        # Input array
        d_output,       # Output array (same size)
        add_op,         # Operation to apply
        h_init,         # Initial value
        len(d_input)    # Number of elements
    )
    
    # Get results
    gpu_result = d_output.get()
    
    # CPU verification
    cpu_result = np.cumsum(input_data)  # NumPy cumulative sum
    
    print(f"GPU inclusive scan: {gpu_result}")
    print(f"CPU cumulative sum: {cpu_result}")
    print(f"Results match: {np.array_equal(gpu_result, cpu_result)}")
    
    # Show step-by-step breakdown
    print(f"\nStep-by-step breakdown:")
    for i, (inp, out) in enumerate(zip(input_data, gpu_result)):
        running_sum = sum(input_data[:i+1])
        print(f"  Position {i}: input={inp}, running_sum={running_sum}, output={out}")

inclusive_scan_example()

**Understanding inclusive scan:**
* Each output position contains the sum from start up to (and including) that position
* * Position 0: sum of elements 0 to 0 = 3
* Position 1: sum of elements 0 to 1 = 3+1 = 4
* Position 2: sum of elements 0 to 2 = 3+1+4 = 8
* And so on...

#### Exclusive Scan
Exclusive scan shifts the results by one position and starts with the initial value:

In [None]:
def exclusive_scan_example():
    """Learn exclusive scan with shifted running sum"""
    
    print("\n=== Exclusive Scan Example ===")
    
    def add_op(a, b):
        return a + b
    
    # Same input data
    input_data = [3, 1, 4, 1, 5, 9, 2, 6]
    print(f"Input data: {input_data}")
    
    # Setup GPU arrays
    d_input = cp.array(input_data, dtype=np.int32)
    d_output = cp.empty_like(d_input)
    
    # Initial value
    h_init = np.array([0], dtype=np.int32)
    
    # Perform exclusive scan
    parallel.exclusive_scan(
        d_input,
        d_output,
        add_op,
        h_init,
        len(d_input)
    )
    
    # Get results
    gpu_result = d_output.get()
    
    # CPU verification (exclusive scan)
    cpu_result = np.concatenate([[0], np.cumsum(input_data)[:-1]])
    
    print(f"GPU exclusive scan: {gpu_result}")
    print(f"CPU exclusive scan: {cpu_result}")
    print(f"Results match: {np.array_equal(gpu_result, cpu_result)}")
    
    # Show step-by-step breakdown
    print(f"\nStep-by-step breakdown:")
    for i, (inp, out) in enumerate(zip(input_data, gpu_result)):
        if i == 0:
            running_sum = 0  # Initial value
        else:
            running_sum = sum(input_data[:i])
        print(f"  Position {i}: input={inp}, sum_before={running_sum}, output={out}")

exclusive_scan_example()

**Understanding exclusive scan:**
* Each output position contains the sum from start up to (but excluding) that position
* Position 0: sum before position 0 = 0 (initial value)
* Position 1: sum before position 1 = 3
* Position 2: sum before position 2 = 3+1 = 4
* And so on...

### Maximum Scan
Scans aren't just for addition. Let's find running maximum:

In [None]:
def maximum_scan_example():
    """Find running maximum using scan"""
    
    print("\n=== Maximum Running Scan ===")
    
    def max_op(a, b):
        return a if a > b else b
    
    # Input data with various values
    input_data = [3, 7, 2, 9, 1, 8, 4, 6]
    print(f"Input data: {input_data}")
    
    # Setup
    d_input = cp.array(input_data, dtype=np.int32)
    d_output = cp.empty_like(d_input)
    
    # Start with very small value
    h_init = np.array([-999999], dtype=np.int32)
    
    # Perform inclusive scan with max operation
    parallel.inclusive_scan(d_input, d_output, max_op, h_init, len(d_input))
    
    # Get results  
    gpu_result = d_output.get()
    
    # CPU verification
    cpu_result = np.maximum.accumulate(input_data)
    
    print(f"GPU running max: {gpu_result}")
    print(f"CPU running max: {cpu_result}")
    print(f"Results match: {np.array_equal(gpu_result, cpu_result)}")
    
    # Show interpretation
    print(f"\nInterpretation:")
    for i, (inp, out) in enumerate(zip(input_data, gpu_result)):
        print(f"  Position {i}: current={inp}, max_so_far={out}")

maximum_scan_example()

#### Practical Application: Parallel Selection
Here we can see a real world case. Selecting elements that meet a condition:

In [None]:
def parallel_selection_example():
    """Use scan to select elements in parallel (stream compaction)"""
    
    print("\n=== Parallel Selection with Scan ===")
    
    def add_op(a, b):
        return a + b
    
    # Problem: Select all even numbers from an array
    input_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    print(f"Input data: {input_data}")
    
    # Step 1: Create selection mask (1 for even, 0 for odd)
    def is_even(x):
        return 1 if x % 2 == 0 else 0
    
    d_input = cp.array(input_data, dtype=np.int32)
    
    # Create mask using transform iterator
    counting_it = parallel.CountingIterator(np.int32(0))  # Index iterator
    
    # Transform: get input[i] and check if even
    def select_even(i):
        return is_even(input_data[i])
    
    mask_it = parallel.TransformIterator(counting_it, select_even)
    
    # Step 2: Compute exclusive scan of mask (gives output positions)
    d_positions = cp.empty(len(input_data), dtype=np.int32)
    h_init = np.array([0], dtype=np.int32)
    
    parallel.exclusive_scan(mask_it, d_positions, add_op, h_init, len(input_data))
    
    # Get results
    positions = d_positions.get()
    mask = [is_even(x) for x in input_data]
    
    print(f"Selection mask: {mask}")
    print(f"Output positions: {positions}")
    
    # Step 3: Extract selected elements
    selected_elements = []
    for i, (value, selected, pos) in enumerate(zip(input_data, mask, positions)):
        if selected:
            print(f"  Element {value} at input[{i}] goes to output[{pos}]")
            selected_elements.append(value)
    
    print(f"Selected even numbers: {selected_elements}")
    
    # Verify with simple CPU method
    cpu_selected = [x for x in input_data if x % 2 == 0]
    print(f"CPU selected: {cpu_selected}")
    print(f"Results match: {selected_elements == cpu_selected}")

parallel_selection_example()

**How parallel selection works:**
1. Create mask: Mark elements to keep (1) or discard (0)
2. Exclusive scan: Convert mask to output positions
3. Scatter: Place selected elements at computed positions

This is much faster than sequential filtering for large arrays!

### Performance Comparison: Scan vs Sequential

In [None]:
def scan_performance_comparison():
    """Compare scan performance with sequential operations"""
    
    print("\n=== Scan Performance Comparison ===")
    
    def add_op(a, b):
        return a + b
    
    # Test with different sizes
    sizes = [1000, 10000, 100000, 1000000]
    
    for size in sizes:
        print(f"\nTesting with {size:,} elements:")
        
        # Create test data
        np.random.seed(42)
        data = np.random.randint(1, 10, size, dtype=np.int32)
        
        # CPU sequential cumulative sum
        start_time = time.time()
        cpu_result = np.cumsum(data)
        cpu_time = time.time() - start_time
        
        # GPU parallel scan
        d_input = cp.array(data)
        d_output = cp.empty_like(d_input)
        h_init = np.array([0], dtype=np.int32)
        
        # Warm up
        parallel.inclusive_scan(d_input, d_output, add_op, h_init, len(d_input))
        
        # Time the operation
        start_time = time.time()
        parallel.inclusive_scan(d_input, d_output, add_op, h_init, len(d_input))
        gpu_result = d_output.get()
        gpu_time = time.time() - start_time
        
        # Compare
        speedup = cpu_time / gpu_time if gpu_time > 0 else float('inf')
        
        print(f"  CPU time: {cpu_time*1000:.2f} ms")
        print(f"  GPU time: {gpu_time*1000:.2f} ms")
        print(f"  Speedup: {speedup:.1f}x")
        print(f"  Results match: {np.allclose(cpu_result, gpu_result)}")

scan_performance_comparison()

**Scan performance insights:**
* Small arrays: CPU might be faster due to GPU overhead
* Large arrays: GPU can be 10-50x faster
* Memory bound: Performance limited by memory bandwidth, not compute
* Scalability: GPU advantage increases with array size

## 7. Sorting Algorithms
### Why GPU Sorting Matters
Sorting is fundamental to many algorithms:
* Data preprocessing: Organize data before analysis
* Search optimization: Binary search requires sorted data
* Grouping operations: Group similar items together
* Statistical analysis: Find medians, percentiles, ranges

**GPU sorting advantages:**
* Parallel comparison: Compare many pairs simultaneously
* High throughput: Process millions of elements per second
* Stable sorting: Preserve order of equal elements

### Basic Radix Sort
Let's start with basic array sorting using radix sort:

In [None]:
def basic_radix_sort_example():
    """Learn GPU sorting with radix sort"""
    
    print("=== Basic Radix Sort Example ===")
    
    # Create unsorted data
    input_data = [64, 34, 25, 12, 22, 11, 90, 5, 77, 30]
    print(f"Input data: {input_data}")
    
    # Setup GPU arrays
    d_input = cp.array(input_data, dtype=np.int32)
    d_output = cp.empty_like(d_input)
    
    # Perform radix sort (ascending order)
    parallel.radix_sort(
        d_input,                           # Input keys
        d_output,                          # Output keys
        None,                              # Input values (none)
        None,                              # Output values (none)
        parallel.SortOrder.ASCENDING,     # Sort order
        len(d_input)                       # Number of elements
    )
    
    # Get results
    gpu_result = d_output.get()
    cpu_result = sorted(input_data)
    
    print(f"GPU sorted: {gpu_result}")
    print(f"CPU sorted: {cpu_result}")
    print(f"Results match: {np.array_equal(gpu_result, cpu_result)}")
    
    # Show sorting verification
    is_sorted = all(gpu_result[i] <= gpu_result[i+1] for i in range(len(gpu_result)-1))
    print(f"Array is properly sorted: {is_sorted}")

basic_radix_sort_example()

**Understanding radix sort:**
* Radix sort: Sorts by processing digits/bits from least to most significant
* Non-comparative: Doesn't compare elements directly
* Stable: Equal elements maintain their relative order
* Fast: O(k*n) complexity where k is number of digits

#### Descending Sort
Sorting in reverse order:

In [None]:
def descending_sort_example():
    """Sort in descending (largest to smallest) order"""
    
    print("\n=== Descending Sort Example ===")
    
    # Test data with duplicates
    input_data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]
    print(f"Input data: {input_data}")
    
    # Setup
    d_input = cp.array(input_data, dtype=np.int32)
    d_output = cp.empty_like(d_input)
    
    # Sort in descending order
    parallel.radix_sort(
        d_input,
        d_output,
        None,
        None,
        parallel.SortOrder.DESCENDING,  # Reverse order
        len(d_input)
    )
    
    # Results
    gpu_result = d_output.get()
    cpu_result = sorted(input_data, reverse=True)
    
    print(f"GPU descending: {gpu_result}")
    print(f"CPU descending: {cpu_result}")
    print(f"Results match: {np.array_equal(gpu_result, cpu_result)}")
    
    # Verify descending order
    is_descending = all(gpu_result[i] >= gpu_result[i+1] for i in range(len(gpu_result)-1))
    print(f"Array is properly sorted (descending): {is_descending}")

descending_sort_example()

#### Key-Value Sorting
Often you want to sort one array (keys) while rearranging another array (values) to match:

In [None]:
def key_value_sort_example():
    """Sort keys while keeping values aligned"""
    
    print("\n=== Key-Value Sort Example ===")
    
    # Example: Sort students by grade, keep names aligned
    grades = [85, 92, 78, 96, 88, 71, 94]
    names = ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace"]
    
    print("Unsorted data:")
    for grade, name in zip(grades, names):
        print(f"  {name}: {grade}")
    
    # Setup GPU arrays
    d_keys = cp.array(grades, dtype=np.int32)
    d_values = cp.arange(len(names), dtype=np.int32)  # Use indices instead of strings
    
    d_keys_out = cp.empty_like(d_keys)
    d_values_out = cp.empty_like(d_values)
    
    # Sort by grades (keys) while rearranging indices (values)
    parallel.radix_sort(
        d_keys,                           # Input grades
        d_keys_out,                       # Output grades
        d_values,                         # Input indices
        d_values_out,                     # Output indices
        parallel.SortOrder.DESCENDING,   # Highest grade first
        len(d_keys)
    )
    
    # Get results
    sorted_grades = d_keys_out.get()
    sorted_indices = d_values_out.get()
    
    print("\nSorted data (by grade, highest first):")
    for grade, idx in zip(sorted_grades, sorted_indices):
        print(f"  {names[idx]}: {grade}")
    
    # Verify sorting
    cpu_pairs = list(zip(grades, range(len(names))))
    cpu_sorted = sorted(cpu_pairs, key=lambda x: x[0], reverse=True)
    cpu_grades = [pair[0] for pair in cpu_sorted]
    cpu_indices = [pair[1] for pair in cpu_sorted]
    
    print(f"\nVerification:")
    print(f"Grades match: {np.array_equal(sorted_grades, cpu_grades)}")
    print(f"Indices match: {np.array_equal(sorted_indices, cpu_indices)}")

key_value_sort_example()

**Key-value sorting applications:**
* Database operations: Sort records by one field
* Index arrays: Create sorted indices for data access
* Paired data: Keep related arrays synchronized

#### Merge Sort for Custom Comparisons
For more complex sorting criteria, use merge sort with custom comparison functions:

In [None]:
def merge_sort_example():
    """Use merge sort with custom comparison function"""
    
    print("\n=== Merge Sort with Custom Comparison ===")
    
    # Example: Sort by absolute value
    input_data = [-15, 3, -7, 22, -1, 8, -12, 5]
    print(f"Input data: {input_data}")
    
    def compare_absolute(a, b):
        """Compare by absolute value"""
        return np.uint8(abs(a) < abs(b))
    
    # Setup
    d_input = cp.array(input_data, dtype=np.int32)
    d_output = cp.empty_like(d_input)
    
    # Merge sort with custom comparison
    parallel.merge_sort(
        d_input,         # Input keys
        None,            # Input values (none)
        d_output,        # Output keys
        None,            # Output values (none)
        compare_absolute, # Custom comparison function
        len(d_input)     # Number of elements
    )
    
    # Results
    gpu_result = d_output.get()
    cpu_result = sorted(input_data, key=abs)
    
    print(f"GPU sort (by absolute value): {gpu_result}")
    print(f"CPU sort (by absolute value): {cpu_result}")
    print(f"Results match: {np.array_equal(gpu_result, cpu_result)}")
    
    # Show absolute values for clarity
    print(f"\nAbsolute values: {[abs(x) for x in gpu_result]}")
    
    # Verify proper ordering by absolute value
    abs_values = [abs(x) for x in gpu_result]
    is_sorted_by_abs = all(abs_values[i] <= abs_values[i+1] for i in range(len(abs_values)-1))
    print(f"Sorted by absolute value: {is_sorted_by_abs}")

merge_sort_example()

### Sorting Performance Comparison
Let's compare different sorting approaches:

In [None]:
def sorting_performance_comparison():
    """Compare GPU sorting vs CPU sorting performance"""
    
    print("\n=== Sorting Performance Comparison ===")
    
    # Test different array sizes
    sizes = [1000, 10000, 100000, 1000000]
    
    for size in sizes:
        print(f"\nTesting with {size:,} elements:")
        
        # Create random data
        np.random.seed(42)
        data = np.random.randint(0, size, size, dtype=np.int32)
        
        # CPU sorting (NumPy)
        start_time = time.time()
        cpu_result = np.sort(data)
        cpu_time = time.time() - start_time
        
        # GPU radix sort
        d_input = cp.array(data)
        d_output = cp.empty_like(d_input)
        
        # Warm up GPU
        parallel.radix_sort(d_input, d_output, None, None, parallel.SortOrder.ASCENDING, len(d_input))
        
        # Time GPU sort
        start_time = time.time()
        parallel.radix_sort(d_input, d_output, None, None, parallel.SortOrder.ASCENDING, len(d_input))
        gpu_result = d_output.get()
        gpu_time = time.time() - start_time
        
        # Compare
        speedup = cpu_time / gpu_time if gpu_time > 0 else float('inf')
        
        print(f"  CPU time: {cpu_time*1000:.2f} ms")
        print(f"  GPU time: {gpu_time*1000:.2f} ms")
        print(f"  Speedup: {speedup:.1f}x")
        print(f"  Results match: {np.array_equal(cpu_result, gpu_result)}")
        
        # Check sorting correctness
        is_cpu_sorted = np.all(cpu_result[:-1] <= cpu_result[1:])
        is_gpu_sorted = np.all(gpu_result[:-1] <= gpu_result[1:])
        print(f"  CPU sorted correctly: {is_cpu_sorted}")
        print(f"  GPU sorted correctly: {is_gpu_sorted}")

sorting_performance_comparison()

**Sorting performance insights:**
* GPU advantage: Most apparent with large arrays (100k+ elements)
* Memory bandwidth: GPU sorting is often memory-bound
* Algorithm choice: Radix sort excels for integers, merge sort for custom comparisons
* Stability: Both algorithms preserve order of equal elements

### Practical Sorting Applications
Here's a real world example, organizing data for analysis:

In [None]:
def practical_sorting_application():
    """Real-world example: Organize sensor data by timestamp"""
    
    print("\n=== Practical Application: Sensor Data Organization ===")
    
    # Simulate sensor data (timestamp, temperature, humidity, sensor_id)
    np.random.seed(42)
    num_readings = 20
    
    # Random timestamps (simulate out-of-order arrival)
    timestamps = np.random.randint(1000, 2000, num_readings)
    temperatures = np.random.uniform(18.0, 35.0, num_readings)
    humidity = np.random.uniform(30.0, 90.0, num_readings)
    sensor_ids = np.random.randint(1, 6, num_readings)
    
    print("Unsorted sensor data (first 10 readings):")
    print("Timestamp | Temp | Humidity | Sensor")
    print("-" * 40)
    for i in range(min(10, num_readings)):
        print(f"{timestamps[i]:>9} | {temperatures[i]:4.1f} | {humidity[i]:6.1f}% | {sensor_ids[i]:>6}")
    
    # Sort by timestamp using key-value sort
    d_keys = cp.array(timestamps, dtype=np.int32)
    d_values = cp.arange(num_readings, dtype=np.int32)  # Original indices
    
    d_keys_out = cp.empty_like(d_keys)
    d_values_out = cp.empty_like(d_values)
    
    # Sort timestamps, keep track of original indices
    parallel.radix_sort(
        d_keys, d_keys_out,
        d_values, d_values_out,
        parallel.SortOrder.ASCENDING,
        num_readings
    )
    
    # Get sorted results
    sorted_timestamps = d_keys_out.get()
    sorted_indices = d_values_out.get()
    
    print(f"\nSorted sensor data (by timestamp):")
    print("Timestamp | Temp | Humidity | Sensor")
    print("-" * 40)
    for i in range(min(10, num_readings)):
        orig_idx = sorted_indices[i]
        print(f"{sorted_timestamps[i]:>9} | {temperatures[orig_idx]:4.1f} | {humidity[orig_idx]:6.1f}% | {sensor_ids[orig_idx]:>6}")
    
    # Verify chronological order
    is_chronological = all(sorted_timestamps[i] <= sorted_timestamps[i+1] 
                          for i in range(len(sorted_timestamps)-1))
    print(f"\nData is now in chronological order: {is_chronological}")
    
    # Show benefits
    print(f"\nBenefits of sorted data:")
    print(f"  - Time-series analysis becomes efficient")
    print(f"  - Binary search for specific timestamps")
    print(f"  - Easy to find data ranges")
    print(f"  - Temporal patterns become visible")

practical_sorting_application()

This example shows how GPU sorting enables efficient data organization for real-world applications like sensor monitoring, financial data analysis, and scientific computing.

## 8. Transform Operations
### What Are Transform Operations?
Transform operations apply a function to each element of an array (or iterator) to create a new array. Think of it as a parallel "map" operation from functional programming.

**Key characteristics:**
* Element-wise: Each input element produces exactly one output element
* Independent: Each transformation is independent of others
* Parallel: All transformations happen simultaneously on GPU
* Memory efficient: Can be combined with other operations

**Common use cases:**
* Mathematical operations (square, sqrt, trigonometric functions)
* Unit conversions (Celsius to Fahrenheit, meters to feet)
* Data normalization and scaling
* Feature engineering in machine learning

### Basic Unary Transform
Let's start with simple element-wise transformations:

In [None]:
def basic_unary_transform_example():
    """Learn basic transform operations"""
    
    print("=== Basic Unary Transform Example ===")
    
    # Example: Convert temperatures from Celsius to Fahrenheit
    celsius_temps = [0, 10, 20, 25, 30, 37.5, 100]
    print(f"Temperatures in Celsius: {celsius_temps}")
    
    def celsius_to_fahrenheit(c):
        """Convert Celsius to Fahrenheit: F = C * 9/5 + 32"""
        return c * 9.0 / 5.0 + 32.0
    
    # Setup GPU arrays
    d_input = cp.array(celsius_temps, dtype=np.float32)
    d_output = cp.empty_like(d_input)
    
    # Perform transform
    parallel.unary_transform(
        d_input,              # Input array
        d_output,             # Output array
        celsius_to_fahrenheit, # Transform function
        len(d_input)          # Number of elements
    )
    
    # Get results
    gpu_result = d_output.get()
    cpu_result = [celsius_to_fahrenheit(c) for c in celsius_temps]
    
    print(f"GPU Fahrenheit temps: {gpu_result}")
    print(f"CPU Fahrenheit temps: {cpu_result}")
    print(f"Results match: {np.allclose(gpu_result, cpu_result)}")
    
    # Show conversion table
    print(f"\nTemperature Conversion Table:")
    print(f"{'Celsius':<8} | {'Fahrenheit':<10}")
    print("-" * 20)
    for c, f in zip(celsius_temps, gpu_result):
        print(f"{c:<8.1f} | {f:<10.1f}")

basic_unary_transform_example()

Understanding unary transform:
* Input: Array of Celsius temperatures
* Function: Conversion formula applied to each element
* Output: Array of Fahrenheit temperatures (same size)
* Parallel: All conversions happen simultaneously

### Mathematical Transform Functions
We can look at more complex mathematical transformations:

In [None]:
def mathematical_transform_example():
    """Apply complex mathematical functions"""
    
    print("\n=== Mathematical Transform Example ===")
    
    # Input: angles in degrees
    angles_degrees = [0, 30, 45, 60, 90, 120, 180, 270, 360]
    print(f"Angles in degrees: {angles_degrees}")
    
    def degrees_to_radians_and_sin(degrees):
        """Convert degrees to radians, then compute sine"""
        import math
        radians = degrees * math.pi / 180.0
        return math.sin(radians)
    
    # Setup
    d_input = cp.array(angles_degrees, dtype=np.float32)
    d_output = cp.empty_like(d_input)
    
    # Transform: degrees → radians → sine
    parallel.unary_transform(d_input, d_output, degrees_to_radians_and_sin, len(d_input))
    
    # Results
    gpu_result = d_output.get()
    cpu_result = [degrees_to_radians_and_sin(angle) for angle in angles_degrees]
    
    print(f"GPU sine values: {gpu_result}")
    print(f"CPU sine values: {cpu_result}")
    print(f"Results match: {np.allclose(gpu_result, cpu_result, atol=1e-6)}")

    # Show trigonometric table
    print(f"\nTrigonometric Table:")
    print(f"{'Degrees':<8} | {'Sine':<10}")
    print("-" * 20)
    for deg, sin_val in zip(angles_degrees, gpu_result):
        print(f"{deg:<8} | {sin_val:<10.6f}")

    # Verify known values
    print(f"\nVerification of known values:")
    print(f"  sin(0°) = {gpu_result[0]:.6f} (should be ~0)")
    print(f"  sin(30°) = {gpu_result[1]:.6f} (should be ~0.5)")
    print(f"  sin(90°) = {gpu_result[4]:.6f} (should be ~1)")
    
mathematical_transform_example()

### Binary Transform Operations

Binary transforms combine two input arrays element-wise:

In [None]:
def binary_transform_example():
    """Combine two arrays with binary transform"""
    
    print("\n=== Binary Transform Example ===")
    
    # Example: Calculate area of rectangles given width and height
    widths = [2.5, 4.0, 1.5, 6.2, 3.8]
    heights = [3.0, 2.5, 4.5, 1.8, 5.2]
    
    print(f"Rectangle widths: {widths}")
    print(f"Rectangle heights: {heights}")
    
    def calculate_area(width, height):
        """Calculate rectangular area"""
        return width * height
    
    def calculate_perimeter(width, height):
        """Calculate rectangular perimeter"""
        return 2 * (width + height)
    
    # Setup GPU arrays
    d_widths = cp.array(widths, dtype=np.float32)
    d_heights = cp.array(heights, dtype=np.float32)
    d_areas = cp.empty_like(d_widths)
    d_perimeters = cp.empty_like(d_widths)
    
    # Calculate areas
    parallel.binary_transform(
        d_widths,      # First input array
        d_heights,     # Second input array
        d_areas,       # Output array
        calculate_area, # Binary function
        len(d_widths)  # Number of elements
    )
    
    # Calculate perimeters
    parallel.binary_transform(
        d_widths, d_heights, d_perimeters, calculate_perimeter, len(d_widths)
    )
    
    # Get results
    gpu_areas = d_areas.get()
    gpu_perimeters = d_perimeters.get()
    
    # CPU verification
    cpu_areas = [w * h for w, h in zip(widths, heights)]
    cpu_perimeters = [2 * (w + h) for w, h in zip(widths, heights)]
    
    print(f"\nRectangle Properties:")
    print(f"{'Width':<6} | {'Height':<6} | {'Area':<8} | {'Perimeter':<10}")
    print("-" * 35)
    for w, h, a, p in zip(widths, heights, gpu_areas, gpu_perimeters):
        print(f"{w:<6.1f} | {h:<6.1f} | {a:<8.2f} | {p:<10.2f}")
    
    print(f"\nVerification:")
    print(f"Areas match: {np.allclose(gpu_areas, cpu_areas)}")
    print(f"Perimeters match: {np.allclose(gpu_perimeters, cpu_perimeters)}")

binary_transform_example()

### Transform with Iterators
Combining transforms with iterators for memory-efficient processing:

In [None]:
def transform_with_iterators_example():
    """Use transforms with iterators for memory efficiency"""
    
    print("\n=== Transform with Iterators Example ===")
    
    # Problem: Calculate sum of squares from 1 to 1000 without storing arrays
    start_num = 1
    count = 1000
    
    print(f"Calculating sum of squares from {start_num} to {start_num + count - 1}")
    
    def square_function(x):
        """Square a number"""
        return x * x
    
    def add_op(a, b):
        """Addition for reduction"""
        return a + b
    
    # Method 1: Using iterator composition (memory efficient)
    print(f"\nMethod 1: Iterator Composition")
    
    # Create counting iterator for numbers 1, 2, 3, ..., 1000
    counting_it = parallel.CountingIterator(np.int64(start_num))
    
    # Transform each number by squaring it
    squares_it = parallel.TransformIterator(counting_it, square_function)
    
    # Sum all the squares
    d_output = cp.empty(1, dtype=np.int64)
    h_init = np.array([0], dtype=np.int64)
    
    start_time = time.time()
    parallel.reduce_into(squares_it, d_output, add_op, count, h_init)
    iterator_time = time.time() - start_time
    
    iterator_result = d_output.get()[0]
    
    # Method 2: Using arrays (memory intensive)
    print(f"Method 2: Array-based")
    
    start_time = time.time()
    d_numbers = cp.arange(start_num, start_num + count, dtype=np.int64)
    d_squares = cp.empty_like(d_numbers)
    
    parallel.unary_transform(d_numbers, d_squares, square_function, count)
    
    d_sum = cp.empty(1, dtype=np.int64)
    parallel.reduce_into(d_squares, d_sum, add_op, count, h_init)
    array_result = d_sum.get()[0]
    array_time = time.time() - start_time
    
    # CPU verification using formula: sum of squares = n(n+1)(2n+1)/6
    n = count
    formula_result = n * (n + 1) * (2 * n + 1) // 6
    
    print(f"\nResults:")
    print(f"Iterator method: {iterator_result} ({iterator_time*1000:.2f} ms)")
    print(f"Array method: {array_result} ({array_time*1000:.2f} ms)")
    print(f"Mathematical formula: {formula_result}")
    
    print(f"\nVerification:")
    print(f"Iterator correct: {iterator_result == formula_result}")
    print(f"Array correct: {array_result == formula_result}")
    print(f"Methods match: {iterator_result == array_result}")
    
    # Memory usage comparison
    iterator_memory = d_output.nbytes  # Just the result
    array_memory = d_numbers.nbytes + d_squares.nbytes + d_sum.nbytes
    
    print(f"\nMemory Usage:")
    print(f"Iterator method: {iterator_memory} bytes")
    print(f"Array method: {array_memory:,} bytes")
    print(f"Memory savings: {array_memory // iterator_memory}x")

transform_with_iterators_example()

### Data Normalization Example
A practical machine learning preprocessing example:

In [None]:
def data_normalization_example():
    """Normalize data using transforms (common ML preprocessing)"""
    
    print("\n=== Data Normalization Example ===")
    
    # Simulate feature data (e.g., house prices, areas, ages)
    np.random.seed(42)
    house_prices = np.random.normal(300000, 100000, 100).astype(np.float32)
    house_areas = np.random.normal(2000, 500, 100).astype(np.float32)
    house_ages = np.random.uniform(0, 50, 100).astype(np.float32)
    
    print(f"Raw data statistics:")
    print(f"Prices: mean={np.mean(house_prices):.0f}, std={np.std(house_prices):.0f}")
    print(f"Areas: mean={np.mean(house_areas):.0f}, std={np.std(house_areas):.0f}")
    print(f"Ages: mean={np.mean(house_ages):.1f}, std={np.std(house_ages):.1f}")
    
    # Z-score normalization: (x - mean) / std
    def normalize_z_score(data, mean_val, std_val):
        """Create normalization function with captured mean and std"""
        def normalize_func(x):
            return (x - mean_val) / std_val
        return normalize_func
    
    # Calculate statistics
    price_mean, price_std = np.mean(house_prices), np.std(house_prices)
    area_mean, area_std = np.mean(house_areas), np.std(house_areas)
    age_mean, age_std = np.mean(house_ages), np.std(house_ages)
    
    # Create normalization functions
    price_normalizer = normalize_z_score(house_prices, price_mean, price_std)
    area_normalizer = normalize_z_score(house_areas, area_mean, area_std)
    age_normalizer = normalize_z_score(house_ages, age_mean, age_std)
    
    # Setup GPU arrays
    d_prices = cp.array(house_prices)
    d_areas = cp.array(house_areas)
    d_ages = cp.array(house_ages)
    
    d_norm_prices = cp.empty_like(d_prices)
    d_norm_areas = cp.empty_like(d_areas)
    d_norm_ages = cp.empty_like(d_ages)
    
    # Normalize on GPU
    parallel.unary_transform(d_prices, d_norm_prices, price_normalizer, len(house_prices))
    parallel.unary_transform(d_areas, d_norm_areas, area_normalizer, len(house_areas))
    parallel.unary_transform(d_ages, d_norm_ages, age_normalizer, len(house_ages))
    
    # Get normalized results
    norm_prices = d_norm_prices.get()
    norm_areas = d_norm_areas.get()
    norm_ages = d_norm_ages.get()
    
    print(f"\nNormalized data statistics:")
    print(f"Prices: mean={np.mean(norm_prices):.6f}, std={np.std(norm_prices):.6f}")
    print(f"Areas: mean={np.mean(norm_areas):.6f}, std={np.std(norm_areas):.6f}")
    print(f"Ages: mean={np.mean(norm_ages):.6f}, std={np.std(norm_ages):.6f}")
    
    # Show sample of normalized data
    print(f"\nSample of normalized data (first 5 houses):")
    print(f"{'Price':<8} | {'Area':<8} | {'Age':<8}")
    print("-" * 28)
    for i in range(5):
        print(f"{norm_prices[i]:<8.3f} | {norm_areas[i]:<8.3f} | {norm_ages[i]:<8.3f}")
    
    # Verify normalization properties
    all_close_to_zero = (abs(np.mean(norm_prices)) < 1e-6 and 
                        abs(np.mean(norm_areas)) < 1e-6 and 
                        abs(np.mean(norm_ages)) < 1e-6)
    
    all_close_to_one = (abs(np.std(norm_prices) - 1.0) < 1e-6 and 
                       abs(np.std(norm_areas) - 1.0) < 1e-6 and 
                       abs(np.std(norm_ages) - 1.0) < 1e-6)
    
    print(f"\nNormalization verification:")
    print(f"All means ≈ 0: {all_close_to_zero}")
    print(f"All std devs ≈ 1: {all_close_to_one}")
    print(f"Normalization successful: {all_close_to_zero and all_close_to_one}")

data_normalization_example()

### Transform Performance Analysis
Let's analyze the performance characteristics of transform operations:

In [None]:
def transform_performance_analysis():
    """Analyze transform performance across different scenarios"""
    
    print("\n=== Transform Performance Analysis ===")
    
    # Test different scenarios
    scenarios = [
        ("Simple multiplication", lambda x: x * 2.0),
        ("Square root", lambda x: x**0.5),
        ("Trigonometric", lambda x: np.sin(x) + np.cos(x)),
        ("Complex expression", lambda x: x**3 - 2*x**2 + x + 1)
    ]
    
    sizes = [10000, 100000, 1000000]
    
    for size in sizes:
        print(f"\nTesting with {size:,} elements:")
        print(f"{'Operation':<20} | {'GPU Time':<10} | {'CPU Time':<10} | {'Speedup':<8}")
        print("-" * 55)
        
        # Create test data
        np.random.seed(42)
        data = np.random.uniform(0.1, 10.0, size).astype(np.float32)
        
        for name, func in scenarios:
            # GPU transform
            d_input = cp.array(data)
            d_output = cp.empty_like(d_input)
            
            # Warm up
            parallel.unary_transform(d_input, d_output, func, size)
            
            # Time GPU
            start_time = time.time()
            parallel.unary_transform(d_input, d_output, func, size)
            gpu_result = d_output.get()
            gpu_time = time.time() - start_time
            
            # Time CPU
            start_time = time.time()
            cpu_result = np.array([func(x) for x in data])
            cpu_time = time.time() - start_time
            
            # Calculate speedup
            speedup = cpu_time / gpu_time if gpu_time > 0 else float('inf')
            
            # Verify correctness
            correct = np.allclose(gpu_result, cpu_result, rtol=1e-5)
            
            print(f"{name:<20} | {gpu_time*1000:<10.2f} | {cpu_time*1000:<10.2f} | {speedup:<8.1f}")
            
            if not correct:
                print(f"  WARNING: Results don't match!")

transform_performance_analysis()

**Transform performance insights:**
1. Simple operations: GPU overhead might make CPU faster for small arrays
2. Complex operations: GPU shines with mathematical functions
3. Memory bound: Simple operations limited by memory bandwidth
4. Compute bound: Complex operations benefit from GPU's parallel compute units
5. Scale matters: GPU advantage increases with array size

## 9. Custom Data Types
### Why Custom Data Types Matter
Sometimes you need to work with structured data that doesn't fit into simple arrays. Custom data types allow you to:
* Group related data: Store multiple fields together (like RGB pixels, 3D coordinates)
* Maintain relationships: Keep associated data synchronized during operations
* Create domain-specific types: Model real-world entities in your code
* Optimize memory layout: Control how data is arranged in memory

### Creating Your First GPU Struct
Let's create a simple 2D point structure:

In [None]:
def basic_gpu_struct_example():
    """Learn to create and use custom GPU data types"""
    
    print("=== Basic GPU Struct Example ===")
    
    # Define a 2D point structure
    @parallel.gpu_struct
    class Point2D:
        x: np.float32
        y: np.float32
    
    print(f"Created Point2D struct with fields: x (float32), y (float32)")
    
    # Create some 2D points
    points_data = [
        (1.0, 2.0), (3.0, 4.0), (5.0, 6.0), 
        (7.0, 8.0), (9.0, 10.0)
    ]
    
    print(f"Creating {len(points_data)} points:")
    for i, (x, y) in enumerate(points_data):
        print(f"  Point {i}: ({x}, {y})")
    
    # Convert to GPU struct array
    # Method 1: Create structured array
    point_array = np.zeros(len(points_data), dtype=Point2D.dtype)
    for i, (x, y) in enumerate(points_data):
        point_array[i] = Point2D(x, y)
    
    d_points = cp.array(point_array)
    
    # Define operation: calculate distance from origin
    def distance_from_origin(point):
        """Calculate distance from (0,0) to point"""
        import math
        return math.sqrt(point.x * point.x + point.y * point.y)
    
    # Transform points to distances
    d_distances = cp.empty(len(points_data), dtype=np.float32)
    
    parallel.unary_transform(d_points, d_distances, distance_from_origin, len(points_data))
    
    # Get results
    distances = d_distances.get()
    
    # Verify with CPU calculation
    cpu_distances = [np.sqrt(x*x + y*y) for x, y in points_data]
    
    print(f"\nDistance calculations:")
    print(f"{'Point':<12} | {'GPU Distance':<12} | {'CPU Distance':<12}")
    print("-" * 40)
    for i, ((x, y), gpu_dist, cpu_dist) in enumerate(zip(points_data, distances, cpu_distances)):
        print(f"({x:3.1f}, {y:3.1f})   | {gpu_dist:<12.3f} | {cpu_dist:<12.3f}")
    
    print(f"\nResults match: {np.allclose(distances, cpu_distances)}")

basic_gpu_struct_example()

**Understanding @parallel.gpu_struct:**
* Decorator: Marks a Python class as a GPU-compatible structure
* Type annotations: Specify the exact data types for each field
* Memory layout: GPU struct controls how data is arranged in memory
* Compatibility: Works with all parallel algorithms

### RGB Pixel Processing
A practical example processing image pixels:

In [None]:
def rgb_pixel_example():
    """Process RGB pixels using custom struct"""
    
    print("\n=== RGB Pixel Processing Example ===")
    
    # Define RGB pixel structure
    @parallel.gpu_struct
    class RGBPixel:
        r: np.uint8  # Red component (0-255)
        g: np.uint8  # Green component (0-255)  
        b: np.uint8  # Blue component (0-255)
    
    # Create sample image data (small 3x3 image)
    image_data = [
        (255, 0, 0),   (0, 255, 0),   (0, 0, 255),    # Red, Green, Blue
        (255, 255, 0), (255, 0, 255), (0, 255, 255),  # Yellow, Magenta, Cyan
        (128, 128, 128), (64, 64, 64), (192, 192, 192) # Grays
    ]
    
    print(f"Processing 3x3 RGB image:")
    for i, (r, g, b) in enumerate(image_data):
        if i % 3 == 0:
            print()
        print(f"({r:3d},{g:3d},{b:3d})", end=" ")
    print()
    
    # Create GPU pixel array
    pixel_array = np.zeros(len(image_data), dtype=RGBPixel.dtype)
    for i, (r, g, b) in enumerate(image_data):
        pixel_array[i] = RGBPixel(r, g, b)
    
    d_pixels = cp.array(pixel_array)
    
    # Operation 1: Convert to grayscale using luminance formula
    def rgb_to_grayscale(pixel):
        """Convert RGB to grayscale using standard luminance weights"""
        # Standard luminance formula: 0.299*R + 0.587*G + 0.114*B
        gray = int(0.299 * pixel.r + 0.587 * pixel.g + 0.114 * pixel.b)
        return min(255, max(0, gray))  # Clamp to valid range
    
    d_grayscale = cp.empty(len(image_data), dtype=np.uint8)
    parallel.unary_transform(d_pixels, d_grayscale, rgb_to_grayscale, len(image_data))
    
    # Operation 2: Find pixel with maximum green value
    def get_green_value(pixel):
        """Extract green component"""
        return pixel.g
    
    def max_op(a, b):
        """Return maximum value"""
        return a if a > b else b
    
    green_it = parallel.TransformIterator(d_pixels, get_green_value)
    d_max_green = cp.empty(1, dtype=np.uint8)
    h_init = np.array([0], dtype=np.uint8)
    
    parallel.reduce_into(green_it, d_max_green, max_op, len(image_data), h_init)
    
    # Get results
    grayscale = d_grayscale.get()
    max_green = d_max_green.get()[0]
    
    print(f"\nGrayscale conversion:")
    for i, gray in enumerate(grayscale):
        if i % 3 == 0:
            print()
        print(f"{gray:3d}", end="     ")
    print()
    
    print(f"\nMaximum green value in image: {max_green}")
    
    # Verify grayscale conversion
    cpu_grayscale = [int(0.299 * r + 0.587 * g + 0.114 * b) for r, g, b in image_data]
    cpu_max_green = max(g for r, g, b in image_data)
    
    print(f"\nVerification:")
    print(f"Grayscale conversion correct: {np.array_equal(grayscale, cpu_grayscale)}")
    print(f"Max green value correct: {max_green == cpu_max_green}")

rgb_pixel_example()

### Complex Struct with Multiple Operations
Let's create a more complex example with 3D vectors:

In [None]:
def vector3d_example():
    """Work with 3D vectors using custom struct"""
    
    print("\n=== 3D Vector Operations Example ===")
    
    # Define 3D vector structure
    @parallel.gpu_struct
    class Vector3D:
        x: np.float32
        y: np.float32
        z: np.float32
    
    # Create 3D vectors (e.g., particle positions)
    vectors_data = [
        (1.0, 0.0, 0.0), (0.0, 1.0, 0.0), (0.0, 0.0, 1.0),  # Unit vectors
        (1.0, 1.0, 1.0), (2.0, 3.0, 4.0), (-1.0, 2.0, -3.0), # Various vectors
        (0.5, -0.5, 0.8), (3.2, 1.1, -2.4)                   # More vectors
    ]
    
    print(f"Working with {len(vectors_data)} 3D vectors:")
    for i, (x, y, z) in enumerate(vectors_data):
        print(f"  Vector {i}: ({x:5.1f}, {y:5.1f}, {z:5.1f})")
    
    # Create GPU vector array
    vector_array = np.zeros(len(vectors_data), dtype=Vector3D.dtype)
    for i, (x, y, z) in enumerate(vectors_data):
        vector_array[i] = Vector3D(x, y, z)
    
    d_vectors = cp.array(vector_array)
    
    # Operation 1: Calculate magnitude (length) of each vector
    def vector_magnitude(vec):
        """Calculate vector magnitude: sqrt(x² + y² + z²)"""
        import math
        return math.sqrt(vec.x * vec.x + vec.y * vec.y + vec.z * vec.z)
    
    d_magnitudes = cp.empty(len(vectors_data), dtype=np.float32)
    parallel.unary_transform(d_vectors, d_magnitudes, vector_magnitude, len(vectors_data))
    
    # Operation 2: Normalize vectors (make them unit length)
    def normalize_vector(vec):
        """Normalize vector to unit length"""
        import math
        mag = math.sqrt(vec.x * vec.x + vec.y * vec.y + vec.z * vec.z)
        if mag > 1e-10:  # Avoid division by zero
            return Vector3D(vec.x / mag, vec.y / mag, vec.z / mag)
        else:
            return Vector3D(0.0, 0.0, 0.0)
    
    d_normalized = cp.empty(len(vectors_data), dtype=Vector3D.dtype)
    parallel.unary_transform(d_vectors, d_normalized, normalize_vector, len(vectors_data))
    
    # Operation 3: Find vector with maximum magnitude
    def max_magnitude_op(a, b):
        """Return the larger magnitude"""
        return a if a > b else b
    
    magnitude_it = parallel.TransformIterator(d_vectors, vector_magnitude)
    d_max_magnitude = cp.empty(1, dtype=np.float32)
    h_init = np.array([0.0], dtype=np.float32)
    
    parallel.reduce_into(magnitude_it, d_max_magnitude, max_magnitude_op, len(vectors_data), h_init)
    
    # Get results
    magnitudes = d_magnitudes.get()
    normalized_vectors = d_normalized.get()
    max_magnitude = d_max_magnitude.get()[0]
    
    # Display results
    print(f"\nVector Analysis:")
    print(f"{'Vector':<15} | {'Magnitude':<10} | {'Normalized':<20}")
    print("-" * 50)
    
    for i, ((x, y, z), mag) in enumerate(zip(vectors_data, magnitudes)):
        norm_vec = normalized_vectors[i]
        norm_str = f"({norm_vec['x']:.3f}, {norm_vec['y']:.3f}, {norm_vec['z']:.3f})"
        print(f"({x:4.1f},{y:4.1f},{z:4.1f}) | {mag:<10.3f} | {norm_str:<20}")
    
    print(f"\nMaximum vector magnitude: {max_magnitude:.3f}")
    
    # Verify normalized vectors have unit length
    normalized_magnitudes = [vector_magnitude(Vector3D(v['x'], v['y'], v['z'])) 
                           for v in normalized_vectors]
    
    all_unit_length = all(abs(mag - 1.0) < 1e-6 for mag in normalized_magnitudes 
                         if mag > 1e-6)  # Skip zero vectors
    
    print(f"All normalized vectors have unit length: {all_unit_length}")

vector3d_example()

### Custom Reduction with Structs
Let's create a reduction that works with custom structs:

In [None]:
def custom_struct_reduction_example():
    """Perform reduction operations on custom structs"""
    
    print("\n=== Custom Struct Reduction Example ===")
    
    # Define a statistics struct to track min, max, and count
    @parallel.gpu_struct
    class Statistics:
        min_val: np.float32
        max_val: np.float32
        count: np.int32
        sum_val: np.float32
    
    # Sample data: temperature readings
    temperature_data = [23.5, 18.2, 31.8, 27.1, 19.9, 35.2, 22.8, 29.4, 16.7, 33.1]
    print(f"Temperature readings: {temperature_data}")
    
    # Convert each temperature to a Statistics struct
    def temp_to_stats(temp):
        """Convert single temperature to Statistics"""
        return Statistics(temp, temp, 1, temp)
    
    # Create temperature array and transform to stats
    d_temperatures = cp.array(temperature_data, dtype=np.float32)
    stats_it = parallel.TransformIterator(d_temperatures, temp_to_stats)
    
    # Define reduction operation to combine Statistics
    def combine_stats(stats1, stats2):
        """Combine two Statistics structs"""
        return Statistics(
            min(stats1.min_val, stats2.min_val),    # Track minimum
            max(stats1.max_val, stats2.max_val),    # Track maximum  
            stats1.count + stats2.count,            # Sum counts
            stats1.sum_val + stats2.sum_val         # Sum values
        )
    
    # Perform reduction
    d_result = cp.empty(1, dtype=Statistics.dtype)
    h_init = Statistics(float('inf'), float('-inf'), 0, 0.0)
    
    parallel.reduce_into(stats_it, d_result, combine_stats, len(temperature_data), h_init)
    
    # Get final statistics
    final_stats = d_result.get()[0]
    gpu_min = final_stats['min_val']
    gpu_max = final_stats['max_val']
    gpu_count = final_stats['count']
    gpu_sum = final_stats['sum_val']
    gpu_average = gpu_sum / gpu_count if gpu_count > 0 else 0
    
    # CPU verification
    cpu_min = min(temperature_data)
    cpu_max = max(temperature_data)
    cpu_count = len(temperature_data)
    cpu_sum = sum(temperature_data)
    cpu_average = cpu_sum / cpu_count
    
    print(f"\nTemperature Statistics:")
    print(f"{'Statistic':<12} | {'GPU Result':<12} | {'CPU Result':<12} | {'Match':<8}")
    print("-" * 50)
    print(f"{'Minimum':<12} | {gpu_min:<12.2f} | {cpu_min:<12.2f} | {gpu_min == cpu_min}")
    print(f"{'Maximum':<12} | {gpu_max:<12.2f} | {cpu_max:<12.2f} | {gpu_max == cpu_max}")
    print(f"{'Count':<12} | {gpu_count:<12} | {cpu_count:<12} | {gpu_count == cpu_count}")
    print(f"{'Sum':<12} | {gpu_sum:<12.2f} | {cpu_sum:<12.2f} | {abs(gpu_sum - cpu_sum) < 1e-6}")
    print(f"{'Average':<12} | {gpu_average:<12.2f} | {cpu_average:<12.2f} | {abs(gpu_average - cpu_average) < 1e-6}")
    print(f"\nAll statistics computed correctly in a single GPU reduction!")

custom_struct_reduction_example()

## 10. Lab Exercises

Now it's time to practice. Here are some hands-on exercises to reinforce your learning:

### Exercise 1: Basic Operations

In [None]:
def exercise_1_basic_operations():
    """Exercise 1: Implement basic parallel operations"""
    
    print("=== Exercise 1: Basic Operations ===")
    print("Implement the following operations using cuda.cccl.parallel:")
    
    # TODO: Implement these functions
    
    def exercise_1a_sum_of_cubes():
        """Calculate sum of cubes from 1 to 1000"""
        # Your code here
        # Hint: Use CountingIterator + TransformIterator + reduce_into
        pass
    
    def exercise_1b_alternating_sum():
        """Calculate alternating sum: 1 - 2 + 3 - 4 + 5 - 6 + ..."""
        # Your code here  
        # Hint: Transform based on index (odd/even)
        pass
    
    def exercise_1c_temperature_conversion():
        """Convert array of Celsius temperatures to Fahrenheit and find hottest day"""
        celsius_temps = [22.5, 18.3, 31.7, 15.9, 28.4, 35.1, 19.8]
        # Your code here
        # Hint: Use unary_transform for conversion, reduce_into for maximum
        pass
    
    print("Implement the functions above and test them!")

exercise_1_basic_operations()

### Exercise 2: Data Processing Pipeline

In [None]:
def exercise_2_data_pipeline():
    """Exercise 2: Build a data processing pipeline"""
    
    print("\n=== Exercise 2: Data Processing Pipeline ===")
    print("Build a pipeline to process student grade data:")
    
    # Sample data: student grades across multiple subjects
    student_data = {
        'math_scores': [85, 92, 78, 96, 88, 71, 94, 82, 90, 87],
        'science_scores': [88, 89, 82, 94, 85, 75, 91, 84, 92, 89],
        'english_scores': [92, 87, 85, 89, 91, 78, 88, 86, 85, 90]
    }
    
    print("Student scores:")
    for subject, scores in student_data.items():
        print(f"  {subject}: {scores}")
    
    def exercise_2a_calculate_averages():
        """Calculate average score for each student across all subjects"""
        # Your code here
        # Hint: Use binary_transform to combine scores, then transform to get average
        pass
    
    def exercise_2b_grade_distribution():
        """Count how many students fall into each grade category"""
        # Grade categories: A (90-100), B (80-89), C (70-79), D (60-69), F (<60)
        # Your code here
        # Hint: Use transform + reduce for each category
        pass
    
    def exercise_2c_top_performers():
        """Find students with average > 90 and count them"""
        # Your code here
        # Hint: Chain transforms to calculate average, then filter and count
        pass
    
    print("Implement the functions above!")

exercise_2_data_pipeline()

### Exercise 3: Custom Data Types

In [None]:
def exercise_3_custom_types():
    """Exercise 3: Work with custom data structures"""
    
    print("\n=== Exercise 3: Custom Data Types ===")
    print("Create and process custom data structures:")
    
    def exercise_3a_point_operations():
        """Create a Point2D struct and calculate distances"""
        
        # TODO: Define Point2D struct with x, y fields
        # TODO: Create array of points
        # TODO: Calculate distance from origin for each point
        # TODO: Find the farthest point from origin
        
        points_data = [(1, 2), (3, 4), (0, 5), (-2, 1), (4, -3)]
        print(f"Points: {points_data}")
        
        # Your code here
        pass
    
    def exercise_3b_rgb_operations():
        """Process RGB color data"""
        
        # TODO: Define RGB struct with r, g, b fields (uint8)
        # TODO: Convert RGB to grayscale using formula: 0.299*R + 0.587*G + 0.114*B  
        # TODO: Find the brightest pixel (highest grayscale value)
        # TODO: Count pixels that are "mostly red" (R > G and R > B)
        
        rgb_data = [
            (255, 0, 0), (0, 255, 0), (0, 0, 255),    # Pure colors
            (255, 255, 255), (0, 0, 0), (128, 128, 128),  # Grayscale
            (255, 128, 64), (64, 255, 128), (128, 64, 255)   # Mixed colors
        ]
        print(f"RGB data: {rgb_data}")
        
        # Your code here
        pass
    
    print("Implement the functions above!")

exercise_3_custom_types()

## Resources
API Reference: https://nvidia.github.io/cccl/python/parallel_api.html#module-cuda.cccl.parallel.experimental.algorithms