## Exercise - CCCL - Customizing Algorithms

### What is `cuda-cccl`?

The [CUDA Core Compute Libraries (CCCL)](https://nvidia.github.io/cccl/python/) provide high-quality, high-performance abstractions for CUDA development in Python. The `cuda-cccl` Python package is composed of two indepdendent subpackages:

* `cuda.compute` is a **parallel algorithms library** containing algorithms like `reduce`, `transform`, `scan` and `sort`. These  can be combined to implement more complex algorithms, while delivering the performance of hand-optimized CUDA kernels, portable across different GPU architectures. They are general-purpose and **designed to be used with CuPy, PyTorch and other array/tensor frameworks.**.

* `cuda.coop` is a lower-level library containing **cooperative algorithms meant to be used within (numba) CUDA kernels**.  Examples include _block-wide reduction_ and _warp-wide scan_, providing numba CUDA kernel developers with building blocks to create speed-of-light, custom kernels.

### When to use it?

`cccl` provides a level of abstraction in between tensor libraries and raw CUDA kernels.

- If you want to implement custom functionality that can not easily and efficiently be expressed using PyTorch/CuPy operations, you can reach for `cuda.compute` before resorting to writing CUDA kernels.
- If you _do_ need to write a kernel, you can often make use of the block-level and warp-level primitives offered by `cuda.coop` to write your kernel much more efficiently and concisely.  

<img src="images/cccl-spectrum.png" width="1000">

## Installation

The command below installs `cuda-cccl` along with pieces of the CUDA toolkit it needs.

```bash
pip install "cuda-cccl[test-cu12]" matplotlib
```

The `[test-cu12]` extras installs CuPy, which we will use in our examples. It is not strictly a dependency of `cuda-cccl` - you can use any array-like object (like PyTorch tensors) as well.

In [None]:
import numpy as np
import cupy  as cp
import cuda.compute as comp

## Hello `cccl`: Simple Reductions

A **reduction** takes many values and combines them into a single result using a binary operation.

As a simple example, consider a sequence of values like $[2, 3, 5, 1, 7, 6, 8, 4]$. The *sum* of the values of that sequence is a reduction using _addition_ as the binary operation: $(2 + 3 + 5 + 1 + 7 + 6 + 8 + 4) = 36$. Similarly, the *maximum value* can be obtained by performing a reduction using `max(a, b)` as the binary operation.

A reduction can be computed in parallel. Typically this is done using a "tree" reduction where elements are combined in pairs across multiple levels, resembling the structure of a binary tree. At each level, the number of elements is halved as partial results are computed in parallel. This continues until a single final result is obtained at the root of the tree.

<img src="https://upload.wikimedia.org/wikipedia/commons/e/ee/Binomial_tree.gif" width="600">


If you know some CUDA, you can quite easily write a kernel to implement this kind of parallel reduction. However, optimizing it for the specific CUDA architecture of your device, and generalizing for different data types and sizes can be difficult.

This is where `cuda.compute` comes in. It provides optimized implementations of algorithms like reduction that give the best possible performance.

### Using `reduce_into()` to compute the sum of a sequence

`cuda.compute` provides a `reduce_into()` function to compute general reductions:

In [None]:
"""
Using `reduce_into()` to compute the sum of a sequence
"""

# Prepare the inputs and outputs.
d_input = cp.array([2, 3, 5, 1, 7, 6, 8, 4], dtype=np.int32)  # input sequence, a CuPy (device) array
d_output = cp.empty(1, dtype=np.int32)  # array which will hold the result, a CuPy (device) array of size 1
h_init = np.array([0], dtype=np.int32)  # initial value of the reduction, a NumPy (host) array of size 1

# Perform the reduction.
comp.reduce_into(d_input, d_output, comp.OpKind.PLUS, len(d_input), h_init)

print(d_input)
# Verify the result.
expected_output = 36
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Sum reduction result: {result}")

### Exercise: computing the minimum value

`reduce_into()` can be used to compute other reductions 

Similar to the examples above, below is an incomplete code snippet for computing the minimum value of a sequence. Complete the section between the comments `begin TODO` and `end TODO` to use `reduce_into()` to compute the minimum.

In [None]:
"""
Using `reduce_into()` to compute the minimum value of a sequence
"""

d_input = cp.array([-2, 3, 5, 1, 7, -6, 8, -4], dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)

# begin TODO


# end TODO

expected_output = -6
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Min reduction result: {result}")

## Custom Reductions

### Example: sum of even values

At this point, you might be thinking:

> **_Umm, can't I just use CuPy or PyTorch to compute sum or max?_**

Of course, given a CuPy array, it's trivial to do simple reductions like `sum`, `min` or `max`:

In [None]:
d_input = cp.array([-2, 3, 5, 1, 7, -6, 8, -4], dtype=np.int32)

print(f"Sum using cp.sum: {cp.sum(d_input)}")
print(f"Max value using cp.max: {cp.max(d_input)}")
print(f"Min value using cp.min: {cp.min(d_input)}")

The benefit of `cuda-cccl` is more apparent when you want to do custom operations. For example, rather than just computing a straightforward `sum`, let's say we wanted to compute the sum of **only even values** in a sequence. Naively, here's how to do that with CuPy:

In [None]:
d_input = cp.array([2, 3, 5, 1, 7, 6, 8, 4], dtype=np.int32)
result = (d_input[d_input % 2 == 0]).sum()
print(f"Sum of even values with CuPy: {result}")

Now, let's do the same thing with `parallel`:

In [None]:
"""
Using `reduce_into()` with a custom binary operation
"""

# Define a custom binary operation for the reduction.
def sum_even_op(a, b):
    return (a if a % 2 == 0 else 0) + (b if b % 2 == 0 else 0)

d_input = cp.array([2, 3, 5, 1, 7, 6, 8, 4], dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)
h_init = np.array([0], dtype=np.int32)

# Call `reduce_into()` passing the function above for the binary operation:
comp.reduce_into(d_input, d_output, sum_even_op, len(d_input), h_init)
result = d_output.get()[0]
print(f"Sum of even values with `cuda.compute`: {result}")

We got the same result using `cuda.compute`, but we had to write significantly more code. Is it worth it? Below is a small benchmarking script comparing timings for a range of input sizes:

### Comparing custom reduction performance with naive CuPy implementation

In [None]:
"""
Compare the performance of the `parallel` implementation with a naive CuPy implementation
"""

import timeit

def evens_sum_cupy(d_input, d_output, h_init):
    # ignore h_init
    cp.sum(d_input[d_input % 2 == 0], out=d_output[0])

def evens_sum_cccl(d_input, d_output, h_init):
    # note, using `op` as the binary operation, rather than `OpKind.PLUS`:
    comp..reduce_into(d_input, d_output, sum_even_op, len(d_input), h_init)

def time_gpu_func(f, *args, **kwargs):
    cp.cuda.Device().synchronize()
    t1 = timeit.default_timer()
    n = 1_000
    for i in range(n):
        f(*args, **kwargs)
        cp.cuda.Device().synchronize()
    t2 = timeit.default_timer()
    return t2 - t1

sizes = [10_000, 100_000, 1_000_000, 10_000_000, 100_000_000]
cccl_times = []
cp_times = []

for n in sizes:
    d_input = cp.random.randint(low=0, high=10, size=n, dtype=np.int32)
    d_out = cp.empty(1, dtype=np.int32)
    h_init = np.array([0], dtype=np.int32)

    cccl_times.append(time_gpu_func(evens_sum_cccl, d_input, d_out, h_init))
    cp_times.append(time_gpu_func(evens_sum_cupy, d_input, d_out, h_init))

import matplotlib.pyplot as plt

# Plotting
fig = plt.figure(figsize=(10, 5))
plt.loglog(sizes, cccl_times, marker='o', label='cuda.ccl')
plt.loglog(sizes, cp_times, marker='s', label='CuPy')

# Annotate each cuda.ccl point with speedup vs CuPy
for x, t_cccl, t_cp in zip(sizes, cccl_times, cp_times):
    speedup = t_cp / t_cccl
    label = f"{speedup:.1f}x faster"
    plt.annotate(label,
                 (x, t_cccl),
                 textcoords="offset points",
                 xytext=(5, -10),  # offset position
                 ha='left',
                 fontsize=9,
                 color='green')

# Labels and title
plt.xlabel('Input Size')
plt.ylabel('Time (seconds)')
plt.title('Timing Comparison for evens_sum.')
plt.legend()
plt.grid(True)
plt.tight_layout()


We see that using `cuda.compute` is much faster than our naive CuPy approach. This is because:

* Operator fusion: the CuPy operation `x[x % 2 == 0]).sum()` is actually 4 separate operations (and at least 4 separate CUDA kernel invocations). With `cuda.compute`, we have a single call to `reduce_into()` that does all the computation.
* No intermediate memory allocations.
* Lesser Python overhead: `cuda.compute` is a lower-level library. You don't have to jump through multiple layers of Python before invoking device code.

## Scan Operations

### What is a Scan Operation?

A **scan** (also called prefix sum) computes a running total of elements. For each position, it shows the cumulative result up to that point.

**Two types of scans:**
* **Inclusive scan**: Includes the current element in the sum
* **Exclusive scan**: Excludes the current element (shifts results)

**Visual example:**

```
Input:     [3, 1, 4, 1, 5]
Inclusive: [3, 4, 8, 9, 14]  (3, 3+1, 3+1+4, 3+1+4+1, 3+1+4+1+5)
Exclusive: [0, 3, 4, 8, 9]   (0, 3, 3+1, 3+1+4, 3+1+4+1)
```

In [None]:
d_input = cp.array([3, 1, 4, 1, 5, 9, 2, 6], dtype=np.int32)
d_inclusive = cp.empty_like(d_input)
d_exclusive = cp.empty_like(d_input)
h_init = np.array([0], dtype=np.int32)

def add_op(a, b):
    return a + b

comp.inclusive_scan(d_input, d_inclusive, add_op, h_init, len(d_input))
comp.exclusive_scan(d_input, d_exclusive, add_op, h_init, len(d_input))

print(f"Input:           {d_input.get()}")
print(f"Inclusive scan:  {d_inclusive.get()}")
print(f"Exclusive scan:  {d_exclusive.get()}")

# Verify with NumPy
np_inclusive = np.cumsum(d_input.get())
np_exclusive = np.concatenate([[0], np_cumsum[:-1]])
np.testing.assert_allclose(d_inclusive.get(), np_inclusive)
np.testing.assert_allclose(d_exclusive.get(), np_exclusive)
print(f"NumPy inclusive:    {np_result}")
print(f"NumPy exclusive:    {np_result}")

### Maximum Scan Example

Scans aren't limited to addition. Here's an example using maximum operation to find running maximum.


In [None]:
# Running maximum example
d_input = cp.array([3, 7, 2, 9, 1, 8, 4, 6], dtype=np.int32)
d_output = cp.empty_like(d_input)

def max_op(a, b):
    return a if a > b else b

# Start with a very small value
h_init = np.array([-999999], dtype=np.int32)

# Perform inclusive scan with max operation
comp.inclusive_scan(d_input, d_output, max_op, h_init, len(d_input))

print(f"Input:       {d_input.get()}")
print(f"Running max: {d_output.get()}")

# Verify with NumPy
np_running_max = np.maximum.accumulate(d_input.get())
print(f"NumPy max:   {np_running_max}")
print(f"Match:       {np.array_equal(d_output.get(), np_running_max)}")


## Sorting

### Merge Sort

The `merge_sort` function can be used to perform key-value sorting.

In [None]:
# Prepare the input arrays.
d_in_keys = cp.asarray([-5, 0, 2, -3, 2, -3, 0, -3, -5, 2], dtype="int32")
d_in_values = cp.asarray(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

# Perform the merge sort.
comp.merge_sort(
    d_in_keys,
    d_in_values,
    d_in_keys,  # reuse input array to store output
    d_in_values,  # reuse input array to store output
    comp.OpKind.LESS,
    d_in_keys.size,
)

print(f"Sorted keys: {d_in_keys.get()}")
print(f"Sorted values: {d_in_values.get()}")

If you just want to sort keys (with no corresponding values), just pass `None`:

In [None]:
# Prepare the input and output arrays.
d_in_keys = cp.asarray([-5, 0, 2, -3, 2, -3, 0, -3, -5, 2], dtype="int32")

print(d_in_keys)

# Perform the merge sort.
comp.merge_sort(
    d_in_keys,
    None,  # don't specify a values array
    d_in_keys,  # reuse input array to store output
    None,  # don't specify a values array
    comp.OpKind.LESS,
    d_in_keys.size,
)

print(f"Sorted keys: {d_in_keys.get()}")

#### Exercise - sort by the last digit

In this excercise, you'll use `merge_sort` with a custom comparator function to sort elements by the last digit.
For example, $[29, 9, 136, 1001, 72, 24, 32, 1] \rightarrow [1001, 1, 72, 32, 24, 136, 29, 9]$.

In [None]:
# Prepare the input and output arrays.
d_in_keys = cp.asarray([29, 9, 136, 1001, 72, 24, 32, 1], dtype="int32")

# define the custom comparator.
def comparison_op(lhs, rhs):
    # begin TODO

    # end TODO

# Perform the merge sort.
comp.merge_sort(
    # begin TODO

    # end TODO
)

print(f"Result: {d_in_keys}")
expected = np.asarray([1001, 1, 72, 32, 24, 136, 29, 9], dtype=np.int32)
assert (d_in_keys.get() == expected).all()

### Radix Sort

The `radix_sort` function provides fast sorting for numeric types using the radix sort algorithm. Unlike merge sort, radix sort doesn't use comparisons but instead processes the bits/digits of numbers.

In [None]:
# Basic radix sort example (ascending order)
d_input = cp.array([64, 34, 25, 12, 22, 11, 90, 5, 77, 30], dtype=np.int32)
d_output = cp.empty_like(d_input)

print(f"Input:  {d_input.get()}")

# Sort in ascending order
comp.radix_sort(
    d_input,                           # Input keys
    d_output,                          # Output keys
    None,                              # Input values (none for keys-only sort)
    None,                              # Output values (none)
    comp.SortOrder.ASCENDING,          # Sort order
    len(d_input)                       # Number of elements
)

print(f"Sorted: {d_output.get()}")

# Verify sorting
is_sorted = all(d_output.get()[i] <= d_output.get()[i+1] for i in range(len(d_output.get())-1))
print(f"Properly sorted: {is_sorted}")


In [None]:
# Descending order sort
d_input = cp.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3], dtype=np.int32)
d_output = cp.empty_like(d_input)

comp.radix_sort(
    d_input, d_output, None, None,
    comp.SortOrder.DESCENDING,    # Sort in reverse order
    len(d_input)
)

print(f"Input:            {d_input.get()}")
print(f"Descending sort:  {d_output.get()}")

# Verify descending order
is_descending = all(d_output.get()[i] >= d_output.get()[i+1] for i in range(len(d_output.get())-1))
print(f"Properly descending: {is_descending}")


In [None]:
# Key-value sorting: sort scores while keeping student IDs aligned
scores = [85, 92, 78, 96, 88, 71, 94]
student_ids = [101, 102, 103, 104, 105, 106, 107]

d_keys = cp.array(scores, dtype=np.int32)
d_values = cp.array(student_ids, dtype=np.int32)
d_keys_out = cp.empty_like(d_keys)
d_values_out = cp.empty_like(d_values)

print("Before sorting:")
for score, student_id in zip(scores, student_ids):
    print(f"  Student {student_id}: {score}")

# Sort by scores (highest first), keep student IDs aligned
comp.radix_sort(
    d_keys, d_keys_out,                # Input/output keys (scores)
    d_values, d_values_out,            # Input/output values (student IDs)
    comp.SortOrder.DESCENDING,    # Highest scores first
    len(d_keys)
)

sorted_scores = d_keys_out.get()
sorted_ids = d_values_out.get()

print("\nAfter sorting (by score, highest first):")
for score, student_id in zip(sorted_scores, sorted_ids):
    print(f"  Student {student_id}: {score}")


## Transforming


#### Unary transform

The `unary_transform` function applies a user-provided unary operation to each element of the input.

In [None]:
# Prepare the input and output arrays.
d_in = cp.asarray([1, 2, 3, 4, 5], dtype=np.int32)
d_out = cp.empty_like(d_in)

def double_op(a):
    return a * 2

# Perform the unary transform.
comp.unary_transform(d_in, d_out, double_op, len(d_in))
print(f"Result of unary transform: {d_out.get()}")

#### Binary transform

The `binary_transform` function applies a user-provided binary operation to pairs of elements from two inputs.

In [None]:
# Prepare the input and output arrays.
d_in1 = cp.asarray([2, 8, 9, 6, 3], dtype=np.int32)
d_in2 = cp.asarray([7, 2, 1, 0, -1], dtype=np.int32)
d_out = cp.empty_like(d_in1)

# Perform the binary transform.
comp.binary_transform(d_in1, d_in2, d_out, comp.OpKind.PLUS, len(d_in1))
print(f"Result of binary transform: {d_out.get()}")

#### Data Normalization with Transform

Transform operations are commonly used in machine learning for data preprocessing, such as normalizing features to have zero mean and unit variance.

In [None]:
# Example: Normalize house prices for machine learning
house_prices = np.array([250000, 180000, 320000, 420000, 150000, 380000, 220000, 295000], dtype=np.float32)
d_prices = cp.array(house_prices)
d_normalized = cp.empty_like(d_prices)

# Calculate statistics for normalization
price_mean = float(np.mean(house_prices))
price_std = float(np.std(house_prices))

print(f"Original prices: {house_prices}")
print(f"Mean: ${price_mean:,.0f}, Std: ${price_std:,.0f}")

def z_score_normalize(price):
    """Z-score normalization: (x - mean) / std"""
    return (price - price_mean) / price_std

# Apply normalization transformation
comp.unary_transform(d_prices, d_normalized, z_score_normalize, len(house_prices))

normalized_result = d_normalized.get()
print(f"Normalized prices: {normalized_result}")
print(f"Normalized mean: {np.mean(normalized_result):.6f}")
print(f"Normalized std: {np.std(normalized_result):.6f}")


#### Transform with Iterators for Memory Efficiency

Combining transforms with iterators allows complex computations without storing intermediate arrays.


In [None]:
# Calculate sum of squares from 1 to 1000 without storing intermediate arrays
def square_func(x):
    return x * x

def add_op(a, b):
    return a + b

# Method 1: Using iterators (memory efficient)
counting_it = comp.CountingIterator(np.int64(1))  # 1, 2, 3, ...
squares_it = comp.TransformIterator(counting_it, square_func)  # 1², 2², 3², ...

d_result = cp.empty(1, dtype=np.int64)
h_init = np.array([0], dtype=np.int64)

# Sum the squares directly without storing them
comp.reduce_into(squares_it, d_result, add_op, 1000, h_init)
iterator_result = d_result.get()[0]

# Mathematical verification: sum of squares = n(n+1)(2n+1)/6
n = 1000
formula_result = n * (n + 1) * (2 * n + 1) // 6

print(f"Sum of squares from 1 to {n}:")
print(f"Iterator result: {iterator_result:,}")
print(f"Formula result:  {formula_result:,}")
print(f"Correct: {iterator_result == formula_result}")
print(f"Memory used: Only space for final result (~8 bytes)")


## Custom (Struct) Types


So far, we've seen how to use `parallel` with input arrays composed of numeric values (ints and floats). A powerful feature of `parallel` is that it can also work with "struct" values, i.e., values that are in turn composed of more than one value. 

For example, consider a sequence of RGB values, like those used in graphics applications. Each RGB value represents a pixel's color and consists of three components: <font color="red">**red**</font>, <font color="green">**green**</font>, and <font color="blue">**blue**</font> intensity levels.

The code below shows how you can use `parallel` to find the pixel with the highest <font color="green">**green**</font> intensity level.

In [None]:
# use `@gpu_struct` to define the data type of each value:
@comp.gpu_struct
class Pixel:
    r: np.int32
    g: np.int32
    b: np.int32

# Define a reduction operation that operates on two `Pixel` objects:
def max_g_value(x, y):
    return x if x.g > y.g else y

# Prepare the input and output arrays. These are just CuPy arrays:
dtype = np.dtype([("r", np.int32), ("g", np.int32), ("b", np.int32)], align=True)  # alternately, use `Pixel.dtype`
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(dtype)
d_out = cp.empty(1, dtype)

# Define the initial value for the reduction. This must be a `Pixel` object:
h_init = Pixel(0, 0, 0)

# Perform the reduction.
comp.reduce_into(d_rgb, d_out, max_g_value, d_rgb.size, h_init)

# Verify the result.
print(f"Input RGB values: \n {d_rgb.get()}")
result = d_out.get()
print(f"Pixel with greatest 'g' intensity: {result}")

## Working with Iterators

Now you have a taste for how to use `parallel` with **custom ops** and **custom data types**. _Iterators_ are another powerful tool in your toolbox for solving more complex problems.

Iterators represent streams of data that are computed "on-the-fly". Unlike arrays, iterators do not require any memory allocation, and thus can represent huge sequences without consuming valuable GPU memory. Iterators can be used as inputs (and sometimes outputs) to algorithms in place of arrays.

Note that "iterators" in the context of the `parallel` library is distinct from the concept of [iterators](https://docs.python.org/3/glossary.html#term-iterator) in the Python language. 

### `CountingIterators` and `ConstantIterator`

A `CountingIterator` represents the sequence `a, a + 1, a + 2, a + 3,.... `. In the following example, we use a `CountingIterator` as the input to `reduce_into` to compute the sum $1 + 2 + 3 + 4 + 5 = 15$.

In [None]:
# Prepare the inputs and outputs:
it_input = comp.CountingIterator(np.int32(1))  # represents the sequence 1, 2, 3, ....
d_output = cp.empty(1, dtype=np.int32)
h_init = np.array([0], dtype=np.int32)

# Perform the reduction.
comp.reduce_into(it_input, d_output, comp.OpKind.PLUS, 5, h_init)  # compute the reduction for `5` input items

print(f"Sum: {d_output.get()}")

A `ConstantIterator` represents the sequence `a, a, a, ...`. In the following example, we use a `ConstantIterator` as one of the inputs to `binary_transform`.

In [None]:
# Prepare the input and output arrays.
d_in1 = cp.asarray([2, 8, 9, 6, 3], dtype=np.int32)
it_in2 = comp.ConstantIterator(np.int32(1))
d_out = cp.empty_like(d_in1)

# Perform the binary transform.
comp.binary_transform(d_in1, it_in2, d_out, comp.OpKind.PLUS, len(d_in1))
print(f"Result of binary transform: {d_out.get()}")

### `TransformIterator`

`TransformIterator` provides a way to compose operations by applying a function to each element as it's accessed. The following code is similar to the `CountingIterator` example above, but it wraps the iterator with a `TransformIterator` to compute the sum $1^2 + 2^2 + 3^2 + 4^2 + 5^2 = 55$.

In [None]:
# Define the transform operation.
def square(a):
    return a**2

# prepare the inputs and output.
it_count = comp.CountingIterator(np.int32(1))  # represents the sequence 1, 2, 3, ....
it_input = comp.TransformIterator(it_count, square)  # represents the sequence 1**2, 2**2, 3**2, ...
d_output = cp.empty(1, dtype=np.int32)
h_init = np.array([0], dtype=np.int32)

# Perform the reduction.
comp.reduce_into(it_input, d_output, comp.OpKind.PLUS, 5, h_init)  # compute the reduction for `5` input items

print(f"Sum: {d_output.get()}")

You can also wrap an array with a `TransformIterator`:

In [None]:
d_arr = cp.asarray([2, 3, 5, 1, 6, 7, 8, 4], dtype=np.int32)
it_input = comp.TransformIterator(d_arr, square)  # represents the sequence [2**2, 3**2, ... 4**2]
d_output = cp.empty(1, dtype=np.int32)
h_init = np.array([0], dtype=np.int32)

# Perform the reduction.
comp.reduce_into(it_input, d_output, comp.OpKind.PLUS, len(d_arr), h_init)

print(f"Sum: {d_output.get()}")

Finally, you can use `TransformOutputIterator` as the output of an algorithm, to apply a function to the result as it's being written.

⚠️ Note that when using `TransformOutputIterator`, you must currently provide explicit type annotations for the transform function.

In [None]:
d_arr = cp.asarray([2, 3, 5, 1, 6, 7, 8, 4], dtype=np.float32)
it_input = comp.TransformIterator(d_arr, square)  # represents the sequence [2**2, 3**2, ... 4**2]
d_out = cp.empty(1, dtype=np.float32)

# provide type annotations when using `TransformOutputIterator`
def sqrt(a: np.float32) -> np.float32:
    return a**2

it_output = comp.TransformOutputIterator(d_out, sqrt)

h_init = np.array([0], dtype=np.float32)

# Perform the reduction.
comp.reduce_into(it_input, it_output, comp.OpKind.PLUS, len(d_arr), h_init)  # compute the reduction for `5` input items

print(f"Sum: {d_out.get()}")

### `ZipIterator`

A `ZipIterator` combines multiple iterators (or arrays) into a single iterator. To access the individual components of any element of a `ZipIterator`, use numeric indexing:

In [None]:
d_in1 = cp.asarray([2, 3, 5, 1, 6, 7, 8, 4], dtype=np.int32)
d_in2 = cp.asarray([7, 7, 9, 3, 1, 2, 6, 0], dtype=np.int32)
it_in3 = comp.CountingIterator(np.int32(0))
it_input = comp.ZipIterator(d_in1, d_in2, it_in3)

def op(x):
    return x[0] + x[1] + x[2]

d_output = cp.empty_like(d_in1)
comp.unary_transform(it_input, d_output, op, len(d_in1))

print(f"Result: {d_output.get()}")

In the example below, we compute the `min` and `max` of a sequence within a single call to `reduce_into`, using `ZipIterator`. Note the need to define `MinMax` to specify the output type of `minmax_op`.

In [None]:
@comp.gpu_struct
class MinMax:
    min_value: np.int32
    max_value: np.int32

def minmax_op(x, y):
    return MinMax(min(x[0], y[0]), max(x[1], y[1]))

d_in = cp.asarray([2, 3, 5, 1, 6, 7, 8, 4], dtype=np.int32)

it_input = comp.ZipIterator(d_in, d_in)
d_output = cp.empty(2, dtype=np.int32).view(MinMax.dtype)

SMALLEST_INT = np.iinfo(np.int32).min
LARGEST_INT = np.iinfo(np.int32).max
h_init = MinMax(LARGEST_INT, SMALLEST_INT)

comp.reduce_into(it_input, d_output, minmax_op, len(d_in), h_init)

print(f"Min value: {d_output.get()[0]['min_value']}")
print(f"Max value: {d_output.get()[0]['max_value']}")

#### Iterator Composition

You can chain multiple iterator types together to create sophisticated data processing pipelines without intermediate storage.

In [None]:
# Example: Sum of squares of even numbers from 1 to 20
def square_if_even(x):
    """Square the number if it's even, otherwise return 0"""
    return (x * x) if (x % 2 == 0) else 0

def add_op(a, b):
    return a + b

# Chain operations: generate numbers → filter/square evens → sum
counting_it = comp.CountingIterator(np.int32(1))  # 1, 2, 3, ..., 20
transform_it = comp.TransformIterator(counting_it, square_if_even)  # 0, 4, 0, 16, 0, 36, ...

d_result = cp.empty(1, dtype=np.int32)
h_init = np.array([0], dtype=np.int32)

comp.reduce_into(transform_it, d_result, add_op, 20, h_init)

# Verify: even numbers 2,4,6,8,10,12,14,16,18,20 -> squares 4,16,36,64,100,144,196,256,324,400
evens = [x for x in range(1, 21) if x % 2 == 0]
expected = sum(x * x for x in evens)

print(f"Numbers 1-20: even squares sum")
print(f"Even numbers: {evens}")
print(f"Their squares: {[x*x for x in evens]}")
print(f"Iterator result: {d_result.get()[0]}")
print(f"Expected result: {expected}")
print(f"Correct: {d_result.get()[0] == expected}")


### Exercise 3: implementing running average


In this example, you'll implement the running average of a sequence, using a single call to the [inclusive_scan](https://nvidia.github.io/cccl/python/parallel_api.html#cuda.compute.algorithms.inclusive_scan) API. To do this, you'll have to piece together many of the concepts we've learned about so far.

In [None]:
@comp.gpu_struct
class SumAndCount:
    # begin TODO

    # end TODO

def reduce_op(x, y) -> SumAndCount:
    # begin TODO

    # end TODO

def compute_running_average(x: SumAndCount) -> np.float32:
    # begin TODO

    # end TODO

d_input = cp.array([2, 3, 5, 1, 7, 6, 8, 4], dtype=np.float32)
d_output = cp.empty(len(d_input), dtype=np.float32)
h_init = SumAndCount(0, 0)

it_input = comp.ZipIterator(d_input, comp.ConstantIterator(np.int32(1)))
it_output = comp.TransformOutputIterator(d_output, compute_running_average)

# Perform the reduction.
comp.inclusive_scan(it_input, it_output, reduce_op, h_init, len(d_input))

print(d_input)

h_input = d_input.get()
expected = h_input.cumsum() / np.arange(1, len(h_input) + 1)

print(f"Running average result: {d_output}")
np.testing.assert_allclose(d_output.get(), expected)

### Resources

* `cuda-cccl` Documentation: https://nvidia.github.io/cccl/python/
* `parallel` API Reference: https://nvidia.github.io/cccl/python/parallel_api.html#cuda-cccl-parallel-api-reference