## Exercise: Segmented Sum Optimization

Below is an example of the `transform` iterator API:

```c++
int constant = 2;
auto transform_it = thrust::make_transform_iterator(
    // iterator to the beginning of the input sequence
    vector.begin(), 
    // capture constant in the lambda by value with `[name]`
    [constant]__host__ __device__(float value_from_input_sequence) { 
      // transformation of each element
      return value_from_input_sequence * constant; 
    });
```

Here's an example of the Counting iterator API:

```c++
// start counting from 0
auto count_it = thrust::make_counting_iterator(0);
```

Rewrite the segmented sum code below without materializing keys in memory.

<details>
    <summary>Copy of the original code in case you need to refer back to it.</summary>

```c++
%%writefile Sources/segmented-sum-optimization.cpp
#include "ach.h"

thrust::universal_vector<float> row_temperatures(
    int height, int width,
    thrust::universal_vector<int>& row_ids,
    thrust::universal_vector<float>& temp)
{
    thrust::universal_vector<float> sums(height);

    // Modify line below to use counting and transform iterators to 
    // generates row indices `id / width` instead
    auto row_ids_begin = row_ids.begin(); 
    auto row_ids_end = row_ids_begin + temp.size();

    thrust::reduce_by_key(thrust::device, 
                          row_ids_begin, row_ids_end, 
                          temp.begin(), 
                          thrust::make_discard_iterator(), 
                          sums.begin());

    return sums;
}
```  

</details>

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !mkdir -p Sources
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/cuda-cpp-tutorial/notebooks/01.05-Serial-vs-Parallel/Sources/ach.h -nv -O Sources/ach.h

In [None]:
%%writefile Sources/segmented-sum-optimization.cpp
#include "ach.h"

thrust::universal_vector<float> row_temperatures(
    int height, int width,
    thrust::universal_vector<int>& row_ids,
    thrust::universal_vector<float>& temp)
{
    thrust::universal_vector<float> sums(height);

    // TODO: Modify the line below to use counting and transform iterators to
    // generates row indices `id / width` instead
    auto row_ids_begin = row_ids.begin();
    auto row_ids_end = row_ids_begin + temp.size();

    thrust::reduce_by_key(thrust::device,
                          row_ids_begin, row_ids_end,
                          temp.begin(),
                          thrust::make_discard_iterator(),
                          sums.begin());

    return sums;
}

In [None]:
!nvcc --extended-lambda -o /tmp/a.out --run Sources/segmented-sum-optimization.cpp -x cu -arch=native

The output of your program should end with:

```
row 0: { 90, 90, ..., 90 } = 1.50995e+09
row 1: { 15, 15, ..., 15 } = 2.51658e+08
row 2: { 15, 15, ..., 15 } = 2.51658e+08
```

If you’re unsure how to proceed, consider expanding this section for guidance. Use the hint only after giving the problem a genuine attempt.

<details>
  <summary>Hints</summary>
  
  - Combine `transform` and `counting` iterators to generate row indices
</details>

Open this section only after you’ve made a serious attempt at solving the problem. Once you’ve completed your solution, compare it with the reference provided here to evaluate your approach and identify any potential improvements.

<details>
  <summary>Solution</summary>

  Key points:

  - `thrust::make_counting_iterator(0)` creates an integer sequence of cell indices
  - `thrust::make_transform_iterator` converts cell indices to row indices by dividing by `width`

  Solution:
  ```c++
  auto row_ids_begin = thrust::make_transform_iterator(
      thrust::make_counting_iterator(0),
      [=] __host__ __device__(int i) { return i / width; });
  ```

  You can find the full solution [here](Solutions/segmented-sum-optimization.cpp).
</details>

Proceed to [the next exercise](01.05.03-Exercise-Segmented-Mean.ipynb).