## Exercise: Segmented Mean

The total raw temperature was a bit hard to read. 
What we are actually interested in is the mean temperature rather than total temperature.
So far, we've used different input iterators to extend algorithms.
But fancy iterators are not limited to that.

Here's an example of the usage of a transform output iterator:

```c++
struct functor {
  __host__ __device__ 
  float operator()(float value_about_to_be_stored_in_output_sequence) const 
  { 
    // will store value / 2 in the output sequence instead of the original value
    return value_about_to_be_stored_in_output_sequence / 2; 
  }
};

auto transform_output_it = 
  thrust::make_transform_output_iterator(
    // iterator to the beginning of the output sequence
    vector.begin(), 
    // functor to apply to value before it's written to the `vector`
    functor{});
```

In this exercise, you'll have to modify `row_temperature` so it computes the segmented mean. 
Use `transform_output_iterator` to turn the total temperature into the mean and remove the `thrust::transform` call.

<details>
    <summary>Original code in case you need to refer back to it</summary>
    
```c++
%%writefile Sources/segmented-mean.cpp
#include "ach.h"

struct mean_functor {
    int width;
    __host__ __device__ float operator()(float x) const {
        return x / width;
    }
};

thrust::universal_vector<float> row_temperatures(
    int height, int width,
    thrust::universal_vector<int>& row_ids,
    thrust::universal_vector<float>& temp)
{
    thrust::universal_vector<float> means(height);

    // use `transform_output_iterator` instead of `means.begin()`
    auto means_output = means.begin(); 

    auto row_ids_begin = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0), 
        [=]__host__ __device__(int i) {
            return i / width;
        });
    auto row_ids_end = row_ids_begin + temp.size();

    thrust::reduce_by_key(thrust::device, 
                          row_ids_begin, 
                          row_ids_end, 
                          temp.begin(), 
                          thrust::make_discard_iterator(), 
                          means_output);

    auto transform_op = mean_functor{width};

    // remove this `transform` call
    thrust::transform(thrust::device, 
                      means.begin(), 
                      means.end(), 
                      means.begin(), 
                      transform_op);

    return means;
}   
```

</details>

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !mkdir -p Sources
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/01.05-Serial-vs-Parallel/Sources/ach.h -nv -O Sources/ach.h

In [None]:
%%writefile Sources/segmented-mean.cpp
#include "ach.h"

struct mean_functor {
    int width;
    __host__ __device__ float operator()(float x) const {
        return x / width;
    }
};

thrust::universal_vector<float> row_temperatures(
    int height, int width,
    thrust::universal_vector<int>& row_ids,
    thrust::universal_vector<float>& temp)
{
    thrust::universal_vector<float> means(height);

    // TODO: Replace `means.begin()` by a `transform_output_iterator` using
    // the provided `mean_functor` functor
    auto means_output = means.begin();

    auto row_ids_begin = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [=]__host__ __device__(int i) {
            return i / width;
        });
    auto row_ids_end = row_ids_begin + temp.size();

    thrust::reduce_by_key(thrust::device,
                          row_ids_begin,
                          row_ids_end,
                          temp.begin(),
                          thrust::make_discard_iterator(),
                          means_output);

    auto transform_op = mean_functor{width};

    // TODO: remove this `transform` call after adding the
    // `transform_output_iterator`
    thrust::transform(thrust::device,
                      means.begin(),
                      means.end(),
                      means.begin(),
                      transform_op);

    return means;
}

In [None]:
!nvcc --extended-lambda -o /tmp/a.out Sources/segmented-mean.cpp -x cu -arch=native # build executable
!/tmp/a.out # run executable

The output of your program should end with:

```
row 0: { 90, 90, ..., 90 } = 90
row 1: { 15, 15, ..., 15 } = 15
row 2: { 15, 15, ..., 15 } = 15
```

If you’re unsure how to proceed, consider expanding this section for guidance. Use the hint only after giving the problem a genuine attempt.

<details>
  <summary>Hints</summary>
  
  - The `transform_output_iterator` has the same API as the `transform` iterator
</details>

Open this section only after you’ve made a serious attempt at solving the problem. Once you’ve completed your solution, compare it with the reference provided here to evaluate your approach and identify any potential improvements.

<details>
  <summary>Solution</summary>

  Key points:

  - We need to divide the sum by the number of elements in each segment to get the mean
  - We can use a `transform_output_iterator` to divide the sum by the number of elements in each segment

  Solution:
  ```c++
  auto means_output =
      thrust::make_transform_output_iterator(
        means.begin(), 
        mean_functor{width});
  ```

  You can find full solution [here](Solutions/segmented-mean.cu).
</details>

---
Congratulations!

Now that you understand the difference between serial and parallel execution, 
proceed to [the next section](../01.06-Memory-Spaces/01.06.01-Memory-Spaces.ipynb).