# Solution: Mean Force By Region

This notebook walks through the solution to the Mean Force By Region exercise, explaining key concepts and implementation details.

## Overview
The goal was to optimize the calculation of average forces in different regions of space by using `transform_output_iterator` to compute means in a single pass.

## Key Components

### 1. Force Accumulator Structure
```cpp
struct ForceAccumulator {
    float force_sum;
    int count;
    
    __host__ __device__
    ForceAccumulator() : force_sum(0.0f), count(0) {}
    
    __host__ __device__
    ForceAccumulator operator+(const ForceAccumulator& other) const {
        return {force_sum + other.force_sum, count + other.count};
    }
};
```
This structure tracks both the sum of forces and count of particles in each region, enabling mean calculation.

### 2. Mean Force Functor
```cpp
struct MeanForceFunctor {
    __host__ __device__
    float operator()(const ForceAccumulator& acc) const {
        return acc.count > 0 ? acc.force_sum / acc.count : 0.0f;
    }
};
```
This functor computes the mean force from an accumulator, handling the case of empty regions.

### 3. Region Mapping
```cpp
struct RegionMapper {
    int grid_size;
    __host__ __device__
    int operator()(const thrust::tuple<float, float>& pos) const {
        float x = thrust::get<0>(pos);
        float y = thrust::get<1>(pos);
        // Map position to grid index...
        return grid_y * grid_size + grid_x;
    }
};
```
Maps particle positions to grid region indices.

## Key Optimization
The main optimization comes from using `transform_output_iterator`:

```cpp
auto mean_output = thrust::make_transform_output_iterator(
    region_means.begin(),
    MeanForceFunctor{}
);
```

This allows us to:
1. Eliminate separate transform pass
2. Compute means during reduction
3. Avoid temporary storage

## Performance Impact
The optimized version typically shows 1.5-2x speedup over the two-step approach due to:
- Fewer memory operations
- Single kernel launch instead of two
- Better cache utilization

## Common Pitfalls
1. Not handling empty regions properly
2. Incorrect grid index calculation
3. Not using transform_output_iterator effectively

## Extension Solutions

### 1. Weighted Averaging
```cpp
struct WeightedForceAccumulator {
    float force_sum;
    float weight_sum;
    
    __host__ __device__
    float get_mean() const {
        return weight_sum > 0 ? force_sum / weight_sum : 0.0f;
    }
};
```

### 2. Mean and Variance
```cpp
struct StatsAccumulator {
    float sum;
    float sum_sq;
    int count;
    
    __host__ __device__
    thrust::tuple<float, float> get_stats() const {
        if (count < 2) return {0.0f, 0.0f};
        float mean = sum / count;
        float variance = (sum_sq - sum * mean) / (count - 1);
        return {mean, variance};
    }
};
```

### 3. Adaptive Grid
```cpp
struct AdaptiveGrid {
    static constexpr int MIN_PARTICLES = 10;
    
    __host__ __device__
    int get_grid_size(int particle_count) const {
        return max(2, static_cast<int>(sqrt(particle_count / MIN_PARTICLES)));
    }
};
```