# CUDA Particle Filter Performance Analysis Guide

This notebook provides step-by-step instructions for analyzing the performance of the CUDA particle filter implementation.

## Features Added

- **Detailed Kernel Timing**: Per-kernel execution times for all particle filter phases
- **Memory Transfer Profiling**: Host-device memory transfer timing
- **NVTX Profiling Markers**: For use with nvprof and Nsight
- **Performance Metrics**: Bandwidth, throughput, and efficiency calculations
- **Comprehensive Reporting**: Detailed breakdowns and recommendations

## Setup

First, ensure you have the necessary CUDA profiling tools installed:

In [None]:
# Check CUDA installation
!nvcc --version
!nvidia-smi

## Build the Project

Compile the particle filter with profiling enabled:

In [None]:
# Clone or upload the project files
# Make sure all .cu and .h files are in the current directory

# Build with profiling enabled
!nvcc -O3 -arch=sm_70 \
  particle_filter_main.cu particle_filter_kernels.cu utils.cu scan_kernels.cu reduce_kernels.cu \
  -o particle_filter_prof \
  -lcudart -lcurand -lnvToolsExt

## Run Basic Profiling

Execute the particle filter with built-in timing:

In [None]:
# Run with internal profiling (ENABLE_PROFILING=1 in config)
!./particle_filter_prof

## Advanced Profiling with nvprof

### Kernel Timeline Analysis

In [None]:
# Generate detailed kernel execution timeline
!nvprof --print-gpu-trace ./particle_filter_prof > nvprof_timeline.log 2>&1

# Display key timing results
!grep -E "(predict_kernel|update_weights_kernel|normalize_weights_kernel|resample_optimized_kernel)" nvprof_timeline.log | tail -10

### GPU Metrics Analysis

In [None]:
# Profile key GPU metrics
!nvprof --metrics achieved_occupancy,sm_efficiency,warp_execution_efficiency,gld_efficiency,gst_efficiency \
  --print-gpu-summary ./particle_filter_prof > nvprof_metrics.log 2>&1

# Display metrics summary
!tail -20 nvprof_metrics.log

### Memory Bandwidth Analysis

In [None]:
# Analyze memory bandwidth utilization
!nvprof --metrics dram_read_throughput,dram_write_throughput,gld_requested_throughput,gst_requested_throughput \
  ./particle_filter_prof > memory_analysis.log 2>&1

# Show memory throughput
!grep "throughput" memory_analysis.log

## Nsight Compute (Advanced Analysis)

If Nsight Compute is available, run detailed kernel analysis:

In [None]:
# Check if ncu is available
!which ncu

# Run detailed kernel analysis
!ncu --target-processes all \
  --metrics sm__warps_active.avg.pct_of_peak_sustained_active,gpu__time_duration.sum \
  --print-summary per-kernel \
  ./particle_filter_prof > ncu_analysis.log 2>&1

## Performance Analysis and Recommendations

### Key Metrics to Monitor

1. **Kernel Execution Times**: Should be balanced across phases
2. **Achieved Occupancy**: Target > 50% for good GPU utilization
3. **Memory Efficiency**: Global load/store efficiency should be > 80%
4. **Warp Execution Efficiency**: Should be > 90%
5. **Memory Bandwidth**: Compare achieved vs theoretical bandwidth

### Common Optimization Opportunities

- **Low Occupancy**: Increase block size or use more blocks
- **Memory Inefficiency**: Improve coalescing or use shared memory
- **Warp Divergence**: Ensure conditional execution doesn't cause divergence
- **Stream Imbalance**: Balance work across CUDA streams

### Scaling Analysis

Test with different particle counts:

In [None]:
# Test scaling with different particle counts
# Modify N_PARTICLES in particle_filter_config.h and rebuild

particle_counts = [10000, 50000, 100000, 500000, 1000000]

for n_particles in particle_counts:
    print(f"\n=== Testing with {n_particles} particles ===")
    # Note: Would need to modify config and rebuild for each test
    # !sed -i 's/#define N_PARTICLES.*/#define N_PARTICLES {n_particles}/' particle_filter_config.h
    # !make clean && make
    # !./particle_filter_prof | grep "Average time per step"

## Profiling Configuration

The profiling system can be controlled via defines in `particle_filter_config.h`:

- `ENABLE_PROFILING`: Enable/disable detailed timing (set to 1)
- `ENABLE_NVTX`: Enable/disable NVTX markers for nvprof (set to 1)

## Output Files

- `results_gpu_optimized.csv`: Simulation results
- `nvprof_timeline.log`: Kernel execution timeline
- `nvprof_metrics.log`: GPU performance metrics
- `memory_analysis.log`: Memory bandwidth analysis
- `ncu_analysis.log`: Nsight Compute detailed analysis

## Performance Report Generation

The built-in profiling automatically generates a detailed performance report showing:

- Per-kernel execution times
- Time breakdowns by particle filter phase
- Performance metrics (throughput, bandwidth, etc.)
- Resampling frequency analysis

Use this information to identify bottlenecks and optimization opportunities in your CUDA particle filter implementation.