# Serial vs Parallel Computation


<img src="./images/serial.svg" width="800" height="600">
image credits : claude.ai

## Content
* [Serial Particle Updates](#Serial-Particle-Updates)
* [Parallel Particle Updates](#Parallel-Particle-Updates) 
* [Exercise: Particle Update Optimization](#Exercise-Particle-Update-Optimization)
* [Exercise: Particle Forces](#Exercise-Particle-Forces)

---

At this point, we have discussed how to on-ramp to GPU programming with parallel algorithms.
We've also covered techniques that can help you extend these parallel algorithms to meet your specific use cases.
As you find more applications for parallel algorithms, there's a possibility that you will get unexpected performance.
To avoid unexpected performance results, you'll need a firm understanding of the difference between serial and parallel execution.
Let's explore this through particle simulation.

## Serial Particle Updates

In a particle simulation system, we need to update the positions of many particles based on their velocities.
Let's first look at a serial implementation where we process each particle one at a time:


In [3]:
#Specifying path to where nvcc exists so that the jupyter notebook reads from it. nvcc is the nvidia cuda compiler for executing cuda. 
import os
os.environ['PATH'] = "/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/cuda-12.6.1-cf4xlcbcfpwchqwo5bktxyhjagryzcx6/bin:" + os.environ['PATH']

In [4]:
%%writefile codes/serial_particles.cu

#include <thrust/universal_vector.h>
#include <thrust/execution_policy.h>
#include <cstdio>
#include <chrono>

void update_particles_serial(int num_particles, float dt,
                           const thrust::universal_vector<float> &in,
                                 thrust::universal_vector<float> &out) {
    const float *in_ptr = thrust::raw_pointer_cast(in.data());
    float *out_ptr = thrust::raw_pointer_cast(out.data());
    
    // Process each particle sequentially
    for(int i = 0; i < num_particles; i++) {
        // Update x position
        out_ptr[i*4 + 0] = in_ptr[i*4 + 0] + dt * in_ptr[i*4 + 2];
        // Update y position
        out_ptr[i*4 + 1] = in_ptr[i*4 + 1] + dt * in_ptr[i*4 + 3];
        // Keep velocities unchanged
        out_ptr[i*4 + 2] = in_ptr[i*4 + 2];
        out_ptr[i*4 + 3] = in_ptr[i*4 + 3];
    }
}

thrust::universal_vector<float> init_particles(int num_particles) {
    thrust::universal_vector<float> particles(num_particles * 4);
    for(int i = 0; i < num_particles; i++) {
        particles[i*4 + 0] = i * 0.1f;     // x position
        particles[i*4 + 1] = i * -0.1f;    // y position
        particles[i*4 + 2] = 1.0f;         // x velocity
        particles[i*4 + 3] = -0.5f;        // y velocity
    }
    return particles;
}

int main() {
    int num_particles = 1000000;  // 1 million particles
    float dt = 0.1f;
    
    // Initialize particles
    thrust::universal_vector<float> particles = init_particles(num_particles);
    thrust::universal_vector<float> output(particles.size());
    
    // Measure performance
    auto begin = std::chrono::high_resolution_clock::now();
    update_particles_serial(num_particles, dt, particles, output);
    auto end = std::chrono::high_resolution_clock::now();
    
    const double seconds = std::chrono::duration<double>(end - begin).count();
    const double gigabytes = static_cast<double>(particles.size() * sizeof(float)) / 1024 / 1024 / 1024;
    const double throughput = gigabytes / seconds;

    std::printf("Serial computation:\n");
    std::printf("Time: %g seconds\n", seconds);
    std::printf("Throughput: %g GB/s\n", throughput);
    
    return 0;
}

Overwriting codes/serial_particles.cu


Let's analyze the performance of this approach:

In [5]:
%%bash
nvcc -o codes/serial_particles --extended-lambda codes/serial_particles.cu
./codes/serial_particles

Serial computation:
Time: 0.0118792 seconds
Throughput: 1.25439 GB/s


## Parallel Particle Updates

GPUs are massively parallel processors. The serial implementation processes particles one at a time, which doesn't take advantage of the GPU's parallel processing capabilities. Let's transform this into a parallel implementation using `thrust::transform`:


This parallel implementation:
1. Uses `thrust::transform` to process all elements simultaneously
2. Uses mdspan for clearer data access
3. Updates positions in parallel while keeping velocities unchanged

In [14]:
%%writefile codes/parallel_particles.cu
#include <thrust/universal_vector.h>
#include <thrust/execution_policy.h>
#include <cuda/std/mdspan>
#include <cstdio>
#include <chrono>

// Serial implementation
void update_particles_serial(int num_particles, float dt,
                           const thrust::universal_vector<float> &in,
                                 thrust::universal_vector<float> &out) {
    const float *in_ptr = thrust::raw_pointer_cast(in.data());
    float *out_ptr = thrust::raw_pointer_cast(out.data());
    
    // Process each particle sequentially
    for(int i = 0; i < num_particles; i++) {
        // Update x position
        out_ptr[i*4 + 0] = in_ptr[i*4 + 0] + dt * in_ptr[i*4 + 2];
        // Update y position
        out_ptr[i*4 + 1] = in_ptr[i*4 + 1] + dt * in_ptr[i*4 + 3];
        // Keep velocities unchanged
        out_ptr[i*4 + 2] = in_ptr[i*4 + 2];
        out_ptr[i*4 + 3] = in_ptr[i*4 + 3];
    }
}

// Parallel implementation
void update_particles_parallel(int num_particles, float dt,
                             const thrust::universal_vector<float> &in,
                                   thrust::universal_vector<float> &out) {
    const float *in_ptr = thrust::raw_pointer_cast(in.data());
    
    thrust::transform(
        thrust::device,
        thrust::counting_iterator<int>(0),
        thrust::counting_iterator<int>(num_particles * 4),
        out.begin(),
        [=] __device__ (int idx) {
            int particle_idx = idx / 4;
            int component = idx % 4;
            
            if (component < 2) {  // Position components
                return in_ptr[idx] + dt * in_ptr[particle_idx * 4 + component + 2];
            }
            return in_ptr[idx];  // Velocity components unchanged
        }
    );
}

thrust::universal_vector<float> init_particles(int num_particles) {
    thrust::universal_vector<float> particles(num_particles * 4);
    for(int i = 0; i < num_particles; i++) {
        particles[i*4 + 0] = i * 0.1f;     // x position
        particles[i*4 + 1] = i * -0.1f;    // y position
        particles[i*4 + 2] = 1.0f;         // x velocity
        particles[i*4 + 3] = -0.5f;        // y velocity
    }
    return particles;
}

int main() {
    int num_particles = 1000000;  // 1 million particles
    float dt = 0.1f;
    
    // Initialize particles
    thrust::universal_vector<float> particles = init_particles(num_particles);
    thrust::universal_vector<float> output(particles.size());
    
    // Measure serial performance
    auto begin = std::chrono::high_resolution_clock::now();
    update_particles_serial(num_particles, dt, particles, output);
    auto end = std::chrono::high_resolution_clock::now();
    
    double seconds = std::chrono::duration<double>(end - begin).count();
    double gigabytes = static_cast<double>(particles.size() * sizeof(float)) / 1024 / 1024 / 1024;
    double throughput = gigabytes / seconds;

    std::printf("Serial computation:\n");
    std::printf("Time: %g seconds\n", seconds);
    std::printf("Throughput: %g GB/s\n\n", throughput);
    
    // Measure parallel performance
    begin = std::chrono::high_resolution_clock::now();
    update_particles_parallel(num_particles, dt, particles, output);
    end = std::chrono::high_resolution_clock::now();
    
    seconds = std::chrono::duration<double>(end - begin).count();
    throughput = gigabytes / seconds;

    std::printf("Parallel computation:\n");
    std::printf("Time: %g seconds\n", seconds);
    std::printf("Throughput: %g GB/s\n", throughput);
    
    return 0;
}

Overwriting codes/parallel_particles.cu


In [15]:
%%bash
nvcc -o codes/parallel_particles --extended-lambda codes/parallel_particles.cu
./codes/parallel_particles

Serial computation:
Time: 0.00984519 seconds
Throughput: 1.51355 GB/s

Parallel computation:
Time: 0.00980984 seconds
Throughput: 1.519 GB/s
