# Understanding CUDA Execution Spaces with Particle Simulation

This notebook guides you through understanding CUDA execution spaces by implementing a simple particle simulation. We'll start with CPU code and gradually transition to GPU acceleration.

<img src="./images/image2.jpg" width="1000" height="800">

### What Are Execution Spaces?

In CUDA programming, "execution spaces" refer to where your code runs:

**Host**: This is the CPU

**Device**: This is the GPU

One of the most important concepts in CUDA is that you must **explicitly** specify which code runs where. Let's explore this concept step by step.

## 1. Basic CPU Implementation
First, let's create a simple particle simulation using CPU only:

In [1]:
#Specifying path to where nvcc exists so that the jupyter notebook reads from it. nvcc is the nvidia cuda compiler for executing cuda. 
import os
os.environ['PATH'] = "/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/cuda-12.6.1-cf4xlcbcfpwchqwo5bktxyhjagryzcx6/bin:" + os.environ['PATH']

In [2]:
%%writefile codes/particles_cpu.cpp
#include <cstdio>
#include <vector>
#include <cmath>

struct Particle {
    float x, y;    // position
    float vx, vy;  // velocity
};

int main() {
    // Simulation parameters
    float dt = 0.1f;  // time step
    
    // Create some particles with initial positions and velocities
    std::vector<Particle> particles = {
        {0.0f, 0.0f, 1.0f, 0.5f},
        {1.0f, 2.0f, -0.5f, 0.2f},
        {-1.0f, -1.0f, 0.3f, 0.7f}
    };
    
    // Print initial state
    printf("Step 0:\n");
    for (int i = 0; i < particles.size(); i++) {
        printf("Particle %d: pos=(%.2f, %.2f) vel=(%.2f, %.2f)\n", 
                i, particles[i].x, particles[i].y, particles[i].vx, particles[i].vy);
    }
    
    // Run simulation for 3 steps
    for (int step = 1; step <= 3; step++) {
        // Update each particle position based on its velocity
        for (int i = 0; i < particles.size(); i++) {
            particles[i].x += particles[i].vx * dt;
            particles[i].y += particles[i].vy * dt;
        }
        
        // Print results
        printf("\nStep %d:\n", step);
        for (int i = 0; i < particles.size(); i++) {
            printf("Particle %d: pos=(%.2f, %.2f) vel=(%.2f, %.2f)\n", 
                   i, particles[i].x, particles[i].y, particles[i].vx, particles[i].vy);
        }
    }
    
    return 0;
}

Overwriting codes/particles_cpu.cpp


In [3]:
%%bash
g++ codes/particles_cpu.cpp -o codes/particles_cpu
./codes/particles_cpu

Step 0:
Particle 0: pos=(0.00, 0.00) vel=(1.00, 0.50)
Particle 1: pos=(1.00, 2.00) vel=(-0.50, 0.20)
Particle 2: pos=(-1.00, -1.00) vel=(0.30, 0.70)

Step 1:
Particle 0: pos=(0.10, 0.05) vel=(1.00, 0.50)
Particle 1: pos=(0.95, 2.02) vel=(-0.50, 0.20)
Particle 2: pos=(-0.97, -0.93) vel=(0.30, 0.70)

Step 2:
Particle 0: pos=(0.20, 0.10) vel=(1.00, 0.50)
Particle 1: pos=(0.90, 2.04) vel=(-0.50, 0.20)
Particle 2: pos=(-0.94, -0.86) vel=(0.30, 0.70)

Step 3:
Particle 0: pos=(0.30, 0.15) vel=(1.00, 0.50)
Particle 1: pos=(0.85, 2.06) vel=(-0.50, 0.20)
Particle 2: pos=(-0.91, -0.79) vel=(0.30, 0.70)


## 2. Refactoring with Algorithm Approach
Next, let's refactor our code to use the inbuilt **transforms** method in c++, which will make it easier to port to CUDA:

In [4]:
%%writefile codes/particles_algo.cpp
#include <cstdio>
#include <vector>
#include <algorithm>

struct Particle {
    float x, y;    // position
    float vx, vy;  // velocity
};

int main() {
    // Simulation parameters
    float dt = 0.1f;  // time step
    
    // Create some particles with initial positions and velocities
    std::vector<Particle> particles = {
        {0.0f, 0.0f, 1.0f, 0.5f},
        {1.0f, 2.0f, -0.5f, 0.2f},
        {-1.0f, -1.0f, 0.3f, 0.7f}
    };
    
    // Define a transformation function
    auto update_position = [dt](Particle p) {
        p.x += p.vx * dt;
        p.y += p.vy * dt;
        return p;
    };
    
    // Print initial state
    printf("Step 0:\n");
    for (int i = 0; i < particles.size(); i++) {
        printf("Particle %d: pos=(%.2f, %.2f) vel=(%.2f, %.2f)\n", 
                i, particles[i].x, particles[i].y, particles[i].vx, particles[i].vy);
    }
    
    // Run simulation for 3 steps
    for (int step = 1; step <= 3; step++) {
        // Transform each particle using the algorithm
        std::transform(particles.begin(), particles.end(), particles.begin(), update_position);
        
        // Print results
        printf("\nStep %d:\n", step);
        for (int i = 0; i < particles.size(); i++) {
            printf("Particle %d: pos=(%.2f, %.2f) vel=(%.2f, %.2f)\n", 
                   i, particles[i].x, particles[i].y, particles[i].vx, particles[i].vy);
        }
    }
    
    return 0;
}

Overwriting codes/particles_algo.cpp


In [5]:
%%bash
g++ codes/particles_algo.cpp -o codes/particles_algo
./codes/particles_algo

Step 0:
Particle 0: pos=(0.00, 0.00) vel=(1.00, 0.50)
Particle 1: pos=(1.00, 2.00) vel=(-0.50, 0.20)
Particle 2: pos=(-1.00, -1.00) vel=(0.30, 0.70)

Step 1:
Particle 0: pos=(0.10, 0.05) vel=(1.00, 0.50)
Particle 1: pos=(0.95, 2.02) vel=(-0.50, 0.20)
Particle 2: pos=(-0.97, -0.93) vel=(0.30, 0.70)

Step 2:
Particle 0: pos=(0.20, 0.10) vel=(1.00, 0.50)
Particle 1: pos=(0.90, 2.04) vel=(-0.50, 0.20)
Particle 2: pos=(-0.94, -0.86) vel=(0.30, 0.70)

Step 3:
Particle 0: pos=(0.30, 0.15) vel=(1.00, 0.50)
Particle 1: pos=(0.85, 2.06) vel=(-0.50, 0.20)
Particle 2: pos=(-0.91, -0.79) vel=(0.30, 0.70)


## 3. Compiling with NVCC 
Now let's see what happens if we simply compile our code with NVCC 

In [6]:
%%writefile codes/particles_nvcc.cu
// Same content as particles_algo.cpp, just with a different file extension
#include <cstdio>
#include <vector>
#include <algorithm>

struct Particle {
    float x, y;    // position
    float vx, vy;  // velocity
};

int main() {
    // Simulation parameters
    float dt = 0.1f;  // time step
    
    // Create some particles with initial positions and velocities
    std::vector<Particle> particles = {
        {0.0f, 0.0f, 1.0f, 0.5f},
        {1.0f, 2.0f, -0.5f, 0.2f},
        {-1.0f, -1.0f, 0.3f, 0.7f}
    };
    
    // Define a transformation function
    auto update_position = [dt](Particle p) {
        p.x += p.vx * dt;
        p.y += p.vy * dt;
        return p;
    };
    
    // Print initial state
    printf("Step 0:\n");
    for (int i = 0; i < particles.size(); i++) {
        printf("Particle %d: pos=(%.2f, %.2f) vel=(%.2f, %.2f)\n", 
                i, particles[i].x, particles[i].y, particles[i].vx, particles[i].vy);
    }
    
    // Run simulation for 3 steps
    for (int step = 1; step <= 3; step++) {
        // Transform each particle using the algorithm
        std::transform(particles.begin(), particles.end(), particles.begin(), update_position);
        
        // Print results
        printf("\nStep %d:\n", step);
        for (int i = 0; i < particles.size(); i++) {
            printf("Particle %d: pos=(%.2f, %.2f) vel=(%.2f, %.2f)\n", 
                   i, particles[i].x, particles[i].y, particles[i].vx, particles[i].vy);
        }
    }
    
    return 0;
}

Overwriting codes/particles_nvcc.cu


In [7]:
%%bash
nvcc -x cu codes/particles_nvcc.cu -o codes/particles_nvcc
./codes/particles_nvcc

Step 0:
Particle 0: pos=(0.00, 0.00) vel=(1.00, 0.50)
Particle 1: pos=(1.00, 2.00) vel=(-0.50, 0.20)
Particle 2: pos=(-1.00, -1.00) vel=(0.30, 0.70)

Step 1:
Particle 0: pos=(0.10, 0.05) vel=(1.00, 0.50)
Particle 1: pos=(0.95, 2.02) vel=(-0.50, 0.20)
Particle 2: pos=(-0.97, -0.93) vel=(0.30, 0.70)

Step 2:
Particle 0: pos=(0.20, 0.10) vel=(1.00, 0.50)
Particle 1: pos=(0.90, 2.04) vel=(-0.50, 0.20)
Particle 2: pos=(-0.94, -0.86) vel=(0.30, 0.70)

Step 3:
Particle 0: pos=(0.30, 0.15) vel=(1.00, 0.50)
Particle 1: pos=(0.85, 2.06) vel=(-0.50, 0.20)
Particle 2: pos=(-0.91, -0.79) vel=(0.30, 0.70)


**Key insight**: Although we compiled with NVCC, all our code is still **running on the CPU!** This demonstrates an important point about CUDA: just using the NVCC compiler doesn't automatically make your code run on the GPU. You need to explicitly specify which parts should run on the device.

## 4. Using Thrust to Move Computation to GPU
Now let's actually move our computation to the GPU using Thrust:

In [44]:
%%writefile codes/particles_thrust.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/execution_policy.h>
#include <cstdio>

struct Particle {
    float x, y;    // position
    float vx, vy;  // velocity
};

// Define a functor for updating positions
struct UpdatePosition {
    const float dt;
    
    UpdatePosition(float _dt) : dt(_dt) {}
    
    __host__ __device__
    Particle operator()(const Particle& p) const {
        Particle updated = p;
        updated.x += p.vx * dt;
        updated.y += p.vy * dt;
        return updated;
    }
};

int main() {
    // Simulation parameters
    float dt = 0.1f;  // time step
    
    // Create particles on the host (CPU)
    thrust::host_vector<Particle> h_particles = {
        {0.0f, 0.0f, 1.0f, 0.5f},
        {1.0f, 2.0f, -0.5f, 0.2f},
        {-1.0f, -1.0f, 0.3f, 0.7f}
    };
    
    // Copy particles to the device (GPU)
    thrust::device_vector<Particle> d_particles = h_particles;
    
    // Create our transformation functor
    UpdatePosition updater(dt);
    
    // Print initial state
    printf("Step 0:\n");
    for (int i = 0; i < h_particles.size(); i++) {
        printf("Particle %d: pos=(%.2f, %.2f) vel=(%.2f, %.2f)\n", 
                i, h_particles[i].x, h_particles[i].y, h_particles[i].vx, h_particles[i].vy);
    }
    
    // Run simulation for 3 steps
    for (int step = 1; step <= 3; step++) {
        // Update positions on the GPU
        thrust::transform(thrust::device, 
                         d_particles.begin(), d_particles.end(), 
                         d_particles.begin(), 
                         updater);
        
        // Copy results back to the host
        thrust::copy(d_particles.begin(), d_particles.end(), h_particles.begin());
        
        // Print results
        printf("\nStep %d:\n", step);
        for (int i = 0; i < h_particles.size(); i++) {
            printf("Particle %d: pos=(%.2f, %.2f) vel=(%.2f, %.2f)\n", 
                   i, h_particles[i].x, h_particles[i].y, h_particles[i].vx, h_particles[i].vy);
        }
    }
    
    return 0;
}

Overwriting codes/particles_thrust.cu


In [45]:
%%bash
nvcc codes/particles_thrust.cu -o codes/particles_thrust
./codes/particles_thrust

Step 0:
Particle 0: pos=(0.00, 0.00) vel=(1.00, 0.50)
Particle 1: pos=(1.00, 2.00) vel=(-0.50, 0.20)
Particle 2: pos=(-1.00, -1.00) vel=(0.30, 0.70)

Step 1:
Particle 0: pos=(0.10, 0.05) vel=(1.00, 0.50)
Particle 1: pos=(0.95, 2.02) vel=(-0.50, 0.20)
Particle 2: pos=(-0.97, -0.93) vel=(0.30, 0.70)

Step 2:
Particle 0: pos=(0.20, 0.10) vel=(1.00, 0.50)
Particle 1: pos=(0.90, 2.04) vel=(-0.50, 0.20)
Particle 2: pos=(-0.94, -0.86) vel=(0.30, 0.70)

Step 3:
Particle 0: pos=(0.30, 0.15) vel=(1.00, 0.50)
Particle 1: pos=(0.85, 2.06) vel=(-0.50, 0.20)
Particle 2: pos=(-0.91, -0.79) vel=(0.30, 0.70)


### Key changes:

1.We use **thrust::host_vector** and **thrust::device_vector** to manage memory on CPU and GPU.  

2.We create a functor with __host__ __device__ specifiers, indicating it can run on both CPU and GPU.  

3.We use **thrust::transform** with **thrust::device** execution policy to perform computation on GPU.  

4.We explicitly copy data between host and device.  
