# Memory Spaces in GPU Programming

## Content
* [Host and Device Memory Basics](#Host-and-Device-Memory-Basics)
* [Universal Vector vs Explicit Memory Management](#Universal-Vector-vs-Explicit-Memory-Management)
* [Performance Impact of Memory Transfers](#Performance-Impact-of-Memory-Transfers)

Let's explore how memory spaces work in GPU programming using our particle simulation system. First, let's understand why memory spaces matter for performance.

GPUs achieve their massive parallelism partly through specialized high-bandwidth memory. While CPUs prioritize low latency memory access, GPUs focus on high throughput to support thousands of concurrent threads. This is why GPUs typically have their own dedicated memory rather than just using system RAM.

Let's see how this affects our particle simulation code.

## Host and Device Memory Basics

Here's a simple example demonstrating the different memory spaces:


In [10]:
#Specifying path to where nvcc exists so that the jupyter notebook reads from it. nvcc is the nvidia cuda compiler for executing cuda. 
import os
os.environ['PATH'] = "/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/cuda-12.6.1-cf4xlcbcfpwchqwo5bktxyhjagryzcx6/bin:" + os.environ['PATH']

In [4]:
%%writefile codes/example.cu 

// Host memory - accessible by CPU
thrust::host_vector<float> h_positions{0.1f, 0.2f, 0.3f, 0.4f};  // x,y positions

// Device memory - accessible by GPU
thrust::device_vector<float> d_positions(4);

// Copy data from host to device
thrust::copy(h_positions.begin(), h_positions.end(), d_positions.begin());


Overwriting codes/example.cu


In this code:
- `host_vector` allocates memory in CPU/host memory space
- `device_vector` allocates memory in GPU/device memory space 
- We must explicitly copy data between spaces
## Universal Vector vs Explicit Memory Management

Let's look at two versions of our particle simulation code - one using universal_vector and one with explicit memory management:


In [5]:
%%writefile codes/example.cu 

// Version 1: Using universal_vector (automatic but potentially inefficient)
void simulate_particles_v1() {
    thrust::universal_vector<float> positions = init_particles(1000000);
    thrust::universal_vector<float> forces(positions.size());
    
    for(int step = 0; step < 100; step++) {
        // GPU computation
        compute_forces(positions, forces);
        
        // CPU visualization every 10 steps
        if(step % 10 == 0) {
            visualize_particles(positions);  // Implicit transfer to host!
        }
        
        update_positions(positions, forces);
    }
}

// Version 2: Explicit memory management (more control, better performance)
void simulate_particles_v2() {
    // Host vectors for CPU operations
    thrust::host_vector<float> h_positions = init_particles(1000000);
    thrust::host_vector<float> h_forces(h_positions.size());
    
    // Device vectors for GPU operations
    thrust::device_vector<float> d_positions = h_positions;
    thrust::device_vector<float> d_forces(d_positions.size());
    
    for(int step = 0; step < 100; step++) {
        // GPU computation using device memory
        compute_forces(d_positions, d_forces);
        
        // Only copy to host when needed for visualization
        if(step % 10 == 0) {
            thrust::copy(d_positions.begin(), d_positions.end(), 
                        h_positions.begin());
            visualize_particles(h_positions);
        }
        
        update_positions(d_positions, d_forces);
    }
}

Overwriting codes/example.cu


Let's measure the impact of memory transfers:

In [16]:
%%writefile codes/example.cu 

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/universal_vector.h>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>
#include <chrono>
#include <cstdio>
#include <cmath>

namespace {

// Simple particle physics computation on GPU
__global__ void compute_forces_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;
    
    // Access particle data
    float x = data[idx * 4 + 0];
    float y = data[idx * 4 + 1];
    
    // Simple force calculation (distance from origin)
    float r = sqrtf(x*x + y*y);
    float force = 1.0f / (r + 0.1f);  // Avoid division by zero
    
    // Store forces
    data[idx * 4 + 2] = -force * x/r;  // fx
    data[idx * 4 + 3] = -force * y/r;  // fy
}

// Helper function to run kernel
template<typename VectorType>
void compute_forces(VectorType& data) {
    int n = data.size() / 4;
    float* raw_ptr = thrust::raw_pointer_cast(data.data());
    
    int block_size = 256;
    int num_blocks = (n + block_size - 1) / block_size;
    compute_forces_kernel<<<num_blocks, block_size>>>(raw_ptr, n);
    cudaDeviceSynchronize();
}

// Simple benchmark comparing memory transfer approaches
void measure_transfer_overhead() {
    const int N = 1000000;  // 1 million particles
    
    // Allocate data
    thrust::host_vector<float> h_data(N * 4);
    thrust::device_vector<float> d_data = h_data;
    
    // Fill with some data
    for(int i = 0; i < N * 4; i++) {
        h_data[i] = static_cast<float>(i);
    }
    
    // Test 1: Compute only
    auto t1 = std::chrono::high_resolution_clock::now();
    compute_forces(d_data);
    auto t2 = std::chrono::high_resolution_clock::now();
    
    // Test 2: Compute + transfers 
    auto t3 = std::chrono::high_resolution_clock::now();
    thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
    compute_forces(d_data);
    thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
    auto t4 = std::chrono::high_resolution_clock::now();
    
    double compute_time = std::chrono::duration<double>(t2-t1).count();
    double total_time = std::chrono::duration<double>(t4-t3).count();
    
    printf("\nMemory Transfer Analysis:\n");
    printf("Compute only: %g seconds\n", compute_time);
    printf("With transfers: %g seconds\n", total_time);
    printf("Transfer overhead: %g seconds\n", total_time - compute_time);
}

// Demonstrate universal vector vs explicit memory management
void compare_memory_approaches() {
    const int N = 100000;  // 100k particles
    printf("\nComparing memory management approaches with %d particles...\n", N);
    
    // Test 1: Universal vector (automatic transfers)
    {
        thrust::universal_vector<float> data(N * 4, 1.0f);
        
        auto t1 = std::chrono::high_resolution_clock::now();
        
        // Do some work with automatic memory management
        for(int i = 0; i < 10; i++) {
            compute_forces(data);
            // Simulate CPU work every few iterations
            if(i % 3 == 0) {
                float sum = thrust::reduce(thrust::host, data.begin(), data.end());
                if(sum < -1e10) printf("Unlikely sum: %f\n", sum);
            }
        }
        
        auto t2 = std::chrono::high_resolution_clock::now();
        double time = std::chrono::duration<double>(t2-t1).count();
        printf("Universal vector time: %g seconds\n", time);
    }
    
    // Test 2: Explicit memory management
    {
        thrust::host_vector<float> h_data(N * 4, 1.0f);
        thrust::device_vector<float> d_data = h_data;
        
        auto t1 = std::chrono::high_resolution_clock::now();
        
        // Do same work with explicit memory management
        for(int i = 0; i < 10; i++) {
            compute_forces(d_data);
            // Only transfer when needed
            if(i % 3 == 0) {
                thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
                float sum = thrust::reduce(thrust::host, h_data.begin(), h_data.end());
                if(sum < -1e10) printf("Unlikely sum: %f\n", sum);
            }
        }
        
        auto t2 = std::chrono::high_resolution_clock::now();
        double time = std::chrono::duration<double>(t2-t1).count();
        printf("Explicit memory time: %g seconds\n", time);
    }
}

} // anonymous namespace

int main() {
    // Show overhead of memory transfers
    measure_transfer_overhead();
    
    // Compare different memory management approaches
    compare_memory_approaches();
    
    return 0;
}

Overwriting codes/example.cu


In [17]:
%%bash
nvcc -o codes/example --extended-lambda codes/example.cu 
./codes/example


Memory Transfer Analysis:
Compute only: 0.348636 seconds
With transfers: 0.00996045 seconds
Transfer overhead: -0.338675 seconds

Comparing memory management approaches with 100000 particles...
Universal vector time: 0.159094 seconds
Explicit memory time: 0.0881803 seconds


Key takeaways:
1. GPU has its own memory space optimized for parallel access
2. Data must be explicitly moved between CPU and GPU memory
3. `universal_vector` provides convenience but may cause hidden transfers
4. Memory transfers can significantly impact performance
5. Best practice: Keep data in GPU memory when doing repeated GPU operations

For optimal performance:
- Minimize host-device transfers
- Batch operations to avoid frequent transfers
- Only transfer data when absolutely necessary
- Keep compute-intensive operations on the GPU
- Use explicit memory management for better control

Next up we'll do an exercise to practice these concepts!