# Shared Memory

## Content

* [Cache Memory](#Cache-Memory)
* [Exercise: Optimize Histogram](03.05.02-Exercise-Optimize-Histogram.ipynb)

---
With our previous optimizations, the kernel now performs significantly better. 
However, some inefficiencies remain. 
Currently, each block’s histogram is stored in global GPU memory, even though it’s never used outside the kernel. 
This approach not only consumes unnecessary bandwidth but also increases the overall memory footprint.

## Cache Memory

![L2](Images/L2.png "L2")

As shown in the figure above, there’s a much closer memory resource: each Streaming Multiprocessor (SM) has its own L1 cache. 
Ideally, we want to store each block’s histogram right there in L1. 
Fortunately, CUDA makes this possible through software-controlled shared memory. 
By allocating the block histogram in shared memory, we can take full advantage of the SM’s L1 cache and reduce unnecessary memory traffic.

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !mkdir -p Sources
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/tutorials/cuda-cpp/notebooks/03.05-Shared-Memory/Sources/ach.cuh -nv -O Sources/ach.cuh

In [None]:
%%writefile Sources/simple-shmem.cpp
#include <cstdio>

__global__ void kernel()
{
  __shared__ int shared[4];
  shared[threadIdx.x] = threadIdx.x;
  __syncthreads();

  if (threadIdx.x == 0)
  {
    for (int i = 0; i < 4; i++) {
      std::printf("shared[%d] = %d\n", i, shared[i]);
    }
  }
}

int main() {
  kernel<<<1, 4>>>();
  cudaDeviceSynchronize();
  return 0;
}

In [None]:
!nvcc -o /tmp/a.out Sources/simple-shmem.cpp -x cu -arch=native && /tmp/a.out

To allocate shared memory, simply annotate a variable with the `__shared__` keyword.
This puts the variable into shared memory that coresides with the L1 cache.
Since shared memory isn't automatically initialized, 
we begin our kernel by having each thread write its own index into a corresponding shared memory location:

```c++
shared[threadIdx.x] = threadIdx.x;
__syncthreads();
```

The `__syncthreads()` call ensures that all threads have finished writing to the shared array before any thread reads from it. 
Afterwards, the first thread prints out the contents of the shared memory:

![Shared Memory](Images/simply-shared.png "Shared Memory")

As you can see, each thread successfully stored its index in the shared array, and the first thread can read back those values.

---
Now lets go ahead and try an [exercise](03.05.02-Exercise-Optimize-Histogram.ipynb).