# Optimizing Online 5g Machine-Learning with Nsight Compute

## 06 Store vectors in shared memory

We can recognize that the data input vector of the algorithm is the same over the whole gaussian loop, and that the basis vectors are the same for every sample in the block. This means `we can share the basis (dictionary) vectors across different samples in the same block`. We assume a maximum size of the vectors and cache them in CUDA's low-latency `shared memory`.

The new version using shared memory is already available in the code. To enable the new version, we just need to set the `APSM_DETECT_VERSION` flag in line 73 of [apsm_versions.h](apsm/cpp/lib/apsm/apsm_versions.h).

Open [apsm_detect.cu](apsm/cpp/lib/apsm/apsm_detect.cu) at line 615 to inspect the differences. You might notice that there is another intermediate step `APSM_DETECT_SPLIT` here. It's useful to do further analysis on the linear and gaussian sections of this kernel and understand better where more time is spent. It's not critical to progress in this lab, but you are free to come back here at the end and inspect it if you are interested.

With that, after setting the `APSM_DETECT_VERSION` define to `apsm_version::APSM_DETECT_SHMEM`, re-compile the code with the following command and the collect a profiler report for this new version:

In [None]:
%cd /dli/task/ncu/apsm/cpp/build
!make -j
!ncu -k kernel_apsm_detect --set full --import-source yes -f -o /dli/task/ncu/report_shmem \
    bin/APSM_tool -m QAM16 -s ../data/offline/rx/time/rxData_QAM16_alltx_converted.bin -r ../data/offline/tx/NOMA_signals_qam16_complex.bin

### Steps without instructor in `...`

After collecting the results, switch to the Ubuntu instance with Nsight Compute and open the just created report file `/root/Desktop/reports/ncu/report_shmem.ncu-rep`. For easier comparison, `add the SPB step as a new baseline`. You can give each baseline a name by just hovering and typing over the existing one. (Since CG and SPB have very similar performance, you could also remove the older CG baseline to keep the comparison easier to read. Simply click on the colored box next to the respective baseline name)

Before inspecting the actual performance, we can verify that the kernel is now in fact using shared memory in the `Launch Statistics` section. It shows that ~10KB of static shared memory are configured per CUDA block.

<img src="images/ncu_report03_03.png">

Now, let's confirm which impact our changes had on the kernel runtime and metrics. Scroll back up on the Details page for the high-level comparison.

<img src="images/ncu_report03_01.png">

As you can see, the new version is about `60% faster`.

On the `Memory Workload Analysis` section, the `Memory Throughput` from DRAM has increased significantly. Given that the overall memory subsystem utilization `Memory Throughput` hasn't really changed across the three versions of the kernel, this indicates that we are using memory quite a bit more efficiently now.

A primary contributor to this would be the `much reduced LG Throttle` stalls, as one can see on the `Warp State Statistics` section. We remember that this was the primary concern with the previous SPB implementation in step 5. This change reduced the pressure on the memory subsystem.

<img src="images/ncu_report03_02.png">

Now, while the performance has improved, and the LG Throttle stalls are significantly reduced, we still see a stall reasons that causes relevant latency: `Stall Barrier`. Let's look at its description in the [documentation](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#statistical-sampler):

*Barrier*: Warp was stalled waiting for sibling warps at a CTA barrier. A high number of warps waiting at a barrier is commonly caused by diverging code paths before a barrier. This causes some warps to wait a long time until other warps reach the synchronization point. Whenever possible, try to divide up the work into blocks of uniform workloads. Also, try to identify which barrier instruction causes the most stalls, and optimize the code executed before that synchronization point first.

As we've seen in the Speed Of Light section on the top, Nsight Compute still suggests that the kernel is latency bound and that reducing these stalls could benefit kernel performance.

If short on time, you can move directly to the [summary](08_summary.ipynb)

If you are interested in another step, continue optimizations in [step 07](07_balanced.ipynb)