# Implementing New Algorithms with CUDA Kernels

Previous labs equipped you with a necessary understanding of how using standard parallel algorithms can provide both convenient and speed-of-light GPU acceleration. 
However, sometimes your unique use cases are not covered by accelerated libraries. 
In this lab, you’ll learn the CUDA SIMT (Single Instruction Multiple Threads) programming model to program the GPU directly using CUDA kernels. 
In addition, this lab will cover utilities provided by the CUDA ecosystem to facilitate development of custom CUDA kernels.

---
## Prerequisites

To get the most out of this lab you should already be able to:

- Distinguish synchronous and asynchronous algorithms
- Declare variables, write loops, and use if / else statements in C++
- Control the execution space of your C++ code and run it on CPU or GPU
- Control the memory space to put your data in CPU or GPU memory


---
## Objectives

By the time you complete this lab, you will be able to:

- Write custom CUDA kernels
- Control thread hierarchy
- Leverage shared memory
- Use cooperative algorithms
- Advanced: leverage efficient memory loading techniques
- Advanced: use atomic operations

---

## Content

* [3.1 Introduction](../03.01-Introduction/03.01-Introduction.ipynb)
* [3.2 CUDA Kernels](../03.02-Kernels/03.02.01-Kernels.ipynb)
* [3.3 Atomics](../03.03-Atomics/03.03.01-Histogram.ipynb)
* [3.4 Synchronization](../03.04-Synchronization/03.04.01-Sync.ipynb)
* [3.5 Shared Memory](../03.05-Shared-Memory/03.05.01-Shared.ipynb)
* [3.6 Cooperative Algorithms](../03.06-Cooperative-Algorithms/03.06.01-Cooperative.ipynb)
