You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUDA does not handle conditional if/else logic well. If you have a conditional statement in your kernel, the compiler will generate code for both branches and then use a predicated instruction to select the correct result. This can lead to a lot of wasted computation if the branches are long or if the condition is rarely met. It is generally a good idea to try to avoid conditional logic in your kernels if possible.
If it is unavoidable, you can dig down to the PTX assembly code (nvcc -ptx kernel.cu -o kernel) and see how the compiler is handling it. Then you can look into the compute metrics of the instructions used and try to optimize it from there.
Single thread going down a long nested if else statement will look more serialized and leave the other threads waiting for the next instruction while the single threads finishes. this is called warp divergence and is a common issue in CUDA programming when dealing with threads specifically within a warp.
vector addition is fast because divergence isn’t possible, not a different possible way for instructions to carry out.
Pros and Cons of Unified Memory
Unified Memory is a feature in CUDA that allows you to allocate memory that is accessible from both the CPU (system DRAM) and the GPU. This can simplify memory management in your code, as you don't have to worry about copying data back and forth between the the RAM sticks and the GPU's memory.
Streams allow for overlapping data transfer (prefetching) with computation.
While one stream is executing a kernel, another stream can be transferring data for the next computation.
This technique is often called "double buffering" or "multi-buffering" when extended to more buffers.
Memory Architectures
DRAM/VRAM cells are the smallest unit of memory in a computer. They are made up of capacitors and transistors that store bits of data. The capacitors store the bits as electrical charges, and the transistors control the flow of electricity to read and write the data.
SRAM (shared memory) is a type of memory that is faster and more expensive than DRAM. It is used for cache memory in CPUs and GPUs because it can be accessed more quickly than DRAM.
Modern NVIDIA GPUs likely use 6T (six-transistor) or 8T SRAM cells for most on-chip memory.
6T cells are compact and offer good performance, while 8T cells can provide better stability and lower power consumption at the cost of larger area.
6T vs 8T SRAM cells in NVIDIA GPUs across different architectures and compute capabilities isn't publicly disclosed in detail. NVIDIA, like most semiconductor companies, keeps many of these low-level design choices proprietary.