With the addition of a CPU fallback for the tile API in 83a5845, we can get into situations where modules are compiled on the GPU using a nonsense block_dim of 1, leading to unnecessary module recompilation when a GPU kernel is launched with the intended number of threads per block.
With the addition of a CPU fallback for the tile API in 83a5845, we can get into situations where modules are compiled on the GPU using a nonsense
block_dimof1, leading to unnecessary module recompilation when a GPU kernel is launched with the intended number of threads per block.