PTX Backend#18
Conversation
|
I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)? |
| .param .u64 _c) | ||
| { | ||
| % endif | ||
| .reg .u32 n, id, tid_x, tid_y; |
There was a problem hiding this comment.
Ensure we throw higher up if n is too big.
There was a problem hiding this comment.
We don't handle n being too large in any of the other backends.
There was a problem hiding this comment.
https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/cstream.mako#L20 in the embedded case we do (argument case doesn't but that is not currently used for CUDA).
| nnz = np.count_nonzero(arr) | ||
| nuq = len(np.unique(np.abs(arr))) | ||
| density = nnz / arr.size | ||
| return (nuq <= 28) or (density <= 0.15) |
There was a problem hiding this comment.
Check if these could do with tuning
There was a problem hiding this comment.
I think that would be a seperate PR
| % for idx, kx in enumerate(bchunks[bb]): | ||
| ld.shared.${pftype} bv, [bsub_thread + ${bsub_off(buf_cur, idx)}]; | ||
| % for j, row_j in enumerate(mcx): | ||
| <% jx = A[row_j, kx] %> |
There was a problem hiding this comment.
See if NumPy can be used in the for loop A[mcx, kx]
This adds a PTX backend to GiMMiK. The key features are:
Optimisations have focused on FP64, FP32 is future work.