Skip to content

PTX Backend#18

Open
WillTrojak wants to merge 12 commits into
PyFR:masterfrom
WillTrojak:feature/ptx
Open

PTX Backend#18
WillTrojak wants to merge 12 commits into
PyFR:masterfrom
WillTrojak:feature/ptx

Conversation

@WillTrojak
Copy link
Copy Markdown
Member

This adds a PTX backend to GiMMiK. The key features are:

  • Mild optimisation of exist CUDA algorithms.
  • Optional async loads for some sparse kernels
  • Added dense generation for Hopper and above

Optimisations have focused on FP64, FP32 is future work.

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
@FreddieWitherden
Copy link
Copy Markdown
Contributor

I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)?

Comment thread gimmik/kernels/ptx/base.mako Outdated
.param .u64 _c)
{
% endif
.reg .u32 n, id, tid_x, tid_y;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure we throw higher up if n is too big.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't handle n being too large in any of the other backends.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/cstream.mako#L20 in the embedded case we do (argument case doesn't but that is not currently used for CUDA).

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/cstream-ksplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream.mako
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/cuda.py
nnz = np.count_nonzero(arr)
nuq = len(np.unique(np.abs(arr)))
density = nnz / arr.size
return (nuq <= 28) or (density <= 0.15)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if these could do with tuning

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a seperate PR

Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
% for idx, kx in enumerate(bchunks[bb]):
ld.shared.${pftype} bv, [bsub_thread + ${bsub_off(buf_cur, idx)}];
% for j, row_j in enumerate(mcx):
<% jx = A[row_j, kx] %>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if NumPy can be used in the for loop A[mcx, kx]

Comment thread gimmik/kernels/ptx/dense-mma-gAd.mako Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants