Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370

Closed
yyfcc17 opened this issue Mar 1, 2024 · 6 comments
Closed

[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370

yyfcc17 opened this issue Mar 1, 2024 · 6 comments
Labels
question Question

Comments

@yyfcc17
Copy link

yyfcc17 commented Mar 1, 2024

i see int4 * fp8 is supported on hopper gpus,

but for gpus like a30 and a100, there is no fp8 support, so i need to use int4 * int8, two int4 is packed into a int8,

how can i accomplish this kind of mixed gemm using cutlass?

thanks!

@hwu36
Copy link
Collaborator

hwu36 commented Mar 1, 2024

#1190

@alexsamardzic

@alexsamardzic
Copy link
Contributor

alexsamardzic commented Mar 1, 2024

My PR is about F16/ S4, and will cover also BF16/S4, F16/ U4 and BF16/ U4 MM on Ampere; through an earlier work of @manishucsd, CUTLASS already supports these for S8 and U8 instead of S4 and U4, respectively, on Ampere. And that's about it regarding mixed data-types MM, thus what OP would need is not in the works yet, but may be an interesting follow-up.

@yyfcc17
Copy link
Author

yyfcc17 commented Mar 4, 2024

thanks for reply.

since LLM w8a8 is almost solved by PTQ methods, and w4a4 is too aggressive for PTQ methods, w4a8 PTQ is a good option for further optimization.

in my own experiments, the LLM's performance degradation is minor under w4a8 (per-channel * per-token) PTQ settings. and there is also a paper about it: https://arxiv.org/abs/2311.09550, the performance seems promising.

trt-llm already supported w4a8 PTQ & inference on hopper, but only for fp8.

is it possible to extend your PR to support w4a8 (int4 * int8) on Ampere? @alexsamardzic

@alexsamardzic
Copy link
Contributor

It would be better to open another PR. Namely, the combination of data-types that my PR intends to handle is rather tricky regarding loading data from shared memory to registers, that I believe should not be the case for S4/S8 MM. Also, there isn't much to share between the two - all of data conversion, reshuffling S4 data between threads, and finally the tests, should be different for S4/S8.

@alexsamardzic
Copy link
Contributor

I'm now working on this feature - #1413 is created, so this issue could be closed.

@yyfcc17
Copy link
Author

yyfcc17 commented Mar 20, 2024

@alexsamardzic that's great, thank you!

@yyfcc17 yyfcc17 closed this as completed Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

4 participants