Refactor fused MLP + fused attention loading. Fix for fused MLP requiring Triton even when not used. #85

TheBloke · 2023-05-16T14:36:27Z

This is a fix for: #43 (comment)

Changes:

In modeling/_base.py, add new classmethods get_fused_attention_module and get_fused_mlp_module. if called directly they return "this class does not support" warnings.
In modeling/llama.py, add same classmethods, implementing the imports of FusedLlamaMLPForQuantizedModel and FusedLlamaAttentionForQuantizedModel with try except blocks
In modeling/_base.py implement checks for inject_fused_attention and inject_fused_mlp which only call get_fused_mlp_module and get_fused_attention_module if right conditions are met. In particular, don't call get_fused_mlp_module unless use_triton is True.
Therefore FusedLlamaMLPForQuantizedModel will not be imported unless the user specifies use_triton and inject_fused_mlp
Therefore the user does not need Triton installed and can execute CUDA code without errors.

Testing done:

Inference with: CUDA, CUDA + FA, Triton, Triton + FA, Triton + FM. Triton + FA + FM
Quantisation with CUDA

…ring Triton even when not used.

LexSong · 2023-05-18T14:31:27Z

This patch works. Thanks.

PanQiWei · 2023-05-20T07:41:10Z

Thank you @TheBloke to create this pr and solved user's problem when try injecting fused module without triton.

But, I think it would be better to fix from the root cause instead of adding new functions as patches for it will make the code more complex.

In my opinion, the better design pattern to keep in "automatic" is keep something to None or use global flag to disable if it's not supported in some specific cases, as those done in flash-attention's block.py and text-generation-inference's layers.py

I will make a new pr to fix the triton problem in a new way, so this pr can be closed.

Once again, I'm really appreciate your contributions and they are truly making this project better and better. ❤️‍🔥

PanQiWei · 2023-05-20T08:31:34Z

I've made #92 to fix import error when triton is not installed and optimized code to make it more automatic when trying to integrate with triton

TheBloke · 2023-05-20T08:40:34Z

Looks good! Will test it in a minute

Refactor fused MLP + fused attention loading. Fix for fused MLP requi…

285a50f

…ring Triton even when not used.

TheBloke mentioned this pull request May 16, 2023

Faster llama #43

Merged

TheBloke added 3 commits May 16, 2023 15:51

Remove accidental import

19653ee

Fix typos

ba57b66

Minor clean up

1bbf5f5

SirWaffle mentioned this pull request May 17, 2023

Issue loading models quantised with older GPTQ code: ValueError: QuantLinear() does not have a parameter or a buffer named bias. #45

Closed

TheBloke mentioned this pull request May 18, 2023

Rename quant_cuda to autogptq_cuda to avoid conflicts with GPTQ-for-LLaMa #88

Closed

TheBloke closed this May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor fused MLP + fused attention loading. Fix for fused MLP requiring Triton even when not used. #85

Refactor fused MLP + fused attention loading. Fix for fused MLP requiring Triton even when not used. #85

TheBloke commented May 16, 2023 •

edited

LexSong commented May 18, 2023

PanQiWei commented May 20, 2023

PanQiWei commented May 20, 2023

TheBloke commented May 20, 2023

Refactor fused MLP + fused attention loading. Fix for fused MLP requiring Triton even when not used. #85

Refactor fused MLP + fused attention loading. Fix for fused MLP requiring Triton even when not used. #85

Conversation

TheBloke commented May 16, 2023 • edited

LexSong commented May 18, 2023

PanQiWei commented May 20, 2023

PanQiWei commented May 20, 2023

TheBloke commented May 20, 2023

TheBloke commented May 16, 2023 •

edited