Skip to content

Squeezing out faster inference on a 3090? Is CUDA_USE_TENSOR_CORES something I can compile for? #8422

Answered by ggerganov
wwoodsTM asked this question in Q&A

You must be logged in to vote

Try to quantize the KV cache and enable Flash Attention:

-ctk q8_0 -ctv q8_0 -fa 1

This should give you some room for extra layers on the GPU

Replies: 2 comments 3 replies

You must be logged in to vote
1 reply
@wwoodsTM

You must be logged in to vote
2 replies
@wwoodsTM

@goodglitch

Answer selected by wwoodsTM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants