Skip to content

4-bit Inference

Compare
Choose a tag to compare
@TimDettmers TimDettmers released this 12 Jul 00:25
· 276 commits to main since this release

Efficient 4-bit Inference (NF4, FP4)

This release adds efficient inference routines for batch size 1. Expected speedups vs 16-bit precision (fp16/bf16) for matrix multiplications with inner product dimension of at least 4096 (LLaMA 7B) is:

  • 2.2x for Turing (T4, RTX 2080, etc.)
  • 3.4x for Ampere (A100, A40, RTX 3090, etc.)
  • 4.0x for Ada/Hopper (H100, L40, RTX 4090, etc.)

The inference kernels for batch size 1 are about 8x faster than 4-bit training kernel for QLoRA. This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1.

No code changes are needed to take advantage of the new kernels as long as a batch size of 1 is used.

Big thanks to @crowsonkb, @Birch-san, and @sekstini for some beta testing and helping to debug some early errors.

Changelog

Features:

  • Added 4-bit inference kernels for batch size=1. Currently supported are the NF4, FP4 data types.
  • Added support for quantizations of bfloat16 input data.

Bug fixes:

  • Added device variable for bitsandbytes layers to be compatible with PyTorch layers.

Deprecated:

  • Binaries for CUDA 11.2, 11.6 no longer ship with pip install bitsandbytes and need to be compiled from source.