Release 4-bit Inference · TimDettmers/bitsandbytes

Efficient 4-bit Inference (NF4, FP4)

This release adds efficient inference routines for batch size 1. Expected speedups vs 16-bit precision (fp16/bf16) for matrix multiplications with inner product dimension of at least 4096 (LLaMA 7B) is:

2.2x for Turing (T4, RTX 2080, etc.)
3.4x for Ampere (A100, A40, RTX 3090, etc.)
4.0x for Ada/Hopper (H100, L40, RTX 4090, etc.)

The inference kernels for batch size 1 are about 8x faster than 4-bit training kernel for QLoRA. This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1.

No code changes are needed to take advantage of the new kernels as long as a batch size of 1 is used.

Big thanks to @crowsonkb, @Birch-san, and @sekstini for some beta testing and helping to debug some early errors.

Changelog

Features:

Added 4-bit inference kernels for batch size=1. Currently supported are the NF4, FP4 data types.
Added support for quantizations of bfloat16 input data.

Bug fixes:

Added device variable for bitsandbytes layers to be compatible with PyTorch layers.

Deprecated:

Binaries for CUDA 11.2, 11.6 no longer ship with pip install bitsandbytes and need to be compiled from source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-bit Inference

Efficient 4-bit Inference (NF4, FP4)

Changelog

Contributors