Release 1.2.0

Latest

Latest

shiyingjin released this 27 Dec 07:39

Feature List

We fork vLLM repository and add some new features to accelerate LLM inference:

Support int8 inference.
Support int4 inference, throughput increase of 1.9-4.0 times compared to the FP16 model.
Support FP8 kv cache which not only simplifies the quantization and dequantization operations, but also does not require additional scale GPU memory storage. The throughput can achive up to 1.54 times compared to disable this feature.

Assets 2