Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ),
and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch,
Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.
Notes:
- The quantization algorithms provide by Intel® Neural Compressor and the evaluate functions provide by Intel® Extension for Transformers.
- The model list are continuing update, please expect to find more LLMs in the future.
Models | SQ INT8 | WOQ INT8 | WOQ INT4 |
---|---|---|---|
EleutherAI/gpt-j-6b | ✔ | ✔ | ✔ |
facebook/opt-1.3b | ✔ | ✔ | ✔ |
facebook/opt-30b | ✔ | ✔ | ✔ |
meta-llama/Llama-2-7b-hf | ✔ | ✔ | ✔ |
meta-llama/Llama-2-13b-hf | ✔ | ✔ | ✔ |
meta-llama/Llama-2-70b-hf | ✔ | ✔ | ✔ |
tiiuae/falcon-7b | ✔ | ✔ | ✔ |
tiiuae/falcon-40b | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan-13B-Chat | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan2-13B-Chat | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan2-7B-Chat | ✔ | ✔ | ✔ |
bigscience/bloom-1b7 | ✔ | ✔ | ✔ |
databricks/dolly-v2-12b | ✖ | ✔ | ✖ |
EleutherAI/gpt-neox-20b | ✖ | ✔ | ✔ |
mistralai/Mistral-7B-v0.1 | ✖ | ✔ | ✔ |
THUDM/chatglm2-6b | WIP | ✔ | ✔ |
THUDM/chatglm3-6b | WIP | ✔ | ✔ |
Detail recipes can be found HERE.
Notes:
- This model list comes from IPEX.
- The WIP recipes will be published soon.
Model | lambada_openai | ||||||||
---|---|---|---|---|---|---|---|---|---|
FP32 | SQ INT8 | WOQ INT8 | WOQ INT4 GPTQ | WOQ INT4 AutoRound | |||||
ACC | ACC | Ratio | ACC | Ratio | ACC | Ratio | ACC | Ratio | |
baichuan-inc/Baichuan-13B-Chat | 67.57% | 68.23% | 1.0098 | 67.57% | 1.0000 | 67.84% | 1.0040 | NA | NA |
baichuan-inc/Baichuan2-13B-Chat | 71.51% | 70.89% | 0.9913 | 71.53% | 1.0003 | 71.76% | 1.0035 | NA | NA |
baichuan-inc/Baichuan2-7B-Chat | 67.67% | 67.96% | 1.0043 | 67.59% | 0.9988 | 67.24% | 0.9936 | 67.42% | 0.9963 |
bigscience/bloom-1b7 | 46.34% | 47.99% | 1.0356 | 46.38% | 1.0009 | 46.19% | 0.9968 | NA | NA |
databricks/dolly-v2-12b | 64.35% | NA | NA | 64.10% | 0.9961 | NA | NA | NA | NA |
EleutherAI/gpt-j-6b | 68.31% | 68.33% | 1.0003 | 68.23% | 0.9988 | 68.79% | 1.0070 | 68.43% | 1.0018 |
EleutherAI/gpt-neox-20b | 72.33% | NA | NA | 72.25% | 0.9989 | 71.96% | 0.9949 | NA | NA |
facebook/opt-1.3b | 57.89% | 57.54% | 0.9940 | 58.08% | 1.0033 | 58.57% | 1.0117 | NA | NA |
facebook/opt-30b | 71.49% | 71.51% | 1.0003 | 71.51% | 1.0003 | 71.82% | 1.0046 | 72.11% | 1.0087 |
meta-llama/Llama-2-13b-hf | 76.77% | 76.25% | 0.9932 | 76.75% | 0.9997 | 77.43% | 1.0086 | 76.75% | 0.9997 |
meta-llama/Llama-2-70b-hf | 79.64% | 79.55% | 0.9989 | 79.57% | 0.9991 | 80.09% | 1.0057 | 79.97% | 1.0041 |
meta-llama/Llama-2-7b-hf | 73.92% | 73.45% | 0.9936 | 73.96% | 1.0005 | 73.45% | 0.9936 | 73.49% | 0.9942 |
mistralai/Mistral-7B-v0.1 | 75.90% | NA | NA | 75.80% | 0.9987 | 76.13% | 1.0030 | 75.61% | 0.9962 |
THUDM/chatglm2-6b | 53.23% | NA | NA | 53.19% | 0.9992 | 52.77% | 0.9914 | 53.35% | 1.0023 |
THUDM/chatglm3-6b | 59.09% | NA | NA | 59.01% | 0.9986 | NA | NA | 58.61% | 0.9919 |
tiiuae/falcon-40b | 77.22% | 77.04% | 0.9977 | 77.22% | 1.0000 | 77.94% | 1.0093 | 78.79% | 1.0203 |
tiiuae/falcon-7b | 74.67% | 76.44% | 1.0237 | 74.77% | 1.0013 | 75.00% | 1.0044 | NA | NA |