LLMs Quantization Recipes

Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch, Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.

Notes:

The quantization algorithms provide by Intel® Neural Compressor and the evaluate functions provide by Intel® Extension for Transformers.

The model list are continuing update, please expect to find more LLMs in the future.

Large Language Models Recipes

Models	SQ INT8	WOQ INT8	WOQ INT4
EleutherAI/gpt-j-6b	✔	✔	✔
facebook/opt-1.3b	✔	✔	✔
facebook/opt-30b	✔	✔	✔
meta-llama/Llama-2-7b-hf	✔	✔	✔
meta-llama/Llama-2-13b-hf	✔	✔	✔
meta-llama/Llama-2-70b-hf	✔	✔	✔
tiiuae/falcon-7b	✔	✔	✔
tiiuae/falcon-40b	✔	✔	✔
baichuan-inc/Baichuan-13B-Chat	✔	✔	✔
baichuan-inc/Baichuan2-13B-Chat	✔	✔	✔
baichuan-inc/Baichuan2-7B-Chat	✔	✔	✔
bigscience/bloom-1b7	✔	✔	✔
databricks/dolly-v2-12b	✖	✔	✖
EleutherAI/gpt-neox-20b	✖	✔	✔
mistralai/Mistral-7B-v0.1	✖	✔	✔
THUDM/chatglm2-6b	WIP	✔	✔
THUDM/chatglm3-6b	WIP	✔	✔

Detail recipes can be found HERE.

Notes:

This model list comes from IPEX.

The WIP recipes will be published soon.

Large Language Models Accuracy

Model	lambada_openai
	FP32	SQ INT8		WOQ INT8		WOQ INT4 GPTQ		WOQ INT4 AutoRound
	ACC	ACC	Ratio	ACC	Ratio	ACC	Ratio	ACC	Ratio
baichuan-inc/Baichuan-13B-Chat	67.57%	68.23%	1.0098	67.57%	1.0000	67.84%	1.0040	NA	NA
baichuan-inc/Baichuan2-13B-Chat	71.51%	70.89%	0.9913	71.53%	1.0003	71.76%	1.0035	NA	NA
baichuan-inc/Baichuan2-7B-Chat	67.67%	67.96%	1.0043	67.59%	0.9988	67.24%	0.9936	67.42%	0.9963
bigscience/bloom-1b7	46.34%	47.99%	1.0356	46.38%	1.0009	46.19%	0.9968	NA	NA
databricks/dolly-v2-12b	64.35%	NA	NA	64.10%	0.9961	NA	NA	NA	NA
EleutherAI/gpt-j-6b	68.31%	68.33%	1.0003	68.23%	0.9988	68.79%	1.0070	68.43%	1.0018
EleutherAI/gpt-neox-20b	72.33%	NA	NA	72.25%	0.9989	71.96%	0.9949	NA	NA
facebook/opt-1.3b	57.89%	57.54%	0.9940	58.08%	1.0033	58.57%	1.0117	NA	NA
facebook/opt-30b	71.49%	71.51%	1.0003	71.51%	1.0003	71.82%	1.0046	72.11%	1.0087
meta-llama/Llama-2-13b-hf	76.77%	76.25%	0.9932	76.75%	0.9997	77.43%	1.0086	76.75%	0.9997
meta-llama/Llama-2-70b-hf	79.64%	79.55%	0.9989	79.57%	0.9991	80.09%	1.0057	79.97%	1.0041
meta-llama/Llama-2-7b-hf	73.92%	73.45%	0.9936	73.96%	1.0005	73.45%	0.9936	73.49%	0.9942
mistralai/Mistral-7B-v0.1	75.90%	NA	NA	75.80%	0.9987	76.13%	1.0030	75.61%	0.9962
THUDM/chatglm2-6b	53.23%	NA	NA	53.19%	0.9992	52.77%	0.9914	53.35%	1.0023
THUDM/chatglm3-6b	59.09%	NA	NA	59.01%	0.9986	NA	NA	58.61%	0.9919
tiiuae/falcon-40b	77.22%	77.04%	0.9977	77.22%	1.0000	77.94%	1.0093	78.79%	1.0203
tiiuae/falcon-7b	74.67%	76.44%	1.0237	74.77%	1.0013	75.00%	1.0044	NA	NA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_recipes.md

llm_recipes.md

LLMs Quantization Recipes

Large Language Models Recipes

Large Language Models Accuracy

Files

llm_recipes.md

Latest commit

History

llm_recipes.md

File metadata and controls

LLMs Quantization Recipes

Large Language Models Recipes

Large Language Models Accuracy