🤗 Hugging Face • 🤖 ModelScope • 🐿️ 机器之心SOTA!模型 • 🟣 wisemodel • 🤗 Online Demo

This project is developed based on Meta's newly released next-generation open-source large language model Llama-3 and is the third generation of the Chinese-LLaMA-Alpaca open-source LLM series (1st gen, 2nd gen). This project has open-sourced the Llama-3-Chinese base model and the Chinese Llama-3-Chinese-Instruct instruction-tuned large model. These models use large-scale Chinese data for continual pre-training on the original Llama-3, and are fine-tuned with selected instruction data to further enhance Chinese basic semantic and instruction understanding capabilities, significantly improving performance compared to the second-generation models.

Main Content

🚀 Open-source Llama-3-Chinese base model and Llama-3-Chinese-Instruct instruction model (v1, v2, v3)
🚀 Released pre-training scripts and instruction fine-tuning scripts, allowing users to further train or fine-tune the model as needed
🚀 Released alpaca_zh_51k, stem_zh_instruction, ruozhiba_gpt4 (4o/4T) instruction data
🚀 Provides a tutorial for quickly quantizing and deploying large models locally using a personal computer's CPU/GPU
🚀 Supports 🤗transformers, llama.cpp, text-generation-webui, vLLM, Ollama and other Llama-3 ecosystems

News

[2024/05/30] Release Llama-3-Chinese-8B-Instruct-v3, which has better performance on downstream tasks than v1/v2. For details, see: 📚Version 3.0 Release Log

[2024/05/08] Release Llama-3-Chinese-8B-Instruct-v2, which is directly tuned on Meta-Llama-3-8B-Instruct with 5M instructions. For details, see: 📚Version 2.0 Release Log

[2024/05/07] Add pre-training and SFT scripts. For details, see: 📚Version 1.1 Release Log

[2024/04/30] Released the Llama-3-Chinese-8B base model and Llama-3-Chinese-8B-Instruct instruction model. For details, see: 📚Version 1.0 Release Log

[2024/04/19] 🚀 Officially launched the Chinese-LLaMA-Alpaca-3 project

Content Guide

Section	Description
💁🏻‍♂️Model Introduction	Briefly introduces the technical features of the models related to this project
⏬Model Download	Download addresses for the Chinese Llama-3 large models
💻Inference and Deployment	Describes how to quantize the model and deploy it using a personal computer to experience the large model
💯Model Performance	Introduces the effects of the model on some tasks
📝Training and Fine-Tuning	Introduces how to train and fine-tune the Chinese Llama-3 large models
❓Frequently Asked Questions	Replies to some common questions

Model Introduction

This project has launched the Chinese open-source large models Llama-3-Chinese and Llama-3-Chinese-Instruct based on Meta Llama-3. The main features are as follows:

📖 Uses the Original Llama-3 Vocabulary

Llama-3 has significantly expanded its vocabulary from 32K to 128K and switched to a BPE vocabulary.
Preliminary experiments have shown that the encoding efficiency of the Llama-3 vocabulary is comparable to our expanded vocabulary in Chinese LLaMA-2, with an efficiency of about 95% based on encoding efficiency tests on Wikipedia data.
Based on our experience and experimental conclusions with Chinese Mixtral ¹, we did not expand the vocabulary further.

🚄 Extended Context Length from 4K in the Second Generation to 8K

Llama-3 has increased the native context window length from 4K to 8K, allowing for further processing of longer context information.
Users can also use methods like PI, NTK, and YaRN to extend the model's long context capabilities to support longer text processing.

⚡ Uses Grouped Query Attention Mechanism

Llama-3 adopts the Grouped Query Attention (GQA) mechanism used in the large parameter version of Llama-2, which further enhances the model's efficiency.

🗒 New Instruction Template

Llama-3-Instruct uses a new instruction template, which is not compatible with Llama-2-chat; it should be used strictly following the official instruction template. (See instruction template)

Model Download

Model Selection Guide

Here's a comparison of the models in this project and recommended usage scenarios. For chat interactions, please choose the Instruct version.

Comparison Item	Llama-3-Chinese-8B	Llama-3-Chinese-8B-Instruct
Model Type	Base Model	Instruction/Chat Model (similar to ChatGPT)
Model Size	8B	8B
Training Type	Causal-LM (CLM)	Instruction Fine-Tuning
Training Method	LoRA + Full emb/lm-head	LoRA + Full emb/lm-head
Initial Model	Meta-Llama-3-8B	v1: Llama-3-Chinese-8B v2: Meta-Llama-3-8B-Instruct v3: mix of inst/inst-v2/inst-meta
Training Corpus	Unlabeled general corpus (approx. 120GB)	Labeled instruction data (approx. 5 million entries)
Vocabulary Size	Original vocabulary (128,256)	Original vocabulary (128,256)
Supported Context Length	8K	8K
Input Template	Not required	Requires Llama-3-Instruct template
Applicable Scenarios	Text continuation: Given a context, let the model generate the following text	Instruction understanding: Q&A, writing, chatting, interaction, etc.

Here is a comparison between different versions of Instruct. Unless there is a clear preference, please prioritize using the Instruct-v3 version.

Comparison Item	Instruct-v1	Instruct-v2	Instruct-v3
Release Date	2024/4/30	2024/5/8	2024/5/30
Base Model	Original Meta-Llama-3-8B	Original Meta-Llama-3-8B-Instruct	(See Training Method)
Training Method	First Stage: Pre-training with 120G Chinese Corpus Second Stage: Fine-tuning with 5 million instruction data	Direct fine-tuning with 5 million instruction data	Model merging using inst-v1, inst-v2, and inst-meta, followed by fine-tuning with a small amount of instruction data
Chinese Proficiency	49.3 / 51.5	51.6 / 51.6	55.2 / 54.8 👍🏻
English Proficiency	63.21	66.68	66.81 👍🏻
Long Text Capability	29.6	46.4 👍🏻	40.5
LLM Arena Win Rate / Elo	49.4% / 1430	66.1% / 1559	83.6% / 1627 👍🏻

Note

Chinese proficiency results are from C-Eval (valid); English proficiency results are from Open LLM Leaderboard (avg); long text capability results are from LongBench (avg). For detailed performance, please refer to the 💯 Model Performance section.

Download Links

Model Name	Full Version	LoRA Version	GGUF Version
Llama-3-Chinese-8B-Instruct-v3 (chat model)	[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel]	N/A	[🤗Hugging Face] [🤖ModelScope]
Llama-3-Chinese-8B-Instruct-v2 (chat model)	[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel]	[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel]	[🤗Hugging Face] [🤖ModelScope]
Llama-3-Chinese-8B-Instruct (chat model)	[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel]	[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel]	[🤗Hugging Face] [🤖ModelScope]
Llama-3-Chinese-8B (base model)	[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel]	[🤗Hugging Face] [🤖ModelScope] [🟣wisemodel]	[🤗Hugging Face] [🤖ModelScope]

Model Type Description:

Full Model: Can be used directly for training and inference, no other merging steps required.
LoRA Model: Needs to be merged with the original base model to convert into a full version, merging steps: 💻 Model Merging Steps
- v1 base model: Meta-Llama-3-8B
- v2 base model: Meta-Llama-3-8B-Instruct
GGUF Model: Quantization format released by llama.cpp, compatible with common large model inference tools like ollama, recommended for users who only need to perform inference deployment. The model name with -im suffix is generated with important matrix, which has generally better performance.

Note

If HF access is blocked, consider using mirror sites (like hf-mirror.com), please find the specific methods and solutions on your own.

Inference and Deployment

The models in this project primarily support the following quantization, inference, and deployment methods. Please refer to the corresponding tutorials for detailed information.

Tool	Features	CPU	GPU	Quantization	GUI	API	vLLM	Tutorial
llama.cpp	Rich GGUF quantization options and efficient local inference	✅	✅	✅	✅	✅	❌	[link]
🤗transformers	Native transformers inference interface	✅	✅	✅	✅	❌	✅	[link]
Imitation OpenAI API Calls	Server demo with an interface similar to OpenAI API	✅	✅	✅	❌	✅	✅	[link]
text-generation-webui	Front-end Web UI deployment method	✅	✅	✅	✅	✅	❌	[link]
LM Studio	Multi-platform chat software with interface	✅	✅	✅	✅	✅	❌	[link]
Ollama	Local large model inference	✅	✅	✅	❌	✅	❌	[link]

Model Performance

To evaluate the effectiveness of the related models, this project conducted both generative performance evaluations and objective performance evaluations (NLU type), assessing the large models from different perspectives. Users are recommended to test on tasks of their interest and choose models suitable for those tasks.

Generative Performance Evaluation

This project has launched an online model battle platform, modeled after the Fastchat Chatbot Arena, where users can browse and evaluate the quality of model responses. The battle platform provides metrics such as win rates and Elo scores, and allows viewing the win rates between different models. ⚔️ Model Arena: http://llm-arena.ymcui.com
The examples directory provides output samples of Llama-3-Chinese-8B-Instruct and Chinese-Mixtral-Instruct, and compares scores using GPT-4-turbo, with Llama-3-Chinese-8B-Instruct averaging a score of 8.1 and Chinese-Mixtral-Instruct averaging 7.8. 📄 Output Sample Comparison: examples
This project has joined the Machine Heart SOTA! Model platform, with online experiences to be implemented later: https://sota.jiqizhixin.com/project/chinese-llama-alpaca-3

Objective Performance Evaluation

C-Eval

C-Eval is a comprehensive Chinese fundamental model evaluation suite, with its validation and test sets comprising 1.3K and 12.3K multiple-choice questions respectively, covering 52 subjects. For C-Eval inference code, please refer to this project: 📖GitHub Wiki

Models	Valid (0-shot)	Valid (5-shot)	Test (0-shot)	Test (5-shot)
Llama-3-Chinese-8B-Instruct-v3	55.2	54.8	52.1	52.4
Llama-3-Chinese-8B-Instruct-v2	51.6	51.6	49.7	49.8
Llama-3-Chinese-8B-Instruct	49.3	51.5	48.3	49.4
Llama-3-Chinese-8B	47.0	50.5	46.1	49.0
Meta-Llama-3-8B-Instruct	51.3	51.3	49.5	51.0
Meta-Llama-3-8B	49.3	51.2	46.1	49.4
Chinese-Mixtral-Instruct (8x7B)	51.7	55.0	50.0	51.5
Chinese-Mixtral (8x7B)	45.8	54.2	43.1	49.1
Chinese-Alpaca-2-13B	44.3	45.9	42.6	44.0
Chinese-LLaMA-2-13B	40.6	42.7	38.0	41.6

CMMLU

CMMLU is another comprehensive Chinese evaluation dataset specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context, covering topics from basic subjects to advanced professional levels, with a total of 11.5K multiple-choice questions. For CMMLU inference code, please refer to this project: 📖GitHub Wiki

Models	Test (0-shot)	Test (5-shot)
Llama-3-Chinese-8B-Instruct-v3	54.4	54.8
Llama-3-Chinese-8B-Instruct-v2	51.8	52.4
Llama-3-Chinese-8B-Instruct	49.7	51.5
Llama-3-Chinese-8B	48.0	50.9
Meta-Llama-3-8B-Instruct	53.0	53.5
Meta-Llama-3-8B	47.8	50.8
Chinese-Mixtral-Instruct (8x7B)	50.0	53.0
Chinese-Mixtral (8x7B)	42.5	51.0
Chinese-Alpaca-2-13B	43.2	45.5
Chinese-LLaMA-2-13B	38.9	42.5

MMLU

MMLU is an English evaluation dataset for assessing natural language understanding capabilities, one of the main datasets used today for evaluating large models' capabilities, with its validation and test sets comprising 1.5K and 14.1K multiple-choice questions respectively, covering 57 subjects. For MMLU inference code, please refer to this project: 📖GitHub Wiki

Models	Valid (0-shot)	Valid (5-shot)	Test (0-shot)	Test (5-shot)
Llama-3-Chinese-8B-Instruct-v3	64.7	65.0	64.8	65.9
Llama-3-Chinese-8B-Instruct-v2	62.1	63.9	62.6	63.7
Llama-3-Chinese-8B-Instruct	60.1	61.3	59.8	61.8
Llama-3-Chinese-8B	55.5	58.5	57.3	61.1
Meta-Llama-3-8B-Instruct	63.4	64.8	65.1	66.4
Meta-Llama-3-8B	58.6	62.5	60.5	65.0
Chinese-Mixtral-Instruct (8x7B)	65.1	69.6	67.5	69.8
Chinese-Mixtral (8x7B)	63.2	67.1	65.5	68.3
Chinese-Alpaca-2-13B	49.6	53.2	50.9	53.5
Chinese-LLaMA-2-13B	46.8	50.0	46.6	51.8

LongBench

LongBench is a benchmark for evaluating large models' long-text understanding capabilities, composed of 6 categories and 20 different tasks. Most tasks have an average length between 5K-15K, totaling approximately 4.75K test data entries. Below are the evaluation results of this project's models on these Chinese tasks (including code tasks). For LongBench inference code, please refer to this project: 📖GitHub Wiki

Models	Single-doc QA	Multi-doc QA	Summarization	Few-Shot Learning	Code	Synthesis	Average
Llama-3-Chinese-8B-Instruct-v3	20.3	28.8	24.5	28.1	59.4	91.9	40.5
Llama-3-Chinese-8B-Instruct-v2	57.3	27.1	13.9	30.3	60.6	89.5	46.4
Llama-3-Chinese-8B-Instruct	44.1	24.0	12.4	33.5	51.8	11.5	29.6
Llama-3-Chinese-8B	16.4	19.3	4.3	28.7	14.3	4.6	14.6
Meta-Llama-3-8B-Instruct	55.1	15.1	0.1	24.0	51.3	94.5	40.0
Meta-Llama-3-8B	21.2	22.9	2.7	35.8	65.9	40.8	31.6
Chinese-Mixtral-Instruct (8x7B)	50.3	34.2	16.4	42.0	56.1	89.5	48.1
Chinese-Mixtral (8x7B)	32.0	23.7	0.4	42.5	27.4	14.0	23.3
Chinese-Alpaca-2-13B-16K	47.9	26.7	13.0	22.3	46.6	21.5	29.7
Chinese-LLaMA-2-13B-16K	36.7	17.7	3.1	29.8	13.8	3.0	17.3
Chinese-Alpaca-2-7B-64K	44.7	28.1	14.4	39.0	44.6	5.0	29.3
Chinese-LLaMA-2-7B-64K	27.2	16.4	6.5	33.0	7.8	5.0	16.0

Open LLM Leaderboard

Open LLM Leaderboard is an LLM benchmark (English) brought by HuggingFaceH4 team, including ARC, HellaSwag, MMLU, TruthfulQA, Winograde, GSM8K datasets. Below are the evaluation results of this project's models.

Models	ARC	HellaS	MMLU	TQA	WinoG	GSM8K	Average
Llama-3-Chinese-8B-Instruct-v3	63.40	80.51	67.90	53.57	76.24	59.21	66.81
Llama-3-Chinese-8B-Instruct-v2	62.63	79.72	66.48	53.93	76.72	60.58	66.68
Llama-3-Chinese-8B-Instruct	61.26	80.24	63.10	55.15	75.06	44.43	63.21
Llama-3-Chinese-8B	55.88	79.53	63.70	41.14	77.03	37.98	59.21
Meta-Llama-3-8B-Instruct	60.75	78.55	67.07	51.65	74.51	68.69	66.87
Meta-Llama-3-8B	59.47	82.09	66.69	43.90	77.35	45.79	62.55
Chinese-Mixtral-Instruct (8x7B)	67.75	85.67	71.53	57.46	83.11	55.65	70.19
Chinese-Mixtral (8x7B)	67.58	85.34	70.38	46.86	82.00	0.00	58.69

Note: MMLU resutls are different from the one that reported in our repo, as the evaluation scripts differ.

Quantitative Performance Evaluation

Under llama.cpp, the quantization performance of Llama-3-Chinese-8B (base model) was tested, as shown in the table below. The actual speed is slightly slower than the second-generation Llama-2-7B.

	F16	Q8_0	Q6_K	Q5_K	Q5_0	Q4_K	Q4_0	Q3_K	Q2_K
Size (GB)	14.97	7.95	6.14	5.34	5.21	4.58	4.34	3.74	2.96
BPW	16.00	8.50	6.56	5.70	5.57	4.89	4.64	4.00	3.16
PPL	5.130	5.135	5.148	5.181	5.222	5.312	5.549	5.755	11.859
PP Speed	5.99	6.10	7.17	7.34	6.65	6.38	6.00	6.85	6.43
TG Speed	44.03	26.08	21.61	22.33	20.93	18.93	17.09	22.50	19.21

Note

Model size: in GB
BPW (Bits-Per-Weight): Per-parameter bit, for example, Q8_0 actual average precision is 8.50
PPL (Perplexity): Measured with an 8K context (natively supported length), lower values are better
PP/TG speed: Provides instruction processing (PP) and text generation (TG) speeds for the Apple M3 Max (Metal), in ms/token, lower values are faster

Training and Fine-Tuning

Manual Training and Fine-Tuning

Pre-training with unlabeled data: 📖Pre-training Scripts Wiki
Fine-tuning with labeled data for instructions: 📖Instruction Fine-Tuning Scripts Wiki

Instruction template

Our Llama-3-Chinese-Instruct adopts original instruction template of Llama-3-Instruct. The following is a chat example.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant. 你是一个乐于助人的助手。<|eot_id|><|start_header_id|>user<|end_header_id|>

你好<|eot_id|><|start_header_id|>assistant<|end_header_id|>

你好！有什么可以帮助你的吗？<|eot_id|>

Data

Below are some of the command data made open source for this project. For more details, please see: 📚 Instruction Data

Data Name	Description	Quantity
alpaca_zh_51k	Alpaca data translated using gpt-3.5	51K
stem_zh_instruction	STEM data scraped using gpt-3.5, including physics, chemistry, medicine, biology, earth sciences	256K
ruozhiba_gpt4	ruozhiba Q&A data obtained using GPT-4o and GPT-4T	2449

Frequently Asked Questions

Please check the FAQ to see if a solution already exists before submitting an issue. For specific questions and answers, refer to the project's 📖GitHub Wiki

Question 1: Why is there no vocabulary expansion like in phases one and two?
Question 2: Will there be a 70B version released?
Question 3: Why is the instruction model no longer called Alpaca?
Question 4: Can the models from this repository be used commercially?
Question 5: Why not perform full pre-training instead of using LoRA?
Question 6: Why is the conversational performance of Llama-3-Chinese not good?
Question 7: Why does the instruction model reply saying it is ChatGPT?
Question 8: What are the differences between v1 and v2 of the Instruct model?

Disclaimer

This project is developed based on Meta's Llama-3 model. Please strictly adhere to the Llama-3 open-source license agreement during use. If using third-party code, comply with the relevant open-source licenses. The accuracy of the model-generated content may be affected by computational methods, random factors, and loss of quantization precision, hence, no guarantees are provided regarding the accuracy of model outputs, nor will any liability be accepted for losses resulting from the use of related resources and outputs. If using the models for commercial purposes, developers must comply with local laws and regulations to ensure the legality of the model outputs. No responsibility will be taken for any products or services derived from this project.

Feedback

If you have questions, please submit them in the GitHub Issues. Ask politely and help build a harmonious discussion community.

Before submitting an issue, check if the FAQ addresses your question and consider reviewing past issues that might solve your problem.
When submitting an issue, please use the project's issue template to help quickly identify specific problems.
Duplicate or irrelevant issues will be handled by stable-bot, please understand.

Footnotes

Cui and Yao, 2024. Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Main Content

News

Content Guide

Model Introduction

📖 Uses the Original Llama-3 Vocabulary

🚄 Extended Context Length from 4K in the Second Generation to 8K

⚡ Uses Grouped Query Attention Mechanism

🗒 New Instruction Template

Model Download

Model Selection Guide

Download Links

Inference and Deployment

Model Performance

Generative Performance Evaluation

Objective Performance Evaluation

C-Eval

CMMLU

MMLU

LongBench

Open LLM Leaderboard

Quantitative Performance Evaluation

Training and Fine-Tuning

Manual Training and Fine-Tuning

Instruction template

Data

Frequently Asked Questions

Disclaimer

Feedback

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Main Content

News

Content Guide

Model Introduction

📖 Uses the Original Llama-3 Vocabulary

🚄 Extended Context Length from 4K in the Second Generation to 8K

⚡ Uses Grouped Query Attention Mechanism

🗒 New Instruction Template

Model Download

Model Selection Guide

Download Links

Inference and Deployment

Model Performance

Generative Performance Evaluation

Objective Performance Evaluation

C-Eval

CMMLU

MMLU

LongBench

Open LLM Leaderboard

Quantitative Performance Evaluation

Training and Fine-Tuning

Manual Training and Fine-Tuning

Instruction template

Data

Frequently Asked Questions

Disclaimer

Feedback

Footnotes