Skip to content

Latest commit

 

History

History
329 lines (241 loc) · 32.9 KB

README_EN.md

File metadata and controls

329 lines (241 loc) · 32.9 KB

🇨🇳Chinese | 🌐English | 📖Documentation | ❓Issues | 💬Discussions | ⚔️Arena



GitHub GitHub release (latest by date) GitHub top language

🤗 Hugging Face🤖 ModelScope🐿️ 机器之心SOTA!模型🟣 wisemodel🤗 Online Demo

This project is developed based on Meta's newly released next-generation open-source large language model Llama-3 and is the third generation of the Chinese-LLaMA-Alpaca open-source LLM series (1st gen, 2nd gen). This project has open-sourced the Llama-3-Chinese base model and the Chinese Llama-3-Chinese-Instruct instruction-tuned large model. These models use large-scale Chinese data for continual pre-training on the original Llama-3, and are fine-tuned with selected instruction data to further enhance Chinese basic semantic and instruction understanding capabilities, significantly improving performance compared to the second-generation models.

Main Content

  • 🚀 Open-source Llama-3-Chinese base model and Llama-3-Chinese-Instruct instruction model (v1, v2, v3)
  • 🚀 Released pre-training scripts and instruction fine-tuning scripts, allowing users to further train or fine-tune the model as needed
  • 🚀 Released alpaca_zh_51k, stem_zh_instruction, ruozhiba_gpt4 (4o/4T) instruction data
  • 🚀 Provides a tutorial for quickly quantizing and deploying large models locally using a personal computer's CPU/GPU
  • 🚀 Supports 🤗transformers, llama.cpp, text-generation-webui, vLLM, Ollama and other Llama-3 ecosystems

Chinese Mixtral | Chinese LLaMA-2 & Alpaca-2 Large Models | Chinese LLaMA & Alpaca Large Models | Multimodal Chinese LLaMA & Alpaca Large Models | Multimodal VLE | Chinese MiniRBT | Chinese LERT | Chinese-English PERT | Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | Knowledge Distillation Tool TextBrewer | Model Pruning Tool TextPruner | Distillation and Pruning Integrated GRAIN

News

[2024/05/30] Release Llama-3-Chinese-8B-Instruct-v3, which has better performance on downstream tasks than v1/v2. For details, see: 📚Version 3.0 Release Log

[2024/05/08] Release Llama-3-Chinese-8B-Instruct-v2, which is directly tuned on Meta-Llama-3-8B-Instruct with 5M instructions. For details, see: 📚Version 2.0 Release Log

[2024/05/07] Add pre-training and SFT scripts. For details, see: 📚Version 1.1 Release Log

[2024/04/30] Released the Llama-3-Chinese-8B base model and Llama-3-Chinese-8B-Instruct instruction model. For details, see: 📚Version 1.0 Release Log

[2024/04/19] 🚀 Officially launched the Chinese-LLaMA-Alpaca-3 project

Content Guide

Section Description
💁🏻‍♂️Model Introduction Briefly introduces the technical features of the models related to this project
⏬Model Download Download addresses for the Chinese Llama-3 large models
💻Inference and Deployment Describes how to quantize the model and deploy it using a personal computer to experience the large model
💯Model Performance Introduces the effects of the model on some tasks
📝Training and Fine-Tuning Introduces how to train and fine-tune the Chinese Llama-3 large models
❓Frequently Asked Questions Replies to some common questions

Model Introduction

This project has launched the Chinese open-source large models Llama-3-Chinese and Llama-3-Chinese-Instruct based on Meta Llama-3. The main features are as follows:

📖 Uses the Original Llama-3 Vocabulary

  • Llama-3 has significantly expanded its vocabulary from 32K to 128K and switched to a BPE vocabulary.
  • Preliminary experiments have shown that the encoding efficiency of the Llama-3 vocabulary is comparable to our expanded vocabulary in Chinese LLaMA-2, with an efficiency of about 95% based on encoding efficiency tests on Wikipedia data.
  • Based on our experience and experimental conclusions with Chinese Mixtral 1, we did not expand the vocabulary further.

🚄 Extended Context Length from 4K in the Second Generation to 8K

  • Llama-3 has increased the native context window length from 4K to 8K, allowing for further processing of longer context information.
  • Users can also use methods like PI, NTK, and YaRN to extend the model's long context capabilities to support longer text processing.

⚡ Uses Grouped Query Attention Mechanism

  • Llama-3 adopts the Grouped Query Attention (GQA) mechanism used in the large parameter version of Llama-2, which further enhances the model's efficiency.

🗒 New Instruction Template

  • Llama-3-Instruct uses a new instruction template, which is not compatible with Llama-2-chat; it should be used strictly following the official instruction template. (See instruction template)

Model Download

Model Selection Guide

Here's a comparison of the models in this project and recommended usage scenarios. For chat interactions, please choose the Instruct version.

Comparison Item Llama-3-Chinese-8B Llama-3-Chinese-8B-Instruct
Model Type Base Model Instruction/Chat Model (similar to ChatGPT)
Model Size 8B 8B
Training Type Causal-LM (CLM) Instruction Fine-Tuning
Training Method LoRA + Full emb/lm-head LoRA + Full emb/lm-head
Initial Model Meta-Llama-3-8B v1: Llama-3-Chinese-8B
v2: Meta-Llama-3-8B-Instruct
v3: mix of inst/inst-v2/inst-meta
Training Corpus Unlabeled general corpus (approx. 120GB) Labeled instruction data (approx. 5 million entries)
Vocabulary Size Original vocabulary (128,256) Original vocabulary (128,256)
Supported Context Length 8K 8K
Input Template Not required Requires Llama-3-Instruct template
Applicable Scenarios Text continuation: Given a context, let the model generate the following text Instruction understanding: Q&A, writing, chatting, interaction, etc.

Here is a comparison between different versions of Instruct. Unless there is a clear preference, please prioritize using the Instruct-v3 version.

Comparison Item Instruct-v1 Instruct-v2 Instruct-v3
Release Date 2024/4/30 2024/5/8 2024/5/30
Base Model Original Meta-Llama-3-8B Original Meta-Llama-3-8B-Instruct (See Training Method)
Training Method First Stage: Pre-training with 120G Chinese Corpus
Second Stage: Fine-tuning with 5 million instruction data
Direct fine-tuning with 5 million instruction data Model merging using inst-v1, inst-v2, and inst-meta, followed by fine-tuning with a small amount of instruction data
Chinese Proficiency 49.3 / 51.5 51.6 / 51.6 55.2 / 54.8 👍🏻
English Proficiency 63.21 66.68 66.81 👍🏻
Long Text Capability 29.6 46.4 👍🏻 40.5
LLM Arena Win Rate / Elo 49.4% / 1430 66.1% / 1559 83.6% / 1627 👍🏻

Note

Chinese proficiency results are from C-Eval (valid); English proficiency results are from Open LLM Leaderboard (avg); long text capability results are from LongBench (avg). For detailed performance, please refer to the 💯 Model Performance section.

Download Links

Model Name Full Version LoRA Version GGUF Version
Llama-3-Chinese-8B-Instruct-v3
(chat model)
[🤗Hugging Face]
[🤖ModelScope]
[🟣wisemodel]
N/A [🤗Hugging Face]
[🤖ModelScope]
Llama-3-Chinese-8B-Instruct-v2
(chat model)
[🤗Hugging Face]
[🤖ModelScope]
[🟣wisemodel]
[🤗Hugging Face]
[🤖ModelScope]
[🟣wisemodel]
[🤗Hugging Face]
[🤖ModelScope]
Llama-3-Chinese-8B-Instruct
(chat model)
[🤗Hugging Face]
[🤖ModelScope]
[🟣wisemodel]
[🤗Hugging Face]
[🤖ModelScope]
[🟣wisemodel]
[🤗Hugging Face]
[🤖ModelScope]
Llama-3-Chinese-8B
(base model)
[🤗Hugging Face]
[🤖ModelScope]
[🟣wisemodel]
[🤗Hugging Face]
[🤖ModelScope]
[🟣wisemodel]
[🤗Hugging Face]
[🤖ModelScope]

Model Type Description:

  • Full Model: Can be used directly for training and inference, no other merging steps required.
  • LoRA Model: Needs to be merged with the original base model to convert into a full version, merging steps: 💻 Model Merging Steps
  • GGUF Model: Quantization format released by llama.cpp, compatible with common large model inference tools like ollama, recommended for users who only need to perform inference deployment. The model name with -im suffix is generated with important matrix, which has generally better performance.

Note

If HF access is blocked, consider using mirror sites (like hf-mirror.com), please find the specific methods and solutions on your own.

Inference and Deployment

The models in this project primarily support the following quantization, inference, and deployment methods. Please refer to the corresponding tutorials for detailed information.

Tool Features CPU GPU Quantization GUI API vLLM Tutorial
llama.cpp Rich GGUF quantization options and efficient local inference [link]
🤗transformers Native transformers inference interface [link]
Imitation OpenAI API Calls Server demo with an interface similar to OpenAI API [link]
text-generation-webui Front-end Web UI deployment method [link]
LM Studio Multi-platform chat software with interface [link]
Ollama Local large model inference [link]

Model Performance

To evaluate the effectiveness of the related models, this project conducted both generative performance evaluations and objective performance evaluations (NLU type), assessing the large models from different perspectives. Users are recommended to test on tasks of their interest and choose models suitable for those tasks.

Generative Performance Evaluation

  • This project has launched an online model battle platform, modeled after the Fastchat Chatbot Arena, where users can browse and evaluate the quality of model responses. The battle platform provides metrics such as win rates and Elo scores, and allows viewing the win rates between different models. ⚔️ Model Arena: http://llm-arena.ymcui.com
  • The examples directory provides output samples of Llama-3-Chinese-8B-Instruct and Chinese-Mixtral-Instruct, and compares scores using GPT-4-turbo, with Llama-3-Chinese-8B-Instruct averaging a score of 8.1 and Chinese-Mixtral-Instruct averaging 7.8. 📄 Output Sample Comparison: examples
  • This project has joined the Machine Heart SOTA! Model platform, with online experiences to be implemented later: https://sota.jiqizhixin.com/project/chinese-llama-alpaca-3

Objective Performance Evaluation

C-Eval

C-Eval is a comprehensive Chinese fundamental model evaluation suite, with its validation and test sets comprising 1.3K and 12.3K multiple-choice questions respectively, covering 52 subjects. For C-Eval inference code, please refer to this project: 📖GitHub Wiki

Models Valid (0-shot) Valid (5-shot) Test (0-shot) Test (5-shot)
Llama-3-Chinese-8B-Instruct-v3 55.2 54.8 52.1 52.4
Llama-3-Chinese-8B-Instruct-v2 51.6 51.6 49.7 49.8
Llama-3-Chinese-8B-Instruct 49.3 51.5 48.3 49.4
Llama-3-Chinese-8B 47.0 50.5 46.1 49.0
Meta-Llama-3-8B-Instruct 51.3 51.3 49.5 51.0
Meta-Llama-3-8B 49.3 51.2 46.1 49.4
Chinese-Mixtral-Instruct (8x7B) 51.7 55.0 50.0 51.5
Chinese-Mixtral (8x7B) 45.8 54.2 43.1 49.1
Chinese-Alpaca-2-13B 44.3 45.9 42.6 44.0
Chinese-LLaMA-2-13B 40.6 42.7 38.0 41.6

CMMLU

CMMLU is another comprehensive Chinese evaluation dataset specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context, covering topics from basic subjects to advanced professional levels, with a total of 11.5K multiple-choice questions. For CMMLU inference code, please refer to this project: 📖GitHub Wiki

Models Test (0-shot) Test (5-shot)
Llama-3-Chinese-8B-Instruct-v3 54.4 54.8
Llama-3-Chinese-8B-Instruct-v2 51.8 52.4
Llama-3-Chinese-8B-Instruct 49.7 51.5
Llama-3-Chinese-8B 48.0 50.9
Meta-Llama-3-8B-Instruct 53.0 53.5
Meta-Llama-3-8B 47.8 50.8
Chinese-Mixtral-Instruct (8x7B) 50.0 53.0
Chinese-Mixtral (8x7B) 42.5 51.0
Chinese-Alpaca-2-13B 43.2 45.5
Chinese-LLaMA-2-13B 38.9 42.5

MMLU

MMLU is an English evaluation dataset for assessing natural language understanding capabilities, one of the main datasets used today for evaluating large models' capabilities, with its validation and test sets comprising 1.5K and 14.1K multiple-choice questions respectively, covering 57 subjects. For MMLU inference code, please refer to this project: 📖GitHub Wiki

Models Valid (0-shot) Valid (5-shot) Test (0-shot) Test (5-shot)
Llama-3-Chinese-8B-Instruct-v3 64.7 65.0 64.8 65.9
Llama-3-Chinese-8B-Instruct-v2 62.1 63.9 62.6 63.7
Llama-3-Chinese-8B-Instruct 60.1 61.3 59.8 61.8
Llama-3-Chinese-8B 55.5 58.5 57.3 61.1
Meta-Llama-3-8B-Instruct 63.4 64.8 65.1 66.4
Meta-Llama-3-8B 58.6 62.5 60.5 65.0
Chinese-Mixtral-Instruct (8x7B) 65.1 69.6 67.5 69.8
Chinese-Mixtral (8x7B) 63.2 67.1 65.5 68.3
Chinese-Alpaca-2-13B 49.6 53.2 50.9 53.5
Chinese-LLaMA-2-13B 46.8 50.0 46.6 51.8

LongBench

LongBench is a benchmark for evaluating large models' long-text understanding capabilities, composed of 6 categories and 20 different tasks. Most tasks have an average length between 5K-15K, totaling approximately 4.75K test data entries. Below are the evaluation results of this project's models on these Chinese tasks (including code tasks). For LongBench inference code, please refer to this project: 📖GitHub Wiki

Models Single-doc QA Multi-doc QA Summarization Few-Shot Learning Code Synthesis Average
Llama-3-Chinese-8B-Instruct-v3 20.3 28.8 24.5 28.1 59.4 91.9 40.5
Llama-3-Chinese-8B-Instruct-v2 57.3 27.1 13.9 30.3 60.6 89.5 46.4
Llama-3-Chinese-8B-Instruct 44.1 24.0 12.4 33.5 51.8 11.5 29.6
Llama-3-Chinese-8B 16.4 19.3 4.3 28.7 14.3 4.6 14.6
Meta-Llama-3-8B-Instruct 55.1 15.1 0.1 24.0 51.3 94.5 40.0
Meta-Llama-3-8B 21.2 22.9 2.7 35.8 65.9 40.8 31.6
Chinese-Mixtral-Instruct (8x7B) 50.3 34.2 16.4 42.0 56.1 89.5 48.1
Chinese-Mixtral (8x7B) 32.0 23.7 0.4 42.5 27.4 14.0 23.3
Chinese-Alpaca-2-13B-16K 47.9 26.7 13.0 22.3 46.6 21.5 29.7
Chinese-LLaMA-2-13B-16K 36.7 17.7 3.1 29.8 13.8 3.0 17.3
Chinese-Alpaca-2-7B-64K 44.7 28.1 14.4 39.0 44.6 5.0 29.3
Chinese-LLaMA-2-7B-64K 27.2 16.4 6.5 33.0 7.8 5.0 16.0

Open LLM Leaderboard

Open LLM Leaderboard is an LLM benchmark (English) brought by HuggingFaceH4 team, including ARC, HellaSwag, MMLU, TruthfulQA, Winograde, GSM8K datasets. Below are the evaluation results of this project's models.

Models ARC HellaS MMLU TQA WinoG GSM8K Average
Llama-3-Chinese-8B-Instruct-v3 63.40 80.51 67.90 53.57 76.24 59.21 66.81
Llama-3-Chinese-8B-Instruct-v2 62.63 79.72 66.48 53.93 76.72 60.58 66.68
Llama-3-Chinese-8B-Instruct 61.26 80.24 63.10 55.15 75.06 44.43 63.21
Llama-3-Chinese-8B 55.88 79.53 63.70 41.14 77.03 37.98 59.21
Meta-Llama-3-8B-Instruct 60.75 78.55 67.07 51.65 74.51 68.69 66.87
Meta-Llama-3-8B 59.47 82.09 66.69 43.90 77.35 45.79 62.55
Chinese-Mixtral-Instruct (8x7B) 67.75 85.67 71.53 57.46 83.11 55.65 70.19
Chinese-Mixtral (8x7B) 67.58 85.34 70.38 46.86 82.00 0.00 58.69

Note: MMLU resutls are different from the one that reported in our repo, as the evaluation scripts differ.

Quantitative Performance Evaluation

Under llama.cpp, the quantization performance of Llama-3-Chinese-8B (base model) was tested, as shown in the table below. The actual speed is slightly slower than the second-generation Llama-2-7B.

F16 Q8_0 Q6_K Q5_K Q5_0 Q4_K Q4_0 Q3_K Q2_K
Size (GB) 14.97 7.95 6.14 5.34 5.21 4.58 4.34 3.74 2.96
BPW 16.00 8.50 6.56 5.70 5.57 4.89 4.64 4.00 3.16
PPL 5.130 5.135 5.148 5.181 5.222 5.312 5.549 5.755 11.859
PP Speed 5.99 6.10 7.17 7.34 6.65 6.38 6.00 6.85 6.43
TG Speed 44.03 26.08 21.61 22.33 20.93 18.93 17.09 22.50 19.21

Note

  • Model size: in GB
  • BPW (Bits-Per-Weight): Per-parameter bit, for example, Q8_0 actual average precision is 8.50
  • PPL (Perplexity): Measured with an 8K context (natively supported length), lower values are better
  • PP/TG speed: Provides instruction processing (PP) and text generation (TG) speeds for the Apple M3 Max (Metal), in ms/token, lower values are faster

Training and Fine-Tuning

Manual Training and Fine-Tuning

Instruction template

Our Llama-3-Chinese-Instruct adopts original instruction template of Llama-3-Instruct. The following is a chat example.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant. 你是一个乐于助人的助手。<|eot_id|><|start_header_id|>user<|end_header_id|>

你好<|eot_id|><|start_header_id|>assistant<|end_header_id|>

你好!有什么可以帮助你的吗?<|eot_id|>

Data

Below are some of the command data made open source for this project. For more details, please see: 📚 Instruction Data

Data Name Description Quantity
alpaca_zh_51k Alpaca data translated using gpt-3.5 51K
stem_zh_instruction STEM data scraped using gpt-3.5, including physics, chemistry, medicine, biology, earth sciences 256K
ruozhiba_gpt4 ruozhiba Q&A data obtained using GPT-4o and GPT-4T 2449

Frequently Asked Questions

Please check the FAQ to see if a solution already exists before submitting an issue. For specific questions and answers, refer to the project's 📖GitHub Wiki

Question 1: Why is there no vocabulary expansion like in phases one and two?
Question 2: Will there be a 70B version released?
Question 3: Why is the instruction model no longer called Alpaca?
Question 4: Can the models from this repository be used commercially?
Question 5: Why not perform full pre-training instead of using LoRA?
Question 6: Why is the conversational performance of Llama-3-Chinese not good?
Question 7: Why does the instruction model reply saying it is ChatGPT?
Question 8: What are the differences between v1 and v2 of the Instruct model?

Disclaimer

This project is developed based on Meta's Llama-3 model. Please strictly adhere to the Llama-3 open-source license agreement during use. If using third-party code, comply with the relevant open-source licenses. The accuracy of the model-generated content may be affected by computational methods, random factors, and loss of quantization precision, hence, no guarantees are provided regarding the accuracy of model outputs, nor will any liability be accepted for losses resulting from the use of related resources and outputs. If using the models for commercial purposes, developers must comply with local laws and regulations to ensure the legality of the model outputs. No responsibility will be taken for any products or services derived from this project.

Feedback

If you have questions, please submit them in the GitHub Issues. Ask politely and help build a harmonious discussion community.

  • Before submitting an issue, check if the FAQ addresses your question and consider reviewing past issues that might solve your problem.
  • When submitting an issue, please use the project's issue template to help quickly identify specific problems.
  • Duplicate or irrelevant issues will be handled by stable-bot, please understand.

Footnotes

  1. Cui and Yao, 2024. Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral