Skip to content

Latest commit

 

History

History
341 lines (267 loc) · 21.2 KB

tech_memo.md

File metadata and controls

341 lines (267 loc) · 21.2 KB

Introducing Qwen-7B: Open foundation and human-aligned models (of the state-of-the-arts)

Large language models have recently attracted an extremely large amount of attention. The boom of ChatGPT rocketed the development of artificial general intelligence and indicates that large language models compress world knowledge into neural networks, and the alignment to human cognition can lead to powerful conversational agents that can provide assistance by interacting with human users. Now, the latest version of ChatGPT based on GPT-4 demonstrates tremendously exciting performance across unlimited capabilities, say, language understanding, logical reasoning, planning, etc., and its incorporation with external tools, including tools and models, releases the power of an agent capable of understanding instructions, executing code, using tools, and so on, to reach the objectives set up by human users.

These significant progresses indicate the importance of large language models as the foundation of AI services.

We are happy to release the 7B-parameter models of our large pretrained model series Qwen (abbr. Tongyi Qianwen), Qwen-7B. This release includes model weights and codes for pretrained and human-aligned language models of 7B parameters:

  • Qwen-7B is the pretrained language model, and Qwen-7B-Chat is fine-tuned to align with human intent.
  • Qwen-7B is pretrained on over 2.2 trillion tokens with a context length of 2048. On the series of benchmarks we tested, Qwen-7B generally performs better than existing open models of similar scales and appears to be on par with some of the larger models.
  • Qwen-7B-Chat is fine-tuned on curated data, including not only task-oriented data but also specific security- and service-oriented data, which seems insufficient in existing open models.
  • Example codes for fine-tuning, evaluation, and inference are included. There are also guides on long-context and tool use in inference.

Goal of release: We believe that while the recent waves of releases of LLMs may have deepened our understanding of model behaviors under standard regimes, it is yet to be revealed how the accompanied techniques of nowadays LLMs, such as 1) quantization and fine-tuning after quantization, 2) training-free long-context inference, and 3) fine-tuning with service-oriented data, including search and tool uses, affect the models as a whole. The open release of Qwen-7B marks our first step towards fully understanding the real-world application of such techniques. It is our hope that it will enable the community to analyze and continue to improve the safety of those models, striving to establish responsible development and deployment of LLMs.

Disclaimer: We must note that even though the weights and codes are released in an open manner and commercial use is not prohibited, similar to other pretrained language models, Qwen-7B comes with potential risks influenced by complex factors, including but not limited to over-diversified, inaccurate, or misleading generation. Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights or codes.

The remainder of this document describes our pretraining and fine-tuning methodology.

Pretraining

Qwen-7B is a transformer-based decoder-only language model with an architecture similar to the LLaMA series of models. It is pretrained on over 2.2 trillion tokens with 2048 context length from publicly available data, covering general and professional fields with a focus on the English and Chinese languages.

Data

Pretraining data: Our training data includes a mix of data from publicly available sources, consisting mainly of web documents and code files. Besides, the data are multilingual, with most of them in English and Chinese. We made an effort and employed an ensemble of models to exclude data of low quality or deemed unfit for pretraining, such as NSFW content. For math reasoning, we include RFT data from gsm8k-ScRel. The final data underwent global fuzzy deduplication. The mix of pretraining corpora has been optimized through numerous ablation experiments.

Tokenization: Compared to the current mainstream open models based on Chinese and English vocabularies, we use a vocabulary of 151,851 tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digits and calls the tiktoken tokenizer library for efficient tokenization. After tokenization, the data amounts to over 2.2 trillion tokens.

Tokenization efficiency

We randomly selected 1 million document corpora of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1, not shown in the figure). As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-7B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages.

Model

Model architecture: Qwen-7B is built with architecture similar to LLaMA. The following are the main differences from the standard transformer: 1) using untied embedding, 2) using rotary positional embedding, 3) no biases except for QKV in attention, 4) RMSNorm instead of LayerNorm, 5) SwiGLU instead of ReLU, and 6) adopting flash attention to accelerate training. The model has 32 layers, the embedding dimension is 4096, and the number of attention heads is 32.

Training details: The model is trained using the AdamW optimizer, with $\beta_1=0.9, \beta_2=0.95, \epsilon=10^{-6}$. The sequence length is 2048, and the batch size is 2048, which means each optimization step accumulates over 4 million tokens. We use a cosine learning rate schedule, with a warm-up of 2000 steps, a peak learning rate of $3 \times 10^{-4}$, and a minimum learning rate of 10% of the peak learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0. The training adopts mixed precision training with bfloat16.

Evaluation

We report results of Qwen-7B on standard benchmarks.

World knowledge

C-Eval is a common evaluation benchmark for testing the common-sense capability of pretrained models in Chinese. It covers 52 subjects in four major directions: humanities, social sciences, STEM, and other specialties. According to standard practice, we use the development set samples as the source of few-shot prompts to evaluate the 5-shot validation set and test set accuracy of the Qwen-7B pretrained model.

The accuracy comparison of the Qwen-7B model and other models on the C-Eval validation set is as follows:

Model Average
Alpaca-7B 28.9
Vicuna-7B 31.2
ChatGLM-6B 37.1
Baichuan-7B 42.7
ChatGLM2-6B 50.9
InternLM-7B 53.4
ChatGPT 53.5
Claude-v1.3 55.5
Qwen-7B 60.8

The performance comparison of the Qwen-7B pretrained model and other models on the C-Eval test set is shown in the following table:

Model Avg. Avg. (Hard) STEM Social Sciences Humanities Others
ChatGLM-6B 38.9 29.2 33.3 48.3 41.3 38.0
Chinese-Alpaca-Plus-13B 41.5 30.5 36.6 49.7 43.1 41.2
Baichuan-7B 42.8 31.5 38.2 52.0 46.2 39.3
WestlakeLM-19B 44.6 34.9 41.6 51.0 44.3 44.5
AndesLM-13B 46.0 29.7 38.1 61.0 51.0 41.9
BatGPT-15B-sirius 47.0 31.9 42.7 57.5 48.6 43.6
ChatGLM2-6B 51.7 37.1 48.6 60.5 51.3 49.8
InternLM-7B 52.8 37.1 48.0 67.4 55.4 45.8
Baichuan-13B 53.6 36.7 47.0 66.8 57.3 49.8
Claude-v1.3 54.2 39.0 51.9 61.7 52.1 53.7
ChatGPT 54.4 41.4 52.9 61.8 50.9 53.6
Qwen-7B 59.6 41.0 52.8 74.1 63.1 55.2

As can be seen, Qwen-7B achieves the best performance out of all existing models of similar scale and even surpasses larger-scale models.

MMLU is currently one of the most recognized benchmarks for evaluating English comprehension abilities, covering 57 subtasks across different academic fields and difficulty levels. The MMLU 5-shot accuracy performance of the Qwen-7B is shown in the following table:

Model Average STEM Social Sciences Humanities Others
LLaMA-7B 35.1 30.5 38.3 34.0 38.1
Baichuan-7B 42.3 35.6 48.9 38.4 48.1
LLaMA2-7B 45.3 36.4 51.2 42.9 52.2
LLaMA-13B 46.9 35.8 53.8 45.0 53.3
ChatGLM2-6B 47.9 41.2 54.4 43.7 54.5
InternLM-7B 51.0 - - - -
Baichuan-13B 51.6 41.6 60.9 47.4 58.5
LLaMA2-13B 54.8 44.1 62.6 52.8 61.1
ChatGLM2-12B 56.2 48.2 65.1 52.6 60.9
Qwen-7B 56.7 47.6 65.9 51.5 64.7

In terms of English, Qwen-7B also surpasses other similar open pretrained models, and is competitive when compared to larger versions of other models.

Coding

We compared the code capabilities of pretrained models on HumanEval, and the results are as follows:

Model Pass@1
Baichuan-7B 9.2
ChatGLM2-6B 9.2
InternLM-7B 10.4
LLaMA-7B 10.5
LLaMA2-7B 12.8
Baichuan-13B 12.8
LLaMA-13B 15.8
MPT-7B 18.3
LLaMA2-13B 18.3
Qwen-7B 24.4

Math

We compared the math capabilities of pretrained models on GSM8K (8-shot), and the results are as follows:

Model Accuracy
MPT-7B 6.8
Falcon-7B 6.8
Baichuan-7B 9.7
LLaMA-7B 11.0
LLaMA2-7B 14.6
LLaMA-13B 17.8
Baichuan-13B 26.6
LLaMA2-13B 28.7
InternLM-7B 31.2
ChatGLM2-6B 32.4
ChatGLM2-12B 40.9
Qwen-7B 51.6

Natural language processing

We compared the translation capabilities of pre-trained models on WMT22 zh-en and en-zh (5-shot BLEU), and the results are as follows:

Model Average zh-en en-zh
InternLM-7B 11.8 9.0 14.5
LLaMA-7B 12.7 16.7 8.7
LLaMA-13B 15.8 19.5 12.0
LLaMA2-7B 19.9 21.9 17.9
Bloom-7B 20.3 19.1 21.4
LLaMA2-13B 23.3 22.4 24.2
PolyLM-13B 23.6 20.2 27.0
Baichuan-7B 24.6 22.6 26.6
Qwen-7B 27.5 24.3 30.6

Long-context inference

We include support for training-free long-context inference based on ntk-aware interpolation, LogN attention scaling, and local window attention. The context can be expanded from 2048 to over 8192. The following are the test results on arXiv in terms of perplexity (PPL).

ModelSequence Length
102420484096819216384
Qwen-7B4.233.7839.35469.812645.09
+ dynamic_ntk4.233.783.593.665.71
+ dynamic_ntk + logn4.233.783.583.564.62
+ dynamic_ntk + logn + local_attn4.233.783.583.494.32

Fine-tuning

Qwen-7B-Chat embodies our practice in alignment with human intents, ensuring internalized safety, and building intelligent agents for services.

Data

Alignment data: The data includes common instruction-style conversations, and security- and service-oriented data, which involves substantial annotation efforts. Instruction data covers broad abilities, such as writing, question answering, brainstorming and planning, content understanding, summarization, natural language processing, and coding. Security data tries to prevent the model from generating harmful and inappropriate content. Service data tries to enhance the model with specific conversation patterns that can be parsed to invoke and incorporate external systems.

Data formatting: Since the data consists of conversation turns, we arrange them into texts using the ChatML format, which is a meta language that can describe both the metadata (e.g., roles) and the content of a turn. Currently, existing roles include system, user, and assistant.

Model

Training details: The causal language modeling objective is used to fine-tune the model, except for the tokens in the content of user's turns. The model is trained using the AdamW optimizer, with $\beta_1=0.9, \beta_2=0.95, \epsilon=10^{-6}$. The sequence length is limited to 2048, and the batch size is 128. The model is trained for 4000 steps, and over the first 1430 steps, the learning rate is warmed up to $1 \times 10^{-5}$. We use weight decay of 0.1, dropout of 0.1, and gradient clipping of 1.0.

Evaluation

Evaluation of human-aligned models is non-trivial and often non-standardized, since such models often target specific applications. We evaluate Qwen-7B-Chat from multiple perspectives.

World knowledge

As fine-tuning uses a much smaller dataset than pretraining and humans' understanding of world knowledge may be limited, we also evaluate the world knowledge of Qwen-7B-Chat using C-Eval and MMLU in a zero-shot and generative manner.

We demonstrate the zero-shot accuracy of Qwen-7B-Chat on the C-Eval validation set.

Model Avg. Acc.
LLaMA2-7B-Chat 31.9
LLaMA2-13B-Chat 40.6
Chinese-Alpaca-2-7B 41.3
Chinese-Alpaca-Plus-13B 43.3
Baichuan-13B-Chat 50.4
ChatGLM2-6B-Chat 50.7
InternLM-7B-Chat 53.2
Qwen-7B-Chat 54.2

The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below

Model Avg. STEM Social Sciences Humanities Others
Chinese-Alpaca-Plus-13B 41.5 36.6 49.7 43.1 41.2
Chinese-Alpaca-2-7B 40.3 - - - -
ChatGLM2-6B-Chat 50.1 46.4 60.4 50.6 46.9
Baichuan-13B-Chat 51.5 43.7 64.6 56.2 49.2
Qwen-7B-Chat 54.6 47.8 67.6 59.3 50.6

Compared with other models with comparable model sizes, the human-aligned Qwen-7B-Chat performs well in C-Eval accuracy.

The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below. The performance of Qwen-7B-Chat is still on top among other human-aligned models with comparable size.

Model Avg. Acc.
ChatGLM2-6B-Chat 45.5
LLaMA2-7B-Chat 47.0
InternLM-7B-Chat 50.8
Baichuan-13B-Chat 52.1
ChatGLM2-12B-Chat 52.1
Qwen-7B-Chat 53.9

Coding

The zero-shot Pass@1 of Qwen-7B-Chat on HumanEval is demonstrated below

Model Pass@1
LLaMA2-7B-Chat 12.2
InternLM-7B-Chat 14.0
Baichuan-13B-Chat 16.5
LLaMA2-13B-Chat 18.9
Qwen-7B-Chat 24.4

Math

The accuracy of Qwen-7B-Chat on GSM8K is shown below

Model Zero-shot Acc. 4-shot Acc.
ChatGLM2-6B-Chat - 28.0
LLaMA2-7B-Chat 20.4 28.2
LLaMA2-13B-Chat 29.4 36.7
InternLM-7B-Chat 32.6 34.5
Baichuan-13B-Chat - 36.3
ChatGLM2-12B-Chat - 38.1
Qwen-7B-Chat 41.1 43.5

Service

LLMs have shown capability in coordinating multiple external systems to achieve the given instructions, which creates new opportunities in traditional online services, the most notable being web search.

Qwen supports calling plugins/tools/APIs through ReAct Prompting. ReAct is also one of the main approaches used by the LangChain framework. For how to write and use prompts for ReAct Prompting, please refer to the ReAct examples. In our evaluation benchmark for assessing tool usage capabilities, Qwen's performance is as follows:

Model Tool Selection (Acc.↑) Tool Input (Rouge-L↑) False Positive Error↓
GPT-4 95% 0.90 15.0%
GPT-3.5 85% 0.88 75.0%
Qwen-7B 99% 0.89 9.7%

The plugins that appear in the evaluation set do not appear in the training set of Qwen. This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate. False Positive: Incorrectly invoking a plugin when it should not have been called when responding to a query.

Qwen also has the capability to be used as a HuggingFace Agent. Its performance on the benchmark provided by HuggingFace is as follows:

Model Tool Selection↑ Tool Used↑ Code↑
GPT-4 100.00 100.00 97.41
GPT-3.5 95.37 96.30 87.04
StarCoder-15.5B 87.04 87.96 68.89
Qwen-7B 90.74 92.59 74.07

Conclusion

In this document, we describe Qwen-7B, including a pretrained model and a human-aligned model. These models have demonstrated exciting performance compared to existing open models of similar or even larger scales. As part of our ongoing commitment to the concept of Model as a Service, the release also includes practical pieces such as long context inference and external system integration, which we hope would facilitate developers realizing their own ideas and concepts. We believe that the open release of Qwen-7B models would further our understanding of variables and techniques introduced in realistic settings and help to drive progress in this important area together with the community.