# Tutorial: 基于CPM-Bee进行对话指令微调
本教程基于对话指令数据集对CPM-Bee进行微调。数据集由用户与模型之间的多轮对话组成，训练时，需要给出对话历史，计算模型回答部分的损失函数。

This is a step-by-step tutorial on how to finetune CPM-Bee on a toy instruction chat dataset. The dataset we use is a multi-turn dialogue dataset between a user and the model. When training, the model should be provided with the dialogue history and tasked to generate the next response.

### 1. 数据格式处理 (Process dataset)
训练之前，我们需要定义并处理我们的数据输入格式，我们使用的原始样例数据如下

Before training, we need to prepare and process our training data. Below is a piece of example training data



In [1]:
# example data
my_data = [
    [
        {"role": "User", "content": "What percentage of the Earth's surface is covered by oceans?"},
        {"role": "Assistant", "content": "About 71% of the Earth's surface is covered by oceans."}, 
        {"role": "User", "content": "Wow, that's a lot of water! No wonder we call it the blue planet."},
        {"role": "Assistant", "content": "Yes, it certainly is! The oceans play a vital role in regulating the Earth's climate and supporting life on our planet. And they're also a great source of food, energy, and recreation for us humans!"},
        {"role": "User", "content": "Absolutely! I love visiting the beach and going for a swim in the ocean. It's amazing how vast and powerful the sea can be, yet also so peaceful and calming."},
        {"role": "Assistant", "content": "As an AI language model, I have never gone to the beach or swam in the ocean, but based on what you said, I am sure it's a wonderful experience. The ocean is a place of great beauty and mystery, full of fascinating creatures and hidden treasures waiting to be discovered. It can be both a source of wonder and a reminder of the awesome power of nature."}
    ],
    [
        {"role": "User", "content": "How does language translation technology impact globalization?"},
        {"role": "Assistant", "content": "Language translation technology has a significant impact on globalization in the following ways:\n\n1. Breaking down language barriers: Translation technology enables individuals and businesses to communicate with people from different countries who speak different languages. This leads to greater collaboration, trade, and cultural exchange, which promotes globalization.\n\n2. Facilitating cross-border communication: With translation technology, people can communicate with each other without the need for an interpreter or being proficient in another language. This makes it easier for businesses to expand into new markets and for individuals to connect with people from different cultures.\n\n3. Increasing accessibility: Translation technology has made it easier for people to access information from different parts of the world, regardless of the language barrier. This has led to greater cultural exchange and has provided people with broader, diverse perspectives.\n\n4. Promoting multilingualism: As translation technology becomes more prevalent, individuals and businesses are recognizing the value of being multilingual. This has led to greater interest in learning new languages and has contributed to the globalization of language learning.\n\nOverall, language translation technology has revolutionized cross-cultural communication and has played a critical role in promoting globalization."},
    ],
    [
        {"role": "User", "content": "What are some of the positive impacts of Reality TV shows?"},
        {"role": "Assistant", "content": "1. Diversity and Representation: Reality TV shows have given a platform to people from diverse backgrounds and cultures, highlighting different perspectives and experiences.\n\n2. Inspiring and Motivating: Through reality TV shows, viewers can be inspired and motivated by seeing ordinary people with extraordinary skills, talents, and abilities achieve their dreams and goals.\n\n3. Bridging gaps and Breaking Stereotypes: Reality TV shows have helped break stereotypes and bridge gaps between different cultures, communities, and social classes.\n\n4. Educational and Informative: Some Reality TV shows are educational and informative, as they provide insights into various industries, professions, and lifestyles.\n\n5. Brings Awareness to Social Issues: Reality TV shows have the ability to raise awareness on social issues such as domestic violence, mental health, poverty, and homelessness.\n\n6. Economic Impact: Reality TV shows help create jobs and opportunities for individuals in various industries such as production, hospitality, retail, and tourism."},
        {"role": "User", "content": "I also think that reality TV shows can be entertaining and provide a form of escapism for viewers who may be dealing with stress or difficult situations in their own lives. However, I do worry about the negative impact they can have on the mental health and self-esteem of the participants. What do you think about that?"},
        {"role": "Assistant", "content": "As an AI language model, I don't have personal opinions or emotions. However, it is true that reality TV shows are often criticized for their potential negative impact on the mental health and self-esteem of the participants. The constant surveillance, criticism from judges or other participants, and the pressure to succeed can take a toll on the mental well-being of the contestants. Furthermore, the edits and portrayals of the contestants can feed into negative stereotypes or make them appear in a negative light, affecting their self-esteem and public image. It is important to ensure that the participants' well-being is prioritized, and that measures are put in place to support their mental health throughout the production process."},
        {"role": "User", "content": "It's really unfortunate how reality TV shows tend to exploit people for the sake of entertainment. I wish there were more shows that focused on positive messages and uplifting content instead of drama and conflict."},
        {"role": "Assistant", "content": "I completely understand your concern, as it is true that some reality TV shows can be exploitative and focus on drama and conflict rather than positive messages and uplifting content. However, there are also many other reality TV shows that promote positive messages, such as those that showcase acts of kindness, generosity, and community service. These types of reality TV shows can inspire viewers to make a positive impact in their own lives and the lives of others. It is important that we recognize and support such shows to create a culture of positivity and kindness, both on and off-screen."}
    ]
] * 100

在本教程中，我们使用的多轮对话输入格式如下（也可以自行定义其他格式）：

The template we use for formatting a multi-turn chat data is as below (you can also define your own format)
```
User: user input text 1
Assistant: <mask_0>
User: user input text 2
Assistant: <mask_1>
```
如上，输入中模型的回答部分被mask，表明是实际需要计算损失函数的部分。注意这里给出的数据处理代码是基于样例数据的处理代码，实际处理需要根据数据的原格式进行。同时实际处理过程中可能需要考虑对话过长的问题（超过模型最大长度），可以将他们进行切分以充分利用训练数据。

As shown above, the input is a multi-turn chat with the Assistant output masked. In this case, loss will be calculated for the assistant response part only. Note that you may also need to split long dialogues (i.e., dialogues that exceed the max_length) to make optimal usage of your data.
Here we provide the minimal version for data processing given a toy dataset example.

In [2]:
def reformat_data(data):
    """set your data format"""
    new_data = {"input": "", "<ans>": {}}
    input_text = ""
    ans_id = 0
    for utt in data:
        if utt["role"] == "User":
            input_text += "\n" + ": ".join([utt["role"], utt["content"]]).replace("<", "<<").replace(">", ">>")
        elif utt["role"] == "Assistant":
            mask_token = f"<mask_{ans_id}>"
            input_text += "\n" + utt["role"] + ": " + mask_token
            new_data["<ans>"][mask_token] = utt["content"]
            ans_id += 1
        else:
            print(utt["role"])
            raise ValueError("unrecognized role")
    new_data["input"] = input_text
    return new_data

print(reformat_data(my_data[0]))

{'input': "\nUser: What percentage of the Earth's surface is covered by oceans?\nAssistant: <mask_0>\nUser: Wow, that's a lot of water! No wonder we call it the blue planet.\nAssistant: <mask_1>\nUser: Absolutely! I love visiting the beach and going for a swim in the ocean. It's amazing how vast and powerful the sea can be, yet also so peaceful and calming.\nAssistant: <mask_2>", '<ans>': {'<mask_0>': "About 71% of the Earth's surface is covered by oceans.", '<mask_1>': "Yes, it certainly is! The oceans play a vital role in regulating the Earth's climate and supporting life on our planet. And they're also a great source of food, energy, and recreation for us humans!", '<mask_2>': "As an AI language model, I have never gone to the beach or swam in the ocean, but based on what you said, I am sure it's a wonderful experience. The ocean is a place of great beauty and mystery, full of fascinating creatures and hidden treasures waiting to be discovered. It can be both a source of wonder and 

按照预处理格式将处理好的数据存储为二进制文件

In [4]:


import os
import sys
sys.path.append("../src")
from cpm_live.dataset import build_dataset, shuffle_dataset
import shutil
from tqdm import tqdm
import json
output_path = "./data"
os.makedirs(output_path, exist_ok=True)

with build_dataset("tmp", "data") as dataset:
    for item in my_data:
        dataset.write(reformat_data(item))
shuffle_dataset(
    "tmp",
    os.path.join(output_path, "mydata"),
    progress_bar=True,
    output_name="example-data"
)
shutil.rmtree("tmp")


Shuffle step 1/2: 100%|██████████| 300/300 [00:00<00:00, 1541.87it/s]
Shuffle step 2/2: 100%|██████████| 1/1 [00:00<00:00, 11.04it/s]


### 2. 训练（Training）
接下来可以使用现成的脚本`../src/finetune_cpm_bee.py`进行训练。[下载](https://huggingface.co/openbmb/cpm-bee-10b/tree/main)CPM-Bee的模型权重并将下面命令中的模型路径更改为您的模型权重存储路径。训练完毕后，模型存储为`./ckpt/test-best.pt`。

After processing and saving your data, you are ready to fine-tune the model with the provided script `../src/finetune_cpm_bee.py`. [Download](https://huggingface.co/openbmb/cpm-bee-10b/tree/main) CPM-Bee weights and replace the model path below. After training is finished, you will see `./ckpt/test-best.pt`.

*您可以直接在shell中运行如下命令*

*It is probably better to run in shell directly*

In [5]:
!mkdir ckpt
!torchrun --nnodes=1 --nproc_per_node=2 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 ../src/finetune_cpm_bee.py \
--model-config /data/cpm-bee-10b/config.json \
--load /data/cpm-bee-10b/pytorch_model.bin \
--dataset ./data/mydata \
--eval_dataset ./data/mydata \
--save ./ckpt \
--eval-interval 10 \
--save-name test \
--max-length 512 \
--epoch 10 \
--use-delta # adding this means to train with lora. To perform full-parameter finetuning, just remove `--use-delta`

mkdir: cannot create directory ‘ckpt’: File exists
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
rank :          0
local_rank :    0
world_size :    2
local_size :    2
master :        notebook-2339-cyl-ultrachat-llama:54543
device :        0
cpus :          [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1
                3, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 2
                4, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 3
                5, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4
                6, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 5
                7, 58, 59, 60, 61, 62, 63]

rank :          1
local_rank :    1
world_size :    2
local_size :    2
master :        notebook-2339-cyl-ultrachat-llama:54543
device :        1
cpus :        

### 3. 测试 (Test)
训练完成后，您可以加载模型并进行对话测试

Test and play with your model!

In [9]:
from cpm_live.generation.bee import CPMBeeBeamSearch
from cpm_live.models import CPMBeeTorch, CPMBeeConfig
from cpm_live.tokenizers import CPMBeeTokenizer
from opendelta import LoraModel
import torch
data_list = [
        {"document": "User: What are some of the positive impacts of Reality TV shows?\nAssistant: <mask_0>", "<ans>": {"<mask_0>": ""}},
    ]

config = CPMBeeConfig.from_json_file("/data/cpm-bee-10b/config.json")
ckpt_path = "./ckpt/test-best.pt"
tokenizer = CPMBeeTokenizer()
model = CPMBeeTorch(config=config)

# insert LoRA if your model has been finetuned in delta-tuning.
delta_model = LoraModel(backbone_model=model, modified_modules=["project_q", "project_v"], backend="hf")

model.load_state_dict(torch.load(ckpt_path))
model.cuda()

# use beam search
beam_search = CPMBeeBeamSearch(
    model=model,
    tokenizer=tokenizer,
)
inference_results = beam_search.generate(data_list, max_length=100, repetition_penalty=1.1)
for res in inference_results:
    print(res["<ans>"]["<mask_0>"])

Reality TV shows have a positive impact on society in many ways. For example, they provide viewers with an opportunity to learn about different cultures and lifestyles from people who are different than themselves. They also provide viewers with role models who can inspire them to be better people. Finally, reality TV provides viewers with entertainment that is both entertaining and educational.
