# 数据筛选
使用[Deita](https://github.com/hkust-nlp/deita)这个数据筛选工具，从已有的大规模SFT数据集中筛选出小规模、高质量的部分  

Deita工具源于`《WHAT MAKES GOOD DATA FOR ALIGNMENT? A COMPREHENSIVE STUDY OF AUTOMATIC DATA SELECTION IN INSTRUCTION TUNING》`  

## Deita工具的核心思想
1. 按照指令复杂度和回复质量对所有的数据排序
2. 按照顺序依次访问池内的数据，如果相似度小于阈值即可加入到最终的数据池中。

## Deita安装
pip install delta


## 使用Deita工具对以下三个指标进行筛选：
1. 数据复杂度
2. 数据质量
3. 数据多样性

### 1. 数据复杂度
数据复杂度主要是通过[hkust-nlp/deita-complexity-scorer](https://huggingface.co/hkust-nlp/deita-complexity-scorer)这个模型进行评估，该模型是基于Llama1-13B继续训练得到的。

复杂度打分模型的训练方式为：

1. 通过提示ChatGPT不断增加当前指令的复杂度，每个原始样本逐渐演进了5个更复杂的样本，共收集了6000个样本。使用的提示词主要负责添加限制或者提高问题的深度：
```
I want you act as a Prompt Rewriter.
Your objective is to rewrite a given prompt into a more complex version to
make those famous AI systems (e.g., ChatGPT and GPT4) a bit harder to handle.
But the rewritten prompt must be reasonable and must be understood and
responded by humans.
Your rewriting cannot omit the non-text parts such as the table and code in
#Given Prompt#:. Also, please do not omit the input in #Given Prompt#.
You SHOULD complicate the given prompt using the following method:
Please add one more constraints/requirements into #Given Prompt#
You should try your best not to make the #Rewritten Prompt# become verbose,
#Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#.
‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’
are not allowed to appear in #Rewritten Prompt#
#Given Prompt#:
<Here is instruction>
#Rewritten Prompt#:
```
2. 收集完样本后，组织成如下的形式，微调Llama1模型。
```
You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction}  \n##Complexity: 
```

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
from scipy.special import softmax

model_name = "hkust-nlp/deita-complexity-scorer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="auto")  # 当显存不足时，使用load_in_4bit=True是将模型加载到4位精度，可以减少显存占用

然后我们调用transformers自带的推理方法，基于Greedy Search生成各分数的概率，并加权求和作为模型对指令复杂度的最终打分。

In [None]:
def infer_complexity(model, tokenizer, input_text):
    complexity_template = ("You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction}  \n##Complexity: ")
    user_input = complexity_template.format(instruction=input_text)
    input_ids = tokenizer.encode(user_input, return_tensors="pt")
    print("用户输入:", user_input)
    print("用户输入分词后:", input_ids)
    
    max_length = 512
    outputs = model.generate(input_ids, 
                             max_length=max_length, 
                             num_return_sequences=1, 
                             return_dict_in_generate=True, 
                             output_scores=True)
    logprobs_list = outputs.scores[0][0]
    print("生成结果:", logprobs_list)
    score_logits = []
    id2score = {
        29896: "1",
        29906: "2",
        29941: "3",
        29946: "4",
        29945: "5",
        29953: "6"
    }
    score_template = np.array([1,2,3,4,5,6])
    for k in id2score:
        score_logits.append(logprobs_list[k])
    score_logits = np.array(score_logits)
    score_npy = softmax(score_logits, axis=0)
    print("取出1-6分对应位置模型预测的logits:", score_logits)
    print("取出1-6分对应位置模型预测的概率:", score_npy)
    score_npy = score_npy * score_template

    score_npy = np.sum(score_npy, axis=0)
    print("最终分数等于基于概率对1-6分的加权和:", score_npy)
    return score_npy

input_text = "write a performance review for a junior data scientist"
complexity_score = infer_complexity(model, tokenizer, input_text)

输出结果：
```python
用户输入: You are a helpful assistant. Please identify the complexity score of the following user query. 
##Query: write a performance review for a junior data scientist  
##Complexity: 
用户输入分词后: tensor([[    1,   887,   526,   263,  8444, 20255, 29889,  3529, 12439,   278,
         13644,  8158,   310,   278,  1494,  1404,  2346, 29889, 29871,    13,
          2277,  3010, 29901,  2436,   263,  4180,  9076,   363,   263, 20183,
           848,  9638,   391,   259,    13,  2277,  8909, 29916,   537, 29901,
         29871]])
生成结果: tensor([ -7.9375, -23.4375,   8.3828,  ...,  -4.6406,  -2.2578,  -2.6875])
取出1-6分对应位置模型预测的logits: [18.859375  24.484375  21.453125  15.9296875 14.0078125 12.984375 ]
取出1-6分对应位置模型预测的概率: [3.4279895e-03 9.5048648e-01 4.5865994e-02 1.8310170e-04 2.6793698e-05
 9.6285166e-06]
最终分数等于基于概率对1-6分的加权和: 2.042923080154651
```

### 2. 数据质量
数据质量方面，也采用和数据复杂度类似的筛选策略，通过ChatGPT不断生成更高质量的数据，例如使用如下的模板提升模型的Helpfulness：

```
I want you to act as a Response Rewriter
Your goal is to enhance the quality of the response given by an AI assistant
to the #Given Prompt# through rewriting.
But the rewritten response must be reasonable and must be understood by humans.
Your rewriting cannot omit the non-text parts such as the table and code in
#Given Prompt# and #Given Response#. Also, please do not omit the input
in #Given Prompt#.
You Should enhance the quality of the response using the following method:
Please make the Response more helpful to the user.
You should try your best not to make the #Rewritten Response# become verbose,
#Rewritten Response# can only add 10 to 20 words into #Given Response#.
‘#Given Response#’, ‘#Rewritten Response#’, ‘given response’ and ‘rewritten response’
are not allowed to appear in #Rewritten Response#
#Given Prompt#:
Give three tips for staying healthy.
#Given Response#:
<Response>
#Rewritten Response#:
```

质量方面主要从以下几个角度提升，包括Helpfulness、Relevance、Depth。在得到质量不断提升的数据后，再利用他们构建带有分数的数据，训练一个质量打分器。官方提供的[hkust-nlp/deita-quality-scorer](https://huggingface.co/hkust-nlp/deita-quality-scorer)使用方式也和complexity scorer类似，通过prompt形式预测最终的分数。

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
from scipy.special import softmax
model_name = "hkust-nlp/deita-quality-scorer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [None]:
def infer_quality(model, tokenizer, input_text, resp_text):
    quality_template = (
        "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: ")
    user_input = quality_template.format(
        instruction=input_text, output=resp_text)
    input_ids = tokenizer.encode(user_input, return_tensors="pt")
    print("用户输入:", user_input)
    print("用户输入分词后:", input_ids)

    max_length = 512
    outputs = model.generate(input_ids, max_length=512, num_return_sequences=1,
                             return_dict_in_generate=True, output_scores=True)
    logprobs_list = outputs.scores[0][0]
    print("生成结果:", logprobs_list)
    score_logits = []
    id2score = {
        29896: "1",
        29906: "2",
        29941: "3",
        29946: "4",
        29945: "5",
        29953: "6"
    }
    score_template = np.array([1, 2, 3, 4, 5, 6])
    for k in id2score:
        score_logits.append(logprobs_list[k])
    score_logits = np.array(score_logits)
    print("取出1-6分对应位置模型预测的logits:", score_logits)
    print("取出1-6分对应位置模型预测的概率:", score_npy)
    score_npy = score_npy * score_template

    score_npy = np.sum(score_npy, axis=0)
    print("最终分数等于基于概率对1-6分的加权和:", score_npy)
    return score_npy


input_text = "word to describe UI with helpful tooltips"  # Example Input
output_text = "User-friendly or intuitive UI"  # Example Output
quality_score = infer_quality(model, tokenizer, input_text, output_text)

print(quality_score)

输出结果
```python
用户输入: You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. 
#Question#:
word to describe UI with helpful tooltips
#Response#:
User-friendly or intuitive UI 
##Quality: 
用户输入分词后: tensor([[    1,   887,   526,   263,  8444, 20255, 29889,  3529, 12439,   278,
         11029,  8158,   310,   278, 13291,  6590,   304,   278,   894, 29889,
         29871,    13,   396, 16492, 29937, 29901,    13,  1742,   304,  8453,
          3740,   411,  8444,  5780,  2034,   567,    13, 29937,  5103, 29937,
         29901,    13,  2659, 29899, 18326,   368,   470, 27951,   573,  3740,
         29871,    13,  2277, 24399,   537, 29901, 29871]])
生成结果: tensor([ -7.2109, -18.8594,  10.8828,  ...,  -3.5664,  -0.7129,   1.8398])
取出1-6分对应位置模型预测的logits: [15.90625   23.515625  22.90625   16.40625   12.8203125 10.9375   ]
取出1-6分对应位置模型预测的概率: [3.2088807e-04 6.4723665e-01 3.5189646e-01 5.2905490e-04 1.4660470e-05
 2.2307599e-06]
最终分数等于基于概率对1-6分的加权和: 2.352686479498516
2.352686479498516
```

需要注意的是，为了实际使用时加快推理速度，Deita额外提供了使用[vllm](https://github.com/vllm-project/vllm)进行推理的接口。vllm主要基于PagedAttention、ContinuousBatching等思想提升模型推理的吞吐量，这里的一些相关技术后续会在课程里面详细介绍并讲解其代码，这里只要学会如何使用即可：

In [None]:
# 代码文件: deita/src/deita/selection/scorer/base.py
if not is_vllm:
    self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
else:
    from vllm import LLM, SamplingParams
    self.llm = LLM(model_name_or_path)
    self.sampling_params = SamplingParams(max_tokens=2, logprobs=1000)

vllm主要是在大batch推理的时候具有速度优势

最终我们运行如下的代码为当前数据集打分：
```bash
# 预测每个数据样本的复杂度
SCORETYPE="complexity"
DATAPATH="./sg_52k.json"
OUTPUTPATH="./output/dieta/complexity_sg_52k.json"
MODELPATH="hkust-nlp/deita-complexity-scorer"
SCORER="llama"
ISVLLM=false

python ./score_dataset.py \
    --data_path $DATAPATH \
    --output_path $OUTPUTPATH \
    --score_type $SCORETYPE \
    --scorer $SCORER \
    --scorer_name_or_path $MODELPATH

# 预测每个数据样本的质量
SCORETYPE="quality"
DATAPATH="./output/dieta/complexity_sg_52k.json"
OUTPUTPATH="./output/dieta/complexity_quality_sg_52k.json"
MODELPATH="hkust-nlp/deita-quality-scorer"
SCORER="llama"
ISVLLM=false

python ./score_dataset.py \
    --data_path $DATAPATH \
    --output_path $OUTPUTPATH \
    --score_type $SCORETYPE \
    --scorer $SCORER \
    --scorer_name_or_path $MODELPATH
```

打分器的核心运行脚本可以参考`./run_scorer.sh`。这里的运行速度比较慢，推荐使用vLLM优化或者数据并行计算（参考[llm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness)）实现，或者直接将数据分片，分别在各张卡上计算完成后再合并。

### 4. 数据多样性
数据多样性方面通过计算当前样本和样本池内最近邻居的余弦相似度实现，主要核心代码包括：样本嵌入和余弦相似度计算。

样本嵌入方面，Deita默认使用Llama1-13B先对整个需要筛选的数据池进行嵌入，将每条数据转换成向量。核心代码如下所示。

需要筛选的数据池采用sharegpt格式存储，主要形式如下：

```
{
    "id": "QWJhYvA_0",
    "conversations": [
      {
        "from": "human",
        "value": "Summarize the main ideas of Jeff Walker's Product Launch Formula into bullet points as it pertains to a growth marketing agency implementing these strategies and tactics for their clients..."
      },
      {
        "from": "gpt",
        "value": "Here are the main ideas of Jeff Walker's Product Launch Formula that can be applied by a growth marketing agency for their clients:\n\n1. Identify the target audience and their needs: Understand the 
ideal customer for the product or service, and create a messaging that resonates with them.\n2. Pre-launch: Build anticipation and excitement for the launch by creating buzz, gathering testimonials and case studies, a
nd using social media to create awareness.\n3. Launch: Use a well-crafted launch sequence to maximize sales and conversions. This can include offering bonuses, creating scarcity, and using a deadline to create urgency
.\n4. Post-launch: Follow up with customers, gather feedback, and continue to provide value to keep them engaged and loyal.\n5. Create a product suite: Once the initial product is successful, expand the offering by cr
eating additional products or services that address other needs of the same audience.\n6. Continual optimization: Continually monitor and optimize the launch process and product suite to improve results.\n7. Build a c
ommunity: Use the launch process to build a community of customers who are passionate about the product and can help spread the word.\n8. Use automation: Use technology and automation to streamline the launch process 
and improve efficiency."
      },
      {
        "from": "human",
        "value": "Summarize the main ideas of Brendon Burchard's Experts Academy into bullet points as it pertains to a growth marketing agency implementing these strategies and tactics for their clients..."
      },
      {
        "from": "gpt",
        "value": "Here are the main ideas of Brendon Burchard's Experts Academy that can be applied by a growth marketing agency for their clients:\n\n1. Identify your expertise: Define your area of expertise and crea
te a unique value proposition that differentiates you from others.\n2. Create a content strategy: Develop a content strategy that showcases your expertise and provides value to your audience. This can include blog pos
ts, videos, podcasts, and social media content.\n3. Build your audience: Use social media, advertising, and other channels to build your audience and increase your visibility.\n4. Develop your products: Create product
s and services that align with your expertise and meet the needs of your audience. These can include digital products, courses, coaching, and consulting services.\n5. Create a brand: Develop a strong brand that reflec
ts your expertise and resonates with your audience.\n6. Build authority: Establish yourself as an authority in your industry by speaking at events, publishing a book, or appearing on podcasts or TV shows.\n7. Monetize
 your expertise: Develop a monetization strategy that leverages your expertise and products to generate revenue.\n8. Build a team: As your business grows, build a team of experts to help you scale your business and pr
ovide excellent service to your clients.\n9. Continual improvement: Continually improve your products, services, and marketing strategies to stay ahead of the competition and provide the best possible experience for y
our clients.\n10. Focus on impact: Ultimately, focus on making a positive impact in the lives of your clients and using your expertise to make a difference in the world."
      }
}
```

首先，为嵌入器构建数据：

In [None]:
#文件：deita/src/deita/selection/embedder/clm_embedder.py
def encode_samples(self, data):
    # 从sharegpt格式的json文件中抽取对话
    conversations = [item["conversations"] for item in data]

    # 构建databuffer，主要涉及将数据元素修改为huggingface-dataset支持的格式
    dataset_buf, data_size = self.create_databuffer(conversations, sort_by_length = True)
    # 基于huggingface的Dataset库构建dataset
    raw_dataset = Dataset.from_list(dataset_buf)

    # 预处理数据，将数据组织成最终训练的格式，包括构建模板（添加标识符，拼接等）、分词
    preprocess_func = partial(preprocess, 
                            conv_template = self.conv_template,
                            only_answer = self.only_answer,
                            max_length = self.max_length,
                            tokenizer = self.tokenizer)
    
    # 这里的分词过程是先在主进程上进行的，随后广播到各子进程
    with self.accelerator.main_process_first():
        # dataset的map函数用于对数据集内部的元素进行操作
        tokenized_datasets = raw_dataset.map(
            preprocess_func,
            batched = True,
            num_proc = 8,
            remove_columns = ["conversations", "specific_length"],
            desc = "Tokenizing and reformatting instruction data"
        )
    # collator函数负责将数据集元素组装成batch，涉及padding、转tensor等操作
    data_collator = DataCollatorForSupervisedDataset(tokenizer = self.tokenizer)
    # 读取dataloader
    dataloader = torch.utils.data.DataLoader(tokenized_datasets, batch_size = self.batch_size_per_device, collate_fn = data_collator)

完成数据处理后，可以进行batch化的向量嵌入。为了加速推理，这里deita支持多卡数据并行推理，涉及accelerator的一些分布式api以及我们在分布式训练课程中讲解的gather等通信概念，建议仔细阅读代码：

In [None]:
# accelerator准备模型和数据加载器（分布式数据并行推理）
model, dataloader = self.accelerator.prepare(self.model, dataloader)

all_embeddings_list = []

total_samples = len(tokenized_datasets)
total_batches = len(dataloader)
# 最后一个batch可能放不满，因此要特殊处理
last_batch_size = total_samples % self.minibatch_size if total_samples % self.minibatch_size != 0 else self.minibatch_size

# 遍历batch进行推理
for b_idx, batch in enumerate(tqdm(dataloader, total = len(tokenized_datasets) // self.minibatch_size, disable = not self.accelerator.is_local_main_process)):
    
    model.eval()

    batch_idx = batch["idx"]
    attention_mask = batch["attention_mask"]

    # 推理并返回最后一层的隐藏层向量，这里的attention_mask是考虑到了存在pad符，防止干扰
    outputs = model(input_ids = batch["input_ids"], attention_mask = batch["attention_mask"], output_hidden_states = True)
    
    seq_len = attention_mask.sum(1, keepdim = True)
    
    # 这里需要处理不同分词器对于padding位置的差异
    if self.tokenizer.padding_side == "right":
        # 如果pad符在右侧填充，则需要找到右侧最后一个非pad符的token表示作为当前句子的表示
        last_hidden_state = outputs.hidden_states[-1][torch.arange(seq_len.size(0))[:, None], seq_len - 1]
    elif self.tokenizer.padding_side == "left":    
        # 如果pad符在左侧填充，则直接找到末尾token作为当前句子表示即可，一定是非pad符
        last_hidden_state = outputs.hidden_states[-1][:, -1]
    else:
        raise ValueError("Invalid padding strategy")
    
    sample_idx = batch_idx.tolist()
    sample_dict = [{"embedding": lst_hs, "idx": s_id} for lst_hs, s_id in zip(last_hidden_state.tolist(), sample_idx)]
    
    # 多GPU数据并行推理，更多分布式推理细节可以参考博客：https://medium.com/@geronimo7/llms-multi-gpu-inference-with-accelerate-5a8333e4c5db
    if(self.world_size > 1):
        all_process_embeddings = [[] for _ in range(self.world_size)]
        # 将多卡的结果按照gpu编号顺序收集，并在主进程上输出到最终的结果对象中
        dist.gather_object(sample_dict, all_process_embeddings if dist.get_rank() == 0 else None, dst=0)
    else:
        all_process_embeddings = [sample_dict]
    
    # 在主进程上将当前batch的所有结果extend到最终的结果列表
    if self.accelerator.is_local_main_process:
        if b_idx == total_batches - 1:
            for process_list in all_process_embeddings[:last_batch_size]:
                all_embeddings_list.extend(process_list)
        else:
            for process_list in all_process_embeddings:
                all_embeddings_list.extend(process_list)   

return all_embeddings_list  # 返回最终结果，随后cache到硬盘

在完成数据嵌入后，在实际筛选数据时，我们根据当前数据和已选取数据的余弦相似度最大值进行判断，如果当前数据和最近邻居非常相似，则不能加入候选数据集中（思考：有没有其他策略）。

In [None]:
# 代码路径：deita/src/deita/selection/embedder/utils.py
def filter(self, df):
    
    logger.info(f"Data number before filtering: #{len(df)}")
    
    df_sorted = self._sort(df)

    embeddings = df_sorted[self.embedding_field]  # 当前所有数据的嵌入表示
    embeddings = np.array(embeddings.values.tolist())  # 排序后转为numpy数组
    
    filtered_indices = [0]

    start_cnt = 0
    for i in tqdm(range(1, embeddings.shape[0], self.batch_size), total = embeddings.shape[0] // self.batch_size):  # 特别注意：这里为了考虑效率问题，原论文支持采用batch化判断相似度

        cur_emb = torch.tensor(embeddings[i:i+self.batch_size], dtype = torch.float32).to(self.device)
        
        if cur_emb.ndim == 4:
            cur_emb = cur_emb.squeeze(1).squeeze(1)

        if cur_emb.ndim == 1:
            cur_emb = cur_emb.unsqueeze(0)

        batch_idx = torch.range(i, i + cur_emb.size(0) - 1, dtype = torch.int64).to(self.device)
        
        existing_emb = embeddings[filtered_indices]

        if existing_emb.ndim == 1:
            existing_emb = existing_emb.unsqueeze(0)

        # 计算当前batch内数据和已经被筛除过的数据的余弦距离，和任一数据过近则筛除
        # 需要注意这里distance_chunk_by_chunk函数是分桶计算的，防止向量维度过高导致运算速度过慢
        distance_existed = self.distance_chunk_by_chunk(existing_emb, cur_emb)
        distance_existed_bool = torch.any(distance_existed > self.threshold, dim = 1)
        
        # 计算当前batch内数据各自之间的余弦距离，和任一数据过近则筛除
        distance_cur = self.distance_chunk_by_chunk(cur_emb, cur_emb)

        # 只取上三角or下三角并且不考虑对角线（不考虑自身之间，并且避免重复计算）
        distance_cur = distance_cur.tril(-1)
        
        distance_cur_bool = torch.any(distance_cur > self.threshold, dim = 1)
        
        # 二者取并集，代表需要被筛除的元素
        distance_bool = distance_existed_bool | distance_cur_bool
        
        # 留下的数据加入filtered_indices
        filtered_indices.extend(batch_idx[~distance_bool].tolist())

        if len(filtered_indices) - start_cnt > 1000:
            logger.info("Now data number: #{}".format(len(filtered_indices)))
            start_cnt = len(filtered_indices)

        if self.data_size > -1:
            if len(filtered_indices) >= self.data_size:
                break
        
    # 取出被保留的元素
    df_filtered = df_sorted.iloc[filtered_indices]        
    logger.info(f"Data number after filtering: #{len(df_filtered)}")
    
    if self.data_size > -1:
        return df_filtered[:self.data_size]  # 完成过滤后的数据，只取指定的data_size规模的数据加入最终的数据集，因为这里按照complexity_score*quality_score排序过了
    else:
        return df_filtered

计算向量余弦相似度的代码也可以学习下，先将向量模长归一化为1，然后求向量乘积即可（无需再除以向量模长之积了，因为已经归一化了）。

In [None]:
# 代码路径：deita/src/deita/selection/embedder/utils.py
def compute_distance(self, matrix, matrix_2):
    """
        使用pytorch计算余弦距离
    """
    
    # 如果需要对向量进行归一化
    if self.normalize_emb:
        # 对每个向量进行归一化处理，使其长度为1
        matrix = matrix / matrix.norm(dim=1)[:, None]
        matrix_2 = matrix_2 / matrix_2.norm(dim=1)[:, None]

    # 如果距离度量方法为余弦相似度
    if self.distance_metric == 'cosine':
        # 对每个向量进行归一化处理，使其长度为1
        matrix_norm = matrix / matrix.norm(dim=1)[:, None]
        matrix_2_norm = matrix_2 / matrix_2.norm(dim=1)[:, None]
        # 计算两个矩阵的余弦相似度
        return torch.mm(matrix_norm, matrix_2_norm.t())
    # 如果距离度量方法为曼哈顿距离
    elif self.distance_metric == 'manhattan':
        # 计算两个矩阵的曼哈顿距离
        return torch.cdist(matrix[None], matrix_2[None], p = 1).squeeze(0)
    else:
        # 如果指定的距离度量方法不支持，抛出错误
        raise ValueError("Metric not supported. Only support cosine and manhattan")

完整的嵌入代码可以参见：`./embed_datasets.py`，启动命令可见：`run_embed_datasets.sh`。

### 整体使用

在完成嵌入后，可以使用`run_combined_filter.sh`这个命令文件，基于已经计算好的数据分数和嵌入向量文件，进行筛选。

完整的数据打分-嵌入-筛选流程可参考`./deita_tutorial/run_deita.sh`文件运行。

### 备注
Deita工具的使用对于显卡要求较高，从这个[链接](https://huggingface.co/datasets/hkust-nlp/deita-10k-v0)下载筛选后的数据进行模型微调的实验
