# 通过微调LLM以遵循人类指令

In [1]:
import json
import os

import matplotlib.pyplot as plt
import tiktoken
import torch

内容大纲：
1. 构建指令数据集
2. 使用指令数据集微调LLM
3. 基于裁判模型进行LLM评测

## 1. 构建指令数据集

### 1.1 首先，我们需要实现一个工具类来提供以下功能：
1. 将原始的“指令-输入-答案”按照Alpaca的模板来转换格式；
2. 填充每个批次中每个token序列直至它的长度为该批次中最长序列的长度；
3. 使用-100作为mask标记，以便在训练过程中忽略这些填充的token。

In [2]:
from typing import Dict, Iterable, List, Optional, Tuple


class BatchingTool:
    @staticmethod
    def load_corpus(corpus_file_path: str) -> List[Dict[str, str]]:
        with open(corpus_file_path, "r", encoding="utf-8") as f:
            data = json.load(f)
        return data

    @staticmethod
    def format_input(item: Dict[str, str]) -> str:
        instruction_text = (
            f"Below is an instruction that describes a task. "
            f"Write a response that appropriately completes the request."
            f"\n\n### Instruction:\n{item['instruction']}"
        )
        input_text = f"\n\n### Input:\n{item['input']}" if item["input"] else ""
        return instruction_text + input_text
    
    @staticmethod
    def format_output(item: Dict[str, str]) -> str:
        output_text = f"\n\n### Response:\n{item['output']}"
        return output_text
    
    @staticmethod
    def format_full(item: Dict[str, str]) -> str:
        return BatchingTool.format_input(item) + BatchingTool.format_output(item)

    @staticmethod
    def collate(
        batch: Iterable[int],
        pad_token_id: int = 50256,
        ignore_index: int = -100,
        allowed_max_length: Optional[int] = None,
        device: str  = "cpu"
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        inputs_batch = []
        tgargets_batch = []
        max_len_in_batch = max([len(item) + 1 for item in batch])
        for item in batch:
            item_padded = BatchingTool.pad(item, max_len_in_batch, pad_token_id)
            inputs, targets = BatchingTool.split(item_padded)
            targets = BatchingTool.mask(targets, pad_token_id, ignore_index)
            if allowed_max_length is not None:
                inputs, targets = BatchingTool.truncate(inputs, targets, allowed_max_length)
            inputs_batch.append(inputs)
            tgargets_batch.append(targets)
        inputs_tensor = torch.stack(inputs_batch).to(device)
        targets_tensor = torch.stack(tgargets_batch).to(device)
        return inputs_tensor, targets_tensor
    
    @staticmethod
    def pad(item: Iterable[int], max_len: int, pad_token_id: int) -> Iterable[int]:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = new_item + [pad_token_id] * (max_len - len(new_item))
        return padded
    
    @staticmethod
    def split(item: Iterable[int]) -> Tuple[torch.Tensor, torch.Tensor]:
        inputs = torch.tensor(item[:-1])
        targets = torch.tensor(item[1:])
        return inputs, targets
    
    @staticmethod
    def mask(targets: torch.Tensor, pad_token_id, ignore_index=-100) -> torch.Tensor:
        # Replace all but the first padding tokens in targets by ignore_index
        masked = targets == pad_token_id
        indices = torch.nonzero(masked).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index
        return targets
    
    @staticmethod
    def truncate(inputs: torch.Tensor, targets: torch.Tensor, allowed_max_length: int) -> Tuple[Iterable[int], Iterable[int]]:
        inputs = inputs[:allowed_max_length]
        targets = targets[:allowed_max_length]
        return inputs, targets

加载原始的“指令-输入-答案”文本数据”：

In [3]:
data = BatchingTool.load_corpus(corpus_file_path="./instruction-data.json")

from pprint import pprint
pprint(data)

[{'input': 'freind --> friend',
  'instruction': 'Evaluate the following phrase by transforming it into the '
                 'spelling given.',
  'output': 'The spelling of the given phrase "freind" is incorrect, the '
            'correct spelling is "friend".'},
 {'input': 'He go to the park every day.',
  'instruction': 'Edit the following sentence for grammar.',
  'output': 'He goes to the park every day.'},
 {'input': '',
  'instruction': 'Convert 45 kilometers to meters.',
  'output': '45 kilometers is 45000 meters.'},
 {'input': '',
  'instruction': "Rewrite this sentence to start with 'Although': Despite the "
                 'rain, they went for a walk.',
  'output': 'Although it was raining, they went for a walk.'},
 {'input': '',
  'instruction': 'What are the first 10 square numbers?',
  'output': '1, 4, 9, 16, 25, 36, 49, 64, 81, 100.'},
 {'input': '',
  'instruction': 'Suggest a more formal synonym for "happy."',
  'output': 'A more formal synonym for "happy" is "conte

将“指令-输入-答案”数据按照Alpaca的格式进行重新组织：

In [4]:
example_data = data[50]
example_text = BatchingTool.format_full(example_data)
print(example_text)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


将数据集中的样本整理为批次：

In [5]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]

batch = (
    inputs_1,
    inputs_2,
    inputs_3
)

pad_token_id = tiktoken.get_encoding("gpt2").encode("<|endoftext|>", allowed_special={"<|endoftext|>"})[-1]

inputs, targets = BatchingTool.collate(batch=batch, pad_token_id=pad_token_id)
print(f"inputs are: \n{inputs}")
print(f"targets are: \n{targets}")
# 


inputs are: 
tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
targets are: 
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


### 1.2 划分数据集

首先实现数据集类：

In [6]:
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    def __init__(self, corpus_data: List[Dict[str, str]], tokenizer: tiktoken.Encoding):
        super().__init__()
        self.tokenizer = tokenizer
        self.encoded_texts = []
        self._encoded_corpus_data(corpus_data=corpus_data)

    def _encoded_corpus_data(self, corpus_data: List[Dict[str, str]]):
        for d in corpus_data:
            text = BatchingTool.format_full(d)
            encoded_text = self.tokenizer.encode(text=text)
            self.encoded_texts.append(encoded_text)

    def __getitem__(self, index: int):
        return self.encoded_texts[index]
    
    def __len__(self):
        return len(self.encoded_texts)

然后将原始的语料集分割为训练集、测试集以及验证集：

In [7]:
# train_portion = int(len(data) * 0.85)  # 85% for training
# test_portion = int(len(data) * 0.1)    # 10% for testing
# val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation

# train_data = data[:train_portion]
# test_data = data[train_portion:train_portion + test_portion]
# val_data = data[train_portion + test_portion:]

length = len(data)
train_pos = int(length * 0.85)
test_pos =int(length * 0.1)
valid_pos = length - train_pos - test_pos

train_data = data[:train_pos]
test_data = data[train_pos: train_pos + test_pos]
valid_data = data[train_pos + test_pos:]


In [8]:
print(f"train set size: {len(train_data)}")
print(f"test set size: {len(test_data)}")
print(f"validation set size: {len(valid_data)}")

train set size: 935
test set size: 110
validation set size: 55


In [9]:
tokenizer = tiktoken.get_encoding(encoding_name="gpt2")

train_set = InstructionDataset(corpus_data=train_data, tokenizer=tokenizer)
test_set = InstructionDataset(corpus_data=test_data, tokenizer=tokenizer)
valid_set = InstructionDataset(corpus_data=valid_data, tokenizer=tokenizer)

在创建数据集的批次加载器之前，我们需要构造传入给DataLoader的collate函数：

In [10]:
from functools import partial


collate = partial(
    BatchingTool.collate,
    pad_token_id=pad_token_id,
    ignore_index=-100,
    allowed_max_length=1024,
    device="cpu"
)

In [11]:
inputs, targets = collate(batch=batch)
print(f"inputs are: \n{inputs}")
print(f"targets are: \n{targets}")

inputs are: 
tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
targets are: 
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


In [12]:
from torch.utils.data import DataLoader



torch.manual_seed(123)

batch_size = 4
num_workers = 0

train_batching = DataLoader(
    dataset=train_set,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True,
    collate_fn=collate,
    num_workers=num_workers
)

test_batching = DataLoader(
    dataset=test_set,
    batch_size=batch_size,
    shuffle=False,
    drop_last=False,
    collate_fn=collate,
    num_workers=num_workers
)

valid_batching = DataLoader(
    dataset=valid_set,
    batch_size=batch_size,
    shuffle=False,
    drop_last=False,
    collate_fn=collate,
    num_workers=num_workers
)

简单验证一下是否鞥能够从三个batching中加载批次样本数据：

In [13]:
print("train batching:")
for inputs, targets in train_batching:
    print(inputs.shape, targets.shape)

print("test batching:")
for inputs, targets in test_batching:
    print(inputs.shape, targets.shape)

print("valid batching:")
for inputs, targets in valid_batching:
    print(inputs.shape, targets.shape)

train batching:
torch.Size([4, 61]) torch.Size([4, 61])
torch.Size([4, 58]) torch.Size([4, 58])
torch.Size([4, 62]) torch.Size([4, 62])
torch.Size([4, 76]) torch.Size([4, 76])
torch.Size([4, 73]) torch.Size([4, 73])
torch.Size([4, 55]) torch.Size([4, 55])
torch.Size([4, 68]) torch.Size([4, 68])
torch.Size([4, 68]) torch.Size([4, 68])
torch.Size([4, 65]) torch.Size([4, 65])
torch.Size([4, 57]) torch.Size([4, 57])
torch.Size([4, 72]) torch.Size([4, 72])
torch.Size([4, 60]) torch.Size([4, 60])
torch.Size([4, 80]) torch.Size([4, 80])
torch.Size([4, 64]) torch.Size([4, 64])
torch.Size([4, 63]) torch.Size([4, 63])
torch.Size([4, 67]) torch.Size([4, 67])
torch.Size([4, 61]) torch.Size([4, 61])
torch.Size([4, 62]) torch.Size([4, 62])
torch.Size([4, 68]) torch.Size([4, 68])
torch.Size([4, 75]) torch.Size([4, 75])
torch.Size([4, 52]) torch.Size([4, 52])
torch.Size([4, 62]) torch.Size([4, 62])
torch.Size([4, 67]) torch.Size([4, 67])
torch.Size([4, 68]) torch.Size([4, 68])
torch.Size([4, 65]) torc

从视觉上分别检查inputs和targets是否包含pad_token_id以及mask标记：

In [14]:
print(inputs[1])

tensor([21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,
          257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,
        21017, 46486,    25,   198, 15946,   485,   257,  6171,  5177,   329,
          705,  2375,   332,  2637,   198,   198, 21017, 18261,    25,   198,
           32,  6171,  5177,   329,   705,  2375,   332,     6,   318,   705,
        27004,  2637, 50256, 50256, 50256, 50256, 50256, 50256])


In [15]:
print(targets[1])

tensor([  318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,   257,
         2882,   326, 20431, 32543,   262,  2581,    13,   198,   198, 21017,
        46486,    25,   198, 15946,   485,   257,  6171,  5177,   329,   705,
         2375,   332,  2637,   198,   198, 21017, 18261,    25,   198,    32,
         6171,  5177,   329,   705,  2375,   332,     6,   318,   705, 27004,
         2637, 50256,  -100,  -100,  -100,  -100,  -100,  -100])


## 2. 加载预训练好的LLM并对其进行微调

内容大纲：
1. 加载预训练好的LLM
2. 使用指令数据集对其进行微调

### 2.1 加载预训练模型

首先，导入模型配置：

In [16]:
with open("gpt2_small_config.json", "r", encoding="utf-8") as f:
    model_config = json.load(f)

pprint(model_config)

{'ctx_len': 256,
 'dropout_rate': 0.1,
 'emb_dim': 768,
 'n_heads': 12,
 'n_layers': 12,
 'vocab_size': 50257,
 'with_bias': False,
 'with_mask': True}


接着，实例化模型以及导入模型参数：

In [17]:
from previous_chapters import *


model = GPT2Small(**model_config)
print(model)

GPT2Small(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (decoder): Sequential(
    (0): TransformerDecoderOnly(
      (drop): Dropout(p=0.1, inplace=False)
      (norm1): LayerNorm()
      (mha): MultiHeadAttention(
        (wq): Linear(in_features=768, out_features=768, bias=False)
        (wk): Linear(in_features=768, out_features=768, bias=False)
        (wv): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (norm2): LayerNorm()
      (ffn): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
          (3): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (1): TransformerDecoderOnly(
      (drop): Dropout(p=0.1, inplace=False)
      (norm1): LayerNorm()
      (mh

In [20]:
model_finetuned_path = "gpt2_small_fine_tuning_for_instruction.pth"
model_pretrained_path = "gpt2_small_pretrained.pth"
to_retrain = False

if not os.path.exists(model_finetuned_path):
    params = torch.load(model_pretrained_path)
    to_retrain = True
    print(f"Loading pretrained model from {model_pretrained_path}")
else:
    params = torch.load(model_finetuned_path)
    print(f"Loading finetuned model from {model_finetuned_path}")
print(params.keys())
model.load_state_dict(params["model_static_dict"])

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
optimizer.load_state_dict(params["optimizer_static_dict"])

device = "cpu"
model.train()

Loading finetuned model from gpt2_small_fine_tuning_for_instruction.pth
dict_keys(['model_static_dict', 'optimizer_static_dict'])


GPT2Small(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (decoder): Sequential(
    (0): TransformerDecoderOnly(
      (drop): Dropout(p=0.1, inplace=False)
      (norm1): LayerNorm()
      (mha): MultiHeadAttention(
        (wq): Linear(in_features=768, out_features=768, bias=False)
        (wk): Linear(in_features=768, out_features=768, bias=False)
        (wv): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (norm2): LayerNorm()
      (ffn): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
          (3): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (1): TransformerDecoderOnly(
      (drop): Dropout(p=0.1, inplace=False)
      (norm1): LayerNorm()
      (mh

在微调模型之前，我们先来看一下模型在指令数据集上的表现：

In [21]:
example_data = valid_data[0]
# example_data
example_text = BatchingTool.format_input(example_data)
# example_text

token_ids = text_to_token_ids(text=example_text, tokenizer=tokenizer)
# token_ids

generated_response = generate_text_simple(
    model=model,
    indices=token_ids,
    max_new_tokens=10,
    context_size=1024
)
generated_text = token_ids_to_text(token_ids=generated_response, tokenizer=tokenizer)

print(f"input text: \n{example_text}\n")
print("-" * 80)
print(f"model response: \n{generated_text}")

input text: 
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'

--------------------------------------------------------------------------------
model response: 
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'

### Response:
The process by the


可以看到，在微调之前，模型基本上就是答非所问。

### 2.2 微调模型

在实际微调之前，我们先来看一下模型的初始损失：

In [22]:

model.to(device)

torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_epoch(train_batching, model, device, num_batches=5)
    val_loss = calc_loss_epoch(valid_batching, model, device, num_batches=5)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 1.42116436958313
Validation loss: 2.3730215549468996


可以看到，损失值有点大。
OK，现在进入实际的模型微调。

In [23]:
if to_retrain:

    import time

    start_time = time.time()

    torch.manual_seed(123)

    num_epochs = 2

    train_losses, val_losses, tokens_seen = train_model_simple(
        model, train_batching, valid_batching, optimizer, device,
        num_epochs=num_epochs, eval_freq=5, eval_iter=5,
        start_context=BatchingTool.format_input(valid_data[0]), tokenizer=tokenizer
    )

    end_time = time.time()
    execution_time_minutes = (end_time - start_time) / 60
    print(f"Training completed in {execution_time_minutes:.2f} minutes.")

训练不易，咱们还是先保存微调好的模型参数：

In [24]:
torch.save(
    obj={
        "model_static_dict": model.state_dict(),
        "optimizer_static_dict": optimizer.state_dict(),
    },
    f="gpt2_small_fine_tuning_for_instruction.pth"
)

然后，简单看一下测试集中的前三条指令的模型回答：

In [25]:
torch.manual_seed(123)


for entry in test_data[:3]:

    input_text = BatchingTool.format_input(entry)

    token_ids = generate_text_simple(
        model=model,
        indices=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=model_config["ctx_len"]
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
)

    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print("-------------------------------------")

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.

Correct response:
>> The car is as fast as lightning.

Model response:
>> The process by which the sentence is 'I love you?<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>
3.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>,




The process by the sentence.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endoftext|>, and the word 'I love
3.<|endoftext|>.<|endoftext|>.<|endoftext|>.<|endo

如上所示，使用GPT2 Small版本的模型根本无法收敛，但是太大的模型无法在笔者的笔记本上顺利训练，只能先将就着继续使用改尺寸的模型。

### 2.3 保存模型针对测试集的回答

保存模型针对测试集的回答，以便后续使用其他更先进的模型作为裁判模型来评测当前模型的回答质量。由于笔记本的性能

In [26]:
from tqdm import tqdm

test_samples = test_data[:10]
for i, entry in tqdm(enumerate(test_samples), total=len(test_samples)):

    input_text = BatchingTool.format_input(entry)

    token_ids = generate_text_simple(
        model=model,
        indices=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=model_config["ctx_len"]
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = generated_text[len(input_text):].replace("### Response:", "").strip()

    test_data[i]["model_response"] = response_text


with open("instruction-data-with-response.json", "w") as file:
    json.dump(test_data, file, indent=4)  # "indent" for pretty-printing

100%|██████████| 10/10 [13:16<00:00, 79.60s/it]


## 3. 评测模型

在这一小节，我们将评测模型在测试集上的表现。主要的评测方式就是使用ollama上的phi3模型作为裁判模型，对刚才保存的模型在测试集上的前10条回答打分（0-100分）。

内容大纲：
1. 运行ollama上的phi3模型
2. 通过REST API的方式调用裁判模型
3. 评测GPT2Small模型在测试集上的表现

### 3.1 运行ollama上的phi3模型

在单独的终端运行以下命令：
```bash
ollama serve
```

检查一下ollama是否正在运行：

In [34]:
import psutil


def is_ollama_running(process_name: str = "ollama") -> bool:
    for p in psutil.process_iter():
        if p.name() == process_name:
            return True
    return False

In [35]:
is_ollama_running()

True

### 3.2 通过REST API的方式调用裁判模型

In [38]:
import requests


def query_judge_model(
    prompt,
    # model="llama3",
    model="phi3",
    url="http://localhost:11434/api/chat"
):
    payload = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "options": {     # Settings below are required for deterministic responses
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }
    headers = {"Content-Type": "application/json"}
    resp = requests.post(url, json=payload, headers=headers)

    resp_data = ""
    for line in resp.iter_lines():
        if not line:
            break
        resp_json = json.loads(line)
        resp_data += resp_json["message"]["content"]
    return resp_data

In [39]:
# model = "llama3"
model = "phi3"
result = query_judge_model("What do Llamas eat?", model)
print(result)

Llamas are herbivores and primarily graze on grasses, but they can also consume a variety of other plant materials. Their diet includes:

- Grasses (both native to their habitat in the Andes mountains as well as introduced species)
- Herbs
- Flowers 
- Leaves from shrubs and trees
- Hay or straw when fresh grass is not available, especially during dry seasons
- They are also known to eat salt licks for mineral supplementation. Llamas have a three-chambered stomach that allows them to ferment plant material efficiently before digestion in the rest of their gut. This adaptation helps break down cellulose and other tough fibers found in plants, which is why they can thrive on such fibrous diets.


### 3.3 评测GPT2Small模型在测试集上的表现

In [40]:
def judge(json_data, json_key, model="phi3"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{BatchingTool.format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = query_judge_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue
    return scores


计算平均分：

In [41]:
scores = judge(test_samples, "model_response")
print(f"Number of scores: {len(scores)} of {len(test_samples)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

Scoring entries:  20%|██        | 2/10 [00:31<02:20, 17.59s/it]

Could not convert score: As an AI developed by Microsoft, I don't have personal experiences or feelings towards sentences. However, if you are asking for assistance in evaluating a sentence based on criteria such as clarity, coherence, grammar, and style (which might be what your request is hinting at), here’s how one could approach it:

1. Clarity - How easily can the meaning of the sentence be understood? A score close to 100 would indicate that even someone unfamiliar with the context should grasp its message without difficulty. Score: 95/100 (assuming high clarity).
   
2. Coherence - Does each part logically follow from another within the sentence? A cohesive and well-structured sentence would score highly here as well, perhaps around 98/100 for excellent flow of ideas without any abrupt shifts in thought or structure. Score: 97/100 (assuming high coherence).
   
3. Grammar - Are the grammatical rules correctly applied? Assuming no errors and proper use of tense, voice, agreement 

Scoring entries:  40%|████      | 4/10 [00:35<00:44,  7.35s/it]

Could not convert score: ```json
{
   "score": 75
}
```


Scoring entries:  50%|█████     | 5/10 [00:45<00:40,  8.07s/it]

Could not convert score: As an AI developed by Microsoft, I don't have personal experiences or feelings such as love for sentences. However, if you are asking me to evaluate a sentence based on criteria like clarity, grammar, and coherence (which could metapclty be interpreted as "love" in this context), here is an example:

Sentence: The quick brown fox jumps over the lazy dog. 
Score: 100

This sentence is a well-known English pangram, containing every letter of the alphabet at least once and thus often used for typing practice or font display purposes due to its concise nature while being grammatically correct and coherent in meaning.


Scoring entries:  70%|███████   | 7/10 [00:50<00:15,  5.16s/it]

Could not convert score: 85
`.` '. Happy to assist! If you need further help or have more questions in the future, feel free to reach out. Have a great day ahead!


Scoring entries:  90%|█████████ | 9/10 [00:59<00:05,  5.02s/it]

Could not convert score:  As an AI developed by Microsoft, I don't have personal experiences or emotions like humans do. However, if you want me to simulate a happy mood for this task, here it goes: Happy! Although as an artificial intelligence, my "happiness" is not based on human feelings but rather designed and programmed responses that aim to provide helpful, positive interactions with users such as yourself. Let's pretend I am at 100% happiness in spirit for this interaction!


Scoring entries: 100%|██████████| 10/10 [01:39<00:00, 10.00s/it]

Could not convert score: To provide an evaluation of this text based soleth criteria provided: clarity (2/10), conciseness (3/10), coherence and structure (4/10), grammar and spelling (5/10). The total score is therefore a weighted average, which results in 3.6 out of 10 when considering the importance given to each aspect: clarity being most important followed by conciseness then coherence & structure with grammatical correctness as least critical but still essential for understanding and communication.

 , 5, 5, the, the, the, 5, the, the, the, and : The text provided is a series of numbers separated by commas followed by 'and'. It lacks clear context or meaningful content beyond this repetition which makes it difficult to understand its purpose. Therefore, I would rate clarity as low (2/10).

 , 5, 5, the, and : The text is slightly more concise than before but still consists of repeated elements without additional information for context or meaningful content. Thus, this aspect als




很遗憾，评测也不顺利。Anyway，整个指令微调的过程我们也有了一个基本的了解，接下来需要做的就是充钱堆英伟达的显卡来完善我们的模型。