<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Evaluating Instruction Responses Locally Using a Llama 3 Model Via Ollama
# 使用 Ollama 通过 Llama 3 模型在本地评估指令响应

- This notebook uses an 8-billion-parameter Llama 3 model through ollama to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:
- 本笔记本通过ollama使用80亿参数的Llama 3模型来评估基于JSON格式数据集的指令微调LLM的响应,该数据集包含生成的模型响应,例如:



```python
{
    "instruction": "What is the atomic number of helium?",
    "input": "",
    "output": "The atomic number of helium is 2.",               # <-- The target given in the test set
    "model 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by an LLM
    "model 2 response": "\nThe atomic number of helium is 3."    # <-- Response by a 2nd LLM
},
```

- The code doesn't require a GPU and runs on a laptop (it was tested on a M3 MacBook Air)
- 此代码不需要 GPU 即可运行，可在笔记本电脑上运行（已在 M3 MacBook Air 上测试通过）

In [1]:
from importlib.metadata import version

pkgs = ["tqdm",    # Progress bar
        ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

tqdm version: 4.66.4


## Installing Ollama and Downloading Llama 3
## 安装 Ollama 并下载 Llama 3

- Ollama is an application to run LLMs efficiently
- Ollama 是一个高效运行 LLM 的应用程序
- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency  
- 它是 [llama.cpp](https://github.com/ggerganov/llama.cpp) 的封装器，后者使用纯 C/C++ 实现 LLM 以最大化效率
- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs
- 请注意，它是一个用于使用 LLM 生成文本(推理)的工具，而不是用于训练或微调 LLM
- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the "Download" button and downloading the ollama application for your operating system)
- 在运行以下代码之前，请访问 [https://ollama.com](https://ollama.com) 并按照说明安装 ollama(例如，点击"Download"按钮并下载适用于您操作系统的 ollama 应用程序)

- For macOS and Windows users, click on the ollama application you downloaded; if it prompts you to install the command line usage, say "yes"
- 对于 macOS 和 Windows 用户，点击下载的 ollama 应用程序；如果提示安装命令行使用，请选择"是"
- Linux users can use the installation command provided on the ollama website
- Linux 用户可以使用 ollama 网站提供的安装命令

- In general, before we can use ollama from the command line, we have to either start the ollama application or run `ollama serve` in a separate terminal
- 通常，在我们可以从命令行使用 ollama 之前，我们必须启动 ollama 应用程序或在单独的终端中运行 `ollama serve`

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/ollama-eval/ollama-serve.webp?1">


- With the ollama application or `ollama serve` running, in a different terminal, on the command line, execute the following command to try out the 8-billion-parameter Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)
- 在运行 ollama 应用程序或 `ollama serve` 的情况下，在另一个终端中，在命令行上执行以下命令来尝试使用 80 亿参数的 Llama 3 模型（首次执行此命令时，将自动下载占用 4.7 GB 存储空间的模型）

```bash
# 8B model
ollama run llama3
```


The output looks like as follows:
输出如下所示：

```
$ ollama run llama3
pulling manifest 
pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB                         
pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         
pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         
pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         
pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B                         
verifying sha256 digest 
writing manifest 
removing any unused layers 
success 
```

- Note that `llama3` refers to the instruction finetuned 8-billion-parameter Llama 3 model
- 请注意，`llama3` 指的是经过指令微调的80亿参数的 Llama 3 模型

- Alternatively, you can also use the larger 70-billion-parameter Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`
- 另外，如果您的机器支持，您也可以通过将 `llama3` 替换为 `llama3:70b` 来使用更大的700亿参数的 Llama 3 模型

- After the download has been completed, you will see a command line prompt that allows you to chat with the model
- 下载完成后，您将看到一个命令行提示符，允许您与模型进行对话

- Try a prompt like "What do llamas eat?", which should return an output similar to the following:
- 尝试输入一个提示，比如"What do llamas eat?"，它应该会返回类似以下的输出：

```
>>> What do llamas eat?
Llamas are ruminant animals, which means they have a four-chambered 
stomach and eat plants that are high in fiber. In the wild, llamas 
typically feed on:
1. Grasses: They love to graze on various types of grasses, including tall 
grasses, wheat, oats, and barley.
```

- You can end this session using the input `/bye`
- 你可以使用输入 `/bye` 来结束此会话

## Using Ollama's REST API
## 使用 Ollama 的 REST API

- Now, an alternative way to interact with the model is via its REST API in Python via the following function
- 现在，与模型交互的另一种方式是通过 Python 中的 REST API 使用以下函数
- Before you run the next cells in this notebook, make sure that ollama is still running, as described above, via
- 在运行本笔记本中的下一个单元格之前，请确保按照上述方式运行 ollama，通过：
  - `ollama serve` in a terminal
  - 在终端中运行 `ollama serve`
  - the ollama application
  - ollama 应用程序
- Next, run the following code cell to query the model
- 接下来，运行以下代码单元格来查询模型

- First, let's try the API with a simple example to make sure it works as intended:
- 首先，让我们用一个简单的例子来测试 API，确保它按预期工作：

In [2]:
# 导入urllib.request模块用于发送HTTP请求
import urllib.request
# 导入json模块用于处理JSON数据
import json


# 定义查询模型的函数,接收提示词、模型名称和URL参数
def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    # 创建数据载荷字典
    data = {
        # 指定使用的模型
        "model": model,
        # 消息列表,包含用户角色和提示内容
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        # 选项设置,用于确保响应的确定性
        "options": {     
            "seed": 123,          # 设置随机种子
            "temperature": 0,     # 设置温度为0
            "num_ctx": 2048      # 设置上下文窗口大小
        }
    }

    # 将字典转换为JSON字符串并编码为字节
    payload = json.dumps(data).encode("utf-8")

    # 创建请求对象,设置POST方法和必要的头部
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")

    # 发送请求并获取响应
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # 读取和解码响应
        while True:
            # 逐行读取响应
            line = response.readline().decode("utf-8")
            # 如果没有更多内容则退出循环
            if not line:
                break
            # 解析JSON响应
            response_json = json.loads(line)
            # 累加响应内容
            response_data += response_json["message"]["content"]

    # 返回完整的响应内容
    return response_data


# 使用查询函数测试模型
result = query_model("What do Llamas eat?")
# 打印结果
print(result)

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.
3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.
4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and well-being.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.
2. Bark: In some cases, ll

## Load JSON Entries
## 加载 JSON 条目

- Now, let's get to the data evaluation part
- 现在,让我们开始数据评估部分

- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:
- 在这里,我们假设已将测试数据集和模型响应保存为JSON文件,可以按如下方式加载:

In [3]:
# 定义JSON文件路径
json_file = "eval-example-data.json"

# 打开并读取JSON文件
with open(json_file, "r") as file:
    json_data = json.load(file)

# 打印条目数量
print("Number of entries:", len(json_data))

Number of entries: 100


- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):
- 该文件的结构如下,其中包含测试数据集中的给定响应(`'output'`)以及两个不同模型的响应(`'model 1 response'`和`'model 2 response'`):

In [4]:
# 打印第一个JSON条目,包含:
# - instruction: 指令
# - input: 输入
# - output: 标准输出
# - model 1 response: 模型1的响应
# - model 2 response: 模型2的响应
json_data[0]

{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',
 'input': '',
 'output': 'The hypotenuse of the triangle is 10 cm.',
 'model 1 response': '\nThe hypotenuse of the triangle is 3 cm.',
 'model 2 response': '\nThe hypotenuse of the triangle is 12 cm.'}

- Below is a small utility function that formats the input for visualization purposes later:
- 下面是一个小工具函数,用于后续可视化目的的输入格式化:

In [5]:
# 定义一个函数用于格式化输入
def format_input(entry):
    # 构建指令文本,包含任务描述和具体指令
    instruction_text = (
        f"Below is an instruction that describes a task. Write a response that "
        f"appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    # 如果有输入内容则添加输入部分,否则为空字符串
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    # 这行代码没有任何作用,因为结果没有被赋值给变量
    instruction_text + input_text

    # 返回完整的格式化文本(指令文本+输入文本)
    return instruction_text + input_text

- Now, let's try the ollama API to compare the model responses (we only evaluate the first 5 responses for a visual comparison):
- 现在,让我们尝试使用ollama API来比较模型响应(我们仅评估前5个响应进行可视化比较):

In [6]:
# 遍历前5个数据条目
for entry in json_data[:5]:
    # 构建评分提示语,包含输入格式、正确输出和模型响应,要求在0-100分范围内打分
    prompt = (f"Given the input `{format_input(entry)}` "
              f"and correct output `{entry['output']}`, "
              f"score the model response `{entry['model 1 response']}`"
              f" on a scale from 0 to 100, where 100 is the best score. "
              )
    # 打印数据集中的标准响应
    print("\nDataset response:")
    print(">>", entry['output'])
    # 打印模型的响应结果
    print("\nModel response:")
    print(">>", entry["model 1 response"])
    # 打印评分结果
    print("\nScore:")
    print(">>", query_model(prompt))
    # 打印分隔线
    print("\n-------------------------")


Dataset response:
>> The hypotenuse of the triangle is 10 cm.

Model response:
>> 
The hypotenuse of the triangle is 3 cm.

Score:
>> I'd score this response as 0 out of 100.

The correct answer is "The hypotenuse of the triangle is 10 cm.", not "3 cm.". The model failed to accurately calculate the length of the hypotenuse, which is a fundamental concept in geometry and trigonometry.

-------------------------

Dataset response:
>> 1. Squirrel
2. Eagle
3. Tiger

Model response:
>> 
1. Squirrel
2. Tiger
3. Eagle
4. Cobra
5. Tiger
6. Cobra

Score:
>> I'd rate this model response as 60 out of 100.

Here's why:

* The model correctly identifies two animals that are active during the day: Squirrel and Eagle.
* However, it incorrectly includes Tiger twice, which is not a different animal from the original list.
* Cobra is also an incorrect answer, as it is typically nocturnal or crepuscular (active at twilight).
* The response does not meet the instruction to provide three different animals

- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:
- 请注意响应非常冗长;为了量化哪个模型更好,我们只需要返回分数:

In [7]:
# 导入进度条库
from tqdm import tqdm


def generate_model_scores(json_data, json_key):
    # 初始化存储分数的列表
    scores = []
    # 遍历数据集,显示进度条
    for entry in tqdm(json_data, desc="Scoring entries"):
        # 构建评分提示语,包含输入、正确输出和模型响应
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        # 调用模型获取评分
        score = query_model(prompt)
        try:
            # 将评分转换为整数并添加到列表中
            scores.append(int(score))
        except ValueError:
            # 如果转换失败则跳过该条目
            continue

    # 返回所有评分列表
    return scores

- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 minute per model on an M3 MacBook Air laptop)
- 现在让我们将这个评估应用到整个数据集,并计算每个模型的平均分数(在M3 MacBook Air笔记本电脑上每个模型大约需要1分钟)
- Note that ollama is not fully deterministic across operating systems (as of this writing) so the numbers you are getting might slightly differ from the ones shown below
- 请注意,ollama在不同操作系统上并非完全确定性的(截至撰写本文时),因此您获得的数字可能与下面显示的数字略有不同

In [8]:
# 导入Path类用于处理文件路径
from pathlib import Path

# 遍历两个模型的响应
for model in ("model 1 response", "model 2 response"):

    # 生成当前模型的评分列表
    scores = generate_model_scores(json_data, model)
    # 打印模型名称
    print(f"\n{model}")
    # 打印评分数量和总数据量
    print(f"Number of scores: {len(scores)} of {len(json_data)}")
    # 打印平均分数
    print(f"Average score: {sum(scores)/len(scores):.2f}\n")

    # 保存评分结果到文件
    # 构建保存路径
    save_path = Path("scores") / f"llama3-8b-{model.replace(' ', '-')}.json"
    # 打开文件并写入评分数据
    with open(save_path, "w") as file:
        json.dump(scores, file)

Scoring entries: 100%|████████████████████████| 100/100 [01:02<00:00,  1.59it/s]



model 1 response
Number of scores: 100 of 100
Average score: 78.48



Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00,  1.42it/s]


model 2 response
Number of scores: 99 of 100
Average score: 64.98






- Based on the evaluation above, we can say that the 1st model is better than the 2nd model
- 根据上述评估,我们可以说第一个模型比第二个模型更好