<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Create "Passive Voice" Entries for an Instruction Dataset
# 为指令数据集创建"被动语态"条目

- This notebook uses OpenAI's GPT-4 to create "passive voice" entries for an instruction dataset, as shown in the example below
- 本笔记本使用OpenAI的GPT-4为指令数据集创建"被动语态"条目,如下例所示

```python
{  
   'instruction': 'Identify the verb in the following sentence',
   'input': 'The cat sleeps on the couch.',
   'output': 'The verb in the sentence is "sleeps."',
   'output_2': 'The sentence is "sleeps."'   #  <---- Newly created entry
}  
```

In [1]:
# pip install -r requirements-extra.txt

In [2]:
from importlib.metadata import version

pkgs = ["openai",  # OpenAI API
        "tqdm",    # Progress bar
       ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 1.30.3
tqdm version: 4.65.0


## Test OpenAI API
## 测试 OpenAI API

- First, let's test if the OpenAI API is correctly set up
- 首先,让我们测试OpenAI API是否正确设置
- If you don't have an account yet, you need to create one at https://platform.openai.com/
- 如果你还没有账号,需要在 https://platform.openai.com/ 创建一个
- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)
- 请注意,由于GPT-4 API不是免费的,你还需要向账户转入一些资金(参见 https://platform.openai.com/settings/organization/billing/overview)
- Creating the ~200 passive voice entries using the code in this notebook costs about $0.13 (13 cents)
- 使用本笔记本中的代码创建约200个被动语态条目的成本约为$0.13(13美分)

- First, we need to provide our OpenAI API secret key, which can be found at https://platform.openai.com/api-keys
- 首先,我们需要提供OpenAI API密钥,可以在 https://platform.openai.com/api-keys 找到
- Make sure not to share this key with anyone
- 请确保不要与任何人分享此密钥
- Add this secret key (`"sk-..."`) to the `config.json` file in this folder
- 将此密钥(`"sk-..."`)添加到此文件夹中的`config.json`文件中

In [3]:
# 导入所需的库
import json
from openai import OpenAI

# 从JSON文件加载API密钥
# 确保将"sk-..."替换为从 https://platform.openai.com/api-keys 获取的实际API密钥
with open("config.json", "r") as config_file:
    config = json.load(config_file)
    api_key = config["OPENAI_API_KEY"]

# 初始化OpenAI客户端
client = OpenAI(api_key=api_key)

- First, let's try the API with a simple example to make sure it works as intended:
- 首先,让我们用一个简单的例子来测试API,确保它按预期工作:

In [4]:
# 定义一个函数来调用ChatGPT API
def run_chatgpt(prompt, client, model="gpt-4-turbo"):
    # 创建聊天完成请求
    response = client.chat.completions.create(
        model=model,  # 使用指定的模型,默认为gpt-4-turbo
        messages=[{"role": "user", "content": prompt}],  # 设置用户消息
        temperature=0.0,  # 设置温度为0以获得确定性输出
    )
    # 返回生成的回复内容
    return response.choices[0].message.content


# 准备输入数据
sentence = "I ate breakfast"  # 示例句子
prompt = f"Convert the following sentence to passive voice: '{sentence}'"  # 构建提示语
run_chatgpt(prompt, client)  # 调用API获取被动语态转换结果

'Breakfast was eaten by me.'

## Create JSON Entries
## 创建 JSON 条目

- Next, we load the file we want to modify:
- 接下来,我们加载要修改的文件:

In [5]:
# 导入json模块用于处理JSON文件
import json

# 指定要读取的JSON文件路径
json_file = "instruction-examples.json"

# 打开并读取JSON文件
with open(json_file, "r") as file:
    json_data = json.load(file)
    
# 打印数据集中的条目数量
print("Number of entries:", len(json_data))

Number of entries: 200


- And we try the OpenAI chat API on a small sample first to ensure that it works correctly:
- 我们先在一个小样本上试用OpenAI聊天API,以确保它能正常工作:

In [6]:
# 遍历前5个数据条目进行测试
for entry in json_data[:5]:
    # 获取每个条目的输出文本
    text = entry["output"]
    # 构建提示语,要求将文本转换为被动语态,不需要额外解释
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    
    # 打印原始输入文本
    print("\nInput:")
    print(">>", text)
    # 打印转换后的输出文本
    print("\nOutput:") 
    print(">>", run_chatgpt(prompt, client))
    print("\n-------------------------")


Input:
>> The verb in the sentence is "sleeps."

Output:
>> The sentence is "sleeps."

-------------------------

Input:
>> The plural form of "goose" is "geese."

Output:
>> The plural form of "goose" is referred to as "geese."

-------------------------

Input:
>> The three primary colors are red, blue, and yellow.

Output:
>> Red, blue, and yellow are considered the three primary colors.

-------------------------

Input:
>> They had finished the game.

Output:
>> The game had been finished by them.

-------------------------

Input:
>> The abbreviation for "Doctor of Philosophy" is Ph.D.

Output:
>> The abbreviation "Ph.D." is used for "Doctor of Philosophy".

-------------------------


- Let's now extend the code to add the generated entries to the `json_data` and add a progress bar:
- 现在让我们扩展代码,将生成的条目添加到`json_data`中并添加进度条:

In [7]:
# 导入tqdm模块用于显示进度条
from tqdm import tqdm  


# 遍历前5个数据条目并显示进度条
for i, entry in tqdm(enumerate(json_data[:5]), total=len(json_data[:5])):
    # 获取每个条目的输出文本
    text = entry["output"]
    # 构建提示语,要求将文本转换为被动语态,不需要额外解释
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    # 将ChatGPT生成的被动语态文本保存到output_2字段
    json_data[i]["output_2"] = run_chatgpt(prompt, client)

100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.23it/s]


- One more time, let's make sure that the new entries (`"output_2"`) look ok
- 再检查一次,确保新添加的条目(`"output_2"`)看起来正常

In [8]:
# 打印第一个数据条目,检查output_2字段是否正确添加了被动语态文本
json_data[0]

{'instruction': 'Identify the verb in the following sentence: The cat sleeps on the couch.',
 'input': '',
 'output': 'The verb in the sentence is "sleeps."',
 'output_2': 'The sentence is "sleeps."'}

- Finally, if everything above looks ok, let's run the conversion to passive voice on our entire json dataset (this takes about 3 minutes):
- 最后,如果上面的一切看起来都没问题,让我们对整个json数据集运行被动语态转换(这需要大约3分钟):

In [9]:
# 遍历所有数据条目并显示进度条
for i, entry in tqdm(enumerate(json_data), total=len(json_data)):
    # 获取每个条目的输出文本
    text = entry["output"]
    # 构建提示语,要求将文本转换为被动语态,不需要额外解释
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    # 将ChatGPT生成的被动语态文本保存到output_2字段
    json_data[i]["output_2"] = run_chatgpt(prompt, client)

100%|██████████████████████████████████████████████████████████████████| 200/200 [03:43<00:00,  1.12s/it]


- After the conversion is completed, we save the file:
- 转换完成后,我们保存文件:

In [10]:
# 生成新的JSON文件名,在原文件名后添加"-modified"后缀
new_json_file = json_file.replace(".json", "-modified.json")


# 将处理后的数据写入新的JSON文件
with open(new_json_file, "w") as file:
    json.dump(json_data, file, indent=4)  # 使用缩进格式化JSON输出,提高可读性