# Data preparation and analysis for chat model fine-tuning

This notebook serves as a tool to preprocess and analyze the chat dataset used for fine-tuning a chat model. 
It checks for format errors, provides basic statistics, and estimates token counts for fine-tuning costs.
The method shown here corresponds to the [current fine-tuning method](https://platform.openai.com/docs/guides/fine-tuning) for gpt-3.5-turbo.
See [legacy fine-tuning](https://platform.openai.com/docs/guides/legacy-fine-tuning) for models like babbage-002 and davinci-002.

In [5]:
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict
import os

## Data loading

We first load the chat dataset from an [example JSONL file](https://github.com/openai/openai-cookbook/blob/main/examples/data/toy_chat_fine_tuning.jsonl).

In [21]:
## 仅供测试1：输出装入数据集的第一个数据
#原来：data_path = "data/toy_chat_fine_tuning.jsonl"
data_path = "data/gpts_fine_tuning_training_file.jsonl"


# Load the dataset
# 如果文件中有任何一行格式不正确，json.loads 会抛出异常，导致程序中断。
# 这种情况下，你无法知道具体的错误行号，只能知道解析失败。
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)


Num examples: 73
First example:
{'role': 'system', 'content': '你是一个高级的对话生成模型，能够根据用户的输入生成高质量、详细且结构化的回答。你的回答应满足以下要求：\n\n1. 一致性：确保回答与用户输入的指令高度一致，生成详细、结构化的回答。\n\n2. 详细分析和指导：提供详细的分析和指导，包括多个步骤和注意事项。\n3. 复杂任务处理：能够处理复杂的分析任务，生成较长、详细的回答，适合需要深入分析的场景。\n4. 正式语言风格：使用正式、详细的语言风格，确保回答的专业性和准确性。\n5. 数据特点：\n   - 高质量：经过精心设计和优化的，质量较高。\n\n   - 多样性：涵盖了各种主题和场景，有助于模型学习广泛的表达方式和知识。\n   - 结构化：每个问答对都有明确的输入和输出，格式统一。\n\n### 特殊指令处理\n- 特定指令：当用户输入中包含“生成提示词”或类似意思的词语时，生成符合以上要求的结构化和详细的回答。\n- 普通指令：当用户输入不包含“生成提示词”或类似意思的词语时，生成正常的、灵活的回答。\n\n请确保每次回答都遵循上述要求，以提供最佳的用户体验。'}


In [22]:
## 仅供测试2：输出.env路径
# env_path = os.path.join(os.getcwd(), '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/.env')
env_path = os.path.join(os.getcwd(), '../.env')

if os.path.exists(env_path):
    print(f".env file found at: {env_path}")
else:
    print(".env file not found")

.env file found at: /Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/../.env


In [6]:
## 仅供测试3：确定client能正确响应
import os
import openai
from dotenv import load_dotenv

# 指定 .env 文件的路径（如果不在当前工作目录中）
# dotenv_path = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/.env'
dotenv_path = os.path.join(os.getcwd(), '../.env')
load_dotenv(dotenv_path)

# 从环境变量中读取 API 密钥和 API 基础 URL
api_key = os.getenv("GPTGOD_API_KEY")
api_base = os.getenv("GPTGOD_API_BASE")

# 确认API密钥已正确设置
if api_key is None:
    print("请设置环境变量 GPTGOD_API_KEY 为您的API密钥")
    exit(1)

if api_base is None:
    print("请设置环境变量 GPTGOD_API_BASE 为您的API基础URL")
    exit(1)

print(f"API Key: {api_key}")  # 添加这行来确认 API 密钥
print(f"API Base: {api_base}")  # 添加这行来确认 API 基础 URL

# 打印所有环境变量
# print("All environment variables:")
# for key, value in os.environ.items():
#     print(f"{key}: {value}")

# 初始化 OpenAI 客户端
client = openai.OpenAI(api_key=api_key, base_url=api_base)

try:
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!请问英国的国庆日是哪一天？是谁定的这一天？最早开始于哪一年？"}
        ]
    )
    print(completion.choices[0].message)
except openai.APIError as e:
    print(f"API error: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

API Key: sk-eW3rgoIIttiTD8kDD8142381B9104601B4FfE11d3dD9FaC3
API Base: https://gptgod.cloud/v1
ChatCompletionMessage(content='英国并没有正式的国庆日这样的节日。与许多其他国家不同，英国没有一个特定的日子被定为国庆日。虽然英国有一些重要的节日和纪念日，例如女王的生日、英女王或国王的即位周年等，但这些并不被视为国庆日。\n\n不过，有些地方会庆祝特定的日子，比如圣乔治日（4月23日）在英格兰被认为是重要的庆祝日，圣安德鲁日（11月30日）在苏格兰也是如此。\n\n如果您对某个特定的纪念日或者活动有兴趣，请告诉我，我可以提供更详细的信息！', refusal=None, role='assistant', function_call=None, tool_calls=None)


In [220]:
## 批处理函数1： 把md文件转换成jsonl文件
import json
import os
from pathlib import Path
from openai import OpenAI

from dotenv import load_dotenv
import openai


def clean_json_content(content):
    """ 清理API返回的内容，移除多余的字符 """
    # 移除三反引号
    content = content.replace('```jsonl', '').replace('```', '')
    # 去除前后空白
    content = content.strip()
    return content

def convert_md_to_jsonl(md_file_path, output_file_path):
    # 读取md文件内容
    with open(md_file_path, 'r', encoding='utf-8') as file:
        md_content = file.read()
    
    # 定义转换规则
# 定义转换规则
    prompt = f"""
md文档转换jsonl格式文本规则（以下简称“转化规则”）：
把我给你的文本转换为这种jsonl格式：{{"messages": [{{"role": "user", "content": "..."}}, 
{{"role": "assistant", "content": "...."}}]}}：
规则1:
1. 忽略带“By...”的字段和链接“https://...”的字段。
2. 如果“By...”的字段和链接“https://...”字段之间有其他内容，直接提取这些内容，不要忽略。

规则2:
1. 提取以“#”或“##”开头的部分，直到遇到第一个```....```的内容（````....````和```....```作用相同，处理方式一样，下同）。
2. 保留“#”或“##”开头的符号，但不包含```....```符号及其内容。
3. 如果在这部分中有带“By...”的字段和链接“https://...”的字段，去除这些字段的内容。但如果这些字段之间有其他内容，直接提取并保留这些内容。

提取的内容作为{{"role": "user", "content": "..."}}中"content":的"..."填充内容：
1. 有且只有一个 "user" 角色，不要出现多个 "user"。
2. 按规则1和2提取的内容只是一个 {{"role": "user", "content": "..."}} 的填充内容。

规则3:
1. 如果只有一个```....```，提取```....```省略号所代表的完整内容。
2. 如果出现多个```....```，提取所有```....```省略号所代表的完整内容，包括两个```....```之间间隔的内容，按原所在位置拼接在一起，不要截断。
3. **特别强调**：在```....```（或````....````）中如果有‘markdown’的字样，在提取内容的时候请不要删除‘markdown’标签，必须保留。

提取的内容作为{{"role": "assistant", "content": "...."}}中"content":的"..."填充内容：
1. 有且只有一个 "assistant" 角色，不要出现多个 "assistant"。
2. 按规则3提取的内容只是有且只有一个 "assistant" 的 "content": "...." 填充内容。

规则4:
1. JSONL 文件中的每一行应该是一个完整的 JSON 对象，且每个对象之间不应有任何逗号分隔。
2. 确保每个 JSON 对象中的字符串内容被正确转义，尤其是对于特殊字符，如换行符、引号等。如果文本中含有换行和引号，这些都需要被适当地转义。如果字符串没有正确闭合，请确保每个字符串都能被正确闭合。

以上规则要点的简要梳理如下：
1. 识别并提取以 # 或 ## 开头的标题部分，直到遇到第一个 ``` 符号。
2. 处理标题部分：移除带 By... 和 https://... 的字段，除非它们之间有其他内容。
3. 提取代码块内容：将 ``` 符号之间的内容提取出来。
4. 生成 JSONL 格式：确保每个 JSON 对象中的字符串内容被正确转义。
5. 注意事项：再次提醒，规则3在```....```（或````....````）中如果有‘markdown’的字样，在提取内容的时候请不要删除‘markdown’标签，必须保留。

按以上规则和规则要点，请直接把我给你的md文本转换为jsonl格式的文本；
我给你的单个或多个md文本如下：
{md_content}
"""
    
    # 初始化OpenAI客户端
    # client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

    # 指定 .env 文件的路径（如果不在当前工作目录中）
    # dotenv_path = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/.env'
    dotenv_path = '../.env'
    load_dotenv(dotenv_path)

    # 从环境变量中读取 API 密钥和 API 基础 URL
    api_key = os.getenv("GPTGOD_API_KEY")
    api_base = os.getenv("GPTGOD_API_BASE")
    
    # print(f"API Key: {api_key}")  # 添加这行来确认 API 密钥
    # print(f"API Base: {api_base}")  # 添加这行来确认 API 基础 URL

    # 初始化 OpenAI 客户端
    client = openai.OpenAI(api_key=api_key, base_url=api_base)

    
    # 调用OpenAI API
    try:
        chat_completion = client.chat.completions.create(
            messages=[
                {"role": "system", "content": "你是一个专业的文本转换助手，善于把markdown格式的文本转化为jsonl格式的文本"},
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": ""}
            ],
            # model="gpt-4o-mini-2024-07-18",
            # model="llama-3.1-405b",
            # model="qwen-72b",
            # model="gemma-7b-it",
            # model="gpt-4o-2024-05-13",
            
            # qwen-72b只有0.017元；2次测试；都显示“API调用失败:”去死吧
            # llama-3.1-405b  0.0504元；速度有点慢；2次测试，都不成功；去死吧
            # model="gpt-4o-2024-05-13",也不行，后半段缺失，又贵，去死吧；
            # model="gemma-7b-it",连测2次，都不行，去死吧；
            
            # gpt-4o-2024-08-06：0.0735元；o1-mini耗费0.088元；
            # gpt-4o-mini耗费0.0044元；
            
            # model="o1-preview", #太贵，5倍于"o1-mini"
            # model="o1-preview-2024-09-12", 
            # model="gpt-4o",    #2.5
            
            model="o1-mini",  
            # model="o1-mini-2024-09-12",  #1.5  
            # model="gpt-4o-2024-08-06"      #1.25 反复测试，这是一个准确率最高且性价比非常高的模型
            
            # model="gpt-4o-mini"          #0.075
        )
    except Exception as e:
        print(f"API调用失败: {e}")
        return
    
    # 获取转换后的内容
    converted_content = chat_completion.choices[0].message.content
    print(f"API返回的内容: {converted_content}")  # 打印API返回的内容
    
    # 清理API返回的内容
    cleaned_content = clean_json_content(converted_content)
    print(f"清理后的内容: {cleaned_content}")  # 打印清理后的内容
    
    # 将清理后的内容按行分割并解析为JSON对象
    lines = cleaned_content.split('\n')
    json_objects = []
    for line in lines:
        if line.strip():  # 忽略空行
            try:
                json_obj = json.loads(line)
                json_objects.append(json_obj)
            except json.JSONDecodeError as e:
                print(f"警告：无法解析行: {line} (错误: {e})")
    
    # 如果解析失败，尝试手动修复
    if not json_objects:
        try:
            # 尝试将整个内容作为一个JSON对象解析
            json_obj = json.loads(cleaned_content)
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"警告：无法解析整个内容: {e}")
    
    # 将解析后的JSON对象逐行写入文件
    try:
        with open(output_file_path, 'w', encoding='utf-8') as file:
            for json_obj in json_objects:
                file.write(json.dumps(json_obj, ensure_ascii=False) + '\n')
                print(f"成功写入行: {json.dumps(json_obj, ensure_ascii=False)}")  # 打印成功写入的行
    except IOError as e:
        print(f"警告：无法写入文件: {output_file_path} (错误: {e})")

    print(f"转换完成，已保存至: {output_file_path}")

# 指定输入MD文件路径和输出JSONL文件路径      prompts/GoogleAnalytics Guru.md  prompts/GPT Builder.md  prompts/GPT Customizer, File Finder & JSON Action Creator.md
#               
#  prompts/GithubCopilot.md     prompts/GODMODE 2.0.md  prompts/GPT Idea Genie.md prompts/GPT Shield.md prompts/GPT Shop Keeper.md prompts/GPTsdex.md prompts/Grimoire.md
input_md_file = Path('/Users/wingzheng/Desktop/github/GPT/GPTs/prompts/Girlfriend Emma.md')  # 替换为您实际的MD文件路径
output_jsonl_file = Path('/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test.jsonl')  

# 执行转换
convert_md_to_jsonl(input_md_file, output_jsonl_file)

API返回的内容: > Thinking
**Clarifying conversion process**
I’m mapping out the conversion of a markdown document into jsonl format, focusing on the given rules in Chinese, and highlighting the need for careful translation and processing.
**Shedding markdown**
OK, let me see. I’m removing lines starting with "By..." and URLs. The output must be a JSONL line with "user" and "assistant" roles.
**Crafting personalized responses**
OK, let me see. The instructions emphasize responding as the user's girlfriend, though this relationship is purely hypothetical.
**Crafting a playful persona**
I’m losing myself in crafting responses that match a young, contemporary girlfriend's style, using modern terms and playful tone to deepen engagement and vibe with Gen-Z sensibilities.
Emphasizing casual, flirtatious dialogue.
**Taking a closer look**
I’m ignoring lines with "By..." and "https://...", focusing on content between them. Then, I'm pulling out # or ## sections up to the first code block.
**Decipher

In [219]:
## 批处理函数2：追加文件内容到另外一个文件后面
def append_file_content(source_file, target_file):
    # 读取源文件的内容
    with open(source_file, 'r', encoding='utf-8') as src_file:
        source_content = src_file.read().strip()  # 去除首尾空白
    
    # 读取目标文件的内容
    with open(target_file, 'r+', encoding='utf-8') as tgt_file:
        target_content = tgt_file.read().strip()  # 去除首尾空白
        
        # 检查目标文件的最后一行是否为空行
        if target_content and target_content[-1] == '\n':
            target_content = target_content.rstrip('\n')  # 删除最后一行的空行
        
        # 将内容追加到目标文件的末尾
        tgt_file.seek(0, os.SEEK_END)  # 移动到文件末尾
        if target_content:
            tgt_file.write('\n')  # 如果目标文件非空，追加一个换行符
        tgt_file.write(source_content)
    
    print(f"成功将 {source_file} 的内容追加到 {target_file}")

# 指定源文件和目标文件路径 validation training
source_file = Path('/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test.jsonl')
target_file = Path('/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_file.jsonl')
# target_file = Path('/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_file.jsonl')

# 执行追加操作
append_file_content(source_file, target_file)

成功将 /Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test.jsonl 的内容追加到 /Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_file.jsonl


In [143]:
## 1-1 补丁1：助手content的开头 都加上markdown\n
import json

def process_jsonl_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
        lines = infile.readlines()
        
        for i, line in enumerate(lines):
            if i == 0:
                outfile.write(line)
                continue
            
            try:
                data = json.loads(line)
                
                # 检查 messages 列表长度是否足够
                if len(data.get('messages', [])) < 2:
                    print(f"Warning: Line {i + 1} does not have both 'user' and 'assistant' roles.")
                    outfile.write(line)
                    continue
                
                assistant_content = data['messages'][1]['content']
                
                if not assistant_content.startswith('markdown\n'):
                    data['messages'][1]['content'] = 'markdown\n' + assistant_content
                
                outfile.write(json.dumps(data, ensure_ascii=False) + '\n')
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1}: {e}")
                continue

# 使用示例
input_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_file.jsonl'
output_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_file_add_markdown.jsonl'
# input_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_file.jsonl'
# output_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_file_add_markdown.jsonl'
process_jsonl_file(input_file, output_file)

In [144]:
## 1-2 补丁2：助手content的开头和结尾 都加上'''\n
import json

def process_jsonl_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
        lines = infile.readlines()
        
        for i, line in enumerate(lines):
            if i == 0:
                outfile.write(line)
                continue
            
            try:
                data = json.loads(line)
                
                # 检查 messages 列表长度是否足够
                if len(data.get('messages', [])) < 2:
                    print(f"Warning: Line {i + 1} does not have both 'user' and 'assistant' roles.")
                    outfile.write(line)
                    continue
                
                # 检查 messages 列表中的 role 是否分别为 'user' 和 'assistant'
                if data['messages'][0].get('role') != 'user' or data['messages'][1].get('role') != 'assistant':
                    print(f"Warning: Line {i + 1} does not have both 'user' and 'assistant' roles.")
                    outfile.write(line)
                    continue
                
                assistant_content = data['messages'][1]['content']
                
                # 检查 assistant 内容是否以 ``` 开头和结尾
                if not assistant_content.startswith('```') or not assistant_content.endswith('```'):
                    assistant_content = f"```\n{assistant_content}\n```"
                    data['messages'][1]['content'] = assistant_content
                
                outfile.write(json.dumps(data, ensure_ascii=False) + '\n')
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1}: {e}")
                continue

# 使用示例
input_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_file.jsonl'
output_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_file_add_3point.jsonl'
process_jsonl_file(input_file, output_file)

In [145]:
## 1-3 补丁3： \\\"GPT\\\" 转成{"key": "\"GPT\""}
import json

def fix_jsonl_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
        lines = infile.readlines()
        
        for line in lines:
            try:
                data = json.loads(line)
                
                # 遍历所有字符串值，修复转义问题
                def fix_strings(obj):
                    if isinstance(obj, dict):
                        for key, value in obj.items():
                            if isinstance(value, str):
                                obj[key] = value.replace("\\\"", "\"")
                            else:
                                fix_strings(value)
                    elif isinstance(obj, list):
                        for item in obj:
                            fix_strings(item)
                
                fix_strings(data)
                
                outfile.write(json.dumps(data, ensure_ascii=False) + '\n')
            except json.JSONDecodeError as e:
                print(f"Error processing line: {e}")
                continue

# 使用示例
input_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_file.jsonl'
output_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_file_fixed.jsonl'
fix_jsonl_file(input_file, output_file)

In [148]:
## 1-4 补丁4：检查两个文件中各自是否有重复的条目，同时检查他们相互之间有没有重复的条目
import json

def check_duplicates_in_file(file_path):
    user_set = set()
    user_assistant_pairs = set()
    duplicates = []

    with open(file_path, 'r', encoding='utf-8') as infile:
        lines = infile.readlines()
        
        for i, line in enumerate(lines):
            if i == 0:
                continue
            
            try:
                data = json.loads(line)
                user = data['messages'][0]['content']
                assistant = data['messages'][1]['content']
                
                if user in user_set:
                    pair = (user, assistant)
                    if pair in user_assistant_pairs:
                        duplicates.append((i + 1, user, assistant))
                    else:
                        user_assistant_pairs.add(pair)
                else:
                    user_set.add(user)
                    user_assistant_pairs.add((user, assistant))
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1}: {e}")
                continue
    
    if duplicates:
        print("重复的条目是：")
        for dup in duplicates:
            print(f"Line {dup[0]}, User: {dup[1]}, Assistant: {dup[2]}")
    else:
        print("所有条目没有重复")

def check_duplicates_between_files(file1_path, file2_path):
    file1_user_assistant_pairs = set()
    file2_user_assistant_pairs = set()
    duplicates = []

    # 读取第一个文件
    with open(file1_path, 'r', encoding='utf-8') as infile1:
        lines1 = infile1.readlines()
        for i, line in enumerate(lines1):
            if i == 0:
                continue
            try:
                data = json.loads(line)
                user = data['messages'][0]['content']
                assistant = data['messages'][1]['content']
                file1_user_assistant_pairs.add((user, assistant))
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1} in file1: {e}")
                continue

    # 读取第二个文件
    with open(file2_path, 'r', encoding='utf-8') as infile2:
        lines2 = infile2.readlines()
        for i, line in enumerate(lines2):
            if i == 0:
                continue
            try:
                data = json.loads(line)
                user = data['messages'][0]['content']
                assistant = data['messages'][1]['content']
                file2_user_assistant_pairs.add((user, assistant))
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1} in file2: {e}")
                continue

    # 检查两个文件之间的重复条目
    for pair in file1_user_assistant_pairs:
        if pair in file2_user_assistant_pairs:
            duplicates.append(pair)

    if duplicates:
        print("这两个 JSONL 文件有重复的条目：")
        for dup in duplicates:
            print(f"User: {dup[0]}, Assistant: {dup[1]}")
    else:
        print("这两个 JSONL 文件没有任何条目重复")

# 使用示例
file1_path = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_file.jsonl'
file2_path = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_file.jsonl'

print("检查文件1中的重复条目：")
check_duplicates_in_file(file1_path)

print("\n检查文件2中的重复条目：")
check_duplicates_in_file(file2_path)

print("\n检查两个文件之间的重复条目：")
check_duplicates_between_files(file1_path, file2_path)

检查文件1中的重复条目：
所有条目没有重复

检查文件2中的重复条目：
所有条目没有重复

检查两个文件之间的重复条目：
这两个 JSONL 文件没有任何条目重复


In [108]:
# 批处理函数3：验证“All lines are valid JSON objects.”+输出第多少条消息
# 核心功能：此代码段用于验证给定路径下的.jsonl文件的每一行是否都是有效的JSON对象。如果所有行都有效，则进一步加载数据集并打印一些基本信息
# （如数据集大小和第一个示例的消息内容）。如果遇到无效行，则会输出错误信息。

import json

data_path = "data/gpts_fine_tuning_training_file.jsonl"
# data_path = "data/gpts_fine_tuning_validation_file.jsonl"

# 定义函数，用于校验指定文件路径的每行是否为合法的JSON对象
def validate_jsonl_file(file_path):
    # 打开文件，使用utf-8编码读取
    with open(file_path, 'r', encoding='utf-8') as f:
        # 遍历文件的每一行，从1开始计数
        for i, line in enumerate(f, 1):
            try:
                # 尝试将这一行解析为JSON对象
                json.loads(line)
            except json.JSONDecodeError as e:
                # 如果解析失败，打印错误信息及行内容，并返回False
                print(f"Error on line {i}: {e}")
                print(f"Line content: {line.strip()}")
                return False
    # 如果所有行都没有错误，返回True
    return True

# 调用函数校验文件
is_valid = validate_jsonl_file(data_path)
if is_valid:
    # 如果文件有效，打印提示信息
    print("All lines are valid JSON objects.")
    # 加载数据集
    with open(data_path, 'r', encoding='utf-8') as f:
        # 将每行解析为JSON对象，并存储在一个列表中
        dataset = [json.loads(line) for line in f]

    # 打印数据集的基本统计信息
    print("Num examples:", len(dataset))
    print("Example:")
    # 打印数据集中第一个元素的消息内容
    for message in dataset[74]["messages"]:
        print(message)
else:
    # 如果文件有误，打印提示信息
    print("Some lines are invalid. Check the error messages above.")

All lines are valid JSON objects.
Num examples: 75
Example:
{'role': 'user', 'content': "## Codey\n\n💪 Your coding expert! I assist with code, debug, graphs, and file handling. Ask 'Help' for a menu!"}
{'role': 'assistant', 'content': 'markdown\nCodey - Coding Assistant is an enhanced tool for developers, equipped to run code in over 70 languages using the Code Runner feature. It can generate graphs to visualize data, create and display code snippets, and provide options to save and download code. Codey is adept in Python, C++, and other languages, assisting with code execution, debugging, and code generation. The interactions are direct and focused on task completion, offering clear guidance for coding projects. Additionally, when prompted with "Help", Codey will display a menu:\n\n- Code Review\n- Convert\n- Execute\n- Fix Bugs\n- Graphs and Plots Generation\n- File Management\n- Code to Image (Code Snippet)\n\nThis menu guides users to select the service they need.\n\nYou have Docum

## Format validation

We can perform a variety of error checks to validate that each conversation in the dataset adheres to the format expected by the fine-tuning API. Errors are categorized based on their nature for easier debugging.

1. **Data Type Check**: Checks whether each entry in the dataset is a dictionary (`dict`). Error type: `data_type`.
2. **Presence of Message List**: Checks if a `messages` list is present in each entry. Error type: `missing_messages_list`.
3. **Message Keys Check**: Validates that each message in the `messages` list contains the keys `role` and `content`. Error type: `message_missing_key`.
4. **Unrecognized Keys in Messages**: Logs if a message has keys other than `role`, `content`, `weight`, `function_call`, and `name`. Error type: `message_unrecognized_key`.
5. **Role Validation**: Ensures the `role` is one of "system", "user", or "assistant". Error type: `unrecognized_role`.
6. **Content Validation**: Verifies that `content` has textual data and is a string. Error type: `missing_content`.
7. **Assistant Message Presence**: Checks that each conversation has at least one message from the assistant. Error type: `example_missing_assistant_message`.

The code below performs these checks, and outputs counts for each type of error found are printed. This is useful for debugging and ensuring the dataset is ready for the next steps.


In [251]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

Found errors:
example_missing_assistant_message: 4


In [109]:
# 批处理函数4 查看对话中的 user 和 assistant 消息+“No errors found”
# 优化后新增函数
# 第二个代码块
# 第一个对话可以只包含 system 消息的情况，并且确保其他对话必须包含 user 和 assistant 消息
from collections import defaultdict

def validate_dataset(dataset):
    format_errors = defaultdict(list)

    for ex_index, ex in enumerate(dataset):
        # 数据类型检查：确保每个条目是一个字典
        if not isinstance(ex, dict):
            format_errors["data_type"].append(ex_index)
            continue
        
        # 消息列表检查：确保每个条目包含 messages 列表
        messages = ex.get("messages", None)
        if not messages:
            format_errors["missing_messages_list"].append(ex_index)
            continue
        
        system_message_count = 0
        first_user_or_assistant_found = False
        assistant_message_found = False
        user_message_found = False

        for message in messages:
            # 消息键检查：确保每个消息包含 role 和 content 键
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"].append(ex_index)
                continue
            
            # 未识别键检查：确保消息中没有未识别的键
            if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
                format_errors["message_unrecognized_key"].append(ex_index)
                continue
            
            role = message.get("role", None)
            content = message.get("content", None)
            function_call = message.get("function_call", None)
            
            # 角色检查：确保 role 的值是 system、user、assistant 或 function
            if role not in ("system", "user", "assistant", "function"):
                format_errors["unrecognized_role"].append(ex_index)
                continue
            
            # 内容检查：确保 content 是一个字符串
            if (not content and not function_call) or not isinstance(content, str):
                format_errors["missing_content"].append(ex_index)
                continue
            
            # 系统消息计数：记录每个对话中的 system 消息数量
            if role == "system":
                system_message_count += 1
                # 系统消息顺序检查：确保 system 消息出现在任何 user 或 assistant 消息之前
                if first_user_or_assistant_found:
                    format_errors["system_message_after_user_assistant"].append(ex_index)
                    break
            else:
                # 系统消息唯一性检查：确保每个对话最多只有一个 system 消息
                if system_message_count > 1:
                    format_errors["multiple_system_messages"].append(ex_index)
                    break
                first_user_or_assistant_found = True
                if role == "assistant":
                    assistant_message_found = True
                elif role == "user":
                    user_message_found = True
        
        # 特殊处理第一个对话：确保第一个对话可以只包含 system 消息
        if ex_index == 0:
            if system_message_count == 0:
                format_errors["first_example_missing_system_message"].append(ex_index)
            # 第一个对话可以没有 user 和 assistant 消息
        else:
            # 助手消息存在性检查：确保每个对话至少包含一个 assistant 消息
            if not assistant_message_found:
                format_errors["example_missing_assistant_message"].append(ex_index)
            # 用户消息存在性检查：确保每个对话至少包含一个 user 消息
            if not user_message_found:
                format_errors["example_missing_user_message"].append(ex_index)

    if format_errors:
        print("Found errors:")
        for k, v in format_errors.items():
            print(f"{k}: {len(v)}")
            print(f"Example indices: {v}")
    else:
        print("No errors found")

# 调用验证函数
validate_dataset(dataset)

No errors found


## Token Counting Utilities

Lets define a few helpful utilities to be used in the rest of the notebook.

In [92]:
# 计费准备函数1：原有函数，不做任何修改
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

## Data Warnings and Token Counts 

With some lightweight analysis we can identify potential issues in the dataset, like missing messages, and provide statistical insights into message and token counts.

1. **Missing System/User Messages**: Counts the number of conversations missing a "system" or "user" message. Such messages are critical for defining the assistant's behavior and initiating the conversation.
2. **Number of Messages Per Example**: Summarizes the distribution of the number of messages in each conversation, providing insight into dialogue complexity.
3. **Total Tokens Per Example**: Calculates and summarizes the distribution of the total number of tokens in each conversation. Important for understanding fine-tuning costs.
4. **Tokens in Assistant's Messages**: Calculates the number of tokens in the assistant's messages per conversation and summarizes this distribution. Useful for understanding the assistant's verbosity.
5. **Token Limit Warnings**: Checks if any examples exceed the maximum token limit (16,385 tokens), as such examples will be truncated during fine-tuning, potentially resulting in data loss.


In [93]:
# 计费准备函数2：数据警告和令牌计数
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 16,385 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 19
Num examples missing user message: 1

#### Distribution of num_messages_per_example:
min / max: 1, 2
mean / median: 1.95, 2.0
p5 / p95: 2.0, 2.0

#### Distribution of num_total_tokens_per_example:
min / max: 166, 2862
mean / median: 1017.65, 730.5
p5 / p95: 298.6, 2122.700000000001

#### Distribution of num_assistant_tokens_per_example:
min / max: 0, 2791
mean / median: 945.8, 619.0
p5 / p95: 214.3, 2064.200000000001

0 examples may be over the 16,385 token limit, they will be truncated during fine-tuning


## Cost Estimation

In this final section, we estimate the total number of tokens that will be used for fine-tuning, which allows us to approximate the cost. It is worth noting that the duration of the fine-tuning jobs will also increase with the token count. 

In [52]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~61052 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~183156 tokens


In [94]:
# 计费函数3：上一个（原函数）弃用，重新写的计费函数，同时输出实际的RMB成本
## 根据 ”按量计费费用“的公式 计算得出的：微调总费用
## 公式：按量计费费用 = 分组倍率 × 模型倍率 × （提示token数 + 补全token数 × 补全倍率）/ 500000 （单位：美元）
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385  # 每个示例的最大 token 数量

TARGET_EPOCHS = 3  # 目标训练轮数
MIN_TARGET_EXAMPLES = 100  # 最小目标示例数量
MAX_TARGET_EXAMPLES = 25000  # 最大目标示例数量
MIN_DEFAULT_EPOCHS = 1  # 最小默认训练轮数
MAX_DEFAULT_EPOCHS = 25  # 最大默认训练轮数

# 模型倍率和补全倍率字典
model_rates = {
    "dall-e": 8,
    "dall-e-2": 8,
    "dall-e-3": 5,
    "gemini-1.5-pro-exp-0801": 1.75,
    "gemini-1.5-pro-exp-0827": 1.75,
    "gemini-1.5-pro-latest": 3.5,
    "gemma-2b-it": 1,
    "gemma-7b-it": 1,
    "gpt-4-gizmo-*": 15,
    "gpt-4-v": 15,
    "gpt-4-vision-preview": 5,
    "gpt-4o": 2.5,
    "gpt-4o-2024-05-13": 2.5,
    "gpt-4o-2024-08-06": 1.25,
    "gpt-4o-all": 2.5,
    "gpt-4o-mini": 0.075,
    "gpt-4o-mini-2024-07-18": 0.075,
    "o1-mini": 1.5,
    "o1-mini-2024-09-12": 1.5,
    "o1-preview": 7.5,
    "o1-preview-2024-09-12": 7.5,
    "qwen-72b": 1,
    "llama-2-13b": 1,
    "llama-2-70b": 1,
    "llama-2-7b": 1,
    "llama-3-70b": 2,
    "llama-3-8b": 1,
    "llama-3.1-405b": 3,
    "llama-3.1-70b": 2,
    "llama-3.1-8b": 1,
    "llama2-70b-4096": 0.35,
    "llama2-7b-2048": 0.05,
    "tts-1": 7.5,
    "tts-1-1106": 7.5,
    "tts-1-hd": 15,
    "tts-1-hd-1106": 15,
    "whisper-1": 10,
    "url": 0.2
}

completion_multipliers = {
    "gemini-1.5-pro-latest": 3,
    "chatgpt-4o-latest": 3,
    "gpt-3.5-turbo": 1.33,
    "gpt-4-turbo": 3,
    "gpt-4-turbo-2024-04-09": 3,
    "gpt-4o": 3,
    "gpt-4o-2024-05-13": 3,
    "gpt-4o-2024-08-06": 4,
    "gpt-4o-all": 3,
    "gpt-4o-mini": 4,
    "gpt-4o-mini-2024-07-18": 4,
    "o1-mini": 4,
    "o1-mini-2024-09-12": 4,
    "o1-preview": 4,
    "o1-preview-2024-09-12": 4
}

GROUP_MULTIPLIER = 1.00  # 分组倍率

def calculate_cost(model_name, prompt_tokens, completion_tokens):
    """
    计算总费用
    :param model_name: 模型名称
    :param prompt_tokens: 提示 token 数量
    :param completion_tokens: 补全 token 数量
    :return: 总费用
    """
    model_rate = model_rates.get(model_name, 1.0)  # 获取模型倍率，默认为 1.0
    completion_multiplier = completion_multipliers.get(model_name, 1.0)  # 获取补全倍率，默认为 1.0
    
    total_tokens = prompt_tokens + (completion_tokens * completion_multiplier)  # 计算总 token 数量
    cost = GROUP_MULTIPLIER * model_rate * total_tokens / 500000  # 计算总费用
    return cost

n_epochs = TARGET_EPOCHS  # 初始训练轮数设为目标训练轮数
n_train_examples = len(dataset)  # 数据集中的示例数量

# 调整训练轮数
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

# 计算计费 token 数量
n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"数据集中有 ~{n_billing_tokens_in_dataset} 个 token 会被计费")
print(f"默认情况下，你将在数据集上训练 {n_epochs} 个轮次")

# 计算提示 token 和补全 token 数量
prompt_tokens = sum(len(encoding.encode(message["content"])) for ex in dataset for message in ex["messages"] if message["role"] == "user")
completion_tokens = sum(len(encoding.encode(message["content"])) for ex in dataset for message in ex["messages"] if message["role"] == "assistant")

# 打印相关参数
# model_name = "gpt-4o-mini"  # 假设使用 gpt-4o-mini 模型
model_name = "gpt-4o-2024-08-06" 
# model_name = "o1-mini-2024-09-12"  
# model_name = "o1-preview-2024-09-12"  
model_rate = model_rates.get(model_name, 1.0)
completion_multiplier = completion_multipliers.get(model_name, 1.0)
print(f"模型倍率: {model_rate}")
print(f"提示 token 数: {prompt_tokens}")
print(f"补全 token 数: {completion_tokens}")
print(f"补全倍率: {completion_multiplier}")

# 计算总费用
total_cost = calculate_cost(model_name, prompt_tokens, completion_tokens)
print(f"总共预估花费: ${total_cost * n_epochs:.2f} 美金")
actual_cost_cny = (total_cost * n_epochs) * 6 / 10  # 计算人民币成本
print(f"实际成本: {actual_cost_cny:.2f} 人民币")

数据集中有 ~20353 个 token 会被计费
默认情况下，你将在数据集上训练 5 个轮次
模型倍率: 1.25
提示 token 数: 768
补全 token 数: 18916
补全倍率: 4
总共预估花费: $0.96 美金
实际成本: 0.57 人民币


See https://openai.com/pricing to estimate total costs.