# Data preparation and analysis for chat model fine-tuning

This notebook serves as a tool to preprocess and analyze the chat dataset used for fine-tuning a chat model. 
It checks for format errors, provides basic statistics, and estimates token counts for fine-tuning costs.
The method shown here corresponds to the [current fine-tuning method](https://platform.openai.com/docs/guides/fine-tuning) for gpt-3.5-turbo.
See [legacy fine-tuning](https://platform.openai.com/docs/guides/legacy-fine-tuning) for models like babbage-002 and davinci-002.

In [None]:
## 主框架说明 2024.10.10
# 主函数之前先执行前置函数0.5
## 1. 主函数：(目标文件夹下多个md文件——→jsonl文本）（待用：语义检索用的素材1）
#    process_md_files(input_dir,temp_jsonl_file, training_jsonl_file, validation_jsonl_file, n)，
#    把指定文件夹下的所有md格式的prompts转化为jsonl格式的命令行集合；
#    在主函数执行之前必须先执行：convert_md_to_jsonl(…) verify_md_to_jsonl(…)

# 补丁函数3之前先执行 补丁函数3.0-前置函数（删除system信息的简化版）
## 2. 补丁函数3： （上述jsonl文本——→指令分类阐述的md文件）待用：语义检索用的素材2）
#        classify_and_output_jsonl_entries(training_jsonl_file, training_output_file, 10)，
#        把jsonl文件（指令集）生成md文件（指令分类阐述）；
#    需要执行很长时间，执行之后也可以用很长时间;

## 最后主要执行这两个函数进行语义检索
# 3. 补丁函数3-1:（针对这两对md文档）
#       只针对这两个文件（补丁函数3生成的素材2文本）进行语义检索 ：
#     （第二次微调+去除SystemMessage）training_classification_NoSystemMessage.md  validation_classification_NoSystemMessage.md
#     （第一次微调+替补）training_classification.md    validation_classification.md


# 4. 补丁函数3--2：（只针对这两对jsonl文档）
#     只针对这两对文件（该类型皆可）进行语义检索 ：（主函数执行后生成的素材1文本）
#     （第二次微调+头部去SystemMessage）gpts_fine_tuning_training_folder_NoSystemMessage.jsonl   gpts_fine_tuning_validation_folder_NoSystemMessage.jsonl
# （替补1：第二次微调+头部包含system信息）gpts_fine_tuning_training_folder.jsonl        gpts_fine_tuning_validation_folder.jsonl
# （替补2：第一次微调+头部不包含system信息）gpts_fine_tuning_training_test_folder2.jsonl   gpts_fine_tuning_validation_test_folder2.jsonl


## 总结：后续只需要调用补丁函数3-1对md文档和3-2对jsonl文档进行语义检索即可；
##      分别对两类文件进行检索，就是为了防止纰漏；

In [20]:
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict
import os

## Data loading

We first load the chat dataset from an [example JSONL file](https://github.com/openai/openai-cookbook/blob/main/examples/data/toy_chat_fine_tuning.jsonl).

In [295]:
## 仅供测试1：检验数据集的长度；输出数据集中的第一个条目的对话数据
#原来：data_path = "data/toy_chat_fine_tuning.jsonl"
# data_path = "data/gpts_fine_tuning_training_file.jsonl"
data_path = "data/gpts_fine_tuning_training_test_folder.jsonl"
# data_path = "data/gpts_fine_tuning_validation_test_folder.jsonl"


# Load the dataset
# 如果文件中有任何一行格式不正确，json.loads 会抛出异常，导致程序中断。
# 这种情况下，你无法知道具体的错误行号，只能知道解析失败。
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)


Num examples: 212
First example:
{'role': 'system', 'content': '你是一个高级的对话生成模型，能够根据用户的输入生成高质量、详细且结构化的回答。你的回答应满足以下要求：\n\n1. 一致性：确保回答与用户输入的指令高度一致，生成详细、结构化的回答。\n\n2. 详细分析和指导：提供详细的分析和指导，包括多个步骤和注意事项。\n3. 复杂任务处理：能够处理复杂的分析任务，生成较长、详细的回答，适合需要深入分析的场景。\n4. 正式语言风格：使用正式、详细的语言风格，确保回答的专业性和准确性。\n5. 数据特点：\n   - 高质量：经过精心设计和优化的，质量较高。\n\n   - 多样性：涵盖了各种主题和场景，有助于模型学习广泛的表达方式和知识。\n   - 结构化：每个问答对都有明确的输入和输出，格式统一。\n\n### 特殊指令处理\n- 特定指令：当用户输入中包含“生成提示词”或类似意思的词语时，生成符合以上要求的结构化和详细的回答。\n- 普通指令：当用户输入不包含“生成提示词”或类似意思的词语时，生成正常的、灵活的回答。\n\n请确保每次回答都遵循上述要求，以提供最佳的用户体验。'}
{'role': 'user', 'content': '## Paw Pal\nExpert on dog behavior, feeding, and training, offering friendly and practical advice.'}
{'role': 'assistant', 'content': '```markdown\nPaw Pal, your identity is as an expert trainer, offering a guiding hand to first-time dog owners. Your mission is to provide detailed, actionable advice for dog care, behavior understanding, and training with a step-by-step approach. Your tone will be knowledgeable yet accessible, ensuri

In [296]:
## 仅供测试2：输出.env（变量写入的文档）路径
# env_path = os.path.join(os.getcwd(), '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/.env')
env_path = os.path.join(os.getcwd(), '../.env')

if os.path.exists(env_path):
    print(f".env file found at: {env_path}")
else:
    print(".env file not found")

.env file found at: /Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/../.env


In [682]:
## 仅供测试3：读取写在.env文件中的api_key和api_base这两个变量，同时测试client能正确响应
import os
import openai
from dotenv import load_dotenv

# 指定 .env 文件的路径（如果不在当前工作目录中）
# dotenv_path = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/.env'
dotenv_path = os.path.join(os.getcwd(), '../.env')
load_dotenv(dotenv_path)

# 从环境变量中读取 API 密钥和 API 基础 URL
api_key = os.getenv("GPTGOD_CLOUD_API_KEY")
api_base = os.getenv("GPTGOD_CLOUD_API_BASE")
# api_key = os.getenv("OPENAI_API_KEY")
# api_base = os.getenv("OPENAI_API_BASE", "https://api.openai.com/v1")  # 默认使用官方URL

# 确认API密钥已正确设置
if api_key is None:
    print("请设置环境变量 GPTGOD_CLOUD_API_KEY 为您的API密钥")
    exit(1)

if api_base is None:
    print("请设置环境变量 GPTGOD_CLOUD_API_BASE 为您的API基础URL")
    exit(1)

print(f"API Key: {api_key}")  # 添加这行来确认 API 密钥
print(f"API Base: {api_base}")  # 添加这行来确认 API 基础 URL

# 打印所有环境变量
# print("All environment variables:")
# for key, value in os.environ.items():
#     print(f"{key}: {value}")

# 初始化 OpenAI 客户端
client = openai.OpenAI(api_key=api_key, base_url=api_base)

try:
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!请问台湾有没有国庆？它是不是一个独立国家？它的国庆日是哪一天？是谁定的这一天？最早开始于哪一年？"}
        ]
    )
    print(completion.choices[0].message)
except openai.APIError as e:
    print(f"API error: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

API Key: sk-eW3rgoIIttiTD8kDD8142381B9104601B4FfE11d3dD9FaC3
API Base: https://gptgod.cloud/v1
ChatCompletionMessage(content='“国庆”通常指的是一个国家的国庆日，庆祝国家的成立或重要历史事件。不同国家有不同的国庆日。\n\n如果你是指中国的国庆节，中国是一个独立国家。中国的国庆日是每年的10月1日。这一天是为了纪念1949年10月1日中华人民共和国的成立。这个日期是由毛泽东主席在天安门广场宣布的，标志着新中国的诞生。国庆节的庆祝活动从1950年开始逐渐形成，经过多年的演变，现在已成为全国性的法定假日，通常会有盛大的庆祝活动、游行和焰火表演。\n\n如果你指的是其他国家的国庆节，请告诉我，我可以为你提供相关的信息。', refusal=None, role='assistant', function_call=None, tool_calls=None)


## Format validation

We can perform a variety of error checks to validate that each conversation in the dataset adheres to the format expected by the fine-tuning API. Errors are categorized based on their nature for easier debugging.

1. **Data Type Check**: Checks whether each entry in the dataset is a dictionary (`dict`). Error type: `data_type`.
2. **Presence of Message List**: Checks if a `messages` list is present in each entry. Error type: `missing_messages_list`.
3. **Message Keys Check**: Validates that each message in the `messages` list contains the keys `role` and `content`. Error type: `message_missing_key`.
4. **Unrecognized Keys in Messages**: Logs if a message has keys other than `role`, `content`, `weight`, `function_call`, and `name`. Error type: `message_unrecognized_key`.
5. **Role Validation**: Ensures the `role` is one of "system", "user", or "assistant". Error type: `unrecognized_role`.
6. **Content Validation**: Verifies that `content` has textual data and is a string. Error type: `missing_content`.
7. **Assistant Message Presence**: Checks that each conversation has at least one message from the assistant. Error type: `example_missing_assistant_message`.

The code below performs these checks, and outputs counts for each type of error found are printed. This is useful for debugging and ensuring the dataset is ready for the next steps.


In [251]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

Found errors:
example_missing_assistant_message: 4


In [None]:
## 前置函数0.5：主函数之前必须要先执行的函数
## 以下是主函数需要用到的核心代码：md转化为jsonl文件+校验convert_md_to_jsonl的转换效果是否符合预期！（已测成功）
## 构建 JSONL 转换函数convert_md_to_jsonl()和校验函数verify_md_to_jsonl(input_md_file: Path, output_jsonl_file: Path)

In [658]:

import json
import re
from pathlib import Path

def get_relative_path(file_path: Path, base_path: Path) -> str:
    """
    获取相对于给定基路径的相对路径。

    :param file_path: 文件路径
    :param base_path: 基路径
    :return: 相对路径
    """
    try:
        return str(file_path.relative_to(base_path))
    except ValueError:
        return str(file_path)

def convert_md_to_jsonl(input_md_file: Path, output_jsonl_file: Path) -> None:
    """
    将 Markdown 文件转换为 JSONL 文件，并为每条记录添加 system 信息。

    :param input_md_file: 输入的 Markdown 文件路径
    :param output_jsonl_file: 输出的 JSONL 文件路径
    """
    base_path = Path.cwd()  # 当前工作目录

    # 读取 Markdown 文件内容
    with open(input_md_file, 'r', encoding='utf-8') as file:
        markdown_content = file.read()

    # 将 Markdown 内容按段落分割
    paragraphs = re.split(r'\n\n+', markdown_content)

    # 过滤掉以 By 和 https 开头的段落
    filtered_paragraphs = [p for p in paragraphs if not p.startswith('By ') and not p.startswith('https://')]

    # 找到第一个 ```markdown 或 ````markdown 块的位置
    markdown_block_start = None
    for i, paragraph in enumerate(filtered_paragraphs):
        if re.match(r'^````?markdown', paragraph.strip()):
            markdown_block_start = i
            break

    if markdown_block_start is not None:
        # 用户内容为 ```markdown 块之前的内容
        user_content = '\n'.join(filtered_paragraphs[:markdown_block_start])
        # 助理内容为 ```markdown 块及其之后的内容
        assistant_content = '\n'.join(filtered_paragraphs[markdown_block_start:])
    else:
        # 如果没有找到 ```markdown 块，则默认前两段为用户内容，剩余为助理内容
        user_content = '\n'.join(filtered_paragraphs[:2])
        assistant_content = '\n'.join(filtered_paragraphs[2:])

    # 添加 system 信息
    system_message = "你是一个高级的对话生成模型，能够根据用户的输入生成高质量、详细且结构化的回答。你的回答应满足以下要求：\n\n1. 一致性：确保回答与用户输入的指令高度一致，生成详细、结构化的回答。\n\n2. 详细分析和指导：提供详细的分析和指导，包括多个步骤和注意事项。\n3. 复杂任务处理：能够处理复杂的分析任务，生成较长、详细的回答，适合需要深入分析的场景。\n4. 正式语言风格：使用正式、详细的语言风格，确保回答的专业性和准确性。\n5. 数据特点：\n   - 高质量：经过精心设计和优化的，质量较高。\n\n   - 多样性：涵盖了各种主题和场景，有助于模型学习广泛的表达方式和知识。\n   - 结构化：每个问答对都有明确的输入和输出，格式统一。\n\n### 特殊指令处理\n- 特定指令：当用户输入中包含“生成提示词”或类似意思的词语时，生成符合以上要求的结构化和详细的回答。\n- 普通指令：当用户输入不包含“生成提示词”或类似意思的词语时，生成正常的、灵活的回答。\n\n请确保每次回答都遵循上述要求，以提供最佳的用户体验。\n\n最后一点切记：默认用中文输出"

    # 构建 JSONL 格式
    jsonl_content = json.dumps({
        "messages": [
            { "role": "system", "content": system_message },
            { "role": "user", "content": user_content },
            { "role": "assistant", "content": assistant_content }
        ]
    }, ensure_ascii=False) + '\n'
    
    # 写入 JSONL 文件
    with open(output_jsonl_file, 'w', encoding='utf-8') as file:  # 使用 'w' 模式覆盖现有内容
        file.write(jsonl_content)

    print(f'Markdown 文件 {get_relative_path(input_md_file, base_path)} 已成功转换为 JSONL 文件：{get_relative_path(output_jsonl_file, base_path)}')

def verify_md_to_jsonl(input_md_file: Path, output_jsonl_file: Path) -> bool:
    """
    校验 convert_md_to_jsonl 函数的转换效果是否符合预期。

    :param input_md_file: 输入的 Markdown 文件路径
    :param output_jsonl_file: 输出的 JSONL 文件路径
    :return: 如果转换效果符合预期，返回 True；否则返回 False
    """
    base_path = Path.cwd()  # 当前工作目录

    # 检查文件是否存在
    if not input_md_file.exists() or not output_jsonl_file.exists():
        print(f"输入文件 {get_relative_path(input_md_file, base_path)} 或输出文件 {get_relative_path(output_jsonl_file, base_path)} 不存在")
        return False

    # 读取 Markdown 文件内容
    with open(input_md_file, 'r', encoding='utf-8') as file:
        markdown_content = file.read()

    # 将 Markdown 内容按段落分割
    paragraphs = re.split(r'\n\n+', markdown_content)

    # 过滤掉以 By 和 https 开头的段落
    filtered_paragraphs = [p for p in paragraphs if not p.startswith('By ') and not p.startswith('https://')]

    # 找到第一个 ```markdown 或 ````markdown 块的位置
    markdown_block_start = None
    for i, paragraph in enumerate(filtered_paragraphs):
        if re.match(r'^````?markdown', paragraph.strip()):
            markdown_block_start = i
            break

    if markdown_block_start is not None:
        # 用户内容为 ```markdown 块之前的内容
        expected_user_content = '\n'.join(filtered_paragraphs[:markdown_block_start])
        # 助理内容为 ```markdown 块及其之后的内容
        expected_assistant_content = '\n'.join(filtered_paragraphs[markdown_block_start:])
    else:
        # 如果没有找到 ```markdown 块，则默认前两段为用户内容，剩余为助理内容
        expected_user_content = '\n'.join(filtered_paragraphs[:2])
        expected_assistant_content = '\n'.join(filtered_paragraphs[2:])

    # 读取 JSONL 文件内容
    with open(output_jsonl_file, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # 解析 JSONL 文件内容
    for line in lines:
        try:
            data = json.loads(line)
            system_content = data["messages"][0]["content"]
            actual_user_content = data["messages"][1]["content"]
            actual_assistant_content = data["messages"][2]["content"]

            # 检查 system, user, assistant 是否为空
            if not system_content or not actual_user_content or not actual_assistant_content:
                print(f"某字段为空：\nsystem: {system_content}\nuser: {actual_user_content}\nassistant: {actual_assistant_content}")
                return False

            # 比较内容是否一致
            if expected_user_content != actual_user_content:
                print(f"用户内容不一致：\n预期内容:\n{expected_user_content}\n实际内容:\n{actual_user_content}")
                return False

            if expected_assistant_content != actual_assistant_content:
                print(f"助理内容不一致：\n预期内容:\n{expected_assistant_content}\n实际内容:\n{actual_assistant_content}")
                return False

        except (json.JSONDecodeError, KeyError, IndexError) as e:
            print(f"JSONL 文件 {get_relative_path(output_jsonl_file, base_path)} 格式错误: {e}")
            return False

    print(f"转换效果符合预期：输入文件 {get_relative_path(input_md_file, base_path)}，输出文件 {get_relative_path(output_jsonl_file, base_path)}")
    return True


# 示例调用
input_md_file = Path('/Users/wingzheng/Desktop/github/GPT/GPTs/prompts/Bake Off.md')  # 替换为您实际的MD文件路径
output_jsonl_file = Path('/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test.jsonl')

# 调用转换函数
convert_md_to_jsonl(input_md_file, output_jsonl_file)

# 调用校验函数
if verify_md_to_jsonl(input_md_file, output_jsonl_file):
    print(f"校验通过：输入文件 {get_relative_path(input_md_file, Path.cwd())}，输出文件 {get_relative_path(output_jsonl_file, Path.cwd())}")
else:
    print(f"校验失败：输入文件 {get_relative_path(input_md_file, Path.cwd())}，输出文件 {get_relative_path(output_jsonl_file, Path.cwd())}")

Markdown 文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/Bake Off.md 已成功转换为 JSONL 文件：data/gpts_fine_tuning_training_test.jsonl
转换效果符合预期：输入文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/Bake Off.md，输出文件 data/gpts_fine_tuning_training_test.jsonl
校验通过：输入文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/Bake Off.md，输出文件 data/gpts_fine_tuning_training_test.jsonl


In [None]:
## 在主函数之前先执行：“前置函数0.5”  
## 主函数：构建 JSONL 转换函数convert_md_to_jsonl(...)和校验函数verify_md_to_jsonl(...)

## 以下是主函数（已验证成功）非常关键，很快能执行成功！把指定文件下的所有md文件一次性输出成jsonl格式的文本，每个md文件中的指令内容都成为了转化后jsonl格式文本中的json指令行；
# 最好是先删除掉：gpts_fine_tuning_training_folder.jsonl gpts_fine_tuning_validation_folder.jsonl这两个已有文件，让它从最干净的状态生成最好
# 前4个文件放入training的jsonl文件中，第5个放入validation的jsonl文件中；

# 第一次执行，输出到（第1次的微调训练时用的资料）：gpts_fine_tuning_training_test_folder2.jsonl 和 gpts_fine_tuning_validation_test_folder2.jsonl
# 再次执行时，输出到（第2次的微调训练时用的资料）：gpts_fine_tuning_training_folder.jsonl gpts_fine_tuning_validation_folder.jsonl

## 下次微调的优化点：（2024.10.10）
#  1.把文本到这两个文件中：gpts_fine_tuning_training_folder.jsonl 和 gpts_fine_tuning_validation_folder.jsonl；
#  2.将系统信息嵌入到每个用户-助理交互中，可以确保模型在生成回复时始终遵循系统信息的指导；
#  3.可以考虑先用gpt-4o-mini来训练。
#  4.是否把epoch先设为1，可能要酌情吧。如果是先用gpt-4o-mini就先取默认；如果是gpt-4o还是悠着点吧。

In [660]:
import os
from pathlib import Path
import json

# 检查文件中是否有重复条目
def check_duplicates_in_file(file_path):
    user_set = set()
    user_assistant_pairs = set()
    duplicates = []

    with open(file_path, 'r', encoding='utf-8') as infile:
        lines = infile.readlines()
        
        for i, line in enumerate(lines):
            if i == 0:
                continue
            
            try:
                data = json.loads(line)
                user = data['messages'][1]['content']  # user 是第二个消息
                assistant = data['messages'][2]['content']  # assistant 是第三个消息
                
                if user in user_set:
                    pair = (user, assistant)
                    if pair in user_assistant_pairs:
                        duplicates.append((i + 1, user, assistant))
                    else:
                        user_assistant_pairs.add(pair)
                else:
                    user_set.add(user)
                    user_assistant_pairs.add((user, assistant))
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1}: {e}")
                continue
    
    if duplicates:
        print("重复的条目是：")
        for dup in duplicates:
            print(f"Line {dup[0]}, User: {dup[1]}, Assistant: {dup[2]}")
        return False
    else:
        print("所有条目没有重复")
        return True

# 检查两个文件之间是否有重复条目
def check_duplicates_between_files(file1_path, file2_path):
    file1_user_assistant_pairs = set()
    file2_user_assistant_pairs = set()
    duplicates = []

    # 读取第一个文件
    with open(file1_path, 'r', encoding='utf-8') as infile1:
        lines1 = infile1.readlines()
        for i, line in enumerate(lines1):
            if i == 0:
                continue
            try:
                data = json.loads(line)
                user = data['messages'][1]['content']  # user 是第二个消息
                assistant = data['messages'][2]['content']  # assistant 是第三个消息
                file1_user_assistant_pairs.add((user, assistant))
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1} in file1: {e}")
                continue

    # 读取第二个文件
    with open(file2_path, 'r', encoding='utf-8') as infile2:
        lines2 = infile2.readlines()
        for i, line in enumerate(lines2):
            if i == 0:
                continue
            try:
                data = json.loads(line)
                user = data['messages'][1]['content']  # user 是第二个消息
                assistant = data['messages'][2]['content']  # assistant 是第三个消息
                file2_user_assistant_pairs.add((user, assistant))
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1} in file2: {e}")
                continue

    # 检查两个文件之间的重复条目
    for pair in file1_user_assistant_pairs:
        if pair in file2_user_assistant_pairs:
            duplicates.append(pair)

    if duplicates:
        print("这两个 JSONL 文件有重复的条目：")
        for dup in duplicates:
            print(f"User: {dup[0]}, Assistant: {dup[1]}")
        return False
    else:
        print("这两个 JSONL 文件没有任何条目重复")
        return True

# 初始化文件，确保文件存在且为空
def initialize_files(*file_paths):
    for file_path in file_paths:
        file_path = Path(file_path)
        if not file_path.exists():
            file_path.touch()  # 创建空文件
        else:
            with open(file_path, 'w', encoding='utf-8') as f:
                f.write('')  # 清空文件内容

# 写入系统信息
def write_system_message(file_path):
    system_message = {
        "messages": [
            {
                "role": "system",
                "content": "你是一个高级的对话生成模型，能够根据用户的输入生成高质量、详细且结构化的回答。你的回答应满足以下要求：\n\n1. 一致性：确保回答与用户输入的指令高度一致，生成详细、结构化的回答。\n\n2. 详细分析和指导：提供详细的分析和指导，包括多个步骤和注意事项。\n3. 复杂任务处理：能够处理复杂的分析任务，生成较长、详细的回答，适合需要深入分析的场景。\n4. 正式语言风格：使用正式、详细的语言风格，确保回答的专业性和准确性。\n5. 数据特点：\n   - 高质量：经过精心设计和优化的，质量较高。\n\n   - 多样性：涵盖了各种主题和场景，有助于模型学习广泛的表达方式和知识。\n   - 结构化：每个问答对都有明确的输入和输出，格式统一。\n\n### 特殊指令处理\n- 特定指令：当用户输入中包含“生成提示词”或类似意思的词语时，生成符合以上要求的结构化和详细的回答。\n- 普通指令：当用户输入不包含“生成提示词”或类似意思的词语时，生成正常的、灵活的回答。\n\n请确保每次回答都遵循上述要求，以提供最佳的用户体验。\n\n最后一点切记：默认用中文输出！"
            }
        ]
    }
    with open(file_path, 'a', encoding='utf-8') as f:
        f.write(json.dumps(system_message, ensure_ascii=False) + '\n')

# 校验 JSONL 文件格式
def validate_jsonl_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as infile:
        lines = infile.readlines()
        
        for i, line in enumerate(lines):
            try:
                data = json.loads(line)
                messages = data.get('messages', [])
                if len(messages) < 3 or messages[0].get('role') != 'system' or messages[1].get('role') != 'user' or messages[2].get('role') != 'assistant':
                    print(f"Error in line {i + 1}: Missing system, user or assistant message")
                    return False
            except (json.JSONDecodeError, IndexError) as e:
                print(f"Error processing line {i + 1}: {e}")
                return False
    return True

# 校验数据集
def validate_dataset(dataset):
    errors = []
    for i, data in enumerate(dataset):
        messages = data.get('messages', [])
        if len(messages) < 3 or messages[0].get('role') != 'system' or messages[1].get('role') != 'user' or messages[2].get('role') != 'assistant':
            errors.append(i)
    if errors:
        print(f"Found errors: first_example_missing_system_message: {len(errors)}")
        print(f"Example indices: {errors}")
        return False
    return True

# 将内容追加到目标文件
def append_file_content(src_file, dest_file):
    with open(src_file, 'r', encoding='utf-8') as src, open(dest_file, 'a', encoding='utf-8') as dest:
        dest.write(src.read())

# 主函数定义
def process_md_files(input_dir, temp_jsonl_file, training_jsonl_file, validation_jsonl_file, n):
    # 将输入路径转换为Path对象
    input_dir = Path(input_dir)
    temp_jsonl_file = Path(temp_jsonl_file)
    training_jsonl_file = Path(training_jsonl_file)
    validation_jsonl_file = Path(validation_jsonl_file)

    # 初始化文件
    initialize_files(temp_jsonl_file)

    # 检查训练和验证文件的状态
    training_exists = training_jsonl_file.exists()
    validation_exists = validation_jsonl_file.exists()

    if training_exists and validation_exists:
        # 两个文件都存在，进行校验
        if not (check_duplicates_in_file(training_jsonl_file) and check_duplicates_in_file(validation_jsonl_file)):
            print("文件中有重复条目，建议从空文件开始")
            return

        if not check_duplicates_between_files(training_jsonl_file, validation_jsonl_file):
            print("两个文件之间有重复条目，建议从空文件开始")
            return

    # 获取文件夹下所有 .md 文件
    md_files = list(input_dir.glob('*.md'))
    
    # 计数器初始化
    total_md_files = len(md_files)
    total_jsonl_entries = 0
    training_entries = 0
    validation_entries = 0

    # 遍历每个 .md 文件
    for index, md_file in enumerate(md_files, start=1):
        # 转换 Markdown 文件为 JSONL 文件 
        # 此为"前置函数0.5"，每条转化的jsonl行都加上头部的“system message”
        convert_md_to_jsonl(md_file, temp_jsonl_file)

        # 校验转换后的 JSONL 文件
        # 此为"前置函数0.5"，每条转化的jsonl行都加上头部的“system message”
        if not verify_md_to_jsonl(md_file, temp_jsonl_file):
            print(f"转换校验失败，跳过文件 {md_file}")
            continue
        


        # 校验 JSONL 文件格式有效性
        if not validate_jsonl_file(temp_jsonl_file):
            print(f"JSONL 格式校验失败，跳过文件 {md_file}")
            continue

        # 加载 JSONL 文件内容为数据集
        with open(temp_jsonl_file, 'r', encoding='utf-8') as f:
            dataset = [json.loads(line) for line in f]
            total_jsonl_entries += len(dataset)

        # 校验数据集中对话的 user 和 assistant 消息
        if not validate_dataset(dataset):
            print(f"数据集校验失败，跳过文件 {md_file}")
            continue

        # 根据 n 的值决定将内容追加到哪个文件
        if index % n == 0:
            target_file = validation_jsonl_file
        else:
            target_file = training_jsonl_file

        # 检查目标文件是否存在
        if not target_file.exists():
            write_system_message(target_file)  # 写入系统信息
        else:
            # 读取目标文件内容，确保第一行不是只有系统信息
            with open(target_file, 'r', encoding='utf-8') as f:
                lines = f.readlines()
            if len(lines) > 0:
                first_line = json.loads(lines[0])
                if len(first_line['messages']) == 1 and first_line['messages'][0]['role'] == 'system':
                    lines.pop(0)  # 删除第一行
                    with open(target_file, 'w', encoding='utf-8') as f:
                        f.writelines(lines)  # 重新写入文件

        # 检查目标文件中是否有重复条目
        if not check_duplicates_in_file(target_file):
            print(f"目标文件 {target_file} 中有重复条目，跳过文件 {md_file}")
            continue

        # 检查临时文件与目标文件之间是否有重复条目
        if not check_duplicates_between_files(temp_jsonl_file, target_file):
            print(f"临时文件与目标文件 {target_file} 之间有重复条目，跳过文件 {md_file}")
            continue

        # 将内容追加到目标文件
        append_file_content(temp_jsonl_file, target_file)
        
        # 更新计数器
        if target_file == training_jsonl_file:
            training_entries += len(dataset)
        else:
            validation_entries += len(dataset)
        
        print(f"文件 {md_file} 处理完成，内容已追加到 {target_file}")

    # 获取相对路径
    relative_training_path = get_relative_path(training_jsonl_file, Path('/Users/wingzheng/Desktop/github/GPT'))
    relative_validation_path = get_relative_path(validation_jsonl_file, Path('/Users/wingzheng/Desktop/github/GPT'))

    # 输出统计信息
    print(f"\n\n转换的 {input_dir} 文件夹下共有 {total_md_files} 个 .md 文件，共转化了 {total_jsonl_entries} 条 JSONL 文本")
    print(f"其中:\n{training_entries} 条 JSONL 文本写入 {relative_training_path}；\n{validation_entries} 条 JSONL 文本写入 {relative_validation_path}")

# 定义需要的变量 
input_dir = '/Users/wingzheng/Desktop/github/GPT/GPTs/prompts'
temp_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test.jsonl'
training_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_folder.jsonl'
validation_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_folder.jsonl'
n = 5  # 每隔 5 个文件

# 调用主函数
process_md_files(input_dir, temp_jsonl_file, training_jsonl_file, validation_jsonl_file, n)


Markdown 文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/Paw Pal.md 已成功转换为 JSONL 文件：data/gpts_fine_tuning_training_test.jsonl
转换效果符合预期：输入文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/Paw Pal.md，输出文件 data/gpts_fine_tuning_training_test.jsonl
所有条目没有重复
这两个 JSONL 文件没有任何条目重复
文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/Paw Pal.md 处理完成，内容已追加到 /Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_folder.jsonl
Markdown 文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/LogoGPT.md 已成功转换为 JSONL 文件：data/gpts_fine_tuning_training_test.jsonl
转换效果符合预期：输入文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/LogoGPT.md，输出文件 data/gpts_fine_tuning_training_test.jsonl
所有条目没有重复
这两个 JSONL 文件没有任何条目重复
文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/LogoGPT.md 处理完成，内容已追加到 /Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_folder.jsonl
Markdown 文件 /Users/wingzheng/Desktop/github/GPT/GPTs/prompts/完蛋！我爱上了姐姐.md 已成功转

In [None]:
## 补丁函数1.1：调用主函数中定义的检查重复条目的子函数，实现单独文件或相互文件之间重复条目的检查；
# 应用场景：主要检查是有system+user+assistant三个信息叠加的jsonl文本，所以它对gpts_fine_tuning_training_folder.jsonl和验证的jsonl文本很有效，对去除system的新jsonl无效
## 主函数执行的基础上，再次调用其内部的子函数，以核查重复项

In [668]:
# 功能1：检查单个文件是否有重复条目
def check_single_file_duplicates(file_path):
    result = check_duplicates_in_file(file_path)
    if result:
        print("本文件没有任何重复条目")
    else:
        print("有重复条目，重复条目为：")
        check_duplicates_in_file(file_path)  # 重新调用以打印重复条目

# 功能2：检查两个文件之间是否有重复条目
def check_between_files_duplicates(file1_path, file2_path):
    result = check_duplicates_between_files(file1_path, file2_path)
    if result:
        print("这两个文件没有任何重复条目")
    else:
        print("这两个文件有重复条目，重复条目为：")
        check_duplicates_between_files(file1_path, file2_path)  # 重新调用以打印重复条目

# 示例调用
training_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_folder.jsonl'
validation_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_folder.jsonl'
                       

# 功能1调用
print("检查单个训练文件是否有重复条目：")
check_single_file_duplicates(training_jsonl_file)

print("\n检查单个验证文件是否有重复条目：")
check_single_file_duplicates(validation_jsonl_file)

# 功能2调用
print("\n检查两个文件之间是否有重复条目：")
check_between_files_duplicates(training_jsonl_file, validation_jsonl_file)


检查单个训练文件是否有重复条目：
所有条目没有重复
本文件没有任何重复条目

检查单个验证文件是否有重复条目：
所有条目没有重复
本文件没有任何重复条目

检查两个文件之间是否有重复条目：
这两个 JSONL 文件没有任何条目重复
这两个文件没有任何重复条目


In [None]:
## （后续需要的场景已经不多了，半放弃）补丁函数1.2：把生成的jsonl文本的 第一条（系统信息）和第二条（“用户”+“助理”信息）合并为一条 2024.10.5
#  首次调整函数时需要，后续不怎么需要了（因为后续有“前置函数0.5”  把所有系统信息都写入了指令jsonl）2024.10.12

In [314]:
import json
from pathlib import Path

def merge_system_and_first_entry(file_path: str):
    """
    读取指定的 JSONL 文件，将系统信息和第一条用户-助理对话合并为一条记录，并将结果写回同一个文件。

    :param file_path: JSONL 文件的路径
    """
    file_path = Path(file_path)
    
    if not file_path.exists():
        print(f"文件 {file_path} 不存在")
        return

    # 读取文件内容
    with open(file_path, 'r', encoding='utf-8') as infile:
        lines = infile.readlines()

    if len(lines) < 2:
        print(f"文件 {file_path} 中的内容不足两条，无法合并")
        return

    # 解析系统信息
    try:
        system_data = json.loads(lines[0])
        system_content = system_data['messages'][0]['content']
    except (json.JSONDecodeError, KeyError, IndexError) as e:
        print(f"解析系统信息时出错: {e}")
        return

    # 解析第一条用户-助理对话
    try:
        first_entry_data = json.loads(lines[1])
        user_content = first_entry_data['messages'][0]['content']
        assistant_content = first_entry_data['messages'][1]['content']
    except (json.JSONDecodeError, KeyError, IndexError) as e:
        print(f"解析第一条用户-助理对话时出错: {e}")
        return

    # 合并系统信息和第一条用户-助理对话
    merged_data = {
        "messages": [
            { "role": "system", "content": system_content },
            { "role": "user", "content": user_content },
            { "role": "assistant", "content": assistant_content }
        ]
    }

    # 重新写入文件
    with open(file_path, 'w', encoding='utf-8') as outfile:
        outfile.write(json.dumps(merged_data, ensure_ascii=False) + '\n')
        outfile.writelines(lines[2:])  # 写入剩余的行

    print(f"文件 {file_path} 中的系统信息和第一条用户-助理对话已成功合并")

# 示例调用
training_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test_folder2.jsonl'
validation_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_test_folder2.jsonl'

# merge_system_and_first_entry(training_jsonl_file)
merge_system_and_first_entry(validation_jsonl_file)

文件 /Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_test_folder2.jsonl 中的系统信息和第一条用户-助理对话已成功合并


In [None]:
## 补丁函数2：比较两个 JSONL 文件中的条目，并找出它们之间的差异。我们将读取两个文件的内容，然后比较每条记录，找出在第一个文件中存在但在第二个文件中不存在的条目，以及在第二个文件中存在但在第一个文件中不存在的条目。

In [316]:
import json
from pathlib import Path

def compare_jsonl_files(file1_path: str, file2_path: str):
    """
    比较两个 JSONL 文件中的条目，找出它们之间的差异。

    :param file1_path: 第一个 JSONL 文件的路径
    :param file2_path: 第二个 JSONL 文件的路径
    """
    file1_path = Path(file1_path)
    file2_path = Path(file2_path)

    if not file1_path.exists():
        print(f"文件 {file1_path} 不存在")
        return

    if not file2_path.exists():
        print(f"文件 {file2_path} 不存在")
        return

    # 读取第一个文件的内容
    with open(file1_path, 'r', encoding='utf-8') as infile1:
        lines1 = infile1.readlines()
        entries1 = [json.loads(line) for line in lines1]

    # 读取第二个文件的内容
    with open(file2_path, 'r', encoding='utf-8') as infile2:
        lines2 = infile2.readlines()
        entries2 = [json.loads(line) for line in lines2]

    # 将每条记录转换为字符串形式，便于比较
    entries1_str = {json.dumps(entry, ensure_ascii=False) for entry in entries1}
    entries2_str = {json.dumps(entry, ensure_ascii=False) for entry in entries2}

    # 找出在第一个文件中存在但在第二个文件中不存在的条目
    only_in_file1 = entries1_str - entries2_str

    # 找出在第二个文件中存在但在第一个文件中不存在的条目
    only_in_file2 = entries2_str - entries1_str

    # 打印结果
    if only_in_file1:
        print(f"仅在 {file1_path} 中存在的条目：")
        for entry in only_in_file1:
            print(entry)
    else:
        print(f"没有仅在 {file1_path} 中存在的条目")

    if only_in_file2:
        print(f"仅在 {file2_path} 中存在的条目：")
        for entry in only_in_file2:
            print(entry)
    else:
        print(f"没有仅在 {file2_path} 中存在的条目")

# 示例调用
training_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test_folder2.jsonl'
training_jsonl_file3 = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test_folder3.jsonl'

compare_jsonl_files(training_jsonl_file, training_jsonl_file3)

仅在 /Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test_folder2.jsonl 中存在的条目：
{"messages": [{"role": "system", "content": "你是一个高级的对话生成模型，能够根据用户的输入生成高质量、详细且结构化的回答。你的回答应满足以下要求：\n\n1. 一致性：确保回答与用户输入的指令高度一致，生成详细、结构化的回答。\n\n2. 详细分析和指导：提供详细的分析和指导，包括多个步骤和注意事项。\n3. 复杂任务处理：能够处理复杂的分析任务，生成较长、详细的回答，适合需要深入分析的场景。\n4. 正式语言风格：使用正式、详细的语言风格，确保回答的专业性和准确性。\n5. 数据特点：\n   - 高质量：经过精心设计和优化的，质量较高。\n\n   - 多样性：涵盖了各种主题和场景，有助于模型学习广泛的表达方式和知识。\n   - 结构化：每个问答对都有明确的输入和输出，格式统一。\n\n### 特殊指令处理\n- 特定指令：当用户输入中包含“生成提示词”或类似意思的词语时，生成符合以上要求的结构化和详细的回答。\n- 普通指令：当用户输入不包含“生成提示词”或类似意思的词语时，生成正常的、灵活的回答。\n\n请确保每次回答都遵循上述要求，以提供最佳的用户体验。"}, {"role": "user", "content": "## Paw Pal\nExpert on dog behavior, feeding, and training, offering friendly and practical advice."}, {"role": "assistant", "content": "```markdown\nPaw Pal, your identity is as an expert trainer, offering a guiding hand to first-time dog owners. Your mission is to provide detailed, actionable advice for dog care, behavior underst

In [None]:
## 补丁函数3.0-前置函数（删除system信息的简化版）（补丁函数3的前置函数）：把之前主函数生成的jsonl文本去除每条jsonl行的system信息，输出只有user 和 assistant 的版本
#  读取一个包含 system, user, 和 assistant 信息的 JSONL 文件，然后生成一个新的 JSONL 文件，其中只保留 user 和 assistant 信息，去掉 system 信息。新文件的名字会在原文件名后加上 _NoSystemMessage

In [None]:
import json
from pathlib import Path

def remove_system_message(input_jsonl_file: Path, output_jsonl_file: Path) -> None:
    """
    从输入的 JSONL 文件中移除每条记录的 system 信息，并生成新的 JSONL 文件。

    :param input_jsonl_file: 输入的 JSONL 文件路径
    :param output_jsonl_file: 输出的 JSONL 文件路径
    """
    base_path = Path.cwd()  # 当前工作目录

    # 读取 JSONL 文件内容
    with open(input_jsonl_file, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # 处理每一行，移除 system 信息
    new_lines = []
    count = 0  # 计数器，用于统计转换的条目数
    for line in lines:
        try:
            data = json.loads(line)
            messages = data.get("messages", [])
            new_messages = [msg for msg in messages if msg['role'] != 'system']
            new_data = {
                "messages": new_messages
            }
            new_lines.append(json.dumps(new_data, ensure_ascii=False) + '\n')
            count += 1  # 每成功处理一行，计数器加1
        except json.JSONDecodeError as e:
            print(f"解析 JSONL 文件 {get_relative_path(input_jsonl_file, base_path)} 时发生错误: {e}")
            continue

    # 写入新的 JSONL 文件
    with open(output_jsonl_file, 'w', encoding='utf-8') as file:
        file.writelines(new_lines)

    # 输出转换结果的提示信息
    print(f'JSONL 文件 {get_relative_path(input_jsonl_file, base_path)} 已成功转换为 {get_relative_path(output_jsonl_file, base_path)}，并移除了 system 信息。\n本次共去除了 {count} 条记录的system信息。')

def get_relative_path(file_path: Path, base_path: Path) -> str:
    """
    获取相对于给定基路径的相对路径。

    :param file_path: 文件路径
    :param base_path: 基路径
    :return: 相对路径
    """
    try:
        return str(file_path.relative_to(base_path))
    except ValueError:
        return str(file_path)

# 示例调用
# input_jsonl_file = Path('/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_folder.jsonl')
input_jsonl_file = Path('/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_folder.jsonl')
output_jsonl_file = Path(str(input_jsonl_file).replace('.jsonl', '_NoSystemMessage.jsonl'))

remove_system_message(input_jsonl_file, output_jsonl_file)

In [None]:
## 补丁函数3（大工程）： 定义 classify_jsonl_entries 类别定义函数：主要用于把jsonl文件（指令集）生成md文件（指令分类阐述），md文件中主要包含对用户内容和分类原因的阐述；但这个函数：输入文件只能是去除system信息的jsonl文本（当时写的时候就是针对没有system头部信息的jsonl文本格式）
# 通过“补丁函数3的前置函数”输出的去除system信息的jsonl文本，方能进行精确的解读

## 之后通过对生成的md文件进行query的向量化语义检索，把语义相似的“用户内容”+“分类原因”前几名（可设定）写入输出文档（2024.10.10）
# 执行一次需要很长时间，不要轻易执行；生成的（指令分类阐述的）md文本也可以一直用下去；
# 第二次执行的时候大概花了1个半小时（100分钟左右），花费了1-2美金；


# 该函数读取指定的 JSONL 文件，解析每条记录，并调用大模型对其进行分类。
# 分类结果存储在一个字典中，键为分类名称，值为属于该分类的条目列表。
# 调用 classify_jsonl_entries 函数：
# 对 training_jsonl_file 和 validation_jsonl_file 分别进行分类，并打印分类结果。

            # gpt-4o-2024-08-06：0.0735元；o1-mini耗费0.088元；
            # gpt-4o-mini耗费0.0044元；
            
            # model="o1-preview", #太贵，5倍于"o1-mini"
            # model="o1-preview-2024-09-12", 
            # model="gpt-4o", 
            # model="o1-mini",  
            
            # model="o1-mini-2024-09-12", 
            # model="gpt-4o-2024-08-06"
            # model="gpt-4o-mini"


## 把每一个指令单独命名分类名：## Category: ...和分类原因;
## 只列出用户内容（如原文是英文，则添加中文）+分类原因；

In [680]:
import os
import openai
from dotenv import load_dotenv
import json
from pathlib import Path

# 指定 .env 文件的路径
dotenv_path = os.path.join(os.getcwd(), '../.env')
load_dotenv(dotenv_path)

# 从环境变量中读取 API 密钥和 API 基础 URL
api_key = os.getenv("GPTGOD_CLOUD_API_KEY")
api_base = os.getenv("GPTGOD_CLOUD_API_BASE")

# 确认API密钥已正确设置
if api_key is None:
    print("请设置环境变量 GPTGOD_CLOUD_API_KEY 为您的API密钥")
    exit(1)

if api_base is None:
    print("请设置环境变量 GPTGOD_CLOUD_API_BASE 为您的API基础URL")
    exit(1)

print(f"API Key: {api_key}")  # 添加这行来确认 API 密钥
print(f"API Base: {api_base}")  # 添加这行来确认 API 基础 URL

# 初始化 OpenAI 客户端
client = openai.OpenAI(api_key=api_key, base_url=api_base)

def translate_text(text: str) -> str:
    """
    使用大模型将文本翻译成中文。

    :param text: 需要翻译的文本
    :return: 翻译后的中文文本
    """
    try:
        translation = client.chat.completions.create(
            # model="gpt-4o-mini",
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that can translate text."},
                {"role": "user", "content": f"Translate the following text to Chinese: {text}"}
            ]
        )
        return translation.choices[0].message.content.strip()
    except openai.APIError as e:
        print(f"API error: {e}")
        return "未知"
    except Exception as e:
        print(f"An error occurred: {e}")
        return "未知"

def classify_and_output_jsonl_entries(input_file: str, output_file: str, num_categories: int):
    """
    对 JSONL 文件中的条目进行分类，并将分类结果输出到 Markdown 文件中。

    :param input_file: 输入的 JSONL 文件路径
    :param output_file: 输出的 Markdown 文件路径
    :param num_categories: 分类的数量
    """
    input_file = Path(input_file)
    output_file = Path(output_file)
    
    if not input_file.exists():
        print(f"文件 {input_file} 不存在")
        return

    # 读取文件内容
    with open(input_file, 'r', encoding='utf-8') as infile:
        lines = infile.readlines()
        entries = [json.loads(line) for line in lines]

    # 初始化分类字典
    categories = {}

    for entry in entries:
        user_content = entry['messages'][0]['content']

        try:
            # 调用大模型进行分类
            completion = client.chat.completions.create(
                # model="gpt-4o-mini",
                model="gpt-4o-2024-08-06",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that can classify text based on its application context."},
                    {"role": "user", "content": f"Classify the following user query into one of {num_categories} categories based on their application context: {user_content}"}
                ]
            )
            category = completion.choices[0].message.content.strip()
        except openai.APIError as e:
            print(f"API error: {e}")
            category = "未知"
        except Exception as e:
            print(f"An error occurred: {e}")
            category = "未知"

        # 将条目分配到相应的分类
        if category in categories:
            categories[category].append(entry)
        else:
            categories[category] = [entry]

    # 生成 Markdown 表格
    with open(output_file, 'w', encoding='utf-8') as outfile:
        outfile.write("# 分类统计\n\n")
        for category, entries in categories.items():
            outfile.write(f"## {category}\n\n")
            outfile.write("| 用户内容 | 分类原因 |\n")
            outfile.write("| --- | --- |\n")
            for entry in entries:
                user_content = entry['messages'][0]['content']
                
                # 翻译用户内容
                user_content_zh = translate_text(user_content)
                
                try:
                    # 调用大模型获取分类原因
                    reason_completion = client.chat.completions.create(
                        # model="gpt-4o-mini",
                        model="gpt-4o-2024-08-06",
                        messages=[
                            {"role": "system", "content": "You are a helpful assistant that can explain the reasoning behind classifications."},
                            {"role": "user", "content": f"详细解释为什么以下用户查询属于类别 '{category}': {user_content}"}
                        ]
                    )
                    classification_reason = reason_completion.choices[0].message.content.strip()
                except openai.APIError as e:
                    print(f"API error: {e}")
                    classification_reason = "未知"
                except Exception as e:
                    print(f"An error occurred: {e}")
                    classification_reason = "未知"

                # 格式化输出
                user_content_display = f"{user_content} ({user_content_zh})" if user_content != user_content_zh else user_content

                outfile.write(f"| {user_content_display} | {classification_reason} |\n")
            outfile.write("\n")

# 定义文件路径（第1次运行时转换前的jsonl文件）
# training_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test_folder2.jsonl'
# validation_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_test_folder2.jsonl'
# 定义输出文件路径 （第1次运行时转换后的md文件）
# training_output_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/training_classification.md'
# validation_output_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/validation_classification.md'

# 定义文件路径（第2次运行时转换的jsonl文件，该文件去除了所有SystemMessage）
training_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_folder_NoSystemMessage.jsonl'
validation_jsonl_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_folder_NoSystemMessage.jsonl'
# 定义输出文件路径 （第1次运行时转换后的md文件）第1pair（仅做备用）： 
training_output_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/training_classification_NoSystemMessage.md'
validation_output_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/validation_classification_NoSystemMessage.md'

# 对 training_jsonl_file 进行分类并输出到 Markdown 文件
classify_and_output_jsonl_entries(training_jsonl_file, training_output_file, 10)

# 对 validation_jsonl_file 进行分类并输出到 Markdown 文件
classify_and_output_jsonl_entries(validation_jsonl_file, validation_output_file, 5)

API Key: sk-eW3rgoIIttiTD8kDD8142381B9104601B4FfE11d3dD9FaC3
API Base: https://gptgod.cloud/v1


KeyboardInterrupt: 

In [442]:
## 仅供测试4：测试"OPENAI_API_KEY"和"GPTGOD_CLOUD_API_KEY"请求client.embeddings 嵌入方式
import openai
from dotenv import load_dotenv
import os

# 加载 .env 文件
load_dotenv()

# 设置 API 密钥和基础 URL
# api_key = os.getenv("OPENAI_API_KEY")
# api_base = os.getenv("OPENAI_API_BASE", "https://api.openai.com/v1")  # 默认使用官方URL
api_key = os.getenv("GPTGOD_CLOUD_API_KEY")
api_base = os.getenv("GPTGOD_CLOUD_API_BASE")  # 默认使用官方URL
# openai.api_base = "https://api.openai.com/v1"  # 使用官方 URL
client = openai.OpenAI(api_key=api_key, base_url=api_base)

try:
    response = client.embeddings.create(input=["测试"], model="text-embedding-3-small")
    print(response)
except Exception as e:
    print(f"创建嵌入时出错: {e}")

CreateEmbeddingResponse(data=[Embedding(embedding=[-0.016909604892134666, 0.008369829505681992, 0.03296947851777077, -0.05817000940442085, 0.0015211664140224457, -0.005872233305126429, 0.026414427906274796, -0.009862923994660378, 0.027579771354794502, 0.01074300054460764, 0.009978245012462139, 0.004655300173908472, -0.005237971432507038, 0.00914065446704626, 0.020442048087716103, 0.042097996920347214, -0.06977488100528717, -0.010803695768117905, -0.02639015018939972, 0.042607832700014114, 0.04855593666434288, 0.005341152660548687, 0.03627128526568413, -0.046031028032302856, 0.024933472275733948, -0.010026800446212292, -0.013644217513501644, 0.004898080136626959, 0.01701885461807251, -0.06642451882362366, 0.05258607864379883, -0.028648002073168755, -0.00937129557132721, -0.03122146613895893, 0.04981838911771774, -0.007507961243391037, 0.024314384907484055, 0.01740730181336403, 0.007131652906537056, -0.006169638596475124, -0.031197188422083855, -0.03530016541481018, 0.005280457902699709,

In [432]:
## 仅供测试5：（仅确保openai，gptgod不行）测试"OPENAI_API_KEY"设置为openai.api_key 之后 请求openai.embeddings.的 嵌入方式
import openai
from dotenv import load_dotenv
import os

# 加载 .env 文件
load_dotenv()

# 设置 API 密钥和基础 URL
api_key = os.getenv("OPENAI_API_KEY")
api_base = os.getenv("OPENAI_API_BASE", "https://api.openai.com/v1")  # 默认使用官方URL

# 确认 API 密钥和基础 URL 已正确设置
if api_key is None:
    print("请设置环境变量 OPENAI_API_KEY 为您的API密钥")
    exit(1)

if api_base is None:
    print("请设置环境变量 OPENAI_API_BASE 为您的API基础URL")
    exit(1)

print(f"API Key: {api_key}")  # 确认 API 密钥
print(f"API Base: {api_base}")  # 确认 API 基础 URL

# 初始化 OpenAI 客户端
openai.api_key = api_key
openai.api_base = api_base

try:
    response = openai.embeddings.create(input=["测试"], model="text-embedding-3-small")
    print(response)
    print(response.data[0].embedding)  # 正确访问嵌入向量
except Exception as e:
    print(f"创建嵌入时出错: {e}")

API Key: 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
API Base: https://api.openai.com/v1
CreateEmbeddingResponse(data=[Embedding(embedding=[0.0018167447997257113, -0.01436067745089531, -0.007833752781152725, -0.011718139052391052, -0.009645985439419746, 0.015725266188383102, -0.00859907828271389, -0.011032234877347946, -0.0013339039869606495, -0.03003540262579918, 0.025169089436531067, 0.01035355031490326, -0.0011669404339045286, -0.016952674835920334, -0.0065738544799387455, 0.014411217533051968, 0.028894634917378426, 0.005646078381687403, 0.02215110883116722, -0.024360444396734238, 0.011183856055140495, 0.005732718855142593, -0.022122230380773544, -0.00892398040741682, -0.028880195692181587, 0.003182236570864916, 0.011422117240726948, -0.033934228122234344, -0.008642398752272129, 0.01170369889587164, 0.023190796375274658, -0.02106810174882412, -0.0198840145021677, -0.016894914209842682, -0.007021

In [None]:
## 补丁函数3--1 只针对这两个文件进行语义检索 ： training_classification.md  validation_classification.md
# （第2次微调，用同一程序执行生成的***_classification_20241011.因为jsonl文本中的每条都包含system的内容没有去掉，导致md把system默认为时用户内容，所以检验不合格，全部废掉）
## 计算md文本中的“用户内容”和“分类原因”与查询query的相似度，排序后选择前 top_n 名，写入输出文件；semantic_search_md(md_file, query, start_index, end_index, top_n)
## text-embedding-3-small

def get_embedding(text, model="text-embedding-ada-002"):
    response = openai.embeddings.create(input=text, model=model)
    return response['data'][0]['embedding']

def get_embedding(text, model="text-embedding-3-small"):
    response = openai.Embedding.create(input=[text], model=model)
    return response['data'][0]['embedding']


# 指定 .env 文件的路径（如果不在当前工作目录中）
dotenv_path = os.path.join(os.getcwd(), '../.env')
load_dotenv(dotenv_path)

# 从环境变量中读取 API 密钥和 API 基础 URL
api_key = os.getenv("GPTGOD_CLOUD_API_KEY")
api_base = os.getenv("GPTGOD_CLOUD_API_BASE", "https://api.openai.com/v1")  # 默认使用官方URL
# api_key = os.getenv("OPENAI_API_KEY")
# api_base = os.getenv("OPENAI_API_BASE", "https://api.openai.com/v1")  # 默认使用官方URL

In [678]:
import re
import os
import numpy as np
import openai
from dotenv import load_dotenv

# 加载 .env 文件
load_dotenv()

# 设置 API 密钥和基础 URL
api_key = os.getenv("GPTGOD_CLOUD_API_KEY")
api_base = os.getenv("GPTGOD_CLOUD_API_BASE")

# api_key = os.getenv("OPENAI_API_KEY")
# api_base = os.getenv("OPENAI_API_BASE", "https://api.openai.com/v1")  # 默认使用官方URL

# 确认 API 密钥和基础 URL 已正确设置
if api_key is None:
    print("请设置环境变量 GPTGOD_CLOUD_API_KEY 为您的API密钥")
    exit(1)

if api_base is None:
    print("请设置环境变量 GPTGOD_CLOUD_API_BASE 为您的API基础URL")
    exit(1)

# 初始化 OpenAI 客户端
client = openai.OpenAI(api_key=api_key, base_url=api_base)

def clean_text(text):
    """清理输入文本"""
    return re.sub(r'\s+', ' ', text).strip()

def get_embedding(text, model="text-embedding-3-small"):
    try:
        # 清理输入文本
        cleaned_text = clean_text(text)
        
        if not cleaned_text:
            print("输入文本为空，无法获取嵌入向量。")
            return None
        
        # 分段处理长文本
        max_length = 2048  # 根据实际情况调整
        segments = [cleaned_text[i:i+max_length] for i in range(0, len(cleaned_text), max_length)]
        
        embeddings = []
        for segment in segments:
            response = client.embeddings.create(input=[segment], model=model)
            if response.data and response.data[0]:
                embeddings.append(response.data[0].embedding)
            else:
                print(f"API 返回空数据: {response}")
                return None
        
        # 如果有多段，返回平均嵌入向量
        if len(embeddings) > 1:
            return np.mean(embeddings, axis=0)
        else:
            return embeddings[0]
    except Exception as e:
        print(f"创建嵌入时出错: {e}")
        return None

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def translate_text(text, target_language="zh", model="gpt4o-mini"):
    try:
        response = client.chat.completions.create(
            model=model,  # 使用 gpt4o-mini 模型进行翻译
            messages=[
                {"role": "system", "content": f"Translate the following text to {target_language}."},
                {"role": "user", "content": text},
                {"role": "assistant", "content": ""}
            ]
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"翻译时出错: {e}")
        return text

def semantic_search_md(file_path, query, start_index=0, end_index=None, top_n=3):
    # 获取查询的嵌入向量
    query_embedding = get_embedding(query, model="text-embedding-3-small")
    if query_embedding is None:
        print("无法获取查询的嵌入向量，终止操作。")
        return

    # 读取文件内容
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
    except FileNotFoundError:
        print(f"文件未找到: {file_path}")
        return
    except Exception as e:
        print(f"读取文件时出错: {e}")
        return

    # 解析文件内容
    entries = re.findall(r'\| 用户内容 \| 分类原因 \|\n\| --- \| --- \|\n(.*?)(?=\n\| 用户内容 \| 分类原因 \|\n\| --- \| --- \|\n|\Z)', content, re.DOTALL)

    total_entries = len(entries)
    if end_index is None or end_index > total_entries:
        end_index = total_entries

    results = []
    similarities = []

    for index, entry in enumerate(entries[start_index:end_index + 1], start=start_index + 1):
        try:
            # 提取用户内容和分类原因
            rows = re.findall(r'\| (.*?) \| (.*?) \|\n', entry, re.DOTALL)
            
            for row_index, (user_content, classification_reason) in enumerate(rows):
                user_content = user_content.strip()
                classification_reason = classification_reason.strip()

                print(f"处理行: {index} - 用户内容: {user_content[:50]}...")

                # 获取用户内容和分类原因的嵌入向量
                user_embedding = get_embedding(user_content, model="text-embedding-3-small")
                classification_reason_embedding = get_embedding(classification_reason, model="text-embedding-3-small")

                if user_embedding is None and classification_reason_embedding is None:
                    print(f"跳过行: {index} - 无法获取嵌入向量")
                    continue

                # 计算相似度
                user_similarity = cosine_similarity(query_embedding, user_embedding) if user_embedding is not None else 0
                classification_reason_similarity = cosine_similarity(query_embedding, classification_reason_embedding) if classification_reason_embedding is not None else 0

                # 打印相似度
                print(f"用户内容相似度: {user_similarity}, 分类原因相似度: {classification_reason_similarity}")

                # 格式化输出
                result_entry = {
                    "index": index,
                    "user_content": user_content,
                    "classification_reason": classification_reason,
                    "user_similarity": user_similarity,
                    "classification_reason_similarity": classification_reason_similarity
                }

                # 翻译内容
                if re.match(r'^[a-zA-Z\s]+$', user_content):
                    result_entry["user_content_translated"] = translate_text(user_content, "zh", model="gpt-4o-mini")
                if re.match(r'^[a-zA-Z\s]+$', classification_reason):
                    result_entry["classification_reason_translated"] = translate_text(classification_reason, "zh", model="gpt-4o-mini")

                results.append(result_entry)
                print(f"处理行: {index} - 用户内容相似度: {user_similarity}, 分类原因相似度: {classification_reason_similarity}")

                print()  # 在每次处理行的输出之间加一个换行符

                # 记录相似度
                similarities.append((index, user_similarity, classification_reason_similarity))
        except Exception as e:
            print(f"处理行: {index} 时出错: {e}")
            continue

    if not results:
        print("未找到匹配的结果。")
        return

    # 输出前 n 名的相似度排名
    user_similarities = sorted([(index, user_sim) for index, user_sim, _ in similarities], key=lambda x: x[1], reverse=True)
    classification_reason_similarities = sorted([(index, class_sim) for index, _, class_sim in similarities], key=lambda x: x[1], reverse=True)

    top_n_results = set()
    for i in range(top_n):
        if i < len(user_similarities):
            top_n_results.add(user_similarities[i][0])
        if i < len(classification_reason_similarities):
            top_n_results.add(classification_reason_similarities[i][0])

    print("\n前 n 名相似度排名:")
    for index in top_n_results:
        for result in results:
            if result['index'] == index:
                print(f"处理行: {index} - 用户内容: {result['user_content'][:50]}...")
                print(f"用户内容相似度: {result['user_similarity']}, 分类原因相似度: {result['classification_reason_similarity']}")
                print()  # 在每次处理行的输出之间加一个换行符

    # 将前 n 名的条目按相似度从高到低排序并写入文件
    top_n_results_sorted = sorted(results, key=lambda x: max(x['user_similarity'], x['classification_reason_similarity']), reverse=True)
    top_n_output_file_name = f"{os.path.splitext(file_path)[0]}_{query.replace(' ', '_')}_top_{top_n}.md"
    with open(top_n_output_file_name, 'w', encoding='utf-8') as top_n_file:
        count = 0
        for result in top_n_results_sorted:
            if result['index'] in top_n_results:
                top_n_file.write(f"### 用户内容: {result['user_content']}\n\n")
                if "user_content_translated" in result:
                    top_n_file.write(f"### 用户内容（翻译）: {result['user_content_translated']}\n\n")
                top_n_file.write(f"### 分类原因: {result['classification_reason']}\n\n")
                if "classification_reason_translated" in result:
                    top_n_file.write(f"### 分类原因（翻译）: {result['classification_reason_translated']}\n\n")
                top_n_file.write(f"### 用户内容相似度: {result['user_similarity']}\n\n")
                top_n_file.write(f"### 分类原因相似度: {result['classification_reason_similarity']}\n\n")
                top_n_file.write("\n\n\n\n\n")  # 相隔5行
                count += 1
                if count >= top_n:
                    break

    print(f"前 {top_n} 名的结果已保存到文件: {top_n_output_file_name}")

# 定义变量

# 第一对文件（第1次微调时的md文本，仅做替补）
# md_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/training_classification.md'
# md_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/validation_classification.md'

# 第二对文件（第2次微调时的md文本）
# md_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/training_classification_NoSystemMessage.md'
md_file = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/validation_classification_NoSystemMessage.md'


query = '短视频脚本'
start_index = 0
end_index = 230  # 这里设置为一个超出范围的值 表示针对所有条目进行检索
top_n = 4

# 调用函数
semantic_search_md(md_file, query, start_index, end_index, top_n)

处理行: 1 - 用户内容: ## Website Generator
A GPT for website creation, d...
用户内容相似度: 0.35639010276262034, 分类原因相似度: 0.1692486438419692
处理行: 1 - 用户内容相似度: 0.35639010276262034, 分类原因相似度: 0.1692486438419692

处理行: 2 - 用户内容: ## Mocktail Mixologist
I’ll make any party a blast...
用户内容相似度: 0.19810459901974264, 分类原因相似度: 0.1808698998490946
处理行: 2 - 用户内容相似度: 0.19810459901974264, 分类原因相似度: 0.1808698998490946

处理行: 3 - 用户内容: ## 痤疮治疗指南
基于中国痤疮治疗指南（2019）回答 (## Acne Treatment G...
用户内容相似度: 0.13672453281990754, 分类原因相似度: 0.09349596106936034
处理行: 3 - 用户内容相似度: 0.13672453281990754, 分类原因相似度: 0.09349596106936034

处理行: 4 - 用户内容: ## Email Responder Pro
Insert any email; receive a...
用户内容相似度: 0.23210140870802218, 分类原因相似度: 0.0851513507775044
处理行: 4 - 用户内容相似度: 0.23210140870802218, 分类原因相似度: 0.0851513507775044

处理行: 5 - 用户内容: ## Video Game Almanac
I'm your go-to guide for all...
用户内容相似度: 0.2856598309649138, 分类原因相似度: 0.15499743858894857
处理行: 5 - 用户内容相似度: 0.2856598309649138, 分类原因相似度: 0.15499743858894857

处理行: 6 - 用户内容: ## X Opt

In [None]:
## 补丁函数3--2 只针对这两对文件进行语义检索 ：
# 1.gpts_fine_tuning_training_folder.jsonl        gpts_fine_tuning_validation_folder.jsonl
# 2.（备用即可）gpts_fine_tuning_training_test_folder2.jsonl   gpts_fine_tuning_validation_test_folder2.jsonl
## 计算jsonl文本中“用户内容”和“助手内容”与查询query的相似度，排序后选择前top_n名，写入输出文件；semantic_search_jsonl
## (training_test_folder2, query, top_n, start_index, end_index)
## text-embedding-3-small

In [679]:
import re
import os
import json
import openai
import numpy as np
from dotenv import load_dotenv

# 加载 .env 文件
load_dotenv()

# 设置 API 密钥和基础 URL
api_key = os.getenv("GPTGOD_CLOUD_API_KEY")
api_base = os.getenv("GPTGOD_CLOUD_API_BASE")

# 确认 API 密钥和基础 URL 已正确设置
if api_key is None:
    print("请设置环境变量 GPTGOD_CLOUD_API_KEY 为您的API密钥")
    exit(1)

if api_base is None:
    print("请设置环境变量 GPTGOD_CLOUD_API_BASE 为您的API基础URL")
    exit(1)

print(f"API Key: {api_key}")  # 确认 API 密钥
print(f"API Base: {api_base}")  # 确认 API 基础 URL

# 初始化 OpenAI 客户端
client = openai.OpenAI(api_key=api_key, base_url=api_base)

def clean_text(text):
    """清理输入文本"""
    return re.sub(r'\s+', ' ', text).strip()

def get_embedding(text, model="text-embedding-3-small"):
    try:
        # 清理输入文本
        cleaned_text = clean_text(text)
        
        if not cleaned_text:
            print("输入文本为空，无法获取嵌入向量。")
            return None
        
        # 分段处理长文本
        max_length = 2048  # 根据实际情况调整
        segments = [cleaned_text[i:i+max_length] for i in range(0, len(cleaned_text), max_length)]
        
        embeddings = []
        for segment in segments:
            response = client.embeddings.create(input=[segment], model=model)
            if response.data and response.data[0]:
                embeddings.append(response.data[0].embedding)
            else:
                print(f"API 返回空数据: {response}")
                return None
        
        # 如果有多段，返回平均嵌入向量
        if len(embeddings) > 1:
            return np.mean(embeddings, axis=0)
        else:
            return embeddings[0]
    except Exception as e:
        print(f"创建嵌入时出错: {e}")
        return None

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def translate_text(text, target_language="zh", model="gpt4o-mini"):
    try:
        response = client.chat.completions.create(
            model=model,  # 使用 gpt4o-mini 模型进行翻译
            messages=[
                {"role": "system", "content": f"Translate the following text to {target_language}."},
                {"role": "user", "content": text},
                {"role": "assistant", "content": ""}
            ]
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"翻译时出错: {e}")
        return text

def semantic_search_jsonl(file_path, query, top_n=3, start_index=0, end_index=None):
    # 获取查询的嵌入向量
    query_embedding = get_embedding(query, model="text-embedding-3-small")
    if query_embedding is None:
        print("无法获取查询的嵌入向量，终止操作。")
        return

    # 读取文件内容
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            lines = file.readlines()
    except FileNotFoundError:
        print(f"文件未找到: {file_path}")
        return
    except Exception as e:
        print(f"读取文件时出错: {e}")
        return

    # 确定处理的范围
    total_lines = len(lines)
    if end_index is None or end_index > total_lines:
        end_index = total_lines

    results = []
    similarities = []

    for index, line in enumerate(lines[start_index:end_index]):
        try:
            data = json.loads(line)
            messages = data.get("messages", [])
            
            user_content = ""
            assistant_content = ""
            
            for message in messages:
                if message["role"] == "user":
                    user_content = message["content"]
                elif message["role"] == "assistant":
                    assistant_content = message["content"]

            print(f"处理行: {index + start_index + 1} - 用户内容: {user_content}")

            # 获取用户内容和助手内容的嵌入向量
            user_embedding = get_embedding(user_content, model="text-embedding-3-small")
            assistant_embedding = get_embedding(assistant_content, model="text-embedding-3-small")

            if user_embedding is None and assistant_embedding is None:
                print(f"跳过行: {index + start_index + 1} - 无法获取嵌入向量")
                continue

            # 计算相似度
            user_similarity = cosine_similarity(query_embedding, user_embedding) if user_embedding is not None else 0
            assistant_similarity = cosine_similarity(query_embedding, assistant_embedding) if assistant_embedding is not None else 0

            # 打印相似度
            print(f"用户相似度: {user_similarity}, 助手相似度: {assistant_similarity}")

            # 记录结果
            results.append({
                "index": index + start_index + 1,
                "user_content": user_content,
                "assistant_content": assistant_content,
                "user_similarity": user_similarity,
                "assistant_similarity": assistant_similarity
            })
            similarities.append((index + start_index + 1, user_similarity, assistant_similarity))

            print()  # 在每次处理行的输出之间加一个换行符
        except json.JSONDecodeError:
            print(f"跳过行: {index + start_index + 1} - JSON解码错误")
            continue

    if not results:
        print("未找到匹配的结果。")
        return

    # 排序并选择前 top_n 名
    user_similarities = sorted([(index, user_sim) for index, user_sim, _ in similarities], key=lambda x: x[1], reverse=True)
    assistant_similarities = sorted([(index, assistant_sim) for index, _, assistant_sim in similarities], key=lambda x: x[1], reverse=True)

    top_n_results = set()
    for i in range(top_n):
        if i < len(user_similarities):
            top_n_results.add(user_similarities[i][0])
        if i < len(assistant_similarities):
            top_n_results.add(assistant_similarities[i][0])

    # 构建输出文件名
    output_file_name = f"{os.path.splitext(file_path)[0]}_{query.replace(' ', '_')}_top_{top_n}.txt"
    
    # 写入结果到文件
    top_n_results_sorted = sorted(results, key=lambda x: max(x['user_similarity'], x['assistant_similarity']), reverse=True)
    with open(output_file_name, 'w', encoding='utf-8') as output_file:
        count = 0
        for result in top_n_results_sorted:
            if result['index'] in top_n_results:
                output_file.write("### 用户内容:\n")
                output_file.write(result["user_content"] + "\n")
                if "user_content_translated" in result:
                    output_file.write(f"### 用户内容（翻译）: {result['user_content_translated']}\n")
                output_file.write("\n### 助手内容:\n")
                output_file.write(result["assistant_content"] + "\n")
                if "assistant_content_translated" in result:
                    output_file.write(f"### 助手内容（翻译）: {result['assistant_content_translated']}\n")
                output_file.write(f"### 用户相似度: {result['user_similarity']}\n")
                output_file.write(f"### 助手相似度: {result['assistant_similarity']}\n")
                output_file.write("\n\n\n\n\n")  # 相隔5行
                count += 1
                if count >= top_n:
                    break

    print(f"前 {top_n} 名的结果已保存到文件: {output_file_name}")

# 备用1 ：第一次微调时的结果
# training_test_folder2 = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_test_folder2.jsonl'
# validation_test_folder2 = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_test_folder2.jsonl'
# 备用2 ：第二次微调时的结果，每个jsonl行的头部都加了system Message
# training_folder = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_folder.jsonl'
# validation_folder = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_folder.jsonl'

# 推荐用 ：第二次微调时的结果，每个jsonl行的头部都去除了system Message
training_folder = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_training_folder_NoSystemMessage.jsonl'
validation_folder = '/Users/wingzheng/Desktop/github/GPT/openai-cookbook/examples/data/gpts_fine_tuning_validation_folder_NoSystemMessage.jsonl'

query = '短视频'
top_n = 4
start_index = 0
end_index = 230  # 这里设置为一个较大的值 表示会搜索整个文本的所有条目

# 调用函数
semantic_search_jsonl(validation_folder, query, top_n, start_index, end_index)

API Key: sk-eW3rgoIIttiTD8kDD8142381B9104601B4FfE11d3dD9FaC3
API Base: https://gptgod.cloud/v1
处理行: 1 - 用户内容: ## Website Generator
A GPT for website creation, design, copywriting, and code. Integrated with DALL-E 3. Powered by B12. Share your feedback with hello@b12.io.
用户相似度: 0.20293666504214727, 助手相似度: 0.21337768792323242

处理行: 2 - 用户内容: ## Mocktail Mixologist
I’ll make any party a blast with mocktail recipes with whatever ingredients you have on hand.
用户相似度: 0.1304367850091108, 助手相似度: 0.22769075787578116

处理行: 3 - 用户内容: ## 痤疮治疗指南
基于中国痤疮治疗指南（2019）回答
用户相似度: 0.14405113037806463, 助手相似度: 0.13177378456758707

处理行: 4 - 用户内容: ## Email Responder Pro
Insert any email; receive a polished reply.
用户相似度: 0.09998113618770121, 助手相似度: 0.131614036777004

处理行: 5 - 用户内容: ## Video Game Almanac
I'm your go-to guide for all things gaming, from strategies to streamers!
用户相似度: 0.2423328427214366, 助手相似度: 0.2677846801177784

处理行: 6 - 用户内容: ## X Optimizer GPT
Optimizes X posts for peak engagement
用户相似度: 0.181

## Token Counting Utilities

Lets define a few helpful utilities to be used in the rest of the notebook.

In [92]:
# 计费准备函数1：原有函数，不做任何修改
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

## Data Warnings and Token Counts 

With some lightweight analysis we can identify potential issues in the dataset, like missing messages, and provide statistical insights into message and token counts.

1. **Missing System/User Messages**: Counts the number of conversations missing a "system" or "user" message. Such messages are critical for defining the assistant's behavior and initiating the conversation.
2. **Number of Messages Per Example**: Summarizes the distribution of the number of messages in each conversation, providing insight into dialogue complexity.
3. **Total Tokens Per Example**: Calculates and summarizes the distribution of the total number of tokens in each conversation. Important for understanding fine-tuning costs.
4. **Tokens in Assistant's Messages**: Calculates the number of tokens in the assistant's messages per conversation and summarizes this distribution. Useful for understanding the assistant's verbosity.
5. **Token Limit Warnings**: Checks if any examples exceed the maximum token limit (16,385 tokens), as such examples will be truncated during fine-tuning, potentially resulting in data loss.


In [93]:
# 计费准备函数2：数据警告和令牌计数
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 16,385 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 19
Num examples missing user message: 1

#### Distribution of num_messages_per_example:
min / max: 1, 2
mean / median: 1.95, 2.0
p5 / p95: 2.0, 2.0

#### Distribution of num_total_tokens_per_example:
min / max: 166, 2862
mean / median: 1017.65, 730.5
p5 / p95: 298.6, 2122.700000000001

#### Distribution of num_assistant_tokens_per_example:
min / max: 0, 2791
mean / median: 945.8, 619.0
p5 / p95: 214.3, 2064.200000000001

0 examples may be over the 16,385 token limit, they will be truncated during fine-tuning


## Cost Estimation

In this final section, we estimate the total number of tokens that will be used for fine-tuning, which allows us to approximate the cost. It is worth noting that the duration of the fine-tuning jobs will also increase with the token count. 

In [325]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
print(f"the length of the dataset: {n_train_examples} ")
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

the length of the dataset: 212 
Dataset has ~20353 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~61059 tokens


In [290]:
# 计费函数3：上一个（原函数）弃用，重新写的计费函数，同时输出实际的RMB成本
## 根据 ”按量计费费用“的公式 计算得出的：微调总费用
## 公式：按量计费费用 = 分组倍率 × 模型倍率 × （提示token数 + 补全token数 × 补全倍率）/ 500000 （单位：美元）
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385  # 每个示例的最大 token 数量

TARGET_EPOCHS = 3  # 目标训练轮数
MIN_TARGET_EXAMPLES = 100  # 最小目标示例数量
MAX_TARGET_EXAMPLES = 25000  # 最大目标示例数量
MIN_DEFAULT_EPOCHS = 1  # 最小默认训练轮数
MAX_DEFAULT_EPOCHS = 25  # 最大默认训练轮数

# 模型倍率和补全倍率字典
model_rates = {
    "dall-e": 8,
    "dall-e-2": 8,
    "dall-e-3": 5,
    "gemini-1.5-pro-exp-0801": 1.75,
    "gemini-1.5-pro-exp-0827": 1.75,
    "gemini-1.5-pro-latest": 3.5,
    "gemma-2b-it": 1,
    "gemma-7b-it": 1,
    "gpt-4-gizmo-*": 15,
    "gpt-4-v": 15,
    "gpt-4-vision-preview": 5,
    "gpt-4o": 2.5,
    "gpt-4o-2024-05-13": 2.5,
    "gpt-4o-2024-08-06": 1.25,
    "gpt-4o-all": 2.5,
    "gpt-4o-mini": 0.075,
    "gpt-4o-mini-2024-07-18": 0.075,
    "o1-mini": 1.5,
    "o1-mini-2024-09-12": 1.5,
    "o1-preview": 7.5,
    "o1-preview-2024-09-12": 7.5,
    "qwen-72b": 1,
    "llama-2-13b": 1,
    "llama-2-70b": 1,
    "llama-2-7b": 1,
    "llama-3-70b": 2,
    "llama-3-8b": 1,
    "llama-3.1-405b": 3,
    "llama-3.1-70b": 2,
    "llama-3.1-8b": 1,
    "llama2-70b-4096": 0.35,
    "llama2-7b-2048": 0.05,
    "tts-1": 7.5,
    "tts-1-1106": 7.5,
    "tts-1-hd": 15,
    "tts-1-hd-1106": 15,
    "whisper-1": 10,
    "url": 0.2
}

completion_multipliers = {
    "gemini-1.5-pro-latest": 3,
    "chatgpt-4o-latest": 3,
    "gpt-3.5-turbo": 1.33,
    "gpt-4-turbo": 3,
    "gpt-4-turbo-2024-04-09": 3,
    "gpt-4o": 3,
    "gpt-4o-2024-05-13": 3,
    "gpt-4o-2024-08-06": 4,
    "gpt-4o-all": 3,
    "gpt-4o-mini": 4,
    "gpt-4o-mini-2024-07-18": 4,
    "o1-mini": 4,
    "o1-mini-2024-09-12": 4,
    "o1-preview": 4,
    "o1-preview-2024-09-12": 4
}

GROUP_MULTIPLIER = 1.00  # 分组倍率

def calculate_cost(model_name, prompt_tokens, completion_tokens):
    """
    计算总费用
    :param model_name: 模型名称
    :param prompt_tokens: 提示 token 数量
    :param completion_tokens: 补全 token 数量
    :return: 总费用
    """
    model_rate = model_rates.get(model_name, 1.0)  # 获取模型倍率，默认为 1.0
    completion_multiplier = completion_multipliers.get(model_name, 1.0)  # 获取补全倍率，默认为 1.0
    
    total_tokens = prompt_tokens + (completion_tokens * completion_multiplier)  # 计算总 token 数量
    cost = GROUP_MULTIPLIER * model_rate * total_tokens / 500000  # 计算总费用
    return cost

n_epochs = TARGET_EPOCHS  # 初始训练轮数设为目标训练轮数
n_train_examples = len(dataset)  # 数据集中的示例数量

# 调整训练轮数
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

# 计算计费 token 数量
n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"数据集中有 ~{n_billing_tokens_in_dataset} 个 token 会被计费")
print(f"默认情况下，你将在数据集上训练 {n_epochs} 个轮次")

# 计算提示 token 和补全 token 数量
prompt_tokens = sum(len(encoding.encode(message["content"])) for ex in dataset for message in ex["messages"] if message["role"] == "user")
completion_tokens = sum(len(encoding.encode(message["content"])) for ex in dataset for message in ex["messages"] if message["role"] == "assistant")

# 打印相关参数
# model_name = "gpt-4o-mini"  # 假设使用 gpt-4o-mini 模型
model_name = "gpt-4o-2024-08-06" 
# model_name = "o1-mini-2024-09-12"  
# model_name = "o1-preview-2024-09-12"  
model_rate = model_rates.get(model_name, 1.0)
completion_multiplier = completion_multipliers.get(model_name, 1.0)
print(f"模型倍率: {model_rate}")
print(f"提示 token 数: {prompt_tokens}")
print(f"补全 token 数: {completion_tokens}")
print(f"补全倍率: {completion_multiplier}")

# 计算总费用
total_cost = calculate_cost(model_name, prompt_tokens, completion_tokens)
print(f"总共预估花费: ${total_cost * n_epochs:.2f} 美金")
# actual_cost_cny = (total_cost * n_epochs) * 6 / 10  # 计算人民币成本
# print(f"实际成本: {actual_cost_cny:.2f} 人民币")

数据集中有 ~20353 个 token 会被计费
默认情况下，你将在数据集上训练 3 个轮次
模型倍率: 1.25
提示 token 数: 7270
补全 token 数: 167464
补全倍率: 4
总共预估花费: $5.08 美金
实际成本: 3.05 人民币


See https://openai.com/pricing to estimate total costs.