# 溯源树构建
1. 使用LLM解析论文，处理成节点；节点belike：论文，论文信息（题目，时间，tag/主题），引用论文中和主题相关的论文
2. 使用节点进行建树，根=选择一篇作为核心，根据其引用论文进行溯源，类似beam search，每次溯源到一个节点，选择其topk相关的引用节点，从根节点开始迭代进行n次，形成n层的检索树
3. 输出为树结构，打印  

基于本地处理，整体过程可以用检索实现，收集到的论文作为知识库，处理节点时，可以在知识库中进行检索，找到的topk作为当前节点的溯源节点；  
知识库中的每条文档只存放论文、时间、摘要、相关引用即可，正文可以省略；  
所以可以分为基于引用关系的溯源，和基于时间&主题的溯源，前者是按照上面说的建立引用树，后者是利用知识库+检索找出主题相关且更早的工作；  
可以考虑二者混合，形成混合检索系统

获取apikey

In [2]:
# get Dashscope key
import os
from dotenv import load_dotenv

# 加载 .env 文件
load_dotenv()

# 尝试获取环境变量并打印出来
dashscope_api_key = os.getenv("DASHSCOPE_API_KEY")
# print(f"DASHSCOPE_API_KEY: {dashscope_api_key}")

## 1.论文解析
解析论文，处理成dict，每个对象中包括：{"id","name","publish_time（处理成一个数如20250311）","abstract"}  
遍历文件夹articles中的论文pdf，顺序设置id，使用pdf解析器，解析出name、publish_time、abstract,   
//决定还是不加key_words了，后面直接让LLM对abstract进行比较，否则两次使用LLM，精度不好保证   
将得到的dict存储成json文件，方便后续处理。

In [None]:
# pip install PyPDF2

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting PyPDF2
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8e/5e/c86a5643653825d3c913719e788e41386bee415c2b87b4f955432f2de6b2/pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Note: you may need to restart the kernel to use updated packages.


尝试使用字符串解析，结果不太行；  
还是使用LLM进行提取吧；  
测试了一下网页工具上是可以直接提取时间的，那就还是按原来的计划提取时间+题目+摘要；  
以下是测试代码

对文件夹中的论文进行批量处理

In [104]:
from langchain_community.llms import Tongyi
from langchain.prompts import PromptTemplate

class ParseLLM:
    def __init__(self,model_name,key):
        self.llm=Tongyi(model=model_name,api_key=key)
        
        # 解析摘要，对应建树方法1
        # self.parse_prompt_template=PromptTemplate(template="""
        #     <document>{document}</document>
        #         Extract the following information from the document above and return it in JSON format:
        #         - Title of the paper
        #         - publication date, use'%Y-%m-%d' fromat
        #         - Abstract
                
        #         Output example:
        #         {{
        #             "title": "The title of the paper",
        #             "publication_date": "2025-03-13",
        #             "abstract": "The abstract of the paper"
        #         }}
        #     """)
        
        # 解析引用，对应建树方法2
        # self.parse_prompt_template = PromptTemplate(template="""
        #     <document>{document}</document>
            
        #     Extract the following information from the document above and return it in JSON format:
        #     - Title of the paper
        #     - Publication date (use '%Y-%m-%d' format)
        #     - References: List of titles from the "References" section (exact strings as they appear)

        #     Output example:
        #     {{
        #         "title": "Attention Is All You Need",
        #         "publication_date": "2017-12-01",
        #         "references": [
        #             "Neural Machine Translation by Jointly Learning to Align and Translate",
        #             "ImageNet Classification with Deep Convolutional Neural Networks"
        #         ]
        #     }}
        # """)
        
        # 解析类别，对应建树方法3
        self.parse_prompt_template = PromptTemplate(template="""
            <document>{document}</document>
            
            Extract the following information from the document above and return it in JSON format:
            - Title of the paper
            - Publication date, use '%Y-%m-%d' format
            - Tags of the paper: Choose one or more of the following tags based on the abstract. If multiple tags apply, include all relevant ones:
                * "jailbreak": Papers discussing methods to bypass the safety mechanisms of language models [[1]].
                * "alignment": Papers focused on ensuring language models behave ethically and align with human values [[2]].
                * "hallucination": Papers addressing issues where language models generate incorrect or nonsensical content [[3]].
                * "prompt-injection": Papers exploring techniques to manipulate language models through crafted inputs [[4]].
            
            Output example:
            {{
                "title": "The title of the paper",
                "publication_date": "2025-03-13",
                "tags": ["alignment", "hallucination"]  // Can be one or more tags
            }}
        """)
        
    
    def response(self,prompt):
        response=self.llm.invoke(prompt)
        return response
    
    def parse_paper(self,paper):
        prompt=self.parse_prompt_template.format(document=paper)
        response=self.response(prompt)
        return response

In [4]:
import os
import json
from PyPDF2 import PdfReader
from datetime import datetime
from tqdm import tqdm

# Step 1: 读取指定文件夹中的所有 PDF 文件内容
def read_papers_from_folder(folder_path):
    papers = []
    
    # 获取所有 PDF 文件路径
    pdf_files = [filename for filename in os.listdir(folder_path) if filename.endswith(".pdf")]
    
    # 使用 tqdm 包裹 pdf_files 列表，显示进度条
    for filename in tqdm(pdf_files, desc="Reading PDFs", unit="file"):
        file_path = os.path.join(folder_path, filename)
        try:
            reader = PdfReader(file_path)
            paper_content = ""
            
            # 遍历每一页并提取文本
            for page in reader.pages:
                paper_content += page.extract_text()
            
            papers.append(paper_content)
        
        except Exception as e:
            # 捕获异常并打印错误信息
            tqdm.write(f"Error reading {filename}: {e}")  # 使用 tqdm.write 避免干扰进度条
    
    return papers

# 辅助函数：去掉多余字符，将输出转为json格式
def parse_to_json(response):
    """
    解析模型返回的带有 Markdown 格式代码块的 JSON 字符串。
    
    :param response: 模型返回的字符串，例如：
                     ```json\n{\n  "论文题目": "INJEC AGENT...",\n  "发布时间": "2024-08-04",\n  "摘要": "Recent work..."\n}\n```
    :return: 解析后的 Python 字典
    """
    try:
        # Step 1: 去掉 Markdown 格式的代码块标记
        if response.startswith("```json") and response.endswith("```"):
            # 去掉开头的 ```json 和结尾的 ```
            response = response[len("```json"):-len("```")].strip()
        
        # Step 2: 将剩余的字符串解析为 JSON
        parsed_data = json.loads(response)
        return parsed_data
    except json.JSONDecodeError as e:
        print(f"JSON 解析失败: {e}")
        return None
    except Exception as e:
        print(f"发生错误: {e}")
        return None
    
# Step 2: 使用 ParseLLM 处理每篇论文，生成解析后的信息列表
def process_papers_with_parse_llm(papers, model_name, api_key=None):
    parse_llm = ParseLLM(model_name=model_name, key=api_key)
    paper_infos = []
    for i, paper in enumerate(tqdm(papers,desc="Processing papers", unit="paper")):
        # print(f"Processing paper {i + 1}/{len(papers)}...")
        try:
            response = parse_llm.parse_paper(paper)
            parsed_info = parse_to_json(response)  # 将 LLM 的响应解析为字典
            parsed_info_with_index={"index": i + 1, **parsed_info}
            paper_infos.append(parsed_info_with_index)
        except Exception as e:
            tqdm.write(f"Error parsing paper {i + 1}: {e}") 
    return paper_infos

# Step 3: 将解析后的信息写入 JSON 文件
def save_paper_infos_to_json(paper_infos, output_file_path):
    with open(output_file_path, 'w') as f:
        json.dump(paper_infos, f, ensure_ascii=False, indent=4)
    print(f"Paper infos have been saved to {output_file_path}")

# 解析
def parse(folder_path, output_file_path, model_name, api_key=None):
    # Step 1: 读取 PDF 文件
    papers = read_papers_from_folder(folder_path)
    if not papers:
        print("No PDF files found in the specified folder.")
        return

    # Step 2: 使用 ParseLLM 解析论文
    paper_infos = process_papers_with_parse_llm(papers, model_name, api_key)

    # Step 3: 将解析结果写入 JSON 文件
    save_paper_infos_to_json(paper_infos, output_file_path)


In [112]:
# 配置参数
folder_path = "../data"  # 替换为你的 PDF 文件夹路径
# folder_path="../EDK/Agent.Data/PDF"
output_file_path = "output/paper_infos5.json"  # 替换为你想保存的 JSON 文件路径

可以分步执行子函数，或者直接执行封装的parse函数

In [109]:
papers=read_papers_from_folder(folder_path)
papers[0][:500]

Reading PDFs:  30%|███       | 112/369 [01:41<03:38,  1.17file/s]FloatObject (b'0.000000000000-14210855') invalid; use 0.0 instead
Reading PDFs: 100%|██████████| 369/369 [06:32<00:00,  1.06s/file]


'SafeAgentBench: A Benchmark for Safe Task Planning of\nEmbodied LLM Agents\nSheng Yin1*Xianghe Pang1*Yuanzhuo Ding1Menglan Chen1\nYutong Bi1Yichen Xiong1Wenhao Huang1\nZhen Xiang2Jing Shao3Siheng Chen1†\n1Shanghai Jiao Tong University2University of Georgia3Shanghai AI Laboratory\n1{Yin.sheng011224,xianghep,ssansjhicvc,vevive,biyutong\ncc_eason,1579515851,sihengc}@sjtu.edu.cn;2zhen.xiang.lance@gmail.com;\n3shaojing@pjlab.org.cn\nAbstract\nWith the integration of large language models (LLMs), em-\nbodied age'

In [110]:
parse_llm = ParseLLM("qwen-turbo", dashscope_api_key)
response=parse_llm.parse_paper(papers[0])
response

'```json\n{\n    "title": "SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents",\n    "publication_date": "2025-03-10",\n    "tags": ["alignment"]\n}\n```'

In [111]:
papers_infos=process_papers_with_parse_llm(papers,"qwen-turbo",dashscope_api_key)
papers_infos[0]

Processing papers:  33%|███▎      | 121/369 [07:09<14:53,  3.60s/paper]

Error parsing paper 121: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  38%|███▊      | 139/369 [08:11<12:41,  3.31s/paper]

Error parsing paper 139: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  40%|███▉      | 146/369 [08:35<13:11,  3.55s/paper]

Error parsing paper 146: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  58%|█████▊    | 213/369 [13:37<09:06,  3.50s/paper]

Error parsing paper 213: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  60%|█████▉    | 220/369 [14:01<08:32,  3.44s/paper]

Error parsing paper 220: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  71%|███████   | 262/369 [16:48<05:58,  3.35s/paper]

Error parsing paper 262: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  73%|███████▎  | 270/369 [17:16<05:20,  3.23s/paper]

Error parsing paper 270: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  75%|███████▍  | 276/369 [17:42<05:57,  3.84s/paper]

Error parsing paper 276: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  75%|███████▌  | 277/369 [17:48<06:55,  4.52s/paper]

Error parsing paper 277: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  77%|███████▋  | 283/369 [18:12<05:06,  3.56s/paper]

Error parsing paper 283: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  78%|███████▊  | 287/369 [18:26<04:32,  3.33s/paper]

Error parsing paper 287: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  86%|████████▌ | 317/369 [20:16<03:08,  3.62s/paper]

Error parsing paper 317: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  93%|█████████▎| 342/369 [21:45<01:36,  3.58s/paper]

Error parsing paper 342: status_code: 400 
 code: DataInspectionFailed 
 message: Input data may contain inappropriate content.


Processing papers:  95%|█████████▌| 352/369 [22:21<00:56,  3.31s/paper]

JSON 解析失败: Expecting value: line 1 column 1 (char 0)
Error parsing paper 352: 'NoneType' object is not a mapping


Processing papers: 100%|██████████| 369/369 [23:19<00:00,  3.79s/paper]


{'index': 1,
 'title': 'SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents',
 'publication_date': '2025-03-10',
 'tags': ['alignment']}

In [114]:
save_paper_infos_to_json(papers_infos, output_file_path)

Paper infos have been saved to output/paper_infos5.json


执行解析函数

In [None]:
parse(folder_path, output_file_path, "qwen-turbo", dashscope_api_key)

## 2.建树
使用anytree，可以在建树完成后直接画出树状图；   
每次建树时，从前面建立的论文库中选取publish time更早且keywords有重叠的；

build_treace_tree(json_file_path,layers)
1. 读取json_file_path指定的json文件，将论文信息读入列表；
2. 选取一个论文作为根结点，这里选时间最晚的、最新的论文；
3. 执行以下操作：对于每个待处理节点，如果存在时间更早且主题相关的论文，则将相关且更早的论文作为anytree中的子节点，并且添加到待处理节点中；重复以上操作直到树的层数达到layers

方法局限性：复杂度n方，并且调用大模型次数过多——当论文数量很多时，建树还是应该交给更简单的算法，比如：  
1. 基于摘要的嵌入向量相似度；
2. 在解析阶段，对文章进行分类，建树阶段基于类别进行溯源；

最后还是使用基于引用关系进行建树了——所以绕一大圈是为了什么……（证明了其他方法的不可行性）  
基本流程和初版一致，把is_related改为判断是否在引用文献中就行；  
可能会导致建的树也很稀疏  

### 2.0辅助函数

In [None]:
# 下载anytree
# pip install anytree

将所有键转为小写——当LLM指令遵循做的比较差、出现不符合要求的输出时可能需要使用；

In [7]:
import json

def convert_keys_to_lowercase(data):
    """
    递归地将 JSON 数据中的所有键转换为小写。
    
    参数:
        data: JSON 数据（可以是字典、列表或其他类型）。
        
    返回:
        转换后的数据，其中所有字典的键都变为小写。
    """
    if isinstance(data, dict):  # 如果是字典
        return {k.lower(): convert_keys_to_lowercase(v) for k, v in data.items()}
    elif isinstance(data, list):  # 如果是列表
        return [convert_keys_to_lowercase(item) for item in data]
    else:  # 其他类型（如字符串、数字等），直接返回
        return data

def overwrite_json_file(file_path):
    """
    读取 JSON 文件，将所有键转换为小写，并直接覆盖原文件。
    
    参数:
        file_path: JSON 文件的路径。
    """
    # Step 1: 读取 JSON 文件
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    # Step 2: 转换所有键为小写
    converted_data = convert_keys_to_lowercase(data)
    
    # Step 3: 将转换后的数据写回原文件
    with open(file_path, 'w', encoding='utf-8') as f:
        json.dump(converted_data, f, ensure_ascii=False, indent=4)


In [None]:
info_path = "output/paper_infos3.json"  # 输入文件路径

overwrite_json_file(info_path)
print("处理完成！已将所有键转换为小写并覆盖原文件。")

处理完成！已将所有键转换为小写并覆盖原文件。


In [None]:
# 打印树结构
def print_tree(root_node):
    for pre, fill, node in RenderTree(root_node):
        print(f"{pre}{node.name}")

### 2.1 基于摘要相关性建树

In [28]:
from langchain_community.llms import Tongyi
from langchain.prompts import PromptTemplate

class JudgeLLM:
    def __init__(self,model_name,key):
        self.llm=Tongyi(model=model_name,api_key=key)
        self.prompt_template=PromptTemplate(template="""
            You are an expert academic assistant tasked with determining whether two research papers are strongly related. Below are the details of the task:

            1. Input the abstracts of two research papers:
            - Abstract 1: {abstract1}
            - Abstract 2: {abstract2}

            2. Evaluation Criteria:
            - The two papers must address the same topic.
            - The two papers must share multiple keywords (e.g., research methods, technical terms, core concepts, etc.).

            3. Output:
            - If the two papers are strongly related, return `True`.
            - If the two papers are not strongly related, return `False`.

            Carefully analyze the content of the two abstracts, extract their themes and keywords, and draw a conclusion based on the criteria above.
            """)
    
    def response(self,prompt):
        response=self.llm.invoke(prompt)
        return response
    
    def is_related(self,abstract1,abstract2):
        prompt=self.prompt_template.format(abstract1=abstract1,abstract2=abstract2)
        response=self.response(prompt)
        if "True" in response:
            return True
        return False

In [118]:
import json
from anytree import Node, RenderTree
from datetime import datetime

# 使用LLM判断相关性进行建树
def build_relate_trace_tree(json_file_path, layers, model_name, api_key, root_num=None):
    """
    构建论文的追溯树。
    
    :param json_file_path: JSON 文件路径，包含论文信息
    :param layers: 树的最大层数
    :param model_name: 使用的模型名称
    :param api_key: API 密钥
    :return: 根节点（anytree.Node）
    """
    # 初始化 JudgeLLM
    judge_llm = JudgeLLM(model_name, api_key)
    
    # Step 1: 读取 JSON 文件，将论文信息读入列表
    with open(json_file_path, 'r', encoding='utf-8') as f:
        papers = json.load(f)

    # Step 2: 建根节点
    papers = [paper for paper in papers if "title" in paper and "publication_date" in paper and "references" in paper]
    papers.sort(key=lambda x: datetime.strptime(x["publication_date"], "%Y-%m-%d"), reverse=True)
    root_paper = papers[root_num]
    root_node = Node(root_paper["title"], data=root_paper)
    
    # 全局计数器
    progress_counter = 0
    total_papers = len(papers) - 1  # 总任务量（去掉根节点）


    # Step 3: 构建追溯树
    def add_children(parent_node, remaining_papers, current_layer):
        nonlocal progress_counter  # 使用非局部变量更新计数器
        
        if current_layer >= layers or not remaining_papers:
            return

        parent_paper = parent_node.data
        parent_title = parent_paper["title"]
        parent_date = datetime.strptime(parent_paper["publication_date"], "%Y-%m-%d")

        # 遍历剩余论文的副本
        for paper in remaining_papers:
            paper_date = datetime.strptime(paper["publication_date"], "%Y-%m-%d")
            paper_title = paper["title"]

            # 检查是否满足条件：时间更早且主题相关
            if (paper_date < parent_date and 
                judge_llm.is_related(parent_paper["abstract"], paper["abstract"])):
                
                # 创建子节点
                child_node = Node(paper_title, parent=parent_node, data=paper)
                
                # 更新计数器和进度条
                progress_counter += 1
                print(f"{progress_counter}/{total_papers} papers processed.")
                
                # 从 remaining_papers 中移除已添加的论文
                remaining_papers.remove(paper)
                
                # 递归添加子节点
                add_children(child_node, remaining_papers, current_layer + 1)

    # Step 4: 开始构建树
    papers.remove(root_paper) 
    remaining_papers = papers # 剩余论文（去掉根节点）
    add_children(root_node, remaining_papers, current_layer=1)

    # Step 5: 返回根节点
    return root_node

### 2.2 基于引用关系建树

In [None]:
# pip install fuzzywuzzy

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting fuzzywuzzy
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Note: you may need to restart the kernel to use updated packages.


In [124]:
from anytree import Node, RenderTree
from tqdm import tqdm
import json
from datetime import datetime
from fuzzywuzzy import fuzz
from collections import deque

def build_refer_trace_tree(info_file_path, layers, root_num=None):
    """
    构建论文的追溯树。
    
    :param info_file_path: JSON 文件路径，包含论文信息
    :param layers: 树的最大层数
    :param root_num: 根节点的序号，默认为 None，越小说明文章越新
    :return: 根节点（anytree.Node）
    """
    # Step 1: 读取 JSON 文件，将论文信息读入列表
    with open(info_file_path, 'r', encoding='utf-8') as f:
        papers = json.load(f)

    # Step 2: 建根节点
    papers = [paper for paper in papers if "title" in paper and "publication_date" in paper and "references" in paper]
    papers.sort(key=lambda x: datetime.strptime(x["publication_date"], "%Y-%m-%d"), reverse=True)
    root_paper = papers[root_num]
    root_node = Node(root_paper["title"], data=root_paper)

    # 全局计数器
    progress_counter = 0
    total_papers = len(papers) - 1  # 总任务量（去掉根节点）

    # 辅助函数，检查引用关系
    def is_refered(paper1, paper2):
        if "references" not in paper1 or not paper1["references"]:
            return False
        for ref_title in paper1["references"]:
            if fuzz.ratio(ref_title.lower(), paper2["title"].lower()) >= 90:
                return True
        return False
    
    # Step 3: 构建追溯树
    # def add_children(parent_node, remaining_papers, current_layer):
    #     nonlocal progress_counter  # 使用非局部变量更新计数器

    #     if current_layer > layers or not remaining_papers:
    #         return

    #     parent_paper = parent_node.data
    #     parent_title = parent_paper["title"]

    #     # 遍历剩余论文
    #     for paper in remaining_papers:  
    #         paper_title = paper["title"]

    #         # 检查是否满足条件：有引用关系
    #         if is_refered(parent_paper, paper):
    #             # 创建子节点
    #             child_node = Node(paper_title, parent=parent_node, data=paper)
                
    #             # 更新计数器和进度条
    #             progress_counter += 1
    #             print(f"{progress_counter}/{total_papers} papers processed.")
                
    #             # 从剩余论文中移除已添加的论文
    #             remaining_papers.remove(paper)
                
    #             # 递归添加子节点
    #             add_children(child_node, remaining_papers, current_layer + 1)


    # Step 3: 构建追溯树（广度优先搜索）
    def add_children(root_node, remaining_papers, layers):
        # 初始化计数器和队列
        progress_counter = 0
        total_papers = len(remaining_papers)
        queue = deque([(root_node, 1)])  # 队列元素为 (当前节点, 当前层数)

        while queue:
            parent_node, current_layer = queue.popleft()  # 取出队列头部的节点
            if current_layer > layers:  # 如果超过最大层数，停止扩展
                continue

            parent_paper = parent_node.data
            parent_title = parent_paper["title"]

            # 遍历剩余论文
            for paper in remaining_papers:
                paper_title = paper["title"]

                # 检查是否满足条件：有引用关系
                if is_refered(parent_paper, paper):
                    # 创建子节点
                    child_node = Node(paper_title, parent=parent_node, data=paper)

                    # 更新计数器并打印进度
                    progress_counter += 1
                    print(f"{progress_counter}/{total_papers} papers processed.")

                    # 从剩余论文中移除已添加的论文
                    remaining_papers.remove(paper)

                    # 将子节点加入队列，准备后续扩展
                    queue.append((child_node, current_layer + 1))

    # Step 4: 开始构建树
    papers.remove(root_paper)
    remaining_papers = papers
    # print(remaining_papers)
    add_children(root_node, remaining_papers, layers)

    # Step 5: 返回根节点
    return root_node

In [None]:
import json
from anytree import Node, RenderTree
from datetime import datetime

# 根据论文tag进行建树
def build_tag_trace_tree(info_file_path, layers, root_index=None):
    """
    构建论文的追溯树。
    
    :param info_file_path: JSON 文件路径，包含论文信息
    :param layers: 树的最大层数
    :param root_index: 根节点的索引，默认为 None
    :return: 根节点（anytree.Node）
    """
    # Step 1: 读取 JSON 文件，将论文信息读入列表
    with open(info_file_path, 'r', encoding='utf-8') as f:
        papers = json.load(f)

    # Step 2: 建根节点
    if root_index is not None:
        root_paper = papers[root_index]
    else:
        # 过滤掉缺少必要字段的论文
        papers = [paper for paper in papers if "title" in paper and "publication_date" in paper and "references" in paper]
        papers.sort(key=lambda x: datetime.strptime(x["publication_date"], "%Y-%m-%d"), reverse=True)
        root_paper = papers[0]  
        
    root_node = Node(root_paper["title"], data=root_paper)

    # 辅助函数，检查引用关系
    def has_same_tag(paper1, paper2):
         # 确保两篇论文都有 tags 字段
        if "tags" not in paper1 or "tags" not in paper2:
            return False

        # 将 tags 转换为集合
        tags1 = set(paper1["tags"])
        tags2 = set(paper2["tags"])

        # 求交集并判断是否非空
        common_tags = tags1 & tags2  # 或者使用 tags1.intersection(tags2)
        return bool(common_tags)  # 如果交集非空，返回 True；否则返回 False
    
    # Step 3: 构建追溯树（广度优先搜索）
    def add_children(root_node, remaining_papers, layers):
        # 初始化计数器和队列
        progress_counter = 0
        total_papers = len(remaining_papers)
        queue = deque([(root_node, 1)])  # 队列元素为 (当前节点, 当前层数)

        while queue:
            parent_node, current_layer = queue.popleft()  # 取出队列头部的节点
            if current_layer > layers:  # 如果超过最大层数，停止扩展
                continue

            parent_paper = parent_node.data
            parent_title = parent_paper["title"]
            parent_date = datetime.strptime(parent_paper["publication_date"], "%Y-%m-%d")

            # 遍历剩余论文
            for paper in remaining_papers: 
                paper_title = paper["title"]
                paper_date = datetime.strptime(paper["publication_date"], "%Y-%m-%d")

                # 检查是否满足条件：有引用关系
                if has_same_tag(parent_paper, paper) and parent_date > paper_date:
                    # 创建子节点
                    child_node = Node(paper_title, parent=parent_node, data=paper)

                    # 更新计数器并打印进度
                    progress_counter += 1
                    print(f"{progress_counter}/{total_papers} papers processed.")

                    # 从剩余论文中移除已添加的论文
                    remaining_papers.remove(paper)

                    # 将子节点加入队列，准备后续扩展
                    queue.append((child_node, current_layer + 1))

    # Step 4: 开始构建树
    papers.remove(root_paper) 
    remaining_papers = papers # 剩余论文（去掉根节点）
    add_children(root_node, remaining_papers, layers)

    # Step 5: 返回根节点
    return root_node

### 执行

In [125]:
# 示例调用

info_file_path = "output/paper_infos5.json"  # 替换为你的 JSON 文件路径
layers = 10  # 树的最大层数
paper_num=1


# 构建追溯树
# root = build_relate_trace_tree(info_file_path, layers,paper_num)
# root = build_refer_trace_tree(info_file_path, layers,paper_num)
root = build_tag_trace_tree(info_file_path, layers,paper_num)

# 打印树结构
print_tree(root)

1/354 papers processed.
2/354 papers processed.
3/354 papers processed.
4/354 papers processed.
5/354 papers processed.
6/354 papers processed.
7/354 papers processed.
8/354 papers processed.
9/354 papers processed.
10/354 papers processed.
11/354 papers processed.
12/354 papers processed.
13/354 papers processed.
14/354 papers processed.
15/354 papers processed.
16/354 papers processed.
17/354 papers processed.
18/354 papers processed.
19/354 papers processed.
20/354 papers processed.
21/354 papers processed.
22/354 papers processed.
23/354 papers processed.
24/354 papers processed.
25/354 papers processed.
26/354 papers processed.
27/354 papers processed.
28/354 papers processed.
29/354 papers processed.
30/354 papers processed.
31/354 papers processed.
32/354 papers processed.
33/354 papers processed.
34/354 papers processed.
35/354 papers processed.
36/354 papers processed.
37/354 papers processed.
38/354 papers processed.
39/354 papers processed.
40/354 papers processed.
41/354 pa

### 后续处理：将建好的树存入json文件、从json文件读取并打印树
将树结构存储到json文件中

In [43]:
import json
from anytree import NodeMixin

def tree_to_dict(node):
    """
    将 anytree 节点及其子节点递归转换为字典。
    
    参数:
        node: anytree 节点。
        
    返回:
        表示树结构的字典。
    """
    return {
        "name": node.name,
        "data": node.data,  # 假设每个节点有 data 属性
        "children": [tree_to_dict(child) for child in node.children]
    }

def save_tree_to_json(root, file_path):
    """
    将 anytree 树保存为 JSON 文件。
    
    参数:
        root: 树的根节点。
        file_path: 输出 JSON 文件的路径。
    """
    tree_dict = tree_to_dict(root)
    with open(file_path, 'w', encoding='utf-8') as f:
        json.dump(tree_dict, f, ensure_ascii=False, indent=4)


In [126]:
# 示例调用
save_file_path = "output/tree_structure6.json"
save_tree_to_json(root, save_file_path)
print(f"树结构已保存到 {save_file_path}")

树结构已保存到 output/tree_structure6.json


将json文件中的树重建并打印

In [45]:
import json
from anytree import Node, RenderTree

def dict_to_tree(tree_dict, parent=None):
    """
    将字典形式的树结构递归转换为 anytree 节点。
    
    参数:
        tree_dict: 表示树结构的字典。
        parent: 当前节点的父节点（用于递归构建子节点）。
        
    返回:
        构建完成的 anytree 节点。
    """
    # 创建当前节点
    current_node = Node(tree_dict["name"], parent=parent, data=tree_dict.get("data"))
    
    # 递归构建子节点
    for child_dict in tree_dict.get("children", []):
        dict_to_tree(child_dict, parent=current_node)
    
    return current_node

def load_tree_from_json(file_path):
    """
    从 JSON 文件中加载树结构并返回根节点。
    
    参数:
        file_path: 输入 JSON 文件的路径。
        
    返回:
        树的根节点。
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        tree_dict = json.load(f)
    
    # 从字典重建树
    root = dict_to_tree(tree_dict)
    return root

def print_tree(root):
    """
    打印树的层次结构。
    
    参数:
        root: 树的根节点。
    """
    for pre, _, node in RenderTree(root):
        print(f"{pre}{node.name}")


In [127]:
# 加载树结构
root = load_tree_from_json(save_file_path)

# 打印树结构
print("树结构如下：")
print_tree(root)

树结构如下：
A Causal Explainable Guardrails for Large Language Models
├── Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
│   ├── A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares
│   ├── AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
│   ├── An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
│   ├── Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection
│   ├── Automatic Jailbreaking of the Text-to-Image Generative AI Systems
│   ├── Can Large Language Models Automatically Jailbreak GPT-4V?
│   ├── Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
│   ├── Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
│   ├── Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models
│   ├