In [1]:
import logging
from setting.db import SessionLocal
from llm_inference.base import LLMInterface
from graph.graph_knowledge_base import GraphKnowledgeBase, SearchAction

logging.basicConfig(level=logging.INFO)

llm_client = LLMInterface("openai", "o3-mini")

gkb = GraphKnowledgeBase(llm_client, "entities_150001", "relationships_150001", "chunks_150001")
session = SessionLocal()

In [2]:
query = "在 TiDB 中, 如果某个节点发生故障 (down机), 并且该节点的实例一直存在, 那么在故障节点的实例副本全部迁移完成后, down-peer 的数量会减少吗？请详细说明 TiDB 的副本迁移机制和 down-peer 数量变化的过程。"
model_kwargs = {
    "options": {
        "num_ctx": 8092,
        "num_gpu": 80,
        "num_predict": 10000,
        "temperature": 0.1,
    }
}
model_kwargs = {}

In [None]:
gkb.retrieve_documents(session, "TiDB fault tolerance behavior during node failure")

In [3]:
from graph.query_analyzer import DeepUnderstandingAnalyzer

analyzer = DeepUnderstandingAnalyzer(llm_client)
analysis_res = analyzer.perform(query)
print(analysis_res)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Analysis Result:
Reasoning: At the most basic level, the user's question revolves around understanding the relationship between a node failure and the subsequent behavior of TiDB’s internal metrics—specifically, the number of down-peers—after the system has automatically migrated the data replicas from the failed node. Breaking it down from first principles, we first recognize that a 'down-peer' in TiDB is essentially a replica that is detected as unavailable or unresponsive. The process of replica migration is a core self-healing mechanism in distributed systems: once a node becomes unresponsive (even if its corresponding instance still exists), TiDB is designed to move its data replicas to other healthy nodes to preserve data redundancy and availability. This setup indicates a cause-and-effect relationship, where the failure triggers a recovery mechanism, which in turn should eventually adjust the system metrics (like the down-peer count). Essentially, the fundamental inquiry is abou

In [4]:
action_history = []
current_findings = []
docs = {}

next_actions = [SearchAction(
    tool="retrieve_documents",
    query=a
) for a in analysis_res.initial_queries]

reasoning = analysis_res.reasoning
queries = analysis_res.initial_queries

In [9]:

knowledge_retrieved = {}
for action in next_actions:
    print(action)
    if action.tool == 'retrieve_knowledge':
        data = gkb.retrieve_graph_data(session, action.query)
    elif action.tool == 'retrieve_neighbors':
        data = gkb.retrieve_neighbors(session, action.entity_ids, action.query)
    elif action.tool == 'retrieve_documents':
        data = gkb.retrieve_documents(session, action.query, 30)
    else:
        raise ValueError(f"Invalid tool: {action.tool}")

    for doc_id, doc in data.documents.items():
        if doc_id not in knowledge_retrieved:
            knowledge_retrieved[doc_id] = doc
        
        for chunk_id, chunk in doc.chunks.items():
            if chunk_id not in knowledge_retrieved[doc_id].chunks:
                knowledge_retrieved[doc_id].chunks[chunk_id] = chunk
                continue

            existing_chunk = knowledge_retrieved[doc_id].chunks[chunk_id]
            rel_dict = {r['id']: r for r in existing_chunk.relationships}
            for relationship in chunk.relationships:
                rel_id = relationship.id
                if rel_id in rel_dict:
                    rel_dict[rel_id]['similarity_score'] = max(
                        rel_dict[rel_id]['similarity_score'],
                        relationship.similarity_score
                    )
                else:
                    rel_dict[rel_id] = relationship.to_dict()

            knowledge_retrieved[doc_id].chunks[chunk_id].relationships = list(rel_dict.values())

action_history.append(action)

knowledge_retrieved

SearchAction(tool=retrieve_documents, query=TiDB node failure replica migration process)


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:graph.chunk_filter:Filter chunks ['7d9ec6766af442c4a9caf13707ad492a', '34cf7fad4a7747aeb1dba70f006b58df', '60edc9f98e204e9ab2f937c600f9b9ca', '3f95c501d40049cb833bf2ddc19944e8', 'c47725fec0f04f14bce35d3ed8e66720']
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:graph.chunk_filter:Filter Eval Result {'chunk_id': '7d9ec6766af442c4a9caf13707ad492a', 'is_relevant': False, 'confidence': 0.9, 'reasoning': 'This chunk describes migrating data from one TiDB cluster to another and the steps for full data migration via backup/restore. It does not mention node failure or replica migration processes, which is the core of the query.'}
INFO:graph.chunk_filter:Filter Eval Result {'chunk_id': '34cf7fad4a7747aeb1dba70f006b58df', 'is_relevant': False, 'confidence': 0.9, 'reasoning': 'This chunk focuses on handling failed DDL statements during TiDB data migration. It d

SearchAction(tool=retrieve_documents, query=TiDB down-peer metric behavior after replica migration)


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:graph.chunk_filter:Filter chunks ['1f4539140b64445fa4c40ba0d247c183', '657b346a50934f27aa4f687aca9192dc', 'bfe13c052dfd492594ff07d7985205e7', 'fa5f56759f4c43668c9f18253ed1190a', '60ba80257982426cb7e9d9ea0714b41d']
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:graph.chunk_filter:Filter Eval Result {'chunk_id': '1f4539140b64445fa4c40ba0d247c183', 'is_relevant': False, 'confidence': 0.9, 'reasoning': 'This chunk is a release note for TiDB 3.0.0-rc.2 and its components. It does not mention anything about the down-peer metric nor replica migration.'}
INFO:graph.chunk_filter:Filter Eval Result {'chunk_id': '657b346a50934f27aa4f687aca9192dc', 'is_relevant': False, 'confidence': 0.9, 'reasoning': 'This is the TiDB 7.6.0 release note focusing on various bug fixes, without any mention of down-peer metrics or replica migration behavior.'}
INFO:graph.chunk_fil

SearchAction(tool=retrieve_documents, query=How does TiDB handle node failures and update down-peer counts?)


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:graph.chunk_filter:Filter chunks ['cdfe5a33d5f6418dab8133d3f08b746b', '9c2dcf083bf7478c9eec150497da5c78', '308096fb1d284787bd62233bbf631960', '83a68645994d43dea55b680c0cf32148', '333969ccdc41465ba39a766f4495f06c']
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:graph.chunk_filter:Filter Eval Result {'chunk_id': 'cdfe5a33d5f6418dab8133d3f08b746b', 'is_relevant': False, 'confidence': 0.7, 'reasoning': 'This chunk lists various bug fixes in TiDB 6.5.4, including handling scenarios when a TiFlash node is down, but it does not mention how TiDB updates down-peer counts or details on overall node failure handling relevant to the query.'}
INFO:graph.chunk_filter:Filter Eval Result {'chunk_id': '9c2dcf083bf7478c9eec150497da5c78', 'is_relevant': True, 'confidence': 0.9, 'reasoning': 'This chunk explicitly mentions a fix for the execution of `replace-down-peer`

SearchAction(tool=retrieve_documents, query=TiDB replica migration mechanism and metric normalization)


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:graph.chunk_filter:Filter chunks ['40bde88511df4ce1842260d5b36d43f3', '657b346a50934f27aa4f687aca9192dc', '787ae113a4e64cb1a619498335936bbf', '96c1b86a8eec4d80b563ee56e9519ddf', '6c28efc2b22a46358d851e9ba8a3b3ef']
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:graph.chunk_filter:Filter Eval Result {'chunk_id': '40bde88511df4ce1842260d5b36d43f3', 'is_relevant': False, 'confidence': 0.9, 'reasoning': 'This chunk focuses on bulk-insert techniques, data import, and auto-random primary key handling. It only briefly mentions data migration tools (e.g. TiDB Data Migration) without discussing replica migration mechanisms or metric normalization.'}
INFO:graph.chunk_filter:Filter Eval Result {'chunk_id': '657b346a50934f27aa4f687aca9192dc', 'is_relevant': False, 'confidence': 0.85, 'reasoning': 'This release notes chunk lists bug fixes for various components i

 16482: DocumentData(id=16482, chunks={}, content="---\ntitle: Daily Check for TiDB Data Migration\nsummary: Learn about the daily check of TiDB Data Migration (DM).\n---\n\n# Daily Check for TiDB Data Migration\n\nThis document summarizes how to perform a daily check on TiDB Data Migration (DM).\n\n+ Method 1: Execute the `query-status` command to check the running status of the task and the error output (if any). For details, see [Query Status](/dm/dm-query-status.md).\n\n+ Method 2: If Prometheus and Grafana are correctly deployed when you deploy the DM cluster using TiUP, you can view DM monitoring metrics in Grafana. For example, suppose that the Grafana's address is `172.16.10.71`, go to <http://172.16.10.71:3000>, enter the Grafana dashboard, and select the DM Dashboard to check monitoring metrics of DM. For more information of these metrics, see [DM Monitoring Metrics](/dm/monitor-a-dm-cluster.md).\n\n+ Method 3: Check the running status of DM and the error (if any) using the l

In [10]:
len(knowledge_retrieved)

3

In [None]:
from graph.knowledge_synthesizer import KnowledgeSynthesizer

synthesizer = KnowledgeSynthesizer(llm_client)
result = synthesizer.iterative_answer_synthesis(
    query=query,
    documents=knowledge_retrieved,
    reasoning=reasoning
)

# Access the results
final_answer = result["final_answer"]
evolution = result["evolution_history"]

Processing document(15833, https://docs.pingcap.com/tidb/v8.1/release-6.6.0)


In [8]:
print(final_answer)

'在 TiDB 集群中，当某个节点（TiKV 实例所在机器）发生故障，并且该节点仍然存在于集群中时，会出现部分 Region 的副本处于 down 状态（常称 down-peer）。TiDB（主要由 PD 和 TiKV 协同工作实现副本管理）对副本迁移有专门的机制，以确保集群数据的高可用性。下面详细说明这一过程以及 down-peer 数量如何变化：\n\n1. 节点故障检测与标记\n   • PD 会周期性地从各个 TiKV 实例接收心跳信息，当检测到某个 TiKV 节点长时间未反馈（超过配置的超时时间），该节点所属的所有 Region 中对应的副本就会被标记为 down-peer。\n   • 这些 down-peer 指示该副本暂时无法提供服务，从而可能影响该 Region 的数据可用性（例如，如果出现超过半数副本不可用，就会影响 Raft 选举）。\n\n2. 副本自动迁移（Replica Rebalancing/Replacement）\n   • 当 PD 检测到某个节点的副本长时间处于 down 状态后，会发起副本调度任务。调度逻辑会选择健康的 TiKV 节点来复制数据，从而在保持 Region 副本数（例如 3 副本）的同时替换失效副本。\n   • 调度时，PD 会依据集群负载、数据本地性、标签规则等因素选择目标节点，新副本在新节点上被创建后，Raft 协议会进行日志复制，以确保新加入的副本跟上已有副本的数据进度。\n   • 当新副本同步完成且达到一定状态之后，PD 会发出指令，将 down 状态的副本从 Raft 配置中移除，从而完成一次迁移操作。\n\n3. Down-Peer 数量的变化\n   • 在故障节点存在期间，因检测到无法通信，集群监控与 PD 调度模块会持续统计该节点上拥有的 down-peer 数量。\n   • 一旦调度完成，并且 Region 的副本风险得到恢复（即用能够正常通信的新副本替换掉原先 down 的副本后），整个 Region 的健康状态恢复正常，此时该 Region 不再计入 down-peer 数量。\n   • 因此，随着所有故障节点上 down 状态的实例副本成功迁移到健康节点后，监控中所统计的 down-peer 数量就会逐步减少，直至恢复正常（当然前提是故障节点仍然在集群中但其 Region 数据已