# 如何构建知识图谱
在本指南中，我们将介绍基于非结构化文本构建知识图谱的基本方法。构建完成的图谱可作为[RAG](/docs/concepts/rag/)应用中的知识库使用。
## ⚠️ 安全提示 ⚠️
构建知识图谱需要对数据库执行写入操作，这一过程存在固有风险。在导入数据前，请务必进行验证和确认。有关通用安全最佳实践的更多信息，[请参阅此处](/docs/security)。

## 架构
从高层次来看，从文本构建知识图谱的步骤如下：
1. **从文本中提取结构化信息**：该模型用于从文本中提取结构化的图表信息。2. **存储至图数据库**：将提取出的结构化图信息存储至图数据库中，可为下游RAG应用提供支持
## 安装设置
首先，获取所需的包并设置环境变量。在本示例中，我们将使用Neo4j图数据库。

In [1]:
%pip install --upgrade --quiet  langchain langchain-neo4j langchain-openai langchain-experimental neo4j


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


在本指南中，我们默认使用OpenAI模型。

In [2]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Uncomment the below to use LangSmith. Not required.
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
# os.environ["LANGSMITH_TRACING"] = "true"

 ········


接下来，我们需要定义 Neo4j 的凭证和连接。按照[这些安装步骤](https://neo4j.com/docs/operations-manual/current/installation/)来设置Neo4j数据库。

In [3]:
import os

from langchain_neo4j import Neo4jGraph

os.environ["NEO4J_URI"] = "bolt://localhost:7687"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "password"

graph = Neo4jGraph(refresh_schema=False)

## LLM图变换器
从文本中提取图数据能够将非结构化信息转化为结构化格式，从而促进对复杂关系和模式的深入洞察与高效导航。`LLMGraphTransformer`通过利用大语言模型（LLM）解析并分类实体及其关系，将文本文档转换为结构化的图文档。所选LLM模型对输出结果具有显著影响，它决定了所提取图数据的准确性与细微差异。

In [4]:
import os

from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")

llm_transformer = LLMGraphTransformer(llm=llm)

现在我们可以传入示例文本并检查结果。

In [5]:
from langchain_core.documents import Document

text = """
Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
"""
documents = [Document(page_content=text)]
graph_documents = llm_transformer.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")

Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='MARRIED', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='PROFESSOR', properties={})]


查看以下图片以更好地理解生成的知识图谱结构。
![graph_construction1.png](../../static/img/graph_construction1.png)
请注意，由于我们使用了大型语言模型（LLM），图谱构建过程具有非确定性。因此，每次执行时您可能会得到略有差异的结果。
此外，您可以根据需求灵活定义需要提取的特定节点类型和关系类型。

In [6]:
llm_transformer_filtered = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Country", "Organization"],
    allowed_relationships=["NATIONALITY", "LOCATED_IN", "WORKED_AT", "SPOUSE"],
)
graph_documents_filtered = llm_transformer_filtered.convert_to_graph_documents(
    documents
)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")

Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]


为了更精确地定义图模式，可以考虑使用三元组方法来描述关系。在这种方法中，每个元组由三个元素组成：源节点、关系类型和目标节点。

In [7]:
allowed_relationships = [
    ("Person", "SPOUSE", "Person"),
    ("Person", "NATIONALITY", "Country"),
    ("Person", "WORKED_AT", "Organization"),
]

llm_transformer_tuple = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Country", "Organization"],
    allowed_relationships=allowed_relationships,
)
graph_documents_filtered = llm_transformer_tuple.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")

Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]


为了更好地理解生成的图表，我们可以再次将其可视化。
![graph_construction2.png](../../static/img/graph_construction2.png)

`node_properties` 参数支持提取节点属性，从而能够创建更详细的图谱。当设置为 `True` 时，LLM 将自动识别并提取相关节点属性。反之，若将 `node_properties` 定义为字符串列表，则大语言模型（LLM）会从文本中有选择地仅提取指定属性。

In [8]:
llm_transformer_props = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Country", "Organization"],
    allowed_relationships=["NATIONALITY", "LOCATED_IN", "WORKED_AT", "SPOUSE"],
    node_properties=["born_year"],
)
graph_documents_props = llm_transformer_props.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents_props[0].nodes}")
print(f"Relationships:{graph_documents_props[0].relationships}")

Nodes:[Node(id='Marie Curie', type='Person', properties={'born_year': '1867'}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={}), Node(id='Poland', type='Country', properties={}), Node(id='France', type='Country', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Poland', type='Country', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='France', type='Country', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})

## 存储至图数据库
生成的图表文档可以通过 `add_graph_documents` 方法存储到图数据库中。

In [9]:
graph.add_graph_documents(graph_documents_props)

大多数图数据库支持索引以优化数据导入和检索。由于我们可能无法预先知道所有节点标签，可以通过使用`baseEntityLabel`参数为每个节点添加辅助基础标签来处理这种情况。

In [10]:
graph.add_graph_documents(graph_documents, baseEntityLabel=True)

结果将如下所示：
![graph_construction3.png](../../static/img/graph_construction3.png)
最后一种选择是同时导入提取节点和关系的源文档。这种方法让我们能够追踪每个实体出现在哪些文档中。

In [11]:
graph.add_graph_documents(graph_documents, include_source=True)

图表将具有以下结构：
![graph_construction4.png](../../static/img/graph_construction4.png)

在此可视化中，源文档以蓝色高亮显示，所有从中提取的实体均通过`MENTIONS`关系相连。