# 引子

请考虑以下文本。

```
Mary had a little lamb,
You’ve heard this tale before;
But did you know she passed her plate,
And had a little more!
```

以下是文本的一种可能表示形式，即KG。

![图 0](http://image.rarelimiting.com/e8a35f81599f1e43bd12c61da39cfc4bc8673f7ebdf8590bf2566cf52f7e539f.png)  


如果你问GPT，如何从给定的文本中创建知识图？它可能会提出如下过程。

- 从作品中提取概念和实体。这些就是节点。
- 提取概念之间的关系。这些是边缘。
- 填充图形数据结构或图形数据库中的节点（概念）和边（关系）。
- 视觉化，如果没有其他东西的话，也是为了一些艺术上的满足。

步骤3和4听起来可以理解。但是如何实现步骤1和2？

下面是我设计的从任何给定的文本语料库中提取概念图的方法的流程图。它与上述方法相似，但有一些细微的区别。

![图 1](http://image.rarelimiting.com/a4782470be82e2d83fec656acb866e7e8ed945b33d5e6f79b3bad7979dbb0855.png)  

- 将文本语料库分割成块。为这些块中的每一个分配一个chunk_id。
- 对于每个文本块，使用LLM提取概念及其语义关系。让我们给这个关系赋值为W1。同一对概念之间可能存在多种关系。每一个这样的关系都是一对概念之间的边。
- 考虑一下，出现在同一文本块中的概念也因其上下文接近度而相关。让我们给这个关系赋值为W2。请注意，同一对概念可能出现在多个块中。
- 将相似的对分组，求和它们的权重，并连接它们的关系。现在我们在任何一对不同的概念之间只有一条边。该边具有一定的权重，并以一系列关系作为其名称。

## 加载包

In [1]:
import pandas as pd
import numpy as np
import os
from langchain.document_loaders import PyPDFLoader, UnstructuredPDFLoader, PyPDFium2Loader
from langchain.document_loaders import PyPDFDirectoryLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
import random

## Input data directory
data_dir = "cureus"
inputdirectory = Path(f"./data_input/{data_dir}")
## This is where the output csv files will be written
out_dir = data_dir
outputdirectory = Path(f"./data_output/{out_dir}")

## 加载文档并切割

上面流程图中的步骤1很容易。Langchain提供了大量的文本分割器，我们可以使用它们将文本分割成块。

In [2]:
## Dir PDF Loader
# loader = PyPDFDirectoryLoader(inputdirectory)
## File Loader
# loader = PyPDFLoader("./data/MedicalDocuments/orf-path_health-n1.pdf")
loader = DirectoryLoader(inputdirectory, show_progress=True)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    length_function=len,
    is_separator_regex=False,
)

pages = splitter.split_documents(documents)
print("Number of chunks = ", len(pages))
print(pages[3].page_content)


100%|██████████| 1/1 [00:01<00:00,  1.82s/it]

Number of chunks =  23
An extensive literature search was performed, and 56 articles published in peer-reviewed journals between 2005 and 2021 were selected and analyzed. The corresponding authors' experiential knowledge served as the foundation for the analysis.





## 给所有的chucks创建Dataframe

In [3]:
from helpers.df_helpers import documents2Dataframe
df = documents2Dataframe(pages)
print(df.shape)
df.head()

(23, 3)


Unnamed: 0,text,source,chunk_id
0,Abstract India’s health indicators have improv...,data_input/cureus/cureus-0015-00000040274.txt,0f56d8fbefa04f1e877f573938f78ff1
1,"Categories: Public Health, Epidemiology/Public...",data_input/cureus/cureus-0015-00000040274.txt,92789b719a254c8385327b9d243935b6
2,Introduction And Background India’s health ind...,data_input/cureus/cureus-0015-00000040274.txt,9eefb3bf352a459c8895f272b632724e
3,"An extensive literature search was performed, ...",data_input/cureus/cureus-0015-00000040274.txt,7c21bdb708d14855b7b3de9d8564b175
4,Review Overview of the public and private heal...,data_input/cureus/cureus-0015-00000040274.txt,bfc37e1213e7428d963fdac63eb80079


## 抽取概念

第二步是真正的乐趣开始的地方。为了提取这些概念及其关系，我使用了Mistral 7B模型。在得出最适合我们目的的模型变体之前，我进行了以下实验：

```
Mistral Instruct
Mistral OpenOrca, and
Zephyr (Hugging Face version derived from Mistral)
```

我使用了这些模型的4位量化版本——这样我的Mac就不会开始恨我了——由Ollama在本地托管。

这些模型都是具有系统提示和用户提示的指令调优模型。如果我们告诉他们的话，他们都能很好地遵循说明，并在JSON中整齐地格式化答案。

经过几轮测试，我终于使用了Zephyr模型，并给出了以下提示。

```
SYS_PROMPT = (
    "You are a network graph maker who extracts terms and their relations from a given context. "
    "You are provided with a context chunk (delimited by ```) Your task is to extract the ontology "
    "of terms mentioned in the given context. These terms should represent the key concepts as per the context. \n"
    "Thought 1: While traversing through each sentence, Think about the key terms mentioned in it.\n"
        "\tTerms may include object, entity, location, organization, person, \n"
        "\tcondition, acronym, documents, service, concept, etc.\n"
        "\tTerms should be as atomistic as possible\n\n"
    "Thought 2: Think about how these terms can have one on one relation with other terms.\n"
        "\tTerms that are mentioned in the same sentence or the same paragraph are typically related to each other.\n"
        "\tTerms can be related to many other terms\n\n"
    "Thought 3: Find out the relation between each such related pair of terms. \n\n"
    "Format your output as a list of json. Each element of the list contains a pair of terms"
    "and the relation between them, like the follwing: \n"
    "[\n"
    "   {\n"
    '       "node_1": "A concept from extracted ontology",\n'
    '       "node_2": "A related concept from extracted ontology",\n'
    '       "edge": "relationship between the two concepts, node_1 and node_2 in one or two sentences"\n'
    "   }, {...}\n"
    "]"
)

USER_PROMPT = f"context: ```{input}``` \n\n output: "
```

如果我们用这个提示来处理儿歌，结果是这样的。

```json
[
  {
    "node_1": "Mary",
    "node_2": "lamb",
    "edge": "owned by"
  },
  {
    "node_1": "plate",
    "node_2": "food",
    "edge": "contained"
  }, 
  . . .
]
```

请注意，它甚至猜测“食物”是一个概念，但文本块中没有明确提及。这不是很美妙吗！

In [4]:
## This function uses the helpers/prompt function to extract concepts from text
from helpers.df_helpers import df2Graph
from helpers.df_helpers import graph2Df

如果我们在示例文章的每一个文本块中运行这个程序，并将json转换为Pandas数据帧，那么它就是这样的。

这里的每一行都表示一对概念之间的关系。每一行都是图中两个节点之间的边，同一对概念之间可以有多条边或关系。上述数据帧中的计数是我任意设置为4的权重。


如果 regenerate 设置为 True，则会重新生成数据框，并且两个数据框都会以 csv 格式写入，因此我们不必再次计算它们。

        dfne = 边的数据框

        df = 块的数据框


否则，数据框将从输出目录读取

In [5]:
## 要使用LLM重新生成图，请将此设置为True
regenerate = False

if regenerate:
    concepts_list = df2Graph(df, model='zephyr:latest')
    dfg1 = graph2Df(concepts_list)
    if not os.path.exists(outputdirectory):
        os.makedirs(outputdirectory)
    
    dfg1.to_csv(outputdirectory/"graph.csv", sep="|", index=False)
    df.to_csv(outputdirectory/"chunks.csv", sep="|", index=False)
else:
    dfg1 = pd.read_csv(outputdirectory/"graph.csv", sep="|")

dfg1.replace("", np.nan, inplace=True)
dfg1.dropna(subset=["node_1", "node_2", 'edge'], inplace=True)
dfg1['count'] = 4 
## 将关系的权重增加到4。
## 稍后计算上下文邻近度时，我们将分配权重为1。  
print(dfg1.shape)
dfg1.head()

(149, 5)


Unnamed: 0,node_1,node_2,edge,chunk_id,count
0,india's health indicators,peer nations,continue to lag behind,ae0fd26675d645e787964255667e90f4,4
2,health workers density,doctors and nurses/midwives,"for 10,00 persons",ae0fd26675d645e787964255667e90f4,4
4,skilled health workforce,india,reinforces the central role human resources ha...,ae0fd26675d645e787964255667e90f4,4
5,skewed inter-state,urban-rural,and public-private sector divide,ae0fd26675d645e787964255667e90f4,4
7,health budget,federal,offers an unprecedented opportunity to do this,ae0fd26675d645e787964255667e90f4,4


## 计算上下文邻近度

我假设在文本语料库中出现的彼此接近的概念是相关的。让我们把这种关系称为“上下文接近”。

为了计算上下文接近边，我们融化数据帧，使node_1和node_2折叠成一列。然后，我们使用chunk_id作为关键字创建该数据帧的自联接。因此，具有相同chunk_id的节点将相互配对以形成一行。

但这也意味着，每个概念也将与自己配对。这被称为自循环，其中边在同一节点上开始和结束。要删除这些自循环，我们将从数据帧中删除node_1与node_2相同的每一行。

最后，我们得到了一个与原始数据帧非常相似的数据帧。

In [6]:
def contextual_proximity(df: pd.DataFrame) -> pd.DataFrame:
    ## 将数据框融合成节点列表
    dfg_long = pd.melt(
        df, id_vars=["chunk_id"], value_vars=["node_1", "node_2"], value_name="node"
    )
    dfg_long.drop(columns=["variable"], inplace=True)
    # 使用块 ID 作为键进行自连接将在相同文本块中出现的术语之间创建链接。
    dfg_wide = pd.merge(dfg_long, dfg_long, on="chunk_id", suffixes=("_1", "_2"))
    # 删除自循环
    self_loops_drop = dfg_wide[dfg_wide["node_1"] == dfg_wide["node_2"]].index
    dfg2 = dfg_wide.drop(index=self_loops_drop).reset_index(drop=True)
    ## Group and count edges.
    dfg2 = (
        dfg2.groupby(["node_1", "node_2"])
        .agg({"chunk_id": [",".join, "count"]})
        .reset_index()
    )
    dfg2.columns = ["node_1", "node_2", "chunk_id", "count"]
    dfg2.replace("", np.nan, inplace=True)
    dfg2.dropna(subset=["node_1", "node_2"], inplace=True)
    # Drop edges with 1 count
    dfg2 = dfg2[dfg2["count"] != 1]
    dfg2["edge"] = "contextual proximity"
    return dfg2


dfg2 = contextual_proximity(dfg1)
dfg2.tail()

Unnamed: 0,node_1,node_2,chunk_id,count,edge
2827,world-class health facilities,nhm strategies,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",10,contextual proximity
2828,world-class health facilities,rural areas,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",2,contextual proximity
2829,world-class health facilities,social norms,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",2,contextual proximity
2830,world-class health facilities,urban areas,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",2,contextual proximity
2831,world-class health facilities,urban slums,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",2,contextual proximity


### 合并两个Frame

这里的count列是node_1和node_2一起出现的块的数量。列chunk_id是所有这些块的列表。

因此，我们现在有两个数据帧，一个具有语义关系，另一个具有文本中提到的概念之间的上下文接近关系。我们可以将它们组合起来形成我们的网络图数据帧。

我们已经完成了为文本构建概念图的工作。但现在还没有组成图谱。我们的目标是将图形可视化，就像本文开头的特色图像一样，我们离目标不远了。

In [7]:
dfg = pd.concat([dfg1, dfg2], axis=0)
dfg = (
    dfg.groupby(["node_1", "node_2"])
    .agg({"chunk_id": ",".join, "edge": ','.join, 'count': 'sum'})
    .reset_index()
)
dfg

Unnamed: 0,node_1,node_2,chunk_id,edge,count
0,56 articles,extensive literature search,"d7a3e5085c7f4de4bc28fb0bd9cb0a94,d7a3e5085c7f4...",contextual proximity,2
1,[54],increasing violence against healthcare personnel,"640835e2521045a395ab6465cc1ba4ca,640835e252104...",contextual proximity,2
2,[55],increasing violence against healthcare personnel,"640835e2521045a395ab6465cc1ba4ca,640835e252104...",contextual proximity,2
3,a bad situation,increasing violence against healthcare personnel,"640835e2521045a395ab6465cc1ba4ca,640835e252104...",contextual proximity,2
4,a worrisome new trend,increasing violence against healthcare personnel,"640835e2521045a395ab6465cc1ba4ca,640835e252104...",contextual proximity,2
...,...,...,...,...,...
753,world-class health facilities,nhm strategies,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",contextual proximity,10
754,world-class health facilities,rural areas,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",contextual proximity,2
755,world-class health facilities,social norms,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",contextual proximity,2
756,world-class health facilities,urban areas,"0857ab4513ad4383aed095bcf24506fa,0857ab4513ad4...",contextual proximity,2


## 计算 NetworkX Graph

NetworkX是一个Python库，它使处理图形变得非常容易。如果您还不熟悉库，请单击下面的徽标了解更多信息

https://networkx.org

将我们的数据帧添加到NetworkX图中只是几行代码。

In [8]:
nodes = pd.concat([dfg['node_1'], dfg['node_2']], axis=0).unique()
nodes.shape

(215,)

In [9]:
import networkx as nx
G = nx.Graph()

## Add nodes to the graph
for node in nodes:
    G.add_node(
        str(node)
    )

## Add edges to the graph
for index, row in dfg.iterrows():
    G.add_edge(
        str(row["node_1"]),
        str(row["node_2"]),
        title=row["edge"],
        weight=row['count']/4
    )

### Calculate communities for coloring the nodes

这就是我们可以开始利用网络图的力量的地方。NetworkX提供了大量开箱即用的网络算法供我们使用。这里有一个链接，指向我们可以在Graph上运行的算法列表。

https://networkx.org/documentation/stable/reference/algorithms/index.html

在这里，我使用社区检测算法为节点添加颜色。社区是一组节点，它们彼此之间的连接比与图的其他部分的连接更紧密。概念共同体可以让我们很好地了解文本中讨论的广泛主题。

Girvan-Newman算法在我们正在研究的评论文章中检测到17个概念社区。这里就是这样一个社区。

```
[
  'digital technology', 
  'EVIN', 
  'medical devices', 
  'online training management information systems', 
  'wearable, trackable technology'
]
```

这立即让我们对综述文件中讨论的健康技术的广泛主题有了一个想法，并使我们能够提出问题，然后我们可以通过RAG管道回答这些问题。这不是很棒吗？

让我们还计算图中每个概念的程度。节点的阶数是它所连接的边的总数。因此，在我们的案例中，一个概念的程度越高，它对我们文本的主题就越重要。我们将在可视化中使用度作为节点的大小。


In [10]:
communities_generator = nx.community.girvan_newman(G)
top_level_communities = next(communities_generator)
next_level_communities = next(communities_generator)
communities = sorted(map(sorted, next_level_communities))
print("Number of Communities = ", len(communities))
print(communities)

Number of Communities =  17
[['56 articles', 'analysis', "corresponding authors' experiential knowledge", 'extensive literature search', 'peer-reviewed journals'], ['[54]', '[55]', 'a bad situation', 'a worrisome new trend', 'adequately compensated', 'can reverse the situation', 'defensive medicine practices', 'increasing violence against healthcare personnel', 'intense focus on specialization', 'low physician-to-patient ratio', 'overwhelmed physicians', 'primary care physicians', 'private marketplace', 'protect themselves by ordering unnecessary tests and procedures', 'results in delays in attending patients', 'set in', 'tempted to take on more patients than they can reasonably serve', 'thoughtful approach to government planning', 'underpaid physicians', 'unethical practices by pharmaceutical companies', 'will not be able to solve this'], ['accredit health facilities', 'enforcement of existing rules', 'health insurance scheme for central government employees', 'health system standardi

### Create a dataframe for community colors

In [11]:
import seaborn as sns
palette = "hls"

## Now add these colors to communities and make another dataframe
def colors2Community(communities) -> pd.DataFrame:
    ## Define a color palette
    p = sns.color_palette(palette, len(communities)).as_hex()
    random.shuffle(p)
    rows = []
    group = 0
    for community in communities:
        color = p.pop()
        group += 1
        for node in community:
            rows += [{"node": node, "color": color, "group": group}]
    df_colors = pd.DataFrame(rows)
    return df_colors


colors = colors2Community(communities)
colors

Unnamed: 0,node,color,group
0,56 articles,#db57db,1
1,analysis,#db57db,1
2,corresponding authors' experiential knowledge,#db57db,1
3,extensive literature search,#db57db,1
4,peer-reviewed journals,#db57db,1
...,...,...,...
210,rural medical assistants (rmas),#57bcdb,15
211,limited uptake,#db57ac,16
212,national health protection mission,#db57ac,16
213,private health sector systems,#57dbcc,17


### 给图谱增加颜色和可视化

可视化是这个练习中最有趣的部分。它有一定的品质，给你一种艺术上的满足感。

我正在使用PiVis库来创建交互式图形。Pyvis是一个用于可视化网络的Python库。这是一篇文章，展示了图书馆的轻松和强大

https://towardsdatascience.com/pyvis-visualize-interactive-network-graphs-in-python-77e059791f01

Pyvis有一个内置的NetworkX Helper，可以将我们的NetworkX图转换为Pyvis对象。所以我们不需要更多的编码…。耶！！

请记住，我们已经计算了每条边的权重作为边的厚度，节点的群落作为它们的颜色，以及每个节点的程度作为它们的大小。

所以，这是我们的图表。


In [12]:
for index, row in colors.iterrows():
    G.nodes[row['node']]['group'] = row['group']
    G.nodes[row['node']]['color'] = row['color']
    G.nodes[row['node']]['size'] = G.degree[row['node']]

In [13]:
from pyvis.network import Network

graph_output_directory = "./docs/index.html"

net = Network(
    notebook=False,
    # bgcolor="#1a1a1a",
    cdn_resources="remote",
    height="900px",
    width="100%",
    select_menu=True,
    # font_color="#cccccc",
    filter_menu=False,
)

net.from_nx(G)
# net.repulsion(node_distance=150, spring_length=400)
net.force_atlas_2based(central_gravity=0.015, gravity=-31)
# net.barnes_hut(gravity=-18100, central_gravity=5.05, spring_length=380)
net.show_buttons(filter_=["physics"])

net.show(graph_output_directory, notebook=False)

./docs/index.html


我们可以根据需要放大、缩小和移动节点和边。我们在页面底部也有滑块面板来改变图形的物理特性。看看图表如何帮助我们提出正确的问题，更好地理解主题！