In [5]:
import os

GRAPHRAG_FOLDER = "C:\Code\III_2024_golf\\ragtest\output"

### Depedendencies

We only need Pandas and the neo4j Python driver with the rust extension for faster network transport.

In [1]:
import time

import pandas as pd
from neo4j import GraphDatabase

## Neo4j Installation

You can create a free instance of Neo4j [online](https://console.neo4j.io). You get a credentials file that you can use for the connection credentials. You can also get an instance in any of the cloud marketplaces.

If you want to install Neo4j locally either use [Neo4j Desktop](https://neo4j.com/download) or 
the official Docker image: `docker run -e NEO4J_AUTH=neo4j/password -p 7687:7687 -p 7474:7474 neo4j` 

In [10]:
NEO4J_URI="neo4j+s://c67dfcd1.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="d_D68NVqp4GYu8pUiJ3BmelzZhX5FBsa2CklfeeS3f8"
NEO4J_DATABASE="neo4j"
# Create a Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

## Batched Import

The batched import function takes a Cypher insert statement (needs to use the variable `value` for the row) and a dataframe to import.
It will send by default 1k rows at a time as query parameter to the database to be inserted.

In [12]:
def batched_import(statement, df, batch_size=1000):
    """
    Import a dataframe into Neo4j using a batched approach.

    Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.
    """
    total = len(df)
    start_s = time.time()
    for start in range(0, total, batch_size):
        batch = df.iloc[start : min(start + batch_size, total)]
        result = driver.execute_query(
            "UNWIND $rows AS value " + statement,
            rows=batch.to_dict("records"),
            database_=NEO4J_DATABASE,
        )
        print(result.summary.counters)
    print(f"{total} rows in {time.time() - start_s} s.")
    return total

## Indexes and Constraints

Indexes in Neo4j are only used to find the starting points for graph queries, e.g. quickly finding two nodes to connect.
Constraints exist to avoid duplicates, we create them mostly on id's of Entity types.

We use some Types as markers with two underscores before and after to distinguish them from the actual entity types.

The default relationship type here is `RELATED` but we could also infer a real relationship-type from the description or the types of the start and end-nodes.

* `__Entity__`
* `__Document__`
* `__Chunk__`
* `__Community__`
* `__Covariate__`

## Import Process

### Importing the Documents

We're loading the parquet file for the documents and create nodes with their ids and add the title property.
We don't need to store text_unit_ids as we can create the relationships and the text content is also contained in the chunks.

In [13]:
doc_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/create_final_documents.parquet", columns=["id", "title"]
)
doc_df.head(2)

Unnamed: 0,id,title
0,79a497eed050a67abeac53133083370776d5896e1f0e60...,現今高爾夫球桿基本概念.txt
1,bba9ca74289df8e717153c49b987e9cd05361bdf376440...,綜評高爾夫下揮桿之腕部釋放技巧.txt


In [14]:
# Import documents
statement = """
CREATE (d:Document {
    id: value.id,
    title: value.title,
    raw_content: value.raw_content,
    text_unit_ids: value.text_unit_ids
});
"""

batched_import(statement, doc_df)

{'_contains_updates': True, 'labels_added': 7, 'nodes_created': 7, 'properties_set': 14}
7 rows in 0.6817796230316162 s.


7

### Loading Text Units

We load the text units, create a node per id and set the text and number of tokens.
Then we connect them to the documents that we created before.

In [15]:
text_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/create_final_text_units.parquet",
    columns=["id", "text", "n_tokens", "document_ids"],
)
text_df.head(2)

Unnamed: 0,id,text,n_tokens,document_ids
0,23f9f55a3d55f4d9c08729314d06b8454c3cb06b3e2277...,高爾夫球桿桿身的相關研究\n\ndoi:10.29706/GS.201208.0007\n大...,1200,[0fc45284410a8448f5e5b3553f77b2ed518df093fd089...
1,bbec972c236d5442cf6c1d6ad96c4a9110148e6cbe3a4b...,業餘的高爾夫愛好者在合適的球具、適當的練習下，即能夠在很\n短的時間，就具備不錯的揮桿距離與...,1200,[0fc45284410a8448f5e5b3553f77b2ed518df093fd089...


In [16]:
statement = """
CREATE (t:TextUnit {
    id: value.id,
    text: value.text,
    n_tokens: toFloat(value.n_tokens),
    document_ids: value.document_ids,
    entity_ids: value.entity_ids,
    relationship_ids: value.relationship_ids
});

"""

batched_import(statement, text_df)

{'_contains_updates': True, 'labels_added': 58, 'nodes_created': 58, 'properties_set': 232}
58 rows in 0.6316184997558594 s.


58

### Loading Entities

For the nodes we store id, name, description, embedding (if available), human readable id.

In [17]:
entity_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/create_final_entities.parquet",
    columns=[
        "id",  
        "human_readable_id",    
        "title",
        "type",
        "description",
        "text_unit_ids",
    ],
)
entity_df.head(2)

Unnamed: 0,id,human_readable_id,title,type,description,text_unit_ids
0,bf2440ec-3d12-498c-9ed3-ac7576270785,0,王順正,PERSON,王順正 is an author affiliated with National Chun...,[23f9f55a3d55f4d9c08729314d06b8454c3cb06b3e227...
1,f4907ce6-5165-487a-bfda-cffcb579ce2b,1,林玉瓊,PERSON,林玉瓊 is an author affiliated with Wu Feng Unive...,[23f9f55a3d55f4d9c08729314d06b8454c3cb06b3e227...


In [19]:
entity_statement = """
CREATE (e:Entity {
    id: value.id,
    name: value.name,
    type: value.type,
    description: value.description,
    human_readable_id: toInteger(value.human_readable_id),
    text_unit_ids: value.text_unit_ids
});
"""

batched_import(entity_statement, entity_df)

{'_contains_updates': True, 'labels_added': 629, 'nodes_created': 629, 'properties_set': 3145}
629 rows in 0.5855612754821777 s.


629

### Import Relationships

For the relationships we find the source and target node by name, using the base `__Entity__` type.
After creating the `RELATED` relationships, we set the description as attribute.

In [20]:
rel_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/create_final_relationships.parquet",
    columns=[
        "id",
        "human_readable_id",
        "source",
        "target",
        "description",
        "weight",
        "combined_degree",
        "text_unit_ids"
    ],
)
rel_df.head(2)

Unnamed: 0,id,human_readable_id,source,target,description,weight,combined_degree,text_unit_ids
0,dd3e85e6-fcab-4d49-86f8-fa2fc2b3545f,0,王順正,高爾夫球桿,王順正 is an author who conducted research on gol...,8.0,45,[23f9f55a3d55f4d9c08729314d06b8454c3cb06b3e227...
1,3164e95b-f59f-49b7-b85c-1cb6a0ccb889,1,林玉瓊,高爾夫球桿,林玉瓊 is an author who conducted research on gol...,8.0,45,[23f9f55a3d55f4d9c08729314d06b8454c3cb06b3e227...


In [21]:
rel_statement = """
CREATE (r:Relationship {
    source: value.source,
    target: value.target,
    weight: toFloat(value.weight),
    description: value.description,
    id: value.id,
    human_readable_id: value.human_readable_id,
    source_degree: toInteger(value.source_degree),
    target_degree: toInteger(value.target_degree),
    /*
    rank: toInteger(value.rank),
    */
    text_unit_ids: value.text_unit_ids
});
"""

batched_import(rel_statement, rel_df)

{'_contains_updates': True, 'labels_added': 818, 'nodes_created': 818, 'properties_set': 5726}
818 rows in 0.4259073734283447 s.


818

### Import node

In [23]:
community_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/create_final_nodes.parquet",
    columns=["id","human_readable_id", "title","community","level","degree","x","y"],
)

community_df.head(2)

Unnamed: 0,id,human_readable_id,title,community,level,degree,x,y
0,bf2440ec-3d12-498c-9ed3-ac7576270785,0,王順正,12,0,1,0.0,0.0
1,bf2440ec-3d12-498c-9ed3-ac7576270785,0,王順正,78,1,1,0.0,0.0


In [24]:
statement = """
CREATE (n:Node {
    id: value.id,
    level: toInteger(value.level),
    title: value.title,
    type: value.type,
    description: value.description,
    source_id: value.source_id,
    community: value.community,
    degree: toInteger(value.degree),
    human_readable_id: toInteger(value.human_readable_id),
    size: toInteger(value.size),
    entity_type: value.entity_type,
    top_level_node_id: value.top_level_node_id,
    x: toInteger(value.x),
    y: toInteger(value.y)
});
"""

batched_import(statement, community_df)


{'_contains_updates': True, 'labels_added': 1000, 'nodes_created': 1000, 'properties_set': 8000}
{'_contains_updates': True, 'labels_added': 352, 'nodes_created': 352, 'properties_set': 2816}
1352 rows in 0.6490049362182617 s.


1352

### Importing Communities

For communities we import their id, title, level.
We connect the `__Community__` nodes to the start and end nodes of the relationships they refer to.

Connecting them to the chunks they orignate from is optional, as the entites are already connected to the chunks.

In [26]:
community_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/create_final_communities.parquet",
    columns=["id", "level", "title", "text_unit_ids", "relationship_ids"],
)

community_df.head(2)

Unnamed: 0,id,level,title,text_unit_ids,relationship_ids
0,19e5deee-4930-4811-b758-b7bcead41273,0,Community 0,[0b8bb9591af8ed28801e86e8a5bc1de89e027c48434fd...,"[0188c1f0-e2e6-42d6-a140-544efb447bf2, 0378d07..."
1,ac5e9bda-c3a5-428a-930d-7af7c99f11d3,0,Community 1,[63460779082864c6d13982183b698189691ca36e76117...,"[0ca09d3e-b942-4dbe-91a0-806aab6c20ad, 104a829..."


In [27]:
statement = """
CREATE (c:Community {
    id: value.id,
    title: value.title,
    level: toInteger(value.level),
    raw_community: value.raw_community,
    relationship_ids: value.relationship_ids,
    text_unit_ids: value.text_unit_ids
});
"""

batched_import(statement, community_df)

{'_contains_updates': True, 'labels_added': 113, 'nodes_created': 113, 'properties_set': 565}
113 rows in 0.23726105690002441 s.


113

### Importing Community Reports

Fo the community reports we create nodes for each communitiy set the id, community, level, title, summary, rank, and rank_explanation and connect them to the entities they are about.
For the findings we create the findings in context of the communities.

In [35]:
community_report_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/create_final_community_reports.parquet",
    columns=[
        "id",
        "community",
        "level",
        "title",
        "summary",
        "findings",
        "rank",
        "rank_explanation",
        "full_content",
        "full_content_json",
    ],
)
community_report_df

Unnamed: 0,id,community,level,title,summary,findings,rank,rank_explanation,full_content,full_content_json
0,7f94ed38c3f945e3ad1ad19fa548a3ba,83,2,力量与速度结合原则与Wenger与Bell的研究,该社区围绕力量与速度结合原则及其理论基础展开，Wenger与Bell的研究为这一原则提供了重...,[{'explanation': '力量与速度结合原则是高尔夫挥杆中的关键概念，运动员需要合...,4.0,影响严重性评分为中等，主要由于力量与速度结合原则在高尔夫运动中的应用可能影响运动员的表现。,# 力量与速度结合原则与Wenger与Bell的研究\n\n该社区围绕力量与速度结合原则及其...,"{\n ""title"": ""力量与速度结合原则与Wenger与Bell的研究"",\n ..."
1,7768076af1e14a3e8a887a481ad52c3a,84,2,Golf Swing Mechanics and Muscle Dynamics,This community focuses on the key muscle group...,[{'explanation': 'The community identifies sev...,7.5,The impact severity rating is high due to the ...,# Golf Swing Mechanics and Muscle Dynamics\n\n...,"{\n ""title"": ""Golf Swing Mechanics and Musc..."
2,3159c82f269a4e9d917cb25520941adc,85,2,Golf Clubs and Their Components,The community focuses on various types of golf...,[{'explanation': 'The community encompasses a ...,6.5,The impact severity rating is moderate to high...,# Golf Clubs and Their Components\n\nThe commu...,"{\n ""title"": ""Golf Clubs and Their Componen..."
3,c92340cde4cd431b8b3f981b81f62d22,86,2,Amateur and Professional Golfers Community,The community consists of amateur and professi...,[{'explanation': 'The community is characteriz...,4.0,The impact severity rating is moderate due to ...,# Amateur and Professional Golfers Community\n...,"{\n ""title"": ""Amateur and Professional Golf..."
4,b254bb63d09940718b8141c9bb9bd601,87,2,Golf Swing Mechanics Research Community,The community is centered around the collabora...,[{'explanation': 'DeDe and Linda have a strong...,4.0,The impact severity rating is moderate due to ...,# Golf Swing Mechanics Research Community\n\nT...,"{\n ""title"": ""Golf Swing Mechanics Research..."
...,...,...,...,...,...,...,...,...,...,...
108,c37b97c25340485f92a87334bf5b8daa,9,0,Golf Swing Techniques and Research Community,This community focuses on the various aspects ...,[{'explanation': 'Preparatory movements are cr...,7.5,The impact severity rating is high due to the ...,# Golf Swing Techniques and Research Community...,"{\n ""title"": ""Golf Swing Techniques and Res..."
109,1c5881de6f7047b59ef1dfb97ebf3bc8,10,0,Golf Swing Techniques and Research Community,This community focuses on the techniques and r...,[{'explanation': '揮桿 is the fundamental action...,7.5,The impact severity rating is high due to the ...,# Golf Swing Techniques and Research Community...,"{\n ""title"": ""Golf Swing Techniques and Res..."
110,40d27c3a053f46ad8316a7c6eebc08ab,11,0,Golf Swing Techniques and Specialized Strength...,This community focuses on the intersection of ...,[{'explanation': 'The study of biomechanics is...,7.5,The impact severity rating is high due to the ...,# Golf Swing Techniques and Specialized Streng...,"{\n ""title"": ""Golf Swing Techniques and Spe..."
111,4e70b8064b8840a89cdfa6ecfbc0ff3a,12,0,Golf Equipment Research Community,The community is centered around the research ...,[{'explanation': 'The design of the golf club ...,7.5,The impact severity rating is high due to the ...,# Golf Equipment Research Community\n\nThe com...,"{\n ""title"": ""Golf Equipment Research Commu..."


In [36]:
# Import communities
community_statement = """
CREATE (cr:CommunityReport {
    id: value.id,
    community: value.community,
    full_content: value.full_content,
    level: toInteger(value.level),
    rank: toFloat(value.rank),
    title: value.title,
    rank_explanation: value.rank_explanation,
    summary: value.summary,
    /*
    findings: value.findings,
    */
    full_content_json: value.full_content_json
});
"""
batched_import(community_statement, community_report_df)

{'_contains_updates': True, 'labels_added': 113, 'nodes_created': 113, 'properties_set': 1017}
113 rows in 0.4844679832458496 s.


113

In [46]:
Creat_statement="""
CREATE INDEX FOR (d:Document) ON (d.id);
CREATE INDEX FOR (t:TextUnit) ON (t.id);
CREATE INDEX FOR (e:Entity) ON (e.id);
CREATE INDEX FOR (r:Relationship) ON (r.id);
CREATE INDEX FOR (n:Node) ON (n.id);
CREATE INDEX FOR (c:Community) ON (c.id);
CREATE INDEX FOR (cr:CommunityReport) ON (cr.id);
"""


for i in Creat_statement.split(";")[:-1]:
    print(i)
    driver.execute_query(i, database_=NEO4J_DATABASE)



CREATE INDEX FOR (d:Document) ON (d.id)


ClientError: {code: Neo.ClientError.Schema.EquivalentSchemaRuleAlreadyExists} {message: An equivalent index already exists, 'Index( id=2, name='index_a0750792', type='RANGE', schema=(:Document {id}), indexProvider='range-1.0' )'.}

In [76]:
driver.execute_query("""MATCH (d:Document)
UNWIND split(d.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (d)-[:HAS_TEXT_UNIT]->(t);

""", database_=NEO4J_DATABASE)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x000002D335DC7B80>, keys=[])

In [70]:
driver.execute_query("""MATCH (t:TextUnit)
UNWIND split(t.document_ids, ',') AS docId
MATCH (d:Document {id: trim(docId)})
CREATE (t)-[:BELONGS_TO]->(d);

""", database_=NEO4J_DATABASE)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x000002D335C37490>, keys=[])

In [75]:
Match_statement="""
MATCH (d:Document)
UNWIND split(d.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (d)-[:HAS_TEXT_UNIT]->(t);
MATCH (t:TextUnit)
UNWIND split(t.entity_ids, ',') AS entityId
MATCH (e:Entity {id: trim(entityId)})
CREATE (t)-[:HAS_ENTITY]->(e);
MATCH (t:TextUnit)
UNWIND split(t.relationship_ids, ',') AS relId
MATCH (r:Relationship {id: trim(relId)})
CREATE (t)-[:HAS_RELATIONSHIP]->(r);
MATCH (r:Relationship)
MATCH (source:Entity {name: r.source})
MATCH (target:Entity {name: r.target})
CREATE (source)-[:RELATES_TO]->(target);
MATCH (cr:CommunityReport)
MATCH (c:Community {id: cr.community})
CREATE (cr)-[:REPORTS_ON]->(c);
"""
for i in Match_statement.split(";")[:-1]:
    print(i)
    print("---")
    driver.execute_query(i, database_=NEO4J_DATABASE)



MATCH (d:Document)
UNWIND split(d.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (d)-[:HAS_TEXT_UNIT]->(t)
---

MATCH (t:TextUnit)
UNWIND split(t.entity_ids, ',') AS entityId
MATCH (e:Entity {id: trim(entityId)})
CREATE (t)-[:HAS_ENTITY]->(e)
---

MATCH (t:TextUnit)
UNWIND split(t.relationship_ids, ',') AS relId
MATCH (r:Relationship {id: trim(relId)})
CREATE (t)-[:HAS_RELATIONSHIP]->(r)
---

MATCH (r:Relationship)
MATCH (source:Entity {name: r.source})
MATCH (target:Entity {name: r.target})
CREATE (source)-[:RELATES_TO]->(target)
---

MATCH (cr:CommunityReport)
MATCH (c:Community {id: cr.community})
CREATE (cr)-[:REPORTS_ON]->(c)
---
