In [1]:
%pip install -U git+https://github.com/microsoft/graphrag.git future pandas neo4j-rust-ext

Collecting git+https://github.com/microsoft/graphrag.git
  Cloning https://github.com/microsoft/graphrag.git to /private/var/folders/4j/rps61y256hl6shjt0sj6v2c80000gp/T/pip-req-build-yi7stznm
  Running command git clone --filter=blob:none --quiet https://github.com/microsoft/graphrag.git /private/var/folders/4j/rps61y256hl6shjt0sj6v2c80000gp/T/pip-req-build-yi7stznm
  Resolved https://github.com/microsoft/graphrag.git to commit 0d348d607053c27331863c206ed0d49380b35af4
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


### Data Preparation
Download a small text file with about a thousand lines from Project Gutenberg and use it for GraphRAG indexing.

This dataset is about Leonardo Da Vinci's story. We use GraphRAG to build a graph index of all relationships related to Da Vinci and the Milvus vector database to search for relevant knowledge to answer questions.

In [3]:
import nest_asyncio
nest_asyncio.apply()

In [4]:
import os
import urllib.request

index_root = os.path.join(os.getcwd(), 'graphrag_index')
os.makedirs(os.path.join(index_root, 'input'), exist_ok=True)

url = "https://www.gutenberg.org/cache/epub/7785/pg7785.txt"
file_path = os.path.join(index_root, 'input', 'davinci.txt')

urllib.request.urlretrieve(url, file_path)

with open(file_path, 'r+', encoding='utf-8') as file:
    # We use the first 934 lines of the text file, because the later lines are not relevant for this example.
    # If you want to save api key cost, you can truncate the text file to a smaller size.
    lines = file.readlines()
    file.seek(0)
    file.writelines(lines[:934])  # Decrease this number if you want to save api key cost.
    file.truncate()


### Initialize the workspace

Now, let’s use GraphRAG to index the text file. To initialize your workspace, let's first run the graphrag.index --init command.

In [2]:
!python -m graphrag.index --init --root ./graphrag_index

[2KInitializing project at .[35m/[0m[95mgraphrag_index[0m
[2KTraceback (most recent call last):
[2K  File "<frozen runpy>", line 198, in _run_module_as_main
[2K  File "<frozen runpy>", line 88, in _run_code
[2K  File Indexer 
"/Users/jiangs5/Documents/RAG/.venv/lib/python3.12/site-packages/graphrag/index/
__main__.py", line 104, in <module>
[2K    index_cli( 
[2K  File Indexer 
"/Users/jiangs5/Documents/RAG/.venv/lib/python3.12/site-packages/graphrag/index/
cli.py", line 126, in index_cli
[2K    _initialize_project_at(root_dir, progress_reporter)
[2K  File Indexer 
"/Users/jiangs5/Documents/RAG/.venv/lib/python3.12/site-packages/graphrag/index/
cli.py", line 199, in _initialize_project_at
[2K    raise ValueError(msg)
[2KValueError: Project already initialized at graphrag_index
⠋ GraphRAG Indexer 



#### Configure the env file and settings
You will find the .env file in the index's root directory. To use it, add your OpenAI API key to the .env file.

Important Notes: __

- We will use OpenAI models for this example; ensure you have an API key ready.

- GraphRAG indexing is costly as it processes the entire text corpus with LLMs. Running this demo may cost a few dollars. To save money, consider truncating the text file to a smaller size.

#### Running the indexing pipeline
The indexing process will take some time. Once completed, you’ll find a new folder at ./graphrag_index/output/<timestamp>/ artifacts containing a series of parquet files.

In [17]:
!python -m graphrag.index --root ./graphrag_index

[2KLogging enabled at r 
[35m/Users/jiangs5/Documents/RAG/graphrag/graphrag_index/output/[0m[95mindexing-engine.log[0m
[2K⠴ GraphRAG Indexer 
[2K[1A[2K⠴ GraphRAG Indexer e.text) - 738 files loaded (386 filtered)  [35m100%[0m  
├── Loading Input (InputFileType.text) - 738 files loaded (386 filtered)  [35m100%[0m  
[2K[1A[2K[1A[2K⠦ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 738 files loaded (386 filtered)  [35m100%[0m  
└── create_base_text_units
[2K[1A[2K[1A[2K[1A[2K⠧ GraphRAG Indexer ━━━━━━━━━━━━━━━━━[0m [35m  2%[0m [36m0:00:01[0m [33m0:00:00[0m
├── Loading Input (InputFileType.text) - 738 files loaded (386 filtered)  [35m100%[0m  
└── create_base_text_units
[2K[1A[2K[1A[2K[1A[2K⠏ GraphRAG Indexer [0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [35m 24%[0m [36m0:00:01[0m [33m0:00:00[0m
├── Loading Input (InputFileType.text) - 738 files loaded (386 filtered)  [35m100%[0m  
└── create_base_text_units
[2K[1A[2K[1A[2K[1A[2K⠋

In [6]:
%load_ext dotenv
%dotenv graphrag_index/.env

### Querying with the vector database
During the querying stage, we use the vector database to store entity description embeddings for GraphRAG local search. This method combines structured data from the knowledge graph with unstructured data from input documents, enhancing the LLM context with relevant entity information for more precise answers.

In [7]:
import os

import pandas as pd
import tiktoken

from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    # read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import (
    store_entity_semantic_embeddings,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.lancedb import LanceDBVectorStore

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
output_dir = os.path.join(index_root, "output")
subdirs = [os.path.join(output_dir, d) for d in os.listdir(output_dir)]
latest_subdir = max(subdirs, key=os.path.getmtime)  # Get latest output directory
INPUT_DIR = os.path.join(latest_subdir, "artifacts")
LANCEDB_URI = f"{INPUT_DIR}/lancedb"

COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_LEVEL = 2

### Load data from the indexing process
During the indexing process, a few parquet files will be generated. We load them into memory and store the entity description information in the Milvus vector database.

#### Read entities

In [9]:
# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)

# load description embeddings to an in-memory lancedb vectorstore
# to connect to a remote db, specify url and port values.
description_embedding_store = LanceDBVectorStore(
    collection_name="entity_description_embeddings",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)
entity_description_embeddings = store_entity_semantic_embeddings(
    entities=entities, vectorstore=description_embedding_store
)

print(f"Entity count: {len(entity_df)}")
entity_df.head()

Entity count: 513


[2024-08-27T08:23:08Z WARN  lance::dataset] No existing dataset at /Users/jiangs5/Documents/RAG/graphrag/graphrag_index/output/20240827-111342/artifacts/lancedb/entity_description_embeddings.lance, it will be created


Unnamed: 0,level,title,type,description,source_id,community,degree,human_readable_id,id,size,graph_embedding,entity_type,top_level_node_id,x,y
0,0,PROJECT GUTENBERG,ORGANIZATION,Project Gutenberg is an organization that pro...,"357e55788727fd070ce35607043d4169,909b09dc034dd...",8,4,0,b45241d70f0e43fca764df95b2b81f77,4,,,b45241d70f0e43fca764df95b2b81f77,0,0
1,0,LEONARDO DA VINCI,PERSON,"Leonardo da Vinci was a renowned artist, poly...","357e55788727fd070ce35607043d4169,39dcda694bba3...",0,53,1,4119fd06010c494caa07f439b333f4c5,53,,,4119fd06010c494caa07f439b333f4c5,0,0
2,0,MAURICE W. BROCKWELL,PERSON,Maurice W. Brockwell is the author of the eBoo...,909b09dc034ddc0cfa8c635f3b7ad5c5,0,1,2,d3835bf3dda84ead99deadbeac5d0d7d,1,,,d3835bf3dda84ead99deadbeac5d0d7d,0,0
3,0,JULIET SUTHERLAND,PERSON,Juliet Sutherland is one of the producers of t...,909b09dc034ddc0cfa8c635f3b7ad5c5,8,1,3,077d2820ae1845bcbb1803379a3d1eae,1,,,077d2820ae1845bcbb1803379a3d1eae,0,0
4,0,DAVID WIDGER,PERSON,David Widger is one of the producers of the eBook,909b09dc034ddc0cfa8c635f3b7ad5c5,8,1,4,3671ea0dd4e84c1a9b02c5ab2c8f4bac,1,,,3671ea0dd4e84c1a9b02c5ab2c8f4bac,0,0


#### Read relationships

In [10]:
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

print(f"Relationship count: {len(relationship_df)}")
relationship_df.head()

Relationship count: 185


Unnamed: 0,source,target,weight,description,text_unit_ids,id,human_readable_id,source_degree,target_degree,rank
0,PROJECT GUTENBERG,LEONARDO DA VINCI,9.0,Project Gutenberg provides an eBook about Leon...,"[357e55788727fd070ce35607043d4169, 909b09dc034...",c8b2408617804483b620e1a6691ac90d,0,4,53,57
1,PROJECT GUTENBERG,JULIET SUTHERLAND,7.0,Juliet Sutherland produced the eBook for Proje...,[909b09dc034ddc0cfa8c635f3b7ad5c5],a5e0d1644eb547ba9a5c3211aac4631a,1,4,1,5
2,PROJECT GUTENBERG,DAVID WIDGER,7.0,David Widger produced the eBook for Project Gu...,[909b09dc034ddc0cfa8c635f3b7ad5c5],5a28b94bc63b44edb30c54748fd14f15,2,4,1,5
3,PROJECT GUTENBERG,DP TEAM,7.0,DP Team contributed to the production of the e...,[909b09dc034ddc0cfa8c635f3b7ad5c5],f97011b2a99d44648e18d517e1eae15c,3,4,1,5
4,LEONARDO DA VINCI,MAURICE W. BROCKWELL,9.0,Maurice W. Brockwell authored a book about Leo...,[909b09dc034ddc0cfa8c635f3b7ad5c5],35489ca6a63b47d6a8913cf333818bc1,4,53,1,54


#### Read covariates

In [11]:
# # NOTE: covariates are turned off by default, because they generally need prompt tuning to be valuable
# # Please see the GRAPHRAG_CLAIM_* settings
# covariate_df = pd.read_parquet(f"{INPUT_DIR}/{COVARIATE_TABLE}.parquet")

# claims = read_indexer_covariates(covariate_df)

# print(f"Claim records: {len(claims)}")
# covariates = {"claims": claims}

#### Read community reports

In [12]:
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)

print(f"Report records: {len(report_df)}")
report_df.head()

Report records: 25


Unnamed: 0,community,full_content,level,rank,title,rank_explanation,summary,findings,full_content_json,id
0,23,# Leonardo da Vinci and His Artistic Community...,2,7.5,Leonardo da Vinci and His Artistic Community,The impact severity rating is high due to the ...,The community revolves around Leonardo da Vinc...,[{'explanation': 'Leonardo da Vinci is the cen...,"{\n ""title"": ""Leonardo da Vinci and His Art...",2f0f8811-02f5-487a-9dea-0aee92010f06
1,24,# Leonardo da Vinci's The Virgin of the Rocks ...,2,7.5,Leonardo da Vinci's The Virgin of the Rocks an...,The impact severity rating is high due to the ...,The community revolves around Leonardo da Vinc...,[{'explanation': ''The Virgin of the Rocks' is...,"{\n ""title"": ""Leonardo da Vinci's The Virgi...",aa63d12b-619b-41e1-9aac-061541905a63
2,11,# Caterina and Her Family Network\n\nThe commu...,1,3.0,Caterina and Her Family Network,The impact severity rating is low due to the f...,"The community revolves around Caterina, the mo...",[{'explanation': 'Caterina is the central enti...,"{\n ""title"": ""Caterina and Her Family Netwo...",26bf187d-6db3-423d-b234-6a6b2bd41fdf
3,12,# Leonardo da Vinci and His Renaissance Networ...,1,8.5,Leonardo da Vinci and His Renaissance Network,The impact severity rating is high due to Leon...,The community revolves around Leonardo da Vinc...,[{'explanation': 'Leonardo da Vinci is the cen...,"{\n ""title"": ""Leonardo da Vinci and His Ren...",a9fff998-ba76-4a7e-b340-bdfff5870e65
4,13,# Lucrezia Crivelli and Her Artistic Connectio...,1,7.0,Lucrezia Crivelli and Her Artistic Connections,The impact severity rating is high due to the ...,The community revolves around Lucrezia Crivell...,[{'explanation': 'Lucrezia Crivelli is the cen...,"{\n ""title"": ""Lucrezia Crivelli and Her Art...",d7799795-972d-4f70-b088-6cf4e60868d8


#### Read text units

In [13]:
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

print(f"Text unit records: {len(text_unit_df)}")
text_unit_df.head()

Text unit records: 10


Unnamed: 0,id,text,n_tokens,document_ids,entity_ids,relationship_ids
0,909b09dc034ddc0cfa8c635f3b7ad5c5,﻿The Project Gutenberg eBook of Leonardo Da Vi...,1200,[93341e4fe83c1fbe4d419240f35dfd17],"[b45241d70f0e43fca764df95b2b81f77, 4119fd06010...","[c8b2408617804483b620e1a6691ac90d, a5e0d1644eb..."
1,12f4bdb549741e42ac24cc4002ac46ee,"by his grandfather\nAntonio, in whose house h...",1200,[93341e4fe83c1fbe4d419240f35dfd17],"[f7e11b0e297a44a896dc67928368f600, e1fd0e904a5...","[fdc954b454744820804d7798f3e0b5de, 49c13838369..."
2,29b5df96bd5ef1c197a3756304a046e4,"the top left-hand corner is\nreversed, and pr...",1200,[93341e4fe83c1fbe4d419240f35dfd17],"[254770028d7a4fa9877da4ba0ad5ad21, deece7e64b2...","[ac6e5a44e0c04a4fa93589376fde4c34, f9005e5c01b..."
3,8762d6f9ffcb3b569708c8d0f0be13b1,.16 x 8.09)]\n\nHe goes on to say that he can ...,1200,[93341e4fe83c1fbe4d419240f35dfd17],"[f7e11b0e297a44a896dc67928368f600, 1943f245ee4...","[859dedcc3736439a8a563419f16cb3d8, 668cf1fdfd6..."
4,50e6365cd0b75ca60c1abe0ef0c3475c,", painted about 1482, which\nbetween 1491 and ...",1200,[93341e4fe83c1fbe4d419240f35dfd17],"[f7e11b0e297a44a896dc67928368f600, 1943f245ee4...","[859dedcc3736439a8a563419f16cb3d8, 478e4c72d8f..."


### Visualizing nodes and relationships with `yfiles-jupyter-graphs`

`yfiles-jupyter-graphs` is a graph visualization extension that provides interactive and customizable visualizations for structured node and relationship data.

In this case, we use it to provide an interactive visualization for the knowledge graph of the local search sample by passing node and relationship lists converted from the given parquet files. The requirements for the input data is an `id` attribute for the nodes and `start`/`end` properties for the relationships that correspond to the node ids. Additional attributes can be added in the `properties` of each node/relationship dict:

In [14]:
%pip install yfiles_jupyter_graphs --quiet
from yfiles_jupyter_graphs import GraphWidget


# converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs
def convert_entities_to_dicts(df):
    """Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs."""
    nodes_dict = {}
    for _, row in df.iterrows():
        # Create a dictionary for each row and collect unique nodes
        node_id = row["title"]
        if node_id not in nodes_dict:
            nodes_dict[node_id] = {
                "id": node_id,
                "properties": row.to_dict(),
            }
    return list(nodes_dict.values())


# converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs
def convert_relationships_to_dicts(df):
    """Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs."""
    relationships = []
    for _, row in df.iterrows():
        # Create a dictionary for each row
        relationships.append({
            "start": row["source"],
            "end": row["target"],
            "properties": row.to_dict(),
        })
    return relationships


w = GraphWidget()
w.directed = True
w.nodes = convert_entities_to_dicts(entity_df)
w.edges = convert_relationships_to_dicts(relationship_df)

Note: you may need to restart the kernel to use updated packages.


#### Configure data-driven visualization

The additional properties can be used to configure the visualization for different use cases.

In [15]:
# show title on the node
w.node_label_mapping = "title"


# map community to a color
def community_to_color(community):
    """Map a community to a color."""
    colors = [
        "crimson",
        "darkorange",
        "indigo",
        "cornflowerblue",
        "cyan",
        "teal",
        "green",
    ]
    return (
        colors[int(community) % len(colors)] if community is not None else "lightgray"
    )


def edge_to_source_community(edge):
    """Get the community of the source node of an edge."""
    source_node = next(
        (entry for entry in w.nodes if entry["properties"]["title"] == edge["start"]),
        None,
    )
    source_node_community = source_node["properties"]["community"]
    return source_node_community if source_node_community is not None else None


w.node_color_mapping = lambda node: community_to_color(node["properties"]["community"])
w.edge_color_mapping = lambda edge: community_to_color(edge_to_source_community(edge))
# map size data to a reasonable factor
w.node_scale_factor_mapping = lambda node: 0.5 + node["properties"]["size"] * 1.5 / 20
# use weight for edge thickness
w.edge_thickness_factor_mapping = "weight"

#### Automatic layouts

The widget provides different automatic layouts that serve different purposes: `Circular`, `Hierarchic`, `Organic (interactiv or static)`, `Orthogonal`, `Radial`, `Tree`, `Geo-spatial`.

For the knowledge graph, this sample uses the `Circular` layout, though `Hierarchic` or `Organic` are also suitable choices.

In [16]:
# Use the circular layout for this visualization. For larger graphs, the default organic layout is often preferrable.
w.circular_layout()

#### Display the graph

In [17]:
display(w)

GraphWidget(layout=Layout(height='800px', width='100%'))

### Create a local search engine
We have prepared the necessary data for the local search engine. Now, we can build a LocalSearch instance with them, an LLM, and an embedding model.

In [18]:
api_base = os.environ["GRAPHRAG_API_BASE"]
embedding_api_base = os.environ["GRAPHRAG_EMBEDDING_API_BASE"]
api_key = os.environ["GRAPHRAG_API_KEY"]  # Your OpenAI API key
llm_model = os.environ["GRAPHRAG_LLM_MODEL"]
embedding_model = os.environ["GRAPHRAG_EMBEDDING_MODEL"]

llm = ChatOpenAI(
    api_base=api_base,
    api_key=api_key,
    model=llm_model,
    api_type=OpenaiApiType.OpenAI,
    max_retries=20,
)

token_encoder = tiktoken.get_encoding("cl100k_base")

text_embedder = OpenAIEmbedding(
    api_key=api_key,
    api_base=embedding_api_base,
    api_type=OpenaiApiType.OpenAI,
    model=embedding_model,
    deployment_name=embedding_model,
    max_retries=20,
)

#### Create local search context builder

In [19]:
context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    covariates=None, #covariates,#todo
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

In [20]:
# text_unit_prop: proportion of context window dedicated to related text units
# community_prop: proportion of context window dedicated to community reports.
# The remaining proportion is dedicated to entities and relationships. Sum of text_unit_prop and community_prop should be <= 1
# conversation_history_max_turns: maximum number of turns to include in the conversation history.
# conversation_history_user_turns_only: if True, only include user queries in the conversation history.
# top_k_mapped_entities: number of related entities to retrieve from the entity description embedding store.
# top_k_relationships: control the number of out-of-network relationships to pull into the context window.
# include_entity_rank: if True, include the entity rank in the entity table in the context window. Default entity rank = node degree.
# include_relationship_weight: if True, include the relationship weight in the context window.
# include_community_rank: if True, include the community rank in the context window.
# return_candidate_context: if True, return a set of dataframes containing all candidate entity/relationship/covariate records that
# could be relevant. Note that not all of these records will be included in the context window. The "in_context" column in these
# dataframes indicates whether the record is included in the context window.
# max_tokens: maximum number of tokens to use for the context window.


local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

llm_params = {
    "max_tokens": 2_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
    "temperature": 0.0,
}

In [21]:
search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

### Make a query

In [22]:
result = await search_engine.asearch("Tell me about Leonardo Da Vinci")
print(result.response)

 ### Leonardo da Vinci: A Renaissance Polymath

Leonardo da Vinci, born in 1452 in Vinci, Italy, is renowned as one of the most versatile and brilliant figures of the Italian Renaissance. His contributions span various fields, including art, science, engineering, and architecture. Leonardo's life and work are extensively documented, highlighting his role as a central figure in his community and his influence on numerous associates, patrons, and students [Data: Entities (1); Relationships (5, 11, 13, 16, 17, +more)].

### Artistic Legacy

Leonardo's artistic legacy is vast and enduring. His most famous works include the "Mona Lisa" and "The Last Supper," which showcase his mastery of technique and composition. These paintings continue to be celebrated and studied for their innovative use of sfumato and chiaroscuro, techniques that have been adopted by many artists. Leonardo's influence on other artists, such as Boltraffio and Salai, is evident in their works and the establishment of his

#### Inspecting the context data used to generate the response

In [23]:
result.context_data["entities"].head()

Unnamed: 0,id,entity,description,number of relationships,in_context
0,8,VINCI,Vinci is the birthplace of Leonardo da Vinci,3,True
1,128,VASARI,An art historian who wrote about Leonardo da V...,1,True
2,1,LEONARDO DA VINCI,"Leonardo da Vinci was a renowned artist, poly...",53,True
3,24,LEONARDO,"Leonardo da Vinci, born in 1452, is a renowne...",44,True
4,126,CESARE BORGIA,A person for whom Leonardo da Vinci worked as ...,1,True


In [24]:
result.context_data["relationships"].head()

Unnamed: 0,id,source,target,description,weight,rank,links,in_context
0,0,PROJECT GUTENBERG,LEONARDO DA VINCI,Project Gutenberg provides an eBook about Leon...,9.0,57,1,True
1,6,LEONARDO DA VINCI,VINCI,Leonardo da Vinci was born in Vinci,10.0,56,2,True
2,8,LEONARDO DA VINCI,CATERINA,Caterina is the mother of Leonardo da Vinci,10.0,55,1,True
3,20,LEONARDO DA VINCI,CESARE BORGIA,Leonardo da Vinci worked as an engineer and ar...,7.0,54,1,True
4,21,LEONARDO DA VINCI,VASARI,Vasari wrote about Leonardo da Vinci's portrai...,1.0,54,2,True


In [25]:
result.context_data["reports"].head()

Unnamed: 0,id,title,content
0,12,Leonardo da Vinci and His Renaissance Network,# Leonardo da Vinci and His Renaissance Networ...
1,12,Leonardo da Vinci and His Renaissance Network,# Leonardo da Vinci and His Renaissance Networ...


In [26]:
result.context_data["sources"].head()

Unnamed: 0,id,text
0,0,﻿The Project Gutenberg eBook of Leonardo Da Vi...
1,6,at the sound of whose name all the muses rise\...
2,8,"of Shakespeare, Leonardo da Vinci made his wi..."
3,7,"not appear to be painted, but truly flesh and ..."
4,9,of giving to his discoveries a practical and\...


In [27]:
if "claims" in result.context_data:
    print(result.context_data["claims"].head())

### Visualizing the result context of `graphrag` queries

The result context of `graphrag` queries allow to inspect the context graph of the request. This data can similarly be visualized as graph with `yfiles-jupyter-graphs`.

In [28]:
"""
Helper function to visualize the result context with `yfiles-jupyter-graphs`.

The dataframes are converted into supported nodes and relationships lists and then passed to yfiles-jupyter-graphs.
Additionally, some values are mapped to visualization properties.
"""


def show_graph(result):
    """Visualize the result context with yfiles-jupyter-graphs."""
    from yfiles_jupyter_graphs import GraphWidget

    if (
        "entities" not in result.context_data
        or "relationships" not in result.context_data
    ):
        msg = "The passed results do not contain 'entities' or 'relationships'"
        raise ValueError(msg)

    # converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs
    def convert_entities_to_dicts(df):
        """Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs."""
        nodes_dict = {}
        for _, row in df.iterrows():
            # Create a dictionary for each row and collect unique nodes
            node_id = row["entity"]
            if node_id not in nodes_dict:
                nodes_dict[node_id] = {
                    "id": node_id,
                    "properties": row.to_dict(),
                }
        return list(nodes_dict.values())

    # converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs
    def convert_relationships_to_dicts(df):
        """Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs."""
        relationships = []
        for _, row in df.iterrows():
            # Create a dictionary for each row
            relationships.append({
                "start": row["source"],
                "end": row["target"],
                "properties": row.to_dict(),
            })
        return relationships

    w = GraphWidget()
    # use the converted data to visualize the graph
    w.nodes = convert_entities_to_dicts(result.context_data["entities"])
    w.edges = convert_relationships_to_dicts(result.context_data["relationships"])
    w.directed = True
    # show title on the node
    w.node_label_mapping = "entity"
    # use weight for edge thickness
    w.edge_thickness_factor_mapping = "weight"
    display(w)


show_graph(result)

GraphWidget(layout=Layout(height='700px', width='100%'))

### Question Generation

GraphRAG can also generate questions based on historical queries, which is useful for creating recommended questions in a chatbot dialogue. This method combines structured data from the knowledge graph with unstructured data from input documents to produce candidate questions related to specific entities.

In [29]:
question_generator = LocalQuestionGen(
   llm=llm,
   context_builder=context_builder,
   token_encoder=token_encoder,
   llm_params=llm_params,
   context_builder_params=local_context_params,
)

In [30]:
question_history = [
    "Tell me about Leonardo Da Vinci",
    "Leonardo's early works",
]

#### Generate questions based on history

In [31]:
candidate_questions = await question_generator.agenerate(
    question_history=question_history, context_data=None, question_count=5
)
candidate_questions.response

[" - What were some of Leonardo da Vinci's early works and their significance?",
 "- How did Leonardo da Vinci's early training influence his artistic style?",
 "- What are some notable paintings from Leonardo da Vinci's early career?",
 "- How did Leonardo da Vinci's early works contribute to his reputation as an artist?",
 '- What techniques and themes did Leonardo da Vinci explore in his early works?']

### Global Search example

Global search method generates answers by searching over all AI-generated community reports in a map-reduce fashion. This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole (e.g. What are the most significant values of the herbs mentioned in this notebook?).

In [32]:
from graphrag.query.structured_search.global_search.community_context import (
    GlobalCommunityContext,
)
from graphrag.query.structured_search.global_search.search import GlobalSearch

#### Build global context based on community reports

In [33]:
context_builder = GlobalCommunityContext(
    community_reports=reports,
    entities=entities,  # default to None if you don't want to use community weights for ranking
    token_encoder=token_encoder,
)

#### Perform global search

In [34]:
global_context_builder_params = {
    "use_community_summary": False,  # False means using full community reports. True means using community short summaries.
    "shuffle_data": True,
    "include_community_rank": True,
    "min_community_rank": 0,
    "community_rank_name": "rank",
    "include_community_weight": True,
    "community_weight_name": "occurrence weight",
    "normalize_community_weight": True,
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    "context_name": "Reports",
}

map_llm_params = {
    "max_tokens": 1000,
    "temperature": 0.0,
    "response_format": {"type": "json_object"},
}

reduce_llm_params = {
    "max_tokens": 2000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000-1500)
    "temperature": 0.0,
}

In [35]:
search_engine = GlobalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    max_data_tokens=12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    map_llm_params=map_llm_params,
    reduce_llm_params=reduce_llm_params,
    allow_general_knowledge=False,  # set this to True will add instruction to encourage the LLM to incorporate general knowledge in the response, which may increase hallucinations, but could be useful in some use cases.
    json_mode=True,  # set this to False if your LLM model does not support JSON mode.
    context_builder_params=global_context_builder_params,
    concurrent_coroutines=32,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

In [36]:
result = await search_engine.asearch(
    "Who is Leonardo Da Vinci?"
)

result.response

" Leonardo da Vinci is a renowned artist, scientist, engineer, and inventor, celebrated for his extraordinary talent and versatility across various fields. He is known for masterpieces such as the Mona Lisa and The Last Supper, which highlight his artistic prowess [Data: Reports (23, 12, 15, 17, 24, +more)].\n\nLeonardo da Vinci is a central figure in the artistic community, with numerous relationships connecting him to other entities. His role as an artist, inventor, and engineer is well-documented. His significant relationships with patrons and rulers, including Ludovico Sforza and King Francis I, provided him with the support and resources needed to create his masterpieces and pursue his scientific interests [Data: Reports (12, 15, 17, 24, +more)].\n\nLeonardo da Vinci's scientific contributions are notable, with his notebooks containing sketches and theories that anticipated later scientific discoveries. His work on anatomy, engineering, and natural phenomena showcases his interdis

In [37]:
# inspect the data used to build the context for the LLM responses
result.context_data["reports"]

Unnamed: 0,id,title,occurrence weight,content,rank
0,2,Milan and Leonardo da Vinci's Legacy,1.0,# Milan and Leonardo da Vinci's Legacy\n\nThe ...,7.5
1,23,Leonardo da Vinci and His Artistic Community,1.0,# Leonardo da Vinci and His Artistic Community...,7.5
2,12,Leonardo da Vinci and His Renaissance Network,0.875,# Leonardo da Vinci and His Renaissance Networ...,8.5
3,21,Ludovico Sforza and His Associates: A Historic...,0.625,# Ludovico Sforza and His Associates: A Histor...,7.5
4,14,Leonardo da Vinci and Florence Community,0.5,# Leonardo da Vinci and Florence Community\n\n...,7.5
5,15,Mona Lisa and Francis I Community,0.5,# Mona Lisa and Francis I Community\n\nThe com...,7.5
6,13,Lucrezia Crivelli and Her Artistic Connections,0.5,# Lucrezia Crivelli and Her Artistic Connectio...,7.0
7,17,The Last Supper and Its Historical Context,0.375,# The Last Supper and Its Historical Context\n...,7.5
8,24,Leonardo da Vinci's The Virgin of the Rocks an...,0.375,# Leonardo da Vinci's The Virgin of the Rocks ...,7.5
9,20,Leonardo da Vinci's Annunciation and Its Museums,0.25,# Leonardo da Vinci's Annunciation and Its Mus...,7.5


In [38]:
# inspect number of LLM calls and tokens
print(f"LLM calls: {result.llm_calls}. LLM tokens: {result.prompt_tokens}")

LLM calls: 2. LLM tokens: 13917


### Neo4j Import of GraphRAG Result Parquet files

This notebook imports the results of the GraphRAG indexing process into the Neo4j Graph database for further processing, analysis or visualization. 

You can also build your own GenAI applications using Neo4j and a number of RAG strategies with LangChain, LlamaIndex, Haystack, and many other frameworks.
See: https://neo4j.com/labs/genai-ecosystem

Here is what the end result looks like:

![](https://dev.assets.neo4j.com/wp-content/uploads/graphrag-neo4j-visualization.png)

#### How does it work?

The notebook loads the parquet files from the `output` folder of your indexing process and loads them into Pandas dataframes.
It then uses a batching approach to send a slice of the data into Neo4j to create nodes and relationships and add relevant properties. The id-arrays on most entities are turned into relationships. 

All operations use MERGE, so they are idempotent, and you can run the script multiple times.

If you need to clean out the database, you can run the following statement

```cypher
MATCH (n)
CALL { WITH n DETACH DELETE n } IN TRANSACTIONS OF 25000 ROWS;
```

In [39]:
import time

import pandas as pd
from neo4j import GraphDatabase

#### Neo4j Installation

You can create a free instance of Neo4j [online](https://console.neo4j.io). You get a credentials file that you can use for the connection credentials. You can also get an instance in any of the cloud marketplaces.

If you want to install Neo4j locally either use [Neo4j Desktop](https://neo4j.com/download) or 
the official Docker image: `docker run -e NEO4J_AUTH=neo4j/password -p 7687:7687 -p 7474:7474 neo4j` 

In [40]:
NEO4J_URI = "neo4j://localhost"  # or neo4j+s://xxxx.databases.neo4j.io
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "email-double-forest-senior-mobile-1426"  # your password
NEO4J_DATABASE = "neo4j"

# Create a Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

#### Batched Import

The batched import function takes a Cypher insert statement (needs to use the variable `value` for the row) and a dataframe to import.
It will send by default 1k rows at a time as query parameter to the database to be inserted.

In [41]:
def batched_import(statement, df, batch_size=1000):
    """
    Import a dataframe into Neo4j using a batched approach.

    Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.
    """
    total = len(df)
    start_s = time.time()
    for start in range(0, total, batch_size):
        batch = df.iloc[start : min(start + batch_size, total)]
        result = driver.execute_query(
            "UNWIND $rows AS value " + statement,
            rows=batch.to_dict("records"),
            database_=NEO4J_DATABASE,
        )
        print(result.summary.counters)
    print(f"{total} rows in {time.time() - start_s} s.")
    return total

#### Indexes and Constraints

Indexes in Neo4j are only used to find the starting points for graph queries, e.g. quickly finding two nodes to connect.
Constraints exist to avoid duplicates, we create them mostly on id's of Entity types.

We use some Types as markers with two underscores before and after to distinguish them from the actual entity types.

The default relationship type here is `RELATED` but we could also infer a real relationship-type from the description or the types of the start and end-nodes.

* `__Entity__`
* `__Document__`
* `__Chunk__`
* `__Community__`
* `__Covariate__`

In [42]:
# create constraints, idempotent operation

statements = """
create constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique;
create constraint document_id if not exists for (d:__Document__) require d.id is unique;
create constraint entity_id if not exists for (c:__Community__) require c.community is unique;
create constraint entity_id if not exists for (e:__Entity__) require e.id is unique;
create constraint entity_title if not exists for (e:__Entity__) require e.name is unique;
create constraint entity_title if not exists for (e:__Covariate__) require e.title is unique;
create constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique;
""".split(";")

for statement in statements:
    if len((statement or "").strip()) > 0:
        print(statement)
        driver.execute_query(statement)


create constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique

create constraint document_id if not exists for (d:__Document__) require d.id is unique

create constraint entity_id if not exists for (c:__Community__) require c.community is unique

create constraint entity_id if not exists for (e:__Entity__) require e.id is unique

create constraint entity_title if not exists for (e:__Entity__) require e.name is unique

create constraint entity_title if not exists for (e:__Covariate__) require e.title is unique

create constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique


#### Import Process

##### Importing the Documents

We're loading the parquet file for the documents and create nodes with their ids and add the title property.
We don't need to store text_unit_ids as we can create the relationships and the text content is also contained in the chunks.

In [43]:
doc_df = pd.read_parquet(
    f"{INPUT_DIR}/create_final_documents.parquet", columns=["id", "title"]
)
doc_df.head(2)

Unnamed: 0,id,title
0,93341e4fe83c1fbe4d419240f35dfd17,davinci.txt


In [44]:
# Import documents
statement = """
MERGE (d:__Document__ {id:value.id})
SET d += value {.title}
"""

batched_import(statement, doc_df)

{'_contains_updates': True, 'labels_added': 1, 'nodes_created': 1, 'properties_set': 2}
1 rows in 0.07135295867919922 s.


1

##### Loading Text Units

We load the text units, create a node per id and set the text and number of tokens.
Then we connect them to the documents that we created before.

In [45]:
text_df = pd.read_parquet(
    f"{INPUT_DIR}/create_final_text_units.parquet",
    columns=["id", "text", "n_tokens", "document_ids"],
)
text_df.head(2)

Unnamed: 0,id,text,n_tokens,document_ids
0,909b09dc034ddc0cfa8c635f3b7ad5c5,﻿The Project Gutenberg eBook of Leonardo Da Vi...,1200,[93341e4fe83c1fbe4d419240f35dfd17]
1,12f4bdb549741e42ac24cc4002ac46ee,"by his grandfather\nAntonio, in whose house h...",1200,[93341e4fe83c1fbe4d419240f35dfd17]


In [46]:
statement = """
MERGE (c:__Chunk__ {id:value.id})
SET c += value {.text, .n_tokens}
WITH c, value
UNWIND value.document_ids AS document
MATCH (d:__Document__ {id:document})
MERGE (c)-[:PART_OF]->(d)
"""

batched_import(statement, text_df)

{'_contains_updates': True, 'labels_added': 10, 'relationships_created': 10, 'nodes_created': 10, 'properties_set': 30}
10 rows in 0.08815479278564453 s.


10

##### Loading Nodes

For the nodes we store id, name, description, embedding (if available), human readable id.

In [47]:
entity_df = pd.read_parquet(
    f"{INPUT_DIR}/create_final_entities.parquet",
    columns=[
        "name",
        "type",
        "description",
        "human_readable_id",
        "id",
        "description_embedding",
        "text_unit_ids",
    ],
)
entity_df.head(2)

Unnamed: 0,name,type,description,human_readable_id,id,description_embedding,text_unit_ids
0,PROJECT GUTENBERG,ORGANIZATION,Project Gutenberg is an organization that pro...,0,b45241d70f0e43fca764df95b2b81f77,"[-0.0216522216796875, -0.005603790283203125, -...","[357e55788727fd070ce35607043d4169, 909b09dc034..."
1,LEONARDO DA VINCI,PERSON,"Leonardo da Vinci was a renowned artist, poly...",1,4119fd06010c494caa07f439b333f4c5,"[-0.003841400146484375, 0.00652313232421875, -...","[357e55788727fd070ce35607043d4169, 39dcda694bb..."


In [48]:
entity_statement = """
MERGE (e:__Entity__ {id:value.id})
SET e += value {.human_readable_id, .description, name:replace(value.name,'"','')}
WITH e, value
CALL db.create.setNodeVectorProperty(e, "description_embedding", value.description_embedding)
CALL apoc.create.addLabels(e, case when coalesce(value.type,"") = "" then [] else [apoc.text.upperCamelCase(replace(value.type,'"',''))] end) yield node
UNWIND value.text_unit_ids AS text_unit
MATCH (c:__Chunk__ {id:text_unit})
MERGE (c)-[:HAS_ENTITY]->(e)
"""

batched_import(entity_statement, entity_df)

{'_contains_updates': True, 'labels_added': 171, 'relationships_created': 234, 'nodes_created': 171, 'properties_set': 684}
171 rows in 0.4161531925201416 s.


171

##### Import Relationships

For the relationships we find the source and target node by name, using the base `__Entity__` type.
After creating the `RELATED` relationships, we set the description as attribute.

In [49]:
rel_df = pd.read_parquet(
    f"{INPUT_DIR}/create_final_relationships.parquet",
    columns=[
        "source",
        "target",
        "id",
        "rank",
        "weight",
        "human_readable_id",
        "description",
        "text_unit_ids",
    ],
)
rel_df.head(2)

Unnamed: 0,source,target,id,rank,weight,human_readable_id,description,text_unit_ids
0,PROJECT GUTENBERG,LEONARDO DA VINCI,c8b2408617804483b620e1a6691ac90d,57,9.0,0,Project Gutenberg provides an eBook about Leon...,"[357e55788727fd070ce35607043d4169, 909b09dc034..."
1,PROJECT GUTENBERG,JULIET SUTHERLAND,a5e0d1644eb547ba9a5c3211aac4631a,5,7.0,1,Juliet Sutherland produced the eBook for Proje...,[909b09dc034ddc0cfa8c635f3b7ad5c5]


In [50]:
rel_statement = """
    MATCH (source:__Entity__ {name:replace(value.source,'"','')})
    MATCH (target:__Entity__ {name:replace(value.target,'"','')})
    // not necessary to merge on id as there is only one relationship per pair
    MERGE (source)-[rel:RELATED {id: value.id}]->(target)
    SET rel += value {.rank, .weight, .human_readable_id, .description, .text_unit_ids}
    RETURN count(*) as createdRels
"""

batched_import(rel_statement, rel_df)

{'_contains_updates': True, 'relationships_created': 185, 'properties_set': 1110}
185 rows in 0.14702892303466797 s.


185

##### Importing Communities

For communities we import their id, title, level.
We connect the `__Community__` nodes to the start and end nodes of the relationships they refer to.

Connecting them to the chunks they orignate from is optional, as the entites are already connected to the chunks.

In [51]:
community_df = pd.read_parquet(
    f"{INPUT_DIR}/create_final_communities.parquet",
    columns=["id", "level", "title", "text_unit_ids", "relationship_ids"],
)

community_df.head(2)

Unnamed: 0,id,level,title,text_unit_ids,relationship_ids
0,8,0,Community 8,"[357e55788727fd070ce35607043d4169,909b09dc034d...","[c8b2408617804483b620e1a6691ac90d, a5e0d1644eb..."
1,0,0,Community 0,"[357e55788727fd070ce35607043d4169,39dcda694bba...","[35489ca6a63b47d6a8913cf333818bc1, 5d3344f45e6..."


In [52]:
statement = """
MERGE (c:__Community__ {community:value.id})
SET c += value {.level, .title}
/*
UNWIND value.text_unit_ids as text_unit_id
MATCH (t:__Chunk__ {id:text_unit_id})
MERGE (c)-[:HAS_CHUNK]->(t)
WITH distinct c, value
*/
WITH *
UNWIND value.relationship_ids as rel_id
MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)
MERGE (start)-[:IN_COMMUNITY]->(c)
MERGE (end)-[:IN_COMMUNITY]->(c)
RETURn count(distinct c) as createdCommunities
"""

batched_import(statement, community_df)

{'_contains_updates': True, 'labels_added': 25, 'relationships_created': 412, 'nodes_created': 25, 'properties_set': 75}
25 rows in 0.24241280555725098 s.


25

##### Importing Community Reports

Fo the community reports we create nodes for each communitiy set the id, community, level, title, summary, rank, and rank_explanation and connect them to the entities they are about.
For the findings we create the findings in context of the communities.

In [53]:
community_report_df = pd.read_parquet(
    f"{INPUT_DIR}/create_final_community_reports.parquet",
    columns=[
        "id",
        "community",
        "level",
        "title",
        "summary",
        "findings",
        "rank",
        "rank_explanation",
        "full_content",
    ],
)
community_report_df.head(2)

Unnamed: 0,id,community,level,title,summary,findings,rank,rank_explanation,full_content
0,2f0f8811-02f5-487a-9dea-0aee92010f06,23,2,Leonardo da Vinci and His Artistic Community,The community revolves around Leonardo da Vinc...,[{'explanation': 'Leonardo da Vinci is the cen...,7.5,The impact severity rating is high due to the ...,# Leonardo da Vinci and His Artistic Community...
1,aa63d12b-619b-41e1-9aac-061541905a63,24,2,Leonardo da Vinci's The Virgin of the Rocks an...,The community revolves around Leonardo da Vinc...,[{'explanation': ''The Virgin of the Rocks' is...,7.5,The impact severity rating is high due to the ...,# Leonardo da Vinci's The Virgin of the Rocks ...


In [54]:
# Import communities
community_statement = """
MERGE (c:__Community__ {community:value.community})
SET c += value {.level, .title, .rank, .rank_explanation, .full_content, .summary}
WITH c, value
UNWIND range(0, size(value.findings)-1) AS finding_idx
WITH c, value, finding_idx, value.findings[finding_idx] as finding
MERGE (c)-[:HAS_FINDING]->(f:Finding {id:finding_idx})
SET f += finding
"""
batched_import(community_statement, community_report_df)

{'_contains_updates': True, 'labels_added': 137, 'relationships_created': 137, 'nodes_created': 137, 'properties_set': 561}
25 rows in 0.08192276954650879 s.


25

##### Importing Covariates

Covariates are for instance claims on entities, we connect them to the chunks where they originate from.

In [55]:
# cov_df = (pd.read_parquet(f"{GRAPHRAG_FOLDER}/create_final_covariates.parquet"),)
# #                         columns=["id","text_unit_id"])
# cov_df.head(2)
# # Subject id do not match entity ids

In [56]:
# # Import covariates
# cov_statement = """
# MERGE (c:__Covariate__ {id:value.id})
# SET c += apoc.map.clean(value, ["text_unit_id", "document_ids", "n_tokens"], [NULL, ""])
# WITH c, value
# MATCH (ch:__Chunk__ {id: value.text_unit_id})
# MERGE (ch)-[:HAS_COVARIATE]->(c)
# """
# batched_import(cov_statement, cov_df)

##### Visualize your data

You can now [Open] Neo4j on Aura, you need to log in with either SSO or your credentials.

Or open https://workspace-preview.neo4j.io and connect to your local instance, remember the URI is `neo4j://localhost` and `neo4j` as username and `password` as password.

In "Explore" you can explore by using visual graph patterns and then explore and expand further.

In "Query", you can open the left sidebar and explore by clicking on the nodes and relationships.
You can also use the co-pilot to generate Cypher queries for your, here are some examples.

###### Show a few `__Entity__` nodes and their relationships (Entity Graph)

```cypher
MATCH path = (:__Entity__)-[:RELATED]->(:__Entity__)
RETURN path LIMIT 200
```

###### Show the Chunks and the Document (Lexical Graph)

```cypher
MATCH (d:__Document__) WITH d LIMIT 1
MATCH path = (d)<-[:PART_OF]-(c:__Chunk__)
RETURN path LIMIT 100
```

######  Show a Community and it's Entities

```cypher
MATCH (c:__Community__) WITH c LIMIT 1
MATCH path = (c)<-[:IN_COMMUNITY]-()-[:RELATED]-(:__Entity__)
RETURN path LIMIT 100
```

###### Show everything

```cypher
MATCH (d:__Document__) WITH d LIMIT 1
MATCH path = (d)<-[:PART_OF]-(:__Chunk__)-[:HAS_ENTIY]->()-[:RELATED]-()-[:IN_COMMUNITY]->()
RETURN path LIMIT 250
```

We showed the visualization of this last query at the beginning.