*Author: [Daniel Puente Viejo](https://www.linkedin.com/in/danielpuenteviejo/)*

## **GraphRAG Tutorial — Neo4j + LLMs**

<p align="left">
  <img src="./imgs/cover_page.gif" alt="alt text" width="700"/>
</p>

Practical example of how to create a **GraphRAG** (Retrieval-Augmented Generation) system using **Neo4j** as the knowledge graph database and **OpenAI**'s language models for natural language processing tasks.   
In this tutorial we will also make use of **LangChain** to facilitate the integration between the LLMs and the graph database.

### **Index:**

- <a href='#1'><ins>1. SetUp</ins></a>
    - <a href='#1.1'><ins>1.1 Install Libraries</ins></a>
    - <a href='#1.2'><ins>1.2 Environment Variables</ins></a>
    - <a href='#1.3'><ins>1.3 Initialize the LLM and Embeddings</ins></a>
    - <a href='#1.4'><ins>1.4 Initialize the graph</ins></a>
    - <a href='#1.5'><ins>1.5 Load the chunks</ins></a>
- <a href='#2'><ins>2. Graph construction</ins></a>
    - <a href='#2.1'><ins>2.1 Create nodes</ins></a>
    - <a href='#2.2'><ins>2.2 Create relationships</ins></a>
- <a href='#3'><ins>3. Retriever</ins></a>
    - <a href='#3.1'><ins>3.1 Define the retriever</ins></a>
    - <a href='#3.2'><ins>3.2 Test the retriever</ins></a>
- <a href='#4'><ins>4. Extra commands</ins></a>
    - <a href='#4.1'><ins>4.1 Delete all nodes</ins></a>
    - <a href='#4.2'><ins>4.2 Close the session</ins></a>
    - <a href='#4.3'><ins>4.3 See the avaiable indexes in the Neo4j database</ins></a>
    - <a href='#4.4'><ins>4.4 See all the filenames</ins></a>
    - <a href='#4.5'><ins>4.5 How to filter a concrete id</ins></a>

## <a id='1' style="color: skyblue;">**1. Setup**</a>

###  <a id='1.1'>**1.1 Install Libraries**</a>

In [1]:
# Standard Imports
from dotenv import load_dotenv
import os
import pandas as pd
import ast

# LangChain Imports
from langchain.schema import Document
from langchain_openai import AzureChatOpenAI
from langchain_community.graphs.neo4j_graph import Neo4jGraph
from langchain_community.vectorstores.neo4j_vector import Neo4jVector
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_huggingface import HuggingFaceEmbeddings

# Neo4j Imports
from neo4j import GraphDatabase
from neo4j_graphrag.retrievers import VectorRetriever
from neo4j_graphrag.retrievers.hybrid import HybridRetriever
from neo4j_graphrag.embeddings.sentence_transformers import SentenceTransformerEmbeddings
from neo4j_graphrag.llm import AzureOpenAILLM
from neo4j_graphrag.generation import GraphRAG

### <a id='1.2'>**1.2 Environment Variables**</a>

Here we load the environment variables from a .env file. Make sure to have your OpenAI API key and Neo4j credentials stored in the .env file.   

*Important:* In case of any connection issue with Neo4j, try changing the `NEO4J_URI` from `neo4j+s://` to `neo4j+ssc://`.

In [2]:
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_KEY')
os.environ["NEO4J_URI"] = os.getenv('NEO4J_URI') # I case of any issue, change neo4j+s to neo4j+ssc
os.environ["NEO4J_USERNAME"] = os.getenv('NEO4J_USERNAME')
os.environ["NEO4J_PASSWORD"] = os.getenv('NEO4J_PASSWORD')

### <a id='1.3'>**1.3 Initialize the LLM and Embeddings**</a>

For this project, we will use OpenAI's `gpt-4o-mini` for text generation and `sentence-transformers/all-MiniLM-L6-v2` for generating embeddings.

In [3]:
llm = AzureChatOpenAI(
    model_name="gpt-4o-mini",
    api_key=os.getenv('OPENAI_KEY'),
    api_version=os.getenv('OPENAI_VERSION'),
    azure_endpoint=os.getenv('OPENAI_ENDPOINT'),
    temperature=0
)

model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

  from .autonotebook import tqdm as notebook_tqdm


### <a id='1.4'>**1.4 Initialize the graph**</a>

Here we will get the Neo4jGraph instance, the driver and the session to interact with the Neo4j database.

In [4]:
graph = Neo4jGraph()
driver = GraphDatabase.driver(uri=os.getenv('NEO4J_URI'), auth=(os.getenv('NEO4J_USERNAME'), os.getenv('NEO4J_PASSWORD')))
session = driver.session()

  graph = Neo4jGraph()


### <a id='1.5'>**1.5 Load the chunks**</a>

In this section, we will load the document chunks into the Neo4j graph database.

In this case we already have the chunks stores in a CSV file. It is a very simple CSV file that contains all the information inside the text files on `/data/original_data/`. You can take a look to them if you want to see the original documents.   
It is synthetic data that has been generated just for this tutorial.

In [5]:
PATH = "data/documents.csv"

df = pd.read_csv(PATH)
df

Unnamed: 0,chunk_id,text,filename,metadata
0,7f8d58f9be5543c0952d60084f7eb160,During the 2024 European Society for Medical O...,oncology_update_spain.txt,"{""filename"": ""oncology_update_spain.txt"", ""cou..."
1,c006dc492ec541b8bc9ee9aa9e634d30,Another highlight came from the collaborative ...,oncology_update_spain.txt,"{""filename"": ""oncology_update_spain.txt"", ""cou..."
2,bbc171c93c5a47c1805d888489d75ea9,The Spanish Oncology Network (RedOnco) announc...,oncology_update_spain.txt,"{""filename"": ""oncology_update_spain.txt"", ""cou..."
3,69c14a8b63da4cf3b18fabbd00cefb4f,"The CARDI-RED study, conducted by the Mayo Cli...",cardio_study_usa.txt,"{""filename"": ""cardio_study_usa.txt"", ""country""..."
4,a6a430bdc7764d3185888aa4beb3d026,A genomic substudy within CARDI-RED identified...,cardio_study_usa.txt,"{""filename"": ""cardio_study_usa.txt"", ""country""..."
5,2b501bb8caf74fb9ab814f8af9530576,A genomic substudy within CARDI-RED identified...,cardio_study_usa.txt,"{""filename"": ""cardio_study_usa.txt"", ""country""..."
6,c80f7c811e8d485e832a445f6199f967,The IL-6 Inhibition Trial (UK-IL6-2024) was a ...,trial_inflammation_uk.txt,"{""filename"": ""trial_inflammation_uk.txt"", ""cou..."
7,cda3cfe5ae894b11bcbad195cee4a221,A post-hoc biomarker analysis performed by the...,trial_inflammation_uk.txt,"{""filename"": ""trial_inflammation_uk.txt"", ""cou..."


## <a id='2' style="color: skyblue;">**2. Graph construction**</a>

### <a id='2.1'>**2.1 Adapt the format**</a>

Neo4jGraph requires a specific format to ingest data. We will adapt the format of our CSV file to match the required structure.   
We need to use `Document` from `langchain.schema` to create the documents.

In [6]:
documents = [
    Document(
        page_content=str(element['text']),
        metadata=ast.literal_eval(element['metadata'])
    )
    for _, element in df.iterrows()
]

documents

[Document(metadata={'filename': 'oncology_update_spain.txt', 'country': 'Spain', 'topic': 'Oncology', 'source_type': 'Congress Summary', 'page_number': 1, 'title': 'ESMO 2024: Advances in EGFR Inhibition', 'keywords': ['AZD5478', 'EGFR', 'NSCLC', 'AstraZeneca', 'ESMO'], 'main_investigator': 'Dr. Isabel Romero', 'institution': 'Hospital Vall d’Hebron', 'year': 2024}, page_content='During the 2024 European Society for Medical Oncology (ESMO) Congress in Madrid, Spanish oncology researchers presented a series of groundbreaking updates on targeted and immunotherapy-based treatments. The centerpiece of the conference was AZD5478, a next-generation EGFR inhibitor developed by AstraZeneca, which demonstrated significant activity in non-small-cell lung cancer (NSCLC) patients harboring EGFR exon 20 insertions. The Phase II data showed an objective response rate of 56% and median progression-free survival (PFS) of 10.4 months, surpassing current standards of care.\n\nDr. Isabel Romero from Hosp

### <a id='2.2'>**2.2 Identify the nodes and relationships**</a>

Now, we will identify the nodes and relationships in our documents using LangChain's `Neo4jGraph` capabilities.   
The nodes and relationships used in this example are dummy and for demonstration purposes only. You can modify them according to your needs.

* Nodes: Nodes represent entities or concepts in the graph.
* Relationships: Relationships define how nodes are connected. We will create relationships such as `WORKS

In [7]:
allowed_nodes = [
    "Drug", "Disease", "Biomarker", "Trial",
    "Researcher", "Institution", "Company", "Country"
]

allowed_relationships = [
    "TREATS", "TARGETS", "EVALUATED_IN", "STUDIES",
    "CONDUCTED_BY", "LEAD_BY", "SPONSORED_BY", "LOCATED_IN"
]

### <a id='2.3'>**2.3 LLM to identify the nodes and relationships**</a>

To accurately identify nodes and relationships from unstructured text, we will leverage the capabilities of our LLM. The model will analyze the content of each document chunk and extract relevant entities and their connections based on predefined criteria.

In [8]:
transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=allowed_nodes,
    allowed_relationships=allowed_relationships,
    node_properties=False, 
    relationship_properties=False
) 

### <a id='2.4'>**2.4 Add the nodes**</a>

Now that we have identified the nodes and relationships, we will proceed to add them to the Neo4j graph database using LangChain's `Neo4jGraph` functionalities.
Thanks to the previous `LLMGraphTransformer`, the nodes and relationships will be automatically created in the graph.

In [9]:
graph_documents = transformer.convert_to_graph_documents(documents)
graph.add_graph_documents(graph_documents, include_source=True)

### <a id='2.5'>**2.5 Create the keyword index**</a>

The keyword index is very important as will help to create hybrid searches combining both vector and keyword-based searches, alongside with the graph structure.

In [10]:
session.run("DROP INDEX keyword IF EXISTS")
session.run("""
CREATE FULLTEXT INDEX keyword FOR (n:Document) ON EACH [n.text]
""")

<neo4j._sync.work.result.Result at 0x3088d49e0>

### <a id='2.6'>**2.6 Add the embeddings**</a>

Now we will add the embeddings to the nodes in the graph. This will allow us to perform vector-based similarity searches later on.

In [11]:
_ = Neo4jVector.from_documents(
    documents,
    embeddings,
    pre_delete_collection=True, 
    index_name="vector",
    keyword_index_name="keyword"
)

## <a id='3' style="color: skyblue;">**3. Retriever**</a>

### <a id='3.1'>**3.1. Vector retriever (Clasical RAG)**</a>

This is an example of how to generate a vector retriever using the Neo4jGrag library.  
The behavior is like a RAG system, where the retriever fetches relevant nodes based on vector similarity, it does not consider relationships between nodes.

In [14]:
embedder = SentenceTransformerEmbeddings()

vector_retriever = VectorRetriever(
    driver,
    embedder=embedder,
    index_name="vector",
    return_properties=["text"],
)

#### **3.1.1 Testing the Vector Retriever**

In this example, we will test the vector retriever by providing a query and retrieving the top relevant nodes based on vector similarity.    
We will display the first result to see the details of the most relevant chunk.

In [15]:
query = "Which researchers in Spain are involved in studies evaluating immunotherapy drugs for breast cancer?"

retriever_result = vector_retriever.search(query_text=query, top_k=2)

# Example of the first result
ast.literal_eval(retriever_result.model_dump_json())['items']

[{'content': "{'text': 'Another highlight came from the collaborative trial between Hospital 12 de Octubre (Madrid) and the Spanish Breast Cancer Group (GEICAM), focusing on combination therapies for triple-negative breast cancer (TNBC). The study evaluated pembrolizumab in combination with the antibody-drug conjugate sacituzumab govitecan in patients with metastatic disease who had received at least two prior lines of therapy. The combination achieved an overall response rate of 28% and disease control rate of 64%, with median overall survival extending to 13.2 months.\\n\\nResearchers also discussed biomarker-driven insights, revealing that patients with PD-L1 expression ≥10% and BRCA1/2 mutations had significantly higher response rates. “These findings underscore the evolving role of immunotherapy even in traditionally hard-to-treat tumor types,” stated Dr. Laura Sánchez, co-investigator and immuno-oncologist at Hospital 12 de Octubre. Safety data indicated manageable toxicity, with

### <a id='3.2'>**3.2 Hybrid retriever (Graph RAG)**</a>

This is an example of how to generate a hybrid retriever using the Neo4jGrag library.  
The behavior is like a Graph RAG system, where the retriever fetches relevant nodes based on both vector similarity and graph relationships.

In [8]:
embedder = SentenceTransformerEmbeddings()

graphrag_retriever = HybridRetriever(
    driver,
    embedder=embedder,
    vector_index_name="vector",
    fulltext_index_name="keyword",
    return_properties=["text"],
)

#### **3.2.1 Testing the Graph Retriever**

In this example, we will test the graph retriever by providing a query and retrieving the top relevant nodes based on vector similarity and graph relationships.

In [9]:
query = "Which researchers in Spain are involved in studies evaluating immunotherapy drugs for breast cancer?"

retriever_result = graphrag_retriever.search(query_text=query, top_k=2)

# Example of the first result
ast.literal_eval(retriever_result.model_dump_json())['items']

[{'content': "{'text': 'Another highlight came from the collaborative trial between Hospital 12 de Octubre (Madrid) and the Spanish Breast Cancer Group (GEICAM), focusing on combination therapies for triple-negative breast cancer (TNBC). The study evaluated pembrolizumab in combination with the antibody-drug conjugate sacituzumab govitecan in patients with metastatic disease who had received at least two prior lines of therapy. The combination achieved an overall response rate of 28% and disease control rate of 64%, with median overall survival extending to 13.2 months.\\n\\nResearchers also discussed biomarker-driven insights, revealing that patients with PD-L1 expression ≥10% and BRCA1/2 mutations had significantly higher response rates. “These findings underscore the evolving role of immunotherapy even in traditionally hard-to-treat tumor types,” stated Dr. Laura Sánchez, co-investigator and immuno-oncologist at Hospital 12 de Octubre. Safety data indicated manageable toxicity, with

#### **3.2.2 Generate Graph RAG retriever-and-answer**

We will create a hybrid retriever that combines vector similarity and graph relationships to fetch relevant nodes from the Neo4j graph database, but incorporating the LLM inside so we automatically return the answer apart from the relevant chunks.

In [10]:
llm_graphrag = AzureOpenAILLM(
    model_name="gpt-4o-mini",
    api_key=os.getenv('OPENAI_KEY'),
    api_version=os.getenv('OPENAI_VERSION'),
    azure_endpoint=os.getenv('OPENAI_ENDPOINT'),
)

rag = GraphRAG(
    retriever=graphrag_retriever, 
    llm=llm_graphrag,
)

#### **3.2.3 Testing the Graph RAG Retriever + Generation**

Now we can test the Graph RAG retriever by providing a query.   
As you can see the retriever configuration has not the exact same format as the vector retriever. However, they mantain similar parameters such as `top_k`.

In [18]:
query = "Which researchers in Spain are involved in studies evaluating immunotherapy drugs for breast cancer?"

response = rag.search(
    query_text=query,
    retriever_config={
        "top_k": 2,
        # "filters": {}
    },
    return_context=True
)

In [19]:
response.retriever_result.model_dump()['items']

[{'content': "{'text': 'Another highlight came from the collaborative trial between Hospital 12 de Octubre (Madrid) and the Spanish Breast Cancer Group (GEICAM), focusing on combination therapies for triple-negative breast cancer (TNBC). The study evaluated pembrolizumab in combination with the antibody-drug conjugate sacituzumab govitecan in patients with metastatic disease who had received at least two prior lines of therapy. The combination achieved an overall response rate of 28% and disease control rate of 64%, with median overall survival extending to 13.2 months.\\n\\nResearchers also discussed biomarker-driven insights, revealing that patients with PD-L1 expression ≥10% and BRCA1/2 mutations had significantly higher response rates. “These findings underscore the evolving role of immunotherapy even in traditionally hard-to-treat tumor types,” stated Dr. Laura Sánchez, co-investigator and immuno-oncologist at Hospital 12 de Octubre. Safety data indicated manageable toxicity, with

In [20]:
print(response.answer)

Researchers from Hospital 12 de Octubre in Madrid, specifically Dr. Laura Sánchez, are involved in studies evaluating immunotherapy drugs for breast cancer, particularly in a trial focusing on combination therapies for triple-negative breast cancer (TNBC).


## <a id='4' style="color: skyblue;">**4. Extra commands**</a>

In this section, we provide some extra commands that can be useful for managing and interacting with the Neo4j graph database.

### <a id='4.1'>**4.1 Delete all nodes**</a>

In [12]:
# with driver.session() as session:
#     session.run("MATCH (n) DETACH DELETE n")

### <a id='4.2'>**4.2 Close the session**</a>

In [13]:
# session.close()
# driver.close()

### <a id='4.3'>**4.3 See the avaiable indexes in the Neo4j database**</a>

In [9]:
query = "SHOW INDEXES;"
with driver.session() as session:
    result = session.run(query)
    for record in result:
        print(record)

<Record id=3 name='constraint_f8c8c4e0' state='ONLINE' populationPercent=100.0 type='RANGE' entityType='NODE' labelsOrTypes=['Chunk'] properties=['id'] indexProvider='range-1.0' owningConstraint='constraint_f8c8c4e0' lastRead=neo4j.time.DateTime(2025, 11, 10, 11, 50, 28, 698000000, tzinfo=<UTC>) readCount=88>
<Record id=1 name='index_1b9dcc97' state='ONLINE' populationPercent=100.0 type='LOOKUP' entityType='RELATIONSHIP' labelsOrTypes=None properties=None indexProvider='token-lookup-1.0' owningConstraint=None lastRead=neo4j.time.DateTime(2025, 11, 10, 12, 8, 13, 147000000, tzinfo=<UTC>) readCount=3>
<Record id=0 name='index_460996c0' state='ONLINE' populationPercent=100.0 type='LOOKUP' entityType='NODE' labelsOrTypes=None properties=None indexProvider='token-lookup-1.0' owningConstraint=None lastRead=neo4j.time.DateTime(2025, 11, 10, 12, 7, 50, 896000000, tzinfo=<UTC>) readCount=646>
<Record id=5 name='keyword' state='ONLINE' populationPercent=100.0 type='FULLTEXT' entityType='NODE' la

### <a id='4.4'>**4.4 See all the filenames**</a>

In [10]:
query = """
MATCH (n)
WHERE n.filename IS NOT NULL
RETURN DISTINCT n.filename AS filename
ORDER BY filename
"""

with driver.session() as session:
    for record in session.run(query):
        print(record["filename"])

cardio_study_usa.txt
oncology_update_spain.txt
trial_inflammation_uk.txt


### <a id='4.5'>**4.5 How to filter a concrete id**</a>

In [13]:
random_id = "4:50a55e0e-0d46-4a3c-85dd-d974565f398f:40"

query = f"""MATCH (n)
WHERE elementId(n) = "{random_id}"
RETURN n.text AS text
"""

with driver.session() as session:
    result = session.run(query)
    for record in result:
        print(record[0])

During the 2024 European Society for Medical Oncology (ESMO) Congress in Madrid, Spanish oncology researchers presented a series of groundbreaking updates on targeted and immunotherapy-based treatments. The centerpiece of the conference was AZD5478, a next-generation EGFR inhibitor developed by AstraZeneca, which demonstrated significant activity in non-small-cell lung cancer (NSCLC) patients harboring EGFR exon 20 insertions. The Phase II data showed an objective response rate of 56% and median progression-free survival (PFS) of 10.4 months, surpassing current standards of care.

Dr. Isabel Romero from Hospital Vall d’Hebron in Barcelona highlighted the importance of molecular profiling, noting that “precision oncology is no longer an aspiration—it is the standard.” The presentation emphasized the need for routine next-generation sequencing (NGS) to detect resistance mutations and guide therapy selection. Moreover, safety outcomes were favorable, with diarrhea and rash being the most 

---