## Neo4j Graph Agent Setup

This notebook demonstrates how to set up and use multiple Neo4j instances in Docker, integrate them with Llama Index for knowledge graph construction, and build custom retrievers and agents. It covers steps for installing dependencies, configuring Neo4j containers, downloading the necessary APOC plugin, and creating agents that interact with the knowledge graph. The agents are designed to perform comparative analysis on documents or articles using Neo4j as the underlying data store.


## Setup (Installs, Data, Models)

In [1]:
# !pip install llama-index
# !pip install llama-index-core==0.10.42
# !pip install llama-index-embeddings-openai
# !pip install llama-index-postprocessor-flag-embedding-reranker
# !pip install git+https://github.com/FlagOpen/FlagEmbedding.git
# !pip install llama-index-graph-stores-neo4j
# !pip install llama-parse


# How to Create Multiple Neo4j Instances in Docker

## Correct Setup Process:

### 1. Stop all running containers:
Before creating a new Neo4j container, make sure to stop any previously running containers.

### 2. Create a new Neo4j container:
You can create a new Neo4j instance by specifying unique volume and port numbers for each container.

Example command to run a new container (neo4j_new_volume_4):
```bash
docker run -d -p 7477:7474 -p 7690:7687 --name neo4j_4 -e NEO4J_AUTH=neo4j/Yuanzhoulv314! -e NEO4J_apoc_export_file_enabled=true -e NEO4J_apoc_import_file_enabled=true -e NEO4J_apoc_import_file_use__neo4j__config=true -e NEO4J_PLUGINS='["apoc"]' -v neo4j_new_volume_4:/var/lib/neo4j neo4j:5.26.1
```

**Explanation:**
- `-p 7477:7474 -p 7690:7687`: Maps the host ports to the container ports (adjust as necessary).
- `--name neo4j_4`: Names the container (you can give it any name).
- `-v neo4j_new_volume_4:/var/lib/neo4j`: Attaches a new volume for persistent data storage.
- `NEO4J_AUTH`: Sets the Neo4j password.

### 3. Access the Neo4j container:
Run the following command to access the container's bash shell:
```bash
docker exec -it neo4j_4 bash
```

### 4. Download the APOC plugin:
Inside the container, create the plugin directory:
```bash
mkdir plugins
```

Navigate to the plugins directory:
```bash
pushd plugins
```

Download the APOC plugin:
```bash
wget https://github.com/neo4j/apoc/releases/download/5.26.1/apoc-5.26.1-core.jar
```

### 5. Edit the neo4j.conf file:
Navigate to the Neo4j configuration directory:
```bash
cd /var/lib/neo4j/conf
```

Edit the neo4j.conf file:
```bash
nano neo4j.conf
```

Modify the following lines in the configuration file:
- Line 242: `dbms.security.procedures.unrestricted=algo.*,apoc.*`
- Line 246: `dbms.security.procedures.allowlist=algo.*,apoc.*`

Save the changes and exit the editor.

### 6. Restart the container:
After making the changes, restart the container to apply the new configuration.

### 7. Test the installation:
Open Neo4j Browser and run the following query to check if APOC is installed:
```cypher
RETURN apoc.version()
```

If you see the APOC version, it means the plugin has been successfully installed.

### 8. Verify the new database:
To confirm the database is created, you can run the following query:
```cypher
MATCH (n) RETURN n;
```

If the database is empty, you should not see any results.

# Main part


In [5]:
from llama_index.graph_stores.neo4j import Neo4jPGStore
password = input("Please enter your password for Neo4j: ")
graph_store1 = Neo4jPGStore(
    username="neo4j",
    password=password,
    url="bolt://localhost:7687",
)
vec_store = None


In [6]:
from llama_index.graph_stores.neo4j import Neo4jPGStore
password = input("Please enter your password for Neo4j: ")
graph_store2 = Neo4jPGStore(
    username="neo4j",
    password=password,
    url="bolt://localhost:7688",
)
vec_store = None


## Construct Knowledge Graph, Get Retrievers

This section shows you how to construct the knowledge graph over the existing documents.

**Note**: we have the default extractors (implicit path, simple llm path) configured. You can also choose to use a pre-defined schema as mentioned in this [notebook](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/property_graph/property_graph_advanced.ipynb).

In [8]:
from llama_index.core.indices.property_graph import (
    ImplicitPathExtractor,
    SimpleLLMPathExtractor,
)
from llama_index.core import PropertyGraphIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

In [12]:

index1 = PropertyGraphIndex.from_existing(
    graph_store1,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    kg_extractors=[
        ImplicitPathExtractor(),
        SimpleLLMPathExtractor(
            llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    show_progress=True,
)


In [10]:

index2 = PropertyGraphIndex.from_existing(
    graph_store2,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    kg_extractors=[
        ImplicitPathExtractor(),
        SimpleLLMPathExtractor(
            llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    show_progress=True,
)

#### Define Vector Retriever

Here we define our vector context retriever - it returns initial nodes via vector search, and traverses the relations to pull in more nodes/context.


In [14]:
from llama_index.core.indices.property_graph import VectorContextRetriever

kg_retriever1 = VectorContextRetriever(
    index1.property_graph_store,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    similarity_top_k=2,
    path_depth=1,
    # include_text=False,
    include_text=True,
)

In [18]:
nodes1 = kg_retriever1.retrieve(
    "Name of Article"
)
# nodes = kg_retriever.retrieve('san francisco')
print(len(nodes1))
for idx, node in enumerate(nodes1):
    print(f">> IDX: {idx}, {node.get_content()}")

1
>> IDX: 0, Here are some facts extracted from the provided text:

Aconex -> Is -> Informational document
Aconex -> Is -> An informational document

DIVISION OF ENGINEERING SERVICES ‑ GEOTECHNICAL SERVICES                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

In [20]:
from llama_index.core.indices.property_graph import VectorContextRetriever

kg_retriever2 = VectorContextRetriever(
    index2.property_graph_store,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    similarity_top_k=2,
    path_depth=1,
    # include_text=False,
    include_text=True,
)

In [23]:
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import RetrieverQueryEngine

base_index = VectorStoreIndex.from_documents(sub_docs, embed_model=embed_model)
base_retriever = base_index.as_retriever(similarity_top_k=2)
base_query_engine = RetrieverQueryEngine(base_retriever)

In [27]:
response = base_query_engine.query(
    "Summarize the report"
)
print(str(response))

The report is a foundation report for the Navy Overhead widening project, identified as EA#12-OH1004. It includes various sections detailing the purpose and scope of work, project description, and pertinent reports and investigations. The report outlines the field investigation and laboratory testing program, including exploratory borings, soil sampling, and cone penetration tests. It also covers site geology and subsurface conditions, regional geology, and groundwater information. Evaluations for scour, corrosion, and seismic recommendations are provided, including ground surface rupture, acceleration response spectrum, liquefaction evaluation, seismic slope stability, and lateral spreading. Additionally, it contains as-built foundation data and subsurface characterization. The document is identified by Aconex# PI405-RPT-GEO-000091, dated January 18, 2019, and is a revision by the 2018 OC 405 Partners.


In [29]:
print(len(response.source_nodes))
for node in response.source_nodes:
    print("---")
    print(node.get_content())

2
---
FOUNDATION REPORT
                                                                                   NAVY OVERHEAD (WIDEN)
                                                                                                   EA#12-OH1004

 List of Figures

 Figure 1       Project Location Map
 Figure 2       Site Location Map
 Figure 3       Exploration Location Map
 Figure 4       Quaternary Geologic Map
 Figure 5       Regional Fault Map
 Figure 6a      Subsurface Cross Section “A-A”
 Figure 6b      Subsurface Cross Section “B-B”
 Figure 6c      Cross Section Key
 Figure 7a      ARS Curves – Non-Liquefied Case
 Figure 7b      ARS Curves – Liquefied Case
 Figure 8       Design ARS Curves
 Figure 9       Deaggregation for PGA and 975-Year Return Period
 Figure 10      Liquefaction Hazard Map
 Figure 11a Liquefaction Subsurface Profile A
 Figure 11b Liquefaction Subsurface Profile B
 Figure 12      Expansive Soil Exclusion Zone

Aconex#: PI405-RPT-GEO-000091
Document Date: 20190118  

## Build Custom Retriever

Build joint retriever that combines vector and KG search.

In [24]:
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
from typing import List


class CustomRetriever(BaseRetriever):
    """Custom retriever that performs both KG vector search and direct vector search."""

    def __init__(self, kg_retriever):
        self._kg_retriever = kg_retriever
        # self._vector_retriever = vector_retriever

    def _retrieve(self, query_bundle) -> List[NodeWithScore]:
        """Retrieve nodes given query."""
        kg_nodes = self._kg_retriever.retrieve(query_bundle)
        # vector_nodes = self._vector_retriever.retrieve(query_bundle)

        unique_nodes = {n.node_id: n for n in kg_nodes}
        # unique_nodes.update({n.node_id: n for n in vector_nodes})
        return list(unique_nodes.values())

In [28]:
custom_retriever1 = CustomRetriever(kg_retriever1)

In [30]:
custom_retriever2 = CustomRetriever(kg_retriever2)

In [32]:
nodes = custom_retriever1.retrieve(
    "LOG OF CPT-17-3902"
)
# len(nodes)

## Build Agent

Now that we have the retriever, we can treat it as a RAG pipeline tool, and wrap it with an agent that can perform basic CoT reasoning and maintain conversation memory over time.


In [34]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import RetrieverQueryEngine

kg_query_engine1 = RetrieverQueryEngine(custom_retriever1)
kg_query_tool1 = QueryEngineTool(
    query_engine=kg_query_engine1,
    metadata=ToolMetadata(
        name="query_tool1",
        description="Provides information about the First Article.",
    ),
)

In [36]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import RetrieverQueryEngine

kg_query_engine2 = RetrieverQueryEngine(custom_retriever2)
kg_query_tool2 = QueryEngineTool(
    query_engine=kg_query_engine2,
    metadata=ToolMetadata(
        name="query_tool2",
        description="Provides information about the Second Article.",
    ),
)

In [44]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI

# Initialize the LLM
llm = OpenAI(
    model="gpt-3.5-turbo",
    temperature=0.2,
)

# Define the system prompt for comparison
COMPARISON_SYSTEM_PROMPT = """You are an AI assistant specialized in analyzing and comparing articles. Your primary tasks are:
1. Use two query tools to retrieve information from two different articles.
2. Carefully analyze the retrieved information.
3. Identify similarities and differences between the articles.
4. Provide a detailed comparative analysis.

When answering:
- Retrieve relevant information from both articles separately.
- Ensure responses are based on the specific content returned by the query tools.
- Clearly highlight similarities and differences.
- Organize the information in a way that makes the comparison easy to understand.
"""

# Function to create the comparison agent
def create_comparison_agent(kg_query_tool1, kg_query_tool2, llm):
    # Combine the two query tools
    tools = [kg_query_tool1, kg_query_tool2]
    
    # Create an agent worker
    agent_worker = FunctionCallingAgentWorker.from_tools(
        tools=tools,
        llm=llm,
        verbose=True,
        system_prompt=COMPARISON_SYSTEM_PROMPT,
        # Allow parallel tool calls for improved efficiency
        allow_parallel_tool_calls=True
    )
    
    return agent_worker.as_agent()

# Create an agent instance
comparison_agent = create_comparison_agent(kg_query_tool1, kg_query_tool2, llm)


## Try out Queries

Now that the agent is setup, let's try out some queries.

In [46]:
# Example usage
comparison_agent.reset()
response = comparison_agent.chat(
    "Please analyze the key similarities and differences between these two articles, including:\n" +
    "1. The main arguments presented in each article\n" +
    "2. The key pieces of evidence used\n" +
    "3. The conclusions drawn"
)
print(str(response))

Added user message to memory: Please analyze the key similarities and differences between these two articles, including:
1. The main arguments presented in each article
2. The key pieces of evidence used
3. The conclusions drawn
=== Calling Function ===
Calling function: query_tool1 with args: {"input": "main arguments, key evidence, conclusions"}
=== Function Output ===
The main arguments presented in the text are related to the Standard Penetration Test (SPT) as a method for assessing soil strength, the importance of hammer efficiency in driving rods, and the use of N-values for design purposes. Key evidence includes details on how the SPT is conducted, the significance of energy measurements during testing, and the adjustments that can be made based on measured energy values. Additionally, the text discusses the use of equipment like the SPT Analyzer or Pile Driving Analyzer for measurements and analysis. The conclusions drawn emphasize the variability of energy in testing methods b

In [233]:
response = agent.chat("Based on the table data, please compare the potential seismic risk of the following two faults (Newport Inglewood Fault Zone and Anaheim Fault) to the project site. Please provide a detailed analysis based on their distance, maximum magnitude, and slip rate, and conclude which fault poses a higher seismic risk.")

Added user message to memory: Based on the table data, please compare the potential seismic risk of the following two faults (Newport Inglewood Fault Zone and Anaheim Fault) to the project site. Please provide a detailed analysis based on their distance, maximum magnitude, and slip rate, and conclude which fault poses a higher seismic risk.
=== LLM Response ===
To compare the potential seismic risk of the Newport Inglewood Fault Zone and the Anaheim Fault to the project site, we will analyze three key factors: distance, maximum magnitude, and slip rate.

### Newport Inglewood Fault Zone

- **Distance**: 5.6 km
  - This fault is very close to the project site, which increases the potential for strong ground shaking during an earthquake.

- **Maximum Magnitude**: 7.2
  - The fault is capable of producing a significant earthquake, which can result in substantial ground shaking and potential damage.

- **Slip Rate**: 1.0 mm/yr
  - A moderate slip rate suggests a relatively frequent movemen

In [235]:
print(str(response))

To compare the potential seismic risk of the Newport Inglewood Fault Zone and the Anaheim Fault to the project site, we will analyze three key factors: distance, maximum magnitude, and slip rate.

### Newport Inglewood Fault Zone

- **Distance**: 5.6 km
  - This fault is very close to the project site, which increases the potential for strong ground shaking during an earthquake.

- **Maximum Magnitude**: 7.2
  - The fault is capable of producing a significant earthquake, which can result in substantial ground shaking and potential damage.

- **Slip Rate**: 1.0 mm/yr
  - A moderate slip rate suggests a relatively frequent movement along the fault, indicating a higher likelihood of seismic activity over time.

### Anaheim Fault

- **Distance**: 7.1 km
  - This fault is also relatively close to the project site, though slightly further than the Newport Inglewood Fault Zone.

- **Maximum Magnitude**: 6.4
  - The potential maximum magnitude is lower than that of the Newport Inglewood Fault 

In [241]:
agent.reset()
response = agent.chat(
    "print the information in LOG OF CPT-17-3901"
)
print(str(response))

Added user message to memory: print the information in LOG OF CPT-17-3901
=== Calling Function ===
Calling function: query_tool with args: {"input": "LOG OF CPT-17-3901"}
=== Function Output ===
The context does not provide specific information about the log of CPT-17-3901.
=== LLM Response ===
I couldn't find any specific information about the "LOG OF CPT-17-3901." If you have more details or another request, feel free to let me know!
I couldn't find any specific information about the "LOG OF CPT-17-3901." If you have more details or another request, feel free to let me know!


In [173]:
print(str(response.source_nodes[0].get_content()))

Here are some facts extracted from the provided text:

New retaining wall -> Will be constructed on -> The side of the approach abutment

FOUNDATION REPORT
                                                                                              NAVY OVERHEAD (WIDEN)
                                                                                                              EA#12-OH1004

 10.4   Approach Embankment Earthwork
 10.4.1     Staging and New Approach Fill
 New fill will be placed to widen and raise the existing approach embankments and convert slopes
 of the embankments into retaining walls. The subject bridge will be constructed in a single stage.
 Table 31, presents simplified depictions of the new fill areas for each abutment at each stage,
 located behind the abutments of the proposed new bridge. Also noted in Table 31, is when a new
 retaining wall will be constructed on the side of the approach abutment. Short, pile supported wing
 walls are also proposed at each 

In [205]:
agent.reset()
response = agent.chat(
    # "in the Liquefaction Triggering Assessment and Settlement Calculation Standard Penetration Tests on Navy OH ‐ Abutment 1,try to find out at which Elevation,the Results,FS is less than 2? and what does this mean?"

"in CONE PENETRATION TEST RECORD,as Depth increase ,how does the TIP BEARING change"
)
print(str(response))

Added user message to memory: in CONE PENETRATION TEST RECORD,as Depth increase ,how does the TIP BEARING change
=== LLM Response ===
In a Cone Penetration Test (CPT), the tip resistance (also known as tip bearing or cone resistance) typically changes with depth due to variations in soil type, density, and other geotechnical properties. Generally, as depth increases, the tip resistance may increase, indicating denser or more compact soil layers. However, this is not always the case, as the tip resistance can also decrease or remain constant depending on the soil stratigraphy and conditions encountered.

Here are some typical scenarios:

1. **Increase in Tip Resistance**: This is common when penetrating through loose surface layers into denser subsurface layers, such as transitioning from loose sand to dense sand or gravel.

2. **Decrease in Tip Resistance**: This might occur when moving from a dense layer to a softer or more compressible layer, such as transitioning from dense sand to 

In [51]:
agent.reset()
response = agent.chat(
    "What does the report identify as the most critical factor for ensuring global stability in Section 10.3?"
)


Added user message to memory: What does the report identify as the most critical factor for ensuring global stability in Section 10.3?
=== Calling Function ===
Calling function: query_tool with args: {"input": "most critical factor for ensuring global stability in Section 10.3"}
=== Function Output ===
The most critical factor for ensuring global stability in Section 10.3 is meeting the required factors of safety of 1.5 for static loading and 1.1 for seismic loading.
=== LLM Response ===
The report identifies meeting the required factors of safety—1.5 for static loading and 1.1 for seismic loading—as the most critical factor for ensuring global stability in Section 10.3.
