# Knowledge Graph Agent with LlamaParse

In [1]:
!pip install llama-index
!pip install llama-index-core==0.10.42
!pip install llama-index-embeddings-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-index-graph-stores-neo4j
!pip install llama-parse

Collecting pypdf<5.0.0,>=4.0.1 (from llama-index-readers-file<0.2.0,>=0.1.4->llama-index)
  Downloading pypdf-4.2.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m704.3 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[0mInstalling collected packages: pypdf
  Attempting uninstall: pypdf
    Found existing installation: pypdf 3.17.4
    Uninstalling pypdf-3.17.4:
      Successfully uninstalled pypdf-3.17.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
embedchain 0.1.102 requires langchain<0.2.0,>=0.1.4, but you have langchain 0.2.5 which is incompatible.
embedchain 0.1.102 requires pypdf<4.0.0,>=3.11.0, but you have pypdf 4.2.0 which is incompatible.
crewai-tools 0.1.6 requires langchain<0.2.0,>=0.1.4, but you have langchain 0.2.5 whi

In [26]:
import nest_asyncio
import os, sys
import numpy as np
from dotenv import load_dotenv, find_dotenv
sys.path.append('../..')
_ = load_dotenv(find_dotenv())
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
nest_asyncio.apply()

#### Setup Model

Here we use gpt-4o and default OpenAI embeddings.

In [2]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

llm = OpenAI(model="gpt-4o")
embed_model = OpenAIEmbedding(model="text-embedding-3-large")

Settings.llm = llm
Settings.embed_model = embed_model

  from pandas.core import (


#### Load Data

In [5]:
from llama_parse import LlamaParse

docs = LlamaParse(api_key="llx-8e3LORqG5sX1AJGGVYI2Ueo7Z3rghhusBRRWtO19f7ZqSO8B", result_type="text").load_data("../Test_Documents/5_CyberPeace_Report.pdf")

Started parsing the file under job_id f1f81fd4-8b2b-49e5-a277-232caf3561f9
.

In [8]:
documents = []
for file in os.listdir("../Test_Documents/"):
    if file != "5_CyberPeace_Report.pdf":
        docs = LlamaParse(api_key="llx-8e3LORqG5sX1AJGGVYI2Ueo7Z3rghhusBRRWtO19f7ZqSO8B", result_type="text").load_data("../Test_Documents/" + file)
        documents.append(docs)

Started parsing the file under job_id 871d08a4-b9dd-4933-b480-7b4fefaa9f54
Started parsing the file under job_id 86a9d491-aa0e-4515-bcd8-faa685854275
Started parsing the file under job_id a69f6414-1857-4b72-8f8d-9f59daf9f3d8
Started parsing the file under job_id 42bd440d-32e3-45b7-b31c-ebccc3a8d9c9


In [9]:
documents.append(docs)

In [10]:
documents

[[Document(id_='da785d88-d3b3-4c4d-b4d1-e6da550a1115', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='            Sales Tax     1\n\n\nLAWS     OF MALAYSIA\n\n\n           Act 806\n\n\n      SALES TAX ACT 2018\n---\n2                                                Laws of Malaysia                                                    A CT       806\n\n\nDate of Royal Assent                                                          ...       ...     24 August 2018\n\n\nDate of publication in the                                          ...       ...       ...     28 August 2018\nGazette\n\n\nPublisher’s Copyright    C\nPERCETAKAN NASIONAL MALAYSIA BERHAD\nAll rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means\nelectronic, mechanical, photocopying, recording and/or otherwise without the prior permission of Percetakan Nasional Malaysia Berh

In [11]:
from copy import deepcopy
from llama_index.core.schema import TextNode, Document
from llama_index.core import VectorStoreIndex


def get_sub_docs(documents):
    """Split docs into pages, by separator."""
    sub_docs = []
    for docs in documents:
        for doc in docs:
            doc_chunks = doc.text.split("\n---\n")
            for doc_chunk in doc_chunks:
                sub_doc = Document(
                    text=doc_chunk,
                    metadata=deepcopy(doc.metadata),
                )
                sub_docs.append(sub_doc)

    return sub_docs

In [12]:
sub_docs = get_sub_docs(documents)

#### Initialize Graph Store

Here we use Neo4j but you can also use our other integrations like Nebula (see an [example notebook](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/property_graph/property_graph_advanced.ipynb)).

To launch Neo4j locally, first ensure you have docker installed. Then, you can launch the database with the following docker command

```bash
docker run \
    -p 7474:7474 -p 7687:7687 \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    neo4j:latest
```

From here, you can open the db at [http://localhost:7474/](http://localhost:7474/). On this page, you will be asked to sign in. Use the default username/password of `neo4j` and `neo4j`.

Once you login for the first time, you will be asked to change the password.

After this, you are ready to create your first property graph!

In [14]:
from llama_index.graph_stores.neo4j import Neo4jPGStore

graph_store = Neo4jPGStore(
    username="neo4j",
    password="Ochieng@2024",
    url="bolt://localhost:7687",
)
vec_store = None

## Construct Knowledge Graph, Get Retrievers

This section shows you how to construct the knowledge graph over the existing documents.

**Note**: we have the default extractors (implicit path, simple llm path) configured. You can also choose to use a pre-defined schema as mentioned in this [notebook](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/property_graph/property_graph_advanced.ipynb).

In [15]:
from llama_index.core.indices.property_graph import (
    ImplicitPathExtractor,
    SimpleLLMPathExtractor,
)
from llama_index.core import PropertyGraphIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

In [16]:
index = PropertyGraphIndex.from_documents(
    sub_docs,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-large"),
    kg_extractors=[
        ImplicitPathExtractor(),
        SimpleLLMPathExtractor(
            llm=OpenAI(model="gpt-4", temperature=0.3),
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    property_graph_store=graph_store,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/246 [00:00<?, ?it/s]

Extracting implicit paths: 100%|██████████| 246/246 [00:00<00:00, 55273.95it/s]
Extracting paths from text: 100%|██████████| 246/246 [11:37<00:00,  2.84s/it]
Generating embeddings: 100%|██████████| 3/3 [00:06<00:00,  2.17s/it]
Generating embeddings: 100%|██████████| 44/44 [06:21<00:00,  8.66s/it]


In [17]:
# run this if index is already loaded
index = PropertyGraphIndex.from_existing(
    graph_store,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-large"),
    kg_extractors=[
        ImplicitPathExtractor(),
        SimpleLLMPathExtractor(
            llm=OpenAI(model="gpt-4", temperature=0.3),
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    show_progress=True,
)

Extracting implicit paths: 0it [00:00, ?it/s]
Extracting paths from text: 0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]


The constructed knowledge graph should look something like this
![knowledge graph](./sf2023_budget_kg_screenshot.png)

#### Define Vector Retriever

Here we define our vector context retriever - it returns initial nodes via vector search, and traverses the relations to pull in more nodes/context.

In [18]:
from llama_index.core.indices.property_graph import VectorContextRetriever

kg_retriever = VectorContextRetriever(
    index.property_graph_store,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-large"),
    similarity_top_k=5,
    path_depth=1,
    include_text=True,
)

### Evaluate the knowledge graph retrieval system retrieval system using TruLens

#### Load TruLens Library Modules

In [19]:
from trulens_eval import Tru
from trulens_eval.tru_custom_app import instrument
from trulens_eval import Feedback, Select
from trulens_eval.feedback import Groundedness

In [20]:
tru = Tru()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of Tru` to prevent this.


In [21]:
from openai import OpenAI
oai_client = OpenAI()

In [32]:
class KnowledgeGraphRetrieval:
    @instrument
    def retrieve(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        results = kg_retriever.retrieve(query)
        print(results)
        return results[0].text
    
    @instrument
    def generate_completion(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        completion = oai_client.chat.completions.create(
        model="gpt-4-turbo",
        temperature=0,
        messages=
        [
            {"role": "user",
            "content": 
            f"We have provided context information below. \n"
            f"---------------------\n"
            f"{context_str}"
            f"\n---------------------\n"
            f"Given this information, please answer the question: {query}"
            }
        ]
        ).choices[0].message.content
        return completion

    @instrument
    def query(self, query: str) -> str:
        context_str = self.retrieve(query)
        completion = self.generate_completion(query, context_str)
        return completion

In [33]:
knowledge_graph_rag = KnowledgeGraphRetrieval()

In [34]:
from trulens_eval.feedback.provider.openai import OpenAI

provider = OpenAI()
grounded = Groundedness(groundedness_provider=provider)

[nltk_data] Downloading package punkt to
[nltk_data]     /home/adeptschneiderthedev/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [35]:
# Define a groundedness feedback function
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons, name = "Groundedness")
    .on(Select.RecordCalls.retrieve.rets.collect())
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

# Question/answer relevance between overall question and answer.
f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name = "Answer Relevance")
    .on(Select.RecordCalls.retrieve.args.query)
    .on_output()
)

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name = "Context Relevance")
    .on(Select.RecordCalls.retrieve.args.query)
    .on(Select.RecordCalls.retrieve.rets.collect())
    .aggregate(np.mean)
)

✅ In Groundedness, input source will be set to __record__.app.retrieve.rets.collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Answer Relevance, input prompt will be set to __record__.app.retrieve.args.query .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.app.retrieve.args.query .
✅ In Context Relevance, input context will be set to __record__.app.retrieve.rets.collect() .


### Construct the TruLens App

In [36]:
from trulens_eval import TruCustomApp
tru_rag = TruCustomApp(knowledge_graph_rag,
    app_id = 'Knowledge Graph Retrieval Pipeline Testing v1',
    feedbacks = [f_groundedness, f_answer_relevance, f_context_relevance])

In [37]:
queries = [
    "Can the Conference of the Parties of the WHO FCTC assist countries in securing financial resources for implementation?",
    "What should be the minimum size of health warnings and messages on tobacco products, and where should they be placed?",
    "I opened a company to produce sensors in Kuala Lumpur. Based on the law in the file, how should I register for sales tax, and what are my obligations?",
    "I opened a company to produce sensors in Kuala Lumpur. During product I paid sales tax on my inputs. Based on the law in the file, what are conditions to be eligible for a refund of the sales tax?",
    "What specific indicators and targets are outlined in Canada's Cybersecurity Strategy?",
    "What measures is the government of Canada taking in response to data security challenges posed by the emergence of novel technologies?",
    "What are the API requirements that apply to the Consent building block?",
    "What additional building blocks are essential to support the functionality of the consent building block?",
    "What are the key findings of the CyberPeace Institute's analysis of cyber threats affecting NGOs in International Geneva?",
    "What are the key lessons learnt from the case studies examined in the report?"
]

In [38]:
def tru_knowledge_graph_rag_retrieval_pipeline(query):
    with tru_rag as recording:
        knowledge_graph_rag.query(query)
    tru.get_leaderboard(app_ids=["Knowledge Graph Retrieval Pipeline Testing v1"])

In [39]:
for query in queries:
    tru_knowledge_graph_rag_retrieval_pipeline(query)

[NodeWithScore(node=TextNode(id_='5fcfd3c5-0216-4b97-889b-820d1ab17b65', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='8ac8cd80-0d10-4f8c-8fe8-b00bb077d61e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='2a3575fc84374422fcb4fdd517485aa2a887e5045cd526c0eda37d691215e744')}, text='Here are some facts extracted from the provided text:\n\nParties -> Shall raise -> Financial resources for effective implementation of the convention\nParties -> Provide -> Financial assistance for developing country parties\n\nWHO Framework Convention on Tobacco Control\n\n\n                                                    Article 32\n                                                  Right to vote\n\n\n1.       Each Party to this Convention shall have one vote, except as provided for in\nparagraph 2 of this Article.\n\n\n2.       Regional economic integration organizations, in mat

Groundedness per statement in source:   0%|          | 0/4 [00:00<?, ?it/s]



Groundedness per statement in source:   0%|          | 0/3 [00:00<?, ?it/s]

[NodeWithScore(node=TextNode(id_='2f42aae6-03b8-4706-8414-03ef1bca5cf0', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='dda96486-af7a-406c-b1f6-68f2b35cdbf4', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='8bf4de7354acf0038b648eca640caf498f488d6de4eab61a9888ab1d1388e91d')}, text='Here are some facts extracted from the provided text:\n\nThe person -> Exports -> Taxable goods from malaysia\n\nSales Tax                           47\n   (4) The Director General may reduce or disallow any refund\ndue in respect of the claim under subsection (1) to the extent\nthat the refund would unjustly enrich the person referred to in\nsubsection (1).\n\n\n   (5) A claim for refund under this section shall be supported\nby such evidence as required by the Director General.\n\n\nDrawback\n\n\n40.   (1) The Director General may allow drawback of the full\namount of sales tax paid

Groundedness per statement in source:   0%|          | 0/25 [00:00<?, ?it/s]

[NodeWithScore(node=TextNode(id_='2f42aae6-03b8-4706-8414-03ef1bca5cf0', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='dda96486-af7a-406c-b1f6-68f2b35cdbf4', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='8bf4de7354acf0038b648eca640caf498f488d6de4eab61a9888ab1d1388e91d')}, text='Here are some facts extracted from the provided text:\n\nThe person -> Exports -> Taxable goods from malaysia\n\nSales Tax                           47\n   (4) The Director General may reduce or disallow any refund\ndue in respect of the claim under subsection (1) to the extent\nthat the refund would unjustly enrich the person referred to in\nsubsection (1).\n\n\n   (5) A claim for refund under this section shall be supported\nby such evidence as required by the Director General.\n\n\nDrawback\n\n\n40.   (1) The Director General may allow drawback of the full\namount of sales tax paid

Groundedness per statement in source:   0%|          | 0/14 [00:00<?, ?it/s]

[NodeWithScore(node=TextNode(id_='0607f1ca-6597-4584-83c1-72423635d725', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='5d283ec7-743d-4223-9d28-14c499c36a5d', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='5e04af5e72d076bc57e09befc925462b896b864d43a262dd508e3bc029beb9ef')}, text='Here are some facts extracted from the provided text:\n\nStrategy -> Is -> Roadmap for canada’s path forward on cyber security\n\nExecutive Summary\n\n\nImplementing the Strategy\n\n\nRecognizing that the pace of change we see today will only accelerate, this\nStrategy is designed as the mainstay of the Government’s continuous efforts\nto enhance cyber security in Canada. The Government’s actions will evolve\nalongside the ground-breaking technological developments and resulting\nparadigm shifts that have become common in our connected world.\n\n\nCyber security action plans will supp

Groundedness per statement in source:   0%|          | 0/3 [00:00<?, ?it/s]

[NodeWithScore(node=TextNode(id_='a6a30d92-fd6e-4596-ab4b-7e4ac6ddaaca', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='019137d1-086b-409b-a85f-3d5e5f44e359', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='d5e9d6abb1be0430b224619ff3e7912e717785fd9b68103dfbd5859d02396823')}, text='Here are some facts extracted from the provided text:\n\nGovernment -> Enhance -> Cyber security in canada\nFederal government -> Will position -> Canada as a global leader in cyber security\n\nLeadership and Collaboration\n\n\n       Smart cities use digital technologies to enhance quality of life by\n\n\n       making services more efficient, cost-effective, and responsive for\n\n\n       urban residents. For example, “smart” traffic lights will measure and\n\n\n       adapt timing to improve traffic flows and connected sewer systems\n\n\n       will detect leaks and monitor real-ti

Groundedness per statement in source:   0%|          | 0/17 [00:00<?, ?it/s]

[NodeWithScore(node=TextNode(id_='f4e9ccb4-355a-41a7-9179-5e9abd266dcc', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='51b54d1c-0869-4d7b-a933-aca3d1b374c3', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='c05f69e009dc87274c25fd0755d83c3f9867e10d46ed836dfcfdd9cb56b03e93')}, text="Here are some facts extracted from the provided text:\n\nData provider or data consumer admin -> Configures -> Consent building block\n\nSource: GovStack | 23Q4 | GovStack Specification (gitbook.io)\n\n\ndata processing is being (or has been) processed according to the Data Policy\nrequiring a consent, is relevant to various entities involved in the data\nprocessing. For this reason, the generic “verification” activity may be executed\nas part of various workflows satisfying the needs of different actors.\nFollowing is the first core set of key functionalities of the Consent Building 

Groundedness per statement in source:   0%|          | 0/24 [00:00<?, ?it/s]

[NodeWithScore(node=TextNode(id_='06f719a3-0b67-48d8-b2e8-3c845f9c1ec9', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='0e6da802-178b-440f-a2be-e18521417e21', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='f212291d91b04883270fc4890b811441608802180b026774375f2ded4f0349fe')}, text="Here are some facts extracted from the provided text:\n\nConsent building block -> Implements -> Key functionalities described in the consent management lifecycle\nConsent building block -> Follow -> Privacy principles\nConsent building block -> Allows -> Both data consuming organisation and data providing organisation to verify their conformance with the underlying data policy\nConsent building block -> Has obtained -> Requisite access tokens\nConsent building block -> Integrates with -> Most other building blocks\nConsent building block -> Will include -> Capacity to sign a consent 

Groundedness per statement in source:   0%|          | 0/20 [00:00<?, ?it/s]

[NodeWithScore(node=TextNode(id_='8de1f4d6-1b1e-4c28-bcdf-00c9b6788b7a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='844fac02-2398-4bd8-890f-5e3ca18e5ac7', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='8f3ebdac2ebea4b588ce32e49b79288f277398a4cba245a55fad22925a86a391')}, text='Here are some facts extracted from the provided text:\n\nCyber threat -> Is -> Threat actor using internet\nCyber threat -> Takes advantage of -> Known vulnerability in a product\n\nCyber Threat\nA threat actor, using the internet, who takes advantage of a\nknown vulnerability in a product for the purposes of exploiting a\nnetwork and the information the network carries.\nCybercrime\nA crime committed with the aid of, or directly involving, a data\nprocessing system or computer network. The computer or its\ndata may be the target of the crime or the computer may be the\ntool with whic

Groundedness per statement in source:   0%|          | 0/3 [00:00<?, ?it/s]

[NodeWithScore(node=TextNode(id_='a6a30d92-fd6e-4596-ab4b-7e4ac6ddaaca', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='019137d1-086b-409b-a85f-3d5e5f44e359', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='d5e9d6abb1be0430b224619ff3e7912e717785fd9b68103dfbd5859d02396823')}, text='Here are some facts extracted from the provided text:\n\nNational cyber security strategy -> Includes -> Executive summary\n\nLeadership and Collaboration\n\n\n       Smart cities use digital technologies to enhance quality of life by\n\n\n       making services more efficient, cost-effective, and responsive for\n\n\n       urban residents. For example, “smart” traffic lights will measure and\n\n\n       adapt timing to improve traffic flows and connected sewer systems\n\n\n       will detect leaks and monitor real-time water flow.                                                      

Groundedness per statement in source:   0%|          | 0/18 [00:00<?, ?it/s]

In [40]:
tru.run_dashboard()

Starting dashboard ...


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://192.168.43.140:8501 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

In [41]:
tru.stop_dashboard()