# Parse, Chunk and Load Documents 

The following notebook executes three steps: 
- **Parsing and Chunking**: The first part of the notebook parses and chunks the documents.  This is done by the [PyPDFLoader](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/#using-pypdf) of LangChain. More documentation can be found here: [LangChain API](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html). 
- **Embeddings**: For every chunk an embeddings is created. For these an OpenAI Embeddings model is used: [text-embedding-3-small](https://platform.openai.com/docs/models/embeddings). 
- **Load to Database**: The Documents and Chunks are loaded to Neo4j. This is done using the [Python Driver](https://neo4j.com/docs/api/python-driver/current/) that enables querying from a Python script.

In [9]:
%pip install pypdf langchain_community langchain langchain_openai IPython neo4j

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m


Installing collected packages: pypdf
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mSuccessfully installed pypdf-4.2.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
import os
from dotenv import load_dotenv
from neo4j import GraphDatabase
import ast
from IPython.display import clear_output

## Get Credentials

In [1]:
if os.path.exists('credentials.env'):
    load_dotenv('credentials.env', override=True)

    # Neo4j
    uri = os.getenv('NEO4J_URI')
    username = os.getenv('NEO4J_USERNAME')
    password = os.getenv('NEO4J_PASSWORD')
    database = os.getenv('NEO4J_DATABASE')

    # AI
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    os.environ['OPENAI_API_KEY']=OPENAI_API_KEY
else:
    print("File 'credentials.env' not found.")

NameError: name 'os' is not defined

In [49]:
documents_path = "documents/"

## Parse and Chunk Documents

In [4]:
chunk_size = 1000
chunk_overlap = 100

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap  = chunk_overlap,
    length_function = len,
    is_separator_regex = False,
)

In [10]:
directory = os.fsencode(documents_path)
chunk_seq_id = 0
chunks_with_metadata = []

for doc in os.listdir(directory):
    doc_name = os.fsdecode(doc)
    print(f"Parsing: {doc_name}")
    doc_path = documents_path + doc_name
    loader = PyPDFLoader(doc_path)
    pages = loader.load_and_split()
    num_chunks = 0
    for page in pages:
        chunks = text_splitter.split_text(page.page_content)
        for chunk in chunks:
            d = {
                'file': page.metadata['source'],
                'page': page.metadata['page'],
                'chunks': chunk,
                'num_chuncks': len(chunks),
                'chunk_seq_id': chunk_seq_id
            }
            chunk_seq_id += 1
            num_chunks += 1
            chunks_with_metadata.append(d.copy())
    break
    print(f"chunked {len(pages)} pages in {num_chunks} chunks")

Parsing: NN_zorg_vrij_basic_2024.pdf


Create a DataFrame of Chunks

In [13]:
df = pd.DataFrame.from_dict(chunks_with_metadata)

In [14]:
df

Unnamed: 0,file,page,chunks,num_chuncks,chunk_seq_id
0,documents/NN_zorg_vrij_basic_2024.pdf,0,Reimbursements and terms and conditions for 20...,1,0
1,documents/NN_zorg_vrij_basic_2024.pdf,1,Abroad ..........................................,5,1
2,documents/NN_zorg_vrij_basic_2024.pdf,1,General practitioner ............................,5,2
3,documents/NN_zorg_vrij_basic_2024.pdf,1,Medical mental healthcare........................,5,3
4,documents/NN_zorg_vrij_basic_2024.pdf,1,Sensory impairment care..........................,5,4
...,...,...,...,...,...
771,documents/NN_zorg_vrij_basic_2024.pdf,202,assess whether we should change it. Such a rec...,4,771
772,documents/NN_zorg_vrij_basic_2024.pdf,203,Reimbursements and terms and conditions for 20...,2,772
773,documents/NN_zorg_vrij_basic_2024.pdf,203,"(`Nederlandse Zorgautoriteit', NZa): Postbus 3...",2,773
774,documents/NN_zorg_vrij_basic_2024.pdf,204,Reimbursements and terms and conditions for 20...,2,774


## Create embeddings

Load an embedding model

In [15]:
model = 'text-embedding-3-small'

In [21]:
embeddings_model = OpenAIEmbeddings(
    model = model,
    openai_api_key = OPENAI_API_KEY
)

Add an embedding for every chunk in the DataFrame

In [22]:
df['embedding'] = df['chunks'].apply(lambda x: embeddings_model.embed_query(x))

## Create Neo4j Connection

Setup the Python Driver for Neo4j with the loaded credentials

In [52]:
class App:
    def __init__(self, uri, user, password, database=None):
        self.driver = GraphDatabase.driver(uri, auth=(user, password), database=database)
        self.database = database

    def close(self):
        self.driver.close()

    def query(self, query):
        return self.driver.execute_query(query)

    def query_params(self, query, parameters):
        return self.driver.execute_query(query, parameters_=parameters)

    def count_nodes_in_db(self):
        query = "MATCH (n) RETURN COUNT(n)"
        result = self.query(query)
        (key, value) = result.records[0].items()[0]
        return value

    def remove_nodes_relationships(self):
        query ="""
            CALL apoc.periodic.iterate(
                "MATCH (c) RETURN c",
                "WITH c DETACH DELETE c",
                {batchSize: 1000}
            )
        """
        result = self.query(query)

    def remove_all_constraints(self):
        query ="""
            CALL apoc.schema.assert({}, {})
        """
        result = self.query(query)

In [53]:

app = App(uri, username, password, database)

In [54]:
app.count_nodes_in_db()

200

## Load to Database

Create some constraints

In [55]:
app.query("CREATE CONSTRAINT unique_policy IF NOT EXISTS FOR (p:Policy) REQUIRE p.id IS UNIQUE")

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x30abbf940>, keys=[])

In [56]:
app.query("CREATE CONSTRAINT unique_chunk IF NOT EXISTS FOR (c:Chunk) REQUIRE c.id IS UNIQUE")

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x30b275dc0>, keys=[])

### Load Policies Nodes to database

Create Dataframe from the policies/files

In [57]:
policies_df = df['file'].drop_duplicates('first').copy()
policies_df = policies_df.reset_index().drop('index',axis=1).reset_index()
policies_df = policies_df.rename(columns={"index": "policy_id", "file": "file_location"})
policies_df['file_name'] = policies_df['file_location'].apply(lambda x: x.split('/')[-1])
policies_df

  policies_df = df['file'].drop_duplicates('first').copy()


Unnamed: 0,policy_id,file_location,file_name
0,0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf


Get number of pages per file

In [58]:
df = pd.merge(df, policies_df, left_on='file', right_on='file_location', how='left').copy()

In [59]:
df

Unnamed: 0,file,page,chunks,num_chuncks,chunk_seq_id,embedding,policy_id,file_location,file_name
0,documents/NN_zorg_vrij_basic_2024.pdf,0,Reimbursements and terms and conditions for 20...,1,0,"[-0.017664135392465398, 0.0010989804845562235,...",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf
1,documents/NN_zorg_vrij_basic_2024.pdf,1,Abroad ..........................................,5,1,"[0.009289972868684463, 0.011951055872454076, 0...",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf
2,documents/NN_zorg_vrij_basic_2024.pdf,1,General practitioner ............................,5,2,"[0.03081734351206423, 0.03832021428989872, 0.1...",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf
3,documents/NN_zorg_vrij_basic_2024.pdf,1,Medical mental healthcare........................,5,3,"[0.05817534559804301, 0.005862851320530374, 0....",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf
4,documents/NN_zorg_vrij_basic_2024.pdf,1,Sensory impairment care..........................,5,4,"[0.04950871115614762, 0.003652064254791728, 0....",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf
...,...,...,...,...,...,...,...,...,...
771,documents/NN_zorg_vrij_basic_2024.pdf,202,assess whether we should change it. Such a rec...,4,771,"[0.030318757627279064, 0.03105452667823687, 0....",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf
772,documents/NN_zorg_vrij_basic_2024.pdf,203,Reimbursements and terms and conditions for 20...,2,772,"[0.002802086345248927, 0.03354124784343089, 0....",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf
773,documents/NN_zorg_vrij_basic_2024.pdf,203,"(`Nederlandse Zorgautoriteit', NZa): Postbus 3...",2,773,"[-0.012716539281772982, 0.03440676650553472, 0...",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf
774,documents/NN_zorg_vrij_basic_2024.pdf,204,Reimbursements and terms and conditions for 20...,2,774,"[0.0028165730735622684, 0.03883166687510502, 0...",0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf


In [61]:
pages_df = df.groupby(['policy_id', 'file_name']).max(['page'])['page'].apply(lambda x: x+1)

In [62]:
policies_df = pd.merge(policies_df, pages_df, on='policy_id', how='left')
policies_df

Unnamed: 0,policy_id,file_location,file_name,page
0,0,documents/NN_zorg_vrij_basic_2024.pdf,NN_zorg_vrij_basic_2024.pdf,205


### Load the Policies

In [63]:
merge_file_query = """
    MERGE(mergedPolicy:Policy {id: $policy_id})
        ON CREATE SET
            mergedPolicy.file_location = $file_location,
            mergedPolicy.file_name = $file_name,
            mergedPolicy.pages = $file_pages
    RETURN mergedPolicy
"""

In [64]:
for index, row in policies_df.iterrows():
    clear_output(wait=True)
    d = {
        'file_location': row['file_location'],
        'file_name': row['file_name'],
        'policy_id': row['policy_id'],
        'file_pages': row['page']
    }
    app.query_params(merge_file_query, d)
    print(f"Loaded {row['file_name']}")
    print("Progress: ", np.round((index+1)/policies_df.shape[0]*100,2), "%")

Loaded NN_zorg_vrij_basic_2024.pdf
Progress:  100.0 %


### Load Chunk Nodes to database

Create Dataframe for chunks

In [65]:
chunks_df = df[['chunk_seq_id', 'num_chuncks', 'page', 'chunks', 'embedding']]
chunks_df

Unnamed: 0,chunk_seq_id,num_chuncks,page,chunks,embedding
0,0,1,0,Reimbursements and terms and conditions for 20...,"[-0.017664135392465398, 0.0010989804845562235,..."
1,1,5,1,Abroad ..........................................,"[0.009289972868684463, 0.011951055872454076, 0..."
2,2,5,1,General practitioner ............................,"[0.03081734351206423, 0.03832021428989872, 0.1..."
3,3,5,1,Medical mental healthcare........................,"[0.05817534559804301, 0.005862851320530374, 0...."
4,4,5,1,Sensory impairment care..........................,"[0.04950871115614762, 0.003652064254791728, 0...."
...,...,...,...,...,...
771,771,4,202,assess whether we should change it. Such a rec...,"[0.030318757627279064, 0.03105452667823687, 0...."
772,772,2,203,Reimbursements and terms and conditions for 20...,"[0.002802086345248927, 0.03354124784343089, 0...."
773,773,2,203,"(`Nederlandse Zorgautoriteit', NZa): Postbus 3...","[-0.012716539281772982, 0.03440676650553472, 0..."
774,774,2,204,Reimbursements and terms and conditions for 20...,"[0.0028165730735622684, 0.03883166687510502, 0..."


In [66]:
merge_chunck_query = """
    MERGE(mergedChunk:Chunk {id: $chunk_seq_id})
        ON CREATE SET
            mergedChunk.page = $page,
            mergedChunk.chunk = $chunk,
            mergedChunk.embedding = $embedding
    RETURN mergedChunk
"""

In [67]:
for index, row in chunks_df.iterrows():
    clear_output(wait=True)
    d = {
        'chunk_seq_id': row['chunk_seq_id'],
        'page': row['page'],
        'chunk': row['chunks'],
        'embedding': row['embedding']
    }
    app.query_params(merge_chunck_query, d)
    print("Progress: ", np.round(((index+1)/chunks_df.shape[0])*100,2), "%")

Progress:  100.0 %


### Load File to Chunk Relationship

In [68]:
part_of_df = df[['chunk_seq_id', 'policy_id']].copy()
part_of_df

Unnamed: 0,chunk_seq_id,policy_id
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0
...,...,...
771,771,0
772,772,0
773,773,0
774,774,0


In [69]:
merge_part_of_query = """
    MATCH
        (policy:Policy {id: $policy_id}),
        (chunk:Chunk {id: $chunk_id})
    MERGE (policy)<-[r:PART_OF]-(chunk)
    RETURN policy.name, type(r), chunk.title
"""

In [70]:
for index, row in part_of_df.iterrows():
    clear_output(wait=True)
    d = {
        'policy_id': row['policy_id'],
        'chunk_id': row['chunk_seq_id']
    }
    app.query_params(merge_part_of_query, d)
    # print(f"Loaded relationship from policy {row['policy_id']} to chunk {row['chunk_seq_id']}")
    print("Progress: ", np.round(((index+1)/part_of_df.shape[0])*100,2), "%")


Progress:  100.0 %


## Load Chunk to Chunk Relationship

Link the chunks in order by the "NEXT" relationship.

In [71]:
next_query = """
    MATCH (policy:Policy)
    WITH policy
    CALL {
        WITH policy
        MATCH (policy)<-[:PART_OF]-(chunks:Chunk)
        WITH chunks ORDER BY chunks.id ASC
        WITH collect(chunks) as chunk_list
        CALL apoc.nodes.link(
            chunk_list,
            "NEXT",
            {avoidDuplicates: true}
        )
        RETURN size(chunk_list) as size_chunk_list
    }
    WITH policy, size_chunk_list
    RETURN policy, size_chunk_list
"""

In [72]:
app.query(next_query)

EagerResult(records=[<Record policy=<Node element_id='207' labels=frozenset({'Policy'}) properties={'file_location': 'documents/NN_zorg_vrij_basic_2024.pdf', 'pages': 205, 'file_name': 'NN_zorg_vrij_basic_2024.pdf', 'id': 0}> size_chunk_list=776>], summary=<neo4j._work.summary.ResultSummary object at 0x30b508b20>, keys=['policy', 'size_chunk_list'])