# LLM Powered PDF Ingestion
## Outline
1. Data Cleansing
2. Prompt Definition
3. Entity & Relationship Extraction
4. Neo4j Cypher Generation
5. Data Ingestion

## Environment Set-up

Before starting this exercise, prepare your Neo4j Sandbox instance

[Neo4j Sandbox](https://neo4j.com/sandbox/)

In [167]:
#Get your Sandbox credentials and enter them here below

connectionUrl = 'bolt://44.203.125.118:7687'
username = 'neo4j'
password = 'methods-front-booms'

In [None]:
%pip uninstall openai

In [169]:
%%capture
%pip install graphdatascience
%pip install openai==0.28
%pip install python-dotenv
%pip install retry
%pip install PyPDF2
%pip install langchain
%pip install sentence-transformers

In [170]:
import os
import openai
from retry import retry
import re
from string import Template
import json
import ast
import time
import pandas as pd
from graphdatascience import GraphDataScience
import glob
from timeit import default_timer as timer
from dotenv import load_dotenv

In [171]:
from google.colab import userdata
userdata.get('open_key')

'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'

In [172]:
os.environ["OPENAI_API_KEY"] = userdata.get('open_key')
openai.api_key = os.getenv('OPENAI_API_KEY')

## Optional section to test our connection to the LLM

In [None]:
%pip install langchain_openai

In [None]:
from langchain_openai import OpenAI

llm = OpenAI(openai_api_key=userdata.get('open_key'))

response = llm.invoke("What is Neo4j?")

print(response)

## Data Cleansing

First, let's define a function that can help clean the input data. For the sake of simplicity, lets keep it simple. In the corpus, the data refers to some Figures like scan images. We dont have them and so will remove any such references.

In [173]:
def clean_text(text):
  clean = "\n".join([row for row in text.split("\n")])
  clean = re.sub(r'\(fig[^)]*\)', '', clean, flags=re.IGNORECASE)
  return clean

Let's take this case sheet and extract entities and relations using LLM

### Source PDF File

Example PDF document is the recent "Building Knowledge Graphs" book from Jesus Barrasa and Jim Webber

In [174]:
pdf = 'ukgovai.pdf'

In [175]:
from PyPDF2 import PdfReader

pdf_reader = PdfReader(pdf)

article_txt = ""
for page in pdf_reader.pages:
    article_txt += page.extract_text()

In [176]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500, chunk_overlap=200, length_function=len
)

chunks = text_splitter.split_text(text=article_txt)

In [157]:
len(chunks)

6

## Prompt Definition

This is a helper function to talk to the LLM with our prompt and text input

In [178]:
# GPT-4 Prompt to complete
@retry(tries=2, delay=5)
def process_gpt(system,
                prompt):

    completion = openai.ChatCompletion.create(
        # engine="gpt-3.5-turbo",
        model="gpt-4",
        max_tokens=2500,
        # Try to be as deterministic as possible
        temperature=0,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]
    )
    nlp_results = completion.choices[0].message.content
    return nlp_results

This is a simple prompt to start with. If the processing is very complex, you can also chain the prompts as and when required. I am going to use a single prompt here that helps me to extract the text strictly as per the Entities and Relationships defined. This is a simplification. In the real scenario you have to leverage on Domain experts to define the Ontology systematically and capture the important information. You might also be fine-tuning the model as and when required.

## Prompts

Our prompt below is deliberately very generic, we are asking the LLM to extract information from the text and categorize it.  The LLM will also enrich the description of the extract information with supporting information from the LLM.

In [179]:
prompt1="""From the text below, extract any Entities & Relationships which are of interest,
these could be business concepts, technology, people, locations, processes or finanical values

0. ALWAYS FINISH THE OUTPUT. Never send partial responses.  You should aim to extract as many entities from the text as possible

1. First look for Entities of interest in the text and generate as a comma-separated format similar to the entity type.
  The entity label should be defined by the high level category of the entity extracted, look for common terms and groups and use this as the entity label,
  replace the label 'Thing' below with a category label.   The name property should be the short name of the extracted entity
  'id' property of each entity must be alphanumberic and must be unique among the entities.
  You will be referring to this property to define the relationship between each entity

  label:'Paper', name:string, summary:string //Title of the article;`name` property is the title of the paper,
  in lowercase & camel-case & should always start with an alphabet; summary is a description as defined within openai
  label:'Thing', name:string, summary:string //any item of interest within the text,
  in lowercase & camel-case & should always start with an alphabet; summary is a description as defined within openai

2. Next generate each relationship as a triples of head, relastionship and tail.  To refer the head and tail entity, use their respective 'id' propertry.
   Relationship property should be mentioned within brackets as comma-separated.
   They should follow these relationship types below. You will have to generate as many relationships as needed as defined below:
    Relationship types:
    Paper|MENTIONS|Thing
    Thing|RELATES_TO|Thing

    The output should look like :
{
    "entities": [{"label":"Paper","id":string,"name":string,"summary":string}],
    "relationships": ["paper|MENTIONS_PERSON|businesstrend"]
}

Case Sheet:
$ctext
"""

In [None]:
article_txt

#### Run the analyse

Let's run our completion task with our LLM

In [180]:
%%time
def run_completion(prompt, results, ctext):
    try:
      system = "You are a helpful business analyst who extracts relevant information and store them on a Neo4j Knowledge Graph"
      pr = Template(prompt).substitute(ctext=ctext)
      res = process_gpt(system, pr)
      results.append(json.loads(res.replace("\'", "'")))
      return results
    except Exception as e:
        print(e)

prompts = [prompt1]
results = []
for p in prompts:
  results = run_completion(p, results, clean_text(article_txt))


CPU times: user 110 ms, sys: 15.8 ms, total: 126 ms
Wall time: 26.3 s


#### Results

In [None]:
results

## Neo4j Cypher Generation

The entities & relationships we got from the LLM have to be transformed to Cypher so we can ingest into Neo4j

In [181]:
#pre-processing results for uploading into Neo4j - helper function:
def get_prop_str(prop_dict, _id):
    s = []
    for key, val in prop_dict.items():
      if key != 'label' and key != 'id':
         s.append(_id+"."+key+' = "'+str(val).replace('\"', '"').replace('"', '\"')+'"')
    return ' ON CREATE SET ' + ','.join(s)

def get_cypher_compliant_var(_id):
    return "_"+ re.sub(r'[\W_]', '', _id)

def generate_cypher(in_json):
    e_map = {}
    e_stmt = []
    r_stmt = []
    e_stmt_tpl = Template("($id:$label{id:'$key'})")
    r_stmt_tpl = Template("""
      MATCH $src
      MATCH $tgt
      MERGE ($src_id)-[:$rel]->($tgt_id)
    """)
    for obj in in_json:
      for j in obj['entities']:
          props = ''
          label = j['label']
          id = j['id']
          if label == 'Case':
                id = 'c'+str(time.time_ns())
          elif label == 'Person':
                id = 'p'+str(time.time_ns())
          varname = get_cypher_compliant_var(j['id'])
          stmt = e_stmt_tpl.substitute(id=varname, label=label, key=id)
          e_map[varname] = stmt
          e_stmt.append('MERGE '+ stmt + get_prop_str(j, varname))

      for st in obj['relationships']:
          rels = st.split("|")
          src_id = get_cypher_compliant_var(rels[0].strip())
          rel = rels[1].strip()
          tgt_id = get_cypher_compliant_var(rels[2].strip())
          stmt = r_stmt_tpl.substitute(
              src_id=src_id, tgt_id=tgt_id, src=e_map[src_id], tgt=e_map[tgt_id], rel=rel)

          r_stmt.append(stmt)

    return e_stmt, r_stmt

In [182]:
ent_cyp, rel_cyp = generate_cypher(results)

_Optional - View the generated Cypher Statements_

In [None]:
ent_cyp

### Data Ingestion

In [183]:
gds = GraphDataScience(connectionUrl, auth=(username, password))
gds.version()

'2.6.0'

Ingest the entities

In [184]:
%%time
for e in ent_cyp:
    gds.run_cypher(e)


CPU times: user 31.4 ms, sys: 943 µs, total: 32.3 ms
Wall time: 1.92 s


Ingest relationships now

In [185]:
%%time
for r in rel_cyp:
    gds.run_cypher(r)

CPU times: user 38.7 ms, sys: 3.91 ms, total: 42.6 ms
Wall time: 4.96 s
