# Data Extraction (Text-to-KG)

In [1]:
import os
import time
from typing import List, Dict, Any
from dotenv import load_dotenv
from langchain.globals import set_llm_cache
from langchain_community.cache import InMemoryCache
from langchain_core.documents.base import Document


'''
You need the following environment variables:

    1. NEO4J_URL (Default="bolt://localhost:7687"), 
    2. NEO4J_USERNAME (Default="neo4j"), 
    3. NEO4J_PASSWORD (Default=None), 
    4. NEO4J_DATABASE (Default="neo4j")
    (The above are for establishing a Neo4j Connection and can be passed as arguments as well)
    5. MODELS_CACHE_FOLDER (Default=None):- 
        I'm using SentenceTransformer model "all-MiniLM-v6-L2" for creating embeddings to calculate vector similarity.
        This is just the path to the folder to use as the cache folder
    6. GOOGLE_API_KEY (For ChatGoogleGenerativeAI) or any other API key for env variable if you want to use some other LLM
    7. TESSDATA_PREFIX (Might need it for using PyTesseract)
'''
load_dotenv('../.env')

from kgrag.data_extraction import Text2KG
from kgrag.parse_pdf import PDFParserMarkdown, OCREngine

from langchain_google_genai import ChatGoogleGenerativeAI

set_llm_cache(InMemoryCache()) # Set LLM Caching (Optional)

# Replace with llm you want to use
llm = ChatGoogleGenerativeAI(model="models/gemini-1.0-pro", temperature=0)

  from tqdm.autonotebook import tqdm, trange


In [2]:
# filepath = input("Please enter filepath of the file you want to process: ")
# filepath = "../SampleDocs/Leadership-Etsko-Schuitema.pdf" # ...or any other file you want to use
# filepath = "/media/wali/D_Drive/Documents/Books/C++_Programming_Program_Design_Including_Data_Structure_D.S.Malik_5th_DS.pdf"
filepath = "/media/wali/D_Drive/Documents/Books/C-How-to-Program-7th-Edition.pdf"

In [3]:

# Can also accept neo4j_url, neo4j_username, neo4j_password & neo4j_database arguments
# if the above mentioned environment variables have not been set
text2kg = Text2KG(
    llm=llm,
    emb_model=None, # By Default, SentenceTransformer is used ot use any other embedding model
    disambiguate_nodes=False,
    link_nodes=True,
    node_vector_similarity_threshold=0.90,
    subject=filepath.split('/')[-1].split('.')[0].replace('_',' ').replace('-',' '), # Subject can be anything or nothing - filename works well for most cases
    verbose=True
) 

In [6]:
'''
Parse or read the PDF file
Can use any parser as long as some conditions are observed in the output:
    1. Must be in langchain_core.documents.base.Document format
    2. Must contain the following keys in doc.metadata:
        1. page
        2. filename/filepath or source
'''

pages =  list(range(37, 45)) #list(range(68,70)) # None

parser = PDFParserMarkdown(
    pdf_path=filepath,
    pages=pages, # Can pass a list of pages to read, useful for debugging
    ocr_engine=OCREngine.PYTESSERACT, # 3 OCR Options: PYTESSERACT, LLM, RAPIDOCR (LLM is most accurate)
) 

doc_dicts: List[Dict[str, Any]] = parser.process_pdf_document()

docs: List[Document] = [
    Document(
        page_content=doc['text'],
        metadata={**doc['page_metadata'], **doc['doc_metadata']}
    )
    for doc in doc_dicts
]

Processing /media/wali/D_Drive/Documents/Books/C-How-to-Program-7th-Edition.pdf...


In [7]:
print(docs[-1].page_content)

Machine-language programming was simply too slow, tedious and error prone for
most programmers. Instead of using the strings of numbers that computers could directly
understand, programmers began using English-like abbreviations to represent elementary
operations. These abbreviations formed the basis of assembly languages. **Translator pro-**
**grams called assemblers were developed to convert early assembly-language programs to**
machine language at computer speeds. The following section of an assembly-language program also adds overtime pay to base pay and stores the result in gross pay:


load  basepay
add   overpay
store  grosspay


Although such code is clearer to humans, it’s incomprehensible to computers until translated to machine language.
Computer usage increased rapidly with the advent of assembly languages, but programmers still had to use many instructions to accomplish even the simplest tasks. To
speed the programming process, **high-level languages were developed in whic

In [8]:
start_time = time.process_time()
text2kg.process_documents(docs, use_existing_node_types=False)
end_time = time.process_time()
print(f"Total Time: {end_time - start_time}")

----------------------------------------------------------------------------------------------------
Doc # 1
Nodes: [Node(id='Internet', type='Technology', properties=[Property(key='chiefMerit', value='clearness')], aliases=[], definition='A global network connecting millions of computers.'), Node(id='World Wide Web', type='Technology', properties=[Property(key='chiefMerit', value='clearness')], aliases=[], definition='A system of interlinked hypertext documents accessed via the Internet.')]
Relationships: [Relationship(start_node_id='Internet', end_node_id='World Wide Web', type='PART_OF', properties=[], context='')]

----------------------------------------------------------------------------------------------------
Doc # 2
Nodes: [Node(id='C++', type='ProgrammingLanguage', properties=[Property(key='Authors', value='Deitel & Associates, Inc.')], aliases=[], definition='A powerful computer programming language that is appropriate for technically oriented people with little or no progr

In [None]:
print("Hi")

# KG Search / Retrieval

In [10]:
import os
from dotenv import load_dotenv

load_dotenv("../.env")

from langchain_google_genai import ChatGoogleGenerativeAI

from kgrag.kg_search import KGSearch


llm = ChatGoogleGenerativeAI(model="models/gemini-1.0-pro", temperature=0)



In [20]:

kg_search = KGSearch(
    ent_llm=llm,
    cypher_llm=llm,
    cypher_examples_json="examples.json",
    fulltext_search_top_k=6,
    vector_search_top_k=6,
    vector_search_min_score=0.7
)


In [None]:
'''
OR get lists of strings separately using kg_search.retrieve - retrieve_as_string is only a wrapper for this function

rels, docs, gen_cypher_results = kg_search.retrieve(
    query, 
    nresults=30,
    use_fulltext_search=True, 
    use_vector_search=True,
    generate_cypher=False
)
rels: All the triples that the entities/nodes in the query are involved in 
     Empty list is returned if `use_fulltext_search=False` & `use_vector_search=False`
docs: All the documents that contain any entities mentioned in the query
     Empty list is returned if `use_fulltext_search=False` & `use_vector_search=False`
gen_cypher_results: List of string/JSON results from the generated cypher
     Empty list is returned if `generate_cypher=False`

If no entities are found using vector or fulltext search, then a cypher is generated regardless of the value of `generate_cypher`
And results are returned in the gen_cypher_results
WARNING: Using `generate_cypher` is unreliable and error-prone. Needs more work and more examples from 'examples.json'
'''

In [34]:

query = "Describe the differences between web 3.0 and web 2.0." #input("Enter your query: ")

docs_string = kg_search.retrieve_as_string( #rels, docs, gen_res
    query, 
    nresults=500,
    use_fulltext_search=True, # Extract all entities (using ent_llm) in the input query and search using fulltext search
    use_vector_search=True, # Search for all entities/nodes in the query using vector search
    generate_cypher=False # Use LLM (cypher_llm) to generate cypher - Uses examples from `cypher_examples_json` for guidance
)
# print(len(rels))

In [35]:
q = f"""Answer the given as best as you can. Use the given search results to assist you with answering the query
Query: {query}

Context:
{docs_string}

Answer:"""
ans = llm.invoke(q)

In [36]:
print(ans.content)

**Web 3.0**

Web 3.0 refers to the next movement in web development—one that realizes the full potential of the web. The Internet in its current state is a giant conglomeration of single websites with loose connections. Web 3.0 will resolve this by moving toward the Semantic Web—or the “web of meaning”—in which the web becomes a giant database meaningfully searchable by computers.

**Web 2.0**

Web 2.0 has no single definition but can be explained through a series of Internet trends, one being the empowerment of the user. Companies such as eBay, Facebook and Twitter are built almost entirely on community-generated content. Web 2.0 takes advantage of collective intelligence, the idea that collaboration will result in intelligent ideas. For example, wikis, such as the encyclopedia Wikipedia, allow users access to edit content. Tagging, or labeling content, is another key part of the collaborative theme of Web 2.0, which can be seen in sites such as Flickr, a photo-sharing site, and del.i

In [37]:
print(docs_string)

Nodes Relations:-
(Q113420780: Technology:Node {'wiki_description': 'scientific article published on 11 June 2022', 'alias': ['Web 3.0'], 'definition': 'The next movement in web development—one that realizes the full potential of the web', 'wiki_type': 'scholarly article', 'url': 'www.wikidata.org/wiki/Q113420780', 'labels': ['Web 3.0: The Future of Web']})-[:USES]->(Q134471: Technology:Node {'wiki_description': 'group of interrelated Web development techniques', 'alias': ['Ajax'], 'definition': 'A set of web development techniques using many web technologies on the client-side to create asynchronous web applications', 'wiki_type': 'acronym', 'url': 'www.wikidata.org/wiki/Q134471', 'labels': ['AJAX', 'Asynchronous JavaScript and XML']})
(Q113420780: Technology:Node {'wiki_description': 'scientific article published on 11 June 2022', 'alias': ['Web 3.0'], 'definition': 'The next movement in web development—one that realizes the full potential of the web', 'wiki_type': 'scholarly article