In [2]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append("..")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Upload your documents
---

In this section, we load and process a scientific article named bioclip.pdf from the dataset directory. The document is split into individual pages (chunks) using PyPDFLoader, and we exclude references by selecting only the first 16 pages for further processing.

In [3]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(f"../datasets/scientific_articles/bioclip.pdf")
pages = loader.load_and_split()

In [4]:
# excluding references
pages = pages[:16]

# Document Distiller
---

This code uses an API key to utilize the DocumentsDisiller for extracting structured information from the loaded pages. The distilled information is formatted into semantic blocks, which will serve as input for knowledge graph construction.

In [7]:
OPENAI_API_KEY = "##"

In [8]:
from itext2kg.documents_distiller import DocumentsDisiller, Article
from itext2kg.graph_integration import iText2KG


document_distiller = DocumentsDisiller(openai_api_key=OPENAI_API_KEY)

In [9]:
IE_query = '''
# DIRECTIVES : 
- Act like an experienced information extractor. 
- You have a chunk of a scientific paper.
- If you do not find the right information, keep its place empty.
'''
# we have replaced the curly braces with square brackets to avoid the error in the query
distilled_doc = document_distiller.distill(documents=[page.page_content.replace("{", '[').replace("}", "]") for page in pages], IE_query=IE_query, output_data_structure=Article)

In [10]:
semantic_blocks = [f"{key} - {value}".replace("{", "[").replace("}", "]") for key, value in distilled_doc.items()]

# iText2KG
---

This section initializes an instance of iText2KG and uses it to construct a knowledge graph. Two methods are applied: one using local context for higher precision (highly recommended), and another using global context to enrich the graph despite being less precise.

## Local entities as context (more precise method - highly recommended)

In [11]:
itext2kg = iText2KG(openai_api_key=OPENAI_API_KEY)

In [12]:
global_ent, global_rel = itext2kg.build_graph(sections=semantic_blocks)

[INFO] Extracting Entities from the Document 1
{'entities': [{'label': 'Model', 'name': 'BIOCLIP'}, {'label': 'Domain', 'name': 'Tree of Life'}]}
[INFO] Extracting Relations from the Document 1
{'relationships': [{'startNode': 'BIOCLIP', 'endNode': 'tree of life', 'name': 'is a vision foundation model for'}]}
Some isolated entities without relations were detected ... trying to solve them!
{'relationships': []}
{'name': 'BIOCLIP', 'label': 'entity', 'properties': {'embeddings': array([ 0.03529191, -0.00025876, -0.02022829, ..., -0.02821856,
       -0.02163173,  0.02438248])}}
[INFO] Wohoo ! Entity using embeddings is matched --- BIOCLIP -merged--> bioclip 
[INFO] Extracting Entities from the Document 2
{'entities': [{'label': 'Person', 'name': 'Samuel Stevens'}, {'label': 'Person', 'name': 'Jiaman Wu'}, {'label': 'Person', 'name': 'Matthew J Thompson'}, {'label': 'Person', 'name': 'Elizabeth G Campolongo'}, {'label': 'Person', 'name': 'Chan Hee Song'}, {'label': 'Person', 'name': 'David

## Global entities as context (less precise but more enriched graph)

In [None]:
global_entities = itext2kg.extract_entities_for_all_sections(sections=semantic_blocks)
global_relations = itext2kg.extract_relations_for_all_sections(sections=semantic_blocks, entities=global_entities)

# Draw the graph
---

The final section involves visualizing the constructed knowledge graph using GraphIntegrator. The graph database Neo4j is accessed using specified credentials, and the resulting graph is visualized to provide a visual representation of the relationships and entities extracted from the document.

In [18]:
from itext2kg.graph_integration import GraphIntegrator


URI = "bolt://localhost:7687"
USERNAME = "neo4j"
PASSWORD = "##"


new_graph = {}
new_graph["nodes"] = global_ent
new_graph["relationships"] = global_rel

GraphIntegrator(uri=URI, username=USERNAME, password=PASSWORD).visualize_graph(json_graph=new_graph)