In [1]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append("..")

Now, iText2KG is compatible with all language models supported by LangChain. 

To use iText2KG, you will need both a chat model and an embeddings model. 

For available chat models, refer to the options listed at: https://python.langchain.com/v0.2/docs/integrations/chat/. 
For embedding models, explore the choices at: https://python.langchain.com/v0.2/docs/integrations/text_embedding/. 

This notebook will show you how to run iText2KG using Mistral, Ollama, and OpenAI models. 

**Please ensure that you install the necessary package for each chat model before use.**

The same applies for OpenAI. 

please setup your model using the tutorial : https://python.langchain.com/v0.2/docs/integrations/chat/openai/
The same for embedding model : https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/

# Ollama

The same applies for Ollama. 

please setup your model using the tutorial : https://python.langchain.com/v0.2/docs/integrations/chat/ollama/
The same for embedding model : https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/

In [2]:
from langchain_ollama import ChatOllama, OllamaEmbeddings

llm = ChatOllama(
    model="llama3.1",
    temperature=0,
)

embeddings = OllamaEmbeddings(
    model="llama3.1",
)

# iText2KG

* Use Case: We aim to connect an online job description with a generated CV using Knowledge Graphs. 

* The objective is to assess the candidate's suitability for the job offer. You can utilize different LLM or embedding models for each module of iText2KG. However, it is important to ensure that the dimensions of node and relation embeddings are consistent across models. If the embedding dimensions differ, cosine similarity may struggle to accurately measure vector distances for further matching.

## Document Distiller

In [3]:
from langchain.document_loaders import PyPDFLoader, TextLoader

# loader = TextLoader("../datasets/scientific_articles/bertology.txt")
# docs = loader.load()
# docs

loader = PyPDFLoader(f"../datasets/scientific_articles/bertology.pdf")
docs = loader.load_and_split()
docs

[Document(metadata={'source': '../datasets/scientific_articles/bertology.pdf', 'page': 0}, page_content='Published as a conference paper at ICLR 2021\nBERT OLOGY MEETS BIOLOGY : INTERPRETING\nATTENTION IN PROTEIN LANGUAGE MODELS\nJesse Vig1Ali Madani1Lav R. Varshney1,2Caiming Xiong1\nRichard Socher1Nazneen Fatema Rajani1\n1Salesforce Research,2University of Illinois at Urbana-Champaign\n{jvig,amadani,cxiong,rsocher,nazneen.rajani}@salesforce.com\nvarshney@illinois.edu\nABSTRACT\nTransformer architectures have proven to learn useful representations for pro-\ntein classiﬁcation and generation tasks. However, these representations present\nchallenges in interpretability. In this work, we demonstrate a set of methods for\nanalyzing protein Transformer models through the lens of attention. We show that\nattention: (1) captures the folding structure of proteins, connecting amino acids that\nare far apart in the underlying sequence, but spatially close in the three-dimensional\nstructure, (2)

In [4]:
from itext2kg.documents_distiller import DocumentsDisiller, Article

document_distiller = DocumentsDisiller(llm_model=llm)

In [5]:
docs = [doc.page_content.replace("{", '[').replace("}", "]") for doc in docs]
docs

['Published as a conference paper at ICLR 2021\nBERT OLOGY MEETS BIOLOGY : INTERPRETING\nATTENTION IN PROTEIN LANGUAGE MODELS\nJesse Vig1Ali Madani1Lav R. Varshney1,2Caiming Xiong1\nRichard Socher1Nazneen Fatema Rajani1\n1Salesforce Research,2University of Illinois at Urbana-Champaign\n[jvig,amadani,cxiong,rsocher,nazneen.rajani]@salesforce.com\nvarshney@illinois.edu\nABSTRACT\nTransformer architectures have proven to learn useful representations for pro-\ntein classiﬁcation and generation tasks. However, these representations present\nchallenges in interpretability. In this work, we demonstrate a set of methods for\nanalyzing protein Transformer models through the lens of attention. We show that\nattention: (1) captures the folding structure of proteins, connecting amino acids that\nare far apart in the underlying sequence, but spatially close in the three-dimensional\nstructure, (2) targets binding sites, a key functional component of proteins, and\n(3) focuses on progressively more 

In [6]:
IE_query = '''
# DIRECTIVES : 
- Act like an experienced information extractor. 
- You have a chunk of a scientific paper.
- If you do not find the right information, keep its place empty.
'''
# we have replaced the curly braces with square brackets to avoid the error in the query
distilled_doc = document_distiller.distill(documents=docs, IE_query=IE_query, output_data_structure=Article)

[{'$defs': {'Author': {'properties': {'name': {'description': 'The name of the author', 'title': 'Name', 'type': 'string'}, 'affiliation': {'description': 'The affiliation of the author', 'title': 'Affiliation', 'type': 'string'}}, 'required': ['name', 'affiliation'], 'title': 'Author', 'type': 'object'}}, 'properties': {'title': {'description': 'The title of the scientific article', 'title': 'Title', 'type': 'string'}, 'authors': {'description': "The list of the article's authors and their affiliation", 'items': {'$ref': '#/$defs/Author'}, 'title': 'Authors', 'type': 'array'}, 'abstract': {'description': "The article's abstract", 'title': 'Abstract', 'type': 'string'}, 'key_findings': {'description': 'The key findings of the article', 'title': 'Key Findings', 'type': 'string'}, 'limitation_of_sota': {'description': 'limitation of the existing work', 'title': 'Limitation Of Sota', 'type': 'string'}, 'proposed_solution': {'description': 'the proposed solution in details', 'title': 'Prop

In [7]:
# Format the distilled document into semantic sections.
semantic_blocks = [f"{key} - {value}".replace("{", "[").replace("}", "]") for key, value in distilled_doc.items()]
semantic_blocks

["$defs - ['Author': ['properties': ['name': ['description': 'The name of the author', 'title': 'Name', 'type': 'string'], 'affiliation': ['description': 'The affiliation of the author', 'title': 'Affiliation', 'type': 'string']], 'required': ['name', 'affiliation'], 'title': 'Author', 'type': 'object']]",
 'properties - [\'title\': [\'description\': \'The title of the scientific article\', \'title\': \'Title\', \'type\': \'string\'], \'authors\': [\'description\': "The list of the article\'s authors and their affiliation", \'items\': [\'$ref\': \'#/$defs/Author\'], \'title\': \'Authors\', \'type\': \'array\'], \'abstract\': [\'description\': "The article\'s abstract", \'title\': \'Abstract\', \'type\': \'string\'], \'key_findings\': [\'description\': \'The key findings of the article\', \'title\': \'Key Findings\', \'type\': \'string\'], \'limitation_of_sota\': [\'description\': \'limitation of the existing work\', \'title\': \'Limitation Of Sota\', \'type\': \'string\'], \'proposed_s

## iText2KG for graph construction

In [11]:
from itext2kg import iText2KG


itext2kg = iText2KG(llm_model = llm, embeddings_model = embeddings)
global_ent, global_rel = itext2kg.build_graph(sections=semantic_blocks, entity_key="entities")

[INFO] Extracting Entities from the Document 1
{'entities': [{'label': 'Author', 'name': 'The name of the author'}]}
[INFO] Extracting Relations from the Document 1
{'$defs': {'Relationship': {'properties': {'startNode': {'default': 'The starting entity, which is present in the entities list.', 'title': 'Startnode', 'type': 'string'}, 'endNode': {'default': 'The ending entity, which is present in the entities list.', 'title': 'Endnode', 'type': 'string'}, 'name': {'default': 'The predicate that defines the relationship between the two entities. This predicate should represent a single, semantically distinct relation.', 'title': 'Name', 'type': 'string'}}, 'title': 'Relationship', 'type': 'object'}}, 'properties': {'relationships': {'default': 'Based on the provided entities and context, identify the predicates that define relationships between these entities. The predicates should be chosen with precision to accurately reflect the expressed relationships.', 'items': {'$ref': '#/$defs/R

KeyboardInterrupt: 

# Draw the graph
---

The final section involves visualizing the constructed knowledge graph using GraphIntegrator. The graph database Neo4j is accessed using specified credentials, and the resulting graph is visualized to provide a visual representation of the relationships and entities extracted from the document.

Run docker run -p7474:7474 -p7687:7687 -e NEO4J_AUTH=neo4j/secretgraph neo4j:latest in terminal

In [None]:
global_ent_list = []
for ent in global_ent:
    if ent["label"] != '':
        global_ent_list.append(ent)

global_ent_list

In [None]:
global_rel_list = []
for rel in global_rel:
    if rel["startNode"] != '' and rel["endNode"] != '':
        global_rel_list.append(rel)

global_rel_list

In [None]:
from itext2kg.graph_integration import GraphIntegrator


URI = "bolt://localhost:7687"
USERNAME = "neo4j"
PASSWORD = "secretgraph"

new_graph = {}
new_graph["nodes"] = global_ent_list
new_graph["relationships"] = global_rel_list
GraphIntegrator(uri=URI, username=USERNAME, password=PASSWORD).visualize_graph(json_graph=new_graph)