In [1]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append("..")

Now, iText2KG is compatible with all language models supported by LangChain. 

To use iText2KG, you will need both a chat model and an embeddings model. 

For available chat models, refer to the options listed at: https://python.langchain.com/v0.2/docs/integrations/chat/. 
For embedding models, explore the choices at: https://python.langchain.com/v0.2/docs/integrations/text_embedding/. 

This notebook will show you how to run iText2KG using Mistral, Ollama, and OpenAI models. 

**Please ensure that you install the necessary package for each chat model before use.**

# Mistral

For Mistral, please set up your model using the tutorial here: https://python.langchain.com/v0.2/docs/integrations/chat/mistralai/. Similarly, for the embedding model, follow the setup guide here: https://python.langchain.com/v0.2/docs/integrations/text_embedding/mistralai/ .

In [4]:
from langchain_mistralai import ChatMistralAI
from langchain_mistralai import MistralAIEmbeddings

mistral_api_key = "##"
mistral_llm_model = ChatMistralAI(
    api_key = mistral_api_key,
    model="mistral-large-latest",
    temperature=0,
    max_retries=2,
)


mistral_embeddings_model = MistralAIEmbeddings(
    model="mistral-embed",
    api_key = mistral_api_key
)

  from .autonotebook import tqdm as notebook_tqdm


# OpenAI

The same applies for OpenAI. 

please setup your model using the tutorial : https://python.langchain.com/v0.2/docs/integrations/chat/openai/
The same for embedding model : https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/

In [2]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

openai_api_key = "##"

openai_llm_model = llm = ChatOpenAI(
    api_key = openai_api_key,
    model="gpt-4o",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

openai_embeddings_model = OpenAIEmbeddings(
    api_key = openai_api_key ,
    model="text-embedding-3-large",
)

# Ollama

The same applies for Ollama. 

please setup your model using the tutorial : https://python.langchain.com/v0.2/docs/integrations/chat/ollama/
The same for embedding model : https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/

In [5]:
from langchain_ollama import ChatOllama, OllamaEmbeddings

llm = ChatOllama(
    model="llama3",
    temperature=0,
)

embeddings = OllamaEmbeddings(
    model="llama3",
)

# iText2KG

* Use case: we aim to connect two scientific papers. 

* The objective is to detect common key concepts between the two papers and allowing for the identification of central themes, keywords, and topics that dominate each paper. These themes could be linked to show overlaps or gaps in coverage, helping researchers identify areas where more study might be needed or where novel connections could be made.

## Document Distiller

### Scientific articles

In [7]:
from langchain.document_loaders import PyPDFLoader
from itext2kg.documents_distiller import DocumentsDistiller, Article
from pydantic import BaseModel, Field
from typing import List, Tuple


class ArticleResults(BaseModel):
    abstract:str = Field(description="Brief summary of the article's abstract")
    key_findings:str = Field(description="The key findings of the article")
    limitation_of_sota : str=Field(description="limitation of the existing work")
    proposed_solution : str = Field(description="the proposed solution in details")
    paper_limitations : str=Field(description="The limitations of the proposed solution of the paper")

# Sample input data as a list of triplets
# It is structured in this manner : (document's path, page_numbers_to_exclude, blueprint, document_type)
documents_information = [
    ("../datasets/scientific_articles/llm-tikg.pdf", [11,10], ArticleResults, 'scientific article'),
    ("../datasets/scientific_articles/actionable-cyber-threat.pdf", [12,11,10], ArticleResults, 'scientific article')
]

def upload_and_distill(documents_information: List[Tuple[str, List[int], BaseModel]]):
    distilled_docs = []
    
    for path_, exclude_pages, blueprint, document_type in documents_information:
        
        loader = PyPDFLoader(path_)
        pages = loader.load_and_split()
        pages = [page for page in pages if page.metadata["page"]+1 not in exclude_pages] # Exclude some pages (unecessary pages, for example, the references)
        document_distiller = DocumentsDistiller(llm_model=openai_llm_model)
        
        IE_query = f'''
        # DIRECTIVES : 
        - Act like an experienced information extractor.
        - You have a chunk of a {document_type}
        - If you do not find the right information, keep its place empty.
        '''
        
        # Distill document content with query
        distilled_doc = document_distiller.distill(
            documents=[page.page_content.replace("{", '[').replace("}", "]") for page in pages],
            IE_query=IE_query,
            output_data_structure=blueprint
        )
        
        # Filter and format distilled document results
        distilled_docs.append([
            f"{document_type}'s {key} - {value}".replace("{", "[").replace("}", "]") 
            for key, value in distilled_doc.items() 
            if value and value != []
        ])
    
    return distilled_docs


In [9]:
distilled_docs = upload_and_distill(documents_information=documents_information)

## iText2KG for graph construction

In [11]:
from itext2kg import iText2KG


itext2kg = iText2KG(llm_model = openai_llm_model, embeddings_model = openai_embeddings_model)

We construct the first knowledge graph of the first distilled documents (for the first article)

In [12]:
kg = itext2kg.build_graph(sections=distilled_docs[0], ent_threshold=0.7, rel_threshold=0.7)

[INFO] ------- Extracting Entities from the Document 1
{'entities': [{'name': 'Open-source threat intelligence', 'label': 'Data Source'}, {'name': 'Knowledge graph', 'label': 'Data Structure'}, {'name': 'Intrusion detection', 'label': 'Application'}, {'name': 'LLM-TIKG', 'label': 'Methodology'}, {'name': 'Large language model', 'label': 'Technology'}, {'name': 'Few-shot learning', 'label': 'Technique'}, {'name': 'Data annotation', 'label': 'Process'}, {'name': 'Data augmentation', 'label': 'Process'}, {'name': 'Topic classification', 'label': 'Task'}, {'name': 'Entity extraction', 'label': 'Task'}, {'name': 'Relationship extraction', 'label': 'Task'}, {'name': 'TTP extraction', 'label': 'Task'}, {'name': 'GPT-3.5', 'label': 'Model'}, {'name': 'Llama2-7B', 'label': 'Model'}, {'name': 'Instruction-based Information Extraction', 'label': 'Methodology'}, {'name': 'Threat hunting', 'label': 'Application'}, {'name': 'Attack attribution', 'label': 'Application'}]}
[Entity(name=instruction bas

We construct the second graph, noting that we already have an existing knowledge graph (for the first article).

In [13]:
kg2 = itext2kg.build_graph(sections=distilled_docs[1], existing_knowledge_graph=kg, rel_threshold=0.7, ent_threshold=0.7)

[INFO] ------- Extracting Entities from the Document 1
{'entities': [{'name': 'Cyber Threat Intelligence', 'label': 'Data Structure'}, {'name': 'Large Language Models', 'label': 'Methodology'}, {'name': 'Knowledge Graphs', 'label': 'Data Structure'}, {'name': 'Llama 2 7B chat', 'label': 'Model'}, {'name': 'Llama 70B chat', 'label': 'Model'}, {'name': 'Mistral', 'label': 'Model'}, {'name': 'Zephyr', 'label': 'Model'}, {'name': 'Prompt Engineering', 'label': 'Technique'}, {'name': 'Link Prediction', 'label': 'Technique'}]}
[Entity(name=large language models, label=Methodology, properties=embeddings=array([-0.00911226,  0.00577835, -0.02530644, ...,  0.00196522,
       -0.01056079,  0.0010359 ])), Entity(name=llama 70b chat, label=Model, properties=embeddings=array([-0.02101934, -0.0112489 , -0.0145149 , ..., -0.00411553,
       -0.00812611,  0.00033667])), Entity(name=zephyr, label=Model, properties=embeddings=array([-0.02932358, -0.00352876, -0.00856502, ...,  0.00673707,
       -0.0070

# Draw the graph
---

The final section involves visualizing the constructed knowledge graph using GraphIntegrator. The graph database Neo4j is accessed using specified credentials, and the resulting graph is visualized to provide a visual representation of the relationships and entities extracted from the document.

In [14]:
from itext2kg.graph_integration import GraphIntegrator


URI = "bolt://localhost:7687"
USERNAME = "neo4j"
PASSWORD = "##"

GraphIntegrator(uri=URI, username=USERNAME, password=PASSWORD).visualize_graph(knowledge_graph=kg2)