<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/RAG_with_Knowledge_Graph_OpenSource.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **RAG with Knowledge Graph**
- This notebook demonstrates how to build a knowledge graph using website data, visualize it, and answer complex queries that cannot be handled by traditional naive RAG.
- It includes optimizations for cybersecurity data, utilizing the ReLiK model for entity detection and relation extraction, and integrating with Kuzu DB for efficient querying.

**Installations**

In [1]:
!pip install -q langchain langchain-experimental langchain-core langchain-community langchain-groq pandas networkx
!pip install -q mpi4py pyvis ampligraph transformers relik deepspeed kuzu bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.6/990.6 kB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.3/204.3 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.0/384.0 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.5/103.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.4/140.4 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Web Data Loading**
- This function loads documents from a list of URLs.
- Loading data from websites is the first step in any data processing pipeline. It ensures that we have the raw data needed for further analysis.

In [2]:
from langchain_community.document_loaders import WebBaseLoader

def load_data_from_websites(urls):
    """Load documents from a list of URLs."""
    web_base_loader = WebBaseLoader(urls)
    documents = web_base_loader.load()
    print(f"Loaded {len(documents)} documents.")
    return documents

# List of websites to load data from
websites = [
    "https://www.scmagazine.com/home/security-news/",
    "https://thehackernews.com/",
    "https://www.securityweek.com/",
    "https://www.darkreading.com/",
    "https://krebsonsecurity.com/",
    "https://www.bleepingcomputer.com/",
    "https://threatpost.com/",
    "https://www.cyberscoop.com/",
    "https://www.infosecurity-magazine.com/",
    "https://www.zdnet.com/topic/security/",
]

documents = load_data_from_websites(websites)



Loaded 10 documents.


**Split Documents into Chunks**
- This function splits the loaded documents into smaller chunks for easier processing.
- Splitting documents into chunks helps in managing large texts and ensures that each chunk can be processed independently.
- This function is used to break down long documents into manageable pieces for entity detection and triplet extraction.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_data(documents, chunk_size=500, chunk_overlap=50):
    """Split documents into smaller chunks."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Number of chunks created: {len(chunks)}")
    return chunks

chunks = chunk_data(documents)

Number of chunks created: 363


**Create a DataFrame of Chunks**
- This function converts the list of chunks into a pandas DataFrame.
- Converting chunks into a DataFrame allows for easier data manipulation and analysis using pandas.
- This function is used to create a structured format for the chunks, which will be used in subsequent steps.

In [4]:
import pandas as pd

def create_chunks_dataframe(chunks):
    """Create a DataFrame from document chunks."""
    data = {'content': [chunk.page_content for chunk in chunks]}
    return pd.DataFrame(data)

chunks_df = create_chunks_dataframe(chunks)
chunks_df.head()

Unnamed: 0,content
0,404: Not FoundCISO StoriesTopicsEventsPodcasts...
1,in any form without prior authorization.\n ...
2,The Hacker News | #1 Trusted Cybersecurity New...
3,Contact/Tip Us\n\n\n\nReach out to get featur...
4,"In 2023, no fewer than 94 percent of businesse..."


**Detect and Classify Entities**
- This function uses a Named Entity Recognition (NER) pipeline to detect entities in the text chunks.
- Detecting entities helps in identifying key concepts and entities in the text, which is crucial for understanding the content.
- This function is used to extract entities from the text chunks and create a DataFrame of detected entities.

In [5]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from relik import Relik
from langchain.schema.document import Document
import torch
import pandas as pd

def extract_entities_with_relik(chunks_df, max_retries=3):
    """Extract entities from document chunks using the ReLiK entity extraction model."""
    session = requests.Session()
    retry = Retry(total=max_retries, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    relik = Relik.from_pretrained(
        "relik-ie/relik-cie-tiny",
        device="cuda" if torch.cuda.is_available() else "cpu",
        precision="bf16" if torch.cuda.is_available() else "fp32",
        skip_metadata=True,  # don't load index metadata to keep low memory requirements
    )

    documents = [Document(page_content=chunk) for chunk in chunks_df['content']]
    entities = []

    for doc in documents:
        # Use the ReLiK model to extract entities
        relik_output = relik(doc.page_content)
        for span in relik_output.spans:
            entities.append({
                'Entity': span.text,
                'Type': span.label
            })

    entities_df = pd.DataFrame(entities)
    return entities_df

entities_df = extract_entities_with_relik(chunks_df)
entities_df.head()

                ___              __         
               /\_ \      __    /\ \        
 _ __     __   \//\ \    /\_\   \ \ \/'\    
/\`'__\ /'__`\   \ \ \   \/\ \   \ \ , <    
\ \ \/ /\  __/    \_\ \_  \ \ \   \ \ \\`\  
 \ \_\ \ \____\   /\____\  \ \_\   \ \_\ \_\
  \/_/  \/____/   \/____/   \/_/    \/_/\/_/
                                            
                                            



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.yaml:   0%|          | 0.00/942 [00:00<?, ?B/s]

[2024-08-12 15:55:55,878] [INFO] [relik.inference.annotator.from_pretrained:700] [PID:407] [RANK:0] Loading Relik from relik-ie/relik-cie-tiny[39m
[2024-08-12 15:55:55,884] [INFO] [relik.inference.annotator.from_pretrained:701] [PID:407] [RANK:0] {
    '_target_': 'relik.inference.annotator.Relik',
    'index': {
        'span': {
            '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex.from_pretrained',
            'name_or_path': 'relik-ie/index-e5-small-v2-wikipedia-matryoshka',
        },
        'triplet': {
            '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex.from_pretrained',
            'name_or_path': 'relik-ie/encoder-e5-small-v2-wikipedia-relations-index',
        },
    },
    'metadata_fields': [],
    'reader': {
        '_target_': 'relik.reader.pytorch_modules.triplet.RelikReaderForTripletExtraction',
        'transformer_model': 'relik-ie/relik-reader-deberta-v3-small-cie-wikipedia',
        'use_nme': True,
    },
 

config.json:   0%|          | 0.00/771 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/202 [00:00<?, ?B/s]

documents.jsonl:   0%|          | 0.00/2.96G [00:00<?, ?B/s]

embeddings.pt:   0%|          | 0.00/4.79G [00:00<?, ?B/s]

[2024-08-12 16:01:04,795] [INFO] [relik.retriever.indexers.base.from_pretrained:482] [PID:407] [RANK:0] Loading Index from config:[39m
[2024-08-12 16:01:04,798] [INFO] [relik.retriever.indexers.base.from_pretrained:483] [PID:407] [RANK:0] {
    '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex',
    'device': 'cuda',
    'metadata_fields': [],
    'name_or_path': '/media/ssd/perelluis/relik_experiments/indexes/index-e5-small-v2-wikipedia-matryoshka',
    'precision': 'bf16',
    'separator': None,
    'use_faiss': False,
}[39m
[2024-08-12 16:01:04,800] [INFO] [relik.retriever.indexers.base.from_pretrained:490] [PID:407] [RANK:0] Loading documents from /root/.cache/huggingface/hub/models--relik-ie--index-e5-small-v2-wikipedia-matryoshka/snapshots/8d119e710a7a8be5000b789dfcde3e661767982b/documents.jsonl[39m
[2024-08-12 16:01:58,147] [INFO] [relik.retriever.indexers.base.from_pretrained:533] [PID:407] [RANK:0] Loading embeddings from /root/.cache/huggingface/hub/mode

config.yaml:   0%|          | 0.00/171 [00:00<?, ?B/s]

documents.jsonl:   0%|          | 0.00/107k [00:00<?, ?B/s]

embeddings.pt:   0%|          | 0.00/476k [00:00<?, ?B/s]

[2024-08-12 16:02:23,168] [INFO] [relik.retriever.indexers.base.from_pretrained:482] [PID:407] [RANK:0] Loading Index from config:[39m
[2024-08-12 16:02:23,169] [INFO] [relik.retriever.indexers.base.from_pretrained:483] [PID:407] [RANK:0] {
    '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex',
    'device': 'cuda',
    'metadata_fields': [],
    'name_or_path': 'relik-ie/encoder-e5-small-v2-wikipedia-relations-index',
    'precision': 'bf16',
    'separator': None,
    'use_faiss': False,
}[39m
[2024-08-12 16:02:23,173] [INFO] [relik.retriever.indexers.base.from_pretrained:490] [PID:407] [RANK:0] Loading documents from /root/.cache/huggingface/hub/models--relik-ie--encoder-e5-small-v2-wikipedia-relations-index/snapshots/f311d53631c26f80a2b1ba16ac65337d06561946/documents.jsonl[39m
[2024-08-12 16:02:23,178] [INFO] [relik.retriever.indexers.base.from_pretrained:533] [PID:407] [RANK:0] Loading embeddings from /root/.cache/huggingface/hub/models--relik-ie--encoder-e5

config.json:   0%|          | 0.00/753 [00:00<?, ?B/s]

configuration_relik.py:   0%|          | 0.00/1.70k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/603M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.2k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

[2024-08-12 16:02:39,014] [INFO] [relik.inference.utils.load_reader:383] [PID:407] [RANK:0] Moving reader to `cuda`.[39m
[2024-08-12 16:02:39,020] [INFO] [relik.inference.utils.load_reader:386] [PID:407] [RANK:0] Setting precision of reader to `torch.bfloat16`.[39m


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

[36m[2024-08-12 16:02:48,258] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 1 number of elements processed[39m
[36m[2024-08-12 16:02:49,394] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 1 number of elements processed[39m
[36m[2024-08-12 16:02:50,636] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 2 number of elements processed[39m
[36m[2024-08-12 16:02:51,998] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 3 number of elements processed[39m
[36m[2024-08-12 16:02:53,229] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 2 number of elements processed[39m
[36m[2024-08-12 16:02:54,426] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 2 number of elements processed[39m
[36

Unnamed: 0,Entity,Type
0,2024,--NME--
1,CyberRisk Alliance,--NME--
2,this website,--NME--
3,CyberRisk Alliance,--NME--
4,Hacker News,Hacker News


**Initializing LLM for Knowledge Extraction**
- Initialize the Mistral LLM using Groq for knowledge extraction.

In [6]:
from langchain_groq import ChatGroq
GROQ_API_KEY = "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"

# Initialize the Mistral LLM using Groq
llm = ChatGroq(
    temperature=0,
    model="mixtral-8x7b-32768",
    api_key=GROQ_API_KEY
)

**Extract Triplets using ReLiK**
- Use the ReLiK model for relation extraction.
- Extracting triplets helps in understanding the relationships between entities, which is essential for building a knowledge graph.
- This function is used to extract triplets from the text chunks and create a list of triplets.

In [7]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import torch
from relik import Relik

# Extract Triplets using ReLiK
def extract_triplets_relik(chunks, max_retries=3):
    """Extract triplets using the ReLiK model."""
    session = requests.Session()
    retry = Retry(total=max_retries, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    relik = Relik.from_pretrained(
        "relik-ie/relik-relation-extraction-small-wikipedia",
        device="cuda" if torch.cuda.is_available() else "cpu",
        precision="bf16" if torch.cuda.is_available() else "fp32",
        skip_metadata=True,  # don't load index metadata to keep low memory requirements
    )

    all_triplets = []
    for chunk in chunks:
        # Use the ReLiK model to extract triplets
        relik_output = relik(chunk.page_content)
        for triplet in relik_output.triplets:
            all_triplets.append({
                'Subject': triplet.subject.text,
                'Predicate': triplet.label,
                'Object': triplet.object.text
            })

    return all_triplets

triplets = extract_triplets_relik(chunks)
print(f"Number of triplets extracted: {len(triplets)}")
for triplet in triplets:
    print(triplet)

                ___              __         
               /\_ \      __    /\ \        
 _ __     __   \//\ \    /\_\   \ \ \/'\    
/\`'__\ /'__`\   \ \ \   \/\ \   \ \ , <    
\ \ \/ /\  __/    \_\ \_  \ \ \   \ \ \\`\  
 \ \_\ \ \____\   /\____\  \ \_\   \ \_\ \_\
  \/_/  \/____/   \/____/   \/_/    \/_/\/_/
                                            
                                            





config.yaml:   0%|          | 0.00/619 [00:00<?, ?B/s]

[2024-08-12 16:12:56,930] [INFO] [relik.inference.annotator.from_pretrained:700] [PID:407] [RANK:0] Loading Relik from relik-ie/relik-relation-extraction-small-wikipedia[39m
[2024-08-12 16:12:56,937] [INFO] [relik.inference.annotator.from_pretrained:701] [PID:407] [RANK:0] {
    '_target_': 'relik.inference.annotator.Relik',
    'index': {
        'triplet': {
            '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex.from_pretrained',
            'name_or_path': 'relik-ie/encoder-e5-small-v2-wikipedia-relations-index',
        },
    },
    'metadata_fields': [],
    'reader': {
        '_target_': 'relik.reader.pytorch_modules.triplet.RelikReaderForTripletExtraction',
        'transformer_model': 'relik-ie/relik-reader-deberta-v3-small-re-wikipedia',
    },
    'retriever': {
        'triplet': {
            '_target_': 'relik.retriever.pytorch_modules.model.GoldenRetriever',
            'question_encoder': 'relik-ie/encoder-e5-small-v2-wikipedia-relations',
  

config.json:   0%|          | 0.00/895 [00:00<?, ?B/s]

configuration_relik.py:   0%|          | 0.00/1.70k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/586M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/6.94k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

[2024-08-12 16:13:14,294] [INFO] [relik.inference.utils.load_reader:383] [PID:407] [RANK:0] Moving reader to `cuda`.[39m
[2024-08-12 16:13:14,298] [INFO] [relik.inference.utils.load_reader:386] [PID:407] [RANK:0] Setting precision of reader to `torch.bfloat16`.[39m




[36m[2024-08-12 16:13:15,208] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 1 number of elements processed[39m
[36m[2024-08-12 16:13:15,829] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 1 number of elements processed[39m
[36m[2024-08-12 16:13:16,501] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 2 number of elements processed[39m
[36m[2024-08-12 16:13:17,192] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 3 number of elements processed[39m
[36m[2024-08-12 16:13:17,848] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 2 number of elements processed[39m
[36m[2024-08-12 16:13:18,518] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:407] [RANK:0] Dataset finished: 2 number of elements processed[39m
[36

IndexError: list index out of range

**Analyze Relationships**
- This function converts the list of triplets into a DataFrame.
- Converting triplets into a DataFrame allows for easier analysis and manipulation of the relationships between entities.
- This function is used to create a structured format for the triplets, which will be used in subsequent steps.

In [None]:
# Analyze Relationships
def analyze_relationships(triplets):
    """Convert list of triplets into a DataFrame."""
    def create_triplets_dataframe(triplets):
        triplet_data = []
        for triplet in triplets:
            subject, predicate, obj = triplet.strip("()").split(", ")
            triplet_data.append({
                "subject": subject.strip(),
                "predicate": predicate.strip(),
                "object": obj.strip()
            })
        return pd.DataFrame(triplet_data)
    triplets_df = create_triplets_dataframe(triplets)
    return triplets_df

triplets_df = analyze_relationships(triplets)

**Calculate Contextual Proximity**
- This function calculates the contextual proximity between nodes by merging the DataFrame with itself and counting the occurrences of node pairs within the same chunk.
- Contextual proximity helps in understanding the relationships between entities that appear together in the same context.
- This function is used to create a DataFrame of contextual proximity relationships between entities.


In [None]:
def calculate_contextual_proximity(df):
    """Calculate contextual proximity between nodes."""
    long_format_df = pd.melt(
        df, id_vars=["chunk_id"], value_vars=["node_1", "node_2"], value_name="node"
    )
    long_format_df.drop(columns=["variable"], inplace=True)
    merged_df = pd.merge(long_format_df, long_format_df, on="chunk_id", suffixes=("_1", "_2"))
    self_loops_index = merged_df[merged_df["node_1"] == merged_df["node_2"]].index
    merged_df = merged_df.drop(index=self_loops_index).reset_index(drop=True)
    grouped_df = (
        merged_df.groupby(["node_1", "node_2"])
        .agg({"chunk_id": [",".join, "count"]})
        .reset_index()
    )
    grouped_df.columns = ["node_1", "node_2", "chunk_id", "count"]
    grouped_df.replace("", np.nan, inplace=True)
    grouped_df.dropna(subset=["node_1", "node_2"], inplace=True)
    grouped_df = grouped_df[grouped_df["count"] != 1]
    grouped_df["edge"] = "contextual proximity"
    return grouped_df

# Calculate contextual proximity
contextual_proximity_df = calculate_contextual_proximity(triplets_df)
contextual_proximity_df.head()

**Merge DataFrames**
- This function merges the concepts DataFrame with the contextual proximity DataFrame and aggregates the data.
- Merging the DataFrames allows for a comprehensive view of the relationships between entities.
- This function is used to create a merged DataFrame that will be used to create the graph.

In [None]:
def merge_dataframes(concepts_df, contextual_proximity_df):
    """Merge the concepts DataFrame with the contextual proximity DataFrame."""
    merged_df = pd.concat([concepts_df, contextual_proximity_df], ignore_index=True)
    return merged_df

# Merge DataFrames
merged_df = merge_dataframes(triplets_df, contextual_proximity_df)
merged_df.head()

**Create NetworkX Graph**
- This function creates a NetworkX graph from the merged DataFrame, with nodes and edges representing the relationships between entities.
- Creating a graph allows for visualizing and analyzing the relationships between entities.
- This function is used to create a graph that will be used for further analysis and visualization.

In [None]:
from pyvis.network import Network
import networkx as nx

def create_network_graph(triplets_df):
    """Create a network graph using Pyvis."""
    G = nx.MultiDiGraph()

    for _, row in triplets_df.iterrows():
        G.add_node(row['subject'], label=row['subject'], title=row['subject'])
        G.add_node(row['object'], label=row['object'], title=row['object'])
        G.add_edge(row['subject'], row['object'], label=row['predicate'], title=row['predicate'])

    net = Network(notebook=True, height="750px", width="100%", bgcolor="#222222", font_color="white")
    net.from_nx(G)

    net.set_options("""
    var options = {
      "nodes": {
        "shape": "dot",
        "size": 16,
        "font": {
          "size": 12
        }
      },
      "edges": {
        "width": 2,
        "color": {
          "inherit": true
        }
      },
      "physics": {
        "forceAtlas2Based": {
          "gravitationalConstant": -50,
          "centralGravity": 0.01,
          "springLength": 230,
          "springConstant": 0.18
        },
        "maxVelocity": 50,
        "solver": "forceAtlas2Based",
        "timestep": 0.22,
        "stabilization": {
          "iterations": 150
        }
      }
    }
    """)

    return net

# Create network graph
net = create_network_graph(triplets_df)
net.show("network_graph.html")

**Generate Graph Embeddings of these Metrics**
- This function calculates various centrality metrics for the graph and sets them as node attributes.
- Centrality metrics help in understanding the importance and influence of nodes in the graph.
- This function is used to enrich the graph with additional metrics that will be used for analysis and visualization.


In [None]:
from ampligraph.latent_features import ComplEx
from ampligraph.evaluation import evaluate_performance, mr_score, mrr_score, hits_at_n_score

def create_knowledge_graph_embeddings(triplets_df):
    """Create knowledge graph embeddings using AmpliGraph."""
    X = triplets_df[['subject', 'predicate', 'object']].values
    model = ComplEx(batches_count=10, epochs=50, k=100, eta=5, optimizer='adam', optimizer_params={'lr': 0.0005},
                    loss='multiclass_nll', regularizer='LP', regularizer_params={'p': 3, 'lambda': 0.1},
                    seed=0, verbose=True)
    model.fit(X)
    return model

# Create knowledge graph embeddings
model = create_knowledge_graph_embeddings(triplets_df)

**Model Evaluation**
- We'll evaluate the knowledge graph embedding model:


In [None]:
def evaluate_model(model, triplets_df):
    """Evaluate the knowledge graph embedding model."""
    X = triplets_df[['subject', 'predicate', 'object']].values
    filter_triples = np.ones(X.shape[0])
    ranks = evaluate_performance(X, model=model, filter_triples=filter_triples, use_default_protocol=True, verbose=True)
    mr = mr_score(ranks)
    mrr = mrr_score(ranks)
    hits_at_10 = hits_at_n_score(ranks, n=10)
    metrics = {
        'MR': mr,
        'MRR': mrr,
        'Hits@10': hits_at_10
    }
    return metrics

# Evaluate the model
metrics = evaluate_model(model, triplets_df)
print(metrics)

**Set up Kuzu DB and Create the Schema**
- We'll set up the Kuzu DB and create the schema:


In [None]:
import kuzu

def setup_kuzu_db(db_name):
    """Set up Kuzu DB and create the schema."""
    db = kuzu.Database(db_name)
    conn = kuzu.Connection(db)

    conn.execute("CREATE NODE TABLE Movie (name STRING, PRIMARY KEY(name))")
    conn.execute("CREATE NODE TABLE Person (name STRING, birthDate STRING, PRIMARY KEY(name))")
    conn.execute("CREATE REL TABLE ActedIn (FROM Person TO Movie)")

    # Create additional tables and relationships
    conn.execute("CREATE NODE TABLE Incident (name STRING, date STRING, type STRING, PRIMARY KEY(name))")
    conn.execute("CREATE REL TABLE InvolvedIn (FROM Person TO Incident)")
    conn.execute("CREATE REL TABLE Targeted (FROM Incident TO Movie)")

    return conn

# Set up Kuzu DB
conn = setup_kuzu_db("test_db")

**Insert Data into Kuzu DB**
- We'll insert data into the Kuzu DB

In [None]:
def insert_data_into_kuzu(conn, triplets_df):
    """Insert data into Kuzu DB."""
    for _, row in triplets_df.iterrows():
        subject, predicate, obj = row['subject'], row['predicate'], row['object']
        if predicate == 'ActedIn':
            conn.execute(f"CREATE (:Person {{name: '{subject}'}})-[:ActedIn]->(:Movie {{name: '{obj}'}})")
        elif predicate == 'InvolvedIn':
            conn.execute(f"CREATE (:Person {{name: '{subject}'}})-[:InvolvedIn]->(:Incident {{name: '{obj}'}})")
        elif predicate == 'Targeted':
            conn.execute(f"CREATE (:Incident {{name: '{subject}'}})-[:Targeted]->(:Movie {{name: '{obj}'}})")

# Insert data into Kuzu DB
insert_data_into_kuzu(conn, triplets_df)

**Create KuzuQAChain and Query the Graph**
- We'll create a KuzuQAChain for querying the graph


In [None]:
from langchain.chains import KuzuQAChain
from langchain_community.graphs import KuzuGraph

def create_kuzu_qa_chain(db, api_key):
    """Create KuzuQAChain for querying the graph."""
    graph = KuzuGraph(db)
    chain = KuzuQAChain.from_llm(llm, graph=graph, verbose=True)
    return chain

# Create KuzuQAChain
chain = create_kuzu_qa_chain(db, GROQ_API_KEY)

# Process queries
queries = [
    "List all details on BFSI security incidents in India.",
    "List all ransomware attacks targeting the healthcare industry in the last 7 days.",
    "Provide recent incidents related to Lockbit Ransomware gang.",
    "Provide recent incidents related to BlackBasta Ransomware."
]

# Query the graph
for query in queries:
    print(f"Query: {query}")
    result = chain.run(query)
    print(result)
    print("\n")

**Refresh Schema Information**
- We'll refresh the schema information



In [None]:
def refresh_schema(graph):
    """Refresh the schema information needed to generate Cypher statements."""
    graph.refresh_schema()
    print("Schema information refreshed.")

# Refresh schema
refresh_schema(graph)

**Add Indexing**
- We'll create indexes on the Kuzu DB

In [None]:
def create_indexes(conn):
    """Create indexes on the Kuzu DB."""
    safe_execute(conn, "CREATE INDEX ON Movie(name)")
    safe_execute(conn, "CREATE INDEX ON Person(name)")
    safe_execute(conn, "CREATE INDEX ON Incident(name)")

# Create indexes
create_indexes(conn)

**Add More Complex Queries**
- We'll run more complex queries using the KuzuQAChain

In [None]:
def run_complex_queries(chain):
    """Run more complex queries using the KuzuQAChain."""
    complex_queries = [
        "Which actors were involved in incidents related to Lockbit Ransomware?",
        "List all movies targeted by ransomware attacks in the last month.",
        "Who is the oldest actor involved in any incident?"
    ]

    for query in complex_queries:
        print(f"Query: {query}")
        result = chain.run(query)
        print(result)
        print("\n")

# Run complex queries
run_complex_queries(chain)