<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/RAG_with_Knowledge_Graph_OpenSource.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **RAG with Knowledge Graph**
- This notebook demonstrates how to build a knowledge graph using website data, visualize it, and answer complex queries that cannot be handled by traditional naive RAG.
- It includes optimizations for cybersecurity data, utilizing the ReLiK model for entity detection and relation extraction, and integrating with Kuzu DB for efficient querying.

**Installations**

In [None]:
!pip install -q langchain langchain-experimental langchain-core langchain-community langchain-groq pandas networkx
!pip install -q mpi4py pyvis ampligraph transformers relik kuzu pykeen torch

**Web Data Loading**
- This function loads documents from a list of URLs.
- Loading data from websites is the first step in any data processing pipeline. It ensures that we have the raw data needed for further analysis.

In [None]:
from langchain_community.document_loaders import WebBaseLoader

def load_data_from_websites(urls):
    """Load documents from a list of URLs."""
    web_base_loader = WebBaseLoader(urls)
    documents = web_base_loader.load()
    print(f"Loaded {len(documents)} documents.")
    return documents

# List of websites to load data from
websites = [
    "https://www.scmagazine.com/home/security-news/",
    "https://thehackernews.com/",
    "https://www.securityweek.com/",
    "https://www.darkreading.com/",
    "https://krebsonsecurity.com/",
    "https://www.bleepingcomputer.com/",
    "https://threatpost.com/",
    "https://www.cyberscoop.com/",
    "https://www.infosecurity-magazine.com/",
    "https://www.zdnet.com/topic/security/",
]

documents = load_data_from_websites(websites)



Loaded 10 documents.


**Split Documents into Chunks**
- This function splits the loaded documents into smaller chunks for easier processing.
- Splitting documents into chunks helps in managing large texts and ensures that each chunk can be processed independently.
- This function is used to break down long documents into manageable pieces for entity detection and triplet extraction.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_data(documents, chunk_size=500, chunk_overlap=50):
    """Split documents into smaller chunks."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Number of chunks created: {len(chunks)}")
    return chunks

chunks = chunk_data(documents)

Number of chunks created: 363


**Create a DataFrame of Chunks**
- This function converts the list of chunks into a pandas DataFrame.
- Converting chunks into a DataFrame allows for easier data manipulation and analysis using pandas.
- This function is used to create a structured format for the chunks, which will be used in subsequent steps.

In [None]:
import pandas as pd

def create_chunks_dataframe(chunks):
    """Create a DataFrame from document chunks."""
    data = {'content': [chunk.page_content for chunk in chunks]}
    return pd.DataFrame(data)

chunks_df = create_chunks_dataframe(chunks)
chunks_df.head()

Unnamed: 0,content
0,404: Not FoundCISO StoriesTopicsEventsPodcasts...
1,in any form without prior authorization.\n ...
2,The Hacker News | #1 Trusted Cybersecurity New...
3,Contact/Tip Us\n\n\n\nReach out to get featur...
4,The Computer Emergency Response Team of Ukrain...


**Detect and Classify Entities**
- This function uses a Named Entity Recognition (NER) pipeline to detect entities in the text chunks.
- Detecting entities helps in identifying key concepts and entities in the text, which is crucial for understanding the content.
- This function is used to extract entities from the text chunks and create a DataFrame of detected entities.

In [None]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from relik import Relik
from langchain.schema.document import Document
import torch
import pandas as pd

def extract_entities_with_relik(chunks_df, max_retries=3):
    """Extract entities from document chunks using the ReLiK entity extraction model."""
    session = requests.Session()
    retry = Retry(total=max_retries, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    relik = Relik.from_pretrained(
        "relik-ie/relik-cie-tiny",
        device="cuda" if torch.cuda.is_available() else "cpu",
        precision="bf16" if torch.cuda.is_available() else "fp32",
        skip_metadata=True,  # don't load index metadata to keep low memory requirements
    )

    documents = [Document(page_content=chunk) for chunk in chunks_df['content']]
    entities = []

    for doc in documents:
        print(f"Processing document: {doc.page_content[:50]}...")  # Print the first 50 characters of the document
        if not doc.page_content.strip():
            print("Warning: Empty document content")
            continue

        # Use the ReLiK model to extract entities
        try:
            relik_output = relik(doc.page_content)
            for span in relik_output.spans:
                entities.append({
                    'Entity': span.text,
                    'Type': span.label
                })
        except Exception as e:
            print(f"Error processing document: {e}")

    entities_df = pd.DataFrame(entities)
    return entities_df

entities_df = extract_entities_with_relik(chunks_df)
entities_df.head()

                ___              __         
               /\_ \      __    /\ \        
 _ __     __   \//\ \    /\_\   \ \ \/'\    
/\`'__\ /'__`\   \ \ \   \/\ \   \ \ , <    
\ \ \/ /\  __/    \_\ \_  \ \ \   \ \ \\`\  
 \ \_\ \ \____\   /\____\  \ \_\   \ \_\ \_\
  \/_/  \/____/   \/____/   \/_/    \/_/\/_/
                                            
                                            



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


[2024-08-13 08:55:43,465] [INFO] [relik.inference.annotator.from_pretrained:700] [PID:3035] [RANK:0] Loading Relik from relik-ie/relik-cie-tiny[39m
[2024-08-13 08:55:43,477] [INFO] [relik.inference.annotator.from_pretrained:701] [PID:3035] [RANK:0] {
    '_target_': 'relik.inference.annotator.Relik',
    'index': {
        'span': {
            '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex.from_pretrained',
            'name_or_path': 'relik-ie/index-e5-small-v2-wikipedia-matryoshka',
        },
        'triplet': {
            '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex.from_pretrained',
            'name_or_path': 'relik-ie/encoder-e5-small-v2-wikipedia-relations-index',
        },
    },
    'metadata_fields': [],
    'reader': {
        '_target_': 'relik.reader.pytorch_modules.triplet.RelikReaderForTripletExtraction',
        'transformer_model': 'relik-ie/relik-reader-deberta-v3-small-cie-wikipedia',
        'use_nme': True,
    },

documents.jsonl:  92%|#########1| 2.71G/2.96G [00:00<?, ?B/s]

embeddings.pt:   0%|          | 0.00/4.79G [00:00<?, ?B/s]

[2024-08-13 08:56:33,104] [INFO] [relik.retriever.indexers.base.from_pretrained:482] [PID:3035] [RANK:0] Loading Index from config:[39m
[2024-08-13 08:56:33,108] [INFO] [relik.retriever.indexers.base.from_pretrained:483] [PID:3035] [RANK:0] {
    '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex',
    'device': 'cuda',
    'metadata_fields': [],
    'name_or_path': '/media/ssd/perelluis/relik_experiments/indexes/index-e5-small-v2-wikipedia-matryoshka',
    'precision': 'bf16',
    'separator': None,
    'use_faiss': False,
}[39m
[2024-08-13 08:56:33,109] [INFO] [relik.retriever.indexers.base.from_pretrained:490] [PID:3035] [RANK:0] Loading documents from /root/.cache/huggingface/hub/models--relik-ie--index-e5-small-v2-wikipedia-matryoshka/snapshots/8d119e710a7a8be5000b789dfcde3e661767982b/documents.jsonl[39m
[2024-08-13 08:57:33,031] [INFO] [relik.retriever.indexers.base.from_pretrained:533] [PID:3035] [RANK:0] Loading embeddings from /root/.cache/huggingface/hub/

config.yaml:   0%|          | 0.00/171 [00:00<?, ?B/s]

documents.jsonl:   0%|          | 0.00/107k [00:00<?, ?B/s]

embeddings.pt:   0%|          | 0.00/476k [00:00<?, ?B/s]

[2024-08-13 08:57:59,710] [INFO] [relik.retriever.indexers.base.from_pretrained:482] [PID:3035] [RANK:0] Loading Index from config:[39m
[2024-08-13 08:57:59,712] [INFO] [relik.retriever.indexers.base.from_pretrained:483] [PID:3035] [RANK:0] {
    '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex',
    'device': 'cuda',
    'metadata_fields': [],
    'name_or_path': 'relik-ie/encoder-e5-small-v2-wikipedia-relations-index',
    'precision': 'bf16',
    'separator': None,
    'use_faiss': False,
}[39m
[2024-08-13 08:57:59,716] [INFO] [relik.retriever.indexers.base.from_pretrained:490] [PID:3035] [RANK:0] Loading documents from /root/.cache/huggingface/hub/models--relik-ie--encoder-e5-small-v2-wikipedia-relations-index/snapshots/f311d53631c26f80a2b1ba16ac65337d06561946/documents.jsonl[39m
[2024-08-13 08:57:59,722] [INFO] [relik.retriever.indexers.base.from_pretrained:533] [PID:3035] [RANK:0] Loading embeddings from /root/.cache/huggingface/hub/models--relik-ie--encode

config.json:   0%|          | 0.00/753 [00:00<?, ?B/s]

configuration_relik.py:   0%|          | 0.00/1.70k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/603M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.2k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

[2024-08-13 09:02:44,189] [INFO] [relik.inference.utils.load_reader:383] [PID:3035] [RANK:0] Moving reader to `cuda`.[39m
[2024-08-13 09:02:44,194] [INFO] [relik.inference.utils.load_reader:386] [PID:3035] [RANK:0] Setting precision of reader to `torch.bfloat16`.[39m
Processing document: 404: Not FoundCISO StoriesTopicsEventsPodcastsRese...


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

[36m[2024-08-13 09:02:53,534] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 1 number of elements processed[39m
Processing document: in any form without prior authorization.
      You...
[36m[2024-08-13 09:02:55,263] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 1 number of elements processed[39m
Processing document: The Hacker News | #1 Trusted Cybersecurity News Si...
[36m[2024-08-13 09:02:56,817] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 2 number of elements processed[39m
Processing document: Contact/Tip Us



Reach out to get featured—conta...
[36m[2024-08-13 09:02:58,184] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 3 number of elements processed[39m
Processing document: The Computer Emergency Response Team of Ukraine (C...
[36m[2024-08-13 09:02:59,380] [

Unnamed: 0,Entity,Type
0,2024,--NME--
1,CyberRisk Alliance,--NME--
2,this website,--NME--
3,CyberRisk Alliance,--NME--
4,Hacker News,Hacker News


**Extract Triplets using ReLiK**
- Use the ReLiK model for relation extraction.
- Extracting triplets helps in understanding the relationships between entities, which is essential for building a knowledge graph.
- This function is used to extract triplets from the text chunks and create a list of triplets.

In [None]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import torch
from relik import Relik

def extract_triplets_relik(chunks, max_retries=3):
    """Extract triplets using the ReLiK model."""
    session = requests.Session()
    retry = Retry(total=max_retries, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    relik = Relik.from_pretrained(
        "relik-ie/relik-relation-extraction-small-wikipedia",
        device="cuda" if torch.cuda.is_available() else "cpu",
        precision="bf16" if torch.cuda.is_available() else "fp32",
        skip_metadata=True,
    )
    print("ReLiK model loaded successfully")

    all_triplets = []
    for chunk_id, chunk in enumerate(chunks):
        if not chunk.page_content or not isinstance(chunk.page_content, str):
            print(f"Skipping invalid chunk: {chunk}")
            continue

        print(f"Processing chunk with length: {len(chunk.page_content)}")

        try:
            relik_output = relik(chunk.page_content)
            if relik_output is None:
                print(f"ReLiK output is None for chunk with length: {len(chunk.page_content)}")
                continue

            for triplet in relik_output.triplets:
                all_triplets.append({
                    'chunk_id': chunk_id,
                    'Subject': triplet.subject.text,
                    'Predicate': triplet.label,
                    'Object': triplet.object.text
                })
        except IndexError as e:
            print(f"IndexError encountered: {e}")
            # No need to check the length of windows if relik_output is None
            if relik_output is not None:
                try:
                    if hasattr(relik_output, 'windows'):
                        print(f"Length of windows: {len(relik_output.windows)}")
                    if hasattr(relik_output, 'windows_candidates'):
                        print(f"Length of windows_candidates: {len(relik_output.windows_candidates)}")
                except Exception as inner_e:
                    print(f"Inner exception while accessing relik_output attributes: {inner_e}")
        except Exception as e:
            print(f"Unexpected error encountered: {e}")

    return all_triplets

triplets = extract_triplets_relik(chunks)
print(f"Number of triplets extracted: {len(triplets)}")

                ___              __         
               /\_ \      __    /\ \        
 _ __     __   \//\ \    /\_\   \ \ \/'\    
/\`'__\ /'__`\   \ \ \   \/\ \   \ \ , <    
\ \ \/ /\  __/    \_\ \_  \ \ \   \ \ \\`\  
 \ \_\ \ \____\   /\____\  \ \_\   \ \_\ \_\
  \/_/  \/____/   \/____/   \/_/    \/_/\/_/
                                            
                                            





config.yaml:   0%|          | 0.00/619 [00:00<?, ?B/s]

[2024-08-13 09:11:14,337] [INFO] [relik.inference.annotator.from_pretrained:700] [PID:3035] [RANK:0] Loading Relik from relik-ie/relik-relation-extraction-small-wikipedia[39m
[2024-08-13 09:11:14,343] [INFO] [relik.inference.annotator.from_pretrained:701] [PID:3035] [RANK:0] {
    '_target_': 'relik.inference.annotator.Relik',
    'index': {
        'triplet': {
            '_target_': 'relik.retriever.indexers.inmemory.InMemoryDocumentIndex.from_pretrained',
            'name_or_path': 'relik-ie/encoder-e5-small-v2-wikipedia-relations-index',
        },
    },
    'metadata_fields': [],
    'reader': {
        '_target_': 'relik.reader.pytorch_modules.triplet.RelikReaderForTripletExtraction',
        'transformer_model': 'relik-ie/relik-reader-deberta-v3-small-re-wikipedia',
    },
    'retriever': {
        'triplet': {
            '_target_': 'relik.retriever.pytorch_modules.model.GoldenRetriever',
            'question_encoder': 'relik-ie/encoder-e5-small-v2-wikipedia-relations',


config.json:   0%|          | 0.00/895 [00:00<?, ?B/s]

configuration_relik.py:   0%|          | 0.00/1.70k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/586M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/6.94k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

[2024-08-13 09:11:30,970] [INFO] [relik.inference.utils.load_reader:383] [PID:3035] [RANK:0] Moving reader to `cuda`.[39m
[2024-08-13 09:11:30,979] [INFO] [relik.inference.utils.load_reader:386] [PID:3035] [RANK:0] Setting precision of reader to `torch.bfloat16`.[39m
ReLiK model loaded successfully
Processing chunk with length: 461




[36m[2024-08-13 09:11:31,953] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 1 number of elements processed[39m
Processing chunk with length: 149
[36m[2024-08-13 09:11:32,605] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 1 number of elements processed[39m
Processing chunk with length: 420
[36m[2024-08-13 09:11:33,290] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 2 number of elements processed[39m
Processing chunk with length: 434
[36m[2024-08-13 09:11:34,161] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 3 number of elements processed[39m
Processing chunk with length: 490
[36m[2024-08-13 09:11:35,233] [DEBUG] [relik.reader.data.relik_reader_re_data.__iter__:399] [PID:3035] [RANK:0] Dataset finished: 2 number of elements processed[39m
Processing chunk with length:

**Analyze Relationships**
- This function converts the list of triplets into a DataFrame.
- Converting triplets into a DataFrame allows for easier analysis and manipulation of the relationships between entities.
- This function is used to create a structured format for the triplets, which will be used in subsequent steps.

In [None]:
import pandas as pd

def analyze_relationships(triplets):
    """Convert list of triplets into a DataFrame."""
    def create_triplets_dataframe(triplets):
        triplet_data = []
        for triplet in triplets:
            triplet_data.append({
                "chunk_id": triplet['chunk_id'],
                "subject": triplet['Subject'],
                "predicate": triplet['Predicate'],
                "object": triplet['Object']
            })
        return pd.DataFrame(triplet_data)

    # Debug: Print the first few triplets to inspect their structure
    print("First few triplets:", triplets[:5])

    triplets_df = create_triplets_dataframe(triplets)
    return triplets_df

triplets_df = analyze_relationships(triplets)
triplets_df.head()

First few triplets: [{'chunk_id': 1, 'Subject': 'this website', 'Predicate': 'digital rights management system', 'Object': 'CyberRisk Alliance'}, {'chunk_id': 3, 'Subject': 'RSS Feeds', 'Predicate': 'has use', 'Object': 'Social Media'}, {'chunk_id': 3, 'Subject': 'Phishing', 'Predicate': 'country', 'Object': 'Ukraine'}, {'chunk_id': 3, 'Subject': 'Government Computers', 'Predicate': 'country', 'Object': 'Ukraine'}, {'chunk_id': 3, 'Subject': 'Government Computers', 'Predicate': 'country', 'Object': 'Ukraine'}]


Unnamed: 0,chunk_id,subject,predicate,object
0,1,this website,digital rights management system,CyberRisk Alliance
1,3,RSS Feeds,has use,Social Media
2,3,Phishing,country,Ukraine
3,3,Government Computers,country,Ukraine
4,3,Government Computers,country,Ukraine


**Calculate Contextual Proximity**
- This function calculates the contextual proximity between nodes by merging the DataFrame with itself and counting the occurrences of node pairs within the same chunk.
- Contextual proximity helps in understanding the relationships between entities that appear together in the same context.
- This function is used to create a DataFrame of contextual proximity relationships between entities.


In [None]:
import numpy as np

def calculate_contextual_proximity(df):
    """Calculate contextual proximity between nodes."""
    long_format_df = pd.melt(
        df, id_vars=["chunk_id"], value_vars=["subject", "object"], value_name="node"
    )
    long_format_df.drop(columns=["variable"], inplace=True)
    merged_df = pd.merge(long_format_df, long_format_df, on="chunk_id", suffixes=("_1", "_2"))
    self_loops_index = merged_df[merged_df["node_1"] == merged_df["node_2"]].index
    merged_df = merged_df.drop(index=self_loops_index).reset_index(drop=True)

    # Convert chunk_id to string before joining
    merged_df['chunk_id'] = merged_df['chunk_id'].astype(str)

    grouped_df = (
        merged_df.groupby(["node_1", "node_2"])
        .agg({"chunk_id": [",".join, "count"]})
        .reset_index()
    )
    grouped_df.columns = ["node_1", "node_2", "chunk_id", "count"]
    grouped_df.replace("", np.nan, inplace=True)
    grouped_df.dropna(subset=["node_1", "node_2"], inplace=True)
    grouped_df = grouped_df[grouped_df["count"] != 1]
    grouped_df["edge"] = "contextual proximity"
    return grouped_df

# Calculate contextual proximity
contextual_proximity_df = calculate_contextual_proximity(triplets_df)
contextual_proximity_df.head()

Unnamed: 0,node_1,node_2,chunk_id,count,edge
0,.top,August 2024,136136,2,contextual proximity
1,.top,Chinese,136136136136136136139139139139139139,12,contextual proximity
2,.top,ICANN,137137142142143143143,7,contextual proximity
3,.top,Internet Corporation for Assigned Names and Nu...,137137,2,contextual proximity
4,.top,Jiangsu,142142142142,4,contextual proximity


In [None]:
print(triplets_df.columns)
print(triplets_df.head())

Index(['chunk_id', 'subject', 'predicate', 'object'], dtype='object')
   chunk_id               subject                         predicate  \
0         1          this website  digital rights management system   
1         3             RSS Feeds                           has use   
2         3              Phishing                           country   
3         3  Government Computers                           country   
4         3  Government Computers                           country   

               object  
0  CyberRisk Alliance  
1        Social Media  
2             Ukraine  
3             Ukraine  
4             Ukraine  


**Merge DataFrames**
- This function merges the concepts DataFrame with the contextual proximity DataFrame and aggregates the data.
- Merging the DataFrames allows for a comprehensive view of the relationships between entities.
- This function is used to create a merged DataFrame that will be used to create the graph.

In [None]:
def merge_dataframes(concepts_df, contextual_proximity_df):
    """Merge the concepts DataFrame with the contextual proximity DataFrame."""
    merged_df = pd.concat([concepts_df, contextual_proximity_df], axis=0, ignore_index=True, sort=False)
    return merged_df
# Merge DataFrames
merged_df = merge_dataframes(triplets_df, contextual_proximity_df)
merged_df.head()

Unnamed: 0,chunk_id,subject,predicate,object,node_1,node_2,count,edge
0,1,this website,digital rights management system,CyberRisk Alliance,,,,
1,3,RSS Feeds,has use,Social Media,,,,
2,3,Phishing,country,Ukraine,,,,
3,3,Government Computers,country,Ukraine,,,,
4,3,Government Computers,country,Ukraine,,,,


**Create NetworkX Graph**
- This function creates a NetworkX graph from the merged DataFrame, with nodes and edges representing the relationships between entities.
- Creating a graph allows for visualizing and analyzing the relationships between entities.
- This function is used to create a graph that will be used for further analysis and visualization.

In [None]:
import networkx as nx
from pyvis.network import Network
import torch
from pykeen.triples import TriplesFactory
from pykeen.models import RotatE
from pykeen.training import SLCWATrainingLoop
from pykeen.evaluation import RankBasedEvaluator

def create_network_graph(merged_df):
    """Create a network graph from the merged DataFrame."""
    G = nx.Graph()

    # Check if 'subject' and 'object' exist in the DataFrame
    if 'subject' in merged_df.columns and 'object' in merged_df.columns:
        for _, row in merged_df.iterrows():
            G.add_edge(row['subject'], row['object'], weight=1)
    # Check if 'node_1' and 'node_2' exist in the DataFrame
    elif 'node_1' in merged_df.columns and 'node_2' in merged_df.columns:
        for _, row in merged_df.iterrows():
            G.add_edge(row['node_1'], row['node_2'], weight=row['count'])

    return G

# Create the network graph
G = create_network_graph(merged_df)

[2024-08-13 09:16:00,784] [INFO] [pykeen.utils.<module>:1616] [PID:3035] Using opt_einsum


**Knowledge Graph Embedding Creation**

In [None]:
from pykeen.triples import TriplesFactory
from pykeen.models import RotatE
from pykeen.training import SLCWATrainingLoop

def create_knowledge_graph_embeddings(triplets_df):
    """Create knowledge graph embeddings using PyKEEN's RotatE model."""
    triples_factory = TriplesFactory.from_labeled_triples(
        triplets_df[['subject', 'predicate', 'object']].values
    )

    model = RotatE(triples_factory=triples_factory)
    training_loop = SLCWATrainingLoop(model=model, triples_factory=triples_factory)
    training_loop.train(triples_factory)

    entity_embeddings = model.entity_representations[0](indices=None).cpu().detach().numpy()
    return entity_embeddings

entity_embeddings = create_knowledge_graph_embeddings(triplets_df)
print("\nEntity embeddings shape:", entity_embeddings.shape)

[2024-08-13 09:16:02,640] [INFO] [pykeen.training.training_loop._train:598] [PID:3035] Currently automatic memory optimization only supports GPUs, but you're using a CPU. Therefore, the batch_size will be set to the default value '{batch_size}'


Training epochs on cpu:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cpu:   0%|          | 0/4 [00:00<?, ?batch/s]


Entity embeddings shape: (620, 200)


**Enhance Network Graph with Embeddings**

In [None]:
def enhance_with_embeddings(network_graph, entity_embeddings, triplets_df):
    """Enhance the network graph with PyKEEN embeddings."""
    for node in network_graph.nodes:
        node_index = triplets_df[triplets_df['subject'] == node].index.to_list()
        if not node_index:
            node_index = triplets_df[triplets_df['object'] == node].index.to_list()
        if node_index:
            # Ensure the index is within bounds
            if node_index[0] < entity_embeddings.shape[0]:
                node_embedding = entity_embeddings[node_index[0]]
                network_graph.nodes[node]['embedding'] = node_embedding.tolist()
            else:
                print(f"Warning: Node index {node_index[0]} is out of bounds for embeddings array.")

    return network_graph

# Enhance the network graph with embeddings
enhanced_graph = enhance_with_embeddings(G, entity_embeddings, triplets_df)



**Metrics**
- Evaluation of the model


In [None]:
# Print top 5 nodes for each centrality metric
print("\nTop 5 nodes by Degree Centrality:")
print(sorted(nx.degree_centrality(G).items(), key=lambda x: x[1], reverse=True)[:5])

print("\nTop 5 nodes by Betweenness Centrality:")
print(sorted(nx.betweenness_centrality(G).items(), key=lambda x: x[1], reverse=True)[:5])

print("\nTop 5 nodes by PageRank:")
print(sorted(nx.pagerank(G).items(), key=lambda x: x[1], reverse=True)[:5])


Top 5 nodes by Degree Centrality:
[('Windows', 0.07580645161290323), ('U.S.', 0.0532258064516129), ('Microsoft', 0.04516129032258064), ('Google', 0.041935483870967745), ('Android', 0.03225806451612903)]

Top 5 nodes by Betweenness Centrality:
[('U.S.', 0.24278923894167984), ('Windows', 0.15452874323625648), ('Microsoft', 0.11294314042160963), ('US', 0.10452456718248068), ('Google', 0.09888360712327454)]

Top 5 nodes by PageRank:
[('Windows', 0.024787637741347388), ('U.S.', 0.019215316683784383), ('Google', 0.01472173665226386), ('Microsoft', 0.013855939730300547), ('Android', 0.012287823106718518)]


**Initializing LLM**
- Initialize the Mistral LLM using Groq for RAG Pipeline.

In [None]:
from langchain_groq import ChatGroq
GROQ_API_KEY = "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"

# Initialize the Mistral LLM using Groq
llm = ChatGroq(
    temperature=0,
    model="mixtral-8x7b-32768",
    api_key=GROQ_API_KEY
)

**Set up Kuzu DB and Create the Schema**
- We'll set up the Kuzu DB and create the schema:


In [None]:
import kuzu

def setup_kuzu_db(db_name):
    """Set up Kuzu DB and create the schema."""
    db = kuzu.Database(db_name)
    conn = kuzu.Connection(db)

    # Create node tables
    conn.execute("CREATE NODE TABLE Incident (name STRING, date STRING, type STRING, description STRING, PRIMARY KEY(name))")
    conn.execute("CREATE NODE TABLE Actor (name STRING, type STRING, description STRING, PRIMARY KEY(name))")
    conn.execute("CREATE NODE TABLE Target (name STRING, type STRING, description STRING, PRIMARY KEY(name))")
    conn.execute("CREATE NODE TABLE Tool (name STRING, type STRING, description STRING, PRIMARY KEY(name))")

    # Create relationship tables
    conn.execute("CREATE REL TABLE InvolvedIn (FROM Actor TO Incident)")
    conn.execute("CREATE REL TABLE Targeted (FROM Incident TO Target)")
    conn.execute("CREATE REL TABLE UsedTool (FROM Actor TO Tool)")
    conn.execute("CREATE REL TABLE AssociatedWith (FROM Incident TO Tool)")

    return db, conn

# Set up Kuzu DB
db, conn = setup_kuzu_db("cyber_db")

**Insert Data into Kuzu DB**
- We'll insert data into the Kuzu DB

In [None]:
def insert_data_into_kuzu(conn, triplets_df):
    """Insert data into Kuzu DB."""
    for _, row in triplets_df.iterrows():
        subject, predicate, obj = row['subject'], row['predicate'], row['object']
        if predicate == 'InvolvedIn':
            conn.execute(f"CREATE (:Actor {{name: '{subject}'}})-[:InvolvedIn]->(:Incident {{name: '{obj}'}})")
        elif predicate == 'Targeted':
            conn.execute(f"CREATE (:Incident {{name: '{subject}'}})-[:Targeted]->(:Target {{name: '{obj}'}})")
        elif predicate == 'UsedTool':
            conn.execute(f"CREATE (:Actor {{name: '{subject}'}})-[:UsedTool]->(:Tool {{name: '{obj}'}})")
        elif predicate == 'AssociatedWith':
            conn.execute(f"CREATE (:Incident {{name: '{subject}'}})-[:AssociatedWith]->(:Tool {{name: '{obj}'}})")

# Insert data into Kuzu DB
insert_data_into_kuzu(conn, triplets_df)

**Create KuzuQAChain and Query the Graph**
- We'll create a KuzuQAChain for querying the graph


In [25]:
from langchain.chains import KuzuQAChain
from langchain_community.graphs import KuzuGraph

def create_kuzu_qa_chain(db, api_key):
    """Create KuzuQAChain for querying the graph."""
    graph = KuzuGraph(db)
    chain = KuzuQAChain.from_llm(llm, graph=graph, verbose=True)
    return chain

# Create KuzuQAChain
chain = create_kuzu_qa_chain(db, GROQ_API_KEY)

# Process queries
queries = [
    "Which actors were involved in incidents related to Lockbit Ransomware?",
    "List all targets affected by ransomware attacks in the last month.",
    "Who is the oldest actor involved in any incident?",
    "List all details on BFSI security incidents in India.",
    "List all ransomware attacks targeting the healthcare industry in the last 7 days.",
    "Provide recent incidents related to Lockbit Ransomware gang.",
    "Provide recent incidents related to BlackBasta Ransomware."
]

# Query the graph
for query in queries:
    print(f"Query: {query}")
    try:
        result = chain.invoke(query)
        print(result)
    except RuntimeError as e:
        print(f"Error: {e}")
    print("\n")

Query: Which actors were involved in incidents related to Lockbit Ransomware?


> Entering new KuzuQAChain chain...
[2024-08-13 09:34:16,845] [INFO] [httpx._send_single_request:1026] [PID:3035] HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
Generated Cypher:
MATCH (a:Actor)-[:InvolvedIn]->(i:Incident)-[:Targeted]->(t:Target)
WHERE t.name = 'Lockbit Ransomware'
RETURN a.name
Full Context:
[]
[2024-08-13 09:34:17,181] [INFO] [httpx._send_single_request:1026] [PID:3035] HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"

> Finished chain.
{'query': 'Which actors were involved in incidents related to Lockbit Ransomware?', 'result': "I don't have information on any actors involved in incidents related to Lockbit Ransomware."}


Query: List all targets affected by ransomware attacks in the last month.


> Entering new KuzuQAChain chain...
[2024-08-13 09:34:17,629] [INFO] [httpx._send_single_request:1026] [PID:3035] HTTP 