<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/KG_Enhanced_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem Statement

#### Task
Develop a co-pilot for threat researchers, security analysts, and professionals that addresses the limitations of current AI solutions like ChatGPT and Perplexity.

#### Current Challenges
1. **Generic Data**: Existing AI solutions provide generic information that lacks specificity.
2. **Context Understanding**: These solutions fail to understand and maintain context.
3. **Limited Information**: The data sources are often limited and not comprehensive.
4. **Single Source Dependency**: Relying on a single source of information reduces reliability and accuracy.
5. **Inadequate AI Models**: Current models do not meet the specialized needs of cybersecurity professionals.

#### Requirement
Create a chatbot capable of collecting and curating data from multiple sources, starting with search engines, and expanding to website crawling and Twitter scraping.

#### Features Required

##### User Interface (UI)
- Chat UI with file upload capabilities.
- Options to save and select prompts.
- Configuration settings for connectors with enable/disable toggles.
- Interface for configuring knowledge and variables (similar to Dify.ai).

##### Technical Specifications
- **No Hallucinations**: Ensure the chatbot provides accurate and reliable information.
- **RAG (Retrieval-Augmented Generation)**: Use RAG to determine which connectors to use based on user inputs.
- **Query Chunking and Distribution**: Optimize the process of breaking down queries and distributing them across different sources.
- **Data Curation Steps**:
  1. Collect links from approximately 50 sources.
  2. Aggregate data from websites and Twitter.
  3. Curate data using a knowledge graph to find relationships and generate responses.
- **Chatbot Capabilities**: Answer queries such as:
  - "List all details on {{BFSI}} security incidents in {{India}}."
  - "List all ransomware attacks targeting the healthcare industry in {{last 7 days/last 3 months/last week/last month}}."
  - "Provide recent incidents related to Lockbit Ransomware gang / BlackBasta Ransomware."

#### Source Tools

##### Website Crawling and Scraping
- [Firecrawl](https://www.firecrawl.dev/playground)
- [Crawl4AI](https://github.com/unclecode/crawl4ai)
- [Apify](https://apify.com/apify/website-content-crawler)
- [Exa](https://exa.ai/search)

##### Twitter Sources
- [Apify Tweet Scraper](https://apify.com/apidojo/tweet-scraper)
- [Twitter API](https://developer.x.com/en/docs/twitter-api)

##### Development Tools
- [Flowise AI](https://flowiseai.com/)
- [Langgenius Dify](https://github.com/langgenius/dify)

#### Goal
Develop a data collector that integrates multiple specific sources to enrich the knowledge base, enabling the model to better understand context and deliver accurate results. The solution should be modular, allowing customization and configuration of sources.

#### Summary
The goal is to build an advanced, modular chatbot for cybersecurity professionals that overcomes the limitations of existing AI solutions by integrating multiple data sources and ensuring context-aware, accurate responses. The chatbot will utilize state-of-the-art techniques like RAG and knowledge graphs to provide comprehensive, curated information from diverse sources.


In [None]:
# Install necessary libraries
%pip install -qU langchain langchain-community faiss-cpu kuzu pyvis
%pip install -qU sentence-transformers torch plotly pandas scikit-learn networkx
%pip install -qU torch torchvision
%pip install -qU langchain-groq apify_client snscrape

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/74.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.8/74.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Import necessary libraries
import kuzu
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.chains import LLMChain
import networkx as nx
import plotly.graph_objects as go
import plotly.express as px
from sklearn.manifold import TSNE
import pandas as pd
import requests
from bs4 import BeautifulSoup
from apify_client import ApifyClient



In [None]:
# Initialize HuggingFace embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Initialize Llama-3.1 from Meta using Groq LPU Inference
llm = ChatGroq(
    temperature=0,
    model="llama-3.1-70b-versatile",
    api_key="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"
)

system = "You are a helpful assistant."
human = "{text}"
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])

chain = prompt | llm

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Initialize Apify client
apify_client = ApifyClient("apify_api_t9YCnrjquQgW4BCNM8yYZrX6Q2a1uF1ImYkB")

In [None]:
# Cybersecurity-specific websites
websites = [
    "https://www.cisa.gov/uscert/ncas/alerts",
    "https://www.virustotal.com/gui/home/upload",
    "https://attack.mitre.org/",
    "https://www.darkreading.com/",
    "https://threatpost.com/",
]

In [None]:
# Function to scrape websites using Apify
def scrape_websites(urls):
    run_input = {
        "startUrls": [{"url": url} for url in urls],
        "maxCrawlPages": 10,
        "maxCrawlDepth": 1,
    }

    run = apify_client.actor("apify/website-content-crawler").call(run_input=run_input)

    dataset_items = apify_client.dataset(run["defaultDatasetId"]).list_items().items

    return [item.get('text', '') for item in dataset_items if 'text' in item]

documents = scrape_websites(websites)
print(documents[:5])

['Cybersecurity Alerts & Advisories | CISALocksearchsearchNational Terrorism Advisory System Widget\nCybersecurity Advisory: In-depth reports covering a specific cybersecurity issue, often including threat actor tactics, techniques, and procedures; indicators of compromise; and mitigations.\nAlert: Concise summaries covering cybersecurity topics, such as mitigations that vendors have published for vulnerabilities in their products.\nICS Advisory: Concise summaries covering industrial control system (ICS) cybersecurity topics, primarily focused on mitigations that ICS vendors have published for vulnerabilities in their products.\nICS Medical Advisory: Concise summaries covering ICS medical cybersecurity topics, primarily focused on mitigations that ICS medical vendors have published for vulnerabilities in their products.\nAnalysis Report: In-depth analysis of a new or evolving cyber threat, including technical details and remediations.', 'ATT&CKcon 5.0 returns October 22-23, 2024 in McL

In [None]:
import json
from apify_client import ApifyClient

def fetch_scraped_tweets(api_token, actor_id, actor_input):
    """Fetch tweets scraped by Apify using the actor."""
    client = ApifyClient(api_token)

    # Start the actor and wait for it to finish
    actor_call = client.actor(actor_id).call(run_input=actor_input)

    # Fetch results from the actor run's default dataset
    dataset_items = client.dataset(actor_call['V8T72VcrBoADgdn4I']).list_items().items

    return dataset_items

def display_tweets(tweets):
    """Display tweets in a readable format."""
    if tweets:
        for tweet in tweets:
            print(json.dumps(tweet, indent=4))
    else:
        print("No tweets available.")

if __name__ == "__main__":
    # Replace with your Apify API token and actor ID
    APIFY_API_TOKEN = 'apify_api_t9YCnrjquQgW4BCNM8yYZrX6Q2a1uF1ImYkB'
    ACTOR_ID = 'apidojo/tweet-scraper'
    ACTOR_INPUT = {
        "queries": ["#cybersecurity"],
        "maxTweets": 100
    }
    tweets = fetch_scraped_tweets(APIFY_API_TOKEN, ACTOR_ID, ACTOR_INPUT)
    display_tweets(tweets)

ERROR:snscrape.base:Error retrieving https://twitter.com/i/api/graphql/7jT5GT59P8IFjgxwqnEdQw/SearchTimeline?variables=%7B%22rawQuery%22%3A%22cybersecurity%22%2C%22count%22%3A20%2C%22product%22%3A%22Latest%22%2C%22withDownvotePerspective%22%3Afalse%2C%22withReactionsMetadata%22%3Afalse%2C%22withReactionsPerspective%22%3Afalse%7D&features=%7B%22rweb_lists_timeline_redesign_enabled%22%3Afalse%2C%22blue_business_profile_image_shape_enabled%22%3Afalse%2C%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Afalse%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22vibe_api_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%22%3Atrue%2C%22view_co

ScraperException: 4 requests to https://twitter.com/i/api/graphql/7jT5GT59P8IFjgxwqnEdQw/SearchTimeline?variables=%7B%22rawQuery%22%3A%22cybersecurity%22%2C%22count%22%3A20%2C%22product%22%3A%22Latest%22%2C%22withDownvotePerspective%22%3Afalse%2C%22withReactionsMetadata%22%3Afalse%2C%22withReactionsPerspective%22%3Afalse%7D&features=%7B%22rweb_lists_timeline_redesign_enabled%22%3Afalse%2C%22blue_business_profile_image_shape_enabled%22%3Afalse%2C%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Afalse%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22vibe_api_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%22%3Atrue%2C%22view_counts_everywhere_api_enabled%22%3Atrue%2C%22longform_notetweets_consumption_enabled%22%3Atrue%2C%22tweet_awards_web_tipping_enabled%22%3Afalse%2C%22freedom_of_speech_not_reach_fetch_enabled%22%3Afalse%2C%22standardized_nudges_misinfo%22%3Atrue%2C%22tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled%22%3Afalse%2C%22interactive_text_enabled%22%3Atrue%2C%22responsive_web_text_conversations_enabled%22%3Afalse%2C%22longform_notetweets_rich_text_read_enabled%22%3Afalse%2C%22longform_notetweets_inline_media_enabled%22%3Afalse%2C%22responsive_web_enhance_cards_enabled%22%3Afalse%2C%22responsive_web_twitter_blue_verified_badge_is_enabled%22%3Atrue%7D failed, giving up.

In [None]:
# Combine all texts
all_texts = documents + tweets

In [None]:
# Split texts into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_text("\n\n".join(all_texts))



In [None]:
# Create a vector store using FAISS and embeddings
vectorstore = FAISS.from_texts(texts, embeddings)

In [None]:
# Initialize Kuzu DB
db = kuzu.Database("cybersecurity_knowledge_graph")
conn = kuzu.Connection(db)

# Create schema for the graph
conn.execute("CREATE NODE TABLE Entity (name STRING, type STRING, PRIMARY KEY (name))")
conn.execute("CREATE REL TABLE Relation (FROM Entity TO Entity, predicate STRING)")

<kuzu.query_result.QueryResult at 0x7d486599ae60>

In [None]:
# Knowledge extraction and graph population
kg_triple_extract_template = """
Extract up to 5 cybersecurity-related knowledge triplets from the text below in the form (subject, predicate, object).
Focus on threats, vulnerabilities, attack techniques, and security measures.
Text: {text}
Triplets:
"""
kg_triple_extract_prompt = PromptTemplate(
    input_variables=["text"],
    template=kg_triple_extract_template,
)

In [None]:
kg_triple_extract_chain = LLMChain(llm=llm, prompt=kg_triple_extract_prompt)

# Extract triplets and populate the knowledge graph
for text in texts:
    triplets = kg_triple_extract_chain.invoke({"text": text})
    for triplet in triplets['text'].split('\n'):
        if triplet.strip():
            try:
                subject, predicate, obj = eval(triplet.strip())
                conn.execute("INSERT INTO Entity (name, type) VALUES (?, ?) ON CONFLICT DO NOTHING", [subject, "Cybersecurity_Entity"])
                conn.execute("INSERT INTO Entity (name, type) VALUES (?, ?) ON CONFLICT DO NOTHING", [obj, "Cybersecurity_Entity"])
                conn.execute("INSERT INTO Relation VALUES (?, ?, ?)", [subject, obj, predicate])
            except Exception as e:
                print(f"Failed to process triplet: {triplet}. Error: {e}")

  warn_deprecated(


Failed to process triplet: Here are 5 cybersecurity-related knowledge triplets extracted from the text:. Error: invalid syntax (<string>, line 1)
Failed to process triplet: 1. (Threat Actor, Uses, Tactics, Techniques, and Procedures). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: 2. (Vulnerability, Has, Indicators of Compromise). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: 3. (Vendor, Publishes, Mitigations). Error: name 'Vendor' is not defined
Failed to process triplet: 4. (Industrial Control System, Is Vulnerable To, Cyber Threats). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: 5. (ICS Medical Vendor, Publishes, Mitigations for Vulnerabilities). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: Note: These triplets are in the form (subject, predicate, object) as requested. Let me know 



Failed to process triplet: Here are 5 cybersecurity-related knowledge triplets extracted from the text:. Error: invalid syntax (<string>, line 1)
Failed to process triplet: 1. (Phishing, uses, Malicious Link). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: 2. (Exploitation of Remote Services, uses, Vulnerabilities). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: 3. (Data Obfuscation, uses, Steganography). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: 4. (Lateral Movement, uses, Remote Desktop Protocol). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: 5. (Data Exfiltration, uses, Web Service). Error: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)
Failed to process triplet: Note: These triplets are in the format (subject, predicate, object) and represent relationships between

In [None]:
# Function to retrieve graph data
def get_graph_data():
    nodes_result = conn.execute("MATCH (e:Entity) RETURN e.name")
    edges_result = conn.execute("MATCH (e1:Entity)-[r:Relation]->(e2:Entity) RETURN e1.name, r.predicate, e2.name")

    nodes = [row.getString(0) for row in nodes_result]
    edges = [(row.getString(0), row.getString(1), row.getString(2)) for row in edges_result]

    return nodes, edges

In [None]:
# Enhanced graph visualization using Plotly
def visualize_graph_plotly():
    nodes, edges = get_graph_data()
    G = nx.Graph()

    for node in nodes:
        G.add_node(node)

    for edge in edges:
        G.add_edge(edge[0], edge[2], label=edge[1])

    pos = nx.spring_layout(G)

    edge_x = []
    edge_y = []
    for edge in G.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_x.extend([x0, x1, None])
        edge_y.extend([y0, y1, None])

    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines')

    node_x = [pos[node][0] for node in G.nodes()]
    node_y = [pos[node][1] for node in G.nodes()]

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers',
        hoverinfo='text',
        marker=dict(
            showscale=True,
            colorscale='YlGnBu',
            reversescale=True,
            color=[],
            size=10,
            colorbar=dict(
                thickness=15,
                title='Node Connections',
                xanchor='left',
                titleside='right'
            ),
            line_width=2))

    node_adjacencies = []
    node_text = []
    for node, adjacencies in G.adjacency():
        node_adjacencies.append(len(adjacencies))
        node_text.append(f'{node}# of connections: {len(adjacencies)}')

    node_trace.marker.color = node_adjacencies
    node_trace.text = node_text

    fig = go.Figure(data=[edge_trace, node_trace],
                    layout=go.Layout(
                        title='Knowledge Graph',
                        titlefont_size=16,
                        showlegend=False,
                        hovermode='closest',
                        margin=dict(b=20,l=5,r=5,t=40),
                        annotations=[ dict(
                            text="",
                            showarrow=False,
                            xref="paper", yref="paper",
                            x=0.005, y=-0.002 ) ],
                        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                    )

    fig.show()

In [None]:
# Embedding visualization
def visualize_embeddings():
    # Get embeddings
    doc_embeddings = [embeddings.embed_query(text) for text in texts]

    # Reduce dimensionality for visualization
    tsne = TSNE(n_components=2, random_state=0)
    tsne_results = tsne.fit_transform(doc_embeddings)

    # Create a DataFrame for Plotly
    df = pd.DataFrame(tsne_results, columns=['x', 'y'])

    fig = px.scatter(df, x='x', y='y', title='Document Embeddings Visualization')
    fig.show()

In [None]:
# Example queries
questions = [
    "What are the latest threats targeting the healthcare industry?",
    "Can you provide details on recent ransomware attacks?",
    "What are the most critical vulnerabilities discovered in the last month?",
    "How can organizations protect against phishing attacks?",
    "What are the emerging trends in cybersecurity for financial institutions?"
]

In [None]:
# Query the knowledge graph and visualize
def query_graph(query):
    return chain.invoke({"text": query})

# Execute example queries and print results
for query in questions:
    answer = query_graph(query)
    print(f"Query: {query}\nAnswer: {answer}\n")

Query: What are the latest threats targeting the healthcare industry?
Answer: content='The healthcare industry is a prime target for cyber attacks, and the threats are constantly evolving. Here are some of the latest threats targeting the healthcare industry:\n\n1. **Ransomware Attacks**: Ransomware attacks are still a major threat to the healthcare industry. These attacks involve encrypting sensitive data and demanding payment in exchange for the decryption key. Recent attacks have shown that ransomware can have devastating effects on healthcare organizations, including data breaches, system downtime, and even patient harm.\n2. **Data Breaches**: Data breaches are a persistent threat to the healthcare industry, with hackers targeting sensitive patient data, including protected health information (PHI). These breaches can occur through various means, including phishing attacks, malware infections, and insider threats.\n3. **Medical Device Hacking**: Medical devices, such as pacemakers,

In [None]:
# Visualize the graph and embeddings
visualize_graph_plotly()
visualize_embeddings()

TypeError: 'QueryResult' object is not iterable