<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/KG_Enhanced_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem Statement

#### Task
Develop a co-pilot for threat researchers, security analysts, and professionals that addresses the limitations of current AI solutions like ChatGPT and Perplexity.

#### Current Challenges
1. **Generic Data**: Existing AI solutions provide generic information that lacks specificity.
2. **Context Understanding**: These solutions fail to understand and maintain context.
3. **Limited Information**: The data sources are often limited and not comprehensive.
4. **Single Source Dependency**: Relying on a single source of information reduces reliability and accuracy.
5. **Inadequate AI Models**: Current models do not meet the specialized needs of cybersecurity professionals.

#### Requirement
Create a chatbot capable of collecting and curating data from multiple sources, starting with search engines, and expanding to website crawling and Twitter scraping.

#### Features Required

##### User Interface (UI)
- Chat UI with file upload capabilities.
- Options to save and select prompts.
- Configuration settings for connectors with enable/disable toggles.
- Interface for configuring knowledge and variables (similar to Dify.ai).

##### Technical Specifications
- **No Hallucinations**: Ensure the chatbot provides accurate and reliable information.
- **RAG (Retrieval-Augmented Generation)**: Use RAG to determine which connectors to use based on user inputs.
- **Query Chunking and Distribution**: Optimize the process of breaking down queries and distributing them across different sources.
- **Data Curation Steps**:
  1. Collect links from approximately 50 sources.
  2. Aggregate data from websites and Twitter.
  3. Curate data using a knowledge graph to find relationships and generate responses.
- **Chatbot Capabilities**: Answer queries such as:
  - "List all details on {{BFSI}} security incidents in {{India}}."
  - "List all ransomware attacks targeting the healthcare industry in {{last 7 days/last 3 months/last week/last month}}."
  - "Provide recent incidents related to Lockbit Ransomware gang / BlackBasta Ransomware."

#### Source Tools

##### Website Crawling and Scraping
- [Firecrawl](https://www.firecrawl.dev/playground)
- [Crawl4AI](https://github.com/unclecode/crawl4ai)
- [Apify](https://apify.com/apify/website-content-crawler)
- [Exa](https://exa.ai/search)

##### Twitter Sources
- [Apify Tweet Scraper](https://apify.com/apidojo/tweet-scraper)
- [Twitter API](https://developer.x.com/en/docs/twitter-api)

##### Development Tools
- [Flowise AI](https://flowiseai.com/)
- [Langgenius Dify](https://github.com/langgenius/dify)

#### Goal
Develop a data collector that integrates multiple specific sources to enrich the knowledge base, enabling the model to better understand context and deliver accurate results. The solution should be modular, allowing customization and configuration of sources.

#### Summary
The goal is to build an advanced, modular chatbot for cybersecurity professionals that overcomes the limitations of existing AI solutions by integrating multiple data sources and ensuring context-aware, accurate responses. The chatbot will utilize state-of-the-art techniques like RAG and knowledge graphs to provide comprehensive, curated information from diverse sources.


In [1]:
# Install necessary libraries
%pip install -qU langchain langchain-community faiss-cpu kuzu pyvis
%pip install -qU sentence-transformers torch plotly pandas scikit-learn networkx
%pip install -qU torch torchvision
%pip install -qU langchain-groq apify-client

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m930.7 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.3/990.3 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.2/374.2 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Import necessary libraries
import kuzu
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.chains import LLMChain
import networkx as nx
import plotly.graph_objects as go
import plotly.express as px
from sklearn.manifold import TSNE
import pandas as pd
import requests
from bs4 import BeautifulSoup
from apify_client import ApifyClient



In [3]:
# Initialize HuggingFace embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Initialize Llama-3.1 from Meta using Groq LPU Inference
llm = ChatGroq(
    temperature=0,
    model="llama-3.1-70b-versatile",
    api_key="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"
)

system = "You are a helpful assistant."
human = "{text}"
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])

chain = prompt | llm

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
chain.invoke({"text": "Explain the importance of low latency for LLMs."})

AIMessage(content='Low latency is crucial for Large Language Models (LLMs) as it directly impacts the user experience, model performance, and overall efficiency. Here are some reasons why low latency is important for LLMs:\n\n1. **Real-time interactions**: LLMs are often used in applications that require real-time interactions, such as chatbots, virtual assistants, and language translation. Low latency ensures that the model responds quickly to user input, enabling a seamless and natural conversation flow.\n2. **User experience**: High latency can lead to frustration and a poor user experience. When users have to wait too long for a response, they may abandon the application or lose interest. Low latency helps maintain user engagement and satisfaction.\n3. **Conversational flow**: In conversational AI, latency can disrupt the natural flow of conversation. If the model takes too long to respond, the user may forget the context or lose their train of thought. Low latency helps maintain t

In [4]:
''' # Initialize Kuzu DB
db = kuzu.Database("cybersecurity_knowledge_graph")
conn = kuzu.Connection(db)

# Create schema for the graph
conn.execute("CREATE NODE TABLE Entity (name STRING, type STRING, PRIMARY KEY (name))")
conn.execute("CREATE REL TABLE Relation (FROM Entity TO Entity, predicate STRING)")
'''

' # Initialize Kuzu DB\ndb = kuzu.Database("cybersecurity_knowledge_graph")\nconn = kuzu.Connection(db)\n\n# Create schema for the graph\nconn.execute("CREATE NODE TABLE Entity (name STRING, type STRING, PRIMARY KEY (name))")\nconn.execute("CREATE REL TABLE Relation (FROM Entity TO Entity, predicate STRING)")\n'

In [5]:
# Cybersecurity-specific websites
websites = [
    "https://www.cisa.gov/uscert/ncas/alerts",
    "https://www.virustotal.com/gui/home/upload",
    "https://attack.mitre.org/",
    "https://www.darkreading.com/",
    "https://threatpost.com/",
]

In [6]:
# Apify client setup
APIFY_API_TOKEN = 'apify_api_t9YCnrjquQgW4BCNM8yYZrX6Q2a1uF1ImYkB'
apify_client = ApifyClient(APIFY_API_TOKEN)

# Function to scrape tweets using Apify
def scrape_tweets(query, max_results=100):
    run_input = {
        'searchQuery': query,
        'maxResults': max_results
    }

    try:
        # Start a new actor task on Apify
        logger.info(f"Starting Apify actor to scrape tweets for query: {query}")
        run = apify_client.actor("microworlds/twitter-scraper").call(run_input=run_input)

        # Fetch the results from the run
        if 'data' not in run:
            logger.warning("No 'data' key in Apify response. Checking for 'output' key.")
            if 'output' in run and 'items' in run['output']:
                tweets = run['output']['items']
            else:
                logger.error("Unexpected Apify response format. Unable to extract tweets.")
                return []
        else:
            tweets = run['data']

        # Extract relevant text content from tweets
        tweet_texts = []
        for tweet in tweets:
            if 'full_text' in tweet:
                tweet_texts.append(tweet['full_text'])
            elif 'text' in tweet:
                tweet_texts.append(tweet['text'])
            else:
                logger.warning(f"Unexpected tweet format: {tweet}")

        logger.info(f"Successfully scraped {len(tweet_texts)} tweets")
        return tweet_texts

    except Exception as e:
        logger.error(f"Error occurred while scraping tweets: {str(e)}")
        return []

In [7]:
# Load and process documents
documents = scrape_websites(websites)
tweet_texts = scrape_tweets('cybersecurity')

all_texts = documents + tweet_texts

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_text("\n\n".join(all_texts))

vectorstore = FAISS.from_texts(texts, embeddings)

NameError: name 'scrape_websites' is not defined

In [None]:
# Knowledge extraction and graph population
kg_triple_extract_template = """
Extract up to 5 cybersecurity-related knowledge triplets from the text below in the form (subject, predicate, object).
Focus on threats, vulnerabilities, attack techniques, and security measures.
Text: {text}
Triplets:
"""
kg_triple_extract_prompt = PromptTemplate(
    input_variables=["text"],
    template=kg_triple_extract_template,
)

kg_triple_extract_chain = LLMChain(llm=llm, prompt=kg_triple_extract_prompt)

for text in texts:
    triplets = kg_triple_extract_chain.invoke({"text": text})
    for triplet in triplets['text'].split('\n'):
        if triplet.strip():
            try:
                subject, predicate, obj = eval(triplet.strip())
                conn.execute("INSERT INTO Entity (name, type) VALUES (?, ?) ON CONFLICT DO NOTHING", [subject, "Cybersecurity_Entity"])
                conn.execute("INSERT INTO Entity (name, type) VALUES (?, ?) ON CONFLICT DO NOTHING", [obj, "Cybersecurity_Entity"])
                conn.execute("INSERT INTO Relation VALUES (?, ?, ?)", [subject, obj, predicate])
            except Exception as e:
                print(f"Failed to process triplet: {triplet}. Error: {e}")

ApifyApiError: User was not found or authentication token is not valid

In [None]:
def get_graph_data():
    nodes_result = conn.execute("MATCH (e:Entity) RETURN e.name")
    edges_result = conn.execute("MATCH (e1:Entity)-[r:Relation]->(e2:Entity) RETURN e1.name, r.predicate, e2.name")

    nodes = [row.getString(0) for row in nodes_result]
    edges = [(row.getString(0), row.getString(1), row.getString(2)) for row in edges_result]

    return nodes, edges

# Enhanced graph visualization using Plotly
def visualize_graph_plotly():
    nodes, edges = get_graph_data()
    G = nx.Graph()

    for node in nodes:
        G.add_node(node)

    for edge in edges:
        G.add_edge(edge[0], edge[2], label=edge[1])

    pos = nx.spring_layout(G)

    edge_x = []
    edge_y = []
    for edge in G.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_x.extend([x0, x1, None])
        edge_y.extend([y0, y1, None])

    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines')

    node_x = [pos[node][0] for node in G.nodes()]
    node_y = [pos[node][1]]

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers',
        hoverinfo='text',
        marker=dict(
            showscale=True,
            colorscale='YlGnBu',
            reversescale=True,
            color=[],
            size=10,
            colorbar=dict(
                thickness=15,
                title='Node Connections',
                xanchor='left',
                titleside='right'
            ),
            line_width=2))

    node_adjacencies = []
    node_text = []
    for node, adjacencies in G.adjacency():
        node_adjacencies.append(len(adjacencies))
        node_text.append(f'{node}# of connections: {len(adjacencies)}')

    node_trace.marker.color = node_adjacencies
    node_trace.text = node_text

    fig = go.Figure(data=[edge_trace, node_trace],
                    layout=go.Layout(
                        title='Knowledge Graph',
                        titlefont_size=16,
                        showlegend=False,
                        hovermode='closest',
                        margin=dict(b=20,l=5,r=5,t=40),
                        annotations=[ dict(
                            text="",
                            showarrow=False,
                            xref="paper", yref="paper",
                            x=0.005, y=-0.002 ) ],
                        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                    )

    fig.show()

# Embedding visualization
def visualize_embeddings():
    # Get embeddings
    doc_embeddings = [embeddings.embed_query(text) for text in texts]

    # Reduce dimensionality for visualization
    tsne = TSNE(n_components=2, random_state=0)
    tsne_results = tsne.fit_transform(doc_embeddings)

    # Create a DataFrame for Plotly
    df = pd.DataFrame(tsne_results, columns=['x', 'y'])

    fig = px.scatter(df, x='x', y='y', title='Document Embeddings Visualization')
    fig.show()

In [None]:
# Example queries
questions = [
    "What are the latest threats targeting the healthcare industry?",
    "Can you provide details on recent ransomware attacks?",
    "What are the most critical vulnerabilities discovered in the last month?",
    "How can organizations protect against phishing attacks?",
    "What are the emerging trends in cybersecurity for financial institutions?"
]

NameError: name 'all_texts' is not defined

In [None]:
# Query the knowledge graph and visualize
def query_graph(query):
    return chain.invoke({"text": query})

for query in questions:
    answer = query_graph(query)
    print(f"Query: {query}\nAnswer: {answer}\n")

# Visualize the graph and embeddings
visualize_graph_plotly()
visualize_embeddings()