<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/KG_Enhanced_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem Statement

#### Task
Develop a co-pilot for threat researchers, security analysts, and professionals that addresses the limitations of current AI solutions like ChatGPT and Perplexity.

#### Current Challenges
1. **Generic Data**: Existing AI solutions provide generic information that lacks specificity.
2. **Context Understanding**: These solutions fail to understand and maintain context.
3. **Limited Information**: The data sources are often limited and not comprehensive.
4. **Single Source Dependency**: Relying on a single source of information reduces reliability and accuracy.
5. **Inadequate AI Models**: Current models do not meet the specialized needs of cybersecurity professionals.

#### Requirement
Create a chatbot capable of collecting and curating data from multiple sources, starting with search engines, and expanding to website crawling and Twitter scraping.

#### Features Required

##### User Interface (UI)
- Chat UI with file upload capabilities.
- Options to save and select prompts.
- Configuration settings for connectors with enable/disable toggles.
- Interface for configuring knowledge and variables (similar to Dify.ai).

##### Technical Specifications
- **No Hallucinations**: Ensure the chatbot provides accurate and reliable information.
- **RAG (Retrieval-Augmented Generation)**: Use RAG to determine which connectors to use based on user inputs.
- **Query Chunking and Distribution**: Optimize the process of breaking down queries and distributing them across different sources.
- **Data Curation Steps**:
  1. Collect links from approximately 50 sources.
  2. Aggregate data from websites and Twitter.
  3. Curate data using a knowledge graph to find relationships and generate responses.
- **Chatbot Capabilities**: Answer queries such as:
  - "List all details on {{BFSI}} security incidents in {{India}}."
  - "List all ransomware attacks targeting the healthcare industry in {{last 7 days/last 3 months/last week/last month}}."
  - "Provide recent incidents related to Lockbit Ransomware gang / BlackBasta Ransomware."

#### Goal
Develop a data collector that integrates multiple specific sources to enrich the knowledge base, enabling the model to better understand context and deliver accurate results. The solution should be modular, allowing customization and configuration of sources.

#### Summary
The goal is to build an advanced, modular chatbot for cybersecurity professionals that overcomes the limitations of existing AI solutions by integrating multiple data sources and ensuring context-aware, accurate responses. The chatbot will utilize state-of-the-art techniques like RAG and knowledge graphs to provide comprehensive, curated information from diverse sources.


#### Installation and Setup

In [1]:
!pip uninstall -yq torch torchvision pandas
!pip install -q torch==2.3.1 torchvision==0.18.1 pandas==2.0.3
!pip install -qU langchain langchain-community faiss-cpu kuzu pyvis
!pip install -qU sentence-transformers networkx pydantic
!pip install -qU langchain-groq apify_client langgraph python-dotenv

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m779.1/779.1 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.3/990.3 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Imports**

In [11]:
# Imports
import os
import logging
from typing import List, Dict, Any, Optional

import networkx as nx
from pyvis.network import Network
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain.schema import SystemMessage, HumanMessage
from langchain_groq import ChatGroq
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.tools import BaseTool
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.messages import BaseMessage
from langchain.output_parsers import PydanticOutputParser
from langchain.memory import ConversationBufferMemory
from pydantic import BaseModel, Field

from apify_client import ApifyClient

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

**Embedding and LLM Initialization**

In [3]:
# Initialize HuggingFace embeddings
embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Initialize Llama-3.1 from Meta using Groq LPU Inference
llm = ChatGroq(
    temperature=0,
    model="llama-3.1-70b-versatile",
    api_key="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"
)

# Define system and human messages
system_message = """You are an expert cybersecurity analyst with extensive knowledge in threat analysis,
vulnerability assessment, and security recommendations. Provide detailed, precise, and actionable insights.
Always consider the latest threat intelligence and best practices in your analysis."""
prompt_template = ChatPromptTemplate.from_messages([("system", system_message), ("human", "{text}")])

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**Knowledge Graph Implementation**

In [4]:
# Knowledge Graph Implementation
class KnowledgeGraph:
    def __init__(self):
        self.graph = nx.Graph()

    def add_entity(self, entity: str, entity_type: str):
        self.graph.add_node(entity, type=entity_type)
        logger.info(f"Added entity: {entity} of type: {entity_type}")

    def add_relation(self, entity1: str, entity2: str, relation: str):
        self.graph.add_edge(entity1, entity2, relation=relation)
        logger.info(f"Added relation: {relation} between {entity1} and {entity2}")

    def get_related_entities(self, entity: str) -> List[Dict[str, str]]:
        related = [{"entity": neighbor, "relation": self.graph.get_edge_data(entity, neighbor)["relation"]}
                   for neighbor in self.graph.neighbors(entity)]
        logger.info(f"Related entities for {entity}: {related}")
        return related

    def visualize(self, output_file: str = "knowledge_graph.html"):
        net = Network(notebook=True, width="100%", height="500px")
        for node, node_data in self.graph.nodes(data=True):
            net.add_node(node, label=node, title=f"Type: {node_data['type']}")
        for edge in self.graph.edges(data=True):
            net.add_edge(edge[0], edge[1], title=edge[2]['relation'])
        net.show(output_file)
        logger.info(f"Knowledge graph visualized at {output_file}")

# Initialize knowledge graph
kg = KnowledgeGraph()

**Data Collection Functions**

In [5]:
# Data Collection Functions
apify_client = ApifyClient("apify_api_yUkcz99gMX1pwNckRi7EyXLwhVTd0j3m4Mtt")

def scrape_websites(urls: List[str]) -> List[str]:
    logger.info(f"Scraping {len(urls)} websites...")
    run_input = {
        "startUrls": [{"url": url} for url in urls],
        "maxCrawlPages": 10,
        "maxCrawlDepth": 1,
    }
    try:
        run = apify_client.actor("apify/website-content-crawler").call(run_input=run_input)
        dataset_items = apify_client.dataset(run["defaultDatasetId"]).list_items().items
        scraped_content = [item.get('text', '') for item in dataset_items if 'text' in item]
        logger.info(f"Successfully scraped {len(scraped_content)} pages.")
        return scraped_content
    except Exception as e:
        logger.error(f"Error scraping websites: {str(e)}")
        return []

def fetch_scraped_tweets(query: str, max_tweets: int = 100) -> List[Dict[str, Any]]:
    logger.info(f"Fetching tweets for query: {query}")
    actor_input = {
        "queries": [query],
        "maxTweets": max_tweets
    }
    try:
        run = apify_client.actor("apidojo/tweet-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} tweets.")
        return items
    except Exception as e:
        logger.error(f"Error fetching tweets: {str(e)}")
        return []

def fetch_nvd_data(days: int = 30) -> List[Dict[str, Any]]:
    logger.info(f"Fetching NVD data for the last {days} days...")
    base_url = "https://services.nvd.nist.gov/rest/json/cves/1.0/"
    current_date = datetime.now()
    start_date = current_date - timedelta(days=days)
    params = {
        "pubStartDate": start_date.strftime("%Y-%m-%dT%H:%M:%S:000 UTC-00:00"),
        "pubEndDate": current_date.strftime("%Y-%m-%dT%H:%M:%S:000 UTC-00:00")
    }
    try:
        response = requests.get(base_url, params=params)
        response.raise_for_status()
        data = response.json()
        vulnerabilities = data.get("result", {}).get("CVE_Items", [])
        logger.info(f"Fetched {len(vulnerabilities)} vulnerabilities from NVD.")
        return vulnerabilities
    except Exception as e:
        logger.error(f"Error fetching NVD data: {str(e)}")
        return []

In [6]:
# Cybersecurity-specific websites
websites = [
    "https://www.cisa.gov/uscert/ncas/alerts",
    "https://www.virustotal.com/gui/home/upload",
    "https://attack.mitre.org/",
    "https://www.darkreading.com/",
    "https://threatpost.com/",
]

# Scrape websites
scraped_content = scrape_websites(websites)

# Fetch tweets
tweets = fetch_scraped_tweets("#cybersecurity")
tweet_content = [tweet.get('full_text', '') for tweet in tweets]

# Combine scraped content and tweets
all_content = scraped_content + tweet_content

**Vector Store and Retriever Setup Functions**

In [8]:
# Vector Store and Retriever Setup Functions
def create_vectorstore(texts: List[str]) -> FAISS:
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    documents = text_splitter.create_documents(texts)
    return FAISS.from_documents(documents, embeddings)

def setup_retriever(vectorstore: FAISS) -> ContextualCompressionRetriever:
    base_retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4})
    compressor = LLMChainExtractor.from_llm(llm)
    return ContextualCompressionRetriever(base_compressor=compressor, base_retriever=base_retriever)

# Create vector store and retriever
vectorstore = create_vectorstore(all_content)
retriever = setup_retriever(vectorstore)

**Pydantic Models for Structured Output**

In [12]:
# Pydantic Models for Structured Output
class ThreatAnalysis(BaseModel):
    threat_type: str = Field(description="Type of cybersecurity threat")
    severity: str = Field(description="Severity level of the threat (Low, Medium, High, Critical)")
    description: str = Field(description="Brief description of the threat")
    potential_impact: str = Field(description="Potential impact on organizations")
    mitigation_steps: List[str] = Field(description="List of steps to mitigate the threat")
    ioc_list: Optional[List[str]] = Field(description="List of Indicators of Compromise (IoCs)", default=None)

class VulnerabilityAssessment(BaseModel):
    vulnerability_name: str = Field(description="Name or identifier of the vulnerability")
    affected_systems: List[str] = Field(description="List of affected systems or software")
    cvss_score: float = Field(description="CVSS score of the vulnerability")
    description: str = Field(description="Brief description of the vulnerability")
    remediation_steps: List[str] = Field(description="List of steps to remediate the vulnerability")
    references: Optional[List[str]] = Field(description="List of references for more information", default=None)

class SecurityRecommendation(BaseModel):
    recommendation: str = Field(description="Security recommendation")
    priority: str = Field(description="Priority level (Low, Medium, High)")
    implementation_difficulty: str = Field(description="Difficulty of implementation (Easy, Moderate, Complex)")
    expected_impact: str = Field(description="Expected impact of implementing the recommendation")
    estimated_cost: Optional[str] = Field(description="Estimated cost of implementation", default=None)

**Specialized Agent Tools**

In [13]:
# Specialized Agent Tools
class ThreatAnalyzerTool(BaseTool):
    name = "Threat Analyzer"
    description = "Analyzes cybersecurity threats and provides detailed information"

    def _run(self, query: str) -> ThreatAnalysis:
        prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="You are a cybersecurity threat analyst. Provide a detailed analysis of the given threat, including its type, severity, description, potential impact, mitigation steps, and if possible, a list of Indicators of Compromise (IoCs)."),
            HumanMessage(content=query)
        ])
        chain = LLMChain(llm=llm, prompt=prompt, output_parser=PydanticOutputParser(pydantic_object=ThreatAnalysis))
        return chain.run(query)

class VulnerabilityAssessorTool(BaseTool):
    name = "Vulnerability Assessor"
    description = "Assesses cybersecurity vulnerabilities and provides detailed information"

    def _run(self, query: str) -> VulnerabilityAssessment:
        prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="You are a cybersecurity vulnerability assessor. Provide a detailed assessment of the given vulnerability, including its name, affected systems, CVSS score, description, remediation steps, and if available, references for more information."),
            HumanMessage(content=query)
        ])
        chain = LLMChain(llm=llm, prompt=prompt, output_parser=PydanticOutputParser(pydantic_object=VulnerabilityAssessment))
        return chain.run(query)

class SecurityAdvisorTool(BaseTool):
    name = "Security Advisor"
    description = "Provides security recommendations based on the analysis of threats and vulnerabilities"

    def _run(self, query: str) -> SecurityRecommendation:
        prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="You are a cybersecurity advisor. Provide a security recommendation based on the given analysis, including the recommendation, priority, implementation difficulty, expected impact, and if possible, an estimated cost of implementation."),
            HumanMessage(content=query)
        ])
        chain = LLMChain(llm=llm, prompt=prompt, output_parser=PydanticOutputParser(pydantic_object=SecurityRecommendation))
        return chain.run(query)

In [14]:
# Initialize tools
threat_analyzer = ThreatAnalyzerTool()
vulnerability_assessor = VulnerabilityAssessorTool()
security_advisor = SecurityAdvisorTool()

# Create the agent prompt
agent_prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content=system_message),
    HumanMessage(content="{input}"),
    HumanMessage(content="Human: {human_input}"),
    HumanMessage(content="AI: {agent_scratchpad}")
])

# Create the OpenAI functions agent
agent = create_openai_functions_agent(
    llm=llm,
    tools=[threat_analyzer, vulnerability_assessor, security_advisor],
    prompt=agent_prompt
)

# Create an agent executor
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
agent_executor = AgentExecutor(
    agent=agent,
    tools=[threat_analyzer, vulnerability_assessor, security_advisor],
    memory=memory,
    verbose=True
)

ValueError: Prompt must have input variable `agent_scratchpad`, but wasn't found. Found [] instead.

In [15]:
# Function to process multiple queries
def process_queries(queries: List[str]) -> List[str]:
    results = []
    for query in queries:
        result = agent_executor.run(input=query, human_input=query)
        results.append(result)
    return results

# Example usage
queries = [
    "Analyze the latest ransomware threat affecting financial institutions.",
    "Assess the vulnerability CVE-2024-12345 in Windows Server.",
    "Provide a security recommendation for mitigating phishing attacks."
]

results = process_queries(queries)
for i, result in enumerate(results):
    print(f"Result for Query {i+1}: {result}")

NameError: name 'agent_executor' is not defined

In [16]:
# Main execution
if __name__ == "__main__":
    # Add some sample data to the knowledge graph
    kg.add_entity("Ransomware", "Threat")
    kg.add_entity("Financial Institutions", "Target")
    kg.add_relation("Ransomware", "Financial Institutions", "targets")

    # Process queries and update knowledge graph
    results = process_queries(queries)
    for query, result in zip(queries, results):
        # Here you would parse the result and update the knowledge graph accordingly
        # This is a simplified example
        kg.add_entity(query, "Query")
        kg.add_relation(query, "Result", result[:50])  # Truncated for brevity

    # Visualize the knowledge graph
    kg.visualize("cybersecurity_knowledge_graph.html")

    print("Cybersecurity co-pilot pipeline executed successfully.")

NameError: name 'agent_executor' is not defined