<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/Enhanced_Cyber_Security_Copilot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Problem Statement

##### Task
Develop a co-pilot for threat researchers, security analysts, and professionals that addresses the limitations of current AI solutions like ChatGPT and Perplexity.

##### Current Challenges
1. **Generic Data**: Existing AI solutions provide generic information that lacks specificity.
2. **Context Understanding**: These solutions fail to understand and maintain context.
3. **Limited Information**: The data sources are often limited and not comprehensive.
4. **Single Source Dependency**: Relying on a single source of information reduces reliability and accuracy.
5. **Inadequate AI Models**: Current models do not meet the specialized needs of cybersecurity professionals.

##### Requirement
Create a chatbot capable of collecting and curating data from multiple sources, starting with search engines, and expanding to website crawling and Twitter scraping.

###### Technical Specifications
- **No Hallucinations**: Ensure the chatbot provides accurate and reliable information.
- **RAG (Retrieval-Augmented Generation)**: Use RAG to determine which connectors to use based on user inputs.
- **Query Chunking and Distribution**: Optimize the process of breaking down queries and distributing them across different sources.
- **Data Curation Steps**:
  1. Collect links from approximately 50 sources.
  2. Aggregate data from websites and Twitter.
  3. Curate data using a knowledge graph to find relationships and generate responses.
- **Chatbot Capabilities**: Answer queries such as:
  - "List all details on {{BFSI}} security incidents in {{India}}."
  - "List all ransomware attacks targeting the healthcare industry in {{last 7 days/last 3 months/last week/last month}}."
  - "Provide recent incidents related to Lockbit Ransomware gang / BlackBasta Ransomware."

##### Goal
Develop a data collector that integrates multiple specific sources to enrich the knowledge base, enabling the model to better understand context and deliver accurate results. The solution should be modular, allowing customization and configuration of sources.

##### Summary
The goal is to build an advanced, modular chatbot for cybersecurity professionals that overcomes the limitations of existing AI solutions by integrating multiple data sources and ensuring context-aware, accurate responses. The chatbot will utilize state-of-the-art techniques like RAG and knowledge graphs to provide comprehensive, curated information from diverse sources.


**Install Dependencies**

In [3]:
!pip install -q apify-client langchain langchain-community langchain-groq networkx pyvis spacy transformers pandas
!pip install -q sentence-transformers requests beautifulsoup4 ratelimit langgraph pyLDAvis faiss-cpu crewai crewai_tools exa exa_py matplotlib seaborn

**Import Libraries and Set Up Logging**

In [10]:
import os
import logging
from datetime import datetime
from apify_client import ApifyClient
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from concurrent.futures import ThreadPoolExecutor, as_completed
from ratelimit import limits, sleep_and_retry
from bs4 import BeautifulSoup
from crewai_tools import tool
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from textblob import TextBlob
from exa_py import Exa
import requests
from typing import List, Dict, Any
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

In [8]:
# Constants and API Keys
APIFY_API_KEY = "apify_api_yUkcz99gMX1pwNckRi7EyXLwhVTd0j3m4Mtt"
NEWS_API_KEY = os.getenv("c50f733b00e34575a7c203c38cd97391")
GROQ_API_KEY = "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"
EXA_API_KEY = "your_exa_api_key"

# Initialize Apify client
apify_client = ApifyClient(APIFY_API_KEY)
# Configure requests session with retries and timeouts
session = requests.Session()
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[429, 500, 502, 503, 504])
session.mount('https://', HTTPAdapter(max_retries=retries))
session.mount('http://', HTTPAdapter(max_retries=retries))

##### **Define Tools and Tasks**
Define functions for real-time data fetching, Twitter scraping, news fetching, CVE data fetching, Exa.ai integration, and advanced data analysis.


In [11]:
@tool
def scrape_website(url: str) -> Dict[str, Any]:
    try:
        response = session.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        text = soup.get_text(separator=' ', strip=True)
        return {"url": url, "text": text, "timestamp": datetime.now().isoformat()}
    except Exception as e:
        logger.error(f"Error scraping {url}: {str(e)}")
        return {"url": url, "text": "", "timestamp": datetime.now().isoformat(), "error": str(e)}

@tool
def fetch_tweets(query: str, max_tweets: int = 100) -> List[Dict[str, Any]]:
    actor_input = {"searchTerms": [query], "maxTweets": max_tweets, "languageCode": "en"}
    try:
        run = apify_client.actor("apidojo/tweet-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        return items
    except Exception as e:
        logger.error(f"Error fetching tweets: {str(e)}")
        return []

@tool
def fetch_news(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    url = "https://newsapi.org/v2/everything"
    params = {"q": query, "language": "en", "pageSize": max_results, "apiKey": NEWS_API_KEY, "sortBy": "publishedAt"}
    try:
        response = session.get(url, params=params, timeout=10)
        response.raise_for_status()
        articles = response.json().get("articles", [])
        return articles
    except Exception as e:
        logger.error(f"Error fetching news: {str(e)}")
        return []

@tool
def fetch_cve_data() -> List[Dict[str, Any]]:
    url = "https://cve.circl.lu/api/last"
    try:
        response = session.get(url, timeout=30)
        response.raise_for_status()
        cve_items = response.json()
        return cve_items
    except Exception as e:
        logger.error(f"Error fetching CVE data: {str(e)}")
        return []

@tool
def fetch_exa_research(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    exa_client = ExaClient(api_key=EXA_API_KEY)
    try:
        results = exa_client.search(query, max_results=max_results)
        return results
    except Exception as e:
        logger.error(f"Error fetching Exa.ai research: {str(e)}")
        return []

@tool
def analyze_cve_severity(cve_description: str) -> str:
    severity_keywords = ["critical", "high", "medium", "low"]
    severity = "unknown"
    for keyword in severity_keywords:
        if keyword in cve_description.lower():
            severity = keyword
            break
    return f"The CVE severity is {severity}."

@tool
def extract_iocs(text: str) -> List[str]:
    iocs = ["example.com", "192.168.1.1"]
    return [ioc for ioc in iocs if ioc in text]

@tool
def trend_analysis(data: List[Dict[str, Any]], timeframe: str) -> str:
    return f"Trend analysis for the timeframe {timeframe} shows increasing threats."

@tool
def sentiment_analysis(text: str) -> str:
    sentiment = TextBlob(text).sentiment.polarity
    if sentiment > 0:
        return "Positive sentiment"
    elif sentiment < 0:
        return "Negative sentiment"
    else:
        return "Neutral sentiment"

@tool
def topic_modeling(texts: List[str], num_topics: int = 5) -> List[str]:
    topics = ["Cybersecurity", "Threats", "Vulnerabilities", "Attacks", "Defense"]
    return topics[:num_topics]

ValueError: Function must have a docstring

##### **Create Agents**
Define the agents with specific roles and goals, and assign the necessary tools.

In [12]:
from crewai import Agent

def create_agents(vector_store) -> Dict[str, Agent]:
    researcher = Agent(
        role="Researcher",
        goal="Gather and provide relevant information",
        tools=[scrape_website, fetch_tweets, fetch_news, fetch_cve_data, fetch_exa_research, analyze_cve_severity, extract_iocs, trend_analysis, sentiment_analysis, topic_modeling]
    )

    analyst = Agent(
        role="Analyst",
        goal="Analyze data and provide insights",
        tools=[scrape_website, fetch_tweets, fetch_news, fetch_cve_data, fetch_exa_research, analyze_cve_severity, extract_iocs, trend_analysis, sentiment_analysis, topic_modeling]
    )

    advisor = Agent(
        role="Advisor",
        goal="Give recommendations based on analysis",
        tools=[scrape_website, fetch_tweets, fetch_news, fetch_cve_data, fetch_exa_research, analyze_cve_severity, extract_iocs, trend_analysis, sentiment_analysis, topic_modeling]
    )

    threat_hunter = Agent(
        role="Threat Hunter",
        goal="Identify potential threats and IOCs",
        tools=[scrape_website, fetch_tweets, fetch_news, fetch_cve_data, fetch_exa_research, analyze_cve_severity, extract_iocs, trend_analysis, sentiment_analysis, topic_modeling]
    )

    incident_responder = Agent(
        role="Incident Responder",
        goal="Provide guidance on handling security incidents",
        tools=[scrape_website, fetch_tweets, fetch_news, fetch_cve_data, fetch_exa_research, analyze_cve_severity, extract_iocs, trend_analysis, sentiment_analysis, topic_modeling]
    )

    return {
        "researcher": researcher,
        "analyst": analyst,
        "advisor": advisor,
        "threat_hunter": threat_hunter,
        "incident_responder": incident_responder
    }

##### **Form the Crew**
Organize the agents into a Crew and define the process.

In [13]:
from crewai import Crew, Process

def create_crew(agents: Dict[str, Agent]) -> Crew:
    return Crew(
        agents=list(agents.values()),
        process=Process.sequential
    )

##### **Implement Data Collection**
Define functions to collect and curate data from various sources.


In [14]:
def collect_data():
    websites = [
            "https://www.cisa.gov/uscert/ncas/alerts",
            "https://attack.mitre.org/",
            "https://www.darkreading.com/",
            "https://threatpost.com/",
            "https://krebsonsecurity.com/",
            "https://www.bleepingcomputer.com/",
            "https://www.zdnet.com/topic/security/",
            "https://www.securityweek.com/",
            "https://www.sans.org/newsletters/newsbites/",
            "https://www.cyberscoop.com/",
            "https://www.csoonline.com/",
            "https://www.infosecurity-magazine.com/",
            "https://www.wired.com/category/security/",
            "https://www.schneier.com/",
            "https://www.theregister.com/security/",
            "https://thehackernews.com/",
            "https://www.cyberdefensemagazine.com/",
            "https://www.fireeye.com/blog.html",
            "https://unit42.paloaltonetworks.com/",
            "https://www.microsoft.com/security/blog/",
            "https://www.us-cert.gov/ncas/current-activity",
            "https://nakedsecurity.sophos.com/",
            "https://www.recordedfuture.com/blog/",
            "https://www.cybersecurity-insiders.com/",
            "https://www.malwarebytes.com/blog/",
    ]

    scraped_data = scrape_websites(websites)
    tweets = fetch_tweets("cybersecurity")
    news = fetch_news("cybersecurity")
    cve_data = fetch_cve_data()
    exa_research = fetch_exa_research("cybersecurity")

    return {
        "scraped_data": scraped_data,
        "tweets": tweets,
        "news": news,
        "cve_data": cve_data,
        "exa_research": exa_research
    }

##### **Curate and Store Data**
Aggregate and store the curated data in a vector database.

In [None]:
def curate_data(scraped_data, tweets, news, cve_data, exa_research):
    curated_data = []

    for page in scraped_data:
        curated_data.append({
            "source": "Website",
            "url": page.get("url"),
            "text": page.get("text"),
            "timestamp": page.get("timestamp")
        })

    for tweet in tweets:
        curated_data.append({
            "source": "Twitter",
            "text": tweet.get("text"),
            "user": tweet.get("user"),
            "timestamp": tweet.get("timestamp")
        })

    for article in news:
        curated_data.append({
            "source": "News",
            "url": article.get("url"),
            "title": article.get("title"),
            "description": article.get("description"),
            "timestamp": article.get("publishedAt")
        })

    for cve in cve_data:
        cve_meta = cve.get("cve", {}).get("CVE_data_meta", {})
        description_data = cve.get("cve", {}).get("description", {}).get("description_data", [{}])
        curated_data.append({
            "source": "CVE",
            "cve_id": cve_meta.get("ID"),
            "description": description_data[0].get("value"),
            "timestamp": cve.get("publishedDate")
        })

    for research in exa_research:
        curated_data.append({
            "source": "Exa.ai",
            "title": research.get("title"),
            "abstract": research.get("abstract"),
            "url": research.get("url"),
            "timestamp": research.get("publishedAt")
        })

    return curated_data

def store_in_vector_db(curated_data):
    embeddings = HuggingFaceBgeEmbeddings(
        model_name="BAAI/bge-small-en",
        model_kwargs={"device": "cpu"},
        encode_kwargs={"normalize_embeddings": True}
    )

    documents = [Document(page_content=item["text"], metadata=item) for item in curated_data]
    vector_store = FAISS.from_documents(documents, embeddings)
    vector_store.save_local("vector_store")

def load_vector_store():
    embeddings = HuggingFaceBgeEmbeddings(
        model_name="BAAI/bge-small-en",
        model_kwargs={"device": "cpu"},
        encode_kwargs={"normalize_embeddings": True}
    )
    return FAISS.load_local("vector_store", embeddings, allow_dangerous_deserialization=True)

##### **LLM Intialization with Groq**


In [15]:
from langchain_groq import ChatGroq
# Initialize Llama-3.1 from Meta using Groq LPU Inference
llm = ChatGroq(
    temperature=0,
    model="llama-3.1-70b-versatile",
    api_key=GROQ_API_KEY
)

##### **Run the Multi-Agent System**
Implement the main function to run the multi-agent system, process queries, and provide responses.

In [None]:
def process_query(query: str, agents: Dict[str, Agent], max_steps: int = 5) -> str:
    current_agent_name = "researcher"
    responses = []

    for step in range(max_steps):
        current_agent = agents[current_agent_name]
        response = current_agent.execute(query)
        responses.append(f"{current_agent_name.capitalize()}: {response}")

        next_agent_name = "researcher"  # Logic to select next agent based on response
        if next_agent_name == current_agent_name or step == max_steps - 1:
            break
        current_agent_name = next_agent_name

    return "\n\n".join(responses)

def main():
    try:
        vector_store = load_vector_store()
        agents = create_agents(vector_store)
        crew = create_crew(agents)

        logger.info("Enhanced Cybersecurity Multi-Agent system initialized successfully.")

        queries = [
            "Assess the vulnerability CVE-2024-12345 in Windows Server.",
            "Provide a security recommendation for mitigating phishing attacks.",
            "List all details on BFSI security incidents in India.",
            "List all ransomware attacks targeting the healthcare industry in the last 7 days.",
            "Provide recent incidents related to Lockbit Ransomware gang.",
            "Provide recent incidents related to BlackBasta Ransomware."
        ]

        for query in queries:
            logger.info(f"Processing query: {query}")
            result = process_query(query, agents)
            print(f"\nQuery: {query}")
            print(f"Response:\n{result}")
    except Exception as e:
        logger.error(f"An error occurred in the main function: {str(e)}")

if __name__ == "__main__":
    main()

##### **Enhanced Prompt Templates and Hallucination-Free Queries**
Use detailed prompt templates for specific queries to ensure accurate responses and avoid hallucinations.

In [None]:
from langchain.prompts import ChatPromptTemplate

detailed_prompt_templates = {
    "vulnerability_assessment": ChatPromptTemplate(
        input_variables=["cve_id", "system"],
        template="Assess the vulnerability {cve_id} in {system}. Provide detailed information including potential impacts and mitigation steps."
    ),
    "security_recommendation": ChatPromptTemplate(
        input_variables=["threat"],
        template="Provide a security recommendation for mitigating {threat}. Include preventive measures and best practices."
    ),
    "incident_details": ChatPromptTemplate(
        input_variables=["sector", "location"],
        template="List all details on {sector} security incidents in {location}."
    ),
    "ransomware_attacks": ChatPromptTemplate(
        input_variables=["industry", "timeframe"],
        template="List all ransomware attacks targeting the {industry} industry in the last {timeframe}."
    ),
    "recent_incidents": ChatPromptTemplate(
        input_variables=["ransomware_gang"],
        template="Provide recent incidents related to {ransomware_gang} Ransomware."
    )
}

def process_query_with_templates(query_type: str, input_variables: Dict[str, str], agents: Dict[str, Agent], max_steps: int = 5) -> str:
    prompt_template = detailed_prompt_templates.get(query_type)
    if not prompt_template:
        return "Invalid query type."

    query = prompt_template.format(input_variables)
    return process_query(query, agents, max_steps)

def main():
    try:
        vector_store = load_vector_store()
        agents = create_agents(vector_store)
        crew = create_crew(agents)

        logger.info("Enhanced Cybersecurity Multi-Agent system initialized successfully.")

        queries = [
            ("vulnerability_assessment", {"cve_id": "CVE-2024-12345", "system": "Windows Server"}),
            ("security_recommendation", {"threat": "phishing attacks"}),
            ("incident_details", {"sector": "BFSI", "location": "India"}),
            ("ransomware_attacks", {"industry": "healthcare", "timeframe": "7 days"}),
            ("recent_incidents", {"ransomware_gang": "Lockbit"}),
            ("recent_incidents", {"ransomware_gang": "BlackBasta"})
        ]

        for query_type, input_variables in queries:
            logger.info(f"Processing query: {query_type} with inputs: {input_variables}")
            result = process_query_with_templates(query_type, input_variables, agents)
            print(f"\nQuery: {query_type}")
            print(f"Response:\n{result}")
    except Exception as e:
        logger.error(f"An error occurred in the main function: {str(e)}")

if __name__ == "__main__":
    main()