<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/Enhanced_Cyber_Security_Copilot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Problem Statement

##### Task
Develop a co-pilot for threat researchers, security analysts, and professionals that addresses the limitations of current AI solutions like ChatGPT and Perplexity.

##### Current Challenges
1. **Generic Data**: Existing AI solutions provide generic information that lacks specificity.
2. **Context Understanding**: These solutions fail to understand and maintain context.
3. **Limited Information**: The data sources are often limited and not comprehensive.
4. **Single Source Dependency**: Relying on a single source of information reduces reliability and accuracy.
5. **Inadequate AI Models**: Current models do not meet the specialized needs of cybersecurity professionals.

##### Requirement
Create a chatbot capable of collecting and curating data from multiple sources, starting with search engines, and expanding to website crawling and Twitter scraping.

###### Technical Specifications
- **No Hallucinations**: Ensure the chatbot provides accurate and reliable information.
- **RAG (Retrieval-Augmented Generation)**: Use RAG to determine which connectors to use based on user inputs.
- **Query Chunking and Distribution**: Optimize the process of breaking down queries and distributing them across different sources.
- **Data Curation Steps**:
  1. Collect links from approximately 50 sources.
  2. Aggregate data from websites and Twitter.
  3. Curate data using a knowledge graph to find relationships and generate responses.
- **Chatbot Capabilities**: Answer queries such as:
  - "List all details on {{BFSI}} security incidents in {{India}}."
  - "List all ransomware attacks targeting the healthcare industry in {{last 7 days/last 3 months/last week/last month}}."
  - "Provide recent incidents related to Lockbit Ransomware gang / BlackBasta Ransomware."

##### Goal
Develop a data collector that integrates multiple specific sources to enrich the knowledge base, enabling the model to better understand context and deliver accurate results. The solution should be modular, allowing customization and configuration of sources.

##### Summary
The goal is to build an advanced, modular chatbot for cybersecurity professionals that overcomes the limitations of existing AI solutions by integrating multiple data sources and ensuring context-aware, accurate responses. The chatbot will utilize state-of-the-art techniques like RAG and knowledge graphs to provide comprehensive, curated information from diverse sources.


**Install Dependencies**

In [1]:
%pip install -q apify-client langchain langchain-community langchain-groq networkx pyvis spacy transformers pandas
%pip install -q sentence-transformers requests beautifulsoup4 ratelimit langgraph pyLDAvis

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.3/990.3 kB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.5/103.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Import Libraries and Set Up Logging**

In [2]:
import os
from datetime import datetime, timedelta
from typing import List, Dict, Any, Annotated, TypedDict
import logging
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from concurrent.futures import ThreadPoolExecutor, as_completed
from ratelimit import limits, sleep_and_retry
from bs4 import BeautifulSoup
from apify_client import ApifyClient
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain_groq import ChatGroq
import networkx as nx
from pyvis.network import Network
import spacy
from transformers import pipeline
import json
from langchain.agents import Tool
from langchain.memory import ConversationBufferMemory
from langchain.callbacks import get_openai_callback
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolExecutor
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from textblob import TextBlob
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

**Constants and API Keys**

In [3]:
# Constants
APIFY_API_KEY = "apify_api_yUkcz99gMX1pwNckRi7EyXLwhVTd0j3m4Mtt"
NEWS_API_KEY = os.getenv("c50f733b00e34575a7c203c38cd97391")
GROQ_API_KEY = "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"
WEBSITES = [
    "https://www.cisa.gov/uscert/ncas/alerts",
    "https://attack.mitre.org/",
    "https://www.darkreading.com/",
    "https://threatpost.com/",
    "https://krebsonsecurity.com/",
    "https://www.bleepingcomputer.com/",
    "https://www.zdnet.com/topic/security/",
    "https://www.securityweek.com/",
    "https://www.sans.org/newsletters/newsbites/",
    "https://www.cyberscoop.com/",
    "https://www.csoonline.com/",
    "https://www.infosecurity-magazine.com/",
    "https://www.wired.com/category/security/",
    "https://www.schneier.com/",
    "https://www.theregister.com/security/",
    "https://thehackernews.com/",
    "https://www.cyberdefensemagazine.com/",
    "https://www.fireeye.com/blog.html",
    "https://unit42.paloaltonetworks.com/",
    "https://www.microsoft.com/security/blog/",
    "https://www.us-cert.gov/ncas/current-activity",
    "https://nakedsecurity.sophos.com/",
    "https://www.recordedfuture.com/blog/",
    "https://www.cybersecurity-insiders.com/",
    "https://www.malwarebytes.com/blog/",
]
RSS_FEEDS = [
    "https://www.cisa.gov/uscert/ncas/alerts.xml",
    "https://krebsonsecurity.com/feed/",
    "https://threatpost.com/feed/",
    "https://www.darkreading.com/rss_simple.asp"
]

  and should_run_async(code)


**Initialize Apify Client and Configure Requests Session**

In [4]:
# Initialize Apify client
apify_client = ApifyClient(APIFY_API_KEY)

# Configure requests session with retries and timeouts
session = requests.Session()
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[429, 500, 502, 503, 504])
session.mount('https://', HTTPAdapter(max_retries=retries))
session.mount('http://', HTTPAdapter(max_retries=retries))

  and should_run_async(code)


**Rate-Limited GET Request**

In [5]:
# Rate-limited GET request
@sleep_and_retry
@limits(calls=15, period=1)  # 5 calls per second
def rate_limited_get(url: str, **kwargs) -> requests.Response:
    return session.get(url, timeout=10, **kwargs)

  and should_run_async(code)


**Website Scraping Functions and Fetch Data Functions**

In [11]:
def scrape_website_with_apify(url: str) -> Dict[str, Any]:
    """Scrape a website using Apify."""
    logger.info(f"Scraping {url} with Apify...")
    try:
        actor_input = {"url": url, "proxyConfiguration": {"useApifyProxy": True}}
        run = apify_client.actor("apify/website-content-crawler").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        if items:
            return {"url": url, "text": items[0].get("text", ""), "timestamp": datetime.now().isoformat()}
        else:
            return {"url": url, "text": "", "timestamp": datetime.now().isoformat(), "error": "No content found"}
    except Exception as e:
        logger.error(f"Error scraping {url} with Apify: {str(e)}")
        return {"url": url, "text": "", "timestamp": datetime.now().isoformat(), "error": str(e)}

def scrape_website(url: str) -> Dict[str, Any]:
    """Scrape a website using BeautifulSoup."""
    try:
        response = rate_limited_get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        text = soup.get_text(separator=' ', strip=True)
        return {"url": url, "text": text, "timestamp": datetime.now().isoformat()}
    except Exception as e:
        logger.error(f"Error scraping {url}: {str(e)}")
        return {"url": url, "text": "", "timestamp": datetime.now().isoformat(), "error": str(e)}

def scrape_websites(urls: List[str]) -> List[Dict[str, Any]]:
    """Scrape multiple websites concurrently."""
    logger.info(f"Scraping {len(urls)} websites...")
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_url = {executor.submit(scrape_website, url): url for url in urls}
        results = [future.result() for future in as_completed(future_to_url)]
    logger.info(f"Successfully scraped {len(results)} pages.")
    return results

def fetch_tweets(query: str, max_tweets: int = 100) -> List[Dict[str, Any]]:
    """Fetch tweets using Apify's Twitter scraper."""
    logger.info(f"Fetching tweets for query: {query}")
    actor_input = {"searchTerms": [query], "maxTweets": max_tweets, "languageCode": "en"}
    try:
        run = apify_client.actor("apidojo/tweet-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} tweets.")
        return items
    except Exception as e:
        logger.error(f"Error fetching tweets: {str(e)}")
        return []

def fetch_news(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    """Fetch news articles using NewsAPI."""
    logger.info(f"Fetching news for query: {query}")
    url = "https://newsapi.org/v2/everything"
    params = {"q": query, "language": "en", "pageSize": max_results, "apiKey": NEWS_API_KEY, "sortBy": "publishedAt"}
    try:
        response = rate_limited_get(url, params=params)
        response.raise_for_status()
        articles = response.json().get("articles", [])
        logger.info(f"Fetched {len(articles)} news articles.")
        return articles
    except Exception as e:
        logger.error(f"Error fetching news: {str(e)}")
        return []

def fetch_cve_data() -> List[Dict[str, Any]]:
    """Fetch CVE data from CIRCL API."""
    logger.info("Fetching CVE data")
    url = "https://cve.circl.lu/api/last"
    try:
        response = rate_limited_get(url)
        response.raise_for_status()
        cve_items = response.json()
        logger.info(f"Fetched {len(cve_items)} CVE items.")
        return cve_items
    except Exception as e:
        logger.error(f"Error fetching CVE data: {str(e)}")
        return []

def fetch_rss_feeds(urls: List[str]) -> List[Dict[str, Any]]:
    """Fetch RSS feeds using Apify's RSS scraper."""
    logger.info(f"Fetching RSS feeds from {len(urls)} URLs")
    run_input = {"startUrls": urls, "maxItems": 50}
    try:
        run = apify_client.actor("jupri/rss-xml-scraper").call(run_input=run_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} RSS feed items.")
        return items
    except Exception as e:
        logger.error(f"Error fetching RSS feeds: {str(e)}")
        return []

  and should_run_async(code)


**Curate Data Function**

In [14]:
def curate_data(website_data, tweets, news, cve_data, rss_feeds):
    """Curate data from various sources."""
    curated_data = []

    for page in website_data:
        curated_data.append({
            "source": "Website",
            "url": page.get("url"),
            "text": page.get("text"),
            "timestamp": page.get("timestamp")
        })

    for tweet in tweets:
        curated_data.append({
            "source": "Twitter",
            "text": tweet.get("text"),
            "user": tweet.get("user"),
            "timestamp": tweet.get("timestamp")
        })

    for article in news:
        curated_data.append({
            "source": "News",
            "url": article.get("url"),
            "title": article.get("title"),
            "description": article.get("description"),
            "timestamp": article.get("publishedAt")
        })

    for cve in cve_data:
        cve_meta = cve.get("cve", {}).get("CVE_data_meta", {})
        description_data = cve.get("cve", {}).get("description", {}).get("description_data", [{}])
        curated_data.append({
            "source": "CVE",
            "cve_id": cve_meta.get("ID"),
            "description": description_data[0].get("value"),
            "timestamp": cve.get("publishedDate")
        })

    for feed in rss_feeds:
        curated_data.append({
            "source": "RSS",
            "url": feed.get("link"),
            "title": feed.get("title"),
            "description": feed.get("description"),
            "timestamp": feed.get("pubDate")
        })

    return curated_data

  and should_run_async(code)


**Process Scraped Data**

In [15]:
def preprocess_item(item: Dict[str, Any]) -> Dict[str, Any]:
    """Preprocess a single data item."""
    processed_item = {
        "source": item["source"],
        "content": "",
        "timestamp": item.get("timestamp", ""),
        "keywords": [],
        "sentiment": 0
    }

    if item["source"] == "Website":
        processed_item["content"] = item.get("text", "")[:500]  # Truncate to first 500 characters
    elif item["source"] == "Twitter":
        processed_item["content"] = item.get("text", "")
    elif item["source"] in ["News", "RSS"]:
        processed_item["content"] = f"{item.get('title', '')} - {item.get('description', '')}"
    elif item["source"] == "CVE":
        processed_item["content"] = f"{item.get('cve_id', '')} - {item.get('description', '')}"

    processed_item["keywords"] = extract_keywords(processed_item["content"])
    processed_item["sentiment"] = perform_sentiment_analysis(processed_item["content"])

    return processed_item

def extract_keywords(text: str, top_n: int = 5) -> List[str]:
    """Extract top keywords from text."""
    words = text.lower().split()
    word_freq = {}
    for word in words:
        if len(word) > 3:  # Ignore short words
            word_freq[word] = word_freq.get(word, 0) + 1
    return sorted(word_freq, key=word_freq.get, reverse=True)[:top_n]

def perform_sentiment_analysis(text: str) -> float:
    """Perform sentiment analysis on text."""
    return TextBlob(text).sentiment.polarity

def process_curated_data(curated_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Process all curated data items."""
    return [preprocess_item(item) for item in curated_data]

def store_in_vector_db(processed_data: List[Dict[str, Any]], file_path: str = "vector_store") -> None:
    """Store processed data in a vector database."""
    embeddings = HuggingFaceBgeEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    vector_store = FAISS.from_documents([{"page_content": item["content"], "metadata": item} for item in processed_data], embeddings)
    vector_store.save_local(file_path)

def store_in_kg(processed_data: List[Dict[str, Any]], file_path: str = "knowledge_graph.json") -> nx.Graph:
    """Store processed data in a knowledge graph."""
    graph = nx.Graph()

    for item in processed_data:
        source = item["source"]
        content = item["content"][:50]  # Use first 50 chars as node identifier
        keywords = item["keywords"]

        graph.add_node(source, type="source")
        graph.add_node(content, type="content", full_content=item["content"], sentiment=item["sentiment"])
        graph.add_edge(source, content)

        for keyword in keywords:
            graph.add_node(keyword, type="keyword")
            graph.add_edge(content, keyword)

    # Save the graph as JSON for easier inspection
    data = nx.node_link_data(graph)
    with open(file_path, 'w') as f:
        json.dump(data, f)

    return graph

  and should_run_async(code)


In [16]:
def main():
    """Main execution function."""
    # Define queries for cybersecurity-related data
    queries = [
        "cybersecurity", "BFSI security incidents", "ransomware attacks healthcare",
        "Lockbit Ransomware", "BlackBasta Ransomware", "latest cybersecurity incidents",
        "cyber threats", "data breaches", "malware", "phishing attacks",
        "network security", "cloud security", "cyber defense", "cybercrime",
        "information security", "vulnerability management", "threat intelligence",
        "incident response", "security awareness", "cybersecurity trends"
    ]

    # Gather data from various sources
    website_data = scrape_websites(WEBSITES)

    tweets = []
    for query in queries:
        tweets.extend(fetch_tweets(query))

    news_articles = []
    for query in queries:
        news_articles.extend(fetch_news(query))

    cve_data = fetch_cve_data()
    rss_data = fetch_rss_feeds(RSS_FEEDS)

    # Curate all collected data
    curated_data = curate_data(website_data, tweets, news_articles, cve_data, rss_data)

    # Process and analyze the curated data
    processed_data = process_curated_data(curated_data)

    # Store data in vector database and knowledge graph
    store_in_vector_db(processed_data)
    graph = store_in_kg(processed_data)

    logger.info(f"Processed {len(processed_data)} items.")
    logger.info(f"Knowledge graph has {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges.")

if __name__ == "__main__":
    main()

  and should_run_async(code)
ERROR:__main__:Error scraping https://www.securityweek.com/: 403 Client Error: Forbidden for url: https://www.securityweek.com/
ERROR:__main__:Error scraping https://www.bleepingcomputer.com/: 403 Client Error: Forbidden for url: https://www.bleepingcomputer.com/
ERROR:__main__:Error scraping https://www.theregister.com/security/: 403 Client Error: Forbidden for url: https://www.theregister.com/security/
ERROR:__main__:Error scraping https://www.us-cert.gov/ncas/current-activity: 404 Client Error: Not Found for url: https://www.cisa.gov/ncas/current-activity
ERROR:__main__:Error scraping https://www.cybersecurity-insiders.com/: 403 Client Error: Forbidden for url: https://www.cybersecurity-insiders.com/
ERROR:__main__:Error fetching news: 401 Client Error: Unauthorized for url: https://newsapi.org/v2/everything?q=cybersecurity&language=en&pageSize=50&sortBy=publishedAt
ERROR:__main__:Error fetching news: 401 Client Error: Unauthorized for url: https://newsa

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

AttributeError: 'dict' object has no attribute 'page_content'

**Initialize HuggingFace Embeddings and LLM**

In [10]:
# Initialize HuggingFace embeddings
embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Initialize Llama-3.1 from Meta using Groq LPU Inference
llm = ChatGroq(
    temperature=0,
    model="llama-3.1-70b-versatile",
    api_key=GROQ_API_KEY
)

# Define system and human messages
system_message = """You are an expert cybersecurity analyst with extensive knowledge in threat analysis,
vulnerability assessment, and security recommendations. Provide detailed, precise, and actionable insights.
Always consider the latest threat intelligence and best practices in your analysis."""
prompt_template = ChatPromptTemplate.from_messages([("system", system_message), ("human", "{text}")])

  and should_run_async(code)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**Enhanced KnowledgeGraph Class**

In [None]:
class KnowledgeGraph:
    def __init__(self):
        self.graph = nx.DiGraph()

    def add_node(self, node, **attrs):
        self.graph.add_node(node, **attrs)

    def add_edge(self, u, v, **attrs):
        self.graph.add_edge(u, v, **attrs)

    def visualize(self, output_file):
        net = Network(notebook=True, height="750px", width="100%", bgcolor="#222222", font_color="white")
        net.set_options("""
        var options = {
          "nodes": {
            "shape": "dot",
            "size": 16,
            "font": {
              "size": 12
            }
          },
          "edges": {
            "color": {
              "inherit": true
            },
            "smooth": {
              "type": "continuous"
            }
          },
          "physics": {
            "forceAtlas2Based": {
              "gravitationalConstant": -50,
              "centralGravity": 0.01,
              "springLength": 230,
              "springConstant": 0.18
            },
            "maxVelocity": 50,
            "solver": "forceAtlas2Based",
            "timestep": 0.22,
            "stabilization": {
              "iterations": 150
            }
          }
        }
        """)

        for node in self.graph.nodes(data=True):
            node_id = node[0]
            node_attrs = node[1]
            title = node_attrs.get('title', node_id)
            color = node_attrs.get('color', '#00ff00')
            shape = node_attrs.get('shape', 'dot')
            size = node_attrs.get('size', 16)
            net.add_node(node_id, label=title, color=color, shape=shape, size=size)

        for edge in self.graph.edges(data=True):
            source = edge[0]
            target = edge[1]
            edge_attrs = edge[2]
            relation = edge_attrs.get('relation', '')
            color = edge_attrs.get('color', '#ffffff')
            net.add_edge(source, target, title=relation, color=color)

        net.show(output_file)
        logger.info(f"Knowledge graph visualized at {output_file}")

# Initialize knowledge graph
kg = KnowledgeGraph()
# Visualize the graph
kg.visualize("enhanced_knowledge_graph.html")

enhanced_knowledge_graph.html


  and should_run_async(code)


**Vector Store**

**Advanced Cybersecurity Analysis Tools**

In [None]:
# Advanced cybersecurity analysis tools
def analyze_cve_severity(cve_description: str) -> str:
    severity_keywords = {
        "critical": 10, "high": 7, "medium": 5, "low": 3,
        "remote code execution": 9, "privilege escalation": 8,
        "denial of service": 6, "information disclosure": 4
    }

    description_lower = cve_description.lower()
    max_severity = max(score for keyword, score in severity_keywords.items() if keyword in description_lower)

    if max_severity >= 9:
        return f"Critical (Score: {max_severity})"
    elif max_severity >= 7:
        return f"High (Score: {max_severity})"
    elif max_severity >= 4:
        return f"Medium (Score: {max_severity})"
    else:
        return f"Low (Score: {max_severity})"

def extract_iocs(text: str) -> Dict[str, List[str]]:
    iocs = {
        "ip_addresses": [],
        "domains": [],
        "hashes": [],
        "urls": []
    }

    # Regular expressions for different IOC types
    ip_pattern = r'\b(?:\d{1,3}\.){3}\d{1,3}\b'
    domain_pattern = r'\b(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}\b'
    hash_pattern = r'\b[a-fA-F0-9]{32,64}\b'
    url_pattern = r'https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)'

    iocs["ip_addresses"] = re.findall(ip_pattern, text)
    iocs["domains"] = re.findall(domain_pattern, text)
    iocs["hashes"] = re.findall(hash_pattern, text)
    iocs["urls"] = re.findall(url_pattern, text)

    return iocs

def trend_analysis(data: List[Dict[str, Any]], timeframe: str) -> str:
    keywords = ["ransomware", "phishing", "data breach", "malware", "zero-day", "supply chain attack", "cloud security", "insider threat"]
    timeframe_days = {"week": 7, "month": 30, "3months": 90}

    if timeframe not in timeframe_days:
        return "Invalid timeframe. Please use 'week', 'month', or '3months'."

    cutoff_date = datetime.now() - timedelta(days=timeframe_days[timeframe])
    recent_data = [item for item in data if datetime.fromisoformat(item['timestamp']) > cutoff_date]

    keyword_counts = {keyword: sum(1 for item in recent_data if keyword in (item.get('text', '') + item.get('title', '') + item.get('description', '')).lower()) for keyword in keywords}

    sorted_trends = sorted(keyword_counts.items(), key=lambda x: x[1], reverse=True)
    trend_report = f"Top cybersecurity trends in the last {timeframe}:\n"
    trend_report += "\n".join(f"- {keyword.capitalize()}: {count} mentions" for keyword, count in sorted_trends)

    # Generate a bar plot of the trends
    plt.figure(figsize=(12, 6))
    sns.barplot(x=[item[0] for item in sorted_trends], y=[item[1] for item in sorted_trends])
    plt.title(f"Cybersecurity Trends - Last {timeframe}")
    plt.xlabel("Keywords")
    plt.ylabel("Mention Count")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(f"trend_analysis_{timeframe}.png")
    plt.close()

    trend_report += f"\n\nA bar plot of the trends has been saved as 'trend_analysis_{timeframe}.png'."

    return trend_report

# New function: Sentiment analysis
def sentiment_analysis(text: str) -> Dict[str, Any]:
    blob = TextBlob(text)
    sentiment = blob.sentiment
    return {
        "polarity": sentiment.polarity,
        "subjectivity": sentiment.subjectivity,
        "sentiment": "positive" if sentiment.polarity > 0 else "negative" if sentiment.polarity < 0 else "neutral"
    }

# New function: Topic modeling
def topic_modeling(texts: List[str], num_topics: int = 5) -> str:
    # Preprocess the texts
    texts = [re.sub(r'\s+', ' ', text.lower()) for text in texts]
    texts = [re.sub(r'[^\w\s]', '', text) for text in texts]

    # Create a dictionary and corpus
    dictionary = corpora.Dictionary([text.split() for text in texts])
    corpus = [dictionary.doc2bow(text.split()) for text in texts]

    # Build the LDA model
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)

    # Prepare the visualization
    vis = gensimvis.prepare(lda_model, corpus, dictionary)
    pyLDAvis.save_html(vis, 'lda_visualization.html')

    # Generate a summary of the topics
    topics_summary = "Topic Modeling Results:\n\n"
    for idx, topic in lda_model.print_topics(-1):
        topics_summary += f"Topic {idx}: {topic}\n"

    topics_summary += "\nAn interactive visualization of the topics has been saved as 'lda_visualization.html'."

    return topics_summary

  and should_run_async(code)


**Define Tools**

In [None]:
# Define the agent's tools
def define_tools(vector_store: FAISS, scraped_data: List[Dict[str, Any]]) -> List[Tool]:
    return [
        Tool(
            name="Search",
            func=lambda q: vector_store.similarity_search(q, k=3),
            description="Useful for searching information in the knowledge base"
        ),
        Tool(
            name="Summarize",
            func=lambda q: llm.predict(f"Summarize the following text:\n{q}"),
            description="Useful for summarizing long pieces of text"
        ),
        Tool(
            name="Analyze CVE Severity",
            func=analyze_cve_severity,
            description="Analyzes the severity of a CVE based on its description"
        ),
        Tool(
            name="Extract IOCs",
            func=extract_iocs,
            description="Extracts potential Indicators of Compromise (IOCs) from text"
        ),
        Tool(
            name="Trend Analysis",
            func=lambda timeframe: trend_analysis(scraped_data, timeframe),
            description="Analyzes cybersecurity trends over a given timeframe (week, month, or 3months)"
        ),
        Tool(
            name="Sentiment Analysis",
            func=sentiment_analysis,
            description="Analyzes the sentiment of a given text"
        ),
        Tool(
            name="Topic Modeling",
            func=lambda texts: topic_modeling(texts, num_topics=5),
            description="Performs topic modeling on a collection of texts"
        )
    ]

# Test agent tools
tools = define_tools(vector_store, scraped_data)
logger.info(f"Tools defined: {tools}")

  and should_run_async(code)


**Define Agent Nodes and Multi-Agent System**

In [None]:
# Define agent types
class AgentState(TypedDict):
    messages: Annotated[List[Dict[str, str]], "The messages in the conversation"]
    current_agent: Annotated[str, "The current agent processing the message"]
    scratchpad: Annotated[List[Dict[str, str]], "The agent's scratchpad"]

# Define agent nodes
def create_agent_node(role: str, system_message: str):
    def agent_function(state: AgentState, tools: List[Tool]):
        messages = state['messages']
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_message),
            ("human", "{input}"),
            ("human", "Thought: {agent_scratchpad}")
        ])
        chain = LLMChain(llm=llm, prompt=prompt)
        result = chain.run(input=messages[-1]['content'], agent_scratchpad=state['scratchpad'])
        return {**state, "messages": messages + [{"role": "assistant", "content": result}]}
    return agent_function

researcher_agent = create_agent_node(
    "researcher",
    "You are a cybersecurity researcher. Your role is to gather and analyze information using the provided tools."
)

analyst_agent = create_agent_node(
    "analyst",
    "You are a cybersecurity analyst. Your role is to interpret data and provide insights based on the information gathered."
)

advisor_agent = create_agent_node(
    "advisor",
    "You are a cybersecurity advisor. Your role is to provide recommendations and action plans based on the analysis."
)

threat_hunter_agent = create_agent_node(
    "threat_hunter",
    "You are a threat hunter. Your role is to proactively search for hidden threats and advanced persistent threats (APTs) in the data. Use IOC extraction and analysis tools to identify potential compromises."
)

incident_responder_agent = create_agent_node(
    "incident_responder",
    "You are an incident responder. Your role is to analyze potential security incidents, provide immediate mitigation steps, and develop longer-term remediation plans."
)

# Define the agent selection function
def select_next_agent(state: AgentState):
    last_message = state['messages'][-1]['content'].lower()
    if "cve" in last_message or "vulnerability" in last_message:
        return "analyst"
    elif "recommend" in last_message or "mitigat" in last_message:
        return "advisor"
    elif "incident" in last_message or "attack" in last_message:
        return "incident_responder"
    elif "ransomware" in last_message or "threat" in last_message:
        return "threat_hunter"
    else:
        return "researcher"

# Create the multi-agent system
def create_multi_agent_system(tools: List[Tool]):
    workflow = StateGraph(AgentState)

    # Add agent nodes
    workflow.add_node("researcher", researcher_agent)
    workflow.add_node("analyst", analyst_agent)
    workflow.add_node("advisor", advisor_agent)
    workflow.add_node("threat_hunter", threat_hunter_agent)
    workflow.add_node("incident_responder", incident_responder_agent)

    # Add edges
    for node in ["researcher", "analyst", "advisor", "threat_hunter", "incident_responder"]:
        workflow.add_edge(node, select_next_agent)

    # Set the entrypoint
    workflow.set_entry_point("researcher")

    # Compile the graph
    return workflow.compile()

# Test multi-agent system
multi_agent_system = create_multi_agent_system(tools)
logger.info(f"Multi-agent system created")

  and should_run_async(code)


**Main Function**

In [None]:
# Enhanced main function
def main(scraped_data: List[Dict[str, Any]]):
    try:
        processed_texts = process_scraped_data(scraped_data)
        vector_store = create_vector_store(processed_texts)
        tools = define_tools(vector_store, scraped_data)
        multi_agent_system = create_multi_agent_system(tools)

        logger.info("Enhanced Cybersecurity Multi-Agent system initialized successfully.")

        memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

        queries = [
            "Assess the vulnerability CVE-2024-12345 in Windows Server.",
            "Provide a security recommendation for mitigating phishing attacks.",
            "List all details on BFSI security incidents in India.",
            "List all ransomware attacks targeting the healthcare industry in the last 7 days.",
            "Provide recent incidents related to Lockbit Ransomware gang.",
            "Provide recent incidents related to BlackBasta Ransomware."
        ]

        for query in queries:
            print(f"\nQuery: {query}")
            with get_openai_callback() as cb:
                initial_state = AgentState(
                    messages=[{"role": "human", "content": query}],
                    current_agent="researcher",
                    scratchpad=[]
                )
                final_state = multi_agent_system.invoke(initial_state)

                # Process and display the final response
                final_response = final_state['messages'][-1]['content']
                print(f"Response: {final_response}")

                # Update knowledge graph based on the response
                update_knowledge_graph([{"source": "User Query", "title": query}])

                logger.info(f"Tokens used: {cb.total_tokens}")
                logger.info(f"Cost of query: ${cb.total_cost:.4f}")

        # Generate final report
        generate_final_report(scraped_data, processed_texts)

    except Exception as e:
        logger.error(f"An error occurred: {str(e)}")

  and should_run_async(code)


**Entry Point**

In [None]:
if __name__ == "__main__":
    try:
        # Load your scraped data here
        with open('scraped_data.json', 'r') as f:
            scraped_data = json.load(f)
        main(scraped_data)
    except FileNotFoundError:
        logger.error("scraped_data.json file not found. Please ensure the file exists in the current directory.")
    except json.JSONDecodeError:
        logger.error("Error decoding JSON from scraped_data.json. Please ensure the file contains valid JSON.")
    except Exception as e:
        logger.error(f"An unexpected error occurred: {str(e)}")

  and should_run_async(code)
ERROR:__main__:scraped_data.json file not found. Please ensure the file exists in the current directory.


In [None]:
# Test data collection
scraped_data = scrape_websites(WEBSITES)
logger.info(f"Scraped data: {scraped_data}")

# Test data processing
processed_texts = process_scraped_data(scraped_data)
logger.info(f"Processed texts: {processed_texts}")

# Test vector store creation
vector_store = create_vector_store(processed_texts)
logger.info(f"Vector store created")

# Test agent tools
tools = define_tools(vector_store, scraped_data)
logger.info(f"Tools defined: {tools}")

# Test multi-agent system
multi_agent_system = create_multi_agent_system(tools)
logger.info(f"Multi-agent system created")


  and should_run_async(code)
ERROR:__main__:Error scraping https://www.securityweek.com/: 403 Client Error: Forbidden for url: https://www.securityweek.com/
ERROR:__main__:Error scraping https://www.bleepingcomputer.com/: 403 Client Error: Forbidden for url: https://www.bleepingcomputer.com/
ERROR:__main__:Error scraping https://www.theregister.com/security/: 403 Client Error: Forbidden for url: https://www.theregister.com/security/
ERROR:__main__:Error scraping https://www.cybersecurity-insiders.com/: 403 Client Error: Forbidden for url: https://www.cybersecurity-insiders.com/
ERROR:__main__:Error scraping https://www.us-cert.gov/ncas/current-activity: 404 Client Error: Not Found for url: https://www.cisa.gov/ncas/current-activity


KeyError: 'source'