<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/Enhanced_Cyber_Security_Copilot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Problem Statement

##### Task
Develop a co-pilot for threat researchers, security analysts, and professionals that addresses the limitations of current AI solutions like ChatGPT and Perplexity.

##### Current Challenges
1. **Generic Data**: Existing AI solutions provide generic information that lacks specificity.
2. **Context Understanding**: These solutions fail to understand and maintain context.
3. **Limited Information**: The data sources are often limited and not comprehensive.
4. **Single Source Dependency**: Relying on a single source of information reduces reliability and accuracy.
5. **Inadequate AI Models**: Current models do not meet the specialized needs of cybersecurity professionals.

##### Requirement
Create a chatbot capable of collecting and curating data from multiple sources, starting with search engines, and expanding to website crawling and Twitter scraping.

###### Technical Specifications
- **No Hallucinations**: Ensure the chatbot provides accurate and reliable information.
- **RAG (Retrieval-Augmented Generation)**: Use RAG to determine which connectors to use based on user inputs.
- **Query Chunking and Distribution**: Optimize the process of breaking down queries and distributing them across different sources.
- **Data Curation Steps**:
  1. Collect links from approximately 50 sources.
  2. Aggregate data from websites and Twitter.
  3. Curate data using a knowledge graph to find relationships and generate responses.
- **Chatbot Capabilities**: Answer queries such as:
  - "List all details on {{BFSI}} security incidents in {{India}}."
  - "List all ransomware attacks targeting the healthcare industry in {{last 7 days/last 3 months/last week/last month}}."
  - "Provide recent incidents related to Lockbit Ransomware gang / BlackBasta Ransomware."

##### Goal
Develop a data collector that integrates multiple specific sources to enrich the knowledge base, enabling the model to better understand context and deliver accurate results. The solution should be modular, allowing customization and configuration of sources.

##### Summary
The goal is to build an advanced, modular chatbot for cybersecurity professionals that overcomes the limitations of existing AI solutions by integrating multiple data sources and ensuring context-aware, accurate responses. The chatbot will utilize state-of-the-art techniques like RAG and knowledge graphs to provide comprehensive, curated information from diverse sources.


In [1]:
%pip install -q apify-client langchain langchain-community langchain-groq networkx pyvis spacy transformers pandas
%pip install -q sentence-transformers requests beautifulsoup4 ratelimit

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.3/990.3 kB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.5/103.5 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import os
from datetime import datetime, timedelta
from typing import List, Dict, Any
import logging
import re
from apify_client import ApifyClient
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq
import networkx as nx
from pyvis.network import Network
import spacy
from transformers import pipeline
import requests
from bs4 import BeautifulSoup

from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from concurrent.futures import ThreadPoolExecutor, as_completed
from ratelimit import limits, sleep_and_retry

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

import os
apify_client = ApifyClient("apify_api_yUkcz99gMX1pwNckRi7EyXLwhVTd0j3m4Mtt")

# Configure requests session with retries and timeouts
session = requests.Session()
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[429, 500, 502, 503, 504])
session.mount('https://', HTTPAdapter(max_retries=retries))
session.mount('http://', HTTPAdapter(max_retries=retries))

In [4]:
# Initialize HuggingFace embeddings
embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Initialize Llama-3.1 from Meta using Groq LPU Inference
llm = ChatGroq(
    temperature=0,
    model="llama-3.1-70b-versatile",
    api_key="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"
)

# Define system and human messages
system_message = """You are an expert cybersecurity analyst with extensive knowledge in threat analysis,
vulnerability assessment, and security recommendations. Provide detailed, precise, and actionable insights.
Always consider the latest threat intelligence and best practices in your analysis."""
prompt_template = ChatPromptTemplate.from_messages([("system", system_message), ("human", "{text}")])

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
class KnowledgeGraph:
    def __init__(self):
        self.graph = nx.Graph()

    def add_entity(self, entity: str, entity_type: str):
        self.graph.add_node(entity, type=entity_type)
        logger.info(f"Added entity: {entity} of type: {entity_type}")

    def add_relation(self, entity1: str, entity2: str, relation: str):
        self.graph.add_edge(entity1, entity2, relation=relation)
        logger.info(f"Added relation: {relation} between {entity1} and {entity2}")

    def get_related_entities(self, entity: str) -> List[Dict[str, str]]:
        related = [{"entity": neighbor, "relation": self.graph.get_edge_data(entity, neighbor)["relation"]}
                   for neighbor in self.graph.neighbors(entity)]
        logger.info(f"Related entities for {entity}: {related}")
        return related

    def visualize(self, output_file: str = "knowledge_graph.html"):
        net = Network(notebook=True, width="100%", height="500px")
        for node, node_data in self.graph.nodes(data=True):
            net.add_node(node, label=node, title=f"Type: {node_data['type']}")
        for edge in self.graph.edges(data=True):
            net.add_edge(edge[0], edge[1], title=edge[2]['relation'])
        net.show(output_file)
        logger.info(f"Knowledge graph visualized at {output_file}")

# Initialize knowledge graph
kg = KnowledgeGraph()
kg.visualize()

knowledge_graph.html


In [6]:
websites = [
    "https://www.cisa.gov/uscert/ncas/alerts",
    "https://attack.mitre.org/",
    "https://www.darkreading.com/",
    "https://threatpost.com/",
    "https://krebsonsecurity.com/",
    "https://www.bleepingcomputer.com/",
    "https://www.zdnet.com/topic/security/",
    "https://www.securityweek.com/",
    "https://www.sans.org/newsletters/newsbites/",
    "https://www.cyberscoop.com/",
    "https://www.csoonline.com/",
    "https://www.infosecurity-magazine.com/",
    "https://www.wired.com/category/security/",
    "https://www.schneier.com/",
    "https://www.theregister.com/security/",
    "https://thehackernews.com/",
    "https://www.cyberdefensemagazine.com/",
    "https://www.fireeye.com/blog.html",
    "https://unit42.paloaltonetworks.com/",
    "https://www.microsoft.com/security/blog/",
    "https://www.us-cert.gov/ncas/current-activity",
    "https://nakedsecurity.sophos.com/",
    "https://www.recordedfuture.com/blog/",
    "https://www.cybersecurity-insiders.com/",
    "https://www.malwarebytes.com/blog/",
]

In [7]:
@sleep_and_retry
@limits(calls=5, period=1)  # 5 calls per second
def rate_limited_get(url: str, **kwargs) -> requests.Response:
    return session.get(url, timeout=10, **kwargs)

def search_google(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    logger.info(f"Searching Google for: {query}")
    run_input = {
        "queries": [query],
        "maxPagesPerQuery": max_results,
        "mobileResults": False,
        "languageCode": "en",
    }
    try:
        run = apify_client.actor("apify/google-search-scraper").call(run_input=run_input)
        dataset_items = apify_client.dataset(run["defaultDatasetId"]).list_items().items
        logger.info(f"Found {len(dataset_items)} Google search results.")
        return dataset_items
    except Exception as e:
        logger.error(f"Error searching Google: {str(e)}")
        return []

def scrape_website(url: str) -> Dict[str, Any]:
    try:
        response = rate_limited_get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        text = soup.get_text(separator=' ', strip=True)
        return {"url": url, "text": text, "timestamp": datetime.now().isoformat()}
    except Exception as e:
        logger.error(f"Error scraping {url}: {str(e)}")
        return {"url": url, "text": "", "timestamp": datetime.now().isoformat(), "error": str(e)}

def scrape_websites(urls: List[str]) -> List[Dict[str, Any]]:
    logger.info(f"Scraping {len(urls)} websites...")
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_url = {executor.submit(scrape_website, url): url for url in urls}
        results = []
        for future in as_completed(future_to_url):
            results.append(future.result())
    logger.info(f"Successfully scraped {len(results)} pages.")
    return results

def fetch_tweets(query: str, max_tweets: int = 100) -> List[Dict[str, Any]]:
    logger.info(f"Fetching tweets for query: {query}")
    actor_input = {
        "searchTerms": [query],
        "maxTweets": max_tweets,
        "languageCode": "en"
    }
    try:
        run = apify_client.actor("quacker/twitter-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} tweets.")
        return items
    except Exception as e:
        logger.error(f"Error fetching tweets: {str(e)}")
        return []

def fetch_news(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    logger.info(f"Fetching news for query: {query}")
    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "language": "en",
        "pageSize": max_results,
        "apiKey": os.getenv("NEWS_API_KEY"),
        "sortBy": "publishedAt"
    }
    try:
        response = rate_limited_get(url, params=params)
        response.raise_for_status()
        articles = response.json().get("articles", [])
        logger.info(f"Fetched {len(articles)} news articles.")
        return articles
    except Exception as e:
        logger.error(f"Error fetching news: {str(e)}")
        return []

def scrape_reddit(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    logger.info(f"Scraping Reddit for query: {query}")
    run_input = {
        "subreddits": ["cybersecurity", "netsec", "security", "hacking", "infosec"],
        "searchType": "posts",
        "searchQuery": query,
        "maxItems": max_results
    }
    try:
        run = apify_client.actor("apify/reddit-scraper").call(run_input=run_input)
        dataset_items = apify_client.dataset(run["defaultDatasetId"]).list_items().items
        logger.info(f"Scraped {len(dataset_items)} Reddit posts.")
        return dataset_items
    except Exception as e:
        logger.error(f"Error scraping Reddit: {str(e)}")
        return []

def fetch_cve_data(days_back: int = 30) -> List[Dict[str, Any]]:
    logger.info(f"Fetching CVE data for the last {days_back} days")
    base_url = "https://services.nvd.nist.gov/rest/json/cves/1.0"
    start_date = (datetime.now() - timedelta(days=days_back)).strftime("%Y-%m-%dT00:00:00:000 UTC-00:00")
    params = {
        "pubStartDate": start_date,
        "resultsPerPage": 2000
    }
    try:
        response = rate_limited_get(base_url, params=params)
        response.raise_for_status()
        cve_data = response.json().get("result", {}).get("CVE_Items", [])
        logger.info(f"Fetched {len(cve_data)} CVE entries.")
        return cve_data
    except Exception as e:
        logger.error(f"Error fetching CVE data: {str(e)}")
        return []

In [8]:
def clean_text(text: str) -> str:
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text.lower()

def tag_content(content: List[Dict[str, Any]], tags: List[str]) -> List[Dict[str, Any]]:
    tagged_content = []
    for item in content:
        cleaned_text = clean_text(item.get("text", ""))
        item_tags = [tag for tag in tags if tag.lower() in cleaned_text]
        tagged_content.append({**item, "tags": item_tags})
    return tagged_content

def save_to_json(data: Dict[str, Any], filename: str):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

def collect_data(query: str, tags: List[str]):
    google_results = search_google(query)
    urls = websites + [result['url'] for result in google_results]
    website_content = scrape_websites(urls)
    tweets = fetch_tweets(query)
    news = fetch_news(query)
    reddit_posts = scrape_reddit(query)
    cve_data = fetch_cve_data()

    tagged_google_results = tag_content(google_results, tags)
    tagged_website_content = tag_content(website_content, tags)
    tagged_tweets = tag_content(tweets, tags)
    tagged_news = tag_content(news, tags)
    tagged_reddit_posts = tag_content(reddit_posts, tags)
    tagged_cve_data = tag_content(cve_data, tags)

    collected_data = {
        "google_results": tagged_google_results,
        "website_content": tagged_website_content,
        "tweets": tagged_tweets,
        "news": tagged_news,
        "reddit_posts": tagged_reddit_posts,
        "cve_data": tagged_cve_data,
        "collection_timestamp": datetime.now().isoformat()
    }

    save_to_json(collected_data, f"cybersecurity_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json")
    return collected_data

In [10]:
import json
# Example usage
if __name__ == "__main__":
    query = "latest cybersecurity threats"
    tags = [
        "malware", "ransomware", "threat", "cybersecurity", "phishing",
        "data breach", "DDoS attack", "APT", "zero-day", "exploit",
        "vulnerability", "spyware", "adware", "rootkit", "backdoor",
        "botnet", "keylogger", "Trojans", "worms", "virus",
        "incident response", "threat intelligence", "SIEM", "EDR",
        "XDR", "cloud security", "IoT security", "AI security",
        "blockchain security", "cryptography", "network security",
        "application security", "DevSecOps", "container security",
        "Kubernetes security", "SOAR", "threat hunting", "OSINT",
        "penetration testing", "red teaming", "blue teaming",
        "purple teaming", "cyber insurance", "compliance", "GDPR",
        "HIPAA", "PCI DSS", "NIST", "ISO 27001", "zero trust",
        "passwordless", "biometrics", "MFA", "IAM", "PAM",
        "cyber resilience", "cyber hygiene", "security awareness",
        "social engineering", "insider threat", "supply chain attack",
        "quantum computing", "post-quantum cryptography", "5G security",
        "OT security", "ICS security", "SCADA security", "mobile security",
        "endpoint security", "email security", "web security",
        "API security", "CASB", "CWPP", "CSPM", "CNAPP",
        "cyber warfare", "cyber espionage", "hacktivism", "cyber terrorism",
        "cyber crime", "dark web", "threat actor", "nation-state attack"
    ]
    data = collect_data(query, tags)
    print("Data collection complete. Results saved to JSON file.")

ERROR:__main__:Error searching Google: Input is not valid: Field input.queries must be string
ERROR:__main__:Error scraping https://www.securityweek.com/: 403 Client Error: Forbidden for url: https://www.securityweek.com/
ERROR:__main__:Error scraping https://www.bleepingcomputer.com/: 403 Client Error: Forbidden for url: https://www.bleepingcomputer.com/
ERROR:__main__:Error scraping https://www.theregister.com/security/: 403 Client Error: Forbidden for url: https://www.theregister.com/security/
ERROR:__main__:Error scraping https://www.us-cert.gov/ncas/current-activity: 404 Client Error: Not Found for url: https://www.cisa.gov/ncas/current-activity
ERROR:__main__:Error scraping https://www.cybersecurity-insiders.com/: 403 Client Error: Forbidden for url: https://www.cybersecurity-insiders.com/
ERROR:__main__:Error fetching news: 401 Client Error: Unauthorized for url: https://newsapi.org/v2/everything?q=latest+cybersecurity+threats&language=en&pageSize=50&sortBy=publishedAt
ERROR:__m

Data collection complete. Results saved to JSON file.
