<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/Enhanced_Cyber_Security_Copilot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Problem Statement

##### Task
Develop a co-pilot for threat researchers, security analysts, and professionals that addresses the limitations of current AI solutions like ChatGPT and Perplexity.

##### Current Challenges
1. **Generic Data**: Existing AI solutions provide generic information that lacks specificity.
2. **Context Understanding**: These solutions fail to understand and maintain context.
3. **Limited Information**: The data sources are often limited and not comprehensive.
4. **Single Source Dependency**: Relying on a single source of information reduces reliability and accuracy.
5. **Inadequate AI Models**: Current models do not meet the specialized needs of cybersecurity professionals.

##### Requirement
Create a chatbot capable of collecting and curating data from multiple sources, starting with search engines, and expanding to website crawling and Twitter scraping.

###### Technical Specifications
- **No Hallucinations**: Ensure the chatbot provides accurate and reliable information.
- **RAG (Retrieval-Augmented Generation)**: Use RAG to determine which connectors to use based on user inputs.
- **Query Chunking and Distribution**: Optimize the process of breaking down queries and distributing them across different sources.
- **Data Curation Steps**:
  1. Collect links from approximately 50 sources.
  2. Aggregate data from websites and Twitter.
  3. Curate data using a knowledge graph to find relationships and generate responses.
- **Chatbot Capabilities**: Answer queries such as:
  - "List all details on {{BFSI}} security incidents in {{India}}."
  - "List all ransomware attacks targeting the healthcare industry in {{last 7 days/last 3 months/last week/last month}}."
  - "Provide recent incidents related to Lockbit Ransomware gang / BlackBasta Ransomware."

##### Goal
Develop a data collector that integrates multiple specific sources to enrich the knowledge base, enabling the model to better understand context and deliver accurate results. The solution should be modular, allowing customization and configuration of sources.

##### Summary
The goal is to build an advanced, modular chatbot for cybersecurity professionals that overcomes the limitations of existing AI solutions by integrating multiple data sources and ensuring context-aware, accurate responses. The chatbot will utilize state-of-the-art techniques like RAG and knowledge graphs to provide comprehensive, curated information from diverse sources.


In [1]:
%pip install -q apify-client langchain langchain-community langchain-groq networkx pyvis spacy transformers pandas
%pip install -q sentence-transformers requests beautifulsoup4 ratelimit

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.3/990.3 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.5/103.5 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import os
from datetime import datetime
from typing import List, Dict, Any
import logging
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from concurrent.futures import ThreadPoolExecutor, as_completed
from ratelimit import limits, sleep_and_retry
from bs4 import BeautifulSoup
from apify_client import ApifyClient
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain_groq import ChatGroq
import networkx as nx
from pyvis.network import Network
import spacy
from transformers import pipeline
import json

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

In [4]:
# Constants
APIFY_API_KEY = "apify_api_yUkcz99gMX1pwNckRi7EyXLwhVTd0j3m4Mtt"
NEWS_API_KEY = os.getenv("c50f733b00e34575a7c203c38cd97391")
GROQ_API_KEY = "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"
WEBSITES = [
    "https://www.cisa.gov/uscert/ncas/alerts",
    "https://attack.mitre.org/",
    "https://www.darkreading.com/",
    "https://threatpost.com/",
    "https://krebsonsecurity.com/",
    "https://www.bleepingcomputer.com/",
    "https://www.zdnet.com/topic/security/",
    "https://www.securityweek.com/",
    "https://www.sans.org/newsletters/newsbites/",
    "https://www.cyberscoop.com/",
    "https://www.csoonline.com/",
    "https://www.infosecurity-magazine.com/",
    "https://www.wired.com/category/security/",
    "https://www.schneier.com/",
    "https://www.theregister.com/security/",
    "https://thehackernews.com/",
    "https://www.cyberdefensemagazine.com/",
    "https://www.fireeye.com/blog.html",
    "https://unit42.paloaltonetworks.com/",
    "https://www.microsoft.com/security/blog/",
    "https://www.us-cert.gov/ncas/current-activity",
    "https://nakedsecurity.sophos.com/",
    "https://www.recordedfuture.com/blog/",
    "https://www.cybersecurity-insiders.com/",
    "https://www.malwarebytes.com/blog/",
]
RSS_FEEDS = [
    "https://www.cisa.gov/uscert/ncas/alerts.xml",
    "https://krebsonsecurity.com/feed/",
    "https://threatpost.com/feed/",
    "https://www.darkreading.com/rss_simple.asp"
]

In [5]:
# Initialize Apify client
apify_client = ApifyClient(APIFY_API_KEY)

# Configure requests session with retries and timeouts
session = requests.Session()
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[429, 500, 502, 503, 504])
session.mount('https://', HTTPAdapter(max_retries=retries))
session.mount('http://', HTTPAdapter(max_retries=retries))

In [6]:
# Initialize HuggingFace embeddings
embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Initialize Llama-3.1 from Meta using Groq LPU Inference
llm = ChatGroq(
    temperature=0,
    model="llama-3.1-70b-versatile",
    api_key=GROQ_API_KEY
)

# Define system and human messages
system_message = """You are an expert cybersecurity analyst with extensive knowledge in threat analysis,
vulnerability assessment, and security recommendations. Provide detailed, precise, and actionable insights.
Always consider the latest threat intelligence and best practices in your analysis."""
prompt_template = ChatPromptTemplate.from_messages([("system", system_message), ("human", "{text}")])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
import logging
from pyvis.network import Network
import networkx as nx

# Configure logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class KnowledgeGraph:
    def __init__(self):
        self.graph = nx.DiGraph()

    def add_node(self, node, **attrs):
        self.graph.add_node(node, **attrs)

    def add_edge(self, u, v, **attrs):
        self.graph.add_edge(u, v, **attrs)

    def visualize(self, output_file):
        net = Network(notebook=True)
        for node in self.graph.nodes(data=True):
            net.add_node(node[0], title=node[1].get('title', node[0]))
        for edge in self.graph.edges(data=True):
            net.add_edge(edge[0], edge[1], title=edge[2].get('relation', ''))
        net.show(output_file)
        logger.info(f"Knowledge graph visualized at {output_file}")

# Initialize knowledge graph
kg = KnowledgeGraph()
# Visualize the graph
kg.visualize("knowledge_graph.html")

knowledge_graph.html


In [8]:
# Rate-limited GET request
@sleep_and_retry
@limits(calls=15, period=1)  # 5 calls per second
def rate_limited_get(url: str, **kwargs) -> requests.Response:
    return session.get(url, timeout=10, **kwargs)

In [9]:
# Website scraping using Apify actor
def scrape_website_with_apify(url: str) -> Dict[str, Any]:
    logger.info(f"Scraping {url} with Apify...")
    try:
        actor_input = {
            "url": url,
            "proxyConfiguration": {"useApifyProxy": True}
        }
        run = apify_client.actor("apify/website-content-crawler").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        if items:
            return {"url": url, "text": items[0].get("text", ""), "timestamp": datetime.now().isoformat()}
        else:
            return {"url": url, "text": "", "timestamp": datetime.now().isoformat(), "error": "No content found"}
    except Exception as e:
        logger.error(f"Error scraping {url} with Apify: {str(e)}")
        return {"url": url, "text": "", "timestamp": datetime.now().isoformat(), "error": str(e)}

# Website scraping
def scrape_website(url: str) -> Dict[str, Any]:
    try:
        response = rate_limited_get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        text = soup.get_text(separator=' ', strip=True)
        return {"url": url, "text": text, "timestamp": datetime.now().isoformat()}
    except Exception as e:
        logger.error(f"Error scraping {url}: {str(e)}")
        return {"url": url, "text": "", "timestamp": datetime.now().isoformat(), "error": str(e)}

def scrape_websites(urls: List[str]) -> List[Dict[str, Any]]:
    logger.info(f"Scraping {len(urls)} websites...")
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_url = {executor.submit(scrape_website, url): url for url in urls}
        results = [future.result() for future in as_completed(future_to_url)]
    logger.info(f"Successfully scraped {len(results)} pages.")
    return results

# Fetch tweets
def fetch_tweets(query: str, max_tweets: int = 100) -> List[Dict[str, Any]]:
    logger.info(f"Fetching tweets for query: {query}")
    actor_input = {
        "searchTerms": [query],
        "maxTweets": max_tweets,
        "languageCode": "en"
    }
    try:
        run = apify_client.actor("apidojo/tweet-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} tweets.")
        return items
    except Exception as e:
        logger.error(f"Error fetching tweets: {str(e)}")
        return []

# Fetch news articles
def fetch_news(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    logger.info(f"Fetching news for query: {query}")
    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "language": "en",
        "pageSize": max_results,
        "apiKey": NEWS_API_KEY,
        "sortBy": "publishedAt"
    }
    try:
        response = rate_limited_get(url, params=params)
        response.raise_for_status()
        articles = response.json().get("articles", [])
        logger.info(f"Fetched {len(articles)} news articles.")
        return articles
    except Exception as e:
        logger.error(f"Error fetching news: {str(e)}")
        return []

# Scrape Reddit
def scrape_reddit(query: str, max_results: int = 100) -> List[Dict[str, Any]]:
    logger.info(f"Scraping Reddit for: {query}")
    actor_input = {
        "searchTerms": [query],
        "maxPosts": max_results
    }
    try:
        run = apify_client.actor("comchat/reddit-api-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} Reddit posts.")
        return items
    except Exception as e:
        logger.error(f"Error scraping Reddit: {str(e)}")
        return []

# Fetch CVE data
def fetch_cve_data() -> List[Dict[str, Any]]:
    logger.info("Fetching CVE data")
    url = "https://cve.circl.lu/api/last"
    try:
        response = rate_limited_get(url)
        response.raise_for_status()
        cve_items = response.json()
        logger.info(f"Fetched {len(cve_items)} CVE items.")
        return cve_items
    except Exception as e:
        logger.error(f"Error fetching CVE data: {str(e)}")
        return []

# Fetch Google News articles
def fetch_google_news(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    logger.info(f"Fetching Google News for: {query}")
    actor_input = {
        "queries": query,
        "maxPagesPerQuery": max_results
    }
    try:
        run = apify_client.actor("lhotanova/google-news-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} Google News articles.")
        return items
    except Exception as e:
        logger.error(f"Error fetching Google News: {str(e)}")
        return []

# Fetch Bing search results
def fetch_bing_search(query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    logger.info(f"Fetching Bing search results for: {query}")
    actor_input = {
        "queries": query,
        "maxPagesPerQuery": max_results
    }
    try:
        run = apify_client.actor("curious_coder/bing-search-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} Bing search results.")
        return items
    except Exception as e:
        logger.error(f"Error fetching Bing search results: {str(e)}")
        return []

# Fetch LinkedIn posts
def fetch_linkedin_posts(query: str, max_posts: int = 100) -> List[Dict[str, Any]]:
    logger.info(f"Fetching LinkedIn posts for query: {query}")
    actor_input = {
        "searchTerms": [query],
        "maxPosts": max_posts,
        "languageCode": "en"
    }
    try:
        run = apify_client.actor("curious_coder/linkedin-post-search-scraper").call(run_input=actor_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} LinkedIn posts.")
        return items
    except Exception as e:
        logger.error(f"Error fetching LinkedIn posts: {str(e)}")
        return []

# Fetch RSS feeds
def fetch_rss_feeds(urls: List[str]) -> List[Dict[str, Any]]:
    logger.info(f"Fetching RSS feeds from {len(urls)} URLs")
    run_input = {
        "startUrls": urls,
        "maxItems": 50
    }
    try:
        run = apify_client.actor("jupri/rss-xml-scraper").call(run_input=run_input)
        dataset_id = run["defaultDatasetId"]
        items = apify_client.dataset(dataset_id).list_items().items
        logger.info(f"Fetched {len(items)} RSS feed items.")
        return items
    except Exception as e:
        logger.error(f"Error fetching RSS feeds: {str(e)}")
        return []

In [None]:
# Curate data from various sources
def curate_data(website_data, tweets, news, reddit_posts, cve_data, google_news, bing_results, linkedin_posts, rss_feeds):
    curated_data = []

    # Process and curate data from websites
    for page in website_data:
        curated_data.append({
            "source": "Website",
            "url": page.get("url"),
            "text": page.get("text"),
            "timestamp": page.get("timestamp")
        })

    # Process and curate data from Twitter
    for tweet in tweets:
        curated_data.append({
            "source": "Twitter",
            "text": tweet.get("text"),
            "user": tweet.get("user"),
            "timestamp": tweet.get("timestamp")
        })

    # Process and curate data from news articles
    for article in news:
        curated_data.append({
            "source": "News",
            "url": article.get("url"),
            "title": article.get("title"),
            "description": article.get("description"),
            "timestamp": article.get("publishedAt")
        })

    # Process and curate data from Reddit posts
    for post in reddit_posts:
        curated_data.append({
            "source": "Reddit",
            "url": post.get("url"),
            "title": post.get("title"),
            "selftext": post.get("selftext"),
            "timestamp": post.get("created_utc")
        })

    # Process and curate data from CVE data
    for cve in cve_data:
        cve_meta = cve.get("cve", {}).get("CVE_data_meta", {})
        description_data = cve.get("cve", {}).get("description", {}).get("description_data", [{}])
        curated_data.append({
            "source": "CVE",
            "cve_id": cve_meta.get("ID"),
            "description": description_data[0].get("value"),
            "timestamp": cve.get("publishedDate")
        })

    # Process and curate data from Google News articles
    for article in google_news:
        curated_data.append({
            "source": "Google News",
            "url": article.get("url"),
            "title": article.get("title"),
            "description": article.get("description"),
            "timestamp": article.get("publishedAt")
        })

    # Process and curate data from Bing search results
    for item in bing_results:
        curated_data.append({
            "source": "Bing",
            "url": item.get("url"),
            "title": item.get("title"),
            "snippet": item.get("snippet"),
            "timestamp": item.get("timestamp")
        })

    # Process and curate data from LinkedIn posts
    for post in linkedin_posts:
        curated_data.append({
            "source": "LinkedIn",
            "text": post.get("text"),
            "user": post.get("user"),
            "timestamp": post.get("timestamp")
        })

    # Process and curate data from RSS feeds
    for feed in rss_feeds:
        curated_data.append({
            "source": "RSS",
            "url": feed.get("link"),
            "title": feed.get("title"),
            "description": feed.get("description"),
            "timestamp": feed.get("pubDate")
        })

    return curated_data

In [None]:
# Define tags and queries
tags = [
    "malware", "ransomware", "threat", "cybersecurity", "phishing",
    "data breach", "DDoS attack", "APT", "zero-day", "exploit",
    "vulnerability", "incident response", "threat intelligence",
    "SIEM", "EDR", "XDR", "cloud security", "IoT security",
    "AI security", "blockchain security", "cryptography",
    "network security", "application security", "DevSecOps",
    "container security", "Kubernetes security", "SOAR",
    "threat hunting", "OSINT", "penetration testing",
    "red teaming", "blue teaming", "purple teaming",
    "cyber insurance", "compliance", "GDPR", "HIPAA",
    "PCI DSS", "NIST", "ISO 27001", "zero trust",
    "passwordless", "biometrics", "MFA", "IAM", "PAM",
    "cyber resilience", "cyber hygiene", "security awareness",
    "social engineering", "insider threat", "supply chain attack",
    "quantum computing", "post-quantum cryptography", "5G security",
    "OT security", "ICS security", "SCADA security", "mobile security",
    "endpoint security", "email security", "web security",
    "API security", "CASB", "CWPP", "CSPM", "CNAPP",
    "cyber warfare", "cyber espionage", "hacktivism", "cyber terrorism",
    "cyber crime", "dark web", "threat actor", "nation-state attack",
    "latest cybersecurity incidents", "recent cyber attacks", "real-time threats",
    "emerging vulnerabilities", "critical infrastructure security", "cyber defense",
    "cybersecurity trends", "cybersecurity news", "cybersecurity alerts",
    "cybersecurity updates", "cybersecurity bulletins", "cybersecurity advisories",
    "cybersecurity reports", "cybersecurity analysis", "cybersecurity research"
]

queries = [
    "cybersecurity threats",
    "vulnerability assessment",
    "latest security updates",
    "List all details on {{BFSI}} security incidents in {{India}}.",
    "List all ransomware attacks targeting the healthcare industry in {{last 7 days/last 3 months/last week/last month}}.",
    "Provide recent incidents related to Lockbit Ransomware gang / BlackBasta Ransomware.",
    "Recent data breaches",
    "Latest phishing campaigns",
    "Real-time cybersecurity alerts",
    "Emerging cyber threats",
    "Critical infrastructure security incidents",
    "Recent DDoS attacks",
    "Latest zero-day vulnerabilities",
    "Recent APT activities",
    "Latest cybersecurity news",
    "Recent cybersecurity trends",
    "Latest cybersecurity advisories",
    "Recent cybersecurity bulletins",
    "Latest cybersecurity reports",
    "Recent cybersecurity research",
    "Latest cybersecurity analysis"
]

# Main function to orchestrate the process
def main():
    for query in queries:
        website_data = scrape_websites(WEBSITES)
        tweets = fetch_tweets(query)
        news = fetch_news(query)
        reddit_posts = scrape_reddit(query)
        cve_data = fetch_cve_data()
        google_news = fetch_google_news(query)
        bing_results = fetch_bing_search(query)
        linkedin_posts = fetch_linkedin_posts(query)
        rss_feeds = fetch_rss_feeds(RSS_FEEDS)

        curated_data = curate_data(
            website_data, tweets, news, reddit_posts, cve_data,
            google_news, bing_results, linkedin_posts, rss_feeds
        )

        for data in curated_data:
            print(json.dumps(data, indent=2))

if __name__ == "__main__":
    main()