# 📊 Dataset Acquisition for SentinelGem

**Author:** Muzan Sano  
**Purpose:** Download and prepare real cybersecurity datasets for SentinelGem training and testing

This notebook identifies and downloads datasets that match our multimodal cybersecurity analysis requirements:

### 🎯 Target Dataset Categories
- **📧 Phishing Email Detection** - Real phishing samples vs legitimate emails
- **🖼️ Screenshot/Image Phishing** - Fake websites, UI spoofing attempts
- **🎤 Social Engineering Audio** - Voice-based attacks and scam calls
- **📋 Malware Log Analysis** - System logs with malware activity
- **🕵️ Network Traffic** - Suspicious network behavior patterns
- **🔍 URL/Domain Analysis** - Malicious URLs and domain characteristics

---

In [None]:
# Setup and imports
import os
import sys
import json
import requests
import pandas as pd
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any

# Add project root to path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

print(f"🛡️ SentinelGem Dataset Acquisition Started")
print(f"📅 Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"📂 Project Root: {project_root}")
print(f"🐍 Python Version: {sys.version}")
print("="*60)

## 🔍 Kaggle Dataset Search

Let's search for cybersecurity datasets on Kaggle that match our needs:

In [None]:
# Kaggle search terms for cybersecurity datasets
search_terms = [
    "phishing email detection",
    "malware detection",
    "network intrusion detection", 
    "social engineering",
    "cybersecurity logs",
    "phishing websites",
    "spam detection",
    "malicious urls",
    "voice phishing",
    "cyber threat detection"
]

print("🔍 Searching Kaggle for relevant cybersecurity datasets...")
print(f"📋 Search terms: {', '.join(search_terms)}")

# We'll collect dataset information manually since Kaggle API requires authentication
recommended_datasets = [
    {
        "name": "phishing-site-urls",
        "title": "Phishing Site URLs", 
        "url": "https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls",
        "description": "Collection of phishing and legitimate URLs for detection training",
        "size": "~50MB",
        "modality": "text/url",
        "relevance": "high"
    },
    {
        "name": "malware-detection", 
        "title": "Malware Detection Dataset",
        "url": "https://www.kaggle.com/datasets/xwolf12/malware-detection",
        "description": "Malware samples and system behavior logs",
        "size": "~200MB",
        "modality": "logs/binary",
        "relevance": "high"
    },
    {
        "name": "spam-email-detection",
        "title": "Email Spam Detection Dataset", 
        "url": "https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset",
        "description": "Email content for spam/phishing detection",
        "size": "~25MB",
        "modality": "text/email",
        "relevance": "high"
    },
    {
        "name": "network-intrusion-detection",
        "title": "Network Intrusion Detection",
        "url": "https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection", 
        "description": "Network traffic patterns for intrusion detection",
        "size": "~100MB",
        "modality": "network/logs",
        "relevance": "medium"
    },
    {
        "name": "android-malware-detection",
        "title": "Android Malware Detection",
        "url": "https://www.kaggle.com/datasets/shashwatwork/android-malware-detection-using-machine-learning",
        "description": "Android app features for malware detection", 
        "size": "~15MB",
        "modality": "features/mobile",
        "relevance": "medium"
    }
]

# Display dataset recommendations
print("\n📊 Recommended Datasets for SentinelGem:")
print("="*60)

for i, dataset in enumerate(recommended_datasets, 1):
    print(f"\n{i}. **{dataset['title']}**")
    print(f"   🔗 URL: {dataset['url']}")
    print(f"   📝 Description: {dataset['description']}")
    print(f"   📏 Size: {dataset['size']}")
    print(f"   🎯 Modality: {dataset['modality']}")
    print(f"   ⭐ Relevance: {dataset['relevance']}")

## 🎵 Audio Dataset Search

For social engineering voice detection, we need audio datasets:

In [None]:
# Audio datasets for social engineering detection
audio_datasets = [
    {
        "name": "common-voice",
        "title": "Mozilla Common Voice",
        "url": "https://www.kaggle.com/datasets/mozillaorg/common-voice", 
        "description": "Large-scale voice dataset - can be used to simulate social engineering scenarios",
        "size": "~50GB (subset available)",
        "modality": "audio/speech",
        "use_case": "Voice pattern analysis, speech synthesis detection"
    },
    {
        "name": "speech-emotion-recognition",
        "title": "Speech Emotion Recognition",
        "url": "https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio",
        "description": "Emotional speech patterns - useful for detecting manipulative speech",
        "size": "~200MB", 
        "modality": "audio/emotion",
        "use_case": "Emotional manipulation detection in social engineering"
    },
    {
        "name": "voice-gender-detection",
        "title": "Voice Gender Detection",
        "url": "https://www.kaggle.com/datasets/primaryobjects/voicegender",
        "description": "Voice characteristics dataset for pattern analysis",
        "size": "~5MB",
        "modality": "audio/features", 
        "use_case": "Voice spoofing and impersonation detection"
    }
]

print("🎤 Audio Datasets for Social Engineering Detection:")
print("="*60)

for i, dataset in enumerate(audio_datasets, 1):
    print(f"\n{i}. **{dataset['title']}**")
    print(f"   🔗 URL: {dataset['url']}")
    print(f"   📝 Description: {dataset['description']}")
    print(f"   📏 Size: {dataset['size']}")
    print(f"   🎯 Use Case: {dataset['use_case']}")

## 🖼️ Image/Screenshot Datasets

For phishing website and UI spoofing detection:

In [None]:
# Image datasets for phishing website detection
image_datasets = [
    {
        "name": "phishing-website-screenshots",
        "title": "Phishing Website Screenshots",
        "url": "https://www.kaggle.com/datasets/shashwatwork/phishing-website-screenshots",
        "description": "Screenshots of phishing websites vs legitimate sites",
        "size": "~500MB",
        "modality": "image/screenshot",
        "use_case": "Visual phishing detection, UI spoofing analysis"
    },
    {
        "name": "website-screenshots",
        "title": "Website Screenshots Dataset", 
        "url": "https://www.kaggle.com/datasets/sid321axn/website-screenshots-dataset",
        "description": "Large collection of website screenshots for classification",
        "size": "~1GB",
        "modality": "image/web",
        "use_case": "Website legitimacy classification"
    },
    {
        "name": "ui-mockups",
        "title": "UI Design Screenshots",
        "url": "https://www.kaggle.com/datasets/jonathanoheix/ui-design-screenshots",
        "description": "Various UI designs - can help distinguish legitimate vs fake interfaces",
        "size": "~300MB",
        "modality": "image/ui",
        "use_case": "UI authenticity verification"
    }
]

print("📷 Image/Screenshot Datasets for Phishing Detection:")
print("="*60)

for i, dataset in enumerate(image_datasets, 1):
    print(f"\n{i}. **{dataset['title']}**")
    print(f"   🔗 URL: {dataset['url']}")
    print(f"   📝 Description: {dataset['description']}")
    print(f"   📏 Size: {dataset['size']}")
    print(f"   🎯 Use Case: {dataset['use_case']}")

## 📋 System Logs & Network Data

For malware detection and system behavior analysis:

In [None]:
# Log and network datasets
log_datasets = [
    {
        "name": "kdd-cup-99",
        "title": "KDD Cup 1999 Network Intrusion Detection",
        "url": "https://www.kaggle.com/datasets/galaxyh/kdd-cup-1999-data",
        "description": "Classic network intrusion detection dataset",
        "size": "~75MB",
        "modality": "network/logs",
        "use_case": "Network anomaly detection, intrusion patterns"
    },
    {
        "name": "windows-pe-malware-detection",
        "title": "Windows PE Malware Detection",
        "url": "https://www.kaggle.com/datasets/amauricio/pe-malware-machine-learning-dataset",
        "description": "Windows PE file features for malware classification",
        "size": "~100MB",
        "modality": "binary/features",
        "use_case": "Executable malware detection"
    },
    {
        "name": "system-call-traces",
        "title": "System Call Traces for Malware Detection",
        "url": "https://www.kaggle.com/datasets/selfishgene/syscall-sequences-malware-detection", 
        "description": "System call sequences from malware and benign processes",
        "size": "~50MB",
        "modality": "logs/syscall",
        "use_case": "Behavioral malware detection"
    },
    {
        "name": "dns-tunneling-detection",
        "title": "DNS Tunneling Detection",
        "url": "https://www.kaggle.com/datasets/7h3rAm/dns-tunneling-queries",
        "description": "DNS queries for detecting covert communication channels",
        "size": "~25MB",
        "modality": "network/dns",
        "use_case": "Covert channel detection, C2 communication"
    }
]

print("📊 System Logs & Network Datasets:")
print("="*60)

for i, dataset in enumerate(log_datasets, 1):
    print(f"\n{i}. **{dataset['title']}**")
    print(f"   🔗 URL: {dataset['url']}")
    print(f"   📝 Description: {dataset['description']}")
    print(f"   📏 Size: {dataset['size']}")
    print(f"   🎯 Use Case: {dataset['use_case']}")

## 🚀 Dataset Download Commands

Here are the Kaggle CLI commands to download the most relevant datasets:

In [None]:
# Priority datasets for immediate download
priority_downloads = [
    {
        "command": "kaggle datasets download -d taruntiwarihp/phishing-site-urls",
        "description": "Phishing URLs - Critical for URL analysis",
        "priority": 1
    },
    {
        "command": "kaggle datasets download -d nitishabharathi/email-spam-dataset", 
        "description": "Email spam/phishing - Critical for email analysis",
        "priority": 1
    },
    {
        "command": "kaggle datasets download -d xwolf12/malware-detection",
        "description": "Malware samples - Critical for threat detection",
        "priority": 1
    },
    {
        "command": "kaggle datasets download -d shashwatwork/phishing-website-screenshots",
        "description": "Phishing screenshots - Critical for visual detection", 
        "priority": 2
    },
    {
        "command": "kaggle datasets download -d uwrfkaggler/ravdess-emotional-speech-audio",
        "description": "Emotional speech - Important for social engineering",
        "priority": 2
    },
    {
        "command": "kaggle datasets download -d galaxyh/kdd-cup-1999-data",
        "description": "Network intrusion - Important for log analysis",
        "priority": 3
    }
]

print("🎯 Priority Dataset Downloads:")
print("="*60)

# Group by priority
for priority in [1, 2, 3]:
    priority_items = [d for d in priority_downloads if d['priority'] == priority]
    if priority_items:
        print(f"\n🔥 Priority {priority} (Download First):")
        for item in priority_items:
            print(f"   💻 {item['command']}")
            print(f"   📝 {item['description']}")
            print()

print("\n📋 Download Instructions:")
print("1. Set up Kaggle API credentials: https://www.kaggle.com/docs/api")
print("2. Create kaggle.json with your API token")
print("3. Place in ~/.kaggle/ directory")
print("4. Run the download commands above")
print("5. Extract datasets to assets/datasets/ directory")

## 🏗️ Dataset Integration Strategy

How we'll integrate these datasets into SentinelGem:

In [None]:
# Integration strategy for each modality
integration_plan = {
    "text_phishing": {
        "datasets": ["phishing-site-urls", "email-spam-dataset"],
        "integration": "Train pattern recognition, improve Gemma 3n prompts",
        "files": ["src/inference.py", "config/rules.yaml"],
        "validation": "Test against existing phishing samples"
    },
    "image_analysis": {
        "datasets": ["phishing-website-screenshots", "website-screenshots"],
        "integration": "Enhance OCR pipeline with real phishing UI patterns",
        "files": ["src/ocr_pipeline.py"],
        "validation": "Visual similarity testing, pattern matching accuracy"
    },
    "audio_analysis": {
        "datasets": ["ravdess-emotional-speech-audio", "voicegender"],
        "integration": "Train social engineering voice pattern detection",
        "files": ["src/audio_pipeline.py"],
        "validation": "Emotion detection accuracy, speech pattern analysis"
    },
    "log_analysis": {
        "datasets": ["kdd-cup-1999-data", "pe-malware-dataset"],
        "integration": "Enhance malware detection rules and signatures",
        "files": ["src/log_parser.py", "config/rules.yaml"],
        "validation": "Intrusion detection accuracy, false positive rate"
    }
}

print("🔧 Dataset Integration Strategy:")
print("="*60)

for modality, plan in integration_plan.items():
    print(f"\n🎯 **{modality.replace('_', ' ').title()}**")
    print(f"   📊 Datasets: {', '.join(plan['datasets'])}")
    print(f"   🔧 Integration: {plan['integration']}")
    print(f"   📁 Files: {', '.join(plan['files'])}")
    print(f"   ✅ Validation: {plan['validation']}")

print("\n🚀 Next Steps:")
print("1. Download priority datasets using Kaggle CLI")
print("2. Create data preprocessing notebooks")
print("3. Update threat detection rules with real patterns")
print("4. Retrain/fine-tune detection models")
print("5. Validate against known attack samples")
print("6. Update bootstrap notebook with real data examples")

## 📁 Directory Structure for Datasets

Let's create the recommended directory structure for organizing our datasets:

In [None]:
# Create dataset directory structure
dataset_structure = {
    "assets/datasets": {
        "phishing": ["urls", "emails", "screenshots"],
        "malware": ["samples", "logs", "network_traces"],
        "audio": ["social_engineering", "emotional_speech", "voice_samples"],
        "images": ["phishing_sites", "legitimate_sites", "ui_samples"],
        "logs": ["system_logs", "network_logs", "application_logs"],
        "processed": ["training", "validation", "testing"]
    }
}

def create_directory_structure(base_path: Path, structure: dict):
    """Create directory structure recursively"""
    for key, value in structure.items():
        current_path = base_path / key
        current_path.mkdir(parents=True, exist_ok=True)
        print(f"📁 Created: {current_path}")
        
        if isinstance(value, dict):
            create_directory_structure(current_path, value)
        elif isinstance(value, list):
            for subdir in value:
                subdir_path = current_path / subdir
                subdir_path.mkdir(parents=True, exist_ok=True)
                print(f"📂 Created: {subdir_path}")

# Create the directory structure
print("🏗️ Creating Dataset Directory Structure:")
print("="*60)

base_path = project_root
create_directory_structure(base_path, dataset_structure)

print("\n✅ Directory structure created successfully!")
print("\n📋 Directory Usage:")
print("- phishing/: Email, URL, and website phishing samples")
print("- malware/: Malware samples and behavioral logs")
print("- audio/: Voice recordings for social engineering detection")
print("- images/: Screenshots of websites and UI elements")
print("- logs/: System, network, and application log files")
print("- processed/: Cleaned and preprocessed data for training")

## 🔄 Data Processing Pipeline

Once datasets are downloaded, here's how we'll process them:

In [None]:
# Data processing pipeline template
processing_pipeline = """
# SentinelGem Data Processing Pipeline
# Run after downloading datasets

# 1. Extract and organize downloaded datasets
for dataset in assets/datasets/*/; do
    echo "Processing $dataset"
    unzip -q "$dataset"*.zip -d "$dataset" 2>/dev/null || true
done

# 2. Run data preprocessing notebooks
jupyter nbconvert --execute notebooks/02_data_preprocessing.ipynb
jupyter nbconvert --execute notebooks/03_feature_extraction.ipynb

# 3. Update threat detection rules
python scripts/update_threat_rules.py --input assets/datasets/ --output config/

# 4. Validate processed data
python scripts/validate_datasets.py --datasets assets/datasets/processed/

# 5. Generate data statistics
python scripts/dataset_statistics.py --output reports/dataset_analysis.html
"""

# Save processing pipeline as shell script
pipeline_script_path = project_root / "scripts" / "process_datasets.sh"
pipeline_script_path.parent.mkdir(exist_ok=True)

with open(pipeline_script_path, 'w') as f:
    f.write("#!/bin/bash\n")
    f.write("# SentinelGem Dataset Processing Pipeline\n")
    f.write("# Author: Muzan Sano\n\n")
    f.write(processing_pipeline)

# Make script executable
import stat
pipeline_script_path.chmod(pipeline_script_path.stat().st_mode | stat.S_IEXEC)

print(f"💾 Created processing pipeline: {pipeline_script_path}")
print("\n🔄 Processing Steps:")
print("1. Extract downloaded zip files")
print("2. Run preprocessing notebooks")
print("3. Update threat detection rules with real patterns")
print("4. Validate data quality and consistency")
print("5. Generate comprehensive dataset statistics")

## 📊 Dataset Summary & Next Actions

Summary of identified datasets and immediate action items:

In [None]:
# Comprehensive summary
total_datasets = len(recommended_datasets) + len(audio_datasets) + len(image_datasets) + len(log_datasets)
estimated_size = "~3-5GB total (selective downloads)"

summary = {
    "total_datasets_identified": total_datasets,
    "priority_downloads": 6,
    "estimated_total_size": estimated_size,
    "modalities_covered": ["text", "images", "audio", "logs", "network"],
    "ready_for_download": True
}

print("📈 DATASET ACQUISITION SUMMARY")
print("="*60)
print(f"🎯 Total Datasets Identified: {summary['total_datasets_identified']}")
print(f"🔥 Priority Downloads: {summary['priority_downloads']}")
print(f"💾 Estimated Size: {summary['estimated_total_size']}")
print(f"🎭 Modalities: {', '.join(summary['modalities_covered'])}")
print(f"✅ Ready for Download: {summary['ready_for_download']}")

print("\n🚀 IMMEDIATE ACTION ITEMS:")
print("="*60)

action_items = [
    "1. 🔑 Set up Kaggle API credentials (kaggle.json)",
    "2. 📥 Download Priority 1 datasets (phishing, malware, spam)",
    "3. 📁 Extract datasets to assets/datasets/ structure", 
    "4. 🔄 Run data preprocessing pipeline",
    "5. 📊 Create data analysis notebooks (02_data_preprocessing.ipynb)",
    "6. ⚙️ Update threat detection rules with real patterns",
    "7. 🧪 Validate SentinelGem performance on real data",
    "8. 📈 Generate accuracy metrics and performance reports"
]

for item in action_items:
    print(item)

print("\n🎯 SUCCESS METRICS:")
print("- Phishing detection accuracy >90%")
print("- Social engineering detection >85%")
print("- Malware detection accuracy >95%")
print("- False positive rate <5%")
print("- Real-time processing <2 seconds")

print(f"\n🛡️ SentinelGem Dataset Acquisition Complete!")
print(f"📅 Completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("Ready to proceed with dataset downloads and integration. 🚀")