# 🚀 Running Phishpedia on Google Colab

This notebook will help you set up and run Phishpedia on Google Colab without Pixi!

**Instructions:**
1. Upload this notebook to Google Colab
2. Enable GPU: Runtime → Change runtime type → GPU
3. Run cells in order
4. Wait for setup to complete
5. Test with your own data!

In [None]:
# 🔧 STEP 1: COMPLETE COLAB SETUP FOR PHISHPEDIA
# Run this cell first in Google Colab!

print("🚀 Setting up Phishpedia on Google Colab...")
print("=" * 60)

# Step 1: Clone the repository
print("📥 Step 1: Cloning Phishpedia repository...")
!git clone https://github.com/lindsey98/Phishpedia.git
%cd Phishpedia

# Step 2: Install additional dependencies (most are already in Colab)
print("\n📦 Step 2: Installing additional dependencies...")
!pip install selenium webdriver-manager
!pip install opencv-python-headless  # Headless version for Colab
!apt-get update -qq
!apt-get install -y chromium-browser chromium-chromedriver

# Step 3: Set up ChromeDriver for Selenium
print("\n🌐 Step 3: Setting up ChromeDriver...")
import os
os.environ['PATH'] += ':/usr/lib/chromium-browser/'

# Step 4: Check what's already available in Colab
print("\n✅ Step 4: Checking pre-installed packages in Colab...")
import sys
import pkg_resources

colab_packages = ['torch', 'torchvision', 'numpy', 'opencv-python', 'PIL', 'requests', 'beautifulsoup4', 'matplotlib', 'pandas', 'scikit-learn']

print("Pre-installed packages in Colab:")
for package in colab_packages:
    try:
        version = pkg_resources.get_distribution(package).version
        print(f"  ✅ {package}: {version}")
    except:
        try:
            # Try alternative names
            if package == 'PIL':
                version = pkg_resources.get_distribution('Pillow').version
                print(f"  ✅ Pillow (PIL): {version}")
            elif package == 'opencv-python':
                import cv2
                print(f"  ✅ opencv-python: {cv2.__version__}")
            else:
                print(f"  ❌ {package}: Not found")
        except:
            print(f"  ❌ {package}: Not found")

print("\n🎯 Colab setup completed! Ready to run Phishpedia.")

In [None]:
# 🧪 STEP 2: TEST SETUP AND CHECK GPU
# Run this cell to verify everything works

import sys
import os
from pathlib import Path

print("🧪 Testing Phishpedia setup...")
print("=" * 40)

# Test 1: Check if we can import the main modules
print("🔍 Test 1: Import test")
try:
    # Add current directory to Python path
    if str(Path(".").absolute()) not in sys.path:
        sys.path.append(str(Path(".").absolute()))
    
    # Try importing main components
    imports_to_test = [
        ('torch', 'PyTorch'),
        ('torchvision', 'TorchVision'), 
        ('cv2', 'OpenCV'),
        ('numpy', 'NumPy'),
        ('requests', 'Requests'),
        ('selenium', 'Selenium'),
        ('PIL', 'Pillow')
    ]
    
    for module, name in imports_to_test:
        try:
            __import__(module)
            print(f"  ✅ {name}")
        except ImportError as e:
            print(f"  ❌ {name}: {e}")
            
except Exception as e:
    print(f"  ❌ Import test failed: {e}")

# Test 2: GPU availability
print("\n🔥 Test 2: GPU availability")
try:
    import torch
    if torch.cuda.is_available():
        gpu_count = torch.cuda.device_count()
        gpu_name = torch.cuda.get_device_name(0)
        print(f"  ✅ GPU available: {gpu_name} ({gpu_count} device(s))")
        print(f"  ✅ CUDA version: {torch.version.cuda}")
    else:
        print("  ⚠️  GPU not available, using CPU")
        print("  💡 In Colab: Runtime > Change runtime type > Hardware accelerator > GPU")
except Exception as e:
    print(f"  ❌ GPU test failed: {e}")

# Test 3: Check Selenium WebDriver
print("\n🌐 Test 3: Selenium WebDriver test")
try:
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    # Set up Chrome options for Colab
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-gpu')
    
    # Try to create a driver instance
    driver = webdriver.Chrome(options=chrome_options)
    print("  ✅ Chrome WebDriver initialized successfully")
    
    # Test a simple page load
    driver.get("https://www.google.com")
    title = driver.title
    print(f"  ✅ Page loaded successfully: {title}")
    driver.quit()
    
except Exception as e:
    print(f"  ❌ WebDriver test failed: {e}")

print("\n🎉 Test completed! Check results above.")

In [None]:
# 📁 STEP 3: CHECK PROJECT STRUCTURE AND MODEL FILES
# Verify Phishpedia files and model availability

import os
from pathlib import Path

print("📁 Checking Phishpedia project structure...")
print("=" * 50)

# Check main script
main_script = Path("phishpedia.py")
if main_script.exists():
    print(f"✅ Main script found: {main_script}")
else:
    print(f"❌ Main script not found: {main_script}")

# Check models directory
models_dir = Path("models")
if not models_dir.exists():
    print("Creating models directory...")
    models_dir.mkdir()

# List required model files
required_files = [
    "rcnn_bet365.pth",
    "faster_rcnn.yaml", 
    "resnetv2_rgb_new.pth.tar",
    "domain_map.pkl"
]

print("\nRequired model files:")
for i, file in enumerate(required_files, 1):
    file_path = models_dir / file
    if file_path.exists():
        size = file_path.stat().st_size / (1024*1024)  # Size in MB
        print(f"  {i}. ✅ {file} ({size:.1f} MB)")
    else:
        print(f"  {i}. ❌ {file} (missing)")

# Check for expand_targetlist directory
targetlist_dir = models_dir / "expand_targetlist"
if targetlist_dir.exists():
    brand_count = len(list(targetlist_dir.glob("*")))
    print(f"  5. ✅ expand_targetlist/ ({brand_count} brands)")
else:
    print(f"  5. ❌ expand_targetlist/ (missing)")

# Show current directory contents
print("\n🔍 Current directory contents:")
for item in sorted(Path(".").iterdir()):
    if item.is_dir():
        item_count = len(list(item.iterdir())) if item.exists() else 0
        print(f"  📁 {item.name}/ ({item_count} items)")
    else:
        size_kb = item.stat().st_size // 1024
        print(f"  📄 {item.name} ({size_kb} KB)")

# Show help if main script exists
if main_script.exists():
    print("\n📋 Getting help information...")
    !python phishpedia.py --help
else:
    print("\n❌ Cannot show help - main script not found")

In [None]:
# 🎯 STEP 4: CREATE SAMPLE TEST DATA
# Create test data structure for Phishpedia

from pathlib import Path
import os

print("📁 Creating sample test data structure...")
print("=" * 40)

# Create test directory structure
test_dir = Path("sample_test_sites")
test_dir.mkdir(exist_ok=True)

# Sample sites data
sample_sites = [
    {
        "name": "test_site_1",
        "url": "https://example-phishing-site.com",
        "description": "Sample phishing site mimicking a bank"
    },
    {
        "name": "test_site_2", 
        "url": "https://legitimate-site.com",
        "description": "Sample legitimate site"
    }
]

for site in sample_sites:
    site_dir = test_dir / site["name"]
    site_dir.mkdir(exist_ok=True)
    
    # Create info.txt
    info_file = site_dir / "info.txt"
    info_file.write_text(site["url"])
    
    print(f"✅ Created: {site_dir}/")
    print(f"   └── info.txt (contains: {site['url']})")
    print(f"   └── shot.png (you need to add screenshot)")
    print()

print("📝 To run Phishpedia on test data:")
print("   1. Add screenshot files (shot.png) to each test site folder")
print("   2. Run: !python phishpedia.py --folder ./sample_test_sites")
print("   3. Or use existing datasets if available")

# Check if datasets directory exists
datasets_dir = Path("datasets")
if datasets_dir.exists():
    print(f"\n✅ Found datasets directory with {len(list(datasets_dir.iterdir()))} items")
    for item in datasets_dir.iterdir():
        if item.is_dir():
            print(f"  📁 {item.name}/")
else:
    print("\n❌ No datasets directory found")

In [None]:
# 🚀 STEP 5: RUN PHISHPEDIA (EXAMPLE)
# This cell shows how to run Phishpedia on your data

import os
from pathlib import Path

print("🚀 Running Phishpedia Example...")
print("=" * 40)

# Check if we can run the main script
if Path("phishpedia.py").exists():
    print("✅ Phishpedia script found")
    
    # Method 1: Run on test sites (if they exist)
    test_folder = "datasets/test_sites"
    if Path(test_folder).exists():
        print(f"\n🎯 Running on existing test data: {test_folder}")
        !python phishpedia.py --folder ./datasets/test_sites
    else:
        print(f"\n❌ Test folder not found: {test_folder}")
        
        # Try sample data we created
        sample_folder = "sample_test_sites"
        if Path(sample_folder).exists():
            print(f"\n⚠️  Sample data exists but needs screenshots")
            print("   Add shot.png files to each folder first")
            print(f"   Then run: !python phishpedia.py --folder ./{sample_folder}")
        else:
            print("\n❌ No test data available")
            print("   Create test data in the previous cell first")
    
    # Show available options
    print("\n📋 Available Phishpedia commands:")
    print("   !python phishpedia.py --folder <your_test_folder>")
    print("   !python phishpedia.py --help")
    
else:
    print("❌ Phishpedia script not found")
    print("   Make sure you ran the setup cell first")

print("\n💡 Usage Tips:")
print("   • Each test site needs: info.txt (URL) + shot.png (screenshot)")
print("   • Results will show: Phish/Benign + Target brand (if phishing)")
print("   • GPU will speed up processing significantly")
print("   • Check model files are downloaded if you get errors")

## 📋 Complete Usage Guide

### 🎯 Quick Start Steps:
1. **Setup**: Run the first cell (clones repo, installs dependencies)
2. **Test**: Run the second cell (verifies GPU, imports, WebDriver)
3. **Check**: Run the third cell (verifies project structure)
4. **Data**: Run the fourth cell (creates sample test structure)
5. **Run**: Use the fifth cell (runs Phishpedia on your data)

### ⚡ Colab Advantages:
✅ **Free GPU access** (Tesla T4/K80)  
✅ **Pre-installed packages** (PyTorch, OpenCV, NumPy)  
✅ **No local setup** needed  
✅ **Easy sharing** via link  
✅ **12+ hour runtime**  

### 🚀 Ready-to-Use Commands:
```bash
# Clone and setup
!git clone https://github.com/lindsey98/Phishpedia.git
%cd Phishpedia

# Install dependencies  
!pip install selenium webdriver-manager opencv-python-headless

# Run on your data
!python phishpedia.py --folder ./your_test_folder

# Check GPU
import torch; print(f'GPU: {torch.cuda.is_available()}')
```

### 💾 Important Notes:
- **Enable GPU**: Runtime → Change runtime type → GPU
- **Save work**: Download results before session ends
- **Model files**: May need to be downloaded separately (large files)
- **Test data**: Each site needs `info.txt` (URL) + `shot.png` (screenshot)