### **Robust Autoencoder Trainer for Phishing Detection (v2)**

This notebook implements a revised, more robust training pipeline for the phishing detection Autoencoder, conforming to the thesis proposal. It addresses key challenges identified in the previous version, primarily data scarcity and unreliable threshold optimization.

**Key Improvements:**
1.  **Cross-Validation (`StratifiedKFold`):** Replaces the single train/validation/test split with k-fold cross-validation to provide a statistically reliable estimate of model performance, which is critical for small user datasets.
2.  **Balanced Fine-Tuning:** The fine-tuning process now uses a mix of user-reported benign URLs and a sample of phishing URLs from the public dataset. This prevents the model from "catastrophically forgetting" what phishing looks like.
3.  **Intelligent Threshold Optimization:** Instead of using a fixed percentile, this version finds the optimal anomaly threshold by analyzing the Precision-Recall curve on each validation fold, maximizing the F1-score.
4.  **Robust Performance Metrics:** Final performance is reported as the average and standard deviation across all cross-validation folds, giving a much clearer picture of real-world effectiveness.
5.  **Simplified Focus (AE First):** The GCN components are temporarily commented out to focus on perfecting the core content-based detector first. They can be re-enabled once a larger user dataset is available.

--- 
#### **1. Setup and Installations**

In [None]:
!pip -q install pandas scikit-learn tensorflow tldextract google-cloud-firestore matplotlib seaborn

--- 
#### **2. Imports and Initial Configuration**

In [None]:
import os
import json
import random
import pickle
import numpy as np
import pandas as pd
import tldextract
import matplotlib.pyplot as plt
import seaborn as sns

# TensorFlow for Autoencoder
import tensorflow as tf
from tensorflow import keras

# Scikit-learn for preprocessing, cross-validation, and metrics
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, roc_auc_score, 
                             precision_recall_curve, auc)

# Google Cloud for data fetching
from google.colab import files
from google.cloud import firestore

# --- Reproducibility ---
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
tf.random.set_seed(SEED)

print("Libraries imported and seed set.")

--- 
### **Part 1: Data Loading and Preprocessing**

#### **3. Load Public Dataset and Authenticate with Firebase**
First, we load the large public dataset of URL features. This will be used for pre-training our Autoencoder to give it a general understanding of benign vs. phishing URLs. We also set up our connection to Firestore to fetch user-reported data later.

In [None]:
# The public dataset contains 111 lexical features extracted from URLs
URL = "https://raw.githubusercontent.com/GregaVrbancic/Phishing-Dataset/master/dataset_full.csv"
print("Downloading public dataset...")
df_public = pd.read_csv(URL)

# Separate features (X) from the label (y)
y_public = df_public["phishing"].astype(int)
X_public = df_public.drop(columns=["phishing"]).astype(np.float32)
feature_cols = X_public.columns.tolist()

print(f"Public dataset loaded with {X_public.shape[0]} samples and {X_public.shape[1]} features.")

# --- Firebase Authentication ---
print("\nPlease upload your Firebase service account JSON key file.")
try:
    uploaded = files.upload()
    sa_path = next(iter(uploaded.keys()))
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = sa_path
    db = firestore.Client()
    print("\n✅ Firebase authentication configured.")
except Exception as e:
    print(f"\n❌ Firebase authentication failed: {e}")
    db = None

#### **4. Feature Engineering and Scaling**
Here we define the functions to extract lexical features from a URL. Crucially, we also define our feature scaler (`RobustScaler` is preferred for its resilience to outliers). This scaler will be fitted **only on the benign data** from the public training set to learn the distribution of "normal" URLs.

In [None]:
def get_lexical_features(url: str) -> dict:
    """Extracts basic lexical features from a URL string."""
    features = {}
    try:
        ext = tldextract.extract(url)
        domain = f"{ext.domain}.{ext.suffix}"
        hostname = f"{ext.subdomain}.{domain}" if ext.subdomain else domain
        
        features['qty_dot_url'] = url.count('.')
        features['qty_hyphen_url'] = url.count('-')
        features['qty_underline_url'] = url.count('_')
        features['qty_slash_url'] = url.count('/')
        features['qty_questionmark_url'] = url.count('?')
        features['qty_equal_url'] = url.count('=')
        features['qty_at_url'] = url.count('@')
        features['qty_and_url'] = url.count('&')
        features['qty_exclamation_url'] = url.count('!')
        features['qty_space_url'] = url.count(' ')
        features['qty_tilde_url'] = url.count('~')
        features['qty_comma_url'] = url.count(',')
        features['qty_plus_url'] = url.count('+')
        features['qty_asterisk_url'] = url.count('*')
        features['qty_hashtag_url'] = url.count('#')
        features['qty_dollar_url'] = url.count('$')
        features['qty_percent_url'] = url.count('%')
        features['qty_dot_domain'] = domain.count('.')
        features['qty_hyphen_domain'] = domain.count('-')
        features['qty_underline_domain'] = domain.count('_')
        features['qty_at_domain'] = domain.count('@')
        features['qty_vowels_domain'] = sum(1 for char in domain if char in 'aeiouAEIOU')
        features['domain_length'] = len(domain)
        features['domain_in_ip'] = 1 if all(part.isdigit() for part in domain.split('.')) else 0
        features['server_client_domain'] = 1 if 'server' in domain or 'client' in domain else 0
        
    except Exception:
        # Return empty dict on error
        return {}
    return features

def create_feature_vector(url: str, all_feature_columns: list) -> np.ndarray:
    """Creates a full feature vector for a URL, filling missing values with 0."""
    lexical_feats = get_lexical_features(url)
    feature_vector = np.array([lexical_feats.get(col, 0.0) for col in all_feature_columns], dtype=np.float32)
    return feature_vector

# Split public data to get a set for fitting the scaler
X_public_train, _, y_public_train, _ = train_test_split(
    X_public, y_public, test_size=0.3, random_state=SEED, stratify=y_public
)

# --- Feature Scaling ---
print("Fitting RobustScaler on benign public training data...")
# Using RobustScaler is better for data with outliers, which is common in lexical features.
scaler = RobustScaler().fit(X_public_train[y_public_train == 0])

print("✅ Scaler fitted.")

--- 
### **Part 2: Autoencoder Pre-training**

#### **5. Define and Pre-train the Autoencoder Model**
We define the Autoencoder architecture as specified in the thesis. Then, we pre-train it **only on the benign samples** from the public dataset. This teaches the model to accurately reconstruct "normal" URLs, establishing a strong baseline.

In [None]:
# Autoencoder hyperparameters (from thesis)
AE_LAYER1 = 64
AE_LAYER2 = 32
AE_BOTTLENECK = 16
AE_DROPOUT = 0.1
LR_PRETRAIN = 1e-3
BATCH_PRETRAIN = 512
EPOCHS_PRETRAIN = 30

def build_autoencoder(input_shape):
    """Builds the Keras Autoencoder model."""
    inp = keras.Input(shape=(input_shape,))
    x = keras.layers.Dense(AE_LAYER1, activation='relu')(inp)
    x = keras.layers.Dropout(AE_DROPOUT)(x)
    x = keras.layers.Dense(AE_LAYER2, activation='relu')(x)
    z = keras.layers.Dense(AE_BOTTLENECK, activation='relu', name='bottleneck')(x)  # Bottleneck
    x = keras.layers.Dense(AE_LAYER2, activation='relu')(z)
    x = keras.layers.Dense(AE_LAYER1, activation='relu')(x)
    out = keras.layers.Dense(input_shape)(x)
    
    model = keras.Model(inp, out)
    return model

# Build and compile the model for pre-training
pretrain_ae_model = build_autoencoder(X_public.shape[1])
pretrain_ae_model.compile(optimizer=keras.optimizers.Adam(LR_PRETRAIN), loss='mse')

print('--- Pre-training Autoencoder on Public Dataset ---')

# Scale the training data
X_public_train_scaled = scaler.transform(X_public_train)

# Train only on benign data
X_benign_public_train = X_public_train_scaled[y_public_train == 0]

history = pretrain_ae_model.fit(
    X_benign_public_train, X_benign_public_train,
    epochs=EPOCHS_PRETRAIN,
    batch_size=BATCH_PRETRAIN,
    shuffle=True,
    verbose=1,
    validation_split=0.2, # Use a portion of training data for validation
    callbacks=[keras.callbacks.EarlyStopping(patience=5, monitor='val_loss', restore_best_weights=True)]
)

print('\n✅ Pre-training complete.')

# Save the pre-trained weights for use in cross-validation
pretrain_ae_model.save_weights('pretrained_ae_weights.weights.h5')
print('Pre-trained model weights saved.')

--- 
### **Part 3: User Data Processing and Fine-Tuning**

#### **6. Fetch and Process User-Reported URLs**
Now we fetch the labeled data provided by your application's users from Firestore. We process these reports to create a clean dataset of unique URLs and their corresponding labels (0 for benign, 1 for phishing).

In [None]:
def fetch_user_reports(db_client):
    """Fetches and processes user reports from Firestore."""
    if not db_client:
        print("Firestore client not initialized. Skipping user data fetch.")
        return pd.DataFrame(columns=['url', 'label'])

    APP_ID = "ads-phishing-link"  # As per your app.py
    REPORTS_PATH = f"artifacts/{APP_ID}/private_user_reports"
    print(f"Fetching user reports from: {REPORTS_PATH}...")

    try:
        report_docs = list(db_client.collection(REPORTS_PATH).stream())
        print(f"Found {len(report_docs)} total user reports.")
    except Exception as e:
        print(f"❌ Error fetching from Firestore: {e}")
        return pd.DataFrame(columns=['url', 'label'])

    # Process reports into a list of {'url': ..., 'label': ...} dicts
    processed_reports = []
    for doc in report_docs:
        d = doc.to_dict()
        report_type = d.get('type')
        payload = d.get('payload', {})
        url = payload.get('url')

        if not url or not report_type:
            continue

        # Map report type to binary label (as per your thesis)
        # 1 = Phishing (true_positive, false_negative)
        # 0 = Benign (false_positive, true_negative)
        label = 1 if report_type in ('true_positive', 'false_negative') else 0
        processed_reports.append({'url': url, 'label': label})

    if not processed_reports:
        print("No valid reports found.")
        return pd.DataFrame(columns=['url', 'label'])
        
    # Create DataFrame and drop duplicates to get unique labeled URLs
    df_user = pd.DataFrame(processed_reports).drop_duplicates().reset_index(drop=True)
    print(f"Created a dataset of {len(df_user)} unique user-reported URLs.")
    print("User data label distribution:")
    print(df_user['label'].value_counts())
    return df_user

# Fetch the data
df_user_reports = fetch_user_reports(db)

#### **7. Prepare Data for Cross-Validation and Fine-Tuning**
This is a critical step. We create the final feature matrix (`X_user`) and label vector (`y_user`) from the user reports. 

More importantly, to prevent **catastrophic forgetting**, we create a balanced fine-tuning dataset. For each cross-validation fold, the training set will consist of:
1.  All **benign** URLs from the user training fold.
2.  All **phishing** URLs from the user training fold.
3.  An equal number of **phishing** URLs sampled from the original public dataset.

This ensures that while the model learns from your specific user data, it doesn't forget the general features of phishing URLs it learned during pre-training.

In [None]:
if not df_user_reports.empty:
    # Create feature vectors for all user-reported URLs
    X_user = np.array([create_feature_vector(url, feature_cols) for url in df_user_reports['url']])
    y_user = df_user_reports['label'].values

    # Get phishing examples from the public dataset to prevent forgetting
    X_public_phish = X_public[y_public == 1]
    
    print(f"User data prepared for cross-validation: X_user shape {X_user.shape}, y_user shape {y_user.shape}")
    print(f"Found {len(X_public_phish)} phishing examples in public data for balanced fine-tuning.")
else:
    print("User reports DataFrame is empty. Cannot proceed with fine-tuning.")
    # Create empty arrays to avoid errors in subsequent cells if run out of order
    X_user, y_user, X_public_phish = np.array([]), np.array([]), np.array([])

--- 
### **Part 4: Cross-Validation, Fine-Tuning, and Evaluation**

#### **8. The Cross-Validation and Fine-Tuning Loop**
This is the core of the robust training process. We use `StratifiedKFold` to split our small user dataset into multiple folds, ensuring each fold has a similar ratio of benign to phishing URLs. 

For each fold, we:
1.  **Load** the pre-trained AE weights.
2.  **Fine-tune** the model on the balanced training set for that fold.
3.  **Evaluate** on the validation set to find the best anomaly threshold.
4.  **Store** the performance metrics and the optimized threshold.

In [None]:
if len(X_user) > 0:
    # --- Cross-Validation Setup ---
    N_SPLITS = 5  # Use 5 folds for a good balance of computation and reliability
    skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)

    # --- Fine-Tuning Hyperparameters ---
    LR_FINETUNE = 1e-5
    BATCH_FINETUNE = 16
    EPOCHS_FINETUNE = 50

    # --- Storage for results across folds ---
    fold_results = []
    best_thresholds = []

    # --- The Loop ---
    for fold_idx, (train_indices, val_indices) in enumerate(skf.split(X_user, y_user)):
        print(f"\n--- Starting Fold {fold_idx + 1}/{N_SPLITS} ---")

        # 1. Get data for this fold
        X_train_fold, X_val_fold = X_user[train_indices], X_user[val_indices]
        y_train_fold, y_val_fold = y_user[train_indices], y_user[val_indices]

        # 2. Create the balanced fine-tuning dataset
        num_user_phish = np.sum(y_train_fold == 1)
        if num_user_phish > 0:
            # Sample from public phishing data to balance
            sample_indices = np.random.choice(len(X_public_phish), size=num_user_phish, replace=False)
            X_public_phish_sample = X_public_phish.iloc[sample_indices]
            
            # Combine user data with public phishing data
            X_finetune = np.vstack([X_train_fold, X_public_phish_sample])
            y_finetune = np.concatenate([y_train_fold, np.ones(num_user_phish)])
        else: # Handle case with no phishing examples in user training fold
            X_finetune = X_train_fold
            y_finetune = y_train_fold
        
        # 3. Scale the fine-tuning and validation data
        X_finetune_scaled = scaler.transform(X_finetune)
        X_val_scaled = scaler.transform(X_val_fold)

        # 4. Build a fresh model and load pre-trained weights
        fine_tune_ae = build_autoencoder(X_user.shape[1])
        fine_tune_ae.load_weights('pretrained_ae_weights.weights.h5')
        fine_tune_ae.compile(optimizer=keras.optimizers.Adam(LR_FINETUNE), loss='mse')

        # 5. Fine-tune the model ONLY on BENIGN data from the balanced set
        X_benign_finetune = X_finetune_scaled[y_finetune == 0]
        if len(X_benign_finetune) > 0:
            print(f"Fine-tuning on {len(X_benign_finetune)} benign samples...")
            fine_tune_ae.fit(
                X_benign_finetune, X_benign_finetune,
                epochs=EPOCHS_FINETUNE,
                batch_size=BATCH_FINETUNE,
                shuffle=True,
                verbose=0,
                callbacks=[keras.callbacks.EarlyStopping(patience=5, monitor='loss', restore_best_weights=True)]
            )
        else:
            print("No benign samples in this fold for fine-tuning.")

        # 6. Evaluate on the validation fold to find the best threshold
        if len(X_val_fold) > 0:
            val_reconstructed = fine_tune_ae.predict(X_val_scaled, verbose=0)
            val_errors = np.mean(np.square(X_val_scaled - val_reconstructed), axis=1)

            # --- Intelligent Threshold Optimization via PR Curve ---
            precision, recall, thresholds = precision_recall_curve(y_val_fold, val_errors)
            # Calculate F1 score for each threshold, avoiding division by zero
            f1_scores = 2 * (precision * recall) / (precision + recall + 1e-9)
            
            best_f1_idx = np.argmax(f1_scores)
            best_threshold = thresholds[best_f1_idx]
            best_f1 = f1_scores[best_f1_idx]
            best_thresholds.append(best_threshold)
            print(f"Best Threshold found: {best_threshold:.4f} (with F1 score: {best_f1:.4f})")

            # 7. Calculate and store performance metrics for this fold
            y_pred_val = (val_errors > best_threshold).astype(int)
            fold_metrics = {
                'accuracy': accuracy_score(y_val_fold, y_pred_val),
                'precision': precision_score(y_val_fold, y_pred_val, zero_division=0),
                'recall': recall_score(y_val_fold, y_pred_val, zero_division=0),
                'f1': f1_score(y_val_fold, y_pred_val, zero_division=0),
                'roc_auc': roc_auc_score(y_val_fold, val_errors) if len(np.unique(y_val_fold)) > 1 else 0.5
            }
            fold_results.append(fold_metrics)
            print(f"Validation Metrics for Fold {fold_idx + 1}: {fold_metrics}")
        else:
            print("Validation fold is empty, skipping evaluation for this fold.")
else:
    print("Cannot run cross-validation as there is no user data.")

#### **9. Aggregate and Display Final Performance**
After completing all folds, we calculate the average and standard deviation of the performance metrics. This gives us a much more robust and realistic understanding of how the model is expected to perform on new, unseen data.

In [None]:
if fold_results:
    df_results = pd.DataFrame(fold_results)
    print("\n--- Cross-Validation Results Summary ---")
    print(df_results)

    print("\n--- Average Performance Metrics (± Std Dev) ---")
    mean_metrics = df_results.mean()
    std_metrics = df_results.std()
    summary_df = pd.concat([mean_metrics, std_metrics], axis=1)
    summary_df.columns = ['Mean', 'Std Dev']
    print(summary_df)
    
    # Determine the final threshold, e.g., by taking the mean of the best thresholds
    final_threshold = np.mean(best_thresholds)
    print(f"\n✅ Final Optimized Anomaly Threshold (Mean across folds): {final_threshold:.6f}")
else:
    print("\nNo results to aggregate.")
    final_threshold = 0.5 # Default fallback

--- 
### **Part 5: Final Model Training and Artifact Export**

#### **10. Retrain Final Model on All User Data**
With the hyperparameters validated and an optimal thresholding strategy established, we now train the final model. We use the **entire user dataset** for this final fine-tuning step to ensure the deployed model has learned from all available information.

In [None]:
if len(X_user) > 0:
    print("\n--- Training Final Model on ALL User Data ---")
    
    # Create the final balanced fine-tuning dataset using all user data
    num_user_phish_total = np.sum(y_user == 1)
    if num_user_phish_total > 0:
        sample_indices_final = np.random.choice(len(X_public_phish), size=num_user_phish_total, replace=False)
        X_public_phish_sample_final = X_public_phish.iloc[sample_indices_final]
        X_final_train = np.vstack([X_user, X_public_phish_sample_final])
        y_final_train = np.concatenate([y_user, np.ones(num_user_phish_total)])
    else:
        X_final_train = X_user
        y_final_train = y_user

    # Scale the data
    X_final_train_scaled = scaler.transform(X_final_train)
    X_benign_final_train = X_final_train_scaled[y_final_train == 0]

    # Build and compile the final model
    final_ae_model = build_autoencoder(X_user.shape[1])
    final_ae_model.load_weights('pretrained_ae_weights.weights.h5') # Start from pre-trained state
    final_ae_model.compile(optimizer=keras.optimizers.Adam(LR_FINETUNE), loss='mse')

    # Fine-tune on all available benign user data
    if len(X_benign_final_train) > 0:
        print(f"Fine-tuning final model on {len(X_benign_final_train)} benign samples...")
        final_ae_model.fit(
            X_benign_final_train, X_benign_final_train,
            epochs=EPOCHS_FINETUNE,
            batch_size=BATCH_FINETUNE,
            shuffle=True,
            verbose=1,
            callbacks=[keras.callbacks.EarlyStopping(patience=5, monitor='loss', restore_best_weights=True)]
        )
        print("\n✅ Final model training complete.")
    else:
        print("No benign user data available for final training. Using pre-trained model as final.")
        final_ae_model = pretrain_ae_model # Fallback to the pre-trained model
else:
    print("\nNo user data. The final model will be the pre-trained public model.")
    final_ae_model = pretrain_ae_model

#### **11. Export Artifacts for Application**
Finally, we save the three essential artifacts required by your backend application:
1.  `phishing_autoencoder_model.keras`: The trained model itself.
2.  `scaler.pkl`: The fitted scaler.
3.  `autoencoder_threshold.txt`: The optimized anomaly threshold.

In [None]:
print("\n--- Exporting Final Artifacts ---")

# 1. Save the final trained model
final_ae_model.save("phishing_autoencoder_model.keras")
print("Saved phishing_autoencoder_model.keras")

# 2. Save the scaler
with open("scaler.pkl", "wb") as f:
    pickle.dump(scaler, f)
print("Saved scaler.pkl")

# 3. Save the optimized threshold
with open("autoencoder_threshold.txt", "w") as f:
    f.write(str(final_threshold))
print(f"Saved autoencoder_threshold.txt with value: {final_threshold}")

print("\n✅ All artifacts exported successfully. You can now download them from the Colab files panel.")