## <span style="color: red;">CatchPhish</span> - An ML Approach to URL Phishing Detection
<br>

### Advanced Modeling Part 3 - Stress Testing

<br>

#### Author: Omar Kreidie


## Table of Contents: 

1. [Notebook Introduction](#1)
2. [Libraries](#2)
3. [The Optimal XGBoost Model](#3)
4. [Stress Testing](#4)
    - [Synthesized Dataset](#5)
    - [Splitting The Dataset](#6)
    - [The Clean Label - Performance](#7)
    - [The Noisy (synthetic) Data](#8)
    - [Evaluating ONLY the Flipped Samples](#9)
5. [Conclusion](#10)


 <a id ='2'></a>
## Libraries

In [1]:
#Data frames + Array libraries
import pandas as pd
import numpy as np 

#Plotting Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

#Statistics libraries
from scipy import stats
from scipy.stats import norm 


#Sklearn model library for logistic regression + preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.inspection import permutation_importance


#Modelling libraries for the DecisionTree + PCA Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

#Random Forest from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

#importing xgboost
import xgboost as xgb
from xgboost import XGBClassifier

from warnings import filterwarnings
filterwarnings(action='ignore')

 <a id ='1'></a>
## Notebook Introduction

In this notebook, we will be stress testing the optimal XGBoost model from notebook 03.1. More detail about the stress will be explained below. 

In [2]:
# loading the full dataset.
old_phish = pd.read_csv('../dataset/PhiUSIIL_Phishing_URL_Dataset.csv')

#loading the reduced dataset from the reduction notebook. 
phish = pd.read_csv('../dataset/reduced_phish4.csv')

In [3]:
#adding the target variable to the reduced DF
phish['label'] = old_phish['label'].values

#sanity check to make sure that the number of rows are equal
assert len(phish) == len(old_phish), "DataFrames Rows are not equal"

In [4]:
#Sanity check, making sure there are the right number of features and rows. 
print(f'Number of rows are: {phish.shape[0]} \nNumber of columns are: {phish.shape[1]}')

Number of rows are: 235795 
Number of columns are: 32


 <a id ='3'></a>
## The Optimal XGBoost Model

In [5]:
# splitting independent & dependent variables

X = phish.drop(columns=['label'])
y = old_phish['label']

# Split (same stratified setup)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

xgb_pipeline = Pipeline([
    ('clf', xgb.XGBClassifier(
        objective='binary:logistic',
        eval_metric='auc',
        use_label_encoder=False,
        random_state=42
    ))
])

xgb_param_grid_iter1 = {
    'clf__n_estimators': [150, 200, 250],    # Number of trees
    'clf__max_depth': [5, 6, 7],             # Tree complexity
    'clf__learning_rate': [0.1, 0.15, 0.2],  # smaller = slower learning, better generalization
    'clf__subsample': [0.9, 1.0],            # Row subsampling for regularization
    'clf__colsample_bytree': [0.6, 0.7, 0.8] # Column subsampling (helps avoid overfitting)
}
# Grid search
xgb_grid = GridSearchCV(
    xgb_pipeline,
    param_grid=xgb_param_grid_iter1,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

xgb_grid.fit(X_train, y_train)

Fitting 5 folds for each of 162 candidates, totalling 810 fits


## XGBoost Model Result

In [6]:
y_pred_xgb = xgb_grid.predict(X_test)
y_proba_xgb = xgb_grid.predict_proba(X_test)[:, 1]

print('Best Parameters:', xgb_grid.best_params_)
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_xgb))
print('Classification Report:\n', classification_report(y_test, y_pred_xgb))
print('ROC AUC Score:', roc_auc_score(y_test, y_proba_xgb))

Best Parameters: {'clf__colsample_bytree': 0.7, 'clf__learning_rate': 0.2, 'clf__max_depth': 5, 'clf__n_estimators': 250, 'clf__subsample': 1.0}
Confusion Matrix:
 [[20165    24]
 [    7 26963]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     20189
           1       1.00      1.00      1.00     26970

    accuracy                           1.00     47159
   macro avg       1.00      1.00      1.00     47159
weighted avg       1.00      1.00      1.00     47159

ROC AUC Score: 0.9999907896701715


 <a id ='4'></a>
# Stress Testing

To stress test the model, we will introduce label noise. This wil lsimulate real-world label inconsistencies, such as humar error or outdates information. 

<br>Using Generative AI, I synthesized a dataset which randomly selected 1% of the dataset and flipped the labels from legitimate to phishing and vice versa. These labels have been stored in a column called noisy_label. In addition to that, I realized that there was no way of tracking which labels were flipped, so I prompted the Gen AI model to add a column called label_flipped where (0=unchanged and 1 = flipped). 

<br> The point of the stress test is to simulate real-world imperfection in labeled data. The goal is to see how much performance degrades with a small amount of label noise. This is especially important for phishing detection systems, where label certainty is not guaranteed and false negatives can have major consequences. 

 <a id ='5'></a>
 ## The Synthesized Dataset

In [7]:
# The stress test dataset
#loading the reduced dataset from the reduction notebook. 
synthetic_phish= pd.read_csv('../dataset/final_stress_test_dataset.csv')

In [8]:
synthetic_phish.tail()

Unnamed: 0,DomainLength,TLDLegitimateProb,TLDLength,NoOfSubDomain,NoOfLettersInURL,LetterRatioInURL,NoOfEqualsInURL,NoOfQMarkInURL,LineOfCode,LargestLineLength,...,Pay,Crypto,NoOfImage,NoOfCSS,NoOfJS,NoOfEmptyRef,NoOfExternalRef,label,noisy_label,label_flipped
235790,22,0.522907,3,1,16,0.552,0,0,2007,9381,...,1,0,51,7,21,2,191,1,1,0
235791,21,0.028555,2,2,14,0.5,0,0,1081,348,...,1,0,50,1,7,0,31,1,1,0
235792,23,0.003319,2,1,17,0.567,0,0,709,13277,...,0,0,27,10,30,2,67,1,1,0
235793,47,0.000961,3,2,39,0.709,0,0,125,1807,...,0,0,0,0,3,0,0,0,0,0
235794,26,0.522907,3,1,20,0.606,0,0,1038,3346,...,0,0,21,6,18,0,261,1,1,0


In [9]:
#Sanity check, making sure there are the right number of features and rows. 
#We are expecting 34 columns
print(f'Number of rows are: {synthetic_phish.shape[0]} \nNumber of columns are: {synthetic_phish.shape[1]}')

Number of rows are: 235795 
Number of columns are: 34


 <a id ='6'></a>
## Splitting The Dataset (2 target variables, clean & noisy)

In [10]:
# Split features and labels
X_stress = synthetic_phish.drop(columns=['label', 'noisy_label', 'label_flipped'])
y_noisy = synthetic_phish['noisy_label']  # labels with 1% noise
y_clean = synthetic_phish['label']        # original target variable
flipped_mask = synthetic_phish['label_flipped'] == 1  # finding the flipped labels

# the fundamental sanity checks 
print("Shape of X:", X_stress.shape)
print("Noisy label distribution:\n", y_noisy.value_counts())
print("Number of flipped rows:", flipped_mask.sum())

Shape of X: (235795, 31)
Noisy label distribution:
 noisy_label
1    134469
0    101326
Name: count, dtype: int64
Number of flipped rows: 2357


In [11]:
best_model = xgb_grid.best_estimator_

 <a id ='7'></a>
## The Clean Label

In [17]:
# Evaluating performance on the clean labels for comparison
print("Evaluation on Clean Data\n")
y_pred_clean = best_model.predict(X_stress)
print("Confusion Matrix:")
print(confusion_matrix(y_clean, y_pred_clean))
print("\nClassification Report:")
print(classification_report(y_clean, y_pred_clean))

Evaluation on Clean Data

Confusion Matrix:
[[100918     27]
 [     7 134843]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    100945
           1       1.00      1.00      1.00    134850

    accuracy                           1.00    235795
   macro avg       1.00      1.00      1.00    235795
weighted avg       1.00      1.00      1.00    235795



 <a id ='8'></a>
## The Noisy (synthetic) Data

In [14]:
# generating predictions on the stress test features
y_pred = best_model.predict(X_stress)
y_proba = best_model.predict_proba(X_stress)[:, 1]


print("Evaluation on Synthetic (Noisy) Data\n")
print("Confusion Matrix:")
print(confusion_matrix(y_noisy, y_pred))
print("\nClassification Report:")
print(classification_report(y_noisy, y_pred))
print("ROC AUC Score:", roc_auc_score(y_noisy, y_proba))

Evaluation on Synthetic (Noisy) Data

Confusion Matrix:
[[ 99932   1394]
 [   993 133476]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99    101326
           1       0.99      0.99      0.99    134469

    accuracy                           0.99    235795
   macro avg       0.99      0.99      0.99    235795
weighted avg       0.99      0.99      0.99    235795

ROC AUC Score: 0.989495943062187


Given the noisy data, where 1% of the labels were flipped we can see the following things have happened: 

1. The number of misclassifications increased from 34 to 2,387 false positive & negatives. 
2. The 1% flip disrupted the class boundaries
3. ROC AUC dropped from 0.9999 to 0.9895, which is still a great score, but it shows that our model has clearly decreased capabilities. 

----

What this tells us:

- The model was not rained to handle label noise, this means that even small inconsistencies will cause the model some issues. 

 <a id ='9'></a>
## ONLY the Flipped Samples 

In [15]:
# Evaluating performance specifically on the flipped samples
print("Evaluation on Flipped (Noisy) Samples Only\n")
# creating masks for flipped samples
flipped_indices = synthetic_phish[synthetic_phish['label_flipped'] == 1].index
X_flipped = X_stress.loc[flipped_indices]
y_noisy_flipped = y_noisy.loc[flipped_indices]
y_clean_flipped = y_clean.loc[flipped_indices]

# Making predictions on flipped subset
y_pred_flipped = best_model.predict(X_flipped)
y_proba_flipped = best_model.predict_proba(X_flipped)[:, 1]

print("Confusion Matrix (Flipped Samples):")
print(confusion_matrix(y_noisy_flipped, y_pred_flipped))
print("\nClassification Report (Flipped Samples):")
print(classification_report(y_noisy_flipped, y_pred_flipped))
print("ROC AUC Score (Flipped Samples):", roc_auc_score(y_noisy_flipped, y_proba_flipped))

Evaluation on Flipped (Noisy) Samples Only

Confusion Matrix (Flipped Samples):
[[   0 1369]
 [ 986    2]]

Classification Report (Flipped Samples):
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1369
           1       0.00      0.00      0.00       988

    accuracy                           0.00      2357
   macro avg       0.00      0.00      0.00      2357
weighted avg       0.00      0.00      0.00      2357

ROC AUC Score (Flipped Samples): 1.4786643520640723e-05


The result of flipping 1% of all the labels is: 
- The model almost always predicted the original class, meaning that it's answers are correct for the most part if the labels weren't flipped. 
- This means that it consistenly disagreed with what the target variable was telling it. This is actually not a bad thing, because the label was flipped... 

<br> For example, if the model was told that the PHISHING URL was legitimate, the model will tell you it's phishing. This is a positive because although, in this stress test it fails numerically, it's logic makes sense, which means it's refusing to believe the wrong labels. 

 <a id ='10'></a>
## Conclusion 

The Clean labels had a 99.99% accuracy with an ROC AUC of 0.99999, That's a great performance. 

----

The Noisy labels had an accuracy of 98.99% with an ROC AUC of 0.9895, thats a good performance, but a degradation was found after adding 1% label noise. 

----

When Looking at just the flipped labels, numerically the model failed with 0% accuracyt and 0.00000001 ROC AUC. However, logically this is fantastic because the error was on the human and not the model. 

----

In conclusion, the XGBoost model is confident and consistent, and it predicts based on learned patterns. 

In [18]:
import joblib

#saving the entire XGBmodel pipeline into a .pkl file. 
joblib.dump(best_model, 'xgb_model.pkl')

['xgb_model.pkl']

In [20]:
phish.columns

Index(['DomainLength', 'TLDLegitimateProb', 'TLDLength', 'NoOfSubDomain',
       'NoOfLettersInURL', 'LetterRatioInURL', 'NoOfEqualsInURL',
       'NoOfQMarkInURL', 'LineOfCode', 'LargestLineLength', 'HasTitle',
       'DomainTitleMatchScore', 'HasFavicon', 'Robots', 'IsResponsive',
       'NoOfURLRedirect', 'NoOfSelfRedirect', 'NoOfPopup', 'NoOfiFrame',
       'HasExternalFormSubmit', 'HasSubmitButton', 'HasHiddenFields',
       'HasPasswordField', 'Bank', 'Pay', 'Crypto', 'NoOfImage', 'NoOfCSS',
       'NoOfJS', 'NoOfEmptyRef', 'NoOfExternalRef', 'label'],
      dtype='object')

In [21]:
phish[phish['label'] == 0].head()

Unnamed: 0,DomainLength,TLDLegitimateProb,TLDLength,NoOfSubDomain,NoOfLettersInURL,LetterRatioInURL,NoOfEqualsInURL,NoOfQMarkInURL,LineOfCode,LargestLineLength,...,HasPasswordField,Bank,Pay,Crypto,NoOfImage,NoOfCSS,NoOfJS,NoOfEmptyRef,NoOfExternalRef,label
11,16,0.522907,3,1,10,0.455,0,0,2,87,...,0,0,0,0,0,0,0,0,1,0
20,20,0.018013,2,2,6,0.231,0,0,9,44,...,0,0,0,0,0,0,0,0,0,0
21,17,5.3e-05,2,1,12,0.5,0,0,20,211,...,0,0,0,0,0,0,0,0,0,0
27,29,0.522907,3,1,26,0.703,0,0,16,113,...,0,0,0,0,0,0,0,0,0,0
28,26,0.03265,2,2,16,0.5,0,0,43,283,...,0,0,0,0,1,0,0,0,1,0


In [23]:
phish.iloc[20]

DomainLength             20.000000
TLDLegitimateProb         0.018013
TLDLength                 2.000000
NoOfSubDomain             2.000000
NoOfLettersInURL          6.000000
LetterRatioInURL          0.231000
NoOfEqualsInURL           0.000000
NoOfQMarkInURL            0.000000
LineOfCode                9.000000
LargestLineLength        44.000000
HasTitle                  1.000000
DomainTitleMatchScore     0.000000
HasFavicon                0.000000
Robots                    0.000000
IsResponsive              0.000000
NoOfURLRedirect           0.000000
NoOfSelfRedirect          0.000000
NoOfPopup                 0.000000
NoOfiFrame                0.000000
HasExternalFormSubmit     0.000000
HasSubmitButton           0.000000
HasHiddenFields           0.000000
HasPasswordField          0.000000
Bank                      0.000000
Pay                       0.000000
Crypto                    0.000000
NoOfImage                 0.000000
NoOfCSS                   0.000000
NoOfJS              