🛡️ Comprehensive Machine Learning Pipeline for Phishing Detection
This project implements multiple machine learning models to identify malicious phishing URLs with high accuracy. Using the Kaggle Phishing Dataset, I developed and compared various classification models designed to distinguish between legitimate and phishing websites based on 48 structural features (URL length, special characters, prefix/suffix, etc.).
Phishing remains one of the primary vectors for cyber-attacks. This project explores different machine learning approaches to build robust classifiers for URL-based phishing detection. The analysis includes traditional models like Logistic Regression and ensemble methods, as well as modern techniques like XGBoost and deep learning with TensorFlow.
- Source: Kaggle Phishing Dataset
- File:
Data/Phishing_Legitimate_full.csv - Features: 48 URL-based features
- Target: Binary classification (0: Legitimate, 1: Phishing)
- Size: ~11,000 samples
- Notebook:
Logistrical_Regression.ipynb - Description: L2-regularized logistic regression for interpretable classification
- Key Features: Feature importance analysis, confusion matrix visualization
- Notebook:
Decision_Tree_with_Random_Forest.ipynb - Description: Ensemble of decision trees with random feature selection
- Parameters: 100 trees, sqrt features, bootstrap sampling
- Visualization: Decision tree plot (first tree, max depth 2)
- Notebook:
Decision_Tree_with_Sample_Replace.ipynb - Description: Bootstrap aggregation of decision trees
- Parameters: 100 trees, bootstrap=True, max_samples=1.0
- Visualization: Decision tree plot (first tree, max depth 2)
- Notebook:
Decision_Tree_with_XGBoost.ipynb - Description: Gradient boosting with decision trees
- Parameters: 100 estimators, max_depth=5, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8
- Visualization: Feature importance plot (top 5 features)
- Notebook:
Tensorflow.ipynb - Description: Deep learning model with hyperparameter tuning using Keras Tuner
- Architecture: Tunable layers (1-5), units (8-128), activations (relu/tanh), optimizers (adam/rmsprop)
- Features: Early stopping, feature scaling, training history plots
- Tuning: Random search with 20 trials, objective: val_accuracy
All models use a consistent 60/20/20 split:
- Training: 60% of data
- Cross-Validation: 20% of data (for hyperparameter tuning and validation)
- Testing: 20% of data (final evaluation)
- Language: Python 3.x
- Libraries:
- Data Processing: Pandas, NumPy
- Machine Learning: Scikit-Learn, XGBoost
- Deep Learning: TensorFlow, Keras, Keras Tuner
- Visualization: Matplotlib
- Environment: Virtual environment (
.venv)
| Model | Test Accuracy | Key Strengths | Notebook |
|---|---|---|---|
| Logistic Regression | ~94% | Interpretable, fast | Logistrical_Regression.ipynb |
| Random Forest | High | Robust, handles non-linearity | Decision_Tree_with_Random_Forest.ipynb |
| Bagging (Decision Trees) | High | Reduces overfitting | Decision_Tree_with_Sample_Replace.ipynb |
| XGBoost | High | Gradient boosting, feature importance | Decision_Tree_with_XGBoost.ipynb |
| TensorFlow NN | Tunable | Deep learning, automated tuning | Tensorflow.ipynb |
Note: Exact metrics vary; run notebooks for detailed classification reports including precision, recall, and F1-scores.
-
Clone/Setup the Repository:
# Navigate to the project directory cd "Phishing Dataset - UV/Phishing-Dataset"
-
Activate Virtual Environment:
# Activate the existing venv .venv\Scripts\Activate.ps1 # Windows PowerShell
-
Install Dependencies (if needed):
pip install pandas scikit-learn numpy matplotlib xgboost tensorflow keras-tuner
-
Run Individual Notebooks:
- Open any
.ipynbfile in Jupyter/VS Code - Execute cells sequentially
- Note: TensorFlow notebook requires
keras-tunerfor hyperparameter tuning
- Open any
Core requirements (install via pip):
pandas
scikit-learn
numpy
matplotlib
xgboost
tensorflow
keras-tuner
- Model Comparison: Ensemble methods (Random Forest, XGBoost) generally outperform single models but with higher computational cost
- Interpretability: Logistic Regression provides clear feature importance, crucial for understanding phishing indicators
- Deep Learning: Neural networks offer flexibility but require careful tuning and more resources
- Evaluation: In cybersecurity, recall (catching phishing) is often prioritized over precision to minimize security risks
- Data Splitting: Proper train/CV/test splits prevent overfitting and ensure reliable performance estimates
The dataset includes 48 features capturing various URL characteristics:
- Length metrics
- Special character counts
- Domain properties
- Security indicators
- Structural patterns
Top phishing indicators typically include suspicious domain patterns, excessive special characters, and abnormal URL structures.
- Feature engineering for additional URL-based signals
- Model stacking/ensembling of best performers
- Real-time deployment considerations
- Cross-validation with different data splits
- Comparison with other datasets
Logistrical_Regression.ipynb: Baseline interpretable modelDecision_Tree_with_Random_Forest.ipynb: Ensemble learning with random forestsDecision_Tree_with_Sample_Replace.ipynb: Bagging approachDecision_Tree_with_XGBoost.ipynb: Gradient boosting implementationTensorflow.ipynb: Deep learning with automated hyperparameter tuning
Each notebook includes data loading, preprocessing, model training, evaluation, and visualization.