Phishing Website Detection using Machine Learning

🛡️ Comprehensive Machine Learning Pipeline for Phishing Detection

This project implements multiple machine learning models to identify malicious phishing URLs with high accuracy. Using the Kaggle Phishing Dataset, I developed and compared various classification models designed to distinguish between legitimate and phishing websites based on 48 structural features (URL length, special characters, prefix/suffix, etc.).

📌 Project Overview

Phishing remains one of the primary vectors for cyber-attacks. This project explores different machine learning approaches to build robust classifiers for URL-based phishing detection. The analysis includes traditional models like Logistic Regression and ensemble methods, as well as modern techniques like XGBoost and deep learning with TensorFlow.

🗂️ Dataset

Source: Kaggle Phishing Dataset
File: Data/Phishing_Legitimate_full.csv
Features: 48 URL-based features
Target: Binary classification (0: Legitimate, 1: Phishing)
Size: ~11,000 samples

🤖 Models Implemented

1. Logistic Regression

Notebook: Logistrical_Regression.ipynb
Description: L2-regularized logistic regression for interpretable classification
Key Features: Feature importance analysis, confusion matrix visualization

2. Random Forest

Notebook: Decision_Tree_with_Random_Forest.ipynb
Description: Ensemble of decision trees with random feature selection
Parameters: 100 trees, sqrt features, bootstrap sampling
Visualization: Decision tree plot (first tree, max depth 2)

3. Bagging Classifier (Decision Trees with Sample Replace)

Notebook: Decision_Tree_with_Sample_Replace.ipynb
Description: Bootstrap aggregation of decision trees
Parameters: 100 trees, bootstrap=True, max_samples=1.0
Visualization: Decision tree plot (first tree, max depth 2)

4. XGBoost

Notebook: Decision_Tree_with_XGBoost.ipynb
Description: Gradient boosting with decision trees
Parameters: 100 estimators, max_depth=5, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8
Visualization: Feature importance plot (top 5 features)

5. TensorFlow Neural Network

Notebook: Tensorflow.ipynb
Description: Deep learning model with hyperparameter tuning using Keras Tuner
Architecture: Tunable layers (1-5), units (8-128), activations (relu/tanh), optimizers (adam/rmsprop)
Features: Early stopping, feature scaling, training history plots
Tuning: Random search with 20 trials, objective: val_accuracy

📊 Data Splitting Strategy

All models use a consistent 60/20/20 split:

Training: 60% of data
Cross-Validation: 20% of data (for hyperparameter tuning and validation)
Testing: 20% of data (final evaluation)

🛠️ Technical Stack

Language: Python 3.x
Libraries:
- Data Processing: Pandas, NumPy
- Machine Learning: Scikit-Learn, XGBoost
- Deep Learning: TensorFlow, Keras, Keras Tuner
- Visualization: Matplotlib
Environment: Virtual environment (.venv)

📈 Model Performance Summary

Model	Test Accuracy	Key Strengths	Notebook
Logistic Regression	~94%	Interpretable, fast	`Logistrical_Regression.ipynb`
Random Forest	High	Robust, handles non-linearity	`Decision_Tree_with_Random_Forest.ipynb`
Bagging (Decision Trees)	High	Reduces overfitting	`Decision_Tree_with_Sample_Replace.ipynb`
XGBoost	High	Gradient boosting, feature importance	`Decision_Tree_with_XGBoost.ipynb`
TensorFlow NN	Tunable	Deep learning, automated tuning	`Tensorflow.ipynb`

Note: Exact metrics vary; run notebooks for detailed classification reports including precision, recall, and F1-scores.

🚀 How to Run

Clone/Setup the Repository:

# Navigate to the project directory
cd "Phishing Dataset - UV/Phishing-Dataset"

Activate Virtual Environment:

# Activate the existing venv
.venv\Scripts\Activate.ps1  # Windows PowerShell

Install Dependencies (if needed):

pip install pandas scikit-learn numpy matplotlib xgboost tensorflow keras-tuner

Run Individual Notebooks:
- Open any .ipynb file in Jupyter/VS Code
- Execute cells sequentially
- Note: TensorFlow notebook requires keras-tuner for hyperparameter tuning

📋 Dependencies

Core requirements (install via pip):

pandas
scikit-learn
numpy
matplotlib
xgboost
tensorflow
keras-tuner

💡 Key Learnings & Insights

Model Comparison: Ensemble methods (Random Forest, XGBoost) generally outperform single models but with higher computational cost
Interpretability: Logistic Regression provides clear feature importance, crucial for understanding phishing indicators
Deep Learning: Neural networks offer flexibility but require careful tuning and more resources
Evaluation: In cybersecurity, recall (catching phishing) is often prioritized over precision to minimize security risks
Data Splitting: Proper train/CV/test splits prevent overfitting and ensure reliable performance estimates

🔍 Feature Analysis

The dataset includes 48 features capturing various URL characteristics:

Length metrics
Special character counts
Domain properties
Security indicators
Structural patterns

Top phishing indicators typically include suspicious domain patterns, excessive special characters, and abnormal URL structures.

📝 Future Improvements

Feature engineering for additional URL-based signals
Model stacking/ensembling of best performers
Real-time deployment considerations
Cross-validation with different data splits
Comparison with other datasets

📄 Notebooks Overview

Logistrical_Regression.ipynb: Baseline interpretable model
Decision_Tree_with_Random_Forest.ipynb: Ensemble learning with random forests
Decision_Tree_with_Sample_Replace.ipynb: Bagging approach
Decision_Tree_with_XGBoost.ipynb: Gradient boosting implementation
Tensorflow.ipynb: Deep learning with automated hyperparameter tuning

Each notebook includes data loading, preprocessing, model training, evaluation, and visualization.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Data		Data
.gitignore		.gitignore
.python-version		.python-version
Decision_Tree_with_Random_Forest.ipynb		Decision_Tree_with_Random_Forest.ipynb
Decision_Tree_with_Sample_Replace.ipynb		Decision_Tree_with_Sample_Replace.ipynb
Decision_Tree_with_XGBoost.ipynb		Decision_Tree_with_XGBoost.ipynb
Logistrical_Regression.ipynb		Logistrical_Regression.ipynb
README.md		README.md
Tensorflow.ipynb		Tensorflow.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phishing Website Detection using Machine Learning

📌 Project Overview

🗂️ Dataset

🤖 Models Implemented

1. Logistic Regression

2. Random Forest

3. Bagging Classifier (Decision Trees with Sample Replace)

4. XGBoost

5. TensorFlow Neural Network

📊 Data Splitting Strategy

🛠️ Technical Stack

📈 Model Performance Summary

🚀 How to Run

📋 Dependencies

💡 Key Learnings & Insights

🔍 Feature Analysis

📝 Future Improvements

📄 Notebooks Overview

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Phishing Website Detection using Machine Learning

📌 Project Overview

🗂️ Dataset

🤖 Models Implemented

1. Logistic Regression

2. Random Forest

3. Bagging Classifier (Decision Trees with Sample Replace)

4. XGBoost

5. TensorFlow Neural Network

📊 Data Splitting Strategy

🛠️ Technical Stack

📈 Model Performance Summary

🚀 How to Run

📋 Dependencies

💡 Key Learnings & Insights

🔍 Feature Analysis

📝 Future Improvements

📄 Notebooks Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages