Skip to content

Matt-Renegar/Classification-Machine-Learning

Repository files navigation

Phishing Website Detection using Machine Learning

🛡️ Comprehensive Machine Learning Pipeline for Phishing Detection

This project implements multiple machine learning models to identify malicious phishing URLs with high accuracy. Using the Kaggle Phishing Dataset, I developed and compared various classification models designed to distinguish between legitimate and phishing websites based on 48 structural features (URL length, special characters, prefix/suffix, etc.).

📌 Project Overview

Phishing remains one of the primary vectors for cyber-attacks. This project explores different machine learning approaches to build robust classifiers for URL-based phishing detection. The analysis includes traditional models like Logistic Regression and ensemble methods, as well as modern techniques like XGBoost and deep learning with TensorFlow.

🗂️ Dataset

  • Source: Kaggle Phishing Dataset
  • File: Data/Phishing_Legitimate_full.csv
  • Features: 48 URL-based features
  • Target: Binary classification (0: Legitimate, 1: Phishing)
  • Size: ~11,000 samples

🤖 Models Implemented

1. Logistic Regression

  • Notebook: Logistrical_Regression.ipynb
  • Description: L2-regularized logistic regression for interpretable classification
  • Key Features: Feature importance analysis, confusion matrix visualization

2. Random Forest

  • Notebook: Decision_Tree_with_Random_Forest.ipynb
  • Description: Ensemble of decision trees with random feature selection
  • Parameters: 100 trees, sqrt features, bootstrap sampling
  • Visualization: Decision tree plot (first tree, max depth 2)

3. Bagging Classifier (Decision Trees with Sample Replace)

  • Notebook: Decision_Tree_with_Sample_Replace.ipynb
  • Description: Bootstrap aggregation of decision trees
  • Parameters: 100 trees, bootstrap=True, max_samples=1.0
  • Visualization: Decision tree plot (first tree, max depth 2)

4. XGBoost

  • Notebook: Decision_Tree_with_XGBoost.ipynb
  • Description: Gradient boosting with decision trees
  • Parameters: 100 estimators, max_depth=5, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8
  • Visualization: Feature importance plot (top 5 features)

5. TensorFlow Neural Network

  • Notebook: Tensorflow.ipynb
  • Description: Deep learning model with hyperparameter tuning using Keras Tuner
  • Architecture: Tunable layers (1-5), units (8-128), activations (relu/tanh), optimizers (adam/rmsprop)
  • Features: Early stopping, feature scaling, training history plots
  • Tuning: Random search with 20 trials, objective: val_accuracy

📊 Data Splitting Strategy

All models use a consistent 60/20/20 split:

  • Training: 60% of data
  • Cross-Validation: 20% of data (for hyperparameter tuning and validation)
  • Testing: 20% of data (final evaluation)

🛠️ Technical Stack

  • Language: Python 3.x
  • Libraries:
    • Data Processing: Pandas, NumPy
    • Machine Learning: Scikit-Learn, XGBoost
    • Deep Learning: TensorFlow, Keras, Keras Tuner
    • Visualization: Matplotlib
  • Environment: Virtual environment (.venv)

📈 Model Performance Summary

Model Test Accuracy Key Strengths Notebook
Logistic Regression ~94% Interpretable, fast Logistrical_Regression.ipynb
Random Forest High Robust, handles non-linearity Decision_Tree_with_Random_Forest.ipynb
Bagging (Decision Trees) High Reduces overfitting Decision_Tree_with_Sample_Replace.ipynb
XGBoost High Gradient boosting, feature importance Decision_Tree_with_XGBoost.ipynb
TensorFlow NN Tunable Deep learning, automated tuning Tensorflow.ipynb

Note: Exact metrics vary; run notebooks for detailed classification reports including precision, recall, and F1-scores.

🚀 How to Run

  1. Clone/Setup the Repository:

    # Navigate to the project directory
    cd "Phishing Dataset - UV/Phishing-Dataset"
  2. Activate Virtual Environment:

    # Activate the existing venv
    .venv\Scripts\Activate.ps1  # Windows PowerShell
  3. Install Dependencies (if needed):

    pip install pandas scikit-learn numpy matplotlib xgboost tensorflow keras-tuner
  4. Run Individual Notebooks:

    • Open any .ipynb file in Jupyter/VS Code
    • Execute cells sequentially
    • Note: TensorFlow notebook requires keras-tuner for hyperparameter tuning

📋 Dependencies

Core requirements (install via pip):

pandas
scikit-learn
numpy
matplotlib
xgboost
tensorflow
keras-tuner

💡 Key Learnings & Insights

  • Model Comparison: Ensemble methods (Random Forest, XGBoost) generally outperform single models but with higher computational cost
  • Interpretability: Logistic Regression provides clear feature importance, crucial for understanding phishing indicators
  • Deep Learning: Neural networks offer flexibility but require careful tuning and more resources
  • Evaluation: In cybersecurity, recall (catching phishing) is often prioritized over precision to minimize security risks
  • Data Splitting: Proper train/CV/test splits prevent overfitting and ensure reliable performance estimates

🔍 Feature Analysis

The dataset includes 48 features capturing various URL characteristics:

  • Length metrics
  • Special character counts
  • Domain properties
  • Security indicators
  • Structural patterns

Top phishing indicators typically include suspicious domain patterns, excessive special characters, and abnormal URL structures.

📝 Future Improvements

  • Feature engineering for additional URL-based signals
  • Model stacking/ensembling of best performers
  • Real-time deployment considerations
  • Cross-validation with different data splits
  • Comparison with other datasets

📄 Notebooks Overview

  • Logistrical_Regression.ipynb: Baseline interpretable model
  • Decision_Tree_with_Random_Forest.ipynb: Ensemble learning with random forests
  • Decision_Tree_with_Sample_Replace.ipynb: Bagging approach
  • Decision_Tree_with_XGBoost.ipynb: Gradient boosting implementation
  • Tensorflow.ipynb: Deep learning with automated hyperparameter tuning

Each notebook includes data loading, preprocessing, model training, evaluation, and visualization.

About

Logistical Regression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors