# The Intelligent Phishing Detection System  
## Using Machine Learning, URL Features, and Web Content Analysis  

---

###  Developed By: Hyndavi Yasarapu
###  ID: 24090531
### Course: MSc Advanced Computer Science
### Course Title: 7COM1039-0206-2025 - Advanced Computer Science Masters Project

---

### ðŸ“Œ Project Overview

This project presents an intelligent phishing detection system that combines:

-  **Machine Learning (ML) Techniques**
-  **URL-Based Feature Extraction**
-  **Web Content Analysis**
-  **Hybrid Detection Approach (Rule-Based + ML-Based)**

The system is designed to accurately detect malicious phishing websites by analyzing structural URL patterns, extracting meaningful web content features, and applying trained machine learning models.

---

### ðŸŽ¯ Objective

To build a robust, scalable, and intelligent cybersecurity solution that improves phishing detection accuracy by integrating multiple analytical techniques.

---

##  Importing Required Libraries and Project Modules
In this section, we import all the necessary libraries and modules required to run the Intelligent Phishing Detection System.
###  Standard Python Libraries
- **os** â†’ Used for handling file paths and directory operations.
- **sys** â†’ Allows modification of the Python path to include project folders.
- **warnings** â†’ Used to suppress unnecessary warning messages for cleaner output.

-- <small> **Author** hyndavi yasarapu <small> --

In [2]:
import os
import sys
import warnings
warnings.filterwarnings('ignore')

from src.data_preparation_hy import DataPreparation, create_sample_dataset
from src.feature_extraction_hy import extract_features_from_dataset

from config import DATA_DIR, MODELS_DIR, RESULTS_DIR
import pandas as pd


#### Directory SetUP

This function creates the required project folders automatically if they donâ€™t already exist.

**DATA_DIR** â†’ Where your dataset and processed features are stored

**MODELS_DIR** â†’ Where trained ML models are saved

**RESULTS_DIR** â†’ Where evaluation results and outputs are saved

-- <small> **Author** hyndavi yasarapu <small> --

In [3]:
def setup_directories_hy():
    """Create necessary project directories"""
    for directory_hy in [DATA_DIR, MODELS_DIR, RESULTS_DIR]:
        os.makedirs(directory_hy, exist_ok=True)
    
    print("Project directories created/verified successfully.")

#### Step 1: Data Preparation

1. The dataset is loaded using the `DataPreparation` class.  
2. The system tries to download the Kaggle dataset automatically (if configured).  
3. If the dataset is not available, a sample dataset is created for demonstration.  
4. Duplicate URLs are removed to avoid repeated data.  
5. Missing values and invalid entries are cleaned from the dataset.  
6. The cleaned dataset is prepared for the next step: Feature Extraction.

-- <small> **Author** hyndavi yasarapu <small> --

In [4]:
# ============================================
# STEP 1: DATA PREPARATION
# ============================================

print("="*70)
print("STEP 1: DATA PREPARATION (KAGGLE DATASET)")
print("="*70)

# Data preparation variable is prep_hy

prep_hy = DataPreparation()

try:
    # Try loading Kaggle dataset
    df_hy = prep_hy.load_raw_data(try_auto_download=True)
    df_hy = prep_hy.clean_data(df_hy)

except FileNotFoundError:
    print("\nDEMO MODE: Creating Sample Dataset")
    print("Kaggle dataset not found. Creating sample dataset...")

    sample_dataset_hy = create_sample_dataset(n_samples=500)
    sample_dataset_hy = prep_hy.clean_data(sample_dataset_hy)

print("\nData Preparation Complete!")
print("Dataset shape:", df_hy.shape)
sample_dataset_hy.head()

STEP 1: DATA PREPARATION (KAGGLE DATASET)

Loading Kaggle dataset from C:\Users\Pradeep Alvala\Documents\MastersProjectAndDoc\24090531_PhishingDetectionHYBRID\data\phishing_dataset.csv...
Initial dataset shape: (11430, 89)
Columns found: ['url', 'length_url', 'length_hostname', 'ip', 'nb_dots', 'nb_hyphens', 'nb_at', 'nb_qm', 'nb_and', 'nb_or', 'nb_eq', 'nb_underscore', 'nb_tilde', 'nb_percent', 'nb_slash', 'nb_star', 'nb_colon', 'nb_comma', 'nb_semicolumn', 'nb_dollar', 'nb_space', 'nb_www', 'nb_com', 'nb_dslash', 'http_in_path', 'https_token', 'ratio_digits_url', 'ratio_digits_host', 'punycode', 'port', 'tld_in_path', 'tld_in_subdomain', 'abnormal_subdomain', 'nb_subdomains', 'prefix_suffix', 'random_domain', 'shortening_service', 'path_extension', 'nb_redirection', 'nb_external_redirection', 'length_words_raw', 'char_repeat', 'shortest_words_raw', 'shortest_word_host', 'shortest_word_path', 'longest_words_raw', 'longest_word_host', 'longest_word_path', 'avg_words_raw', 'avg_word_hos

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status,label
0,http://www.crestonwood.com/router.php,37,19,0,3,0,0,0,0,0,...,1,0,45,-1,0,1,1,4,legitimate,0
1,http://shadetreetechnology.com/V4/validation/a...,77,23,1,1,0,0,0,0,0,...,0,0,77,5767,0,0,1,2,phishing,1
2,https://support-appleld.com.secureupdate.duila...,126,50,1,4,1,0,1,2,0,...,0,0,14,4004,5828815,0,1,0,phishing,1
3,http://rgipt.ac.in,18,11,0,2,0,0,0,0,0,...,0,0,62,-1,107721,0,0,3,legitimate,0
4,http://www.iracing.com/tracks/gateway-motorspo...,55,15,0,2,2,0,0,0,0,...,1,0,224,8175,8725,0,0,6,legitimate,0


In [6]:
# ================================================================
    # STEP 2: FEATURE EXTRACTION
    # ================================================================
print("\n" + "="*70)
print("STEP 2: FEATURE EXTRACTION")
print("="*70)
    
# For sample/small datasets, extract features
# For large datasets, you may want to extract features in batches
    
if len(df) <= 500:
    print("\nExtracting features from URLs...")
    print("Note: This may take a few minutes for network-based features...")
    features_df = prep.load_or_extract_features(df, force_extract=False)
else:
    print("\nLarge dataset detected. Extracting features from first 500 URLs...")
    print("For full dataset, run feature extraction separately.")
    df_sample = df.head(500)
    features_df = prep.load_or_extract_features(df_sample, force_extract=False)
    
print(f"\nFeature extraction complete!")
print(f"Features shape: {features_df.shape}")
print(f"Features: {list(features_df.columns)}")


STEP 2: FEATURE EXTRACTION


NameError: name 'df' is not defined