# üîê Phishing Website Detection ‚Äì Exploratory Data Analysis (EDA)

## 1. Introduction
Phishing websites are malicious websites designed to trick users into revealing sensitive information such as login credentials, banking details, or personal data by impersonating legitimate entities.

This notebook focuses on **Exploratory Data Analysis (EDA)** to understand data quality, structure, feature behavior, and class distribution before building machine learning models.

## 2. Lifecycle of a Machine Learning Model

The complete lifecycle of this machine learning model follows a structured pipeline to ensure reliability, performance, and reproducibility:

- Problem Understanding
- Data Collection
- Dataset Understanding
- Data Validation & Quality Checks
- Exploratory Data Analysis (EDA)
- Data Preprocessing & Feature Engineering
- Model Selection & Training
- Model Evaluation
- Hyperparameter Tuning
- Model Deployment & Monitoring

This project focuses mainly on classification, where the goal is to identify whether a website is phishing or legitimate.

## 3. Problem Statement

Phishing websites are fraudulent websites designed to steal sensitive information such as usernames, passwords, and banking details by impersonating legitimate entities.

## üéØ Objective:

To build a machine learning classification model that can accurately detect phishing websites based on URL, domain, HTML, JavaScript, and traffic-based features.

## üß† Problem Type:

Binary Classification

Target Variable:

1 ‚Üí Legitimate Website

-1 ‚Üí Phishing Website

## 4. Data Collection
The dataset is sourced from the **UCI Machine Learning Repository**, collected using phishing and legitimate website samples from:
- PhishTank  
- Alexa  
- WHOIS  
- Google Index  
- Statistical phishing reports
##### Data Source - https://archive.ics.uci.edu/dataset/327/phishing+websites

## 4.1 Dataset Information
- **Total records:** 11,055  
- **Total features:** 30 input features + 1 target  
- **Target column:** Result  
- **Feature values:** -1, 0, 1  
- **Missing values:** None  

### 4.2 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame and Show Top 5 Records

In [2]:
file_path = r"../Network_Data/phisingData.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


### information about dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   having_IP_Address            11055 non-null  int64
 1   URL_Length                   11055 non-null  int64
 2   Shortining_Service           11055 non-null  int64
 3   having_At_Symbol             11055 non-null  int64
 4   double_slash_redirecting     11055 non-null  int64
 5   Prefix_Suffix                11055 non-null  int64
 6   having_Sub_Domain            11055 non-null  int64
 7   SSLfinal_State               11055 non-null  int64
 8   Domain_registeration_length  11055 non-null  int64
 9   Favicon                      11055 non-null  int64
 10  port                         11055 non-null  int64
 11  HTTPS_token                  11055 non-null  int64
 12  Request_URL                  11055 non-null  int64
 13  URL_of_Anchor                11055 non-null  i

## 4.3 Dataset Information
- **Total records:** 11,055  
- **Total features:** 30 input features + 1 target  
- **Target column:** Result  
- **Feature values:** -1, 0, 1  
- **Missing values:** None  

## 4.4 Feature Categories
### Address Bar Based Features
- IP address usage, URL length, URL shortening, special symbols, subdomains, HTTPS.

### Abnormal Based Features
- Request URL, anchor URL, SFH, email submission, abnormal URLs.

### HTML & JavaScript Based Features
- Forwarding, status bar manipulation, right-click disabling, pop-ups, iframe usage.

### Domain Based Features
- Domain age, DNS record, traffic, PageRank, Google index, backlinks.

### Statistical Reports
- Blacklist-based domain and IP reputation.


## 5. Data Quality Checks
- Shape validation  
- Data type validation  
- Missing value check  
- Duplicate record check  
- Feature value range check  
- Class imbalance check  

### 5.1 Shape validation 

In [4]:
df.shape

(11055, 31)

### 5.2 Data type validation 

In [5]:
df.dtypes

having_IP_Address              int64
URL_Length                     int64
Shortining_Service             int64
having_At_Symbol               int64
double_slash_redirecting       int64
Prefix_Suffix                  int64
having_Sub_Domain              int64
SSLfinal_State                 int64
Domain_registeration_length    int64
Favicon                        int64
port                           int64
HTTPS_token                    int64
Request_URL                    int64
URL_of_Anchor                  int64
Links_in_tags                  int64
SFH                            int64
Submitting_to_email            int64
Abnormal_URL                   int64
Redirect                       int64
on_mouseover                   int64
RightClick                     int64
popUpWidnow                    int64
Iframe                         int64
age_of_domain                  int64
DNSRecord                      int64
web_traffic                    int64
Page_Rank                      int64
G

### check if there is non numeric column

In [6]:
df.select_dtypes(exclude=['number']).columns

Index([], dtype='object')

### Validate target column values

In [7]:
df['Result'].unique()

array([-1,  1])

### Validate feature value ranges

In [8]:
invalid_cols = []

for col in df.columns:
    if col != 'Result':
        if not set(df[col].unique()).issubset({-1, 0, 1}):
            invalid_cols.append(col)

invalid_cols

[]

### Data Type Validation

All features in the dataset are numerical and stored as `int64` data types.  
The target variable `Result` contains only two valid class labels: `1` (Legitimate) and `-1` (Phishing).

No object or categorical data types were found, and all feature values lie within the expected range of `-1`, `0`, and `1`.  
This confirms that the dataset is clean and suitable for machine learning models without additional type conversion.


### 5.3 Missing value check

In [9]:
df.isnull().sum()

having_IP_Address              0
URL_Length                     0
Shortining_Service             0
having_At_Symbol               0
double_slash_redirecting       0
Prefix_Suffix                  0
having_Sub_Domain              0
SSLfinal_State                 0
Domain_registeration_length    0
Favicon                        0
port                           0
HTTPS_token                    0
Request_URL                    0
URL_of_Anchor                  0
Links_in_tags                  0
SFH                            0
Submitting_to_email            0
Abnormal_URL                   0
Redirect                       0
on_mouseover                   0
RightClick                     0
popUpWidnow                    0
Iframe                         0
age_of_domain                  0
DNSRecord                      0
web_traffic                    0
Page_Rank                      0
Google_Index                   0
Links_pointing_to_page         0
Statistical_report             0
Result    

#### There are no missing values in the data set

### 5.4 Duplicate record check

In [10]:
df.duplicated().sum()

np.int64(5206)

In [11]:
df = df.drop_duplicates()
df.shape

(5849, 31)

After removing duplicate records, the dataset size reduced significantly, indicating a high number of repeated feature vectors due to rule-based feature extraction. The resulting dataset contains only unique behavioral patterns and exhibits a nearly balanced class distribution.

In [12]:
df['Result'].value_counts()

Result
-1    3019
 1    2830
Name: count, dtype: int64

### 5.5 Feature value range check

In [13]:
for col in df.columns:
    print(col, df[col].unique())

having_IP_Address [-1  1]
URL_Length [ 1  0 -1]
Shortining_Service [ 1 -1]
having_At_Symbol [ 1 -1]
double_slash_redirecting [-1  1]
Prefix_Suffix [-1  1]
having_Sub_Domain [-1  0  1]
SSLfinal_State [-1  1  0]
Domain_registeration_length [-1  1]
Favicon [ 1 -1]
port [ 1 -1]
HTTPS_token [-1  1]
Request_URL [ 1 -1]
URL_of_Anchor [-1  0  1]
Links_in_tags [ 1 -1  0]
SFH [-1  1  0]
Submitting_to_email [-1  1]
Abnormal_URL [-1  1]
Redirect [0 1]
on_mouseover [ 1 -1]
RightClick [ 1 -1]
popUpWidnow [ 1 -1]
Iframe [ 1 -1]
age_of_domain [-1  1]
DNSRecord [-1  1]
web_traffic [-1  0  1]
Page_Rank [-1  1]
Google_Index [ 1 -1]
Links_pointing_to_page [ 1  0 -1]
Statistical_report [-1  1]
Result [-1  1]


### Programmatic validation

In [14]:
allowed_values = {-1, 0, 1}

invalid_columns = {}

for col in df.columns:
    if col != 'Result':   # target handled separately
        unique_vals = set(df[col].unique())
        if not unique_vals.issubset(allowed_values):
            invalid_columns[col] = unique_vals
invalid_cols

[]

### Validate target column separately

In [15]:
set(df['Result'].unique())

{np.int64(-1), np.int64(1)}

### Feature Value Range Check

All feature columns in the dataset are rule-based and ternary encoded, with valid values limited to `-1`, `0`, and `1`.  
A feature value range check was performed to ensure that no unexpected or corrupted values were present.

The target column `Result` was validated separately and confirmed to contain only two valid class labels: `-1` and `1`.  
No invalid feature values were found, confirming data integrity.


### 5.6 Class imbalance check

#### Count samples per class

In [16]:
df['Result'].value_counts()

Result
-1    3019
 1    2830
Name: count, dtype: int64

### Check percentage distribution

In [17]:
df['Result'].value_counts(normalize=True) * 100

Result
-1    51.615661
 1    48.384339
Name: proportion, dtype: float64

#### Calculate imbalance ratio

In [18]:
counts = df['Result'].value_counts()
imbalance_ratio = counts.max() / counts.min()
imbalance_ratio

np.float64(1.0667844522968197)

### Class Imbalance Check

The target variable distribution was analyzed to identify potential class imbalance.  
After removing duplicate records, the dataset shows a nearly balanced distribution between phishing and legitimate websites.

The imbalance ratio is close to 1, indicating that no resampling techniques are required.  
A stratified train-test split was used to preserve class proportions during model evaluation.

## 6. Key Observations
- Dataset is clean and numerical.
- No missing values.
- Suitable for classification models.
- Minimal preprocessing required.

# 7. Conclusion
This dataset is ideal for phishing detection using machine learning models such as Logistic Regression, Random Forest, XGBoost, and CatBoost.