# Exploratory Data Analysis (EDA) - Phishing Website Detection

**Author:** AIAP Batch 22 Assessment  
**Date:** November 2025  
**Dataset:** Phishing Website Detection Database

## Objective
This notebook performs an exploratory data analysis on the phishing website dataset to understand the characteristics of phishing vs legitimate websites and inform feature engineering and model selection for the machine learning pipeline.


In [None]:
# Import necessary libraries
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")


## Step 1: Data Loading and Initial Exploration

**Purpose:** Load the phishing dataset from SQLite database and perform initial inspection to understand the data structure, size, and basic characteristics.


In [None]:
# Connect to SQLite database and load data
conn = sqlite3.connect('data/phishing.db')
df = pd.read_sql_query("SELECT * FROM phishing_data", conn)
conn.close()

print(f"Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")

# Display first few rows
df.head()


**Explanation:** The dataset contains 10,500 website samples with 16 features. The 'Unnamed: 0' column is an index, and 'label' is the target variable (0 = legitimate, 1 = phishing).


In [None]:
# Basic information about the dataset
print("="*50)
print("DATASET INFORMATION")
print("="*50)
df.info()

print("\n" + "="*50)
print("DATA TYPES SUMMARY")
print("="*50)
print(df.dtypes.value_counts())


## Step 2: Data Quality Assessment

**Purpose:** Check for missing values, duplicates, and data quality issues that may affect model performance.


In [None]:
# Check for missing values
print("="*50)
print("MISSING VALUES ANALYSIS")
print("="*50)
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Feature': missing_values.index,
    'Missing Count': missing_values.values,
    'Percentage': missing_percentage.values
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
print(missing_df)

print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print(f"Total cells: {df.shape[0] * df.shape[1]}")
print(f"Percentage of missing data: {(df.isnull().sum().sum() / (df.shape[0] * df.shape[1]) * 100):.2f}%")


In [None]:
# Check for duplicates
print("="*50)
print("DUPLICATE ANALYSIS")
print("="*50)
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
print(f"Percentage of duplicates: {(duplicates / len(df) * 100):.2f}%")


**Conclusion:** The dataset has missing values in the LineOfCode feature which will need to be handled during preprocessing. No duplicate rows were found, which is good for model training.


## Step 3: Target Variable Analysis

**Purpose:** Understand the distribution of the target variable (phishing vs legitimate websites) to assess class balance.


In [None]:
# Target variable distribution
print("="*50)
print("TARGET VARIABLE DISTRIBUTION")
print("="*50)
label_counts = df['label'].value_counts()
label_percentages = df['label'].value_counts(normalize=True) * 100

print(f"\nLabel Distribution:")
print(f"  Legitimate websites (0): {label_counts[0]} ({label_percentages[0]:.2f}%)")
print(f"  Phishing websites (1): {label_counts[1]} ({label_percentages[1]:.2f}%)")

# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=df, x='label', palette='Set2', ax=axes[0])
axes[0].set_title('Target Variable Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Label (0=Legitimate, 1=Phishing)', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_xticklabels(['Legitimate (0)', 'Phishing (1)'])

# Add counts on bars
for container in axes[0].containers:
    axes[0].bar_label(container)

# Pie chart
colors = ['#8dd3c7', '#fb8072']
axes[1].pie(label_counts, labels=['Legitimate', 'Phishing'], autopct='%1.1f%%', 
            colors=colors, startangle=90)
axes[1].set_title('Target Variable Percentage', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()


**Explanation:** The dataset shows a class imbalance with approximately 70% phishing websites and 30% legitimate websites. This imbalance will need to be considered when selecting evaluation metrics and potentially using techniques like SMOTE or class weighting.


## Step 4: Statistical Summary of Numerical Features

**Purpose:** Generate descriptive statistics for all numerical features to understand their distributions, ranges, and central tendencies.


In [None]:
# Statistical summary of numerical features
print("="*50)
print("STATISTICAL SUMMARY - NUMERICAL FEATURES")
print("="*50)

# Select numerical columns (excluding index and label)
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numerical_cols = [col for col in numerical_cols if col not in ['Unnamed: 0', 'label']]

print(f"\nNumerical features: {numerical_cols}")
print(f"\nNumber of numerical features: {len(numerical_cols)}\n")

# Display statistics
df[numerical_cols].describe().T


## Step 5: Categorical Features Analysis

**Purpose:** Examine categorical features (Industry and HostingProvider) to understand their distributions and unique values.


In [None]:
# Categorical features analysis
categorical_cols = ['Industry', 'HostingProvider']

print("="*50)
print("CATEGORICAL FEATURES ANALYSIS")
print("="*50)

for col in categorical_cols:
    print(f"\n{col}:")
    print(f"  Unique values: {df[col].nunique()}")
    print(f"  Missing values: {df[col].isnull().sum()}")
    print(f"  Top 10 most common values:")
    print(df[col].value_counts().head(10))
    print()


In [None]:
# Visualize top categories
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Industry distribution
industry_top = df['Industry'].value_counts().head(15)
sns.barplot(x=industry_top.values, y=industry_top.index, palette='viridis', ax=axes[0])
axes[0].set_title('Top 15 Industries', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Count', fontsize=12)
axes[0].set_ylabel('Industry', fontsize=12)

# HostingProvider distribution
hosting_top = df['HostingProvider'].value_counts().head(15)
sns.barplot(x=hosting_top.values, y=hosting_top.index, palette='plasma', ax=axes[1])
axes[1].set_title('Top 15 Hosting Providers', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Count', fontsize=12)
axes[1].set_ylabel('Hosting Provider', fontsize=12)

plt.tight_layout()
plt.show()


**Explanation:** Both Industry and HostingProvider have high cardinality (many unique values). For the ML pipeline, we'll need to apply encoding techniques such as target encoding or frequency encoding to handle these categorical features effectively.


## Step 6: Feature Distributions Visualization

**Purpose:** Visualize the distribution of numerical features using histograms and box plots to identify skewness and potential outliers.


In [None]:
# Distribution plots for numerical features
fig, axes = plt.subplots(4, 3, figsize=(18, 16))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    if idx < len(axes):
        df[col].hist(bins=50, ax=axes[idx], edgecolor='black', color='skyblue')
        axes[idx].set_title(f'{col} Distribution', fontsize=12, fontweight='bold')
        axes[idx].set_xlabel(col, fontsize=10)
        axes[idx].set_ylabel('Frequency', fontsize=10)
        
        # Add mean and median lines
        axes[idx].axvline(df[col].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df[col].mean():.2f}')
        axes[idx].axvline(df[col].median(), color='green', linestyle='--', linewidth=2, label=f'Median: {df[col].median():.2f}')
        axes[idx].legend(fontsize=8)

plt.tight_layout()
plt.show()


In [None]:
# Box plots to identify outliers
fig, axes = plt.subplots(4, 3, figsize=(18, 16))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    if idx < len(axes):
        sns.boxplot(data=df, y=col, ax=axes[idx], color='lightcoral')
        axes[idx].set_title(f'{col} Box Plot', fontsize=12, fontweight='bold')
        axes[idx].set_ylabel(col, fontsize=10)

plt.tight_layout()
plt.show()


**Conclusion:** Many features show right-skewed distributions and contain outliers. Features like LineOfCode, LargestLineLength, and NoOfImage have significant outliers. We'll need to consider outlier handling and potentially feature scaling in our preprocessing pipeline.


## Step 7: Correlation Analysis

**Purpose:** Examine correlations between numerical features and identify multicollinearity that might affect model performance.


In [None]:
# Correlation matrix including target variable
correlation_features = numerical_cols + ['label']
correlation_matrix = df[correlation_features].corr()

# Plot correlation heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap - All Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Print features most correlated with target
print("="*50)
print("CORRELATION WITH TARGET VARIABLE")
print("="*50)
target_correlation = correlation_matrix['label'].sort_values(ascending=False)
print("\nFeatures sorted by correlation with target (label):")
print(target_correlation)


In [None]:
# Identify highly correlated feature pairs (potential multicollinearity)
print("\n" + "="*50)
print("HIGHLY CORRELATED FEATURE PAIRS")
print("="*50)
print("\nFeature pairs with correlation > 0.7:")

correlation_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7 and correlation_matrix.columns[i] != 'label' and correlation_matrix.columns[j] != 'label':
            correlation_pairs.append({
                'Feature 1': correlation_matrix.columns[i],
                'Feature 2': correlation_matrix.columns[j],
                'Correlation': correlation_matrix.iloc[i, j]
            })

if correlation_pairs:
    correlation_df = pd.DataFrame(correlation_pairs).sort_values('Correlation', ascending=False)
    print(correlation_df)
else:
    print("No feature pairs with correlation > 0.7 found.")


**Explanation:** The correlation analysis reveals which features have the strongest relationships with the target variable. Features with high correlation (positive or negative) are likely to be important predictors. High correlation between features (multicollinearity) might be problematic for some models like linear regression but less so for tree-based models.


## Step 8: Feature Comparison - Phishing vs Legitimate Websites

**Purpose:** Compare feature distributions between phishing and legitimate websites to identify distinguishing characteristics.


In [None]:
# Compare key features between phishing and legitimate sites
key_features = ['DomainAgeMonths', 'NoOfURLRedirect', 'NoOfPopup', 'NoOfiFrame', 
                'Robots', 'IsResponsive']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, feature in enumerate(key_features):
    # Box plot comparison
    df_plot = df[[feature, 'label']].copy()
    df_plot['Website Type'] = df_plot['label'].map({0: 'Legitimate', 1: 'Phishing'})
    
    sns.boxplot(data=df_plot, x='Website Type', y=feature, 
                palette={'Legitimate': '#8dd3c7', 'Phishing': '#fb8072'}, ax=axes[idx])
    axes[idx].set_title(f'{feature} Comparison', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Website Type', fontsize=10)
    axes[idx].set_ylabel(feature, fontsize=10)

plt.tight_layout()
plt.show()


In [None]:
# Statistical comparison between phishing and legitimate websites
print("="*50)
print("STATISTICAL COMPARISON BY WEBSITE TYPE")
print("="*50)

comparison_stats = df.groupby('label')[numerical_cols].agg(['mean', 'median', 'std'])
print("\nMean values by website type:")
print(comparison_stats.xs('mean', axis=1, level=1).T)


**Explanation:** Key differences observed between phishing and legitimate websites:
- **DomainAgeMonths**: Legitimate websites tend to have older domains
- **NoOfURLRedirect**: Phishing sites often have more URL redirects
- **Robots**: Legitimate sites are more likely to have robots.txt
- **IsResponsive**: Legitimate sites tend to be more responsive


## Key Findings and Interpretation

Based on the exploratory data analysis, here are the key findings:

### 1. **Dataset Characteristics**
- 10,500 samples with 16 features
- Class imbalance: ~70% phishing, ~30% legitimate websites
- Missing values in LineOfCode feature (~3,000 missing values)
- High cardinality categorical features (Industry, HostingProvider)

### 2. **Feature Quality**
- Most numerical features are right-skewed with outliers
- Several features show strong correlation with target variable
- No duplicate records in the dataset

### 3. **Key Discriminating Features**
Based on the analysis, the following features appear most useful for classification:
- **DomainAgeMonths**: Younger domains are associated with phishing
- **NoOfURLRedirect**: Higher redirects indicate phishing
- **Robots**: Presence of robots.txt suggests legitimate sites
- **IsResponsive**: Responsive design more common in legitimate sites
- **NoOfPopup**: Phishing sites may have more popups

### 4. **Data Quality Issues**
- Missing values in LineOfCode require imputation
- Outliers present in multiple features
- High cardinality categorical variables need appropriate encoding


## Recommendations for ML Pipeline

Based on the EDA findings, the following preprocessing steps and modeling choices are recommended:

### Feature Engineering & Preprocessing:
1. **Handle Missing Values**: 
   - Impute LineOfCode missing values using median or KNN imputation
   
2. **Outlier Treatment**:
   - Consider capping outliers or using robust scaling methods
   - Tree-based models are naturally robust to outliers
   
3. **Categorical Encoding**:
   - Use target encoding or frequency encoding for Industry and HostingProvider
   - Handle high cardinality appropriately
   
4. **Feature Scaling**:
   - Standardization or normalization for models sensitive to scale (Logistic Regression, SVM, Neural Networks)
   - Not required for tree-based models
   
5. **Handle Class Imbalance**:
   - Use stratified sampling for train-test split
   - Consider class weights in model training
   - Alternatively, use SMOTE for oversampling minority class

### Model Selection:
1. **Primary Models to Consider**:
   - **Random Forest**: Handles non-linear relationships, robust to outliers, provides feature importance
   - **XGBoost/LightGBM**: High performance, handles missing values, good with imbalanced data
   - **Logistic Regression**: Baseline model for interpretability
   
2. **Ensemble Methods**: Combine multiple models for better performance

### Evaluation Metrics:
Given class imbalance, use:
- **Primary**: F1-score, Precision, Recall
- **ROC-AUC**: Overall model discrimination ability
- **Confusion Matrix**: Understand false positives vs false negatives
- **Avoid**: Accuracy alone (misleading with imbalanced data)

### Other Considerations:
- Use cross-validation for robust performance estimation
- Perform hyperparameter tuning using GridSearchCV or RandomizedSearchCV
- Monitor for overfitting with validation curves
