# Exploratory Data Analysis â€” Phishing Detection

This notebook explores the phishing vs. benign classification dataset: class distribution, feature summaries, and a brief insight.

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt

PROJECT_ROOT = (os.path.join(os.getcwd(), "..") if os.path.basename(os.getcwd()) == "notebooks" else os.getcwd())
PROJECT_ROOT = os.path.abspath(PROJECT_ROOT)
DATA_PATH = os.path.join(PROJECT_ROOT, "data", "processed", "sample_phishing.csv")
if not os.path.isfile(DATA_PATH):
    DATA_PATH = os.path.join(PROJECT_ROOT, "data", "raw", "phishing.csv")

df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
df.head()

## Class distribution

In [None]:
label_col = "label"
counts = df[label_col].value_counts().sort_index()
labels = ["Benign", "Phishing"]
fig, ax = plt.subplots(figsize=(5, 4))
ax.bar(labels, [counts.get(0, 0), counts.get(1, 0)], color=["steelblue", "coral"])
ax.set_ylabel("Count")
ax.set_title("Class Distribution")
plt.tight_layout()
plt.show()
print("Class counts:", counts.to_dict())

## Feature summaries

In [None]:
feature_cols = [c for c in df.columns if c != label_col]
summary = df[feature_cols].describe()
summary

In [None]:
df[feature_cols].hist(bins=20, figsize=(12, 8), layout=(2, 3))
plt.suptitle("Feature distributions")
plt.tight_layout()
plt.show()

## Written insight

The dataset is imbalanced toward benign samples; phishing prevalence depends on the source. Features such as URL length, presence of an IP in the host, and path depth are often used in phishing detectors because malicious URLs tend to be longer, use IPs to avoid reputation checks, and have deeper paths. Improving the baseline will likely require handling class imbalance (e.g. class weight or resampling) and possibly adding or engineering features that better separate the two classes.

In [None]:
# Correlation of numeric features with label
df.corr()[label_col].drop(label_col).abs().sort_values(ascending=False)