# Phishing Email Detection - Exploration Notebook

This notebook is intended for exploratory data analysis (EDA) and experimentation with the phishing email dataset. We will analyze the features, visualize the data, and prepare it for training a machine learning model.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [2]:
# Load the dataset
data_path = '../data/processed/phishing_emails.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
df.head()

In [3]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

In [4]:
# Visualize the distribution of the target variable
plt.figure(figsize=(8, 6))
sns.countplot(x='label', data=df)
plt.title('Distribution of Phishing vs Legitimate Emails')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

In [5]:
# Analyze feature correlations
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()

In [6]:
# Prepare data for model training
X = df.drop('label', axis=1)
y = df['label']

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Next Steps

1. Implement feature engineering techniques to improve model performance.
2. Train various machine learning models and evaluate their performance.
3. Fine-tune the models using hyperparameter optimization.
4. Document findings and prepare for model deployment.