# Exploratory Analysis
---
This notebook downloads the two primary datasets used for training the model, one containing ham and spam emails, and the other containing phishing emails.

1. Imports
2. Downloading the data.
3. Data Inspection.
4. Cleaning the data.
5. Saving cleaned data to a csv file.

## 1. Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import kagglehub
import os
from pathlib import Path

## 2. Downloading the Data

Download the datasets using `kagglehub`, and are stored in `/home/<user>//.cache/kagglehub/datasets/...` (Unix/Linux path). Path may be different if installed on Windows system.

1. https://www.kaggle.com/datasets/marcelwiechmann/enron-spam-data
2. https://www.kaggle.com/datasets/subhajournal/phishingemails

In [12]:
# Get the path to the Phishing_Email.csv file using the current user's cache directory
path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/marcelwiechmann/enron-spam-data/versions/3"

if not Path(path).exists():
    # If the file doesn't exist, download it
    path = kagglehub.dataset_download("marcelwiechmann/enron-spam-data")

hamSpam_df = pd.read_csv(f"{path}/enron_spam_data.csv")
hamSpam_df.head(5)

hamSpam_df.to_csv("hamSpam.csv")

In [13]:
# Get the path to the Phishing_Email.csv file using the current user's cache directory
path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/subhajournal/phishingemails/versions/1"

if not Path(path).exists():
    # If the file doesn't exist, download it
    path = kagglehub.dataset_download("subhajournal/phishingemails")

phish_df = pd.read_csv(f"{path}/Phishing_Email.csv")
phish_df.head(5)

phish_df.to_csv("phish.csv")

In [30]:
list(hamSpam_df.columns.values) , list(phish_df.columns.values)

(['Unnamed: 0', 'Email Text', 'Email Type'],
 ['Unnamed: 0', 'Email Text', 'Email Type'])

In [29]:
# Rename known columns (no error if source names are missing)
hamSpam_df = hamSpam_df.rename(columns={"Spam/Ham": "Email Type", "Message": "Email Text"})

# Drop optional columns only if they exist to avoid KeyError
cols_to_drop = [c for c in ["Date", "Subject"] if c in hamSpam_df.columns]
if cols_to_drop:
	hamSpam_df = hamSpam_df.drop(columns=cols_to_drop)