# Exploratory Analysis
---
This notebook downloads the two primary datasets used for training the model, one containing ham and spam emails, and the other containing phishing emails.

1. Imports
2. Downloading the data.
3. Data Inspection.
4. Cleaning the data.
5. Saving cleaned data to a csv file.

## 1. Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import kagglehub
import os
from pathlib import Path
import csv

## 2. Downloading the Data

Download the datasets using `kagglehub`, and are stored in `/home/<user>//.cache/kagglehub/datasets/...` (Unix/Linux path). Path may be different if installed on Windows system.

1. https://www.kaggle.com/datasets/marcelwiechmann/enron-spam-data
2. https://www.kaggle.com/datasets/subhajournal/phishingemails

In [2]:
# Get the path to the Phishing_Email.csv file using the current user's cache directory
path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/marcelwiechmann/enron-spam-data/versions/3"

if not Path(path).exists():
    # If the file doesn't exist, download it
    path = kagglehub.dataset_download("marcelwiechmann/enron-spam-data")

hamSpam_df = pd.read_csv(f"{path}/enron_spam_data.csv")
hamSpam_df.to_csv("hamSpam.csv")
hamSpam_df.head(5)

Unnamed: 0.1,Unnamed: 0,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14


In [3]:
# Get the path to the Phishing_Email.csv file using the current user's cache directory
path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/subhajournal/phishingemails/versions/1"

if not Path(path).exists():
    # If the file doesn't exist, download it
    path = kagglehub.dataset_download("subhajournal/phishingemails")

phish_df = pd.read_csv(f"{path}/Phishing_Email.csv")
phish_df.to_csv("phish.csv")
phish_df.head(5)

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email


In [4]:
list(hamSpam_df.columns.values) , list(phish_df.columns.values)

(['Unnamed: 0', 'Subject', 'Message', 'Spam/Ham', 'Date'],
 ['Unnamed: 0', 'Email Text', 'Email Type'])

In [5]:
hamSpam_df.count()

Unnamed: 0    33716
Subject       33716
Message       33664
Spam/Ham      33716
Date          33716
dtype: int64

In [6]:
phish_df.count()

Unnamed: 0    18650
Email Text    18634
Email Type    18650
dtype: int64

In [7]:
# Rename known columns (no error if source names are missing)
hamSpam_df = hamSpam_df.rename(columns={"Spam/Ham": "Email Type", "Message": "Email Text"})

# Drop optional columns only if they exist to avoid KeyError
cols_to_drop = [c for c in ["Date", "Subject"] if c in hamSpam_df.columns]
if cols_to_drop:
	hamSpam_df = hamSpam_df.drop(columns=cols_to_drop)

In [8]:
hamSpam_df = hamSpam_df[hamSpam_df['Email Type'].isin(['ham', 'spam'])]

In [17]:
hamSpam_df.to_csv(
    "hamSpam_fixed.csv",
    index=False,
    quoting=csv.QUOTE_ALL,
    escapechar="\\"
)

In [23]:
phish_df['Email Type'].apply(repr).unique()


array(["'Safe Email'", "'Phishing Email'"], dtype=object)

In [24]:
phish_df['Email Type'] = phish_df['Email Type'].replace({'Safe Email': 'ham', 'Phishing Email': 'phish'})
phish_df['Email Type'].apply(repr).unique()

array(["'ham'", "'phish'"], dtype=object)

In [25]:
emailDataset = pd.concat([hamSpam_df, phish_df], ignore_index=True)
emailDataset['Email Type'].apply(repr).unique()

array(["'ham'", "'spam'", "'phish'"], dtype=object)

In [27]:
emailDataset.count()

Unnamed: 0    52366
Email Text    52298
Email Type    52366
dtype: int64

In [29]:
emailDataset

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,,ham
1,1,"gary , production from the high island larger ...",ham
2,2,- calpine daily gas nomination 1 . doc,ham
3,3,fyi - see note below - already done .\nstella\...,ham
4,4,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham
...,...,...,...
52361,18646,date a lonely housewife always wanted to date ...,phish
52362,18647,request submitted : access request for anita ....,ham
52363,18648,"re : important - prc mtg hi dorn & john , as y...",ham
52364,18649,press clippings - letter on californian utilit...,ham
