# Exploratory Analysis
---
This notebook downloads the two primary datasets used for training the model, one containing ham and spam emails, and the other containing phishing emails.

1. Imports
2. Downloading the data.
3. Data Inspection.
4. Cleaning the data.
5. Saving cleaned data to a csv file.

## 1. Imports

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import kagglehub
import os
from pathlib import Path

## 2. Downloading the Data

Download the datasets using `kagglehub`, and are stored in `/home/<user>//.cache/kagglehub/datasets/...` (Unix/Linux path). Path may be different if installed on Windows system.

1. https://www.kaggle.com/datasets/marcelwiechmann/enron-spam-data
2. https://www.kaggle.com/datasets/subhajournal/phishingemails

In [27]:
# Get the path to the Phishing_Email.csv file using the current user's cache directory
path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/marcelwiechmann/enron-spam-data/versions/3"

if not Path(path).exists():
    # If the file doesn't exist, download it
    path = kagglehub.dataset_download("marcelwiechmann/enron-spam-data")

phish_df = pd.read_csv(f"{path}/enron_spam_data.csv")
phish_df.head(5)

Unnamed: 0.1,Unnamed: 0,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14


In [28]:
# Get the path to the Phishing_Email.csv file using the current user's cache directory
path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/subhajournal/phishingemails/versions/1"

if not Path(path).exists():
    # If the file doesn't exist, download it
    path = kagglehub.dataset_download("subhajournal/phishingemails")

phish_df = pd.read_csv(f"{path}/Phishing_Email.csv")
phish_df.head(5)

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email
