# Data Collection
---
Ham and spam data is downloaded from [marcelwiechmann/enron-spam-data](https://www.kaggle.com/datasets/marcelwiechmann/enron-spam-data) and the phishing data is downloaded from [subhajournal/phishingemails](https://www.kaggle.com/datasets/subhajournal/phishingemails).

## 1. Imports

In [7]:
import pandas as pd
import kagglehub
import os
from pathlib import Path

## 2. Download the Data from Kaggle
Download the datasets using `kagglehub`, and are stored in `/home/<user>//.cache/kagglehub/datasets/...` (Unix/Linux path). Path may be different if installed on Windows system.

Create `data` folder.

In [8]:
Path("./data").mkdir(parents=True, exist_ok=True)

Download ham and spam data.

In [9]:
# Get the path to the Phishing_Email.csv file using the current user's cache directory
path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/marcelwiechmann/enron-spam-data/versions/3"

if not Path(path).exists():
    # If the file doesn't exist, download it
    path = kagglehub.dataset_download("marcelwiechmann/enron-spam-data")

hamSpam_df = pd.read_csv(f"{path}/enron_spam_data.csv")
hamSpam_df.drop("Unnamed: 0", axis=1, inplace=True)
hamSpam_df.to_csv("./data/1_hamSpam.csv", index=False)
hamSpam_df.head(5)

Unnamed: 0,Subject,Message,Spam/Ham,Date
0,christmas tree farm pictures,,ham,1999-12-10
1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14


Download phishing data.

In [10]:
# Get the path to the Phishing_Email.csv file using the current user's cache directory
path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/subhajournal/phishingemails/versions/1"

if not Path(path).exists():
    # If the file doesn't exist, download it
    path = kagglehub.dataset_download("subhajournal/phishingemails")

phish_df = pd.read_csv(f"{path}/Phishing_Email.csv")
phish_df.drop("Unnamed: 0", axis=1, inplace=True)
phish_df.to_csv("./data/1_phish.csv", index=False)
phish_df.head(5)

Unnamed: 0,Email Text,Email Type
0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,the other side of * galicismos * * galicismo *...,Safe Email
2,re : equistar deal tickets are you still avail...,Safe Email
3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,software at incredibly low prices ( 86 % lower...,Phishing Email


Show the columns of each dataset.

In [11]:
list(hamSpam_df.columns.values) , list(phish_df.columns.values)

(['Subject', 'Message', 'Spam/Ham', 'Date'], ['Email Text', 'Email Type'])

Show the current count of each dataset.

In [12]:
hamSpam_df.shape, phish_df.shape

((33716, 4), (18650, 2))

Data from the two sources are not compatible at initial load. We will have to clean the data and combine them in a common dataframe structure. Continue to step 2.