# Data Cleaning
---
Take the raw data from `/data/hamSpam.csv` and `/data/phish.csv` and conform the ham/spam dataset to the structure of the phishing dataset and integrate together. Then clean the dataset by removing NA values, standardize the textual features by lowercasing and removing stopwords, and remove any embedded HTML elements from the text.

## 1. Imports & Setup

In [1]:
import pandas as pd

import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

from utils import clean_text

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/prokope/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Download nltk datasets.

In [2]:
# NLTK downloads
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/prokope/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/prokope/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/prokope/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## 2. Load Data from CSV

In [3]:
hamSpam_df = pd.read_csv("./data/1_hamSpam.csv")
phish_df = pd.read_csv("./data/1_phish.csv")

## 3. Process Ham/Spam Data

In [4]:
# Rename known columns (no error if source names are missing)
hamSpam_df.rename(
    columns={"Spam/Ham": "Email Type", "Message": "Email Text"},
    inplace=True
)

# Drop optional columns only if they exist to avoid KeyError
cols_to_drop = [c for c in ["Date", "Subject"] if c in hamSpam_df.columns]

if cols_to_drop:
	hamSpam_df.drop(columns=cols_to_drop, inplace=True)

# Filter out emails that are not ham/spam
hamSpam_df = hamSpam_df[hamSpam_df['Email Type'].isin(['ham', 'spam'])]

# Export as processed dataset
hamSpam_df.to_feather(
    "data/2_hamSpam_processed.feather"
)

## 4. Process Phishing Data

Observe email types.

In [5]:
phish_df['Email Type'].apply(repr).unique()

array(["'Safe Email'", "'Phishing Email'"], dtype=object)

Standardize email types.

In [6]:
phish_df['Email Type'] = phish_df['Email Type'].replace(
    {'Safe Email': 'ham', 'Phishing Email': 'phish'}
)
phish_df['Email Type'].apply(repr).unique()

array(["'ham'", "'phish'"], dtype=object)

## 5. Combine Datasets

In [7]:
email_df = pd.concat([hamSpam_df, phish_df], ignore_index=True)
email_df['Email Type'].apply(repr).unique()

array(["'ham'", "'spam'", "'phish'"], dtype=object)

Display examples of each category.

In [22]:
email_df[email_df["Email Type"] == "spam"].iloc[[5000]]

Unnamed: 0,Email Text,Email Type
18545,start date : 2 / 6 / 02 ; hourahead hour : 24 ...,spam


In [23]:
email_df[email_df["Email Type"] == "ham"].iloc[[5000]]

Unnamed: 0,Email Text,Email Type
6500,"dave ,\nyou are representing martin lin , an a...",ham


In [24]:
email_df[email_df["Email Type"] == "phish"].iloc[[5000]]

Unnamed: 0,Email Text,Email Type
46443,we sell windows xp pro for 50 bucks saturnine ...,phish


Show total count of emails.

In [8]:
email_df.count()

Email Text    52298
Email Type    52366
dtype: int64

## 6. Drop NA and "empty" Rows

In [26]:
print(email_df.isna().sum())

email_df.dropna(inplace=True)

empty_rows = email_df[email_df["Email Text"] == "empty"]
email_df.drop(empty_rows.index, inplace=True)

print(email_df.isna().sum())

Email Text    68
Email Type     0
dtype: int64
Email Text    0
Email Type    0
dtype: int64


## 7. Bulk Clean Dataset & Export to CSV


Clean data by removing stop words, punctuation, special characters, and tokenize/lemmatize.

In [27]:
cleaned_email_df = email_df.copy()
cleaned_email_df["Cleaned Email Text"] = email_df["Email Text"].apply(clean_text)
cleaned_email_df.head()

Unnamed: 0,Email Text,Email Type,Cleaned Email Text
1,"gary , production from the high island larger ...",ham,gary production high island larger block comme...
2,- calpine daily gas nomination 1 . doc,ham,calpine daily gas nomination doc
3,fyi - see note below - already done .\nstella\...,ham,fyi see note already done stella forwarded ste...
4,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,fyi forwarded lauri allen hou ect pm kimberly ...
5,"jackie ,\nsince the inlet to 3 river plant is ...",ham,jackie since inlet river plant shut last day f...


Export to feather.

In [28]:
cleaned_email_df.to_feather(
    "./data/2_clean_email_dataset.feather"
)