# Data Cleaning
---
Take the raw data from `/data/hamSpam.csv` and `/data/phish.csv` and conform the ham/spam dataset to the structure of the phishing dataset and integrate together. Then clean the dataset by removing NA values, standardize the textual features by lowercasing and removing stopwords, and remove any embedded HTML elements from the text.

## 1. Imports

In [27]:
import pandas as pd
import csv
import re

import nltk
from nltk.tokenize import word_tokenize

## 2. Load Data from CSV

In [28]:
hamSpam_df = pd.read_csv("./data/hamSpam.csv")
phish_df = pd.read_csv("./data/phish.csv")

## 3. Process Ham/Spam Data

In [29]:
# Rename known columns (no error if source names are missing)
hamSpam_df.rename(
    columns={"Spam/Ham": "Email Type", "Message": "Email Text"},
    inplace=True
)

# Drop optional columns only if they exist to avoid KeyError
cols_to_drop = [c for c in ["Date", "Subject"] if c in hamSpam_df.columns]

if cols_to_drop:
	hamSpam_df.drop(columns=cols_to_drop, inplace=True)

# Filter out emails that are not ham/spam
hamSpam_df = hamSpam_df[hamSpam_df['Email Type'].isin(['ham', 'spam'])]

# Export as processed dataset
hamSpam_df.to_csv(
    "data/hamSpam_processed.csv",
    index=False,
    quoting=csv.QUOTE_ALL,
    escapechar="\\"
)

## 4. Process Phishing Data

Observe email types.

In [30]:
phish_df['Email Type'].apply(repr).unique()

array(["'Safe Email'", "'Phishing Email'"], dtype=object)

Standardize email types.

In [31]:
phish_df['Email Type'] = phish_df['Email Type'].replace(
    {'Safe Email': 'ham', 'Phishing Email': 'phish'}
)
phish_df['Email Type'].apply(repr).unique()

array(["'ham'", "'phish'"], dtype=object)

## 5. Combine Datasets

In [32]:
emailDataset = pd.concat([hamSpam_df, phish_df], ignore_index=True)
emailDataset['Email Type'].apply(repr).unique()

array(["'ham'", "'spam'", "'phish'"], dtype=object)

Show total count of emails.

In [33]:
emailDataset.count()

Unnamed: 0.1    52366
Unnamed: 0      52366
Email Text      52298
Email Type      52366
dtype: int64

## 6. Drop NA and "empty" Rows

In [34]:
emailDataset.dropna(inplace=True)

empty_rows = emailDataset[emailDataset["Email Text"] == "empty"]
emailDataset.drop(empty_rows.index, inplace=True)

emailDataset

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
1,1,1,"gary , production from the high island larger ...",ham
2,2,2,- calpine daily gas nomination 1 . doc,ham
3,3,3,fyi - see note below - already done .\nstella\...,ham
4,4,4,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham
5,5,5,"jackie ,\nsince the inlet to 3 river plant is ...",ham
...,...,...,...,...
52360,18644,18645,\nRick Moen a Ã©crit:> > I'm confused. I thou...,ham
52361,18645,18646,date a lonely housewife always wanted to date ...,phish
52362,18646,18647,request submitted : access request for anita ....,ham
52363,18647,18648,"re : important - prc mtg hi dorn & john , as y...",ham


## 7. Bulk Clean Dataset & Export to CSV


In [35]:
# NLTK downloads
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/prokope/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/prokope/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/prokope/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/prokope/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Clean data by removing stop words, punctuation, special characters, and tokenize/lemmatize.

In [36]:
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def clean_text(text):
    # 1. Lowercase
    text = text.lower()

    # 2. Remove URLs
    text = re.sub(r"http[s]?://\S+|www\.\S+", " ", text)

    # 3. Remove email addresses
    text = re.sub(r"\S+@\S+", " ", text)

    # 4. Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))

    # 5. Tokenize
    tokens = word_tokenize(text)

    # 6. Remove stop words + non-alpha words
    tokens = [w for w in tokens if w.isalpha() and w not in stop_words]

    # 7. Lemmatize (optional but improves SVM performance)
    tokens = [lemmatizer.lemmatize(w) for w in tokens]

    return " ".join(tokens)

In [37]:
emailDataset["clean_text"] = emailDataset["Email Text"].apply(clean_text)

Export to CSV.

In [38]:
emailDataset.to_csv(
    "data/email_dataset.csv",
    index=False,
)
emailDataset.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Email Text,Email Type,clean_text
1,1,1,"gary , production from the high island larger ...",ham,gary production high island larger block comme...
2,2,2,- calpine daily gas nomination 1 . doc,ham,calpine daily gas nomination doc
3,3,3,fyi - see note below - already done .\nstella\...,ham,fyi see note already done stella forwarded ste...
4,4,4,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,fyi forwarded lauri allen hou ect pm kimberly ...
5,5,5,"jackie ,\nsince the inlet to 3 river plant is ...",ham,jackie since inlet river plant shut last day f...


In [39]:
# Load data into a DataFrame
# summary_df = emailDataset.copy()

# # Text length metrics
# summary_df["char_count"] = summary_df["Email Text"].str.len()
# summary_df["word_count"] = summary_df["Email Text"].apply(lambda x: len(x.split()))

# # Count URLs
# url_pattern = r"http[s]?://\S+|www\.\S+"
# summary_df["url_count"] = summary_df["Email Text"].apply(lambda x: len(re.findall(url_pattern, x)))

# # Count special characters
# summary_df["special_chars"] = summary_df["Email Text"].apply(lambda x: sum(not c.isalnum() and not c.isspace() for c in x))

# # Count uppercase words
# summary_df["uppercase_words"] = summary_df["Email Text"].apply(lambda x: len([w for w in x.split() if w.isupper()]))

# # Save the DataFrame to a JSON file
# summary_df[[
#     "char_count",
#     "word_count",
#     "url_count",
#     "special_chars",
#     "uppercase_words",
# ]].to_json('./data/email_dataset_summary_stats.json', orient='records')
