# Data Collection
---
Ham and spam data is downloaded from [marcelwiechmann/enron-spam-data](https://www.kaggle.com/datasets/marcelwiechmann/enron-spam-data) and the phishing data is downloaded from [subhajournal/phishingemails](https://www.kaggle.com/datasets/subhajournal/phishingemails).

## 1. Imports

In [1]:
import pandas as pd
import kagglehub
import os
from pathlib import Path

## 2. Download the Data from Kaggle
Download the datasets using `kagglehub`, and are stored in `/home/<user>//.cache/kagglehub/datasets/...` (Unix/Linux path). Path may be different if installed on Windows system.

Create `data` folder.

In [3]:
Path("../data").mkdir(parents=True, exist_ok=True)

Download ham and spam data.

Alternative datasets:
- marcelwiechmann/enron-spam-data
- jackksoncsie/spam-email-dataset
- subhajournal/phishingemails

In [None]:
from_source = "meruvulikith/190k-spam-ham-email-dataset-for-classification"
full_local_path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/meruvulikith/190k-spam-ham-email-dataset-for-classification/versions/1/"

if not Path(full_local_path).exists():
    # If the file doesn't exist, download it
    full_local_path = kagglehub.dataset_download(from_source)

hamSpam_df = pd.read_csv(f"{full_local_path}/spam_Emails_data.csv")
hamSpam_df.to_csv("../data/1_hamSpam.csv", index=False)
hamSpam_df.head(5)

Unnamed: 0,label,text
0,Spam,viiiiiiagraaaa\nonly for the ones that want to...
1,Ham,got ice thought look az original message ice o...
2,Spam,yo ur wom an ne eds an escapenumber in ch ma n...
3,Spam,start increasing your odds of success & live s...
4,Ham,author jra date escapenumber escapenumber esca...


In [6]:
hamSpam_df["label"].value_counts()

label
Ham     102160
Spam     91692
Name: count, dtype: int64

Number of true duplicates

In [7]:
hamSpam_df[hamSpam_df.duplicated() == True]["label"].value_counts()

Series([], Name: count, dtype: int64)

Download phishing data.

In [None]:
full_local_path = f"{os.path.expanduser('~')}/.cache/kagglehub/datasets/naserabdullahalam/phishing-email-dataset/versions/1/"

if not Path(full_local_path).exists():
    path = kagglehub.dataset_download("subhajournal/phishingemails")

# Define the list of files to read
file_names = ["Ling.csv", "Nazario.csv", "Nigerian_Fraud.csv", "CEAS_08.csv"]

# List to hold individual dataframes
dataframes = []

# Loop through each file, read it, and append it to the list
for file_name in file_names:
    file_path = os.path.join(full_local_path, file_name)
    try:
        df = pd.read_csv(file_path)
        dataframes.append(df)
        print(f"Successfully loaded: {file_name}")
    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
    except Exception as e:
        print(f"An error occurred while reading {file_name}: {e}")

phish_df = pd.concat(dataframes, ignore_index=True)
phish_df = phish_df[phish_df["label"] == 1]
phish_df.to_csv("../data/1_phish.csv", index=False)
phish_df.head(5)

Successfully loaded: Ling.csv
Successfully loaded: Nazario.csv
Successfully loaded: Nigerian_Fraud.csv
Successfully loaded: CEAS_08.csv


Unnamed: 0,subject,body,label,sender,receiver,date,urls
21,free,this is a multi-part message in mime format . ...,1,,,,
38,the internet success toolbox,note : we do not wish to send e-mail to anyone...,1,,,,
84,free stealth 3 . 0 bulk email software . . .,"just released . . . 30 , 000 , 000 email addre...",1,,,,
85,need more money ?,"hi , would you like to earn an extra $ 700 a w...",1,,,,
86,cable decsrambler now only $ 6 . 99 !,this is really cool ! premium channels and pay...,1,,,,


In [9]:
print("Number of true duplicates: ", phish_df[phish_df.duplicated() == True]["label"].value_counts())

Number of true duplicates:  Series([], Name: count, dtype: int64)


Show the columns of each dataset.

In [10]:
list(hamSpam_df.columns.values) , list(phish_df.columns.values)

(['label', 'text'],
 ['subject', 'body', 'label', 'sender', 'receiver', 'date', 'urls'])

Show the current count of each dataset.

In [11]:
hamSpam_df.shape, phish_df.shape

((193852, 2), (27197, 7))

Data from the two sources are not compatible at initial load. We will have to clean the data and combine them in a common dataframe structure. Continue to step 2.