### Phishing Email Detection

In this notebook, we will collect and preprocess data from publicly available Kaggle datasets focused on **Phishing Email Detection** based on the email body text. To create a more comprehensive dataset, we will combine the following sources:

1. [Phishing Emails Dataset by Subhajournal](https://www.kaggle.com/datasets/subhajournal/phishingemails)  
2. [Phishing Email Dataset by Naser Abdullah Alam](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset)

The merged dataset will then be cleaned, normalized, and saved for further analysis.  

In the next code cell, we will import all the necessary Python packages required for this notebook.


In [1]:
!pip install kaggle -q

In the following code cell, we will import all the necessary packages that will be used throughout this notebook.


In [2]:
import os
import json
import pandas as pd
import re

print("Pandas: ", pd.__version__)

Pandas:  2.2.2


### Datasets

The datasets used for this task will be downloaded from Kaggle using the Kaggle API. The following datasets will be combined to create a comprehensive phishing email dataset:

1. [Phishing Emails Dataset by Subhajournal](https://www.kaggle.com/datasets/subhajournal/phishingemails)  
2. [Phishing Email Dataset by Naser Abdullah Alam](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset)

In the next code cell, we will use the **Kaggle API** to download these datasets.


In [3]:
with open('kaggle.json', 'r') as reader:
  keys = json.loads(reader.read())
  os.environ['KAGGLE_USERNAME'] = keys['username']
  os.environ['KAGGLE_KEY'] = keys['key']

!kaggle datasets download  subhajournal/phishingemails --unzip
!kaggle datasets download  naserabdullahalam/phishing-email-dataset --unzip

Dataset URL: https://www.kaggle.com/datasets/subhajournal/phishingemails
License(s): GNU Lesser General Public License 3.0
Downloading phishingemails.zip to /content
  0% 0.00/18.0M [00:00<?, ?B/s]
100% 18.0M/18.0M [00:00<00:00, 1.41GB/s]
Dataset URL: https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset
License(s): CC-BY-SA-4.0
Downloading phishing-email-dataset.zip to /content
  0% 0.00/77.1M [00:00<?, ?B/s]
100% 77.1M/77.1M [00:00<00:00, 1.28GB/s]


### Phishing Email Detection

The first dataset used in this task was downloaded from [this Kaggle source](https://www.kaggle.com/datasets/subhajournal/phishingemails). The dataset file is named **`Phishing_Email.csv`**, which contains labeled email samples for phishing detection.


In [4]:
df1 = pd.read_csv('Phishing_Email.csv')
df1.head(2)

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email


In the following code cell we are then going to drop columns that we are not going to use in the `df1` dataframe.

In [5]:
df1.drop(columns=["Unnamed: 0"], inplace=True)
df1.head(2)

Unnamed: 0,Email Text,Email Type
0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,the other side of * galicismos * * galicismo *...,Safe Email


In the following code cell, we will rename the dataset columns to ensure consistency across all data sources. Specifically, we will standardize the columns to **`body`** (containing the email content) and **`label`** (indicating whether the email is **`legitimate`** or **`phishing`**).


In [6]:
df1.rename(columns={"Email Text": "body", "Email Type": "label"}, inplace=True)
df1.head(2)

Unnamed: 0,body,label
0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,the other side of * galicismos * * galicismo *...,Safe Email


Before renaming column values, we are going to drop all null values first.

In [7]:
if sum(df1.isna().any()):
  df1.dropna(inplace=True)
  df1.reset_index(drop=True, inplace=True)
df1.isna().any()

Unnamed: 0,0
body,False
label,False


In [8]:
df1['label'] = df1['label'].apply(lambda x: 'phishing' if x == 'Phishing Email' else 'legitimate')
df1.head(2)

Unnamed: 0,body,label
0,"re : 6 . 1100 , disc : uniformitarianism , re ...",legitimate
1,the other side of * galicismos * * galicismo *...,legitimate


### Phishing Email Dataset

The second dataset used in this task was obtained from [this Kaggle source](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset). From this dataset, we will use the following files:

```
CEAS_08.csv
Enron.csv
Ling.csv
Nazario.csv
Nigerian_Fraud.csv
SpamAssasin.csv
```

These individual data files will be read and combined to form a single, comprehensive dataframe for phishing email detection. Here is how you can cite the dataset.

> *Al-Subaiey, A., Al-Thani, M., Alam, N. A., Antora, K. F., Khandakar, A., & Zaman, S. A. U. (2024, May 19). Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection. ArXiv.org. [https://arxiv.org/abs/2405.11619](https://arxiv.org/abs/2405.11619)*


In [9]:
df2 = pd.read_csv('CEAS_08.csv')[['body', 'label']]
df2.head(2)

Unnamed: 0,body,label
0,"Buck up, your troubles caused by small dimensi...",1
1,\nUpgrade your sex and pleasures with these te...,1


In [10]:
df3 = pd.read_csv('Enron.csv')[['body', 'label']]
df3.head(2)

Unnamed: 0,body,label
0,( see attached file : hplno 525 . xls )\r\n- h...,0
1,- - - - - - - - - - - - - - - - - - - - - - fo...,0


In [11]:
df4 = pd.read_csv('Nazario.csv')[['body', 'label']]
df4.head(2)

Unnamed: 0,body,label
0,This text is part of the internal format of yo...,1
1,Business with \t\t\t\t\t\t\t\tcPanel & WHM \t...,1


In [12]:
df5 = pd.read_csv('Nigerian_Fraud.csv')[['body', 'label']]
df5.head(2)

Unnamed: 0,body,label
0,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,1
1,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",1


In [13]:
df6 = pd.read_csv('SpamAssasin.csv')[['body', 'label']]
df6.head(2)

Unnamed: 0,body,label
0,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0
1,"Martin A posted:\nTassos Papadopoulos, the Gre...",0


In the following code cell, we will merge the dataframes **`df2`**, **`df3`**, **`df4`**, **`df5`**, and **`df6`** into a single dataframe. After merging, we will update the **`label`** column so that all instances of `0` are replaced with **`legitimate`** and all instances of `1` are replaced with **`phishing`**.


In [14]:
df7 = pd.concat([df2, df3, df4, df5, df6])
df7.head(2)

Unnamed: 0,body,label
0,"Buck up, your troubles caused by small dimensi...",1
1,\nUpgrade your sex and pleasures with these te...,1


Before converting the label values in **`df7`** to categorical format, we will first remove all rows containing null or missing values to ensure data consistency and reliability.


In [15]:
if sum(df7.isna().any()):
  df7.dropna(inplace=True)
  df7.reset_index(drop=True, inplace=True)
df7.isna().any()

Unnamed: 0,0
body,False
label,False


In [16]:
df7['label'] = df7['label'].apply(lambda x: 'phishing' if x == 1 else 'legitimate')
df7.head(2)

Unnamed: 0,body,label
0,"Buck up, your troubles caused by small dimensi...",phishing
1,\nUpgrade your sex and pleasures with these te...,phishing


In the following code cell, we will join **`df7`** and **`df1`** into a single object named **`dataframe`**. This combined dataset will contain two columns: **`body`**, representing the email content, and **`label`**, representing the categorical label indicating whether an email is **legitimate** or **phishing**.


In [17]:
dataframe = pd.concat([df7, df1])
dataframe.head(2)

Unnamed: 0,body,label
0,"Buck up, your troubles caused by small dimensi...",phishing
1,\nUpgrade your sex and pleasures with these te...,phishing


### Data Cleaning

In the following code cell, we will clean the **`body`** column of the dataframe. The text will be normalized to lowercase, and unwanted elements such as URLs, email addresses, numbers, and other extraneous characters will be removed. This process will produce clean features, which in our case correspond to the email body content.


In [18]:
MENTION_HASHTAG_RE = re.compile(r"(@|#)([A-Za-z0-9]+)")
EMAIL_RE = re.compile(r"([A-Za-z0-9]+[._-])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Za-z]{2,})+")
URL_RE = re.compile(r"https?\S+", re.MULTILINE)
DIGIT_RE = re.compile(r"\d")
PUNCT_RE = re.compile(r"[^\w\s\']")
SPACE_RE = re.compile(r"\s+")

def clean_sentence(sent: str, lower: bool = True) -> str:
    if lower:
        sent = sent.lower()
    sent = MENTION_HASHTAG_RE.sub(" ", sent)
    sent = EMAIL_RE.sub(" ", sent)
    sent = URL_RE.sub(" ", sent)
    sent = DIGIT_RE.sub(" ", sent)
    sent = PUNCT_RE.sub(" ", sent)
    sent = SPACE_RE.sub(" ", sent).strip()
    return sent

dataframe['body'] = dataframe.body.apply(clean_sentence)
dataframe.head(2)

Unnamed: 0,body,label
0,buck up your troubles caused by small dimensio...,phishing
1,upgrade your sex and pleasures with these tech...,phishing


Finally, we can save our combined and cleaned dataframe to a **`.csv`** file. This file will be used for further analysis and to train a phishing detection model in the subsequent notebooks.


In [19]:
dataframe.to_csv("clean_phishing_email.csv", index=False)
print("Done!")

Done!
