# **Phishing Emails Analysis.**
"Phishing is the practice of sending fraudulent communications that appear to come from a legitimate and reputable source, usually through email and text messaging. The attacker's goal is to steal money, gain access to sensitive data and login information, or to install malware on the victim's device. Phishing is a dangerous, damaging, and an increasingly common type of cyberattack."

[Source](https://www.cisco.com/c/en/us/products/security/email-security/what-is-phishing.html)


This project involves cleaning data collected on emails classified as 'phishing'.

Data source : https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset

**Reference**

*Al-Subaiey, A., Al-Thani, M., Alam, N. A., Antora, K. F., Khandakar, A., & Zaman, S. A. U. (2024, May 19). Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection. ArXiv.org. https://arxiv.org/abs/2405.11619*


# **Loading Data.**

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Loading data
df = pd.read_csv("/content/drive/MyDrive/Nigerian_Fraud.csv")
df.head()

Unnamed: 0,sender,receiver,date,subject,body,urls,label
0,MR. JAMES NGOLA. <james_ngola2002@maktoob.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 02:38:20 +0000",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,0,1
1,Mr. Ben Suleman <bensul2004nng@spinfinder.com>,R@M,"Thu, 31 Oct 2002 05:10:00 -0000",URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",0,1
2,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:17:55 +0100",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
3,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:44:20 -0000",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
4,Maryam Abacha <m_abacha03@www.com>,R@M,"Fri, 01 Nov 2002 01:45:04 +0100",I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",0,1


In [None]:
# Renaming columns
df.columns = ['sender_information', 'receiver_email', 'date_and_time', 'subject', 'body', 'urls', 'label']
df.head()

Unnamed: 0,sender_information,receiver_email,date_and_time,subject,body,urls,label
0,MR. JAMES NGOLA. <james_ngola2002@maktoob.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 02:38:20 +0000",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,0,1
1,Mr. Ben Suleman <bensul2004nng@spinfinder.com>,R@M,"Thu, 31 Oct 2002 05:10:00 -0000",URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",0,1
2,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:17:55 +0100",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
3,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:44:20 -0000",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
4,Maryam Abacha <m_abacha03@www.com>,R@M,"Fri, 01 Nov 2002 01:45:04 +0100",I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",0,1


## **Data Cleaning.**
The column `sender_information` contains extra information like the name of the senders.

In [None]:
# Convert the 'sender' column to string type
df['sender_information'] = df['sender_information'].astype(str)

# Regular expression to extract email addresses
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
name_pattern = re.compile(r'([a-zA-Z\s\.\-]+)\s*<')

# Function to extract email
def extract_email(sender):
    match = email_pattern.search(sender)
    return match.group(0) if match else sender

# Function to extract name
def extract_name(sender):
    match = name_pattern.search(sender)
    return match.group(1).strip() if match else 'Unknown'

# Apply the functions to the 'sender_information' column
df['sender_email'] = df['sender_information'].apply(extract_email)
df['sender_name'] = df['sender_information'].apply(extract_name)

# Display the cleaned DataFrame
df.head()

Unnamed: 0,sender_information,receiver_email,date_and_time,subject,body,urls,label,sender_email,sender_name
0,MR. JAMES NGOLA. <james_ngola2002@maktoob.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 02:38:20 +0000",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,0,1,james_ngola2002@maktoob.com,MR. JAMES NGOLA.
1,Mr. Ben Suleman <bensul2004nng@spinfinder.com>,R@M,"Thu, 31 Oct 2002 05:10:00 -0000",URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",0,1,bensul2004nng@spinfinder.com,Mr. Ben Suleman
2,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:17:55 +0100",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1,obong_715@epatra.com,PRINCE OBONG ELEME
3,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:44:20 -0000",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1,obong_715@epatra.com,PRINCE OBONG ELEME
4,Maryam Abacha <m_abacha03@www.com>,R@M,"Fri, 01 Nov 2002 01:45:04 +0100",I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",0,1,m_abacha03@www.com,Maryam Abacha


The column `date_and_time` contains the date and time together.

In [None]:
df['datetime'] = pd.to_datetime(df['date_and_time'], utc=True)
df['date'] = df['datetime'].dt.date
df['time'] = df['datetime'].dt.time
df.head()

Unnamed: 0,sender_information,receiver_email,date_and_time,subject,body,urls,label,sender_email,sender_name,datetime,date,time
0,MR. JAMES NGOLA. <james_ngola2002@maktoob.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 02:38:20 +0000",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,0,1,james_ngola2002@maktoob.com,MR. JAMES NGOLA.,2002-10-31 02:38:20+00:00,2002-10-31,02:38:20
1,Mr. Ben Suleman <bensul2004nng@spinfinder.com>,R@M,"Thu, 31 Oct 2002 05:10:00 -0000",URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",0,1,bensul2004nng@spinfinder.com,Mr. Ben Suleman,2002-10-31 05:10:00+00:00,2002-10-31,05:10:00
2,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:17:55 +0100",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1,obong_715@epatra.com,PRINCE OBONG ELEME,2002-10-31 21:17:55+00:00,2002-10-31,21:17:55
3,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:44:20 -0000",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1,obong_715@epatra.com,PRINCE OBONG ELEME,2002-10-31 22:44:20+00:00,2002-10-31,22:44:20
4,Maryam Abacha <m_abacha03@www.com>,R@M,"Fri, 01 Nov 2002 01:45:04 +0100",I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",0,1,m_abacha03@www.com,Maryam Abacha,2002-11-01 00:45:04+00:00,2002-11-01,00:45:04


In [None]:
# Total observations / rows
df.shape

(3332, 12)

In [None]:
# Observations classified as 'Spam' / have a label 1
(df['label'] == 1).sum()

3332

In [None]:
# Checking missing values
df.isnull().sum()

sender_information       0
receiver_email        1324
date_and_time          482
subject                 39
body                     0
urls                     0
label                    0
sender_email             0
sender_name              0
datetime               482
date                   482
time                   482
dtype: int64

In [None]:
# Drop columns [sender_information,	date_and_time, label]
df.drop(['sender_information', 'date_and_time', 'label'], axis=1, inplace=True)
df.head()

Unnamed: 0,receiver_email,subject,body,urls,sender_email,sender_name,datetime,date,time
0,webmaster@aclweb.org,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,0,james_ngola2002@maktoob.com,MR. JAMES NGOLA.,2002-10-31 02:38:20+00:00,2002-10-31,02:38:20
1,R@M,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",0,bensul2004nng@spinfinder.com,Mr. Ben Suleman,2002-10-31 05:10:00+00:00,2002-10-31,05:10:00
2,webmaster@aclweb.org,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,obong_715@epatra.com,PRINCE OBONG ELEME,2002-10-31 21:17:55+00:00,2002-10-31,21:17:55
3,webmaster@aclweb.org,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,obong_715@epatra.com,PRINCE OBONG ELEME,2002-10-31 22:44:20+00:00,2002-10-31,22:44:20
4,R@M,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",0,m_abacha03@www.com,Maryam Abacha,2002-11-01 00:45:04+00:00,2002-11-01,00:45:04


In [None]:
# Drop rows with missing 'datetime', 'date', and 'time'
df = df.dropna(subset=['datetime', 'date', 'time'])

# Impute missing 'receiver_email' with a placeholder value
df['receiver_email'] = df['receiver_email'].fillna('unknown@example.com')

# Impute missing 'subject' with a placeholder value
df['subject'] = df['subject'].fillna('No Subject')

# Convert 'datetime' column to datetime type
df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['receiver_email'] = df['receiver_email'].fillna('unknown@example.com')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['subject'] = df['subject'].fillna('No Subject')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')


In [None]:
# Checking missing values again
df.isnull().sum()

receiver_email    0
subject           0
body              0
urls              0
sender_email      0
sender_name       0
datetime          0
date              0
time              0
dtype: int64

In [None]:
# Prepare the file
df.to_csv('fraud_data.csv', index=False)

In [None]:
# Downloading the file.
from google.colab import files
files.download('fraud_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>