# **Phishing Emails Analysis.**
"Phishing is the practice of sending fraudulent communications that appear to come from a legitimate and reputable source, usually through email and text messaging. The attacker's goal is to steal money, gain access to sensitive data and login information, or to install malware on the victim's device. Phishing is a dangerous, damaging, and an increasingly common type of cyberattack."

[Source](https://www.cisco.com/c/en/us/products/security/email-security/what-is-phishing.html)


This project involves cleaning data collected on emails classified as 'phishing'.

Data source : https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset

**Reference**

*Al-Subaiey, A., Al-Thani, M., Alam, N. A., Antora, K. F., Khandakar, A., & Zaman, S. A. U. (2024, May 19). Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection. ArXiv.org. https://arxiv.org/abs/2405.11619*


# **Loading Data.**

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Loading data
df1 = pd.read_csv("/content/drive/MyDrive/Phishing Emails/SpamAssasin.csv")
df1.head()

(5809, 7)

In [None]:
# Loading data file 2
df2 = pd.read_csv("/content/drive/MyDrive/Phishing Emails/Nazario.csv")
df2.head()

(1565, 7)

In [None]:
# Merging the files
df = pd.concat([df1, df2], ignore_index=True)
df.head()

Unnamed: 0,sender,receiver,date,subject,body,label,urls
0,Robert Elz <kre@munnari.OZ.AU>,Chris Garrigues <cwg-dated-1030377287.06fa6d@D...,"Thu, 22 Aug 2002 18:26:25 +0700",Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1
1,Steve Burt <Steve_Burt@cursor-system.com>,"""'zzzzteana@yahoogroups.com'"" <zzzzteana@yahoo...","Thu, 22 Aug 2002 12:46:18 +0100",[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1
2,"""Tim Chapman"" <timc@2ubh.com>",zzzzteana <zzzzteana@yahoogroups.com>,"Thu, 22 Aug 2002 13:52:38 +0100",[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1
3,Monty Solomon <monty@roscom.com>,undisclosed-recipient: ;,"Thu, 22 Aug 2002 09:15:25 -0400",[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1
4,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,zzzzteana@yahoogroups.com,"Thu, 22 Aug 2002 14:38:22 +0100",Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1


In [None]:
df.shape

(7374, 7)

In [None]:
# Renaming columns
df.columns = ['sender_information', 'receiver_email', 'date_and_time', 'subject', 'body', 'urls', 'label']
df.head()

Unnamed: 0,sender_information,receiver_email,date_and_time,subject,body,urls,label
0,Robert Elz <kre@munnari.OZ.AU>,Chris Garrigues <cwg-dated-1030377287.06fa6d@D...,"Thu, 22 Aug 2002 18:26:25 +0700",Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1
1,Steve Burt <Steve_Burt@cursor-system.com>,"""'zzzzteana@yahoogroups.com'"" <zzzzteana@yahoo...","Thu, 22 Aug 2002 12:46:18 +0100",[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1
2,"""Tim Chapman"" <timc@2ubh.com>",zzzzteana <zzzzteana@yahoogroups.com>,"Thu, 22 Aug 2002 13:52:38 +0100",[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1
3,Monty Solomon <monty@roscom.com>,undisclosed-recipient: ;,"Thu, 22 Aug 2002 09:15:25 -0400",[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1
4,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,zzzzteana@yahoogroups.com,"Thu, 22 Aug 2002 14:38:22 +0100",Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1


## **Data Cleaning.**
The column `sender_information` contains extra information like the name of the senders.

In [None]:
# Convert the 'sender' column to string type
df['sender_information'] = df['sender_information'].astype(str)

# Regular expression to extract email addresses
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
name_pattern = re.compile(r'([a-zA-Z\s\.\-]+)\s*<')

# Function to extract email
def extract_email(sender):
    match = email_pattern.search(sender)
    return match.group(0) if match else sender

# Function to extract name
def extract_name(sender):
    match = name_pattern.search(sender)
    return match.group(1).strip() if match else 'Unknown'

# Apply the functions to the 'sender_information' column
df['sender_email'] = df['sender_information'].apply(extract_email)
df['sender_name'] = df['sender_information'].apply(extract_name)

# Display the cleaned DataFrame
df.head()

Unnamed: 0,sender_information,receiver_email,date_and_time,subject,body,urls,label,sender_email,sender_name
0,Robert Elz <kre@munnari.OZ.AU>,Chris Garrigues <cwg-dated-1030377287.06fa6d@D...,"Thu, 22 Aug 2002 18:26:25 +0700",Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1,kre@munnari.OZ.AU,Robert Elz
1,Steve Burt <Steve_Burt@cursor-system.com>,"""'zzzzteana@yahoogroups.com'"" <zzzzteana@yahoo...","Thu, 22 Aug 2002 12:46:18 +0100",[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1,Steve_Burt@cursor-system.com,Steve Burt
2,"""Tim Chapman"" <timc@2ubh.com>",zzzzteana <zzzzteana@yahoogroups.com>,"Thu, 22 Aug 2002 13:52:38 +0100",[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1,timc@2ubh.com,
3,Monty Solomon <monty@roscom.com>,undisclosed-recipient: ;,"Thu, 22 Aug 2002 09:15:25 -0400",[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1,monty@roscom.com,Monty Solomon
4,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,zzzzteana@yahoogroups.com,"Thu, 22 Aug 2002 14:38:22 +0100",Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1,Stewart.Smith@ee.ed.ac.uk,Stewart Smith


The column `date_and_time` contains the date and time together.

In [None]:
df['datetime'] = pd.to_datetime(df['date_and_time'], utc=True, errors='coerce')
df['date'] = df['datetime'].dt.date
df['time'] = df['datetime'].dt.time
df.head()

Unnamed: 0,sender_information,receiver_email,date_and_time,subject,body,urls,label,sender_email,sender_name,datetime,date,time
0,Robert Elz <kre@munnari.OZ.AU>,Chris Garrigues <cwg-dated-1030377287.06fa6d@D...,"Thu, 22 Aug 2002 18:26:25 +0700",Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1,kre@munnari.OZ.AU,Robert Elz,2002-08-22 11:26:25+00:00,2002-08-22,11:26:25
1,Steve Burt <Steve_Burt@cursor-system.com>,"""'zzzzteana@yahoogroups.com'"" <zzzzteana@yahoo...","Thu, 22 Aug 2002 12:46:18 +0100",[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1,Steve_Burt@cursor-system.com,Steve Burt,2002-08-22 11:46:18+00:00,2002-08-22,11:46:18
2,"""Tim Chapman"" <timc@2ubh.com>",zzzzteana <zzzzteana@yahoogroups.com>,"Thu, 22 Aug 2002 13:52:38 +0100",[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1,timc@2ubh.com,,2002-08-22 12:52:38+00:00,2002-08-22,12:52:38
3,Monty Solomon <monty@roscom.com>,undisclosed-recipient: ;,"Thu, 22 Aug 2002 09:15:25 -0400",[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1,monty@roscom.com,Monty Solomon,2002-08-22 13:15:25+00:00,2002-08-22,13:15:25
4,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,zzzzteana@yahoogroups.com,"Thu, 22 Aug 2002 14:38:22 +0100",Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1,Stewart.Smith@ee.ed.ac.uk,Stewart Smith,2002-08-22 13:38:22+00:00,2002-08-22,13:38:22


In [None]:
# Total observations / rows
df.shape

(7374, 12)

In [None]:
# Observations classified as 'Spam' / have a label 1
(df['label'] == 1).sum()

6425

In [None]:
# Checking missing values
df.isnull().sum()

sender_information       0
receiver_email         306
date_and_time            1
subject                 20
body                     1
urls                     0
label                    0
sender_email             0
sender_name              0
datetime              1838
date                  1838
time                  1838
dtype: int64

In [None]:
# Drop columns [sender_information,	date_and_time, label]
df.drop(['sender_information', 'date_and_time', 'label'], axis=1, inplace=True)
df.head()

Unnamed: 0,receiver_email,subject,body,urls,sender_email,sender_name,datetime,date,time
0,Chris Garrigues <cwg-dated-1030377287.06fa6d@D...,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,kre@munnari.OZ.AU,Robert Elz,2002-08-22 11:26:25+00:00,2002-08-22,11:26:25
1,"""'zzzzteana@yahoogroups.com'"" <zzzzteana@yahoo...",[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,Steve_Burt@cursor-system.com,Steve Burt,2002-08-22 11:46:18+00:00,2002-08-22,11:46:18
2,zzzzteana <zzzzteana@yahoogroups.com>,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,timc@2ubh.com,,2002-08-22 12:52:38+00:00,2002-08-22,12:52:38
3,undisclosed-recipient: ;,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,monty@roscom.com,Monty Solomon,2002-08-22 13:15:25+00:00,2002-08-22,13:15:25
4,zzzzteana@yahoogroups.com,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,Stewart.Smith@ee.ed.ac.uk,Stewart Smith,2002-08-22 13:38:22+00:00,2002-08-22,13:38:22


In [None]:
# Drop rows with missing 'datetime', 'date', and 'time'
df = df.dropna(subset=['datetime', 'date', 'time', 'body'])

# Impute missing 'receiver_email' with a placeholder value
df['receiver_email'] = df['receiver_email'].fillna('unknown@example.com')

# Impute missing 'subject' with a placeholder value
df['subject'] = df['subject'].fillna('No Subject')

# Convert 'datetime' column to datetime type
df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['receiver_email'] = df['receiver_email'].fillna('unknown@example.com')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['subject'] = df['subject'].fillna('No Subject')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')


In [None]:
# Checking missing values again
df.isnull().sum()

receiver_email    0
subject           0
body              0
urls              0
sender_email      0
sender_name       0
datetime          0
date              0
time              0
dtype: int64

In [None]:
df.shape

(5535, 9)

In [None]:
# Prepare the file
df.to_csv('fraud_data.csv', index=False)

In [None]:
# Downloading the file.
from google.colab import files
files.download('fraud_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>