# **Phishing Website Detection using Decision Trees**

**A CS 180 Machine Learning Project**

## **Main Objective**

In the digital age, the Internet is essential for communication and commerce but also brings security threats, with phishing being a major concern. Phishing tricks people into giving up sensitive information by posing as legitimate entities, leading to financial loss and identity theft (Dutta, 2021). The main challenge in fighting phishing is its evolving nature, as cybercriminals constantly update their tactics, outpacing traditional methods like blacklists (Almenari & Alshammari, 2023).

Machine Learning (ML) offers a promising solution by analyzing various website characteristics—such as URL length, HTTPS usage, and PageRank—to predict phishing attempts. This approach is necessary given the limitations of current security measures (Dutta, 2021). Our project stands out by using a diverse set of features to train ML models, improving prediction accuracy. For example, a short URL may not be suspicious alone, but combined with a low PageRank and no HTTPS, it could indicate a phishing site (Almenari & Alshammari, 2023).

This project has practical implications for enhancing online security, offering real-time warnings about potential phishing sites and aiding cybersecurity professionals in identifying threats more efficiently (Dutta, 2021). In summary, using ML to detect phishing websites based on various characteristics is a novel, challenging, and valuable endeavor to make the Internet safer for all users.

## **Preliminaries**


Let's import important libraries for this project.

In [30]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import json
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB

Finally, let's import the dataset that will be used. Note that this dataset was constructed via `dataset-construction.ipynb` which was adapted from the Google Colab notebook created by Github user shreyagopal that can be found [here](https://colab.research.google.com/github/shreyagopal/Phishing-Website-Detection-by-Machine-Learning-Techniques/blob/master/URL%20Feature%20Extraction.ipynb). 

In [16]:
df_url = pd.read_csv('./datasets/final_dataset.csv')
df_url.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,ucmo.edu,0,0,0,4,0,0,0,0,1,1,1,1,1,1,1,0
1,amazon.com,0,0,0,4,0,0,0,0,1,1,1,1,1,1,1,0
2,juicyfinder.com,0,0,0,2,0,0,0,0,1,1,1,1,1,1,1,0
3,martindale.com,0,0,1,3,0,0,0,0,1,1,1,1,1,1,1,0
4,montrealladies.com,0,0,0,1,0,0,0,0,1,1,1,1,1,1,1,0


## **Data Preprocessing**

First, let's evaluate the dataframe.

In [17]:
# Determine the shape of the dataframe
df_url.shape

(30000, 17)

In [18]:
# Determine its columns
df_url.columns

Index(['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth',
       'Redirection', 'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record',
       'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over', 'Right_Click',
       'Web_Forwards', 'Label'],
      dtype='object')

This indicates that the dataframe has `17 columns` and `30000 rows`. Each column represents a feature associated with each URL. Now let's perform some exploration with the dataframe.

In [19]:
df_url.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Domain         30000 non-null  object
 1   Have_IP        30000 non-null  int64 
 2   Have_At        30000 non-null  int64 
 3   URL_Length     30000 non-null  int64 
 4   URL_Depth      30000 non-null  int64 
 5   Redirection    30000 non-null  int64 
 6   https_Domain   30000 non-null  int64 
 7   TinyURL        30000 non-null  int64 
 8   Prefix/Suffix  30000 non-null  int64 
 9   DNS_Record     30000 non-null  int64 
 10  Domain_Age     30000 non-null  int64 
 11  Domain_End     30000 non-null  int64 
 12  iFrame         30000 non-null  int64 
 13  Mouse_Over     30000 non-null  int64 
 14  Right_Click    30000 non-null  int64 
 15  Web_Forwards   30000 non-null  int64 
 16  Label          30000 non-null  int64 
dtypes: int64(16), object(1)
memory usage: 3.9+ MB


Each column contains an `int` except for `Domain`, which is a string. Let's analyze each feature. first, by checking the unique values. 

In [20]:
for feature in df_url.columns:
    print(f'{feature}: {df_url[feature].unique()}')

Domain: ['ucmo.edu' 'amazon.com' 'juicyfinder.com' ... 'app.bronto.com'
 'masterdonatelli.com' 'msnpromo.free.fr']
Have_IP: [0 1]
Have_At: [0 1]
URL_Length: [0 1]
URL_Depth: [ 4  2  3  1  6  5  7  8  9 10 11 14 16 13 17 12 15 25 19 18 23]
Redirection: [0 1]
https_Domain: [0]
TinyURL: [0 1]
Prefix/Suffix: [0 1]
DNS_Record: [1]
Domain_Age: [1]
Domain_End: [1]
iFrame: [1 0]
Mouse_Over: [1 0]
Right_Click: [1]
Web_Forwards: [1 0]
Label: [0 1]


Next, we will examine the number of unique values. We'll focus on the Domain column, as the unique values in the other columns can be easily determined from the previous section.

In [21]:
print(f"Domain: {df_url['Domain'].nunique()}")

Domain: 20240


There are repeating domains within the dataset. For now, let's not remove them since they might be helpeful later. 

From these, we can remove `https_Domain` and `Right_Click` columns from the dataframe since they only have 1 unique value, which is not helpful in determining whether a set of URL features belongs to a phishing website. 

In [22]:
df_url.drop(['https_Domain', 'Right_Click', 'DNS_Record', 'Domain_Age', 'Domain_End'], axis=1, inplace=True)
df_url.columns
df_url.shape

(30000, 12)

The dataframe now only has `12 columns`.

Before we proceed to modelling, the `Domain` column has type `object`. This is problematic since neural networks doesn't handle string objects well. Let's label each domain using an integer. 

In [23]:
# Use LabelEncoder to convert the Domain column to numerical values
le = LabelEncoder()
df_url['Domain'] = le.fit_transform(df_url['Domain'])

# Get the mapping from labels to their encoded values
domain_mapping = dict(zip(le.classes_, range(len(le.classes_))))

# Write the mapping to a file
with open('domain_mapping.json', 'w') as f:
    json.dump(domain_mapping, f)

print(domain_mapping)



In [24]:
print(df_url.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Domain         30000 non-null  int64
 1   Have_IP        30000 non-null  int64
 2   Have_At        30000 non-null  int64
 3   URL_Length     30000 non-null  int64
 4   URL_Depth      30000 non-null  int64
 5   Redirection    30000 non-null  int64
 6   TinyURL        30000 non-null  int64
 7   Prefix/Suffix  30000 non-null  int64
 8   iFrame         30000 non-null  int64
 9   Mouse_Over     30000 non-null  int64
 10  Web_Forwards   30000 non-null  int64
 11  Label          30000 non-null  int64
dtypes: int64(12)
memory usage: 2.7 MB
None


Every column is now an integer. Hence, we can now proceed to data modelling.

## **Data Modelling**

Before we proceed, let's once again check the first few rows of our dataframe.

In [25]:
df_url.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,TinyURL,Prefix/Suffix,iFrame,Mouse_Over,Web_Forwards,Label
0,18429,0,0,0,4,0,0,0,1,1,1,0
1,1688,0,0,0,4,0,0,0,1,1,1,0
2,9743,0,0,0,2,0,0,0,1,1,1,0
3,11195,0,0,1,3,0,0,0,1,1,1,0
4,11928,0,0,0,1,0,0,0,1,1,1,0


Now, let's randomize the rows since the first 5000 rows contain URLs of phishing websites while the last 5000 rows are for legitimate websites.

In [26]:
df_url = df_url.sample(frac=1).reset_index(drop=True)
df_url.head(10)

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,TinyURL,Prefix/Suffix,iFrame,Mouse_Over,Web_Forwards,Label
0,2173,0,0,1,2,0,0,0,1,1,1,1
1,11339,0,0,0,2,0,0,0,1,1,1,0
2,3419,0,0,1,2,0,0,0,1,1,1,0
3,10691,0,0,1,3,0,1,0,1,1,1,1
4,15026,0,0,0,3,0,0,0,1,1,1,0
5,13601,0,0,0,3,0,0,0,1,1,1,1
6,8975,0,0,0,5,0,0,0,1,1,1,1
7,13627,0,0,1,4,0,0,0,1,1,1,1
8,11135,0,0,0,3,0,0,0,1,1,1,0
9,19920,0,0,0,2,0,0,0,1,1,1,1


We can see that the rows has been randomized. Now, let's split the dataset used for training and testing. 

In [27]:
X = df_url.drop('Label', axis=1)
y = df_url['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Length of X_train: {len(X_train)}")
print(f"Length of X_test: {len(X_test)}")
print(f"Length of y_train: {len(y_train)}")
print(f"Length of y_test: {len(y_test)}")

Length of X_train: 24000
Length of X_test: 6000
Length of y_train: 24000
Length of y_test: 6000


Data modelling can now proceed. Let's model our data using `MLPClassifier`.

In [28]:
mlp = MLPClassifier(hidden_layer_sizes=(200, 200, 200, 200, 200), activation='relu', max_iter=10000, alpha=0.0001, solver='adam', random_state=42, tol=0.00001)
mlp_model = mlp.fit(X_train, y_train)
print(f"Training score: {mlp_model.score(X_train, y_train)}")
print(f"Testing score: {mlp_model.score(X_test, y_test)}")

Training score: 0.537625
Testing score: 0.5405


`MLPClassifier` underfits the data. Let's try using `GaussianNB`.

In [29]:
gnb = GaussianNB()
gnb_model = gnb.fit(X_train, y_train)
print(f"Training score: {gnb_model.score(X_train, y_train)}")
print(f"Testing score: {gnb_model.score(X_test, y_test)}")

Training score: 0.56575
Testing score: 0.551


What about `BernoulliNB`?

In [31]:
bnb = BernoulliNB()
bnb_model = bnb.fit(X_train, y_train)
print(f"Training score: {bnb_model.score(X_train, y_train)}")
print(f"Testing score: {bnb_model.score(X_test, y_test)}")

Training score: 0.583875
Testing score: 0.5756666666666667
