# **Phishing Website Detection using Decision Trees**

**A CS 180 Machine Learning Project**

## **Main Objective**

In the digital age, the Internet is essential for communication and commerce but also brings security threats, with phishing being a major concern. Phishing tricks people into giving up sensitive information by posing as legitimate entities, leading to financial loss and identity theft (Dutta, 2021). The main challenge in fighting phishing is its evolving nature, as cybercriminals constantly update their tactics, outpacing traditional methods like blacklists (Almenari & Alshammari, 2023).

Machine Learning (ML) offers a promising solution by analyzing various website characteristics—such as URL length, HTTPS usage, and PageRank—to predict phishing attempts. This approach is necessary given the limitations of current security measures (Dutta, 2021). Our project stands out by using a diverse set of features to train ML models, improving prediction accuracy. For example, a short URL may not be suspicious alone, but combined with a low PageRank and no HTTPS, it could indicate a phishing site (Almenari & Alshammari, 2023).

This project has practical implications for enhancing online security, offering real-time warnings about potential phishing sites and aiding cybersecurity professionals in identifying threats more efficiently (Dutta, 2021). In summary, using ML to detect phishing websites based on various characteristics is a novel, challenging, and valuable endeavor to make the Internet safer for all users.

## **Preliminaries**


Let's import important libraries for this project.

In [11]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import json
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression


Finally, let's import the dataset that will be used. Note that this dataset was constructed via `dataset-construction.ipynb` which was adapted from the Google Colab notebook created by Github user shreyagopal that can be found [here](https://colab.research.google.com/github/shreyagopal/Phishing-Website-Detection-by-Machine-Learning-Techniques/blob/master/URL%20Feature%20Extraction.ipynb). 

In [2]:
df_url = pd.read_csv('./datasets/new_dataset.csv')
df_url.head()

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,https://www.todayshomeowner.com/how-to-make-ho...,82,23,0,2,7,0,0,0,0,...,1,1,0,240,8892,67860,0,1,4,legitimate
1,http://thapthan.ac.th/information/confirmation...,93,14,1,2,0,0,0,0,0,...,1,0,1,0,2996,4189860,0,1,2,phishing
2,http://app.dialoginsight.com/T/OFC4/L2S/3888/B...,121,21,1,3,0,0,0,0,0,...,1,1,0,30,2527,346022,0,1,3,phishing
3,https://www.bedslide.com,24,16,0,2,0,0,0,0,0,...,0,0,0,139,7531,1059151,0,0,4,legitimate
4,https://tabs.ultimate-guitar.com/s/sex_pistols...,73,24,0,3,1,0,0,0,0,...,0,0,0,3002,7590,635,0,1,5,legitimate


## **Data Preprocessing**

First, let's evaluate the dataframe.

In [3]:
# Determine the shape of the dataframe
df_url.shape

(11430, 89)

In [5]:
# Determine its columns
df_url.columns

Index(['url', 'length_url', 'length_hostname', 'ip', 'nb_dots', 'nb_hyphens',
       'nb_at', 'nb_qm', 'nb_and', 'nb_or', 'nb_eq', 'nb_underscore',
       'nb_tilde', 'nb_percent', 'nb_slash', 'nb_star', 'nb_colon', 'nb_comma',
       'nb_semicolumn', 'nb_dollar', 'nb_space', 'nb_www', 'nb_com',
       'nb_dslash', 'http_in_path', 'https_token', 'ratio_digits_url',
       'ratio_digits_host', 'punycode', 'port', 'tld_in_path',
       'tld_in_subdomain', 'abnormal_subdomain', 'nb_subdomains',
       'prefix_suffix', 'random_domain', 'shortening_service',
       'path_extension', 'nb_redirection', 'nb_external_redirection',
       'length_words_raw', 'char_repeat', 'shortest_words_raw',
       'shortest_word_host', 'shortest_word_path', 'longest_words_raw',
       'longest_word_host', 'longest_word_path', 'avg_words_raw',
       'avg_word_host', 'avg_word_path', 'phish_hints', 'domain_in_brand',
       'brand_in_subdomain', 'brand_in_path', 'suspecious_tld',
       'statistical_report', 

This indicates that the dataframe has `89 columns` and `11430 rows`. Each column represents a feature associated with each URL. Now let's perform some exploration with the dataframe. First, let's see if there are null values within the dataset.

In [6]:
df_url.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 89 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   url                         11430 non-null  object 
 1   length_url                  11430 non-null  int64  
 2   length_hostname             11430 non-null  int64  
 3   ip                          11430 non-null  int64  
 4   nb_dots                     11430 non-null  int64  
 5   nb_hyphens                  11430 non-null  int64  
 6   nb_at                       11430 non-null  int64  
 7   nb_qm                       11430 non-null  int64  
 8   nb_and                      11430 non-null  int64  
 9   nb_or                       11430 non-null  int64  
 10  nb_eq                       11430 non-null  int64  
 11  nb_underscore               11430 non-null  int64  
 12  nb_tilde                    11430 non-null  int64  
 13  nb_percent                  114

Fortunately, there are no null values. However, those are a lot of features. Each column contains an `int` or `float` except for `URL` and `status`, which are both strings. The `URL` column can be dropped while the `status` column will retain. 

But before that, let's reduce the number of columns by checking if they only have one unique value.

In [7]:
lst_removed_features = []
for feature in df_url.columns:
    if len(df_url[feature].unique()) == 1:
        lst_removed_features.append(feature)
        df_url.drop([feature], axis=1, inplace=True)
print(lst_removed_features)
df_url.shape

['nb_or', 'ratio_nullHyperlinks', 'ratio_intRedirection', 'ratio_intErrors', 'submit_email', 'sfh']


(11430, 83)

6 features only have one unique value. Now, we have 83 columns left. Let's now drop the `URL` column. 


In [8]:
df_url.drop(['url'], axis=1, inplace=True)
df_url.shape

(11430, 82)

Next, let's convert the `status` column. We shall label `phishing` as `1` and `legitimate` as `0`.

In [12]:
print(df_url['status'].iloc[0])
df_url['status'] = np.where(df_url['status'] == 'legitimate', 0, 1)
print(df_url['status'].iloc[0])

legitimate
0


Let's confirm if all columns are now either an `int` or a `float` 

In [13]:
df_url.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 82 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   length_url                  11430 non-null  int64  
 1   length_hostname             11430 non-null  int64  
 2   ip                          11430 non-null  int64  
 3   nb_dots                     11430 non-null  int64  
 4   nb_hyphens                  11430 non-null  int64  
 5   nb_at                       11430 non-null  int64  
 6   nb_qm                       11430 non-null  int64  
 7   nb_and                      11430 non-null  int64  
 8   nb_eq                       11430 non-null  int64  
 9   nb_underscore               11430 non-null  int64  
 10  nb_tilde                    11430 non-null  int64  
 11  nb_percent                  11430 non-null  int64  
 12  nb_slash                    11430 non-null  int64  
 13  nb_star                     114

Great! Every column is now either an `int` or a `float`. Let's also determine the number of 'good' and 'bad' URLs in the dataframe.

In [18]:
print(df_url['status'].value_counts())

status
0    5715
1    5715
Name: count, dtype: int64


The dataset has equal number of `good` and `bad` URLs. Now, let's examine the unique values of each feature. 



In [14]:
for feature in df_url.columns:
    print(f"{feature}: {df_url[feature].unique()}")

length_url: [  82   93  121   24   73   47   66   80   56   71   20   97   51  104
   52  160   35   22   36   33   30   50   31   34   79   94   59   48
   61  102   85   55   44   78   81   53   62   65   68  105   27   74
   92   40   25  107   26   32   46   57  110   43   98   41   28   21
   69   49  100  268   39   67   23   83  130  120  214  112   37   87
   58  109   42   29   89   54   76  125  117   18   95   45  126   86
  219  190   60  218   38  111   77   17   19   70   63   75   15  106
  202   64  113  243   72  430  127  263  180  118  151  137  253  131
  140  159  114  124  199  176  139  128  119  293  938  149   16   84
   90  150   96  230  123  129   99  138  196  188  330  158  425  133
  461  216  122   91  203  332  155  173  136  204  116  396  565  187
  156  157  250  464  208  267   88  264  238  132  278  101  162  103
  154  135  177  404  148  141  115  147  415  153  152  552  476  305
  259  437  254  276  165  172  201  108  629  206  142  536  299

A majority of the data contains only 0s and 1s. Let's see quickly see which features greatly influence the results.

In [17]:

for feature in df_url.columns:
    if np.array_equal(df_url[feature].unique(), np.array([0, 1])) or np.array_equal(df_url[feature].unique(), np.array([1, 0])):
        print(feature)
        print(f'Number of Good URLs tagged as 0: {len(df_url[(df_url[feature] == 0) & (df_url["status"] == 0)])}')
        print(f'Number of Good URLs tagged as 1: {len(df_url[(df_url[feature] == 1) & (df_url["status"] == 0)])}')
        print(f'Number of Bad URLs tagged as 0: {len(df_url[(df_url[feature] == 0) & (df_url["status"] == 1)])}')
        print(f'Number of Bad URLs tagged as 1: {len(df_url[(df_url[feature] == 1) & (df_url["status"] == 1)])}')
        print()

ip
Number of Good URLs tagged as 0: 5512
Number of Good URLs tagged as 1: 203
Number of Bad URLs tagged as 0: 4197
Number of Bad URLs tagged as 1: 1518

nb_tilde
Number of Good URLs tagged as 0: 5691
Number of Good URLs tagged as 1: 24
Number of Bad URLs tagged as 0: 5663
Number of Bad URLs tagged as 1: 52

nb_star
Number of Good URLs tagged as 0: 5715
Number of Good URLs tagged as 1: 0
Number of Bad URLs tagged as 0: 5707
Number of Bad URLs tagged as 1: 8

nb_dslash
Number of Good URLs tagged as 0: 5711
Number of Good URLs tagged as 1: 4
Number of Bad URLs tagged as 0: 5644
Number of Bad URLs tagged as 1: 71

https_token
Number of Good URLs tagged as 0: 2543
Number of Good URLs tagged as 1: 3172
Number of Bad URLs tagged as 0: 1904
Number of Bad URLs tagged as 1: 3811

punycode
Number of Good URLs tagged as 0: 5715
Number of Good URLs tagged as 1: 0
Number of Bad URLs tagged as 0: 5711
Number of Bad URLs tagged as 1: 4

port
Number of Good URLs tagged as 0: 5704
Number of Good URLs ta

Looks like `external_favicon` and `google_index` are good indicators of whether or not a URL is good or bad. We shall see soon. 

## **Data Modelling**

Before we proceed, let's once again check the first few rows of our dataframe.

In [19]:
df_url.head()

Unnamed: 0,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_eq,nb_underscore,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,82,23,0,2,7,0,0,0,0,0,...,1,1,0,240,8892,67860,0,1,4,0
1,93,14,1,2,0,0,0,0,0,0,...,1,0,1,0,2996,4189860,0,1,2,1
2,121,21,1,3,0,0,0,0,0,0,...,1,1,0,30,2527,346022,0,1,3,1
3,24,16,0,2,0,0,0,0,0,0,...,0,0,0,139,7531,1059151,0,0,4,0
4,73,24,0,3,1,0,0,0,0,5,...,0,0,0,3002,7590,635,0,1,5,0


Now, let's randomize the rows before we split the data.

In [20]:
df_url = df_url.sample(frac=1).reset_index(drop=True)
df_url.head(10)

Unnamed: 0,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_eq,nb_underscore,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,45,37,0,4,0,0,0,0,0,0,...,1,1,0,181,4202,0,0,0,0,1
1,50,27,0,2,1,0,0,0,0,0,...,1,0,0,128,4255,0,0,0,3,0
2,40,31,1,4,0,0,0,0,0,0,...,1,1,0,228,5616,0,0,1,5,1
3,32,17,0,1,0,0,0,0,0,1,...,1,0,0,104,626,0,0,0,0,1
4,61,20,1,2,0,0,0,0,0,0,...,1,0,0,24,7645,0,0,1,0,1
5,65,24,0,3,0,0,0,0,0,2,...,1,0,0,2816,4489,0,0,0,0,0
6,62,24,0,3,3,0,0,0,0,0,...,1,1,0,635,5208,27339,0,0,5,0
7,53,16,0,3,0,0,1,0,1,0,...,1,0,0,221,8546,21974,0,0,5,0
8,52,18,0,2,0,0,1,0,1,0,...,1,0,0,542,5301,0,0,1,0,1
9,89,23,1,2,0,0,0,0,0,0,...,1,0,0,270,461,0,0,1,0,1


We can see that the rows has been randomized. Now, let's split the dataset used for training and testing. 

In [21]:
X = df_url.drop('status', axis=1)
y = df_url['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Length of X_train: {len(X_train)}")
print(f"Length of X_test: {len(X_test)}")
print(f"Length of y_train: {len(y_train)}")
print(f"Length of y_test: {len(y_test)}")

Length of X_train: 9144
Length of X_test: 2286
Length of y_train: 9144
Length of y_test: 2286


Data modelling can now proceed. Let's model our data using `MLPClassifier`.

In [23]:
mlp = MLPClassifier(hidden_layer_sizes=(200, 200, 200, 200, 200), activation='tanh', max_iter=10000, alpha=0.0001, solver='adam', random_state=42, tol=0.00001)
mlp_model = mlp.fit(X_train, y_train)
print(f"Training score: {mlp_model.score(X_train, y_train)}")
print(f"Testing score: {mlp_model.score(X_test, y_test)}")

Training score: 0.7981189851268592
Testing score: 0.789588801399825


`MLPClassifier` underfits the data. Let's try using `GaussianNB`.

In [24]:
gnb = GaussianNB()
gnb_model = gnb.fit(X_train, y_train)
print(f"Training score: {gnb_model.score(X_train, y_train)}")
print(f"Testing score: {gnb_model.score(X_test, y_test)}")

Training score: 0.7490157480314961
Testing score: 0.7405949256342957


What about `BernoulliNB`?

In [25]:
bnb = BernoulliNB()
bnb_model = bnb.fit(X_train, y_train)
print(f"Training score: {bnb_model.score(X_train, y_train)}")
print(f"Testing score: {bnb_model.score(X_test, y_test)}")

Training score: 0.8794838145231846
Testing score: 0.8862642169728784


In [28]:
lr = LogisticRegression(max_iter=1000000, random_state=42, tol=0.0001)
lr_model = lr.fit(X_train, y_train)
print(f"Training score: {lr_model.score(X_train, y_train)}")
print(f"Testing score: {lr_model.score(X_test, y_test)}")

Training score: 0.8842957130358705
Testing score: 0.8902012248468941


STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
