# **Phishing Website Detection using Decision Trees**

**A CS 180 Machine Learning Project**

## **Main Objective**

In the digital age, the Internet is essential for communication and commerce but also brings security threats, with phishing being a major concern. Phishing tricks people into giving up sensitive information by posing as legitimate entities, leading to financial loss and identity theft (Dutta, 2021). The main challenge in fighting phishing is its evolving nature, as cybercriminals constantly update their tactics, outpacing traditional methods like blacklists (Almenari & Alshammari, 2023).

Machine Learning (ML) offers a promising solution by analyzing various website characteristics—such as URL length, HTTPS usage, and PageRank—to predict phishing attempts. This approach is necessary given the limitations of current security measures (Dutta, 2021). Our project stands out by using a diverse set of features to train ML models, improving prediction accuracy. For example, a short URL may not be suspicious alone, but combined with a low PageRank and no HTTPS, it could indicate a phishing site (Almenari & Alshammari, 2023).

This project has practical implications for enhancing online security, offering real-time warnings about potential phishing sites and aiding cybersecurity professionals in identifying threats more efficiently (Dutta, 2021). In summary, using ML to detect phishing websites based on various characteristics is a novel, challenging, and valuable endeavor to make the Internet safer for all users.

## **Preliminaries**


Let's import important libraries for this project.

In [31]:
import pandas as pd

Finally, let's import the dataset that will be used. Note that this dataset was constructed via `dataset-construction.ipynb` which was adapted from the Google Colab notebook created by Github user shreyagopal that can be found [here](https://colab.research.google.com/github/shreyagopal/Phishing-Website-Detection-by-Machine-Learning-Techniques/blob/master/URL%20Feature%20Extraction.ipynb). 

In [32]:
df_url = pd.read_csv('./datasets/final_dataset.csv')
df_url.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,ucmo.edu,0,0,0,4,0,0,0,0,0,1,1,1,1,1,1,0
1,amazon.com,0,0,0,4,0,0,0,0,0,1,1,1,1,1,1,0
2,juicyfinder.com,0,0,0,2,0,0,0,0,0,1,1,1,1,1,1,0
3,martindale.com,0,0,1,3,0,0,0,0,0,1,1,1,1,1,1,0
4,montrealladies.com,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,0


## **Data Preprocessing**

First, let's evaluate the dataframe.

In [33]:
# Determine the shape of the dataframe
df_url.shape

(15000, 17)

In [34]:
# Determine its columns
df_url.columns

Index(['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth',
       'Redirection', 'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record',
       'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over', 'Right_Click',
       'Web_Forwards', 'Label'],
      dtype='object')

This indicates that the dataframe has `17 columns` and `15000 rows`. Each column represents a feature associated with each URL. Now let's perform some exploration with the dataframe.

In [35]:
df_url.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Domain         15000 non-null  object
 1   Have_IP        15000 non-null  int64 
 2   Have_At        15000 non-null  int64 
 3   URL_Length     15000 non-null  int64 
 4   URL_Depth      15000 non-null  int64 
 5   Redirection    15000 non-null  int64 
 6   https_Domain   15000 non-null  int64 
 7   TinyURL        15000 non-null  int64 
 8   Prefix/Suffix  15000 non-null  int64 
 9   DNS_Record     15000 non-null  int64 
 10  Domain_Age     15000 non-null  int64 
 11  Domain_End     15000 non-null  int64 
 12  iFrame         15000 non-null  int64 
 13  Mouse_Over     15000 non-null  int64 
 14  Right_Click    15000 non-null  int64 
 15  Web_Forwards   15000 non-null  int64 
 16  Label          15000 non-null  int64 
dtypes: int64(16), object(1)
memory usage: 1.9+ MB


Each column contains an `int` except for `Domain`, which is a string. Let's analyze each feature. 

In [36]:
for feature in df_url.columns:
    print(f'{feature}: {df_url[feature].unique()}')

Domain: ['ucmo.edu' 'amazon.com' 'juicyfinder.com' ... 'linkzip.net' 'smxqfps.biz'
 'klaretech.com']
Have_IP: [0 1]
Have_At: [0 1]
URL_Length: [0 1]
URL_Depth: [ 4  2  3  1  6  5  7  8  9 10 11 14 13 17 12 15 16 25 19]
Redirection: [0 1]
https_Domain: [0]
TinyURL: [0 1]
Prefix/Suffix: [0 1]
DNS_Record: [0 1]
Domain_Age: [1 0]
Domain_End: [1 0]
iFrame: [1 0]
Mouse_Over: [1 0]
Right_Click: [1]
Web_Forwards: [1 0]
Label: [0 1]


From this, we can remove `https_Domain` and `Right_Click` columns from the dataframe since they do not contribute any information that is helpful in determining whether or not a set of URL features belongs to a phishing website. 

In [37]:
df_url.drop(['https_Domain', 'Right_Click'], axis=1, inplace=True)
df_url.columns
df_url.shape

(15000, 15)

The dataframe now only has `15 columns`.