# **Phishing Website Detection using Decision Trees**

**A CS 180 Machine Learning Project**

## **Main Objective**

In the digital age, the Internet is essential for communication and commerce but also brings security threats, with phishing being a major concern. Phishing tricks people into giving up sensitive information by posing as legitimate entities, leading to financial loss and identity theft (Dutta, 2021). The main challenge in fighting phishing is its evolving nature, as cybercriminals constantly update their tactics, outpacing traditional methods like blacklists (Almenari & Alshammari, 2023).

Machine Learning (ML) offers a promising solution by analyzing various website characteristics—such as URL length, HTTPS usage, and PageRank—to predict phishing attempts. This approach is necessary given the limitations of current security measures (Dutta, 2021). Our project stands out by using a diverse set of features to train ML models, improving prediction accuracy. For example, a short URL may not be suspicious alone, but combined with a low PageRank and no HTTPS, it could indicate a phishing site (Almenari & Alshammari, 2023).

This project has practical implications for enhancing online security, offering real-time warnings about potential phishing sites and aiding cybersecurity professionals in identifying threats more efficiently (Dutta, 2021). In summary, using ML to detect phishing websites based on various characteristics is a novel, challenging, and valuable endeavor to make the Internet safer for all users.

## **Preliminaries**


First, Google Drive must be mounted.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Then, path must be changed to the right folder

In [2]:
%cd /content/drive/Shared drives/CS 180 2324B Final Project/datasets

/content/drive/Shared drives/CS 180 2324B Final Project/datasets


Folder must have three .csv files: `shorturl_services_list.csv`, `preprocessed_dataset.csv`, and `final_dataset.csv`. For this project, `final_dataset.csv` will be used.

In [3]:
%ls

final_dataset.csv  preprocessed_dataset.csv  shorturl_services_list.csv


Let's also import important libraries for this project.

In [5]:
import pandas as pd

Finally, let's import the dataset that will be used.

In [6]:
df_url = pd.read_csv('final_dataset.csv')
df_url.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,ucmo.edu,0,0,0,4,0,0,0,0,0,1,1,1,1,1,1,0
1,amazon.com,0,0,0,4,0,0,0,0,0,1,1,1,1,1,1,0
2,juicyfinder.com,0,0,0,2,0,0,0,0,0,1,1,1,1,1,1,0
3,martindale.com,0,0,1,3,0,0,0,0,0,1,1,1,1,1,1,0
4,montrealladies.com,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,0


## **Data Preprocessing**

First, let's determine