# Phishing URL Detection

## Introduction

### Relevant Background Info

Phishing is a cybercrime that baits unknowing victims into clicking on URLs. This is done by acting like an authentic institution while contacting the victim through emails, texts, or other social media. Phishing assaults nowadays are advanced and progressively more troublesome to spot; additionally, as we rely more on our online profiles, phishing emails that take our sensitive information become more and more dangerous. Not only are phishing assaults dangerous to individuals, but they are also dangerous to huge corporations. For instance, one of the most extraordinary Phishing attacks includes the Colonial Pipeline scam, where over 3.4 billion euros were scammed out of the company. The attack was constructed by a simple email which gained access to passwords and planted malicious software onto the company network system. This caused the company to shut down and the oil prices to skyrocket. The seriousness and dangers of phishing can lead to extreme damage and have tragic consequences, so more sophisticated methods of systems are required to prevent phishing. To counteract the dangers of phishing, our group will classify URLs as 'phishing' or 'legitimate' to warn victims before the attackers steal their sensitive information.

### Predictive Question

How can we classify the legitimacy of the URL based on its attributes?

### Dataset

The dataset used in this project comes from: https://data.mendeley.com/datasets/c2gw7fy2j4/3/files/575316f4-ee1d-453e-a04f-7b950915b61b <br>
The dataset is used by the article <i><a href="https://www.sciencedirect.com/science/article/pii/S0952197621001950#">Towards benchmark datasets for machine learning based website phishing detection: An experimental study</a></i> which can be found on the <i><a href="https://www.sciencedirect.com/journal/engineering-applications-of-artificial-intelligence">Engineering Applications of Artificial Intelligence journal</a></i>.

## Preliminary Exploratory Data Analysis

Loading necessary libraries.

In [2]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

### Reading the Data

As there is no host that allows us to read the dataset online, we have downloaded the data set, and will be reading it locally (in the Jupyter server).  The credibility of the data above, in the *Dataset* section of this proposal.

In [3]:
options(repr.matrix.max.rows = 5)
phishing_data <- read_csv("data/dataset_phishing.csv")
phishing_data

[1mRows: [22m[34m11430[39m [1mColumns: [22m[34m89[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): url, status
[32mdbl[39m (87): length_url, length_hostname, ip, nb_dots, nb_hyphens, nb_at, nb_qm...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,⋯,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
http://www.crestonwood.com/router.php,37,19,0,3,0,0,0,0,0,⋯,0,1,0,45,-1,0,1,1,4,legitimate
http://shadetreetechnology.com/V4/validation/a111aedc8ae390eabcfa130e041a10a4,77,23,1,1,0,0,0,0,0,⋯,1,0,0,77,5767,0,0,1,2,phishing
https://support-appleld.com.secureupdate.duilawyeryork.com/ap/89e6a3b4b063b8d/?cmd=_update&dispatch=89e6a3b4b063b8d1b&locale=_,126,50,1,4,1,0,1,2,0,⋯,1,0,0,14,4004,5828815,0,1,0,phishing
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
http://www.mypublicdomainpictures.com/,38,30,0,2,0,0,0,0,0,⋯,1,0,0,85,2836,2455493,0,0,4,legitimate
http://174.139.46.123/ap/signin?openid.pape.max_auth_age=0&amp;openid.return_to=https%3A%2F%2Fwww.amazon.co.jp%2F%3Fref_%3Dnav_em_hd_re_signin&amp;openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&amp;openid.assoc_handle=jpflex&amp;openid.mode=checkid_setup&amp;key=a@b.c&amp;openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&amp;openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&amp;&amp;ref_=nav_em_hd_clc_signin,477,14,1,24,0,1,1,9,0,⋯,1,1,1,0,-1,0,1,1,0,phishing


### Cleaning and Wrangling the Data into a tidy format

The data we have found has satisfied all the rules of being clean:
- Each row is a single observation.
- Each column is a single variable.
- Each value is a single cell (entry in data frame is not shared)

Therefore, anything we do to the data such as pivotting wider, and pivotting longer would make the data less clean. Additionally, as the data only comes in one table, we do NOT need to merge the dataset with other ones.

However, what we will do is ***selecting the important columns*** and ***separating the dataset into a training set and a testing set***.

In [24]:
phishing_imp <- phishing_data |>
select (url, length_url, length_hostname, nb_dots, nb_hyphens, nb_at, nb_qm, nb_and, nb_or, nb_eq, nb_underscore, nb_tilde,
        nb_percent, nb_slash, nb_star, nb_colon, nb_comma, nb_semicolumn, nb_dollar, nb_space, nb_www, nb_com, nb_dslash, 
        http_in_path, https_token, ratio_digits_url, ratio_digits_host, tld_in_path, tld_in_subdomain, nb_subdomains,
        longest_word_host, longest_word_path, avg_words_raw, avg_word_host, domain_registration_length, status)

phishing_split <- initial_split(phishing_imp, prop = 3/4, strata = status)
phishing_train <- training(phishing_split)
phishing_test <- testing(phishing_split)
phishing_train
phishing_test

url,length_url,length_hostname,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,nb_eq,⋯,ratio_digits_host,tld_in_path,tld_in_subdomain,nb_subdomains,longest_word_host,longest_word_path,avg_words_raw,avg_word_host,domain_registration_length,status
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
http://www.crestonwood.com/router.php,37,19,3,0,0,0,0,0,0,⋯,0,0,0,3,11,6,5.750000,7,45,legitimate
http://rgipt.ac.in,18,11,2,0,0,0,0,0,0,⋯,0,0,0,2,5,0,5.000000,5,62,legitimate
http://www.iracing.com/tracks/gateway-motorsports-park/,55,15,2,2,0,0,0,0,0,⋯,0,0,0,2,7,11,6.333333,5,224,legitimate
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
http://www.heinzreber.net/homeflash1.html,41,18,3,0,0,0,0,0,0,⋯,0,0,0,3,10,10,6.750000,6.5,977,phishing
http://www.peoplemakingplaces.com/includes/Support/En/log/signin/customer_center/customer-IDPP00C644/myaccount/signin,117,26,2,1,0,0,0,0,0,⋯,0,0,0,2,18,10,7.230769,10.5,134,phishing


url,length_url,length_hostname,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,nb_eq,⋯,ratio_digits_host,tld_in_path,tld_in_subdomain,nb_subdomains,longest_word_host,longest_word_path,avg_words_raw,avg_word_host,domain_registration_length,status
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
http://appleid.apple.com-app.es/,32,24,3,1,0,0,0,0,0,⋯,0,0,0,3,7,0,4.5,4.5,0,phishing
http://www.shadetreetechnology.com/V4/validation/ba4b8bddd7958ecb8772c836c2969531,81,27,2,0,0,0,0,0,0,⋯,0,0,0,2,19,32,13.2,11.0,76,phishing
https://www.missfiga.com/,25,16,2,0,0,0,0,0,0,⋯,0,0,0,2,8,0,5.5,5.5,880,legitimate
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
https://www.facebook.com/Interactive-Television-Pvt-Ltd-Group-M-100230523435650/photos/?ref=page_internal,105,16,2,6,0,1,0,0,1,⋯,0.0000000,0,0,2,8,15,6.153846,5.50,2809,legitimate
http://174.139.46.123/ap/signin?openid.pape.max_auth_age=0&amp;openid.return_to=https%3A%2F%2Fwww.amazon.co.jp%2F%3Fref_%3Dnav_em_hd_re_signin&amp;openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&amp;openid.assoc_handle=jpflex&amp;openid.mode=checkid_setup&amp;key=a@b.c&amp;openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&amp;openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&amp;&amp;ref_=nav_em_hd_clc_signin,477,14,24,0,1,1,9,0,9,⋯,0.7857143,1,1,3,3,12,4.377778,2.75,0,phishing


### Data Summary

<p style="color:red;">TODO</p>

### Data Visualization

<p style="color:red;">TODO</p>

## Methods

### How we will conduct data analysis

<p style="color:red;">TODO</p>

### How we will visualize the results

<p style="color:red;">TODO</p>

## Expected Outcomes and Significance

<p style="color:red;">TODO</p>