# Phishing URL Detection

## Introduction

### Relevant Background Info

Phishing is a cybercrime that baits unknowing victims into clicking on URLs by acting like an authentic institution while contacting the victim through emails or other social media. Phishing is often used to steal user data, and with more and more of our data going online, the attack is becoming more and more vicious. Not only are phishing assaults dangerous to individuals, but they are also dangerous to huge corporations. For instance, one of the most extraordinary Phishing attacks includes the Colonial Pipeline scam, where over 3.4 billion euros were scammed out of the company. To counteract the dangers of phishing, our group will classify URLs as 'phishing' or 'legitimate' to warn victims before the attackers steal their sensitive information.

### Predictive Question

Can we classify an URL is phishing or legitimate?

### Dataset

The dataset used in this project comes from: https://data.mendeley.com/datasets/c2gw7fy2j4/3/files/575316f4-ee1d-453e-a04f-7b950915b61b
The dataset is used by the article which can be found on the Engineering Applications of Artificial Intelligence journal.

## Preliminary Exploratory Data Analysis

In [2]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

### Reading the Data

In [3]:
options(repr.matrix.max.rows = 5)
phishing_data <- read_csv("https://brianhan.tech/media/dsci/dataset_phishing.csv")

[1mRows: [22m[34m11430[39m [1mColumns: [22m[34m89[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): url, status
[32mdbl[39m (87): length_url, length_hostname, ip, nb_dots, nb_hyphens, nb_at, nb_qm...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
phishing_subset <- phishing_data |>
                   select(status, length_url, length_hostname, nb_dots, nb_hyphens, nb_at, nb_qm, nb_and, nb_or, nb_eq, nb_underscore, nb_tilde,
                          nb_percent, nb_slash, nb_star, nb_colon, nb_comma, nb_semicolumn, nb_dollar, nb_space, http_in_path, https_token, 
                          ratio_digits_url, ratio_digits_host, nb_subdomains,longest_word_host, longest_word_path, avg_words_raw, avg_word_host, 
                          domain_registration_length)
phishing_subset

status,length_url,length_hostname,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,nb_eq,⋯,http_in_path,https_token,ratio_digits_url,ratio_digits_host,nb_subdomains,longest_word_host,longest_word_path,avg_words_raw,avg_word_host,domain_registration_length
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
legitimate,37,19,3,0,0,0,0,0,0,⋯,0,1,0.0000000,0,3,11,6,5.75,7.0,45
phishing,77,23,1,0,0,0,0,0,0,⋯,0,1,0.2207792,0,1,19,32,15.75,19.0,77
phishing,126,50,4,1,0,1,2,0,3,⋯,0,0,0.1507937,0,3,13,17,8.25,8.4,14
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
legitimate,38,30,2,0,0,0,0,0,0,⋯,0,1,0.00000000,0.0000000,2,22,0,12.500000,12.50,85
phishing,477,14,24,0,1,1,9,0,9,⋯,4,1,0.08595388,0.7857143,3,3,12,4.377778,2.75,0


### Choosing Relevant Predictors using Forward Selection

Credit goes to datasciencebook.ca, Classification II: evaluation & tuning, 6.8.3 Forward Selection in R

In [7]:
phishing_subset <- phishing_data |>
                   select(status, length_url, length_hostname, nb_dots, nb_hyphens, nb_at, nb_qm, nb_and, nb_or, nb_eq, nb_underscore, nb_tilde,
                          nb_percent, nb_slash, nb_star, nb_colon, nb_comma, nb_semicolumn, nb_dollar, nb_space, http_in_path, https_token, 
                          ratio_digits_url, ratio_digits_host, nb_subdomains,longest_word_host, longest_word_path, avg_words_raw, avg_word_host, 
                          domain_registration_length) %>%
                   mutate(https_token = as_factor(https_token))

names <- colnames(phishing_subset |> select(-status))

# create an empty tibble to store the results
accuracies <- tibble(size = integer(), 
                     model_string = character(), 
                     accuracy = numeric())

# create a model specification
knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
     set_engine("kknn") |>
     set_mode("classification")

# create a 5-fold cross-validation object
phishing_vfold <- vfold_cv(phishing_subset, v = 5, strata = status)

# store the total number of predictors
n_total <- length(names)

# stores selected predictors
selected <- c()

# for every size from 1 to the total number of predictors
for (i in 1:n_total) {
    # for every predictor still not added yet
    accs <- list()
    models <- list()
    for (j in 1:length(names)) {
        # create a model string for this combination of predictors
        preds_new <- c(selected, names[[j]])
        model_string <- paste("status", "~", paste(preds_new, collapse="+"))

        # create a recipe from the model string
        phishing_recipe <- recipe(as.formula(model_string), 
                                  data = phishing_subset) |>
                          step_scale(all_numeric_predictors()) |>
                          step_center(all_numeric_predictors())

        # tune the KNN classifier with these predictors, 
        # and collect the accuracy for the best K
        acc <- workflow() |>
               add_recipe(phishing_recipe) |>
               add_model(knn_spec) |>
               tune_grid(resamples = phishing_vfold, grid = 10) |>
               collect_metrics() |>
               filter(.metric == "accuracy") |>
               summarize(mx = max(mean))
        acc <- acc$mx |> unlist()

        # add this result to the dataframe
        accs[[j]] <- acc
        models[[j]] <- model_string
    }
    jstar <- which.max(unlist(accs))
    accuracies <- accuracies |> 
      add_row(size = i, 
              model_string = models[[jstar]], 
              accuracy = accs[[jstar]])
    selected <- c(selected, names[[jstar]])
    names <- names[-jstar]
}
accuracies

phishing_subset

[33m![39m [33mFold1: preprocessor 1/1: Column(s) have zero variance so scaling cannot be used: `nb_or`. Conside...[39m

[33m![39m [33mFold2: preprocessor 1/1: Column(s) have zero variance so scaling cannot be used: `nb_or`. Conside...[39m

[33m![39m [33mFold3: preprocessor 1/1: Column(s) have zero variance so scaling cannot be used: `nb_or`. Conside...[39m

[33m![39m [33mFold4: preprocessor 1/1: Column(s) have zero variance so scaling cannot be used: `nb_or`. Conside...[39m

[33m![39m [33mFold5: preprocessor 1/1: Column(s) have zero variance so scaling cannot be used: `nb_or`. Conside...[39m

[33m![39m [33mFold1: preprocessor 1/1: Column(s) have zero variance so scaling cannot be used: `nb_or`. Conside...[39m

[33m![39m [33mFold2: preprocessor 1/1: Column(s) have zero variance so scaling cannot be used: `nb_or`. Conside...[39m

[33m![39m [33mFold3: preprocessor 1/1: Column(s) have zero variance so scaling cannot be used: `nb_or`. Conside...[39m

[33m![

size,model_string,accuracy
<int>,<chr>,<dbl>
1,status ~ longest_word_path,0.6675416
2,status ~ longest_word_path+domain_registration_length,0.7453193
3,status ~ longest_word_path+domain_registration_length+nb_hyphens,0.7703412
⋮,⋮,⋮
28,status ~ longest_word_path+domain_registration_length+nb_hyphens+nb_slash+nb_dots+ratio_digits_url+length_hostname+avg_word_host+longest_word_host+length_url+nb_underscore+nb_at+ratio_digits_host+http_in_path+nb_semicolumn+nb_tilde+nb_or+nb_dollar+nb_star+nb_subdomains+nb_percent+avg_words_raw+nb_and+nb_colon+https_token+nb_qm+nb_eq+nb_comma,0.8588801
29,status ~ longest_word_path+domain_registration_length+nb_hyphens+nb_slash+nb_dots+ratio_digits_url+length_hostname+avg_word_host+longest_word_host+length_url+nb_underscore+nb_at+ratio_digits_host+http_in_path+nb_semicolumn+nb_tilde+nb_or+nb_dollar+nb_star+nb_subdomains+nb_percent+avg_words_raw+nb_and+nb_colon+https_token+nb_qm+nb_eq+nb_comma+nb_space,0.8579178


status,length_url,length_hostname,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,nb_eq,⋯,http_in_path,https_token,ratio_digits_url,ratio_digits_host,nb_subdomains,longest_word_host,longest_word_path,avg_words_raw,avg_word_host,domain_registration_length
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
legitimate,37,19,3,0,0,0,0,0,0,⋯,0,1,0.0000000,0,3,11,6,5.75,7.0,45
phishing,77,23,1,0,0,0,0,0,0,⋯,0,1,0.2207792,0,1,19,32,15.75,19.0,77
phishing,126,50,4,1,0,1,2,0,3,⋯,0,0,0.1507937,0,3,13,17,8.25,8.4,14
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
legitimate,38,30,2,0,0,0,0,0,0,⋯,0,1,0.00000000,0.0000000,2,22,0,12.500000,12.50,85
phishing,477,14,24,0,1,1,9,0,9,⋯,4,1,0.08595388,0.7857143,3,3,12,4.377778,2.75,0


In [9]:
accuracies <- accuracies %>%
              arrange(desc(accuracy))

accuracies

size,model_string,accuracy
<int>,<chr>,<dbl>
16,status ~ longest_word_path+domain_registration_length+nb_hyphens+nb_slash+nb_dots+ratio_digits_url+length_hostname+avg_word_host+longest_word_host+length_url+nb_underscore+nb_at+ratio_digits_host+http_in_path+nb_semicolumn+nb_tilde,0.8597550
17,status ~ longest_word_path+domain_registration_length+nb_hyphens+nb_slash+nb_dots+ratio_digits_url+length_hostname+avg_word_host+longest_word_host+length_url+nb_underscore+nb_at+ratio_digits_host+http_in_path+nb_semicolumn+nb_tilde+nb_or,0.8597550
18,status ~ longest_word_path+domain_registration_length+nb_hyphens+nb_slash+nb_dots+ratio_digits_url+length_hostname+avg_word_host+longest_word_host+length_url+nb_underscore+nb_at+ratio_digits_host+http_in_path+nb_semicolumn+nb_tilde+nb_or+nb_dollar,0.8596675
⋮,⋮,⋮
2,status ~ longest_word_path+domain_registration_length,0.7453193
1,status ~ longest_word_path,0.6675416


As seen with the model string:

In [10]:
"status ~ longest_word_path+domain_registration_length+nb_hyphens+nb_slash+nb_dots+ratio_digits_url+length_hostname+avg_word_host+longest_word_host+length_url+nb_underscore+nb_at+ratio_digits_host+http_in_path+nb_semicolumn+nb_tilde"

We now know the most relevant predictors. Notably, there is a tie between the accuracy of 0.8597550; however, we choose the first model string as there are less predictors (16 vs 17) than the second. This means that there will be less multicollinearity between the variables.

We continue with cleaning our variables by selecting for the variables chosen by forward selection.

In [12]:
cleaned_phishing_data <- phishing_data |>
                         select(status, longest_word_path, domain_registration_length, nb_hyphens, nb_slash, nb_dots, ratio_digits_url,
                                length_hostname, avg_word_host, longest_word_host, length_url, nb_underscore, nb_at, ratio_digits_host, http_in_path,
                                nb_semicolumn, nb_tilde)

Now, we will split the data into training and testing data sets.

In [14]:
cleaned_phishing_data_split <- initial_split(cleaned_phishing_data, prop = 3/4, strata = status)
phishing_train <- training(cleaned_phishing_data_split)
phishing_test <- testing(cleaned_phishing_data_split)

### Data Summary

### Data Visualization

## Methods

To classify an URL as legitimate or phishing, we will be using K-nearest neighbours classification. The steps we will take are as follows:

1. Use Forward Selection to choose the relevant predictors
2. Use Cross-Validation to find the optimal value of K
3. Perform K-nearest neighbours classification

As shown above, in <i>Choosing Relevant Predictors using Forward Selection</i> under <i>Preliminary Exploratory Data Analysis</i>, we have chosen the following columns to precede with data analytics.

### Visualization of Results

The visualization of the results include three different visualizations:
1. The visualization of the number of neighbors K and the Accuracy Estimates from cross-validation
2. Visualizations of the distribution described by the metrics.
3. Confusion Matrix


The first visualization allows us to determine the value of K, maximizing the accuracy of our algorithm.

The second visualization allows us to understand how accurate our model is at classifying URLs as either phishing or legitimate. 

The third visualization will be a confusion matrix to help us understand what the accuracies imply (how certain errors in the accuracy can cause more harm).

## Expected Outcomes and Significance

To the question, "Can we classify an URL is phishing or legitimate?" we expect the answer to be "yes." By using Forward Selection, Cross Validation, then creating a KNN Classification algorithm, we expect to identify whether an URL is phishing or legitimate with a high level of certainty. These findings allow us to classify phishing/legitimate URLs and prevent bad actors from stealing one's information. This project leads to future questions like:

- What percentage of phishing is prevented by your email service?
- Would phishing URLs advance to the point they will be undetectable?
- What new types of phishing attacks come to play in the future?
- Can the power of quantum computing increase the accuracy or speed in classifying phishing/legitimate URLs?