# **Phishing Website Dataset Construction**

## **Acknowledgement**

The original dataset was found on Kaggle and it's made by Hemanth Pingali. The dataset can be found [here](https://www.kaggle.com/datasets/hemanthpingali/phishing-url/data). 

## **Important Packages**

In [14]:
import pandas as pd
import os

## **Dataset Construction**

First, let's load the `.csv` files from the `datasets` folder. Note that the author of the dataset already separated the testing dataset from the one that will be used training, so we need to combine them. 

In [9]:
training_set = pd.read_csv("./datasets/training.csv")
print(training_set.shape)
training_set.head()


(7658, 89)


Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,https://www.todayshomeowner.com/how-to-make-ho...,82,23,0,2,7,0,0,0,0,...,1,1,0,240,8892,67860,0,1,4,legitimate
1,http://thapthan.ac.th/information/confirmation...,93,14,1,2,0,0,0,0,0,...,1,0,1,0,2996,4189860,0,1,2,phishing
2,http://app.dialoginsight.com/T/OFC4/L2S/3888/B...,121,21,1,3,0,0,0,0,0,...,1,1,0,30,2527,346022,0,1,3,phishing
3,https://www.bedslide.com,24,16,0,2,0,0,0,0,0,...,0,0,0,139,7531,1059151,0,0,4,legitimate
4,https://tabs.ultimate-guitar.com/s/sex_pistols...,73,24,0,3,1,0,0,0,0,...,0,0,0,3002,7590,635,0,1,5,legitimate


In [10]:
testing_set = pd.read_csv("./datasets/testing.csv")
print(testing_set.shape)
testing_set.head()

(3772, 89)


Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,https://clubedemilhagem.com/home.php,36,19,0,2,0,0,0,0,0,...,1,0,0,344,21,0,0,1,0,phishing
1,http://www.medicalnewstoday.com/articles/18893...,51,24,0,3,0,0,0,0,0,...,1,1,0,103,6106,737,0,1,6,legitimate
2,https://en.wikipedia.org/wiki/NBC_Nightly_News,46,16,0,2,0,0,0,0,0,...,0,1,0,901,7134,12,0,0,7,legitimate
3,http://secure.web894.com/customer_center/custo...,185,17,1,2,1,0,1,2,0,...,1,1,0,247,1944,0,0,1,0,phishing
4,https://en.wikipedia.org/wiki/Transaction_proc...,52,16,0,2,0,0,0,0,0,...,0,1,0,901,7134,12,0,0,7,legitimate


In [11]:
combined_dataset = pd.concat([training_set, testing_set])
print(combined_dataset.shape)
combined_dataset.head()

(11430, 89)


Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,https://www.todayshomeowner.com/how-to-make-ho...,82,23,0,2,7,0,0,0,0,...,1,1,0,240,8892,67860,0,1,4,legitimate
1,http://thapthan.ac.th/information/confirmation...,93,14,1,2,0,0,0,0,0,...,1,0,1,0,2996,4189860,0,1,2,phishing
2,http://app.dialoginsight.com/T/OFC4/L2S/3888/B...,121,21,1,3,0,0,0,0,0,...,1,1,0,30,2527,346022,0,1,3,phishing
3,https://www.bedslide.com,24,16,0,2,0,0,0,0,0,...,0,0,0,139,7531,1059151,0,0,4,legitimate
4,https://tabs.ultimate-guitar.com/s/sex_pistols...,73,24,0,3,1,0,0,0,0,...,0,0,0,3002,7590,635,0,1,5,legitimate


Now that the training and testing has been combined, let's write the dataset into a `.csv` file. 

In [15]:
combined_dataset.to_csv("./datasets/final_dataset.csv", index=False)
files = [f for f in os.listdir('./datasets')]
print(files)


['final_dataset.csv', 'training.csv', 'testing.csv', 'new_dataset.csv']


We now have the dataset that will be used for modelling. Let's quickly check the features that the dataset has.

In [13]:
print(combined_dataset.columns)
combined_dataset.info()

Index(['url', 'length_url', 'length_hostname', 'ip', 'nb_dots', 'nb_hyphens',
       'nb_at', 'nb_qm', 'nb_and', 'nb_or', 'nb_eq', 'nb_underscore',
       'nb_tilde', 'nb_percent', 'nb_slash', 'nb_star', 'nb_colon', 'nb_comma',
       'nb_semicolumn', 'nb_dollar', 'nb_space', 'nb_www', 'nb_com',
       'nb_dslash', 'http_in_path', 'https_token', 'ratio_digits_url',
       'ratio_digits_host', 'punycode', 'port', 'tld_in_path',
       'tld_in_subdomain', 'abnormal_subdomain', 'nb_subdomains',
       'prefix_suffix', 'random_domain', 'shortening_service',
       'path_extension', 'nb_redirection', 'nb_external_redirection',
       'length_words_raw', 'char_repeat', 'shortest_words_raw',
       'shortest_word_host', 'shortest_word_path', 'longest_words_raw',
       'longest_word_host', 'longest_word_path', 'avg_words_raw',
       'avg_word_host', 'avg_word_path', 'phish_hints', 'domain_in_brand',
       'brand_in_subdomain', 'brand_in_path', 'suspecious_tld',
       'statistical_report', 

Those are a lot of features! A majority of these features are mentioned in research papers regarding phishing website detection, like Mohammad & McCluskey (2015) and Mamun et al. (2016), and even in Python Notebooks, like the one written by Shreyagopal (2020). 

With this, we can now use the dataset in order to train and test a model that can predict whether a set of features belong to a phishing website or not. 

## **References**

* Mamun, M.S., Rathore, M.A., Habibi Lashkari, A., Stakhanova, N., & Ghorbani, A.A. (2016). Detecting Malicious URLs Using Lexical Analysis. International Conference on Network and System Security.

* Mohammad, R., & McCluskey, L. (2015, March 25). Phishing Websites. UCI Machine Learning Repository.
https://archive.ics.uci.edu/dataset/327/phishing+websites

* Shreyagopal. (2020). GitHub - shreyagopal/Phishing-Website-Detection-by-Machine-LearningTechniques. GitHub. https://github.com/shreyagopal/Phishing-Website-Detection-by-Machine-LearningTechniques?tab=readme-ov-file

