# Phishing Website Detection by Machine Learning - Feature Extraction

This notebook aims to collect the relevant data and extract the selective features.

In [104]:
import os
import requests
import pandas as pd
from urllib.parse import urlparse,urlencode
import ipaddress
import re

# 1. Data Collection:

For this project we are required to collect legitimate(0) and phishing URLs.

- Phishing URLs are collected using opensource service called PhishTank, http://data.phishtank.com/data/online-valid.json.
- Legitimate URLs are collected using dataset provided by University of New Brunswick, https://www.unb.ca/cic/datasets/url-2016.html. The number of legitimate URLs in this collection are 35,300. Among all the files, 'Benign_list_big_final.csv' is the file of our interest. 

## 1.1. Phishing URLs:

Phishing URLs are collected using PhishTank and loaded into the DataFrame.

In [84]:
#Downloading the phishing URLs file

CSV_URL = 'http://data.phishtank.com/data/online-valid.csv'

with open("Dataset/" + os.path.split(CSV_URL)[1] , 'wb') as f, \
        requests.get(CSV_URL, stream=True) as r:
    for line in r.iter_lines():
        f.write(line+'\n'.encode())

In [85]:
path = "Dataset//online-valid.csv"
df_phishing = pd.read_csv(path)

In [86]:
df_phishing.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,7373519,http://uat.bakebe.com/de/bw-bank/,http://www.phishtank.com/phish_detail.php?phis...,2021-12-04T19:45:21+00:00,yes,2021-12-04T20:05:02+00:00,yes,Other
1,7373479,https://mercarir.jp.dbctscc.cn,http://www.phishtank.com/phish_detail.php?phis...,2021-12-04T18:04:30+00:00,yes,2021-12-04T19:52:04+00:00,yes,Other
2,7373476,https://www.bancowlnterbank.com/,http://www.phishtank.com/phish_detail.php?phis...,2021-12-04T17:53:44+00:00,yes,2021-12-04T18:10:52+00:00,yes,Other
3,7373471,https://clientedescontaoitau.000webhostapp.com/,http://www.phishtank.com/phish_detail.php?phis...,2021-12-04T17:35:13+00:00,yes,2021-12-04T19:42:55+00:00,yes,Itau
4,7373461,https://verifiedsbcglobal.weebly.com/,http://www.phishtank.com/phish_detail.php?phis...,2021-12-04T16:36:02+00:00,yes,2021-12-04T18:10:52+00:00,yes,Other


In [87]:
df_phishing.isnull().sum()

phish_id             0
url                  0
phish_detail_url     0
submission_time      0
verified             0
verification_time    0
online               0
target               0
dtype: int64

In [88]:
df_phishing.shape

(12479, 8)

We would pick up 5000 samples from the above dataframe randomly.

In [89]:
#Collecting 5,000 Phishing URLs randomly

df_phishing_final = df_phishing.sample(n = 5000, random_state = 12).copy()
df_phishing_final = df_phishing_final.reset_index(drop=True)
df_phishing_final.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,7368857,https://luftcomsa.com/wp-includes/fonts/tesco....,http://www.phishtank.com/phish_detail.php?phis...,2021-11-30T13:45:52+00:00,yes,2021-11-30T18:42:36+00:00,yes,Tesco
1,6924120,http://hunterpowersport.com/wp-includes/custom...,http://www.phishtank.com/phish_detail.php?phis...,2021-01-16T14:07:43+00:00,yes,2021-01-16T14:23:19+00:00,yes,Other
2,7046965,https://urlth.me/jvr9U,http://www.phishtank.com/phish_detail.php?phis...,2021-03-27T12:42:21+00:00,yes,2021-03-27T12:44:06+00:00,yes,Other
3,7304838,https://sites.google.com/view/xopauyr/home,http://www.phishtank.com/phish_detail.php?phis...,2021-09-28T08:20:38+00:00,yes,2021-09-28T08:25:37+00:00,yes,Other
4,6376701,http://poligrafiapias.com/Secured-adobe/08fd9f...,http://www.phishtank.com/phish_detail.php?phis...,2020-01-30T02:02:10+00:00,yes,2020-02-06T14:06:27+00:00,yes,Other


In [90]:
df_phishing_final.shape

(5000, 8)

## 1.2. Legitimate URLs:

In [100]:
path = "Dataset//Benign_list_big_final.csv"
df_legitimate = pd.read_csv(path)
df_legitimate.columns = ["URLs"]

In [101]:
df_legitimate.head()

Unnamed: 0,URLs
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [102]:
#Collecting 5,000 Legitimate URLs randomly

df_legitimate_final = df_legitimate.sample(n = 5000, random_state = 12).copy()
df_legitimate_final = df_legitimate_final.reset_index(drop=True)
df_legitimate_final.head()

Unnamed: 0,URLs
0,http://graphicriver.net/search?date=this-month...
1,http://ecnavi.jp/redirect/?url=http://www.cros...
2,https://hubpages.com/signin?explain=follow+Hub...
3,http://extratorrent.cc/torrent/4190536/AOMEI+B...
4,http://icicibank.com/Personal-Banking/offers/o...


In [103]:
df_legitimate_final.shape

(5000, 1)

# 2. Feature Extraction:

In this step, features are extracted from the Legitimate URLs dataset.

The extracted features are categorized into;

1. Address Bar based Features
2. Domain based Features
3. HTML & Javascript based Features

### 2.1. Address Bar based Features:

Following features are considered as address basr based features;

- Domain of URL
- IP Address in URL
- "@" Symbol in URL
- Length of URL
- Depth of URL
- Redirection "//" in URL
- "http/https" in Domain name
- Using URL Shortening Services “TinyURL”
- Prefix or Suffix "-" in Domain