# Phishing Website Detection by Machine Learning Techniques

# 1. Objective:
**A phishing website is a common social engineering method that mimics trustful uniform resource locators (URLs) and webpages. The objective of this notebook is to collect data & extract the selctive features form the URLs.This project is worked on Google Collaboratory**

# 2. Collecting the Data: 
**For this project, we need a bunch of urls of type legitimate (0) and phishing (1).**

**The collection of phishing urls is rather easy because of the opensource service called PhishTank. This service provide a set of phishing URLs in multiple formats like csv, json etc. that gets updated hourly. To download the data: https://www.phishtank.com/developer_info.php. **

**For the legitimate URLs, we found a source that has a collection of benign, spam, phishing, malware & defacement URLs. The source of the dataset is University of New Brunswick, https://www.unb.ca/cic/datasets/url-2016.html.**


In [None]:
#importing required packages for this module
import pandas as pd

In [None]:
from google.colab import files

uploaded = files.upload()

Saving Phishing_list.csv to Phishing_list.csv


In [None]:
#loading the phishing URLs data to dataframe
data0 = pd.read_csv("Phishing_list.csv")
data0.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,6557033,http://u1047531.cp.regruhosting.ru/acces-inges...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T22:01:43+00:00,yes,2020-05-09T22:03:07+00:00,yes,Other
1,6557032,http://hoysalacreations.com/wp-content/plugins...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T22:01:37+00:00,yes,2020-05-09T22:03:07+00:00,yes,Other
2,6557011,http://www.accsystemprblemhelp.site/checkpoint...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:54:31+00:00,yes,2020-05-09T21:55:38+00:00,yes,Facebook
3,6557010,http://www.accsystemprblemhelp.site/login_atte...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:53:48+00:00,yes,2020-05-09T21:54:34+00:00,yes,Facebook
4,6557009,https://firebasestorage.googleapis.com/v0/b/so...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:49:27+00:00,yes,2020-05-09T21:51:24+00:00,yes,Microsoft


In [None]:
# shape of Phishing dataframe
data0.shape

(14858, 8)

In [None]:
# size of Phishing dataframe
data0.size

118864

So, the data has thousands of phishing URLs. But the problem here is, this data gets updated hourly. Without getting into the risk of data imbalance, I am considering a margin value of 5,000 phishing URLs & 5000 legitimate URLs.

Thereby, picking up 5000 samples from the above dataframe randomly.

In [None]:
#Collecting 5,000 Phishing URLs randomly
phishurl = data0.sample(n = 5000, random_state = 12).copy()
phishurl = phishurl.reset_index(drop=True)
phishurl.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,6514946,http://confirmprofileaccount.com/,http://www.phishtank.com/phish_detail.php?phis...,2020-04-19T11:06:55+00:00,yes,2020-04-19T13:42:41+00:00,yes,Other
1,4927651,http://www.marreme.com/MasterAdmin/04mop.html,http://www.phishtank.com/phish_detail.php?phis...,2017-04-04T19:35:54+00:00,yes,2017-05-03T23:00:42+00:00,yes,Other
2,5116976,http://modsecpaststudents.com/review/,http://www.phishtank.com/phish_detail.php?phis...,2017-07-25T18:48:30+00:00,yes,2017-07-28T16:01:36+00:00,yes,Other
3,6356131,https://docs.google.com/forms/d/e/1FAIpQLScL6L...,http://www.phishtank.com/phish_detail.php?phis...,2020-01-13T20:13:37+00:00,yes,2020-01-17T01:55:38+00:00,yes,Other
4,6535965,https://oportunidadedasemana.com/americanas//?...,http://www.phishtank.com/phish_detail.php?phis...,2020-04-29T00:01:03+00:00,yes,2020-05-01T10:55:35+00:00,yes,Other


In [None]:
# shape of 5000 Phishing dataframe
phishurl.shape

(5000, 8)

# Legitimate URLs:
**From the Benign_list.csv file, the URLs are loaded into a dataframe.**

In [None]:
uploaded = files.upload()

Saving Benign_list.csv to Benign_list.csv


In [None]:
# Loading legitimate files 
data1 = pd.read_csv("Benign_list.csv")
data1.columns = ['URLs']
data1.head()

Unnamed: 0,URLs
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [None]:
# shape of Legitimate dataframe
data1.shape

(35377, 1)

In [None]:
# size of Legitimate dataframe
data1.size

35377

In [None]:
#Collecting 5,000 Legitimate URLs randomly
legiurl = data1.sample(n = 5000, random_state = 12).copy()
legiurl = legiurl.reset_index(drop=True)
legiurl.head()

Unnamed: 0,URLs
0,http://graphicriver.net/search?date=this-month...
1,http://ecnavi.jp/redirect/?url=http://www.cros...
2,https://hubpages.com/signin?explain=follow+Hub...
3,http://extratorrent.cc/torrent/4190536/AOMEI+B...
4,http://icicibank.com/Personal-Banking/offers/o...


In [None]:
# shape of 5000 Legitimate dataframe
legiurl.shape

(5000, 1)

#Feature Extraction:
**In this step, features are extracted from the URLs dataset.**

#Address Bar Based Features:
Many features can be extracted that can be consided as address bar base features. Out of them, below mentioned were considered for this project.

- Domain of URL</br>
- IP Address in URL</br>
- "@" Symbol in URL</br>
- Length of URL</br>
- Depth of URL</br>
- Redirection "//" in URL</br>
- "http/https" in Domain name</br>
- Prefix or Suffix "-" in Domain</br>
##Each of these features are explained and the coded below:</br>

In [None]:
# importing required packages for this section
from urllib.parse import urlparse
import ipaddress
import re

# 1. Domain of the URL

### Here, we are just extracting the domain present in the URL. This feature doesn't have much significance in the training. May even be dropped while training the model.

In [None]:
# 1.Domain of the URL (Domain) 
def getDomain(url):  
  domain = urlparse(url).netloc
  if re.match(r"^www.",domain):
	       domain = domain.replace("www.","")
  return domain

# 2. IP Address in the URL

### Checks for the presence of IP address in the URL. URLs may have IP address instead of domain name. If an IP address is used as an alternative of the domain name in the URL, we can be sure that someone is trying to steal personal information with this URL.

### If the domain part of URL has IP address, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [None]:
# 2.Checks for IP address in URL (Have_IP)
def havingIP(url):
  try:
    ipaddress.ip_address(url)
    ip = 1            # phishing
  except:
    ip = 0            # legitimate
  return ip

# 3. "@" Symbol in URL
### Checks for the presence of '@' symbol in the URL. Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol and the real address often follows the “@” symbol.

### If the URL has '@' symbol, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [None]:
# 3.Checks the presence of @ in URL (Have_At)
def haveAtSign(url):
  if "@" in url:
    at = 1            # phishing    
  else:
    at = 0            # legitimate    
  return at

# 4. Length of URL
### Computes the length of the URL. Phishers can use long URL to hide the doubtful part in the address bar. In this project, if the length of the URL is greater than or equal 54 characters then the URL classified as phishing otherwise legitimate.

### If the length of URL >= 54 , the value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [None]:
# 4.Finding the length of URL and categorizing (URL_Length)
def getLength(url):
  if len(url) < 54:
    length = 0            # legitimate            
  else:
    length = 1            # phishing            
  return length

# 5. Depth of URL
### Computes the depth of the URL. This feature calculates the number of sub pages in the given url based on the '/'.

### The value of feature is a numerical based on the URL.

In [None]:
# 5.Gives number of '/' in URL (URL_Depth)
def getDepth(url):
  s = urlparse(url).path.split('/')
  depth = 0
  for j in range(len(s)):
    if len(s[j]) != 0:
      depth = depth+1
  return depth

# 6. Redirection "//" in URL
### Checks the presence of "//" in the URL. The existence of “//” within the URL path means that the user will be redirected to another website. The location of the “//” in URL is computed. We find that if the URL starts with “HTTP”, that means the “//” should appear in the sixth position. However, if the URL employs “HTTPS” then the “//” should appear in seventh position.

### If the "//" is anywhere in the URL apart from after the protocal, thee value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [None]:
# 6.Checking for redirection '//' in the URL (Redirection)
def redirection(url):
  pos = url.rfind('//')
  if pos > 6:
    if pos > 7:
      return 1            # phishing
    else:
      return 0            # legitimate
  else:
    return 0            # legitimate

# 7. "http/https" in Domain name
### Checks for the presence of "http/https" in the domain part of the URL. The phishers may add the “HTTPS” token to the domain part of a URL in order to trick users.

### If the URL has "http/https" in the domain part, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [None]:
# 7.Checks existence of “HTTPS” Token in the Domain Part of the URL (https_Domain)
def httpDomain(url):
  domain = urlparse(url).netloc
  if 'https' in domain:
    return 1            # phishing
  else:
    return 0            # legitimate

# 8. Prefix or Suffix "-" in Domain
### Checking the presence of '-' in the domain part of URL. The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to the domain name so that users feel that they are dealing with a legitimate webpage.

### If the URL has '-' symbol in the domain part of the URL, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [None]:
# 8.Checking for Prefix or Suffix Separated by (-) in the Domain (Prefix/Suffix)
def prefixSuffix(url):
    if '-' in urlparse(url).netloc:
        return 1            # phishing
    else:
        return 0            # legitimate

#Computing URL Features
Create a list and a function that calls the other functions and stores all the features of the URL in the list. We will extract the features of each URL and append to this list.

In [None]:
# Function to extract features
def featureExtraction(url,label):

  features = []
  #Address bar based features
  features.append(getDomain(url))
  features.append(havingIP(url))
  features.append(haveAtSign(url))
  features.append(getLength(url))
  features.append(getDepth(url))
  features.append(redirection(url))
  features.append(httpDomain(url))
  features.append(prefixSuffix(url))
  features.append(label)

  return features

#Legitimate URLs:
**Now, feature extraction is done on legitimate URLs.**



In [None]:
#Extracting the feautres & storing them in a list
legi_features = []
label = 0

for i in range(0, 5000):
  url = legiurl['URLs'][i]
  legi_features.append(featureExtraction(url,label))

In [None]:
#converting the list to dataframe
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain',  'Prefix/Suffix', 'Label']

legitimate = pd.DataFrame(legi_features, columns= feature_names)
legitimate.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,Prefix/Suffix,Label
0,graphicriver.net,0,0,1,1,0,0,0,0
1,ecnavi.jp,0,0,1,1,1,0,0,0
2,hubpages.com,0,0,1,1,0,0,0,0
3,extratorrent.cc,0,0,1,3,0,0,0,0
4,icicibank.com,0,0,1,3,0,0,0,0


In [None]:
# Storing the extracted legitimate URLs fatures to csv file
legitimate.to_csv('legitimate.csv', index= False)

#Phishing URLs:
Now, feature extraction is performed on phishing URLs.

In [None]:
#Extracting the feautres & storing them in a list
phish_features = []
label = 1
for i in range(0, 5000):
  url = phishurl['url'][i]
  phish_features.append(featureExtraction(url,label))

In [None]:
#converting the list to dataframe
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'Prefix/Suffix', 'Label']

phishing = pd.DataFrame(phish_features, columns= feature_names)
phishing.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,Prefix/Suffix,Label
0,confirmprofileaccount.com,0,0,0,0,0,0,0,1
1,marreme.com,0,0,0,2,0,0,0,1
2,modsecpaststudents.com,0,0,0,1,0,0,0,1
3,docs.google.com,0,0,1,5,0,0,0,1
4,oportunidadedasemana.com,0,0,1,1,1,0,0,1


In [None]:
# Storing the extracted legitimate URLs fatures to csv file
phishing.to_csv('phishing.csv', index= False)

#Final Dataset
In the above section we formed two dataframes of legitimate & phishing URL features. Now, we will combine them to a single dataframe and export the data to csv file for the Machine Learning training done in other notebook.

In [None]:
#Concatenating the dataframes into one 
urldata = pd.concat([legitimate, phishing]).reset_index(drop=True)

urldata.head() # label 0 shows legitimate included

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,Prefix/Suffix,Label
0,graphicriver.net,0,0,1,1,0,0,0,0
1,ecnavi.jp,0,0,1,1,1,0,0,0
2,hubpages.com,0,0,1,1,0,0,0,0
3,extratorrent.cc,0,0,1,3,0,0,0,0
4,icicibank.com,0,0,1,3,0,0,0,0


In [None]:
urldata.tail()   #label 1 shows that phising also included

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,Prefix/Suffix,Label
9995,rebrand.ly,0,0,0,1,0,0,0,1
9996,paypal-test.projektumfeld.de,0,0,0,0,0,0,1,1
9997,sporttime.com.mx,0,0,1,9,0,0,0,1
9998,stopcarpeliculas.com.br,0,0,1,6,0,0,0,1
9999,mathbariatimes.com,0,0,0,2,0,0,0,1


In [None]:
# shape of Final dataframe
urldata.shape

(10000, 9)

In [None]:
# Storing the data in CSV file
urldata.to_csv('urldata.csv', index=False)