### FEATURE EXTRACTION

In [1]:
import pandas as pd

<hr>

#### Data Ingestion 

Phishing data : https://www.phishtank.com/developer_info.php  <br>
Legitimate data : https://www.unb.ca/cic/datasets/url-2016.html

Loading Phishing data

In [2]:
phishing_data=pd.read_csv("Datasets/phishing.csv")
phishing_data.shape

(14858, 8)

In [3]:
phishing_data.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,6557033,http://u1047531.cp.regruhosting.ru/acces-inges...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T22:01:43+00:00,yes,2020-05-09T22:03:07+00:00,yes,Other
1,6557032,http://hoysalacreations.com/wp-content/plugins...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T22:01:37+00:00,yes,2020-05-09T22:03:07+00:00,yes,Other
2,6557011,http://www.accsystemprblemhelp.site/checkpoint...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:54:31+00:00,yes,2020-05-09T21:55:38+00:00,yes,Facebook
3,6557010,http://www.accsystemprblemhelp.site/login_atte...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:53:48+00:00,yes,2020-05-09T21:54:34+00:00,yes,Facebook
4,6557009,https://firebasestorage.googleapis.com/v0/b/so...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:49:27+00:00,yes,2020-05-09T21:51:24+00:00,yes,Microsoft


In [4]:
phishing_data.dtypes

phish_id              int64
url                  object
phish_detail_url     object
submission_time      object
verified             object
verification_time    object
online               object
target               object
dtype: object

In [5]:
#randomly taking the 5000 samples
phishing_sample=phishing_data.sample(n=5000,random_state=10).copy()
#re-indexing from 0
phishing_sample=phishing_sample.reset_index(drop=True)
phishing_sample.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,6555613,https://www.momos-ambachtelijkeijs.nl/https/35...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T02:02:37+00:00,yes,2020-05-09T02:07:51+00:00,yes,Other
1,5957208,http://623112j4j3.codesandbox.io/IlOysTgNjFrGt...,http://www.phishtank.com/phish_detail.php?phis...,2019-03-05T11:15:13+00:00,yes,2019-03-15T07:18:26+00:00,yes,Other
2,6537214,http://airmaxzoomturkey.com/index.php,http://www.phishtank.com/phish_detail.php?phis...,2020-04-29T14:58:16+00:00,yes,2020-04-29T14:59:33+00:00,yes,Other
3,6496943,http://betasus21.blogspot.com,http://www.phishtank.com/phish_detail.php?phis...,2020-04-10T08:22:26+00:00,yes,2020-04-10T08:54:51+00:00,yes,Other
4,6541414,https://welovebellsouthandswbell.weebly.com,http://www.phishtank.com/phish_detail.php?phis...,2020-05-01T14:57:37+00:00,yes,2020-05-01T14:58:41+00:00,yes,Other


In [6]:
phishing_sample.shape

(5000, 8)

In [7]:
# Storing the sampled phishing data in csv file
phishing_sample.to_csv('Datasets/phishing_data_sample.csv', index= False)

Loading Legitimate data

In [8]:
legitimate_data=pd.read_csv("Datasets/legitimate.csv")
# assigning the column name
legitimate_data.columns=["urls"]
legitimate_data.head()

Unnamed: 0,urls
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [9]:
legitimate_data.shape

(35377, 1)

In [10]:
legitimate_data.dtypes

urls    object
dtype: object

In [11]:
#randomly taking the 5000 samples
legitimate_sample=legitimate_data.sample(n=5000,random_state=10).copy()
#re-indexing from 0
legitimate_sample=legitimate_sample.reset_index(drop=True)
legitimate_sample.head()

Unnamed: 0,urls
0,http://lifehacker.com/5900260/how-can-stop-wor...
1,http://cookpad.com/recipe/list/212659?utf8=%E2...
2,http://conservativetribune.com/civil-rights-le...
3,http://distractify.com/igor-feng/28-photos-tha...
4,http://motthegioi.vn/mao-trach-dong-qua-sach-b...


In [12]:
legitimate_sample.shape

(5000, 1)

In [13]:
# Storing the sampled legitimate data in csv file
legitimate_sample.to_csv('Datasets/legitimate_data_sample.csv', index= False)

<hr>

<b>Features collected from academic studies for the phishing domain detection with machine learning techniques are grouped as given below.</b>

1) URL-Based Features<br>
2) Domain-Based Features<br>
3) Content-Based Features<br>

<hr>

#### 1) URL-Based Features:

In [14]:
#importing the required libraries
from urllib.parse import urlparse,urlencode
import ipaddress
import re


Many features can be extracted that can be considered as URL-Based Features. Out of them, below mentioned were considered for our project.

1. Domain name
2. IP address in the URL
3. "@" in the URL
4. Length of URL
5. Depth of the URL 
6. Redirection in URL
7. "http/https" in the domain name
8. Using URL Shortenting services "TinyURL"
9. Adding Prefix or Suffix Separated by (-) to the Domain

#### 1. Domain name

In [15]:
#function to get the domain name of the URL
def getDomainName(url):
    domainName=urlparse(url).netloc
    if re.match(r"^www.",domainName):
        domainName=domainName.replace("www.","")
    return domainName

#### 2. IP address in the URL

If an IP address is used as an alternative of the domain name in the URL, such as “http://125.98.3.123/fake.html”, users can be sure that someone is trying to steal their personal information. 

<b>Rule:</b> 
* If The Domain Part has an IP Address -> Phishing
* Otherwise→ Legitimate


In [16]:
#checking if the URL contains the IP address or not
def checkIPadddress(url):
    try:
        ipaddress.ip_address(url)
        return 1   #phishing
    except:
        return 0  #legitimate

#### 3. "@" in the URL

Using '@' symbol in the URL leads the browser to ignore everything preceding the '@' symbol and the real address often follows the '@' symbol. 

<b>Rule:</b> 
* If Url has '@' symbol→ Phishing
* Otherwise→ Legitimate

In [17]:
#checking if there is "@" symbol in the URL
def checkAtSymbol(url):
    if "@" in url:
        return 1   #phishing
    else:
        return 0    #legitimate

#### 3. Length of the URL

Phishers can use long URL to hide the doubtful part in the address bar. 


To ensure accuracy of our study, we calculated the length of URLs in the dataset and produced an average URL length. The results showed that if the length of the URL is greater than or equal 82 characters then the URL classified as phishing.

In [18]:
#average length of the phishing URLs
def getAverageLength(li):
    sum=0
    for i in range(len(li)):
        sum=sum+len(li[i])
    avglen=sum/len(li)
    return avglen


In [19]:
phishing_url=phishing_data["url"]
print("Average length of the phishing urls: ",getAverageLength(phishing_url))

Average length of the phishing urls:  82.80421321846816



<b>Rule:</b>
* If URL length<82 → Legitimate
* Otherwise → Phishing

In [20]:
#finding the length of url
def getLength(url):
  if len(url) < 82:
    return 0    #legitimate        
  else:
    return 1    #phishing

#### 5. Depth of the URL

This feature calculates the number of sub pages in the given url based on the '/'. The value of feature is a numerical.

In [21]:
#finding the depth of url
def getDepth(url):
  split_url = urlparse(url).path.split('/')
  depth = 0
  for i in range(len(split_url)):
    if len(split_url[i]) != 0:
      depth = depth+1
  return depth

#### 6. Redirection in URL

The existence of '//' within the URL path means that the user will be redirected to another website. An example of such URL’s is: http://www.legitimate.com//http://www.phishing.com. 
We examine the location where the '//' appears. We find that if the URL starts with 'HTTP', that means the '//' should appear in the sixth position. However, if the URL employs 'HTTPS' then the '//' should appear in seventh position.

<b>Rule:</b> 
* If the position of the last occurrence of '//' in the URL is 7→ Phishing
* Otherwise → Legitimate


In [22]:
#checking if there is redirection '//' in the URL
def checkRedirection(url):
    position=url.rfind('//')
    if position>7:
        return 1    #phishing
    else:
        return 0    #legitimate
    

#### 7."http/https" in the domain name

The phishers may add the “HTTPS” token to the domain part of a URL in order to trick users. For example,
http://https-www-paypal-it-webapps-mpp-home.soft-hair.com/. <br>
<b>Rule:</b> 
* If http/https token in domain part of the URL → Phishing
* Otherwise → Legitimate

In [23]:
#checking if we have "http/https" in the domain name of the url
def checkHttpDomain(url):
    domainName = urlparse(url).netloc
    if 'http' or 'https' in domainName:
        return 1  #phishing
    else:
        return 0  #legitimate

#### 8. Using URL Shortenting services "TinyURL"

URL shortening is a method on the “World Wide Web” in which a URL may be made considerably smaller in length and still lead to the required webpage. This is accomplished by means of an “HTTP Redirect” on a domain name that is short, which links to the webpage that has a long URL. For example, the URL http://portal.hud.ac.uk/ can be shortened to “bit.ly/19DXSk4”.  <br>
<b>Rule:</b> 
* If TinyURL → Phishing
* Otherwise→ Legitimate


In [24]:
#shortening services
shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"

In [25]:
#checking if the url is using any shortening services
def checktinyURL(url):
    tinyURL_match=re.search(shortening_services,url)
    if tinyURL_match:
        return 1    #phishing
    else:
        return 0    #legitimate

#### 9. Adding Prefix or Suffix Separated by (-) to the Domain

The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to the domain name so that users feel that they are dealing with a legitimate webpage. For example http://www.Confirme-paypal.com/.  <br>
<b>Rule:</b> 
* If Domain Name Part Includes (-) Symbol → Phishing
* Otherwise → Legitimate

In [26]:
#checking if there is prefix or suffix separated by "-" in the domain name
def checkPrefixSuffix(url):
    domainName = urlparse(url).netloc
    if '-' in domainName:
        return 1  #phishing
    else:
        return 0  #legitimate

<hr>

#### 2) Domain-Based Features

In [27]:
#importing the required libraries
import re
from bs4 import BeautifulSoup
import whois
import urllib
import urllib.request
from datetime import datetime

Many features can be extracted that can be considered as Domain-Based Features. Out of them, below mentioned were considered for our project.

1. DNS_Record
2. Age of Domain
3. End Period of Domain

#### 1. DNS_Record

#### 2. Age of Domain

This feature can be extracted from WHOIS database (Whois 2005). Most phishing websites live for a short period of time. By reviewing our dataset, we find that the minimum age of the legitimate domain is 6 months. 

<b>Rule:</b> 
* IF Age Of Domain ≥ 6 months → Legitimate
* Otherwise → Phishing


In [41]:
#finding domain age
def domainAge(domain_name):
  creation_date = domain_name.creation_date
  expiration_date = domain_name.expiration_date

  if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
    try:
      creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1

  if ((expiration_date is None) or (creation_date is None)):
      return 1

  elif ((type(expiration_date) is list) or (type(creation_date) is list)):
      return 1
      
  else:
    ageofdomain = abs((expiration_date - creation_date).days)
    if ((ageofdomain/30) < 6):
      return 1      # Phishing
    else:
      return 0      # Legitimate

#### 3. End of Domain

This feature can be extracted from WHOIS database (Whois 2005). For this feature, the remaining domain time is calculated by finding the different between expiration time & current time. The end period considered for the legitimate domain is 6 months or less for this project.

<b>Rule:</b> 
* IF End Of Domain ≥ 6 months → Legitimate
* Otherwise → Phishing

In [None]:

#finding domain age
def domainEnd(domain_name):
  expiration_date = domain_name.expiration_date
  if isinstance(expiration_date,str):
    try:
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if (expiration_date is None):
      return 1
  elif (type(expiration_date) is list):
      return 1
  else:
    today = datetime.now()
    end = abs((expiration_date - today).days)
    if ((end/30) < 6):
      end = 0
    else:
      end = 1
  return end

<hr>

#### 3) Content based featues:

Many features can be extracted that can be considered as Content-Based Features. Out of them, below mentioned were considered for our project.
1. iframe Redirection
2. Status Bar Customization
3. Disabling Right Click

#### 1. iframe Redirection

IFrame is an HTML tag used to display an additional webpage into one that is currently shown. Phishers can make use of the 'iframe' tag and make it invisible i.e. without frame borders. 
In this regard, phishers make use of the 'frameBorder' attribute which causes the browser to render a visual delineation. 

<b>Rule:</b> 
* If iframe or frameborder tag is found → Phishing
* Otherwise → Legitimate


In [30]:
#checking iframe in webpage content
def iframe(response):
    try:
      if re.findall(r"[<iframe>|<frameBorder>]", response.text):
        return 0     # Legitimate
      else:
        return 1     # Phishing
    except:
        return 1

#### 2. Status Bar Customization

Phishers may use JavaScript to show a fake URL in the status bar to users. To extract this feature, we must dig-out the webpage source code, particularly the 'onMouseOver' event, and check if it makes any changes on the status bar.

<b>Rule:</b>
* If onMouseOver Changes Status Bar→ Phishing
* Otherwise→Legitimate


In [31]:
#checking mouseover in scripts
def mouseOver(response): 
  try:
    if re.findall("<script>.+onmouseover.+</script>", response.text):
      return 1   # Phishing
    else:
      return 0    # Legitimate
  except:
      return 1

#### 3. Disabling Right Click

Phishers use JavaScript to disable the right-click function, so that users cannot view and save the webpage source code. This feature is treated exactly as 'Using onMouseOver to hide the Link'. Nonetheless, for this feature, we will search for event “event.button==2” in the webpage source code and check if the right click is disabled. 

<b>Rule:</b> 
* If Right Click Disabled → Phishing 
* Otherwise→Legitimate


In [32]:
#checking the event button
def rightClick(response):
  try:
    if re.findall(r"event.button ?== ?2", response.text):
      return 0
    else:
      return 1
  except:
    return 1

#### 4. Website Forwarding

The fine line that distinguishes phishing websites from legitimate ones is how many times a website has been redirected. In our dataset, we find that legitimate websites have been redirected one time max. On the other hand, phishing websites containing this feature have been redirected at least 4 times.

In [46]:
# Checks the number of forwardings (Web_Forwards)    
def forwarding(response):
  if response == "":
    return 1
  else:
    if len(response.history) <= 2:
      return 0
    else:
      return 1

<hr>

#### Computing Features

In [33]:
import requests

In [47]:
#Function to extract features
def featureExtraction(url,label):

  features = []
  # URL-Based features (9)
  features.append(getDomainName(url))
  features.append(checkIPadddress(url))
  features.append(checkAtSymbol(url))
  features.append(getLength(url))
  features.append(getDepth(url))
  features.append(checkRedirection(url))
  features.append(checkHttpDomain(url))
  features.append(checktinyURL(url))
  features.append(checkPrefixSuffix(url))
  
  #Domain-Based features (3)
  dns = 0
  try:
    domain_name = whois.whois(urlparse(url).netloc)
  except:
    dns = 1

  features.append(dns)
  features.append(1 if dns == 1 else domainAge(domain_name))
  features.append(1 if dns == 1 else domainEnd(domain_name))

  #Content-Based features (3)
  try:
    response = requests.get(url)
  except:
    response = ""
    
  features.append(iframe(response))
  features.append(mouseOver(response))
  features.append(rightClick(response))
  features.append(forwarding(response))

  features.append(label)
  return features

<hr>

#### Phishing Features

In [48]:
#Extracting the featres & storing them in a list
phishing_features = []
label = 1
for i in range(0,5000):
  url = phishing_sample['url'][i]
  phishing_features.append(featureExtraction(url,label))

In [None]:
#converting the list to dataframe
feature_names = [ 'Domain_Name','Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                 'https_Domain', 'TinyURL', 'Prefix_Suffix',

                 'DNS_Record', 'Domain_Age','End_Domain',

                 'iFrame','Mouse_Over','Right_Click','Web_Forwards', 'label']

phishing = pd.DataFrame(phishing_features, columns= feature_names)
phishing.head()

In [51]:
# Storing the extracted legitimate URLs fatures to csv file
phishing.to_csv('Datasets/phishing_feature_extracted.csv', index= False)

<hr>

#### Legitimate Features

In [52]:
#Extracting the featres & storing them in a list
legitimate_features = []
label = 0
for i in range(0,5000):
  url = legitimate_sample['urls'][i]
  legitimate_features.append(featureExtraction(url,label))

In [None]:
#converting the list to dataframe
feature_names = [ 'Domain_Name','Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                 'https_Domain', 'TinyURL', 'Prefix_Suffix',

                 'DNS_Record', 'Domain_Age','End_Domain',

                 'iFrame','Mouse_Over','Right_Click','Web_Forwards', 'label']

legitimate = pd.DataFrame(legitimate_features, columns= feature_names)

In [55]:
# Storing the extracted legitimate URLs fatures to csv file
legitimate.to_csv('Datasets/legitimate_feature_extracted.csv', index= False)

<hr>

#### Final Dataset

In the above section we formed two dataframes of legitimate & phishing URL features. Now, we will combine them to a single dataframe and export the data to csv file for performing the Machine Learning training.

In [None]:
#Concatenating the dataframes into one 
urldata = pd.concat([phishing,legitimate]).reset_index(drop=True)
urldata.head()

In [57]:
# Storing the data in CSV file
urldata.to_csv('Datasets/final_feature_extracted.csv', index=False)

<hr>

#### Conclusion

With this we finally extracted 17 features for 10,000 URL which has 5000 phishing & 5000 legitimate URLs.