# Address Bar Based Features:

Many features can be extracted that can be consided as address bar base features. Out of them, below mentioned were considered for this project.

1. Domain of URL
2. IP Address in URL
3. "@" Symbol in URL
4. Length of URL
5. Depth of URL
6. Redirection "//" in URL
7. "http/https" in Domain name
8. Using URL Shortening Services “TinyURL”
9. Prefix or Suffix "-" in Domain

In [1]:
from urllib.parse import urlparse,urlencode
import ipaddress
import re
import pandas as pd

In [2]:
# 1.Domain of the URL (Domain) 
def getDomain(url):  
  domain = urlparse(url).netloc
  if re.match(r"^www.",domain):
	       domain = domain.replace("www.","")
  return domain

In [3]:
# 2.Checks for IP address in URL (Have_IP)
def havingIP(url):
    match = re.search(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', url)
    if match:
        return 1
    else:
        return 0
# 0->No IP address
# 1->IP address present in URL

In [4]:
# 3.Checks the presence of @ in URL (Have_At)
def haveAtSign(url):
  if "@" in url:
    at = 1    
  else:
    at = 0    
  return at
# 0->No @
# 1->@ present in URL

In [5]:
# 4.Finding the length of URL
def getLength(url):
  length = len(url)           
  return length

In [6]:
# 5.Gives number of '/' in URL (URL_Depth)
def getDepth(url):
  s = urlparse(url).path.split('/')
  depth = 0
  for j in range(len(s)):
    if len(s[j]) != 0:
      depth = depth+1
  return depth

In [7]:
# 6.Checking for redirection '//' in the url (Redirection)
def ns_double_slash(url):
  pos = url.rfind('//')
  if pos > 6:
    if pos > 7:
      return 1
    else:
      return 0
  else:
    return 0
# presence of // other than in the beginning indicate non-standard url structre

In [8]:
# 7.Existence of “HTTPS” Token in the Domain Part of the URL (https_Domain)
def httpDomain(url):
  tp = urlparse(url).scheme
  if tp == 'https':
    return 1
  else:
    return 0
# 0->http
# 1->https

In [9]:
# 8. Checking for Shortening Services in URL (Tiny_URL)
def tinyURL(url):
    with open(r'data\url_shortening_domains.txt', 'r') as file:
        lines = file.readlines()
        shorteners = [x.strip() for x in lines]
    domain = urlparse(url).netloc
    if domain in shorteners:
        return 1
    else:
        return 0
# 0->Normal URL
# 1->Shortened URL

In [10]:
# 9.Checking for number of dashes in the URL. Phishing domains usually have many dashes to create long domain names.
def domain_dash_count(url):
    domain = urlparse(url).netloc
    dash_count = domain.count('-')
    return dash_count

In [11]:
def featureExtraction(url,label):
    features = []
    features.append(url)
    
    #Address bar based features (10)
    features.append(getDomain(url))
    print("Check 01 of 17 complete.", end="\r")
    features.append(havingIP(url))
    print("Check 02 of 17 complete.", end="\r")
    features.append(haveAtSign(url))
    print("Check 03 of 17 complete.", end="\r")
    features.append(getLength(url))
    print("Check 04 of 17 complete.", end="\r")
    features.append(getDepth(url))
    print("Check 05 of 17 complete.", end="\r")
    features.append(ns_double_slash(url))
    print("Check 06 of 17 complete.", end="\r")
    features.append(httpDomain(url))
    print("Check 07 of 17 complete.", end="\r")
    features.append(tinyURL(url))
    print("Check 08 of 17 complete.", end="\r")
    features.append(domain_dash_count(url))
    print("Check 09 of 17 complete.", end="\r")
    features.append(label)
    return features

In [12]:
legi_features = []
label = 0
df_real = pd.read_csv(r'data\selected_benign_urls.csv')
for i in range(5000):
    url = df_real['URLs'][i]
    print("Extracting features for url number "+str(i)+" : "+url)
    legi_features.append(featureExtraction(url,label))

Extracting features for url number 0 : http://superuser.com/questions/694540/windows-asks-me-for-a-bluetooth-pairing-code-for-my-headset
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 1 : http://bdnews24.com/bangladesh/2015/05/14/bdnews24.com-journalist-threatened-for-protesting-against-ananta-murder
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 2 : https://codepen.io/anon/embed/QbNqgv?height=300&slug-hash=QbNqgv&default-tab=result&host=http%3A%2F%2Fcodepen.io
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05

Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 958 : http://nesn.com/2015/04/boston-bruins-relaxed-headed-into-saturday-nights-win-or-go-home-game-video/
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 959 : http://techcrunch.com/2015/05/08/simcity-creator-will-wrights-new-app-wants-to-create-a-graphic-novel-of-your-life/
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for ur

Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 1885 : http://olx.ro/i2/auto-masini-moto-ambarcatiuni/camioane-utilaje-rulote-remorci/rulote-remorci/
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 1886 : http://correios.com.br/Para-governo/governo-federal-e-estadual/solucoes-em-marketing
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 1887 : http://depositphoto

Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 2938 : http://atwiki.jp/wiki/%E3%83%AA%E3%83%AA%E3%82%AB%E3%83%AB%E3%81%AA%E3%81%AE%E3%81%AF
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 2939 : http://elitedaily.com/life/culture/10-ways-growing-free-spirited-mom-enriched-childhood/1026581/
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 2940 : http://slashdot.

Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 3940 : http://rocketnews24.com/2011/05/01/%e3%81%bb%e3%81%bc%e5%ae%87%e5%ae%99%e3%81%8b%e3%82%89%e3%82%b9%e3%82%ab%e3%82%a4%e3%83%80%e3%82%a4%e3%83%93%e3%83%b3%e3%82%b0%e3%81%99%e3%82%8b%e5%8b%95%e7%94%bb%e3%81%8c%e7%be%8e%e3%81%97%e3%81%99/
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 3941 : http://putlocker.is/watch-life-of-ryan-caretaker-manager-online-free-putlocker.html
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complet

Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 4987 : http://xvideo-jp.com/archives/tag/%e3%82%a2%e3%82%a4%e3%83%89%e3%83%ab%e3%83%bb%e8%8a%b8%e8%83%bd%e4%ba%ba/page/2
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09 of 17 complete.Extracting features for url number 4988 : http://torcache.net/torrent/046AF21D731F3E5B8ECBEAA59AAF551DFE6E6867.torrent?title=[kickass.to]fifty.shades.of.grey.2015.720p.webrip.900mb.mmkv
Check 01 of 17 complete.Check 02 of 17 complete.Check 03 of 17 complete.Check 04 of 17 complete.Check 05 of 17 complete.Check 06 of 17 complete.Check 07 of 17 complete.Check 08 of 17 complete.Check 09

In [13]:
feature_names = ['URL', 'Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Non-standard_Doubleslash', 
                      'https_Domain', 'Shortened_URL', 'Hyphen_Count', 'Label']

legitimate = pd.DataFrame(legi_features, columns= feature_names)
legitimate.head(10)

Unnamed: 0,URL,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Non-standard_Doubleslash,https_Domain,Shortened_URL,Hyphen_Count,Label
0,http://superuser.com/questions/694540/windows-...,superuser.com,0,0,97,3,0,0,0,0,0
1,http://bdnews24.com/bangladesh/2015/05/14/bdne...,bdnews24.com,0,0,113,5,0,0,0,0,0
2,https://codepen.io/anon/embed/QbNqgv?height=30...,codepen.io,0,0,112,3,0,1,0,0,0
3,http://kakaku.com/pc/note-pc-battery/ranking_0...,kakaku.com,0,0,84,6,0,0,0,0,0
4,http://kienthuc.net.vn/hoi-dap-ve-tuyen-sinh/c...,kienthuc.net.vn,0,0,111,2,0,0,0,0,0
5,http://metro.co.uk/2015/05/10/louis-tomlinson-...,metro.co.uk,0,0,101,4,0,0,0,0,0
6,http://hubpages.com/hub/First-Black-Female-Nav...,hubpages.com,0,0,87,2,0,0,0,0,0
7,http://tinnhanh360.net/tram-tro-voi-phi-m-truo...,tinnhanh360.net,0,0,90,1,0,0,0,0,0
8,http://torcache.net/torrent/AD6D871D2F56EDEEFF...,torcache.net,0,0,164,2,0,0,0,0,0
9,http://nesn.com/2015/05/barcelonas-luis-suarez...,nesn.com,0,0,96,3,0,0,0,0,0


In [14]:
legitimate.to_csv("feature_extracted_data\legitimate_urls_ABBF.csv", header=True, index=False)

In [15]:
# df = pd.read_csv(r'feature_extracted_data\legitimate_urls_ABBF.csv')
# df1 = pd.concat([df, legitimate])
# df1.to_csv("feature_extracted_data\legitimate_urls.csv", header=True, index=False)