# Phishing Website Detection - Feature Extraction


Step -1 : 数据预处理 Data preprocessing 

这个数据集包括一些网站的网址（一些是合法的一些是钓鱼的）
This dataset contains few website links (Some of them are legitimate websites and a few are fake websites)
建模前的预处理，并根据现有情况计算特征值
Pre-Processing the data before building a model and also Extracting the features from the data based on certain conditions

In [1]:
#importing numpy and pandas which are required for data pre-processing
import numpy as np
import pandas as pd

In [2]:
#Loading the data
raw_data = pd.read_csv("./raw_datasets/100-legitimate-art.txt") #loading only 100 samples (art websites data)

In [3]:
raw_data.head()

Unnamed: 0,websites
0,http://www.emuck.com:3000/archive/egan.html
1,http://danoday.com/summit.shtml
2,http://groups.yahoo.com/group/voice_actor_appr...
3,http://voice-international.com/
4,http://www.livinglegendsltd.com/


首先我们需要根据URL的部分对URL进行切割
We need to split the data according to parts of the URL
一个典型的URL，比如https://www.example.com/index.html , 包括协议(http)，主机名(www.example.com)和文件名(index.html)

A typical URL could have the form http://www.example.com/index.html, which indicates a protocol (http), a hostname (www.example.com), and a file name (index.html).

对URL细节的具体解释：1.https://doepud.co.uk/blog/anatomy-of-a-url                                                    
                                 2.https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL                  
                                 3.https://techwelkin.com/understanding-the-components-and-structure-of-a-url

Detailed explanation about URL:  1.https://doepud.co.uk/blog/anatomy-of-a-url                                                    
                                 2.https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL                  
                                 3.https://techwelkin.com/understanding-the-components-and-structure-of-a-url
                                        
                                            

In [4]:
raw_data['websites'].str.split("://").head() #Here we divided the protocol from the entire URL. but need it to be divided it 
                                                 #seperate column

0         [http, www.emuck.com:3000/archive/egan.html]
1                     [http, danoday.com/summit.shtml]
2    [http, groups.yahoo.com/group/voice_actor_appr...
3                     [http, voice-international.com/]
4                    [http, www.livinglegendsltd.com/]
Name: websites, dtype: object








refer to this link for splitting of data --> https://apassionatechie.wordpress.com/2018/02/24/how-do-i-split-a-string-into-several-columns-in-a-dataframe-with-pandas-python/

In [5]:
seperation_of_protocol = raw_data['websites'].str.split("://",expand = True) #expand argument in the split method will give you a new column

In [6]:
seperation_of_protocol.head()

Unnamed: 0,0,1
0,http,www.emuck.com:3000/archive/egan.html
1,http,danoday.com/summit.shtml
2,http,groups.yahoo.com/group/voice_actor_appreciatio...
3,http,voice-international.com/
4,http,www.livinglegendsltd.com/


In [7]:
type(seperation_of_protocol)

pandas.core.frame.DataFrame

In [8]:
seperation_domain_name = seperation_of_protocol[1].str.split("/",1,expand = True) #split(seperator,no of splits according to seperator(delimiter),expand)

In [9]:
type(seperation_domain_name)

pandas.core.frame.DataFrame

In [10]:
seperation_domain_name.columns=["domain_name","address"] #renaming columns of data frame

In [11]:
seperation_domain_name.head()

Unnamed: 0,domain_name,address
0,www.emuck.com:3000,archive/egan.html
1,danoday.com,summit.shtml
2,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...
3,voice-international.com,
4,www.livinglegendsltd.com,


In [12]:
#Concatenation of data frames
splitted_data = pd.concat([raw_data['websites'],seperation_of_protocol[0],seperation_domain_name],axis=1)

In [13]:
splitted_data.head()

Unnamed: 0,websites,0,domain_name,address
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...
3,http://voice-international.com/,http,voice-international.com,
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,


In [14]:
splitted_data.columns = ['url','protocol','domain_name','address']

In [15]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...
3,http://voice-international.com/,http,voice-international.com,
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,


Domain name column can be further sub divided into domain_names as well as sub_domain_names 

Similarly, address column can also be further sub divided into path,query_string,file..................

In [16]:
type(splitted_data)

pandas.core.frame.DataFrame

### Features Extraction

特征1
Feature-1

1.用来隐藏可疑部分的长URL

1.Long URL to Hide the Suspicious Part

如果URL的长度超过54个字符，URL就被判定为钓鱼网站

If the length of the URL is greater than or equal 54 characters then the URL classified as phishing


0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious

In [17]:
def long_url(l):
    """This function is defined in order to differntiate website based on the length of the URL"""
    if len(l) < 54:
        return 0
    elif len(l) >= 54 and len(l) <= 75:
        return 2
    return 1

In [18]:
#Applying the above defined function in order to divide the websites into 3 categories
splitted_data['long_url'] = raw_data['websites'].apply(long_url) 


In [19]:
#Will show the results only the websites which are legitimate according to above condition as 0 is legitimate website
splitted_data[splitted_data.long_url == 0].head()

Unnamed: 0,url,protocol,domain_name,address,long_url
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0
3,http://voice-international.com/,http,voice-international.com,,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0
5,http://voicechasers.com/forum/viewforum.php?f=8,http,voicechasers.com,forum/viewforum.php?f=8,0


特征2
Feature-2

2.URL有'@'符号

2.URL’s having “@” Symbol

在URL中使用'@'符号会使浏览器忽略'@'前的字符，真实信息可能隐藏在'@'后面

Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol and the real address often follows the “@” symbol.

IF {Url Having @ Symbol→ Phishing
    Otherwise→ Legitimate }


0 --- indicates legitimate

1 --- indicates Phishing


In [20]:
def have_at_symbol(l):
    """This function is used to check whether the URL contains @ symbol or not"""
    if "@" in l:
        return 1
    return 0
    

In [21]:
splitted_data['having_@_symbol'] = raw_data['websites'].apply(have_at_symbol)

In [22]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0
3,http://voice-international.com/,http,voice-international.com,,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0


特征3
Feature-3

3.使用'//'重定向

3.Redirecting using “//”

URL路径中'//'意味着重定向到其他网址, 比如“http://www.legitimate.com//http://www.phishing.com”.

我们检查'//'出现的位置, 如果是HTTP使用，则出现在第6位, HTTPS使用，则出现在第七位

The existence of “//” within the URL path means that the user will be redirected to another website.
An example of such URL’s is: “http://www.legitimate.com//http://www.phishing.com”. 
We examine the location where the “//” appears. 
We find that if the URL starts with “HTTP”, that means the “//” should appear in the sixth position. 
However, if the URL employs “HTTPS” then the “//” should appear in seventh position.

IF {ThePosition of the Last Occurrence of "//" in the URL > 7→ Phishing
    
    Otherwise→ Legitimate

0 --- indicates legitimate

1 --- indicates Phishing


In [23]:
def redirection(l):
    """If the url has symbol(//) after protocol then such URL is to be classified as phishing """
    if "//" in l:
        return 1
    return 0

In [24]:
seperation_of_protocol.head()[1]

0                 www.emuck.com:3000/archive/egan.html
1                             danoday.com/summit.shtml
2    groups.yahoo.com/group/voice_actor_appreciatio...
3                             voice-international.com/
4                            www.livinglegendsltd.com/
Name: 1, dtype: object

In [25]:
splitted_data['redirection_//_symbol'] = seperation_of_protocol[1].apply(redirection)

In [26]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0


特征4
Feature-4

4.域名前缀后缀间用(-)分割
4.Adding Prefix or Suffix Separated by (-) to the Domain

合法域名一般没有(-)，钓鱼网站往往使用(-)来误导用户自己在访问正常网站。

The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to the domain name
so that users feel that they are dealing with a legitimate webpage. 

For example http://www.Confirme-paypal.com/.
    
IF {Domain Name Part Includes (−) Symbol → Phishing
    
    Otherwise → Legitimate
    
1 --> indicates phishing

0 --> indicates legitimate
    

In [27]:
def prefix_suffix_seperation(l):
    if '-' in l:
        return 1
    return 0

In [28]:
splitted_data['prefix_suffix_seperation'] = seperation_domain_name['domain_name'].apply(prefix_suffix_seperation)

In [29]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0


特征5
Feature - 5

5.二级域名和多二级域名

5. Sub-Domain and Multi Sub-Domains

The legitimate URL link has two dots in the URL since we can ignore typing “www.”. 
If the number of dots is equal to three then the URL is classified as “Suspicious” since it has one sub-domain.
However, if the dots are greater than three it is classified as “Phishy” since it will have multiple sub-domains

0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious


In [30]:
def sub_domains(l):
    if l.count('.') < 3:
        return 0
    elif l.count('.') == 3:
        return 2
    return 1

In [31]:
splitted_data['sub_domains'] = splitted_data['domain_name'].apply(sub_domains)

In [32]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0


特征6
Feature-6

6.使用IP地址

6.Using the IP Address

If an IP address is used as an alternative of the domain name in the URL, such as “http://125.98.3.123/fake.html”,
users can be sure that someone is trying to steal their personal information. Sometimes, 
the IP address is even transformed into hexadecimal code as shown in the following link “http://0x58.0xCC.0xCA.0x62/2/paypal.ca/index.html”.

        Rule: IF{If The Domain Part has an IP Address → Phishing
                 Otherwise→ Legitimate

                 1 --> indicates phishing

                 0 --> indicates legitimate

In [33]:
import re
def having_ip_address(url):
    match=re.search('(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|'  #IPv4
                    '((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\/)'  #IPv4 in hexadecimal
                    '(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}',url)     #Ipv6
    if match:
        #print match.group()
        return 1
    else:
        #print 'No matching pattern found'
        return 0


In [34]:
splitted_data['having_ip_address'] = raw_data['websites'].apply(having_ip_address)

In [35]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0


特征7
Feature-7

7.使用短域名服务

7.Using URL Shortening Services “TinyURL”

URL shortening is a method on the “World Wide Web” in which a URL may be made considerably smaller in length and still lead to the required webpage. 
This is accomplished by means of an “HTTP Redirect” on a domain name that is short, which links to the webpage that has a long URL. 
For example, the URL “http://portal.hud.ac.uk/” can be shortened to “bit.ly/19DXSk4”.
    
Rule: IF{TinyURL → Phishing
         
         Otherwise→ Legitimate
         
                 1 --> indicates phishing

                 0 --> indicates legitimate

In [36]:
#we have imported re module in the above feature. So need not to import again
def shortening_service(url):
    match=re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                    'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                    'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                    'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                    'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                    'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                    'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|tr\.im|link\.zip\.net',url)
    if match:
        return 1
    else:
        return 0



In [37]:
splitted_data['shortening_service'] = raw_data['websites'].apply(shortening_service)

In [38]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0


特征8
Feature - 8 

8.HTTPS出现在域名中

8.The Existence of “HTTPS” Token in the Domain Part of the URL

The phishers may add the “HTTPS” token to the domain part of a URL in order to trick users.
For example, http://https-www-paypal-it-webapps-mpp-home.soft-hair.com/.

    Rule: IF{Using HTTP Token in Domain Part of The URL→ Phishing
             
             Otherwise→ Legitimate

In [39]:
def https_token(url):
    match=re.search('https://|http://',url)
    if match.start(0)==0:
        url=url[match.end(0):]
    match=re.search('http|https',url)
    if match:
        return 1
    else:
        return 0


In [40]:
splitted_data['https_token'] = raw_data['websites'].apply(https_token)

In [41]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0


特征9
Feature - 9

9.不正常的URL

9.Abnormal_URL

这个特征可以从WHOIS数据库中提取出来。对于一个合法的网站，身份通常是其URL的一部分。
但是身份怎么确定很难说，是name还是org还是别的？

This feature can be extracted from WHOIS database. For a legitimate website, identity is typically part of its URL.

Rule: IF {The Host Name Is Not Included In URL → Phishing 
          
          Otherwise→ Legitimate

In [42]:
import whois
import re

In [43]:
def abnormal_url_sub(hostname,url):
    match=re.search(hostname,url)
    if match:
        return 0
    else:
        print("In sub")
        return 1
    
# A possiable version about use name and org
#def abnormal_url_sub(result,url):
#    match1 = re.search(result.name,url)
#    match2 = re.search(result.org,url)
#    if match:
#        return 0
#    else:
#        print("In sub")
#        return 1
    

In [44]:
def abnormal_url_main(vec):
    domain = vec[0]
    print("domain: ", end="")
    print(domain)
    url = vec[1]
#    print("url: ", end="")
#    print(url)
    dns = 0
    try:
        whois_result = whois.whois(domain)
    except Exception as e:
        print ('str(e):\t\t', str(e))
        print ('repr(e):\t', repr(e))
        dns = 1
        
    if dns == 1:
        print("whois failed = 1")
        return 1
    
    print("whois result: ", end="")
    print(whois_result)
    
    if whois_result.domain_name is None:
        print("whois result is null 1")
        return 1
    print("whois_result.domain_name: ", end="")
    print(whois_result.domain_name)
    domain_name = whois_result.domain_name[1]
    return abnormal_url_sub(domain_name,url)
#    print("whois result: ", end="")
#    print(whois_result)
    
#    return abnormal_url_sub(whois_result,url)   
    

In [45]:
# ?
 #x - - cross check 
splitted_data['abnormal_url'] = splitted_data[['domain_name', 'url']].apply(abnormal_url_main, axis=1)

domain: www.emuck.com:3000
whois result: {
  "domain_name": null,
  "registrar": null,
  "whois_server": null,
  "referral_url": null,
  "updated_date": null,
  "creation_date": null,
  "expiration_date": null,
  "name_servers": null,
  "status": null,
  "emails": null,
  "dnssec": null,
  "name": null,
  "org": null,
  "address": null,
  "city": null,
  "state": null,
  "zipcode": null,
  "country": null
}
whois result is null 1
domain: danoday.com
whois result: {
  "domain_name": [
    "DANODAY.COM",
    "danoday.com"
  ],
  "registrar": "DREAMHOST",
  "whois_server": "WHOIS.DREAMHOST.COM",
  "referral_url": null,
  "updated_date": "2019-02-07 08:15:50",
  "creation_date": [
    "1997-03-10 05:00:00",
    "1997-03-09 21:00:00"
  ],
  "expiration_date": "2020-03-11 04:00:00",
  "name_servers": [
    "A.DNS.HOSTWAY.NET",
    "B.DNS.HOSTWAY.NET"
  ],
  "status": [
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientTransferProhibited https://www.ica

str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: anp.awn.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: www.awn.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: www.awn.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: www.awn.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: us.imdb.com
whois result: {
  "domain_name": [
    "IMDB.COM",
    "imdb.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-05-07 23:07:16",
    "2019-08-26 12:19:56"
  ],
  "creation_date": [
    "1996-01-05 05:00:00",
    "1996-01-04 21:00:00"
  ],
  "expiration_date": [
    "2024-01-04 05:00:00",
    "2024-01-03 00:00:00"
  ],
  "name_servers": [
    "NS1.P31.DYNECT.NET",
    "NS2.P31.DYNECT.NET",
    "NS3.P31.DYNECT.NET",
    "NS4.P31.DYNECT.NET",
    "PDNS1.ULTRADNS.NET"

whois result: {
  "domain_name": "TOONZONE.NET",
  "registrar": "Network Solutions, LLC",
  "whois_server": "whois.networksolutions.com",
  "referral_url": null,
  "updated_date": [
    "2019-12-18 13:15:55",
    "2019-10-09 07:20:11"
  ],
  "creation_date": "1998-08-03 04:00:00",
  "expiration_date": "2022-08-02 04:00:00",
  "name_servers": [
    "COREY.NS.CLOUDFLARE.COM",
    "LUCIANA.NS.CLOUDFLARE.COM"
  ],
  "status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
  "emails": [
    "abuse@web.com",
    "rkim4123@gmail.com"
  ],
  "dnssec": "unsigned",
  "name": "Kim, Rodney",
  "org": null,
  "address": "1321 1/2 S WILTON PL",
  "city": "Los Angeles",
  "state": "CA",
  "zipcode": "90019-4713",
  "country": "US"
}
whois_result.domain_name: TOONZONE.NET
In sub
domain: members.tripod.com
whois result: {
  "domain_name": [
    "TRIPOD.COM",
    "tripod.com"
  ],
  "registrar": "CSC CORPORATE DOMAINS, INC.",
  "whois_server": "whois.corporatedomains.com",
  

whois result: {
  "domain_name": [
    "EPINIONS.COM",
    "epinions.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-10-17 16:16:35",
    "2019-10-17 09:17:21"
  ],
  "creation_date": [
    "1999-02-12 05:00:00",
    "1999-02-11 21:00:00"
  ],
  "expiration_date": [
    "2020-02-12 05:00:00",
    "2020-02-11 00:00:00"
  ],
  "name_servers": [
    "NS1.MARKMONITOR.COM",
    "NS2.MARKMONITOR.COM",
    "NS3.MARKMONITOR.COM",
    "NS4.MARKMONITOR.COM",
    "NS5.MARKMONITOR.COM",
    "NS6.MARKMONITOR.COM",
    "NS7.MARKMONITOR.COM",
    "ns6.markmonitor.com",
    "ns3.markmonitor.com",
    "ns2.markmonitor.com",
    "ns7.markmonitor.com",
    "ns5.markmonitor.com",
    "ns1.markmonitor.com",
    "ns4.markmonitor.com"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clien

str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: us.imdb.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: us.imdb.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: www.tv-now.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: www.cdaccess.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: genetic_mishap.tripod.com
whois result: {
  "domain_name": [
    "TRIPOD.COM",
    "tripod.com"
  ],
  "registrar": "CSC CORPORATE DOMAINS, INC.",
  "whois_server": "whois.corporatedomains.com",
  "referral_url": null,
  "updated_date": [
    "2019-11-01 15:22:21",
    "2018-12-17 17:15:26"
  ],
  "creation_date": "1994-09-29 04:00:00",
  "expiration_date": "2020-09-28 04:00:00",
  "name_servers": [
    "NS1.LYCOS.COM",
    "NS2.LYCOS.COM",
    "NS3.LYCOS.COM",
    "NS4.LYCOS.COM",
    "ns1.lycos.com",
    "ns2.lycos.com",
    "ns4.lycos.com",
    "ns3.lycos

In [46]:
splitted_data.head(10)

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,1
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1
5,http://voicechasers.com/forum/viewforum.php?f=8,http,voicechasers.com,forum/viewforum.php?f=8,0,0,0,0,0,0,0,0,1
6,http://hollywoodcollectorshow.com/,http,hollywoodcollectorshow.com,,0,0,0,0,0,0,0,0,1
7,http://www.geocities.com/hollywood/hills/8944/,http,www.geocities.com,hollywood/hills/8944/,0,0,0,0,0,0,0,0,1
8,http://asifa.proboards61.com/index.cgi?action=...,http,asifa.proboards61.com,index.cgi?action=calendarviewall,2,0,0,0,0,0,0,0,1
9,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/cal///group/voi...,1,0,1,0,0,0,0,0,0


Feature - 10:
    
10.Google Index

This feature examines whether a website is in Google’s index or not. When a site is indexed by Google,
it is displayed on search results (Webmaster resources, 2014). 
Usually, phishing webpages are merely accessible for a short period and as a result, 
many phishing webpages may not be found on the Google index.

Rule: IF{Webpage Indexed by Google → Legitimate
         
         Otherwise → Phishing

In [47]:
from google import google
from fake_useragent import UserAgent


In [48]:
def google_index(url):
    site=google.search(url,5)
    if site:
        return 0
    else:
        return 1

In [49]:
splitted_data['google_index'] = raw_data['websites'].apply(google_index)

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue1.7%2Farticles%2Fkowlaskidi1.7.html&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue1.7%2Farticles%2Fkowlaskidi1.7.html&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue1.7%2Farticles%2Fkowlaskidi1.7.html&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue1.7%2Farticles%2Fkowlaskidi1.7.html&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue1.7%2Farticles%2Fkowlaskidi1.7.html&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue5.03%2F5.03pages%2Fevanierforay.php3&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue5.03%2F5.03pages%2Fevanierforay.php3&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue5.03%2F5.03pages%2Fevanierforay.php3&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue4.02%2F4.02pages%2Fforaylittlejohn.php3&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.com%2Fmag%2Fissue4.02%2F4.02pages%2Fforaylittlejohn.php3&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.awn.

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fnews.toonzone.net%2F2000%2Foct%2F27%2Fhearing_voices.php%23pat_on_the_head&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fnews.toonzone.net%2F2000%2Foct%2F27%2Fhearing_voices.php%23pat_on_the_head&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fmembers.tripod.com%2F~leemichaelwithers%2Fsfh.htm&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fmembers.tripod.com%2F~leemichaelwithers%2Fsfh.htm&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fmembers.tripod.com%2F~leemichaelwithers%2Fsfh.htm&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fmembers.tripod.com%2F~le

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.povonline.com%2Fcols%2Fcol051.htm&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.povonline.com%2Fcols%2Fcol051.htm&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.povonline.com%2F2001%2Fnews060901.htm&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.povonline.com%2F2001%2Fnews060901.htm&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.povonline.com%2F2001%2Fnews060901.htm&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.povonline.com%2F2001%2Fnews060901.htm&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://w

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.epinions.com%2Fmusc-album-musc-freberg__stan%2Ftk_~mm032.1.2&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.epinions.com%2Fmusc-album-musc-freberg__stan%2Ftk_~mm032.1.2&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fcitypaper.net%2Farticles%2F091699%2Fmus.stan.shtml&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fcitypaper.net%2Farticles%2F091699%2Fmus.stan.shtml&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fcitypaper.net%2Farticles%2F091699%2Fmus.stan.shtml&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fcitypaper.net%2Farticles%2F091699%2Fmus.s

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fus.imdb.com%2Fname%2Fnm0001242%2F&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fus.imdb.com%2Fname%2Fnm0001242%2F&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.geocities.com%2Fenchantedforest%2Fdell%2F8545%2Fva.html&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.geocities.com%2Fenchantedforest%2Fdell%2F8545%2Fva.html&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.geocities.com%2Fenchantedforest%2Fdell%2F8545%2Fva.html&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.geocities.com%2Fenchantedforest%2Fdell%2F8545%2Fva.html&start=30&num=10&

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fus.imdb.com%2Fname%2Fnm0304000%2F&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fus.imdb.com%2Fname%2Fnm0304000%2F&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.taoslandandfilm.com%2Fbradentnwkly.html&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.taoslandandfilm.com%2Fbradentnwkly.html&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.taoslandandfilm.com%2Fbradentnwkly.html&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.taoslandandfilm.com%2Fbradentnwkly.html&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://w

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fgenetic_mishap.tripod.com%2Flandofthreenamedpeople%2F&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fgenetic_mishap.tripod.com%2Flandofthreenamedpeople%2F&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fgenetic_mishap.tripod.com%2Flandofthreenamedpeople%2F&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fgenetic_mishap.tripod.com%2Flandofthreenamedpeople%2F&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.angelfire.com%2Fceleb%2Fstarzbios%2Fgottfried_g.html&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.angelfire.com%2Fceleb%2Fstarzbios%2Fgottfried_g

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.kylehebert.com%2F&start=40&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.post-gazette.com%2Fmagazine%2F19990223voicetalent1.asp&start=0&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.post-gazette.com%2Fmagazine%2F19990223voicetalent1.asp&start=10&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.post-gazette.com%2Fmagazine%2F19990223voicetalent1.asp&start=20&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.post-gazette.com%2Fmagazine%2F19990223voicetalent1.asp&start=30&num=10&tbs=
HTTP Error 429: Too Many Requests
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.post-gazette.com%2Fmagazine%2F19990223voicetalent1.asp&start=4

Feature - 11

11.Website Traffic

This feature measures the popularity of the website by determining the number of visitors and the number of pages they visit.
However, since phishing websites live for a short period of time, they may not be recognized by the Alexa database 
(Alexa the Web Information Company., 1996). By reviewing our dataset, we find that in worst scenarios,
legitimate websites ranked among the top 100,000. Furthermore,
if the domain has no traffic or is not recognized by the Alexa database, it is classified as “Phishing”.
Otherwise, it is classified as “Suspicious”.

Rule: IF{Website Rank<100,000 → LegitimateWebsite Rank>100,
         
         000 →SuspiciousOtherwise → Phish

In [50]:
from bs4 import BeautifulSoup
import urllib.request
def web_traffic(url):
    try:
        rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find("REACH")['RANK']
    except TypeError:
        return 1
    rank= int(rank)
    if (rank<100000):
        return 0
    else:
        return 2

In [51]:
splitted_data['web_traffic'] = raw_data['websites'].apply(web_traffic)

In [52]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,google_index,web_traffic
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,0,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,0,2
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,1,1,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,0,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,0,1


Feature - 11

Domain Registration Length

Based on the fact that a phishing website lives for a short period of time, 
we believe that trustworthy domains are regularly paid for several years in advance. 
In our dataset, we find that the longest fraudulent domains have been used for one year only.

Rule: IF{Domains Expires on≤ 1 years → Phishing
         
         Otherwise→ Legitimate

In [53]:
import whois
from datetime import datetime
import time
def domain_registration_length_sub(domain):
    expiration_date = domain.expiration_date
    today = time.strftime('%Y-%m-%d')
    today = datetime.strptime(today, '%Y-%m-%d')
    if expiration_date is None:
        return 1
    elif type(expiration_date) is list or type(today) is list :
        return 2             #If it is a type of list then we can't select a single value from list. So,it is regarded as suspected website  
    else:
        registration_length = abs((expiration_date - today).days)
        if registration_length / 365 <= 1:
            return 1
        else:
            return 0

    
    
    

In [54]:
def domain_registration_length_main(domain):
    dns = 0
    try:
        domain_name = whois.whois(domain)
    except:
        dns = 1
        
    if dns == 1:
        return 1
    else:
        return domain_registration_length_sub(domain_name)
    

In [55]:
splitted_data['domain_registration_length'] = splitted_data['domain_name'].apply(domain_registration_length_main)

In [56]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,google_index,web_traffic,domain_registration_length
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,0,1,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,0,2,1
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,1,1,0,1
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,0,1,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,0,1,1


In [57]:
#Testing the above function
domain_registration_length_main('www.google.com')

2

In [58]:
domain_registration_length_main('www.w3schools.com')

1

Feature - 12

This feature can be extracted from WHOIS database (Whois 2005). 
Most phishing websites live for a short period of time. 
By reviewing our dataset, we find that the minimum age of the legitimate domain is 6 months.

Rule: IF {Age Of Domain≥6 months → Legitimate
          
          Otherwise → Phishing

In [59]:
def age_of_domain_sub(domain):
    creation_date = domain.creation_date
    expiration_date = domain.expiration_date
    if ((expiration_date is None) or (creation_date is None)):
        return 1
    elif ((type(expiration_date) is list) or (type(creation_date) is list)):
        return 2
    else:
        ageofdomain = abs((expiration_date - creation_date).days)
        if ((ageofdomain/30) < 6):
            return 1
        else:
            return 0

In [60]:
def age_of_domain_main(domain):
    dns = 0
    try:
        domain_name = whois.whois(domain)
    except:
        dns = 1
        
    if dns == 1:
        return 1
    else:
        return age_of_domain_sub(domain_name)



In [61]:
splitted_data['age_of_domain'] = splitted_data['domain_name'].apply(age_of_domain_main)

In [62]:
splitted_data

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,google_index,web_traffic,domain_registration_length,age_of_domain
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,0,1,1,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,0,2,1,2
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,1,1,0,1,1
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,0,1,1,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,0,1,1,1
5,http://voicechasers.com/forum/viewforum.php?f=8,http,voicechasers.com,forum/viewforum.php?f=8,0,0,0,0,0,0,0,0,1,0,1,1,1
6,http://hollywoodcollectorshow.com/,http,hollywoodcollectorshow.com,,0,0,0,0,0,0,0,0,1,0,1,1,0
7,http://www.geocities.com/hollywood/hills/8944/,http,www.geocities.com,hollywood/hills/8944/,0,0,0,0,0,0,0,0,1,0,0,1,1
8,http://asifa.proboards61.com/index.cgi?action=...,http,asifa.proboards61.com,index.cgi?action=calendarviewall,2,0,0,0,0,0,0,0,1,1,1,1,1
9,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/cal///group/voi...,1,0,1,0,0,0,0,0,0,1,0,1,1


Feature - 13

DNS Record

For phishing websites, either the claimed identity is not recognized by the WHOIS database (Whois 2005) or 
no records founded for the hostname (Pan and Ding 2006). 
If the DNS record is empty or not found then the website is classified as “Phishing”, otherwise it is classified as “Legitimate”.

Rule: IF{no DNS Record For The Domain → Phishing
         
         Otherwise→ Legitimate
        

In [63]:
def dns_record(domain):
    dns = 0
    try:
        domain_name = whois.whois(domain)
        print(domain_name)
    except:
        dns = 1
        
    if dns == 1:
        return 1
    else:
        return dns



In [64]:
splitted_data['dns_record'] = splitted_data['domain_name'].apply(dns_record)

{
  "domain_name": null,
  "registrar": null,
  "whois_server": null,
  "referral_url": null,
  "updated_date": null,
  "creation_date": null,
  "expiration_date": null,
  "name_servers": null,
  "status": null,
  "emails": null,
  "dnssec": null,
  "name": null,
  "org": null,
  "address": null,
  "city": null,
  "state": null,
  "zipcode": null,
  "country": null
}


In [65]:
splitted_data

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,google_index,web_traffic,domain_registration_length,age_of_domain,dns_record
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,0,1,1,1,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,0,2,1,2,1
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,1,1,0,1,1,1
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,0,1,1,1,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,0,1,1,1,1
5,http://voicechasers.com/forum/viewforum.php?f=8,http,voicechasers.com,forum/viewforum.php?f=8,0,0,0,0,0,0,0,0,1,0,1,1,1,1
6,http://hollywoodcollectorshow.com/,http,hollywoodcollectorshow.com,,0,0,0,0,0,0,0,0,1,0,1,1,0,1
7,http://www.geocities.com/hollywood/hills/8944/,http,www.geocities.com,hollywood/hills/8944/,0,0,0,0,0,0,0,0,1,0,0,1,1,1
8,http://asifa.proboards61.com/index.cgi?action=...,http,asifa.proboards61.com,index.cgi?action=calendarviewall,2,0,0,0,0,0,0,0,1,1,1,1,1,1
9,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/cal///group/voi...,1,0,1,0,0,0,0,0,0,1,0,1,1,1


Feature - 14

Statistical-Reports Based Feature

Several parties such as PhishTank (PhishTank Stats, 2010-2012),
and StopBadware (StopBadware, 2010-2012) formulate numerous statistical reports on phishing websites at every given period of time
some are monthly and others are quarterly. 

Rule: IF{Host Belongs to Top Phishing IPs or Top Phishing Domains → Phishing
         
         Otherwise → Legitimate


In [66]:
import socket
def statistical_report(url):
    hostname = url
    h = [(x.start(0), x.end(0)) for x in re.finditer('https://|http://|www.|https://www.|http://www.', hostname)]
    z = int(len(h))
    if z != 0:
        y = h[0][1]
        hostname = hostname[y:]
        h = [(x.start(0), x.end(0)) for x in re.finditer('/', hostname)]
        z = int(len(h))
        if z != 0:
            hostname = hostname[:h[0][0]]
    url_match=re.search('at\.ua|usa\.cc|baltazarpresentes\.com\.br|pe\.hu|esy\.es|hol\.es|sweddy\.com|myjino\.ru|96\.lt|ow\.ly',url)
    try:
        ip_address = socket.gethostbyname(hostname)
        ip_match=re.search('146\.112\.61\.108|213\.174\.157\.151|121\.50\.168\.88|192\.185\.217\.116|78\.46\.211\.158|181\.174\.165\.13|46\.242\.145\.103|121\.50\.168\.40|83\.125\.22\.219|46\.242\.145\.98|107\.151\.148\.44|107\.151\.148\.107|64\.70\.19\.203|199\.184\.144\.27|107\.151\.148\.108|107\.151\.148\.109|119\.28\.52\.61|54\.83\.43\.69|52\.69\.166\.231|216\.58\.192\.225|118\.184\.25\.86|67\.208\.74\.71|23\.253\.126\.58|104\.239\.157\.210|175\.126\.123\.219|141\.8\.224\.221|10\.10\.10\.10|43\.229\.108\.32|103\.232\.215\.140|69\.172\.201\.153|216\.218\.185\.162|54\.225\.104\.146|103\.243\.24\.98|199\.59\.243\.120|31\.170\.160\.61|213\.19\.128\.77|62\.113\.226\.131|208\.100\.26\.234|195\.16\.127\.102|195\.16\.127\.157|34\.196\.13\.28|103\.224\.212\.222|172\.217\.4\.225|54\.72\.9\.51|192\.64\.147\.141|198\.200\.56\.183|23\.253\.164\.103|52\.48\.191\.26|52\.214\.197\.72|87\.98\.255\.18|209\.99\.17\.27|216\.38\.62\.18|104\.130\.124\.96|47\.89\.58\.141|78\.46\.211\.158|54\.86\.225\.156|54\.82\.156\.19|37\.157\.192\.102|204\.11\.56\.48|110\.34\.231\.42',ip_address)  
    except:
        return 1

    if url_match:
        return 1
    else:
        return 0


In [67]:
splitted_data['statistical_report'] = raw_data['websites'].apply(statistical_report)

In [68]:
splitted_data

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,google_index,web_traffic,domain_registration_length,age_of_domain,dns_record,statistical_report
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,0,1,1,1,0,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,0,2,1,2,1,1
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,1,1,0,1,1,1,1
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,0,1,1,1,1,1
5,http://voicechasers.com/forum/viewforum.php?f=8,http,voicechasers.com,forum/viewforum.php?f=8,0,0,0,0,0,0,0,0,1,0,1,1,1,1,1
6,http://hollywoodcollectorshow.com/,http,hollywoodcollectorshow.com,,0,0,0,0,0,0,0,0,1,0,1,1,0,1,1
7,http://www.geocities.com/hollywood/hills/8944/,http,www.geocities.com,hollywood/hills/8944/,0,0,0,0,0,0,0,0,1,0,0,1,1,1,1
8,http://asifa.proboards61.com/index.cgi?action=...,http,asifa.proboards61.com,index.cgi?action=calendarviewall,2,0,0,0,0,0,0,0,1,1,1,1,1,1,1
9,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/cal///group/voi...,1,0,1,0,0,0,0,0,0,1,0,1,1,1,1


In [69]:
whois.

SyntaxError: invalid syntax (<ipython-input-69-20dcbfaa433e>, line 1)