# Phishing Website Detection - Feature Extraction


#### 数据预处理

这个数据集包括一些网站的网址（一些是合法的一些是钓鱼的）
建模前的预处理，并根据现有情况计算特征值

In [1]:
#importing numpy and pandas which are required for data pre-processing
import numpy as np
import pandas as pd

In [2]:
#Loading the data
raw_data = pd.read_csv("./raw_datasets/100-legitimate-art.txt") #loading only 100 samples (art websites data)

In [3]:
raw_data.head()

Unnamed: 0,websites
0,http://www.emuck.com:3000/archive/egan.html
1,http://danoday.com/summit.shtml
2,http://groups.yahoo.com/group/voice_actor_appr...
3,http://voice-international.com/
4,http://www.livinglegendsltd.com/


首先我们需要根据URL的部分对URL进行切割

一个典型的URL，比如https://www.example.com/index.html , 包括协议(http)，主机名(www.example.com)和文件名(index.html)

对URL细节的具体解释：1.https://doepud.co.uk/blog/anatomy-of-a-url                                                    
                                 2.https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL                  
                                 3.https://techwelkin.com/understanding-the-components-and-structure-of-a-url


In [4]:
raw_data['websites'].str.split("://").head() #Here we divided the protocol from the entire URL. but need it to be divided it 
                                                 #seperate column

0         [http, www.emuck.com:3000/archive/egan.html]
1                     [http, danoday.com/summit.shtml]
2    [http, groups.yahoo.com/group/voice_actor_appr...
3                     [http, voice-international.com/]
4                    [http, www.livinglegendsltd.com/]
Name: websites, dtype: object

关于数据分割的参考 --> https://apassionatechie.wordpress.com/2018/02/24/how-do-i-split-a-string-into-several-columns-in-a-dataframe-with-pandas-python/

In [5]:
seperation_of_protocol = raw_data['websites'].str.split("://",expand = True) #expand argument in the split method will give you a new column

In [6]:
seperation_of_protocol.head()

Unnamed: 0,0,1
0,http,www.emuck.com:3000/archive/egan.html
1,http,danoday.com/summit.shtml
2,http,groups.yahoo.com/group/voice_actor_appreciatio...
3,http,voice-international.com/
4,http,www.livinglegendsltd.com/


In [7]:
type(seperation_of_protocol)

pandas.core.frame.DataFrame

In [8]:
seperation_domain_name = seperation_of_protocol[1].str.split("/",1,expand = True) #split(seperator,no of splits according to seperator(delimiter),expand)

In [9]:
type(seperation_domain_name)

pandas.core.frame.DataFrame

In [10]:
seperation_domain_name.columns=["domain_name","address"] #renaming columns of data frame

In [11]:
seperation_domain_name.head()

Unnamed: 0,domain_name,address
0,www.emuck.com:3000,archive/egan.html
1,danoday.com,summit.shtml
2,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...
3,voice-international.com,
4,www.livinglegendsltd.com,


In [12]:
#Concatenation of data frames
splitted_data = pd.concat([raw_data['websites'],seperation_of_protocol[0],seperation_domain_name],axis=1)

In [13]:
splitted_data.head()

Unnamed: 0,websites,0,domain_name,address
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...
3,http://voice-international.com/,http,voice-international.com,
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,


In [14]:
splitted_data.columns = ['url','protocol','domain_name','address']

In [15]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...
3,http://voice-international.com/,http,voice-international.com,
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,


域名列可以进一步细分为domain_names和sub_domain_names（子域名）

同样，地址栏也可以进一步细分为path，query_string，file .........

In [16]:
type(splitted_data)

pandas.core.frame.DataFrame

### 特征提取 Features Extraction

#### 特征1：用来隐藏可疑部分的长URL

如果URL的长度超过54个字符，URL就被判定为钓鱼网站

0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious

In [17]:
def long_url(l):
    """This function is defined in order to differntiate website based on the length of the URL"""
    if len(l) < 54:
        return 0
    elif len(l) >= 54 and len(l) <= 75:
        return 2
    return 1

In [18]:
#Applying the above defined function in order to divide the websites into 3 categories
splitted_data['long_url'] = raw_data['websites'].apply(long_url) 


In [19]:
#Will show the results only the websites which are legitimate according to above condition as 0 is legitimate website
splitted_data[splitted_data.long_url == 0].head()

Unnamed: 0,url,protocol,domain_name,address,long_url
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0
3,http://voice-international.com/,http,voice-international.com,,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0
5,http://voicechasers.com/forum/viewforum.php?f=8,http,voicechasers.com,forum/viewforum.php?f=8,0


#### 特征2：URL是否有'@'符号

在URL中使用'@'符号会使浏览器忽略'@'前的字符，真实信息可能隐藏在'@'后面

IF {Url Having @ Symbol→ Phishing
    Otherwise→ Legitimate }

0 --- indicates legitimate

1 --- indicates Phishing


In [20]:
def have_at_symbol(l):
    """This function is used to check whether the URL contains @ symbol or not"""
    if "@" in l:
        return 1
    return 0
    

In [21]:
splitted_data['having_@_symbol'] = raw_data['websites'].apply(have_at_symbol)

In [22]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0
3,http://voice-international.com/,http,voice-international.com,,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0


#### 特征3：使用'//'重定向

URL路径中'//'意味着重定向到其他网址, 比如“http://www.legitimate.com//http://www.phishing.com”.

我们检查'//'出现的位置, 如果是HTTP使用，则出现在第6位, HTTPS使用，则出现在第七位

IF {The Position of the Last Occurrence of "//" in the URL} > 7→ Phishing
    
    Otherwise→ Legitimate

0 --- indicates legitimate

1 --- indicates Phishing


In [23]:
def redirection(l):
    """If the url has symbol(//) after protocol then such URL is to be classified as phishing """
    if "//" in l:
        return 1
    return 0

In [24]:
seperation_of_protocol.head()[1]

0                 www.emuck.com:3000/archive/egan.html
1                             danoday.com/summit.shtml
2    groups.yahoo.com/group/voice_actor_appreciatio...
3                             voice-international.com/
4                            www.livinglegendsltd.com/
Name: 1, dtype: object

In [25]:
splitted_data['redirection_//_symbol'] = seperation_of_protocol[1].apply(redirection)

In [26]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0


#### 特征4：域名前缀后缀间用(-)分割

合法域名一般没有(-)，钓鱼网站往往使用(-)来误导用户自己在访问正常网站。

例如：http://www.Confirme-paypal.com/.
    
IF {Domain Name Part Includes (−) Symbol → Phishing
    
    Otherwise → Legitimate
    
1 --> indicates phishing

0 --> indicates legitimate
    

In [27]:
def prefix_suffix_seperation(l):
    if '-' in l:
        return 1
    return 0

In [28]:
splitted_data['prefix_suffix_seperation'] = seperation_domain_name['domain_name'].apply(prefix_suffix_seperation)

In [29]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0


#### 特征5：二级域名和多级域名

合法的URL链接在URL中有两个点，因为我们可以忽略键入“ www. ”。
如果点数等于三个，则该URL被归类为“可疑”，因为它具有一个子域。
但是，如果点数大于3，则将其归类为“网络钓鱼”，因为它将具有多个子域。

0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious


In [30]:
def sub_domains(l):
    if l.count('.') < 3:
        return 0
    elif l.count('.') == 3:
        return 2
    return 1

In [31]:
splitted_data['sub_domains'] = splitted_data['domain_name'].apply(sub_domains)

In [32]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0


#### 特征6：使用IP地址

如果使用IP地址代替URL中的域名，例如“ http://125.98.3.123/fake.html”，
用户可以确定有人试图窃取其个人信息。 有时，
IP地址甚至被转换为十六进制代码，如以下链接“ http：//0x58.0xCC.0xCA.0x62/2/paypal.ca/index.html”所示。

        Rule: IF{If The Domain Part has an IP Address} → Phishing
                 Otherwise→ Legitimate

1 --> indicates phishing

0 --> indicates legitimate

In [33]:
import re
def having_ip_address(url):
    match=re.search('(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|'  #IPv4
                    '((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\/)'  #IPv4 in hexadecimal
                    '(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}',url)     #Ipv6
    if match:
        #print match.group()
        return 1
    else:
        #print 'No matching pattern found'
        return 0


In [34]:
splitted_data['having_ip_address'] = raw_data['websites'].apply(having_ip_address)

In [35]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0


#### 特征7：使用短域名服务

把URL缩短是“万维网”上的一种方法，短域名可以使URL的长度显着减小，并且仍然可以跳转至所需的网页。
这是通过短域名上的“ HTTP重定向”来实现的，该域名链接到具有长URL的网页。
例如，URL“ http://portal.hud.ac.uk/” 可以缩短为“ bit.ly/19DXSk4”。

Rule: IF{Tiny URL} → Phishing
         
         Otherwise→ Legitimate
         
1 --> indicates phishing

0 --> indicates legitimate

In [36]:
#we have imported re module in the above feature. So need not to import again
def shortening_service(url):
    match=re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                    'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                    'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                    'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                    'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                    'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                    'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|tr\.im|link\.zip\.net',url)
    if match:
        return 1
    else:
        return 0



In [37]:
splitted_data['shortening_service'] = raw_data['websites'].apply(shortening_service)

In [38]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0


#### 特征8：HTTPS出现在域名中

网络钓鱼者可能将“ HTTPS”协议名添加到URL的域部分，以欺骗用户。

例如，http://https-www-paypal-it-webapps-mpp-home.soft-hair.com/。

    Rule: IF{Using HTTP Token in Domain Part of The URL}→ Phishing
             
             Otherwise→ Legitimate
             
0 --- indicates legitimate

1 --- indicates Phishing

In [39]:
def https_token(url):
    match=re.search('https://|http://',url)
    if match.start(0)==0:
        url=url[match.end(0):]
    match=re.search('http|https',url)
    if match:
        return 1
    else:
        return 0


In [40]:
splitted_data['https_token'] = raw_data['websites'].apply(https_token)

In [41]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0


#### 特征9：不正常的URL

这个特征可以从WHOIS数据库中提取出来。对于一个合法的网站，身份通常是其URL的一部分。
（但是身份怎么确定很难说，是name还是org还是别的？）

Rule: IF {The Host Name Is Not Included In URL} → Phishing 
          
          Otherwise→ Legitimate

0 --- indicates legitimate

1 --- indicates Phishing

In [42]:
import whois
import re

In [43]:
def abnormal_url_sub(hostname,url):
    match=re.search(hostname,url)
    if match:
        return 0
    else:
        print("In sub")
        return 1
    
# A possiable version about use name and org
#def abnormal_url_sub(result,url):
#    match1 = re.search(result.name,url)
#    match2 = re.search(result.org,url)
#    if match:
#        return 0
#    else:
#        print("In sub")
#        return 1
    

In [44]:
def abnormal_url_main(vec):
    domain = vec[0]
    print("domain: ", end="")
    print(domain)
    url = vec[1]
#    print("url: ", end="")
#    print(url)
    dns = 0
    try:
        whois_result = whois.whois(domain)
    except Exception as e:
        print ('str(e):\t\t', str(e))
        print ('repr(e):\t', repr(e))
        dns = 1
        
    if dns == 1:
        print("whois failed = 1")
        return 1
    
    print("whois result: ", end="")
    print(whois_result)
    
    if whois_result.domain_name is None:
        print("whois result is null 1")
        return 1
    print("whois_result.domain_name: ", end="")
    print(whois_result.domain_name)
    domain_name = whois_result.domain_name[1]
    return abnormal_url_sub(domain_name,url)
#    print("whois result: ", end="")
#    print(whois_result)
    
#    return abnormal_url_sub(whois_result,url)   
    

In [45]:
# ?
 #x - - cross check 
splitted_data['abnormal_url'] = splitted_data[['domain_name', 'url']].apply(abnormal_url_main, axis=1)

domain: www.emuck.com:3000
whois result: {
  "domain_name": null,
  "registrar": null,
  "whois_server": null,
  "referral_url": null,
  "updated_date": null,
  "creation_date": null,
  "expiration_date": null,
  "name_servers": null,
  "status": null,
  "emails": null,
  "dnssec": null,
  "name": null,
  "org": null,
  "address": null,
  "city": null,
  "state": null,
  "zipcode": null,
  "country": null
}
whois result is null 1
domain: danoday.com
whois result: {
  "domain_name": [
    "DANODAY.COM",
    "danoday.com"
  ],
  "registrar": "DREAMHOST",
  "whois_server": "WHOIS.DREAMHOST.COM",
  "referral_url": null,
  "updated_date": "2019-02-07 08:15:50",
  "creation_date": [
    "1997-03-10 05:00:00",
    "1997-03-09 21:00:00"
  ],
  "expiration_date": "2020-03-11 04:00:00",
  "name_servers": [
    "A.DNS.HOSTWAY.NET",
    "B.DNS.HOSTWAY.NET"
  ],
  "status": [
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientTransferProhibited https://www.ica

str(e):		 [WinError 10054] 远程主机强迫关闭了一个现有的连接。
repr(e):	 ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)
whois failed = 1
domain: www.pamelynferdin.com
whois result: {
  "domain_name": [
    "PAMELYNFERDIN.COM",
    "pamelynferdin.com"
  ],
  "registrar": "1&1 IONOS SE",
  "whois_server": "whois.ionos.com",
  "referral_url": null,
  "updated_date": [
    "2019-10-14 07:45:02",
    "2018-10-05 19:19:14"
  ],
  "creation_date": "1999-10-13 21:36:22",
  "expiration_date": "2020-10-13 21:36:22",
  "name_servers": [
    "NS1056.UI-DNS.BIZ",
    "NS1075.UI-DNS.DE",
    "NS1076.UI-DNS.ORG",
    "NS1110.UI-DNS.COM"
  ],
  "status": [
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientTransferProhibited https://www.icann.org/epp#clientTransferProhibited"
  ],
  "emails": [
    "abuse@ionos.com",
    "privacy@1and1.com"
  ],
  "dnssec": [
    "unsigned",
    "Unsigned"
  ],
  "name": "Oneandone Private Registration",
  "org": "1&1 Internet

whois result: {
  "domain_name": [
    "IMDB.COM",
    "imdb.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-05-07 23:07:16",
    "2019-08-26 12:19:56"
  ],
  "creation_date": [
    "1996-01-05 05:00:00",
    "1996-01-04 21:00:00"
  ],
  "expiration_date": [
    "2024-01-04 05:00:00",
    "2024-01-03 00:00:00"
  ],
  "name_servers": [
    "NS1.P31.DYNECT.NET",
    "NS2.P31.DYNECT.NET",
    "NS3.P31.DYNECT.NET",
    "NS4.P31.DYNECT.NET",
    "PDNS1.ULTRADNS.NET",
    "PDNS2.ULTRADNS.NET",
    "PDNS3.ULTRADNS.ORG",
    "PDNS4.ULTRADNS.ORG",
    "PDNS5.ULTRADNS.INFO",
    "PDNS6.ULTRADNS.CO.UK",
    "pdns1.ultradns.net",
    "pdns6.ultradns.co.uk",
    "pdns4.ultradns.org",
    "pdns2.ultradns.net",
    "ns3.p31.dynect.net",
    "pdns5.ultradns.info",
    "ns1.p31.dynect.net",
    "ns4.p31.dynect.net",
    "ns2.p31.dynect.net",
    "pdns3.ultradns.org"
  ],
  "status": [
    "clientDeleteProhibi

str(e):		 [WinError 10054] 远程主机强迫关闭了一个现有的连接。
repr(e):	 ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)
whois failed = 1
domain: www.voicechasers.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: patfraley.com
whois result: {
  "domain_name": "PATFRALEY.COM",
  "registrar": "NetEarth One, Inc.",
  "whois_server": "whois.netearthone.com",
  "referral_url": null,
  "updated_date": [
    "2019-01-13 07:21:46",
    "2019-03-15 02:16:43"
  ],
  "creation_date": "2003-03-17 22:06:32",
  "expiration_date": "2020-03-17 21:06:32",
  "name_servers": [
    "NS3.BLUEJETHOSTING.COM",
    "NS4.BLUEJETHOSTING.COM",
    "ns3.bluejethosting.com",
    "ns4.bluejethosting.com"
  ],
  "status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
  "emails": [
    "a-b-u-s-e.whois.field@netearthone.com",
    "domains@capalon.com"
  ],
  "dnssec": [
    "unsigned",
    "Unsigned"
  ],
  "name": "Domain Admin",
  "org": "Capalon Communic

str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: www.tsimon.com
whois result: {
  "domain_name": [
    "TSIMON.COM",
    "tsimon.com"
  ],
  "registrar": "pair Domains",
  "whois_server": "pairDomains.com",
  "referral_url": null,
  "updated_date": [
    "2019-07-07 01:59:53",
    "2019-07-07T01:59:53+0000Z"
  ],
  "creation_date": [
    "2000-04-18 20:43:13",
    "2000-04-18T20:43:13+0000Z"
  ],
  "expiration_date": [
    "2022-04-18 20:43:13",
    "2022-04-18T20:43:13+0000Z"
  ],
  "name_servers": [
    "DNS4.PAIR.COM",
    "NS0.NS0.COM"
  ],
  "status": "ok https://icann.org/epp#ok",
  "emails": "abuse@pairdomains.com",
  "dnssec": "unsigned",
  "name": "GDPR Redacted",
  "org": null,
  "address": "GDPR Redacted",
  "city": "GDPR Redacted",
  "state": "MO",
  "zipcode": "GDPR Redacted",
  "country": "US"
}
whois_result.domain_name: ['TSIMON.COM', 'tsimon.com']
domain: www.firezine.net
whois result: {
  "domain_name": "FIREZINE.NET",
  "registrar": "Network

str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: www.cyberonic.net
whois result: {
  "domain_name": "CYBERONIC.NET",
  "registrar": "TUCOWS, INC.",
  "whois_server": "whois.tucows.com",
  "referral_url": null,
  "updated_date": [
    "2015-08-08 06:13:17",
    "2016-12-21 13:59:05"
  ],
  "creation_date": "1998-07-16 04:00:00",
  "expiration_date": "2020-07-15 04:00:00",
  "name_servers": [
    "NS1.LINODE.COM",
    "NS2.LINODE.COM",
    "NS3.LINODE.COM",
    "NS4.LINODE.COM",
    "NS5.LINODE.COM",
    "ns1.linode.com",
    "ns2.linode.com",
    "ns3.linode.com",
    "ns4.linode.com",
    "ns5.linode.com"
  ],
  "status": "ok https://icann.org/epp#ok",
  "emails": [
    "cyberonic.net@contactprivacy.com",
    "domainabuse@tucows.com",
    "support@cyberonic.com"
  ],
  "dnssec": "unsigned",
  "name": "Contact Privacy Inc. Customer 013275771",
  "org": "Contact Privacy Inc. Customer 013275771",
  "address": "96 Mowat Ave",
  "city": "Toronto",
  "state": "ON",

str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: bakerstreetdozen.com
str(e):		 timed out
repr(e):	 timeout('timed out')
whois failed = 1
domain: us.imdb.com
whois result: {
  "domain_name": [
    "IMDB.COM",
    "imdb.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-05-07 23:07:16",
    "2019-08-26 12:19:56"
  ],
  "creation_date": [
    "1996-01-05 05:00:00",
    "1996-01-04 21:00:00"
  ],
  "expiration_date": [
    "2024-01-04 05:00:00",
    "2024-01-03 00:00:00"
  ],
  "name_servers": [
    "NS1.P31.DYNECT.NET",
    "NS2.P31.DYNECT.NET",
    "NS3.P31.DYNECT.NET",
    "NS4.P31.DYNECT.NET",
    "PDNS1.ULTRADNS.NET",
    "PDNS2.ULTRADNS.NET",
    "PDNS3.ULTRADNS.ORG",
    "PDNS4.ULTRADNS.ORG",
    "PDNS5.ULTRADNS.INFO",
    "PDNS6.ULTRADNS.CO.UK",
    "pdns1.ultradns.net",
    "pdns6.ultradns.co.uk",
    "pdns4.ultradns.org",
    "pdns2.ultradns.net",
    "ns3.p31.d

whois result: {
  "domain_name": [
    "GEOCITIES.COM",
    "geocities.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-11-12 10:11:38",
    "2019-11-12 02:11:38"
  ],
  "creation_date": [
    "1995-12-15 05:00:00",
    "1995-12-15 00:00:00"
  ],
  "expiration_date": [
    "2020-12-14 05:00:00",
    "2020-12-13 00:00:00"
  ],
  "name_servers": [
    "NS1.YAHOO.COM",
    "NS2.YAHOO.COM",
    "NS3.YAHOO.COM",
    "NS4.YAHOO.COM",
    "NS5.YAHOO.COM",
    "ns5.yahoo.com",
    "ns3.yahoo.com",
    "ns4.yahoo.com",
    "ns2.yahoo.com",
    "ns1.yahoo.com"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTransferProhibi

whois result: {
  "domain_name": "KATHYGARVER.COM",
  "registrar": "TUCOWS, INC.",
  "whois_server": "whois.tucows.com",
  "referral_url": null,
  "updated_date": "2019-06-19 13:11:29",
  "creation_date": "1999-07-04 02:37:34",
  "expiration_date": "2020-07-04 02:37:34",
  "name_servers": [
    "NS1.IPOWER.COM",
    "NS2.IPOWER.COM",
    "ns1.ipower.com",
    "ns2.ipower.com"
  ],
  "status": "ok https://icann.org/epp#ok",
  "emails": [
    "kathygarver.com@contactprivacy.com",
    "domainabuse@tucows.com",
    "support@ipower-inc.com"
  ],
  "dnssec": "unsigned",
  "name": "Contact Privacy Inc. Customer 017991298",
  "org": "Contact Privacy Inc. Customer 017991298",
  "address": "96 Mowat Ave",
  "city": "Toronto",
  "state": "ON",
  "zipcode": "M6K 3M1",
  "country": "CA"
}
whois_result.domain_name: KATHYGARVER.COM
In sub
domain: us.imdb.com
whois result: {
  "domain_name": [
    "IMDB.COM",
    "imdb.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.

whois result: {
  "domain_name": [
    "TRIPOD.COM",
    "tripod.com"
  ],
  "registrar": "CSC CORPORATE DOMAINS, INC.",
  "whois_server": "whois.corporatedomains.com",
  "referral_url": null,
  "updated_date": [
    "2019-11-01 15:22:21",
    "2018-12-17 17:15:26"
  ],
  "creation_date": "1994-09-29 04:00:00",
  "expiration_date": "2020-09-28 04:00:00",
  "name_servers": [
    "NS1.LYCOS.COM",
    "NS2.LYCOS.COM",
    "NS3.LYCOS.COM",
    "NS4.LYCOS.COM",
    "ns1.lycos.com",
    "ns2.lycos.com",
    "ns4.lycos.com",
    "ns3.lycos.com"
  ],
  "status": [
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
    "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",
    "clientTransferProhibited http://www.icann.org/epp#clientTransferProhibited"
  ],
  "emails": [
    "domainabuse@cscglobal.co

whois result: {
  "domain_name": "POST-GAZETTE.COM",
  "registrar": "Network Solutions, LLC",
  "whois_server": "whois.networksolutions.com",
  "referral_url": null,
  "updated_date": [
    "2018-11-05 08:35:59",
    "2018-11-05 08:37:04"
  ],
  "creation_date": "1995-01-05 05:00:00",
  "expiration_date": "2024-01-04 05:00:00",
  "name_servers": [
    "DNS1.POST-GAZETTE.COM",
    "DNS2.POST-GAZETTE.COM",
    "DNS3.POST-GAZETTE.COM",
    "DNS4.POST-GAZETTE.COM"
  ],
  "status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
  "emails": [
    "abuse@web.com",
    "ea8ra3297gq@networksolutionsprivateregistration.com",
    "wz55y3xp2yc@networksolutionsprivateregistration.com"
  ],
  "dnssec": "unsigned",
  "name": "PERFECT PRIVACY, LLC",
  "org": null,
  "address": "5335 Gate Parkway care of Network Solutions PO Box 459",
  "city": "Jacksonville",
  "state": "FL",
  "zipcode": "32256",
  "country": "US"
}
whois_result.domain_name: POST-GAZETTE.COM
In sub
domain:

In [46]:
splitted_data.head(10)

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1
5,http://voicechasers.com/forum/viewforum.php?f=8,http,voicechasers.com,forum/viewforum.php?f=8,0,0,0,0,0,0,0,0,1
6,http://hollywoodcollectorshow.com/,http,hollywoodcollectorshow.com,,0,0,0,0,0,0,0,0,1
7,http://www.geocities.com/hollywood/hills/8944/,http,www.geocities.com,hollywood/hills/8944/,0,0,0,0,0,0,0,0,0
8,http://asifa.proboards61.com/index.cgi?action=...,http,asifa.proboards61.com,index.cgi?action=calendarviewall,2,0,0,0,0,0,0,0,1
9,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/cal///group/voi...,1,0,1,0,0,0,0,0,0


#### 特征10：Google索引

此功能检查网站是否在Google的索引中。 Google将网站编入索引后，它会显示在搜索结果中（网站站长资源，2014年）。

通常，网上诱骗网页只能在很短的时间内访问，因此，Google索引中可能找不到许多网络钓鱼网页。

Rule: IF{Webpage Indexed by Google} → Legitimate
         
         Otherwise → Phishing

0 --- indicates legitimate

1 --- indicates Phishing

In [47]:
from google import google
from fake_useragent import UserAgent


In [48]:
def google_index(url):
    site=google.search(url,5)
    if site:
        return 0
    else:
        return 1

In [49]:
splitted_data['google_index'] = raw_data['websites'].apply(google_index)

Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.emuck.com%3A3000%2Farchive%2Fegan.html&start=0&num=10&tbs=
<urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.emuck.com%3A3000%2Farchive%2Fegan.html&start=10&num=10&tbs=
<urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.emuck.com%3A3000%2Farchive%2Fegan.html&start=20&num=10&tbs=
<urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.emuck.com%3A3000%2Farchive%2Fegan.html&start=30&num=10&tbs=
<urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>
Error accessing: http://www.google.com/search?nl=en&q=http%3A%2F%2Fwww.emuck.com%3A3000%2Farchive%2Fegan.html&start=40&num=10&tbs=
<urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>
E

KeyboardInterrupt: 

#### 特征11：网站流量

此功能通过确定访问者数量和他们访问的页面数量来衡量网站的受欢迎程度。

但是，由于网络钓鱼网站的生存时间很短，因此Alexa数据库可能无法识别它们（Alexa Web信息公司，1996年）。 通过查看数据集，我们发现在最坏的情况下，
合法网站跻身前十万。 

此外，如果该域没有流量或未被Alexa数据库识别，则将其归类为“网络钓鱼”。否则，将其分类为“可疑”。

Rule: IF Website Rank<100,000 → Legitimate Website Rank>100,
         
         000 →Suspicious
         Otherwise → Phish

0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious

In [50]:
from bs4 import BeautifulSoup
import urllib.request
def web_traffic(url):
    try:
        rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find("REACH")['RANK']
    except TypeError:
        return 1
    rank= int(rank)
    if (rank<100000):
        return 0
    else:
        return 2

In [51]:
splitted_data['web_traffic'] = raw_data['websites'].apply(web_traffic)

ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。

In [52]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1


#### 特征12：域注册时间长度

基于网络钓鱼网站的生存期很短的事实，我们认为，值得信赖的域名会提前几年定期付款。

在我们的数据集中，我们发现最长的欺诈域仅被使用了一年。

Rule: IF{Domains Expires on≤ 1 years} → Phishing
         
         Otherwise→ Legitimate
         
0 --- indicates legitimate

1 --- indicates Phishing

In [53]:
import whois
from datetime import datetime
import time
def domain_registration_length_sub(domain):
    expiration_date = domain.expiration_date
    today = time.strftime('%Y-%m-%d')
    today = datetime.strptime(today, '%Y-%m-%d')
    if expiration_date is None:
        return 1
    elif type(expiration_date) is list or type(today) is list :
        return 2             #If it is a type of list then we can't select a single value from list. So,it is regarded as suspected website  
    else:
        registration_length = abs((expiration_date - today).days)
        if registration_length / 365 <= 1:
            return 1
        else:
            return 0

    
    
    

In [54]:
def domain_registration_length_main(domain):
    dns = 0
    try:
        domain_name = whois.whois(domain)
    except:
        dns = 1
        
    if dns == 1:
        return 1
    else:
        return domain_registration_length_sub(domain_name)
    

In [55]:
splitted_data['domain_registration_length'] = splitted_data['domain_name'].apply(domain_registration_length_main)

In [56]:
splitted_data.head()

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,domain_registration_length
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,1
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,0,2
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,1


In [57]:
#Testing the above function
domain_registration_length_main('www.google.com')

2

In [58]:
domain_registration_length_main('www.w3schools.com')

0

#### 特征13：域生存时间

可以从WHOIS数据库中提取此功能（Whois 2005）。大多数网络钓鱼网站的生存时间很短。

通过查看我们的数据集，我们发现合法域的最小生存时间为6个月。

Rule: IF {Age Of Domain≥6 months} → Legitimate
          
          Otherwise → Phishing
          
0 --- indicates legitimate

1 --- indicates Phishing

In [59]:
def age_of_domain_sub(domain):
    creation_date = domain.creation_date
    expiration_date = domain.expiration_date
    if ((expiration_date is None) or (creation_date is None)):
        return 1
    elif ((type(expiration_date) is list) or (type(creation_date) is list)):
        return 2
    else:
        ageofdomain = abs((expiration_date - creation_date).days)
        if ((ageofdomain/30) < 6):
            return 1
        else:
            return 0

In [60]:
def age_of_domain_main(domain):
    dns = 0
    try:
        domain_name = whois.whois(domain)
    except:
        dns = 1
        
    if dns == 1:
        return 1
    else:
        return age_of_domain_sub(domain_name)



In [61]:
splitted_data['age_of_domain'] = splitted_data['domain_name'].apply(age_of_domain_main)

In [62]:
splitted_data

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,domain_registration_length,age_of_domain
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,1,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,1,2
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,0,2,2
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,1,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,http://www.post-gazette.com/magazine/19990223v...,http,www.post-gazette.com,magazine/19990223voicetalent1.asp,2,0,0,1,0,0,0,0,1,0,0
97,http://www.serkworks.com/roommates/index.htm,http,www.serkworks.com,roommates/index.htm,0,0,0,0,0,0,0,0,1,1,1
98,http://www.armory.com/~keeper/jesshirt.html,http,www.armory.com,~keeper/jesshirt.html,0,0,0,0,0,0,0,0,1,0,0
99,http://www.voicechasers.com/database/showactor...,http,www.voicechasers.com,database/showactor.php?actorid=1220,2,0,0,0,0,0,0,0,1,1,1


#### 特征14：DNS记录

对于网络钓鱼网站，WHOIS数据库无法识别声称的身份（Whois 2005）或没有找到有关主机名的记录（Pan和Ding 2006）。

如果DNS记录为空或未找到，则该网站被归类为“网络钓鱼”，否则被归类为“合法”。

Rule: IF{no DNS Record For The Domain} → Phishing
         
         Otherwise→ Legitimate
        
0 --- indicates legitimate

1 --- indicates Phishing

In [63]:
def dns_record(domain):
    dns = 0
    try:
        domain_name = whois.whois(domain)
        print(domain_name)
    except:
        dns = 1
        
    if dns == 1:
        return 1
    else:
        return dns



In [64]:
splitted_data['dns_record'] = splitted_data['domain_name'].apply(dns_record)

{
  "domain_name": null,
  "registrar": null,
  "whois_server": null,
  "referral_url": null,
  "updated_date": null,
  "creation_date": null,
  "expiration_date": null,
  "name_servers": null,
  "status": null,
  "emails": null,
  "dnssec": null,
  "name": null,
  "org": null,
  "address": null,
  "city": null,
  "state": null,
  "zipcode": null,
  "country": null
}
{
  "domain_name": [
    "DANODAY.COM",
    "danoday.com"
  ],
  "registrar": "DREAMHOST",
  "whois_server": "WHOIS.DREAMHOST.COM",
  "referral_url": null,
  "updated_date": "2019-02-07 08:15:50",
  "creation_date": [
    "1997-03-10 05:00:00",
    "1997-03-09 21:00:00"
  ],
  "expiration_date": "2020-03-11 04:00:00",
  "name_servers": [
    "A.DNS.HOSTWAY.NET",
    "B.DNS.HOSTWAY.NET"
  ],
  "status": [
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientTransferProhibited https://www.icann.org/epp#clientTransferProhibited"
  ],
  "emails": [
    "lfxhxq6lbw4ucmc@proxy.dreamhost.com",

{
  "domain_name": "MONKEYDOG.COM",
  "registrar": "DYNADOT LLC",
  "whois_server": "whois.dynadot.com",
  "referral_url": null,
  "updated_date": "2019-08-25 12:37:30",
  "creation_date": "2010-09-08 18:59:31",
  "expiration_date": "2020-09-08 18:59:31",
  "name_servers": [
    "NS1.GRIDHOST.COM",
    "NS2.GRIDHOST.COM",
    "ns1.gridhost.com",
    "ns2.gridhost.com"
  ],
  "status": [
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientTransferProhibited"
  ],
  "emails": [
    "abuse@dynadot.com",
    "monkeydog.com@superprivacyservice.com"
  ],
  "dnssec": "unsigned",
  "name": "Tech Admin",
  "org": "Virtual Point Inc.",
  "address": "1340 Reynolds Ave # 116-290",
  "city": "Irvine",
  "state": "CA",
  "zipcode": "92614-5525",
  "country": "US"
}
{
  "domain_name": "TEAMKNIGHTRIDER.COM",
  "registrar": "TUCOWS, INC.",
  "whois_server": "whois.tucows.com",
  "referral_url": null,
  "updated_date": "2019-03-09 08:35:34",
  "creation_date": "1997

{
  "domain_name": "VORTEX.COM",
  "registrar": "Network Solutions, LLC",
  "whois_server": "whois.networksolutions.com",
  "referral_url": null,
  "updated_date": "2016-10-19 15:08:01",
  "creation_date": "1986-10-27 05:00:00",
  "expiration_date": "2021-10-26 04:00:00",
  "name_servers": [
    "NS01.VORTEX.COM",
    "NS02.VORTEX.COM",
    "NS1.METRON.COM",
    "NS1.VORTEX.COM",
    "NS2.METRON.COM",
    "NS2.VORTEX.COM",
    "NS3.METRON.COM"
  ],
  "status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
  "emails": "abuse@web.com",
  "dnssec": "unsigned",
  "name": null,
  "org": null,
  "address": null,
  "city": null,
  "state": null,
  "zipcode": null,
  "country": null
}
{
  "domain_name": "AWN.COM",
  "registrar": "Network Solutions, LLC",
  "whois_server": "whois.networksolutions.com",
  "referral_url": null,
  "updated_date": "2018-07-26 07:24:50",
  "creation_date": "1995-09-25 04:00:00",
  "expiration_date": "2020-09-24 04:00:00",
  "name_servers

{
  "domain_name": "TOONZONE.NET",
  "registrar": "Network Solutions, LLC",
  "whois_server": "whois.networksolutions.com",
  "referral_url": null,
  "updated_date": "2019-12-18 13:15:55",
  "creation_date": "1998-08-03 04:00:00",
  "expiration_date": "2022-08-02 04:00:00",
  "name_servers": [
    "COREY.NS.CLOUDFLARE.COM",
    "LUCIANA.NS.CLOUDFLARE.COM"
  ],
  "status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
  "emails": "abuse@web.com",
  "dnssec": "unsigned",
  "name": null,
  "org": null,
  "address": null,
  "city": null,
  "state": null,
  "zipcode": null,
  "country": null
}
{
  "domain_name": [
    "TRIPOD.COM",
    "tripod.com"
  ],
  "registrar": "CSC CORPORATE DOMAINS, INC.",
  "whois_server": "whois.corporatedomains.com",
  "referral_url": null,
  "updated_date": [
    "2019-11-01 15:22:21",
    "2018-12-17 17:15:26"
  ],
  "creation_date": "1994-09-29 04:00:00",
  "expiration_date": "2020-09-28 04:00:00",
  "name_servers": [
    "NS1.LYC

{
  "domain_name": "THEHUTCH.COM",
  "registrar": "TUCOWS, INC.",
  "whois_server": "whois.tucows.com",
  "referral_url": null,
  "updated_date": "2016-07-21 07:12:08",
  "creation_date": "2000-04-22 17:31:52",
  "expiration_date": "2022-04-22 17:31:52",
  "name_servers": [
    "NS2.POWWEB.COM",
    "NS3.POWWEB.COM",
    "ns2.powweb.com",
    "ns3.powweb.com"
  ],
  "status": [
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited"
  ],
  "emails": [
    "domainabuse@tucows.com",
    "support@powweb.com"
  ],
  "dnssec": "unsigned",
  "name": "REDACTED FOR PRIVACY",
  "org": "REDACTED FOR PRIVACY",
  "address": "REDACTED FOR PRIVACY",
  "city": "REDACTED FOR PRIVACY",
  "state": "MN",
  "zipcode": "REDACTED FOR PRIVACY",
  "country": "US"
}
{
  "domain_name": [
    "POVONLINE.COM",
    "povonline.com"
  ],
  "registrar": "DREAMHOST",
  "whois_server": "WHOIS.DREAMHOST.COM",
  "referral_url

{
  "domain_name": [
    "EPINIONS.COM",
    "epinions.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-10-17 16:16:35",
    "2019-10-17 09:17:21"
  ],
  "creation_date": [
    "1999-02-12 05:00:00",
    "1999-02-11 21:00:00"
  ],
  "expiration_date": [
    "2020-02-12 05:00:00",
    "2020-02-11 00:00:00"
  ],
  "name_servers": [
    "NS1.MARKMONITOR.COM",
    "NS2.MARKMONITOR.COM",
    "NS3.MARKMONITOR.COM",
    "NS4.MARKMONITOR.COM",
    "NS5.MARKMONITOR.COM",
    "NS6.MARKMONITOR.COM",
    "NS7.MARKMONITOR.COM",
    "ns2.markmonitor.com",
    "ns3.markmonitor.com",
    "ns7.markmonitor.com",
    "ns1.markmonitor.com",
    "ns6.markmonitor.com",
    "ns5.markmonitor.com",
    "ns4.markmonitor.com"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibi

{
  "domain_name": [
    "GEOCITIES.COM",
    "geocities.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-11-12 10:11:38",
    "2019-11-12 02:11:38"
  ],
  "creation_date": [
    "1995-12-15 05:00:00",
    "1995-12-15 00:00:00"
  ],
  "expiration_date": [
    "2020-12-14 05:00:00",
    "2020-12-13 00:00:00"
  ],
  "name_servers": [
    "NS1.YAHOO.COM",
    "NS2.YAHOO.COM",
    "NS3.YAHOO.COM",
    "NS4.YAHOO.COM",
    "NS5.YAHOO.COM",
    "ns5.yahoo.com",
    "ns4.yahoo.com",
    "ns2.yahoo.com",
    "ns3.yahoo.com",
    "ns1.yahoo.com"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTransferProhibited https://ic

{
  "domain_name": "TAOSLANDANDFILM.COM",
  "registrar": "Network Solutions, LLC",
  "whois_server": "whois.networksolutions.com",
  "referral_url": null,
  "updated_date": "2018-09-17 16:14:15",
  "creation_date": "1996-09-19 04:00:00",
  "expiration_date": "2023-09-18 04:00:00",
  "name_servers": [
    "NS1141.DNS.DYN.COM",
    "NS2182.DNS.DYN.COM",
    "NS3194.DNS.DYN.COM",
    "NS4157.DNS.DYN.COM"
  ],
  "status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
  "emails": "abuse@web.com",
  "dnssec": "unsigned",
  "name": null,
  "org": null,
  "address": null,
  "city": null,
  "state": null,
  "zipcode": null,
  "country": null
}
{
  "domain_name": [
    "IMDB.COM",
    "imdb.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-05-07 23:07:16",
    "2019-08-26 12:19:56"
  ],
  "creation_date": [
    "1996-01-05 05:00:00",
    "1996-01-04 21:00:00"
  ],
  "expirati

{
  "domain_name": "IMDB.COM",
  "registrar": "MarkMonitor Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": "2019-05-07 23:07:16",
  "creation_date": "1996-01-05 05:00:00",
  "expiration_date": "2024-01-04 05:00:00",
  "name_servers": [
    "NS1.P31.DYNECT.NET",
    "NS2.P31.DYNECT.NET",
    "NS3.P31.DYNECT.NET",
    "NS4.P31.DYNECT.NET",
    "PDNS1.ULTRADNS.NET",
    "PDNS2.ULTRADNS.NET",
    "PDNS3.ULTRADNS.ORG",
    "PDNS4.ULTRADNS.ORG",
    "PDNS5.ULTRADNS.INFO",
    "PDNS6.ULTRADNS.CO.UK"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
    "serverUpdateProhibited https://icann.org/epp#server

{
  "domain_name": [
    "KYLEHEBERT.COM",
    "kylehebert.com"
  ],
  "registrar": "1&1 IONOS SE",
  "whois_server": "whois.ionos.com",
  "referral_url": null,
  "updated_date": [
    "2019-10-10 07:54:59",
    "2018-03-14 00:15:26"
  ],
  "creation_date": "2001-10-10 02:47:40",
  "expiration_date": "2020-10-10 02:47:40",
  "name_servers": [
    "NS1056.UI-DNS.BIZ",
    "NS1056.UI-DNS.COM",
    "NS1056.UI-DNS.DE",
    "NS1056.UI-DNS.ORG"
  ],
  "status": [
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientTransferProhibited https://www.icann.org/epp#clientTransferProhibited"
  ],
  "emails": [
    "abuse@ionos.com",
    "privacy@1and1.com"
  ],
  "dnssec": [
    "unsigned",
    "Unsigned"
  ],
  "name": "Oneandone Private Registration",
  "org": "1&1 Internet Inc",
  "address": [
    "701 Lee Road Suite 300",
    "ATTN"
  ],
  "city": "Chesterbrook",
  "state": "PA",
  "zipcode": "19087",
  "country": "US"
}
{
  "domain_name": "POST-GAZETTE.COM"

In [65]:
splitted_data

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,domain_registration_length,age_of_domain,dns_record
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,1,1,0
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,1,2,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,0,2,2,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,1,1,1
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,http://www.post-gazette.com/magazine/19990223v...,http,www.post-gazette.com,magazine/19990223voicetalent1.asp,2,0,0,1,0,0,0,0,1,0,0,0
97,http://www.serkworks.com/roommates/index.htm,http,www.serkworks.com,roommates/index.htm,0,0,0,0,0,0,0,0,1,1,1,1
98,http://www.armory.com/~keeper/jesshirt.html,http,www.armory.com,~keeper/jesshirt.html,0,0,0,0,0,0,0,0,1,0,0,0
99,http://www.voicechasers.com/database/showactor...,http,www.voicechasers.com,database/showactor.php?actorid=1220,2,0,0,0,0,0,0,0,1,1,1,1


#### 特征15：基于统计报告的特征

PhishTank（PhishTank Stats，2010-2012），和StopBadware（StopBadware，2010-2012年）在每个给定时间段内在网络钓鱼网站上制定了大量统计报告
有些是每月，有些是每季度。

Rule: IF{Host Belongs to Top Phishing IPs or Top Phishing Domains} → Phishing
         
         Otherwise → Legitimate
         
0 --- indicates legitimate

1 --- indicates Phishing


In [66]:
import socket
def statistical_report(url):
    hostname = url
    h = [(x.start(0), x.end(0)) for x in re.finditer('https://|http://|www.|https://www.|http://www.', hostname)]
    z = int(len(h))
    if z != 0:
        y = h[0][1]
        hostname = hostname[y:]
        h = [(x.start(0), x.end(0)) for x in re.finditer('/', hostname)]
        z = int(len(h))
        if z != 0:
            hostname = hostname[:h[0][0]]
    url_match=re.search('at\.ua|usa\.cc|baltazarpresentes\.com\.br|pe\.hu|esy\.es|hol\.es|sweddy\.com|myjino\.ru|96\.lt|ow\.ly',url)
    try:
        ip_address = socket.gethostbyname(hostname)
        ip_match=re.search('146\.112\.61\.108|213\.174\.157\.151|121\.50\.168\.88|192\.185\.217\.116|78\.46\.211\.158|181\.174\.165\.13|46\.242\.145\.103|121\.50\.168\.40|83\.125\.22\.219|46\.242\.145\.98|107\.151\.148\.44|107\.151\.148\.107|64\.70\.19\.203|199\.184\.144\.27|107\.151\.148\.108|107\.151\.148\.109|119\.28\.52\.61|54\.83\.43\.69|52\.69\.166\.231|216\.58\.192\.225|118\.184\.25\.86|67\.208\.74\.71|23\.253\.126\.58|104\.239\.157\.210|175\.126\.123\.219|141\.8\.224\.221|10\.10\.10\.10|43\.229\.108\.32|103\.232\.215\.140|69\.172\.201\.153|216\.218\.185\.162|54\.225\.104\.146|103\.243\.24\.98|199\.59\.243\.120|31\.170\.160\.61|213\.19\.128\.77|62\.113\.226\.131|208\.100\.26\.234|195\.16\.127\.102|195\.16\.127\.157|34\.196\.13\.28|103\.224\.212\.222|172\.217\.4\.225|54\.72\.9\.51|192\.64\.147\.141|198\.200\.56\.183|23\.253\.164\.103|52\.48\.191\.26|52\.214\.197\.72|87\.98\.255\.18|209\.99\.17\.27|216\.38\.62\.18|104\.130\.124\.96|47\.89\.58\.141|78\.46\.211\.158|54\.86\.225\.156|54\.82\.156\.19|37\.157\.192\.102|204\.11\.56\.48|110\.34\.231\.42',ip_address)  
    except:
        return 1

    if url_match:
        return 1
    else:
        return 0


In [67]:
splitted_data['statistical_report'] = raw_data['websites'].apply(statistical_report)

In [68]:
splitted_data

Unnamed: 0,url,protocol,domain_name,address,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains,having_ip_address,shortening_service,https_token,abnormal_url,domain_registration_length,age_of_domain,dns_record,statistical_report
0,http://www.emuck.com:3000/archive/egan.html,http,www.emuck.com:3000,archive/egan.html,0,0,0,0,0,0,0,0,1,1,1,0,1
1,http://danoday.com/summit.shtml,http,danoday.com,summit.shtml,0,0,0,0,0,0,0,0,0,1,2,0,0
2,http://groups.yahoo.com/group/voice_actor_appr...,http,groups.yahoo.com,group/voice_actor_appreciation/links/events_an...,1,0,0,0,0,0,0,0,0,2,2,0,0
3,http://voice-international.com/,http,voice-international.com,,0,0,0,1,0,0,0,0,1,1,1,1,0
4,http://www.livinglegendsltd.com/,http,www.livinglegendsltd.com,,0,0,0,0,0,0,0,0,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,http://www.post-gazette.com/magazine/19990223v...,http,www.post-gazette.com,magazine/19990223voicetalent1.asp,2,0,0,1,0,0,0,0,1,0,0,0,0
97,http://www.serkworks.com/roommates/index.htm,http,www.serkworks.com,roommates/index.htm,0,0,0,0,0,0,0,0,1,1,1,1,0
98,http://www.armory.com/~keeper/jesshirt.html,http,www.armory.com,~keeper/jesshirt.html,0,0,0,0,0,0,0,0,1,0,0,0,0
99,http://www.voicechasers.com/database/showactor...,http,www.voicechasers.com,database/showactor.php?actorid=1220,2,0,0,0,0,0,0,0,1,1,1,1,0


In [70]:
whois.

SyntaxError: invalid syntax (<ipython-input-70-20dcbfaa433e>, line 1)