# HoxHunt Summer Hunters 2019 - Data - Home assignment


## What we expect

Investigate potential features you could extract from the given URL and implement extractors for the ones that interest you the most. Below example code extracts one feature but does not store it very efficiently (just console logs it). Implement sensible data structure using some known data structure library to store the features per URL. Also consider how would you approach error handling if one feature extractor fails?

Be prepared to discuss questions such as: what features could indicate the malicousness of a given URL? What goes in to the thinking of the attacker when they are choosing a site for an attack? What would you develop next?

## What we don't expect

Implement a humangous set of features.

Implement any kind of actual predicition models that uses the features to give predictions on malicousness at this stage :) 

In [2]:
import requests
import json
import re
import pandas as pd

from urllib.parse import urlparse
from urllib.request import urlopen
import tldextract

from bs4 import BeautifulSoup
from bs4.element import Comment

from nltk.tokenize import word_tokenize

#import whois (also you need to install whois tool on system seperately) 

# Feature extraction

From the study [1] which built a phishing website classifier using five types of features sets (f1-5) we can see that url related features (f1) had the best precision, recall and FP-rate when classifiying phishing sites, but a combination of url and webpage content related features (f5) provided performance close to the classifier containing all the features when it comes to recall and FP-rate. This succest that combining these feature sets might be a good starting point when building a classfier. However, it can be noted that the research did not record the performance of any other pair of two features other than the pair f1,5.

I also looked at the twitter phishing paper [2], which used WHOIS information of registrar name and domain age to identify phishing sites. Will need scrape these features from some online WHOIS site as the internal tool does not work good enough.
 
[1] [Know Your Phish: Novel Techniques for Detecting
Phishing Sites and their Targets](https://arxiv.org/pdf/1510.06501.pdf)

[2] [PhishAri: Automatic Realtime Phishing Detection on Twitter](https://arxiv.org/pdf/1301.6899.pdf)


## URL-related features

From the url related features f1 in [1] we build extractors for 8 of the 9 url related features for starting and landing url.  

1 protocol used (http/https)<br>
2 count of dots ‘.’ in FreeURL<br>
3 count of level domains<br>
4 length of the URL<br>
5 length of the FQDN<br>
6 length of the mld<br>
7 count of terms in the URL<br>
8 count of terms in the mld<br>

Intuitively these form a good starting point for the classifier as I can often see from the pure url if the site is trustful with the exception if the site is between an url shortener. 

There are also a large number of url features related to links inside the landing page, which are important for the classfier to work correctly and which I should build next to replicate the performance of f1 feature set.

In [3]:
example_urls = ["https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus",
                "http://cartaobndes.gov.br.cv31792.tmweb.ru/",
                "https://paypal.co.uk.yatn.eu/m/",
                "http://college-eisk.ru/cli/",
                "https://dotpay-platnosc3.eu/dotpay/"
               ]

In [4]:
start_land_urls = []
for url in example_urls:
    try:
        response = urlopen(url)
        landing_url = response.geturl()
        start_land_urls.append((url, landing_url))
    except:
        start_land_urls.append((url, None))

start_land_urls

[('https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus',
  'https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus'),
 ('http://cartaobndes.gov.br.cv31792.tmweb.ru/',
  'https://vh76.timeweb.ru/parking/?ref=cartaobndes.gov.br.cv31792.tmweb.ru'),
 ('https://paypal.co.uk.yatn.eu/m/', None),
 ('http://college-eisk.ru/cli/', None),
 ('https://dotpay-platnosc3.eu/dotpay/', None)]

In [5]:
#parses the url to components described in [1]. 
def url_components(url):
    url_parsed =  urlparse(url)
    url_extracted = tldextract.extract(url)
    
    protocol = url_parsed.scheme
    FQDN = url_parsed.netloc
    RDN = url_extracted.domain + '.' + url_extracted.suffix
    mld = url_extracted.domain
    FreeURL_start = url_extracted.subdomain
    FreeURL_end = url_parsed.path
    
    return (protocol,FQDN, RDN,mld,FreeURL_start, FreeURL_end)

url_components('https://www.amazon.co.uk/ap/signin? encoding=UTF8')

('https', 'www.amazon.co.uk', 'amazon.co.uk', 'amazon', 'www', '/ap/signin')

In [6]:
def url_features(url):
    
    if url != None:
         
        protocol,FQDN, RDN, mld, FreeURL_start, FreeURL_end = url_components(url)
        
        protocol_used = 1 if protocol == 'https' else 0
        free_url_dots = FreeURL_start.count('.') +  FreeURL_end.count('.')
        
        #Count of level domains, this is not accuarate as subdomains get added if they exist.
        level_domain = FQDN.count('.')

        url_length = len(url)
        FQDN_length = len(FQDN)
        mld_length = len(mld)
         
        url_terms = len(re.split(r"[/:\.?=&-_]+",url))
        mld_terms = len(re.split(r"[-]+",mld))
 
        return [protocol_used,free_url_dots,level_domain,  url_length,FQDN_length,mld_length,url_terms,mld_terms]
    else:
        return [None] * 8

In [7]:
col_names =  ['starting-url', 'protocol', 'free-url-dots','level_domain',
              'url-len', 'fqdn-len', 'mld-len','url-terms','mld-terms']

df_start  = pd.DataFrame(columns = col_names)
for url in start_land_urls:
    df_start.loc[len(df_start)] = [url[0]] + url_features(url[0])
df_start.head()       

Unnamed: 0,starting-url,protocol,free-url-dots,level_domain,url-len,fqdn-len,mld-len,url-terms,mld-terms
0,https://www.slideshare.net/weaveworks/client-s...,1,0,2,76,18,10,10,1
1,http://cartaobndes.gov.br.cv31792.tmweb.ru/,0,3,5,43,35,5,8,1
2,https://paypal.co.uk.yatn.eu/m/,1,2,4,31,20,4,8,1
3,http://college-eisk.ru/cli/,0,0,1,27,15,12,6,2
4,https://dotpay-platnosc3.eu/dotpay/,1,0,1,35,19,16,6,2


In [8]:
col_names =  ['landing-url', 'protocol', 'free-url-dots','level_domain',
              'url-len', 'fqdn-len', 'mld-len','url-terms','mld-terms']

df_land  = pd.DataFrame(columns = col_names)
for url in start_land_urls:
    df_land.loc[len(df_land)] = [url[1]] + url_features(url[1])
df_land.head()   

Unnamed: 0,landing-url,protocol,free-url-dots,level_domain,url-len,fqdn-len,mld-len,url-terms,mld-terms
0,https://www.slideshare.net/weaveworks/client-s...,1.0,0.0,2.0,76.0,18.0,10.0,10.0,1.0
1,https://vh76.timeweb.ru/parking/?ref=cartaobnd...,1.0,0.0,2.0,72.0,15.0,7.0,12.0,1.0
2,,,,,,,,,
3,,,,,,,,,
4,,,,,,,,,


## Webpage content features

From [1] we consider the full set of f5 webpage content features: the number of terms in title and body, the number of input fiels, images and iframes.

In [9]:
#https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text
def tag_visible(element):  
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    
    return True

def text_from_html(soup): 
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    
    return u" ".join(t.strip() for t in visible_texts)

In [8]:
def title_terms_amount(soup):
    title = soup.find('title').string
    tokenized_title = word_tokenize(title)
    long_terms = [x for x in tokenized_title if len(x) >= 3]
    
    return len(set(long_terms))

def body_terms_amount(soup):  
    text = text_from_html(soup)
    tokenized_text = word_tokenize(text)
    long_terms = [x for x in tokenized_text if len(x) >= 3]
    
    return len(set(long_terms))

def input_field_amount(soup):
    #the research paper does not discern between types of input fields
    input_fields = soup.find_all('input')
    
    return len(input_fields)

def images_amount(soup):
    images = soup.find_all('img')
    
    return len(images)

def iframes_amount(soup):
    iframes = soup.find_all('iframe')
    
    return len(iframes)

def content_features(url): 
    if url != None:
        
        try:
            page = urlopen(url).read()
            soup = BeautifulSoup(page, 'html.parser')

            title_terms_count = title_terms_amount(soup)      
            body_terms_count = body_terms_amount(soup)
            inputs_count = input_field_amount(soup)
            images_count = images_amount(soup)
            iframes_count = iframes_amount(soup)    
      
            return [title_terms_count, body_terms_count, inputs_count, images_count, iframes_count]
        except:
            return [None] * 5
    else:
        return [None] * 5
    
#print(content_features(example_urls[0]))

In [10]:
col_names =  ['landing-url', 'title-terms', 'body-terms', 'inputs-count', 'images-count', 'iframes-count']
df_content  = pd.DataFrame(columns = col_names)
for url in start_land_urls:
    df_content.loc[len(df_content)] = [url[1]] + content_features(url[1])
df_content.head()   

Unnamed: 0,landing-url,title-terms,body-terms,inputs-count,images-count,iframes-count
0,https://www.slideshare.net/weaveworks/client-s...,5.0,413.0,13.0,50.0,1.0
1,https://vh76.timeweb.ru/parking/?ref=cartaobnd...,5.0,65.0,0.0,1.0,1.0
2,,,,,,
3,,,,,,
4,,,,,,


# Appendix

### links
[Know Your Phish: Novel Techniques for Detecting
Phishing Sites and their Targets](https://arxiv.org/pdf/1510.06501.pdf)

[DeltaPhish: Detecting Phishing Webpages
in Compromised Websites](https://arxiv.org/pdf/1707.00317.pdf)



[PhishAri: Automatic Realtime Phishing Detection on Twitter](https://arxiv.org/pdf/1301.6899.pdf)



[More or Less? Predict the Social Influence of Malicious URLs on Social Media
](https://arxiv.org/abs/1812.02978)



[awesome-threat-intelligence](https://github.com/hslatman/awesome-threat-intelligence)



In [10]:
def get_domain_age_in_days(domain):
    print(domain)
    show = "https://input.payapi.io/v1/api/fraud/domain/age/" + domain
    data = requests.get(show).json()
    return data['result'] if 'result' in data else None

def parse_domain_from_url(url):
    t = urlparse(url).netloc
    return '.'.join(t.split('.')[-2:])

def analyze_url(url):
    # First feature, if domain is new it could indicate that the bad guy has bought it recently...
    age_in_days_feature = get_domain_age_in_days(parse_domain_from_url(url));
    # Hmm...maybe I could do something more sensible with the data than just printing out
    print(url, age_in_days_feature)

# Note some of these urls are live phishing sites (as of 2019-03-21) use with caution! More can be found at https://www.phishtank.com/
example_urls = ["https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus",
                "http://cartaobndes.gov.br.cv31792.tmweb.ru/",
                "https://paypal.co.uk.yatn.eu/m/",
                "http://college-eisk.ru/cli/",
                "https://dotpay-platnosc3.eu/dotpay/"
               ]
for url in example_urls: 
    analyze_url(url)


slideshare.net
https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus 4741
tmweb.ru
http://cartaobndes.gov.br.cv31792.tmweb.ru/ 4655
yatn.eu
https://paypal.co.uk.yatn.eu/m/ None
college-eisk.ru
http://college-eisk.ru/cli/ 2723
dotpay-platnosc3.eu
https://dotpay-platnosc3.eu/dotpay/ None


In [12]:
#might add WHOIS features in the future, if I get the whois to work more reliably
url_parsed =  urlparse('https://en.wikipedia.org/wiki/Domain_name_registrar')
FQDN = url_parsed.netloc
print(FQDN)

domain = whois.query(FQDN)
#domain.registrar
#domain.creation_date
print(domain.__dict__)