# HoxHunt Summer Hunters 2019 - Data - Home assignment


<img src="https://www.dropbox.com/s/zmuij2fyjo27j1u/Screenshot%202019-03-21%2017.28.10.png?dl=1" width="1000">

## Assignment

In this assignment you as a HoxHunt Data Science Hunter are given the task to extract interesting features from a possible malicious indicator of compromise, more specifically in this case from a given potentially malicious URL. 

<img src="https://www.dropbox.com/s/ao0neaphtfama7g/Screenshot%202019-03-21%2017.23.40.png?dl=1" width="400">

This assignment assumes that you are comfortable (or quick to learn) on using Jupyter Notebooks and suitable programming enviroment such as Python, R or Julia. The example below uses Python and has some external dependencies such as Requests library.

Happy hunting!


## Interesting research papers & resources

Below is a list of interesting research papers on the topic. They might give you good tips what features you could extract from a given URL:


[Know Your Phish: Novel Techniques for Detecting
Phishing Sites and their Targets](https://arxiv.org/pdf/1510.06501.pdf)

[DeltaPhish: Detecting Phishing Webpages
in Compromised Websites](https://arxiv.org/pdf/1707.00317.pdf)

[PhishAri: Automatic Realtime Phishing Detection on Twitter](https://arxiv.org/pdf/1301.6899.pdf)

[More or Less? Predict the Social Influence of Malicious URLs on Social Media
](https://arxiv.org/abs/1812.02978)

[awesome-threat-intelligence](https://github.com/hslatman/awesome-threat-intelligence)



## What we expect

Investigate potential features you could extract from the given URL and implement extractors for the ones that interest you the most. Below example code extracts one feature but does not store it very efficiently (just console logs it). Implement sensible data structure using some known data structure library to store the features per URL. Also consider how would you approach error handling if one feature extractor fails?

Be prepared to discuss questions such as: what features could indicate the malicousness of a given URL? What goes in to the thinking of the attacker when they are choosing a site for an attack? What would you develop next?

## What we don't expect

Implement a humangous set of features.

Implement any kind of actual predicition models that uses the features to give predictions on malicousness at this stage :) 

In [35]:
import requests
import json
import re

from urllib.parse import urlparse
from urllib.request import urlopen
import tldextract

from bs4 import BeautifulSoup
from bs4.element import Comment

import nltk


In [2]:

def get_domain_age_in_days(domain):
    show = "https://input.payapi.io/v1/api/fraud/domain/age/" + domain
    data = requests.get(show).json()
    return data['result'] if 'result' in data else None

def parse_domain_from_url(url):
    t = urlparse(url).netloc
    return '.'.join(t.split('.')[-2:])

def analyze_url(url):
    # First feature, if domain is new it could indicate that the bad guy has bought it recently...
    age_in_days_feature = get_domain_age_in_days(parse_domain_from_url(url));
    # Hmm...maybe I could do something more sensible with the data than just printing out
    print(url, age_in_days_feature)

# Note some of these urls are live phishing sites (as of 2019-03-21) use with caution! More can be found at https://www.phishtank.com/
example_urls = ["https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus",
                "http://cartaobndes.gov.br.cv31792.tmweb.ru/",
                "https://paypal.co.uk.yatn.eu/m/",
                "http://college-eisk.ru/cli/",
                "https://dotpay-platnosc3.eu/dotpay/"
               ]
for url in example_urls: 
    analyze_url(url)


https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus 4739
http://cartaobndes.gov.br.cv31792.tmweb.ru/ 4653
https://paypal.co.uk.yatn.eu/m/ None
http://college-eisk.ru/cli/ 2722
https://dotpay-platnosc3.eu/dotpay/ None


f1 106 URL
f2 66 Term usage consistency
f3 22 Usage of starting and landing mld
f4 13 RDN usage
f5 5 Webpage content

url related features had the best precision, recall and FP-rate, together with term use frequency, but a combination of url and webpage content related features 



In [3]:
example_urls = ["https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus",
                "http://cartaobndes.gov.br.cv31792.tmweb.ru/",
                "https://paypal.co.uk.yatn.eu/m/",
                "http://college-eisk.ru/cli/",
                "https://dotpay-platnosc3.eu/dotpay/"
               ]

In [4]:
start_land_urls = []
for url in example_urls:
    try:
        response = urlopen(url)
        landing_url = response.geturl()
        start_land_urls.append((url, landing_url))
    except:
        start_land_urls.append((url, None))

start_land_urls

[('https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus',
  'https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus'),
 ('http://cartaobndes.gov.br.cv31792.tmweb.ru/',
  'https://vh76.timeweb.ru/parking/?ref=cartaobndes.gov.br.cv31792.tmweb.ru'),
 ('https://paypal.co.uk.yatn.eu/m/', None),
 ('http://college-eisk.ru/cli/', None),
 ('https://dotpay-platnosc3.eu/dotpay/', None)]

In [5]:
def url_components(url):
    url_parsed =  urlparse(url)
    url_extracted = tldextract.extract(url)
    
    protocol = url_parsed.scheme
    FQDN = url_parsed.netloc
    RDN = url_extracted.domain + '.' + url_extracted.suffix
    mld = url_extracted.domain
    FreeURL_start = url_extracted.subdomain
    FreeURL_end = url_parsed.path
    
    return (protocol,FQDN, RDN,mld,FreeURL_start, FreeURL_end)

for url in example_urls:
    print(url_components(url))

('https', 'www.slideshare.net', 'slideshare.net', 'slideshare', 'www', '/weaveworks/client-side-monitoring-with-prometheus')
('http', 'cartaobndes.gov.br.cv31792.tmweb.ru', 'tmweb.ru', 'tmweb', 'cartaobndes.gov.br.cv31792', '/')
('https', 'paypal.co.uk.yatn.eu', 'yatn.eu', 'yatn', 'paypal.co.uk', '/m/')
('http', 'college-eisk.ru', 'college-eisk.ru', 'college-eisk', '', '/cli/')
('https', 'dotpay-platnosc3.eu', 'dotpay-platnosc3.eu', 'dotpay-platnosc3', '', '/dotpay/')


In [6]:
#url_parsed =  urlparse(example_urls[0])
#test = url_parsed.hostname
#print(url_parsed)
tldextract.extract(example_urls[2])
#tldextract.extract('https://www.amazon.co.uk/ap/signin? encoding=UTF8')
#urlparse(example_urls[0])

ExtractResult(subdomain='paypal.co.uk', domain='yatn', suffix='eu')

In [15]:
print(example_urls[1])
url_terms = re.split(r"[/:\.?=&]+",example_urls[2])
print(url_terms)

http://cartaobndes.gov.br.cv31792.tmweb.ru/
['https', 'paypal', 'co', 'uk', 'yatn', 'eu', 'm', '']


In [28]:
def url_features(url):
    
    if url != None:
        
        url_parsed =  urlparse(url)
        # protocol used (http/https)
        protocol,FQDN, RDN, mld, FreeURL_start, FreeURL_end = url_components(url)
        #count of dots ‘.’ in FreeURL
        free_url_dots = FreeURL_start.count('.') +  FreeURL_end.count('.')
        
        #Count of level domains 
        
        #length of the URL
        url_length = len(url)
        # length of the FQDN
        FQDN_length = len(FQDN)
        #length of the mld
        mld_length = len(mld)
        #count of terms in the URL 
        url_terms = len(re.split(r"[/:\.?=&-_]+",url))
        #count of terms in the mld  
        mld_terms = len(re.split(r"[-]+",mld))
 
        return (protocol,free_url_dots, url_length,FQDN_length,mld_length,url_terms,mld_terms)

    
for url in start_land_urls: 
    print(url_features(url[0]), url_features(url[1]))
#print(url_features(example_urls[3]))

('https', 0, 76, 18, 10, 10, 1) ('https', 0, 76, 18, 10, 10, 1)
('http', 3, 43, 35, 5, 8, 1) ('https', 0, 72, 15, 7, 12, 1)
('https', 2, 31, 20, 4, 8, 1) None
('http', 0, 27, 15, 12, 6, 2) None
('https', 0, 35, 19, 16, 6, 2) None


In [72]:
page = requests.get(example_urls[0])
soup = BeautifulSoup(page.text, 'html.parser')

In [56]:
#artist_name_list = soup.find('body')
#artist_name_list.string
#nltk.word_tokenize(artist_name_list.string)
#text = artist_name_list.getText()
#nltk.word_tokenize(text)

In [74]:
#https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

#html = urlopen(example_urls[0]).read()
#text = text_from_html(html)

In [89]:
def title_terms_amount(url):
    
    page = urlopen(url).read()
    soup = BeautifulSoup(page, 'html.parser')
    title = soup.find('title').string
    tokenized_title = nltk.word_tokenize(title)
    long_terms = [x for x in tokenized_title if len(x) >= 3]
    
    return len(set(long_terms))

def body_terms_amount(url):
    
    html = urlopen(url).read()
    text = text_from_html(html)
    tokenized_text = nltk.word_tokenize(text)
    long_terms = [x for x in tokenized_text if len(x) >= 3]
    
    return len(set(long_terms))

def input_field_amount(url):
    #the research paper does not discern between types of input fields
    page = urlopen(url).read()
    soup = BeautifulSoup(page, 'html.parser')
    input_fields = soup.find_all('input')
    
    return len(input_fields)

def images_amount(url):
    
    page = urlopen(url).read()
    soup = BeautifulSoup(page, 'html.parser')
    images = soup.find_all('img')
    
    return len(images)

def iframes_amount(url):
    
    page = urlopen(url).read()
    soup = BeautifulSoup(page, 'html.parser')
    iframes = soup.find_all('iframe')
    
    return len(iframes)

def content_features(url):
    
    if url != None:

        title_terms_count = title_terms_amount(url)      
        body_terms_count = body_terms_amount(url)
        inputs_count = input_field_amount(url)
        images_count = images_amount(url)
        iframes_count = iframes_amount(url)    
      
        return (title_terms_count, body_terms_count, inputs_count, images_count, iframes_count)
    
print(content_features(example_urls[0]))

(5, 414, 13, 50, 0)


In [92]:
iframes_amount('https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe')

0

In [93]:
page = urlopen('https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe').read()

In [95]:
soup = BeautifulSoup(page, 'html.parser')
iframes = soup.find_all('iframe')

In [96]:
iframes

[<iframe src="https://www.w3schools.com">
   <p>Your browser does not support iframes.</p>
 </iframe>]

# Appendix

## Notes

[Know Your Phish: Novel Techniques for Detecting
Phishing Sites and their Targets](https://arxiv.org/pdf/1510.06501.pdf)

Modeling phisher limitations: To increase their chances
of success, phishers try to make their phish mimic its
target closely and obscure any signal that might tip off the
victim. However, in crafting the structure of the phishing
webpage, phishers are restricted in two significant ways.
First, external hyperlinks in the phishing webpage, especially those pointing to the target, are to domains outside
the control of phishers. 

Second, while phishers can freely
change most parts of the phishing page, the latter part
of its domain name is constrained as they are limited
to domains that the phishers control. We conjecture that
by modeling these limitations in our phishing detection
classifier, we can improve its effectiveness.

Measuring consistency in term usage: A webpage can
be represented by a collection of key terms that occur
in multiple parts of the page such as its body text, title,
domain name, other parts of the URL etc. We conjecture
that the way in which these terms are used in different
parts of the page will be different in legitimate and
phishing webpages.


A phisher
has full control over the subdomains portion and can set it to
any value. The RDN portion is constrained since it has to be
registered with a domain name registrar

useful features Starting URL, Landing URL,Redirection chain,Logged links,HTML




[DeltaPhish: Detecting Phishing Webpages
in Compromised Websites](https://arxiv.org/pdf/1707.00317.pdf)



[PhishAri: Automatic Realtime Phishing Detection on Twitter](https://arxiv.org/pdf/1301.6899.pdf)



[More or Less? Predict the Social Influence of Malicious URLs on Social Media
](https://arxiv.org/abs/1812.02978)



[awesome-threat-intelligence](https://github.com/hslatman/awesome-threat-intelligence)

