By reading some of the materials provided by Hoxhunt and some searching, I will explain what features we can exctract from a URL (assuming we are all familiar with structure of the URL): 

- Usually phishing URLs are longer and have more combinations of terms. Attackers have fully controll over FreeURL parts of the URL which contains subdomains of FQDN, path, and query. Therefore, we can sonsiders the following as features: 

    - Length of the URL
    - Number of dots in FreeURL
    - Length of the FQDN 
    - Length of the mld
    - Count of terms in the URL
    - Count of terms in the mld 

- Another feature can be webpage content. Phishing webpages tend to have less text but more images and input fields. Therefore we can count the number of terms in the text, and the number pf images and input fields. 

- One of the features can be age of the webpage, since phishing webs have younger age than legitimate websites. However, this one was covered by the problem itself. 

Below, I will extract some of them and store them as a data. 

In [237]:
import pandas as pd
import tldextract
import requests
import json
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import re
import urllib.request

#Length of the URL
def get_url_length(url):
    return len(url)

#Number of dots in FreeURL (subdomains, path, query, Parameters for last path element)
def get_dots_in_freeurl_length(url):
    URL = urlparse(url)
    #get freeurl parts of the url
    if URL.hostname:
        subdom_url = URL.hostname.split('.')[0]
    else:
        subdom_url = ""
    path_url = URL.path
    params_url = URL.params
    query_url = URL.query
    frag_url = URL.fragment
    freeurl = subdom_url+path_url+params_url+query_url+frag_url
    return freeurl.count('.')

#Length of the FQDN  
def get_fqdn_length(url):
    return len(urlparse(url).netloc)

#Length of the mld (seconf level domain)
def get_mld_length(url):
    URL = urlparse(url)
    if URL.hostname: 
        mld = URL.hostname.split('.')[1]
    else:
        mld = ""
    return len(mld)

#Age of the domain 
def get_domain_age_in_days(domain):
    show = "https://input.payapi.io/v1/api/fraud/domain/age/" + domain
    data = requests.get(show).json()
    return data['result'] if 'result' in data else None

def parse_domain_from_url(url):
    t = urlparse(url).netloc
    return '.'.join(t.split('.')[-2:])

def analyze_url(url):
    # First feature, if domain is new it could indicate that the bad guy has bought it recently...
    age_in_days_feature = get_domain_age_in_days(parse_domain_from_url(url))
    return age_in_days_feature

# Features based on content of the webpage:
def get_num_of_img(url):
    try:
        r = requests.get(url)
        data = r.text
        soup = BeautifulSoup(data)
        img = soup.findAll('img')
        return len(img)
    except: 
        print("An unexpected error happened for ", url)
        return None

def get_num_inputform(url):
    try:
        r = requests.get(url)
        data = r.text
        soup = BeautifulSoup(data)
        form = soup.find('form')
        #form might be None so:
        if form!=None: 
            fields = form.findAll('input')
        else: 
            fields = ''
        return len(fields)   
    except: 
        print("An unexpected error happened for ", url)
        return None
    
def get_num_word(url):
    try:
        html = urllib.request.urlopen(url).read()
        soup = BeautifulSoup(html)
        for script in soup(["script", "style"]):
            script.decompose()
        strips = list(soup.stripped_strings)  
        return len(strips)
    except: 
        print("An unexpected error happened for ", url)
        return None
    
def is_https(url):
    protocol = urlparse(url).scheme
    return (protocol=='https')
        

In [208]:
def url_to_features(url):
    dic = {}
    dic['url'] = url
    dic['len_url'] = get_url_length(url)
    dic['num_dots_in_freeurl'] = get_dots_in_freeurl_length(url)
    dic['len_FQDN'] = get_fqdn_length(url)
    dic['len_mld'] = get_mld_length(url)
    dic['age_domain'] = analyze_url(url)
    dic['num_img'] = get_num_of_img(url)
    dic['num_inputframes'] = get_num_inputform(url)
    dic['num_words'] = get_num_word(url)
    dic['is_https'] = is_https(url)
    return dic

In [238]:
url_examples = ["https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus",
                "http://cartaobndes.gov.br.cv31792.tmweb.ru/",
                "https://paypal.co.uk.yatn.eu/m/",
                "http://college-eisk.ru/cli/",
                "https://dotpay-platnosc3.eu/dotpay/",
                "https://www.google.fi",
                "https://www.wikipedia.org/",
                "https://www.google.fi/jnj.sn;dk?loo.k#lol",
                "abc"
               ]

feature_list = []
for url in url_examples: 
    feature_list.append(url_to_features(url))
data = pd.DataFrame(feature_list)
data

An unexpected error happened for  http://cartaobndes.gov.br.cv31792.tmweb.ru/
An unexpected error happened for  http://cartaobndes.gov.br.cv31792.tmweb.ru/
An unexpected error happened for  http://cartaobndes.gov.br.cv31792.tmweb.ru/
An unexpected error happened for  https://paypal.co.uk.yatn.eu/m/
An unexpected error happened for  https://paypal.co.uk.yatn.eu/m/
An unexpected error happened for  https://paypal.co.uk.yatn.eu/m/
An unexpected error happened for  http://college-eisk.ru/cli/
An unexpected error happened for  https://dotpay-platnosc3.eu/dotpay/
An unexpected error happened for  https://dotpay-platnosc3.eu/dotpay/
An unexpected error happened for  https://dotpay-platnosc3.eu/dotpay/
An unexpected error happened for  https://www.google.fi/jnj.sn;dk?loo.k#lol
An unexpected error happened for  abc
An unexpected error happened for  abc
An unexpected error happened for  abc


Unnamed: 0,url,len_url,num_dots_in_freeurl,len_FQDN,len_mld,age_domain,num_img,num_inputframes,num_words,is_https
0,https://www.slideshare.net/weaveworks/client-s...,76,0,18,10,5106.0,53.0,2.0,251.0,True
1,http://cartaobndes.gov.br.cv31792.tmweb.ru/,43,0,35,3,5020.0,,,,False
2,https://paypal.co.uk.yatn.eu/m/,31,0,20,2,,,,,True
3,http://college-eisk.ru/cli/,27,0,15,2,3089.0,0.0,0.0,,False
4,https://dotpay-platnosc3.eu/dotpay/,35,0,19,2,,,,,True
5,https://www.google.fi,21,0,13,6,,2.0,10.0,27.0,True
6,https://www.wikipedia.org/,26,0,17,9,7014.0,1.0,4.0,450.0,True
7,https://www.google.fi/jnj.sn;dk?loo.k#lol,41,2,13,6,,0.0,0.0,,True
8,abc,3,0,0,0,,,,,False


Before using any machine learning method, it is better to take care of the NaN values. The functions are implemented to return None in case of an exception. One could think of returning a defult value such as -1 for the cases where the function fails, however, I did not do that since this should be done only as a post processing after function calls. \
The following code first looks at how many Nans we have and then replaces those with -1 for the features that can have only positive values.

In [225]:
print("Number of nulls per feature:")
print(data.isnull().sum())
data.fillna(-1 , inplace =True)
data

Number of nulls per feature:
url                    0
len_url                0
num_dots_in_freeurl    0
len_FQDN               0
len_mld                0
age_domain             0
num_img                0
num_inputframes        0
num_words              0
is_https               0
dtype: int64


Unnamed: 0,url,len_url,num_dots_in_freeurl,len_FQDN,len_mld,age_domain,num_img,num_inputframes,num_words,is_https
0,https://www.slideshare.net/weaveworks/client-s...,76,0,18,10,5106.0,53.0,2.0,251.0,True
1,http://cartaobndes.gov.br.cv31792.tmweb.ru/,43,0,35,3,5020.0,-1.0,-1.0,-1.0,False
2,https://paypal.co.uk.yatn.eu/m/,31,0,20,2,-1.0,-1.0,-1.0,-1.0,True
3,http://college-eisk.ru/cli/,27,0,15,2,3089.0,0.0,0.0,-1.0,False
4,https://dotpay-platnosc3.eu/dotpay/,35,0,19,2,-1.0,-1.0,-1.0,-1.0,True
5,https://www.google.fi,21,0,13,6,-1.0,2.0,10.0,27.0,True
6,https://www.wikipedia.org/,26,0,17,9,7014.0,1.0,4.0,450.0,True
7,https://www.google.fi/jnj.sn;dk?loo.k#lol,41,2,13,6,-1.0,0.0,0.0,-1.0,True
