# Features
We have 17 features  in our dataset. Let's figure out what they mean.

## Domain
The domain that is present in the URL. 
This feature does not matter much when learning.

In [3]:
def getDomain(url):  
    domain = urlparse(url).netloc
    if re.match(r"^www.",domain):
        domain = domain.replace("www.","")
    return domain

## Have_IP
Checks for the presence of IP address in the URL. URLs may have IP address instead of domain name. If an IP address is used as an alternative of the domain name in the URL, we can be sure that someone is trying to steal personal information with this URL.

If the domain part of URL has IP address, the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [5]:
def havingIP(url):
    try:
        ipaddress.ip_address(url)
        ip = 1
    except:
        ip = 0
    return ip

## Have_At
Checks for the presence of '@' symbol in the URL. 

If the URL has '@' symbol, the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [7]:
def haveAtSign(url):
    if "@" in url:
        at = 1    
    else:
        at = 0    
    return at

## URL_Length
Computes the length of the URL. 

If the length of URL >= 54 , the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [8]:
def getLength(url):
    if len(url) < 54:
        length = 0            
    else:
        length = 1            
    return length

## URL_Depth
This feature calculates the number of sub pages in the given url based on the '/'.

The value of feature is a numerical based on the URL.

In [9]:
def getDepth(url):
    s = urlparse(url).path.split('/')
    depth = 0
    for j in range(len(s)):
        if len(s[j]) != 0:
            depth = depth+1
    return depth

## Redirection
Checks the presence of "//" in the URL. 

If the "//" is anywhere in the URL apart from after the protocal, thee value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [None]:
def redirection(url):
    pos = url.rfind('//')
    if pos > 6:
        if pos > 7:
            return 1
        else:
            return 0
    else:
        return 0

## https_Domain
Checks for the presence of "http/https" in the domain part of the URL.

If the URL has "http/https" in the domain part, the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [10]:
def httpDomain(url):
    domain = urlparse(url).netloc
    if 'https' in domain:
        return 1
    else:
        return 0

## TinyURL
If the URL is using Shortening Services, the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [12]:
shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"

def tinyURL(url):
    match=re.search(shortening_services,url)
    if match:
        return 1
    else:
        return 0

## Prefix/Suffix	
Checking the presence of '-' in the domain part of URL. The dash symbol is rarely used in legitimate URLs.

If the URL has '-' symbol in the domain part of the URL, the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [13]:
def prefixSuffix(url):
    if '-' in urlparse(url).netloc:
        return 1
    else:
        return 0

## DNS_Record
For phishing websites, either the claimed identity is not recognized by the WHOIS database or no records founded for the hostname. If the DNS record is empty or not found then, the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [16]:
#!pip install python-whois

## Web_Traffic
This feature measures the popularity of the website by determining the number of visitors and the number of pages they visit. However, since phishing websites live for a short period of time, they may not be recognized by the Alexa database (Alexa the Web Information Company., 1996). By reviewing our dataset, we find that in worst scenarios, legitimate websites ranked among the top 100,000. Furthermore, if the domain has no traffic or is not recognized by the Alexa database, it is classified as “Phishing”.

If the rank of the domain < 100000, the vlaue of this feature is 1 (phishing) else 0 (legitimate).

In [17]:
def web_traffic(url):
    try:
        url = urllib.parse.quote(url)
        rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find("REACH")['RANK']
        rank = int(rank)
    except TypeError:
        return 1
    if rank <100000:
        return 1
    else:
        return 0

## Domain_Age
This feature can be extracted from WHOIS database.

If age of domain > 12 months, the vlaue of this feature is 1 (phishing) else 0 (legitimate).

In [19]:
def domainAge(domain_name):
    creation_date = domain_name.creation_date
    expiration_date = domain_name.expiration_date
    if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
        try:
            creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
            expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
        except:
            return 1
    if ((expiration_date is None) or (creation_date is None)):
        return 1
    elif ((type(expiration_date) is list) or (type(creation_date) is list)):
        return 1
    else:
        ageofdomain = abs((expiration_date - creation_date).days)
        if ((ageofdomain/30) < 6):
            age = 1
        else:
            age = 0
    return age

## Domain_End
This feature can be extracted from WHOIS database.

If end period of domain > 6 months, the vlaue of this feature is 1 (phishing) else 0 (legitimate).

In [20]:
def domainEnd(domain_name):
    expiration_date = domain_name.expiration_date
    if isinstance(expiration_date,str):
        try:
            expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
        except:
            return 1
    if (expiration_date is None):
        return 1
    elif (type(expiration_date) is list):
        return 1
    else:
        today = datetime.now()
        end = abs((expiration_date - today).days)
        if ((end/30) < 6):
            end = 0
        else:
            end = 1
    return end

## iFrame
IFrame is an HTML tag used to display an additional webpage into one that is currently shown.

If the iframe is empty or repsonse is not found then, the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [21]:
def iframe(response):
    if response == "":
        return 1
    else:
        if re.findall(r"[|]", response.text):
            return 0
        else:
            return 1

## Mouse_Over
Phishers may use JavaScript to show a fake URL in the status bar to users. 

If the response is empty or onmouseover is found then, the value assigned to this feature is 1 (phishing)  else 0 (legitimate).

In [22]:
def mouseOver(response): 
    if response == "" :
        return 1
    else:
        if re.findall("", response.text):
            return 1
        else:
            return 0

## Right_Click
Phishers use JavaScript to disable the right-click function, so that users cannot view and save the webpage source code. 

If the response is empty or onmouseover is not found then, the value assigned to this feature is 1 (phishing) else 0 (legitimate).

In [None]:
def rightClick(response):
    if response == "":
        return 1
    else:
        if re.findall(r"event.button ?== ?2", response.text):
            return 0
        else:
            return 1

## Web_Forwards
The fine line that distinguishes phishing websites from legitimate ones is how many times the website has been redirected. In our dataset, it was found that legitimate websites were redirected at most once. On the other hand, phishing websites containing this feature have been redirected at least 4 times.

In [23]:
def forwarding(response):
    if response == "":
        return 1
    else:
        if len(response.history) <= 2:
            return 0
        else:
            return 1

## Label
it's class of url :)