# Decision tree
Mathematics and Methods in Machine Learning and Neural Networks<br>
Jori Nordlund, Simo Ojala, Esa Ryömä<br>
Helsinki Metropolia University of Applied sciences<br>
06.02.2020

In [1]:
import pandas as pd
import numpy as np
from sklearn import tree
import graphviz
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn import model_selection
from IPython.display import IFrame

## Objective
The objective of this assignment is to create a decision tree that defines whether a site is legit or used for phishing.

## Data handling

In [2]:
filename = r'http://users.metropolia.fi/~simooj/phishing.csv'
df = pd.read_csv(filename,
                index_col = None,
                sep = ';',
                na_values='?')
df.head(10)

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWindow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1
5,-1,0,-1,1,-1,-1,1,1,-1,1,...,1,1,1,1,1,-1,1,-1,-1,1
6,1,0,-1,1,1,-1,-1,-1,1,1,...,1,1,1,-1,-1,-1,1,0,-1,-1
7,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,0,-1,1,0,1,-1
8,1,0,-1,1,1,-1,1,1,-1,1,...,1,1,1,-1,1,1,1,0,1,1
9,1,1,-1,1,1,-1,-1,1,-1,1,...,1,1,1,-1,0,-1,1,0,1,-1


In [3]:
df['Result'] = df['Result'].astype(pd.api.types.CategoricalDtype(ordered=False))
print(df.shape)
print(df.dtypes) # Checking the types of the data

(11055, 31)
having_IP_Address                 int64
URL_Length                        int64
Shortining_Service                int64
having_At_Symbol                  int64
double_slash_redirecting          int64
Prefix_Suffix                     int64
having_Sub_Domain                 int64
SSLfinal_State                    int64
Domain_registeration_length       int64
Favicon                           int64
port                              int64
HTTPS_token                       int64
Request_URL                       int64
URL_of_Anchor                     int64
Links_in_tags                     int64
SFH                               int64
Submitting_to_email               int64
Abnormal_URL                      int64
Redirect                          int64
on_mouseover                      int64
RightClick                        int64
popUpWindow                       int64
Iframe                            int64
age_of_domain                     int64
DNSRecord                   

In [4]:
colnames = df.columns.array
print(colnames) # Printing the column names of the data 

<PandasArray>
[          'having_IP_Address',                  'URL_Length',
          'Shortining_Service',            'having_At_Symbol',
    'double_slash_redirecting',               'Prefix_Suffix',
           'having_Sub_Domain',              'SSLfinal_State',
 'Domain_registeration_length',                     'Favicon',
                        'port',                 'HTTPS_token',
                 'Request_URL',               'URL_of_Anchor',
               'Links_in_tags',                         'SFH',
         'Submitting_to_email',                'Abnormal_URL',
                    'Redirect',                'on_mouseover',
                  'RightClick',                 'popUpWindow',
                      'Iframe',               'age_of_domain',
                   'DNSRecord',                 'web_traffic',
                   'Page_Rank',                'Google_Index',
      'Links_pointing_to_page',          'Statistical_report',
                      'Result']
Length: 3

In [5]:
# Slicing the labels and input variables from the data

data = df.loc[:, 'having_IP_Address':'Statistical_report'] # input variables
labels = df.loc[:,'Result'] # labels
print(labels)

0       -1
1       -1
2       -1
3       -1
4        1
        ..
11050    1
11051   -1
11052   -1
11053   -1
11054   -1
Name: Result, Length: 11055, dtype: category
Categories (2, int64): [-1, 1]


## Decision tree
- Defining the tree
- Fitting the data
- Validating the tree

In [6]:
# Defining the tree
classifier = tree.DecisionTreeClassifier(max_depth=2)

#Fitting the data
classifier.fit(data, labels)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [7]:
# Creating a picture of the tree that was build
dot_data = tree.export_graphviz(classifier, out_file=None, feature_names=colnames[:30], class_names=['-1','1'])
graph = graphviz.Source(dot_data)
graph.render("kuva")

IFrame("http://users.metropolia.fi/~simooj/phishing.pdf", width=900, height=700)

### Confusion matrix and cross validation

In [8]:
Y_pred = classifier.predict(data)

# Confusin matrix
cm = confusion_matrix(labels, Y_pred)
print(cm)

[[4425  473]
 [ 563 5594]]


In [9]:
# Cross validation using k-folds cross validation
k = 10
scores = cross_val_score(estimator=classifier,
                        X=data,
                        y=labels,
                        scoring="accuracy",
                        cv=k)
print("Accuracies from %d individual folds:" % k)
print(scores)
print("Accuracy calculated using %d-fold cross validation = %.3f" % (k, scores.mean()))

Accuracies from 10 individual folds:
[0.89783002 0.8960217  0.90144665 0.90777577 0.90596745 0.920434
 0.90415913 0.89864253 0.91757246 0.91304348]
Accuracy calculated using 10-fold cross validation = 0.906


## Classification report

In [10]:
print(classification_report(labels, Y_pred, target_names=['-1', '1']))

              precision    recall  f1-score   support

          -1       0.89      0.90      0.90      4898
           1       0.92      0.91      0.92      6157

    accuracy                           0.91     11055
   macro avg       0.90      0.91      0.91     11055
weighted avg       0.91      0.91      0.91     11055



From the confusion matrix and the classification report we can see that our decision tree does identify the phishing sites with the recall of 91%, practically speaking this means that out of 6157 sample phishing sites the tree was able to correctly identify 5594. 

# Instructions
For web security purposes, the decision tree produced in this document gives the following steps to follow:
- Step 1: First, see if the target site uses HTTPS security certificate. If there is <b>NO HTTPS certificate, check step 2.</b>
If the existing certificate is from a trusted source or not, was not a decisive factor in this. If there <b>IS a HTTPS certificate</b>, the site is likely to be <b>legitimate</b>.

- Step 2: Check if "anchors" meaning links with a-tags link outside the site itself to different domains. In legitimate sites most of the anchors link to the site itself. Also, it is suspicious if the anchors do not link to any webpage, but contains void javascript or link to non-existent parts.
If below 31% of links on the site link outside the site domain, the site is legitimate.
If between 31% and 67% link outside the domain, the site is suspicious.
If over 67% link outside the domain, the site is likely to be a phishing site.

As a rule of thumb: if over 31% of links link outside the domain, treat the site as dangerous!

#Sources
Documentation on the data was lacking, but after searching the several descriptions of the same dataset were found.

Original dataset: https://archive.ics.uci.edu/ml/datasets/phishing+websites#

Description of the data column names: https://github.com/tensorflow/tfjs-examples/tree/master/website-phishing