## Cog-Sys mathematics and methods: Assignment 3 - Phishing websites
Juha Nuutinen

In [44]:
import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import graphviz
from IPython.display import IFrame

In [45]:
df = pd.read_csv("phishing.csv", sep=';')
df.describe()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWindow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
count,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,...,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0
mean,0.313795,-0.633198,0.738761,0.700588,0.741474,-0.734962,0.063953,0.250927,-0.336771,0.628584,...,0.613388,0.816915,0.061239,0.377114,0.287291,-0.483673,0.721574,0.344007,0.719584,0.113885
std,0.949534,0.766095,0.673998,0.713598,0.671011,0.678139,0.817518,0.911892,0.941629,0.777777,...,0.789818,0.576784,0.998168,0.926209,0.827733,0.875289,0.692369,0.569944,0.694437,0.993539
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,-1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,...,1.0,1.0,-1.0,-1.0,0.0,-1.0,1.0,0.0,1.0,-1.0
50%,1.0,-1.0,1.0,1.0,1.0,-1.0,0.0,1.0,-1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,0.0,1.0,1.0
75%,1.0,-1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Decision tree with depth=2
70% of the data is used for training the model, and the rest for validating it.

In [47]:
colnames = df.columns.get_values()
df_train, df_validate = np.split(df.sample(frac=1),
                                 [int(0.7*len(df))])
x = df_train.loc[:, "having_IP_Address":"Statistical_report"]
x_validate = df_validate.loc[:, "having_IP_Address":"Statistical_report"]
y = df_train.loc[:, "Result"]
y_validate = df_validate.loc[:, "Result"]
print("Number of instances in  training set: {0}".format(len(x)))
classifier = tree.DecisionTreeClassifier(max_depth=2)
classifier.fit(x, y)

Number of instances in  training set: 7738


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [54]:
dot_data = tree.export_graphviz(classifier, out_file=None,
                                feature_names=colnames[:-1],
                                class_names=["normal","phishing"])
graph = graphviz.Source(dot_data) 
graph.render("Phishing")

IFrame("Phishing.pdf", width=700, height=600)

## Instructions for spotting a phishing site
If the site uses HTTPS, the site is probably legit regardless of the SSL certificate issuer and it's age, however next we'll be looking at anchor elements, as that combined with the HTTPS status gives a more accurate verdict.

For anchor elements, two features are examined:

1) if the anchor tag and the website have different domain names

2) it the anchor tag does not link to any webpage

Then we'll figure out the percentage of anchor elements in the site, that meet the previous criteria.

If the site uses HTTPS, and the percentage of anchors that meet the aforementioned criteria is less than 31% of all anchors, the the site is very likely to be legit (gini=0.0).
If the site uses HTTPS, but the percentage of anchors that meet the aforementioned criteria is 31% or more, it is probably legit but there's a chance that it is a phishing site (gini=0.415).

Next we'll look at sites, that are not using HTTPS. We'll examine the anchor elements, as we did with the sites that use HTTPS.
If the anchor element percentage is less than 31%, the site is still likely legit, even though it doesn't use HTTP (gini=0.199).
If the anchor element percentage is 31% or more, the site is very likely a phishing site (gini=0.144).

So to identify a phishing site relatively accurately, one only needs to examine to properties of the site: HTTPS, and anchor elements. No HTTPS, and many anchor elements link to a different domain or to no webpage at all, means that it is very likely a phishing site.
## Accuracy, confusion matrix and classification report

In [56]:
print("Number of instances in validation set: {0}".format(len(x_validate)))
y_pred = classifier.predict(x_validate)
cm = confusion_matrix(y_validate, y_pred)
print("Confusion matrix:\n",cm)

accuracy = (cm[0][0]+cm[1][1])/(cm[0][0]+cm[1][1]+cm[0][1]+cm[1][0])
print("Accuracy calculated from the training set = %.3f" % (accuracy))

print(classification_report(y_validate, y_pred,
                            target_names=["normal","phishing"]))

Number of instances in validation set: 3317
Confusion matrix:
 [[1279  143]
 [ 154 1741]]
Accuracy calculated from the training set = 0.910
              precision    recall  f1-score   support

      normal       0.89      0.90      0.90      1422
    phishing       0.92      0.92      0.92      1895

   micro avg       0.91      0.91      0.91      3317
   macro avg       0.91      0.91      0.91      3317
weighted avg       0.91      0.91      0.91      3317



## Decision tree with depth=3
Now we'll try again with a decision tree with a depth of 3

In [55]:
classifier = tree.DecisionTreeClassifier(max_depth=3)
classifier.fit(x, y)
dot_data = tree.export_graphviz(classifier, out_file=None,
                                feature_names=colnames[:-1],
                                class_names=["normal","phishing"])
graph = graphviz.Source(dot_data) 
graph.render("Phishing2")
IFrame("Phishing2.pdf", width=600, height=600)

In [57]:
print("Number of instances in validation set: {0}".format(len(x_validate)))
y_pred = classifier.predict(x_validate)
cm = confusion_matrix(y_validate, y_pred)
print("Confusion matrix:\n",cm)

accuracy = (cm[0][0]+cm[1][1])/(cm[0][0]+cm[1][1]+cm[0][1]+cm[1][0])
print("Accuracy calculated from the training set = %.3f" % (accuracy))

print(classification_report(y_validate, y_pred,
                            target_names=["normal","phishing"]))

Number of instances in validation set: 3317
Confusion matrix:
 [[1279  143]
 [ 154 1741]]
Accuracy calculated from the training set = 0.910
              precision    recall  f1-score   support

      normal       0.89      0.90      0.90      1422
    phishing       0.92      0.92      0.92      1895

   micro avg       0.91      0.91      0.91      3317
   macro avg       0.91      0.91      0.91      3317
weighted avg       0.91      0.91      0.91      3317



As can be seen above, the bigger decision tree doesn't yield better results, so the instructions for the analyst are based on the 2-depth tree.