## Phishing Detection using Machine Learning
This exercise involves building a phishing detector using two different algorithms:
- Logistic Regression
- Decision Trees
Plus spam detection using Natural Language Processing (NLP).

The first part of the exercise will use data from the [UCI Machine Learning Repository (Phishing Websites Data Set)](https://archive.ics.uci.edu/ml/datasets/Phishing+Websites), as converted to CSV for the Machine Learning for Pentesting course (hosted [at this GitHub link](https://raw.githubusercontent.com/PacktPublishing/Mastering-Machine-Learning-for-Penetration-Testing/master/Chapter02/dataset.csv)). Note that the CSV is comprised of 31 columns - vectors with 30 attributes and one result feature.

### Part 1: Logistic Regression
We will start by using logistic regression to train a predictive model.

In [1]:
# Import numpy and scikit-learn
import numpy as np
from sklearn import *
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn import datasets
import matplotlib.pyplot as plt

In [2]:
# Load the data
training_data = np.genfromtxt('dataset-phishing-detection.csv',delimiter=',',dtype=np.int32)

In [3]:
# Print the data and number of columns
print(training_data)
print('Columns: ' + str(len(training_data[0])))

[[-1  1  1 ...  1 -1 -1]
 [ 1  1  1 ...  1  1 -1]
 [ 1  0  1 ...  0 -1 -1]
 ...
 [ 1 -1  1 ...  0  1 -1]
 [-1 -1  1 ...  1  1 -1]
 [-1 -1  1 ...  1 -1 -1]]
Columns: 31


#### Commentary on the data
We can see from the output above that the data consists of an array of arrays containing 31 columns with either -1, 0 or 1. The 31 columns represent 30 features and one result, so we have labelled data. The features are [described in the document accompanying the dataset that can be downloaded here.](https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Phishing%20Websites%20Features.docx) They include things such as whether a website uses an IP address in the address bar or a URL shortener (e.g. tinyurl), features found in HTML and JavaScript on malicious pages (e.g. popups, iframes, redirects) and other indicators of phishing websites. Each indicator either has a binary value (-1 or 1) representing legitimate or phishing or a ternary value (-1, 0 or 1) for legitimate, suspicious and phishing. The final column (the 31st) is the label - a legitimate site is marked -1 whilst a phishing site is 1. 

Because the data consists of features from -1 to 1, we can use logistic regression if a logistic curve fits the data. However, given the number of features, a decision tree might be more accurate.

We might also want to apply dimensionality reduction techniques.

In [4]:
# Separate the inputs (attributes) and outputs (results - the last column)
inputs = training_data[:,:-1]
outputs = training_data[:,-1]

In [5]:
# Divide the data into training and test data
training_inputs = inputs[:2000]
training_outputs = outputs[:2000]
testing_inputs = inputs[2000:]
testing_outputs = outputs[2000:]

In [6]:
# Create the Logistic Regression classifier
classifier = LogisticRegression()

In [7]:
# Train the classifier
classifier.fit(training_inputs, training_outputs)

LogisticRegression()

In [8]:
# Make predictions
predictions = classifier.predict(testing_inputs)

In [9]:
# Print the accuracy of the model
accuracy = 100.0 * accuracy_score(testing_outputs, predictions)
print("The accuracy of the Logistic Regression model on test data is: " + str(accuracy))

The accuracy of the Logistic Regression model on test data is: 84.51684152401988


### Part 2: Decision Trees
84.5% accuracy isn't bad, but decision trees might provide better accuracy than logistic regression. We will use sklearn's decision tree.

In [10]:
# Import the library
from sklearn import tree

In [11]:
# Create a tree classifier
tree_classifier = tree.DecisionTreeClassifier()

In [12]:
# Train the model
tree_classifier.fit(training_inputs, training_outputs)

DecisionTreeClassifier()

In [13]:
# Compute some predictions
tree_predictions = tree_classifier.predict(testing_inputs)

In [14]:
# Calculate accuracy and print
accuracy = 100 * accuracy_score(testing_outputs, tree_predictions)
print("The accuracy of the Decision Tree model on testing data is " + str(accuracy))

The accuracy of the Decision Tree model on testing data is 90.70127001656545


#### Discussion and next steps
The next steps in the tutorial involve training an NLP model. After that, I plan to plot the decision boundaries of the models and examine them to pick the next one. Once a model has been suitably trained, it could be used for phishing detection. Basically, it would need to be fed data in the same format as the input. One could potentially scan emails for links and extract the relevant data points about each link. This would, however, require sufficient processing power so would be a task best performed by an email server or similar. Essentially, the idea would be to use this model to make predictions as part of our usual virus and spam scanning - if we find any dodgy links, we can quarantine the email and flag it for review. The result of the review can also provide more labelled data for re-training the model with new data.