# Phishing Websites


## About the Dataset
This dataset consists of 11055 training data set and 2456 testing data set. It has 30 attributes, namely, 
- having_IP_Address  { -1,1 }
- URL_Length   { 1,0,-1 } 
- Shortining_Service { 1,-1 } 
- having_At_Symbol   { 1,-1 } 
- double_slash_redirecting { -1,1 } 
- Prefix_Suffix  { -1,1 } 
- having_Sub_Domain  { -1,0,1 } 
- SSLfinal_State  { -1,1,0 } 
- Domain_registeration_length { -1,1 } 
- Favicon { 1,-1 } 
- port { 1,-1 } 
- HTTPS_token { -1,1 } 
- Request_URL  { 1,-1 } 
- URL_of_Anchor { -1,0,1 } 
- Links_in_tags { 1,-1,0 } 
- SFH  { -1,1,0 } 
- Submitting_to_email { -1,1 } 
- Abnormal_URL { -1,1 }
- Redirect  { 0,1 } 
- on_mouseover  { 1,-1 }
- RightClick  { 1,-1 } 
- popUpWidnow  { 1,-1 } 
- Iframe { 1,-1 } 
- age_of_domain  { -1,1 } 
- DNSRecord   { -1,1 } 
- web_traffic  { -1,0,1 } 
- Page_Rank { -1,1 } 
- Google_Index { 1,-1 } 
- Links_pointing_to_page { 1,0,-1 } 
- Statistical_report { -1,1 } 
- Result  { -1,1 } 

which they take on the values 1, 0, -1 that mean whether the website is either legitimate, suspicious or phishing, respectively.

###### The following segment of code just creates and duplicates the dataset in the .arff files. 

In [1]:
import os

# Getting all the arff files from the current directory
files = [arff for arff in os.listdir('.') if arff.endswith(".arff")]

# Function for converting arff list to csv list
def toCsv(content):
    data = False
    header = ""
    newContent = []
    for line in content:
        if not data:
            if "@attribute" in line:
                attri = line.split()
                columnName = attri[attri.index("@attribute")+1]
                header = header + columnName + ","
            elif "@data" in line:
                data = True
                header = header[:-1]
                header += '\n'
                newContent.append(header)
        else:
            newContent.append(line)
    return newContent

# Main loop for reading and writing files
for file in files:
    with open(file , "r") as inFile:
        content = inFile.readlines()
        name,ext = os.path.splitext(inFile.name)
        new = toCsv(content)
        with open(name+".csv", "w") as outFile:
            outFile.writelines(new)

In [2]:
from math import *
import numpy as np
from scipy.io import arff
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "white")
sns.set(style = "whitegrid", color_codes = True)


data_train = pd.read_csv('Training Dataset.csv', sep = ',')
data_test = pd.read_csv('old.csv', sep = ',')

feature_index_names = {0:'having_IP_Address', 1:'URL_Length', 2:'Shortining_Service', 3:'having_At_Symbol', 
                        4:'double_slash_redirecting', 5:'Prefix_Suffix', 6:'having_Sub_Domain', 7:'SSLfinal_State',
                        8:'Domain_registeration_length', 9:'Favicon', 10:'port', 11:'HTTPS_token',12:'Request_URL', 
                        13:'URL_of_Anchor', 14:'Links_in_tags', 15:'SFH', 16:'Submitting_to_email', 17:'Abnormal_URL', 
                        18:'Redirect', 19:'on_mouseover', 20:'RightClick', 21:'popUpWidnow', 22:'Iframe',
                        23:' age_of_domain', 24:'DNSRecord', 25:'web_traffic', 26:'Page_Rank', 27:' Google_Index', 
                        28:'Links_pointing_to_page', 29:'Statistical_report'} 
#training dataset
training_data = np.array(data_train)
x_training = training_data[:, :-1]
y_training = training_data[:, -1]

print('training dataset : ')
print('{:_<24s} = {:d}'.format('number of samples', y_training.shape[0]))
print('{:_<24s} = {:d}'.format('number of negative ones', np.sum(y_training == -1)))
print('{:_<24s} = {:d}'.format('number of ones', np.sum(y_training == 1)))

#testing dataset
testing_data = np.array(data_test)
x_testing = testing_data[:, :-1]
y_testing = testing_data[:, -1]

print('testing dataset : ')
print('{:_<24s} = {:d}'.format('number of samples', y_testing.shape[0]))
print('{:_<24s} = {:d}'.format('number of negative ones', np.sum(y_testing == -1)))
print('{:_<24s} = {:d}'.format('number of ones', np.sum(y_testing == 1)))

training dataset : 
number of samples_______ = 11055
number of negative ones_ = 4898
number of ones__________ = 6157
testing dataset : 
number of samples_______ = 2456
number of negative ones_ = 1362
number of ones__________ = 1094


### Logistic Regression 

We can also use logistic regression to perform occupancy detection. In order to achieve this, first we need to define our hypothesis (or model):
$$
\begin{array}{rcl}
h_{\boldsymbol{\Theta}}(\mathbf{x}) & = & \frac{1}{1 + e^{-(\theta_{0} x_{0} + \theta_{1} x_{1} + \cdots + \theta_{d} x_{d})}} \\
& = & \frac{1}{1 + e^{-\mathbf{x} \boldsymbol{\Theta}^{T}}},
\end{array}
$$
where $\mathbf{x} = [x_{0}, x_{1}, \ldots, x_{d}]$, $\boldsymbol{\Theta}=[\theta_{0}, \theta_{1}, \ldots, \theta_{d}]$ and $x_{0} = 1$. 

The following function implements this model, which assumes that all vectors are row-major.

In [3]:
def h(x, theta):
    s = np.dot(x, theta.T)
    u = 1.0 / (1.0 + np.exp(-s))
    return u

Now, we need to define our loss function to learn the parameters of the model based on training dataset. We can use mean square error as our loss function, i.e.,

$$
J\left(\boldsymbol{\Theta}\right) = \frac{1}{2 N} \sum\limits_{n=1}^{N} \left(h_{ \boldsymbol{\Theta}}\left(\mathbf{x}^{\left(n\right)}\right) - y^{\left(n\right)} \right)^2 = \frac{1}{N} \sum\limits_{n=1}^{N}Cost\left(h_{ \boldsymbol{\Theta}}\left(\mathbf{x}^{\left(n\right)}\right) , y^{\left(n\right)}\right).
$$

The following function implements the loss function $J\left(\boldsymbol{\Theta}\right)$.

In [4]:
def J(x, y, theta):
    N = y.shape[0]
    mse = 1.0 / (2*N) * np.sum((h(x, theta) - y)**2)
    return mse

Now, we need implement gradient descent solver to learn the parameters $\boldsymbol{\Theta}$ of our model. The following function implements batch gradient descent to learn the parameters $\boldsymbol{\Theta}$ on the training dataset.

In [5]:
def bgd(x, y, theta, alpha=0.1, epsilon=0.001, max_iter=1000):
    y = y[:, np.newaxis]
    N = y.shape[0]
    t = 0
    while True:
        # print the value of loss function for each iteration
        print('iteration #{:>8d}, loss = {:>8f}'.format(t, J(x,y,theta)))
        # keep a copy of the parameter vector before the update for checking the convergence criterion
        theta_previous = theta.copy()
        # update the parameter vector
        e = (h(x, theta) - y)
        theta  = theta - alpha * 1.0 / N * np.sum( e * x, axis=0)
        t = t + 1
        # check the convergence criterion
        if (np.max(np.abs(theta-theta_previous)) < epsilon) or (t>max_iter):
            break
    return theta

Now we normalize the data

In [6]:
m = np.mean(x_training, axis=0)
s = np.std(x_training, axis=0)
x_training = (x_training - m) / s
x_testing = (x_testing - m) / s

Augment training and testing vector to take advantage of fast matrix operations in NumPy.

In [7]:
x_training_aug = np.hstack((np.ones((x_training.shape[0],1)), x_training))
x_testing_aug = np.hstack((np.ones((x_testing.shape[0],1)), x_testing))

Create a random parameter vector and learn the parameters

In [8]:
theta = np.random.randn(1, x_training_aug.shape[1])
theta = bgd(x_training_aug, y_training, theta)

iteration #       0, loss = 0.601014
iteration #       1, loss = 0.584591
iteration #       2, loss = 0.568730
iteration #       3, loss = 0.553518
iteration #       4, loss = 0.539009
iteration #       5, loss = 0.525235
iteration #       6, loss = 0.512214
iteration #       7, loss = 0.499949
iteration #       8, loss = 0.488437
iteration #       9, loss = 0.477664
iteration #      10, loss = 0.467609
iteration #      11, loss = 0.458244
iteration #      12, loss = 0.449539
iteration #      13, loss = 0.441458
iteration #      14, loss = 0.433963
iteration #      15, loss = 0.427013
iteration #      16, loss = 0.420568
iteration #      17, loss = 0.414589
iteration #      18, loss = 0.409035
iteration #      19, loss = 0.403868
iteration #      20, loss = 0.399049
iteration #      21, loss = 0.394544
iteration #      22, loss = 0.390323
iteration #      23, loss = 0.386361
iteration #      24, loss = 0.382634
iteration #      25, loss = 0.379125
iteration #      26, loss = 0.375817
i

iteration #     323, loss = 0.301389
iteration #     324, loss = 0.301396
iteration #     325, loss = 0.301402
iteration #     326, loss = 0.301409
iteration #     327, loss = 0.301416
iteration #     328, loss = 0.301422
iteration #     329, loss = 0.301429
iteration #     330, loss = 0.301435
iteration #     331, loss = 0.301442
iteration #     332, loss = 0.301448
iteration #     333, loss = 0.301455
iteration #     334, loss = 0.301461
iteration #     335, loss = 0.301468
iteration #     336, loss = 0.301474
iteration #     337, loss = 0.301481
iteration #     338, loss = 0.301487
iteration #     339, loss = 0.301493
iteration #     340, loss = 0.301500
iteration #     341, loss = 0.301506
iteration #     342, loss = 0.301512
iteration #     343, loss = 0.301519
iteration #     344, loss = 0.301525
iteration #     345, loss = 0.301531
iteration #     346, loss = 0.301537
iteration #     347, loss = 0.301544
iteration #     348, loss = 0.301550
iteration #     349, loss = 0.301556
i

iteration #     657, loss = 0.303363
iteration #     658, loss = 0.303366
iteration #     659, loss = 0.303370
iteration #     660, loss = 0.303374
iteration #     661, loss = 0.303377
iteration #     662, loss = 0.303381
iteration #     663, loss = 0.303385
iteration #     664, loss = 0.303388
iteration #     665, loss = 0.303392
iteration #     666, loss = 0.303396
iteration #     667, loss = 0.303399
iteration #     668, loss = 0.303403
iteration #     669, loss = 0.303406
iteration #     670, loss = 0.303410
iteration #     671, loss = 0.303414
iteration #     672, loss = 0.303417
iteration #     673, loss = 0.303421
iteration #     674, loss = 0.303424
iteration #     675, loss = 0.303428
iteration #     676, loss = 0.303431
iteration #     677, loss = 0.303435
iteration #     678, loss = 0.303438
iteration #     679, loss = 0.303442
iteration #     680, loss = 0.303445
iteration #     681, loss = 0.303449
iteration #     682, loss = 0.303452
iteration #     683, loss = 0.303456
i

iteration #     980, loss = 0.304170
iteration #     981, loss = 0.304172
iteration #     982, loss = 0.304173
iteration #     983, loss = 0.304175
iteration #     984, loss = 0.304177
iteration #     985, loss = 0.304178
iteration #     986, loss = 0.304180
iteration #     987, loss = 0.304181
iteration #     988, loss = 0.304183
iteration #     989, loss = 0.304184
iteration #     990, loss = 0.304186
iteration #     991, loss = 0.304188
iteration #     992, loss = 0.304189
iteration #     993, loss = 0.304191
iteration #     994, loss = 0.304192
iteration #     995, loss = 0.304194
iteration #     996, loss = 0.304195
iteration #     997, loss = 0.304197
iteration #     998, loss = 0.304198
iteration #     999, loss = 0.304200
iteration #    1000, loss = 0.304202


After the parameters are learned, we can test the performance of the classifier. Note that logistic regression returns a number in [0,1], thus we need to binarize it.

In [9]:
y_prediction = (h(x_testing_aug, theta)>=0.5)[:,0]
test_error = np.sum(y_testing != y_prediction) / y_testing.shape[0]
print('logistic regression test error = {:.4f}'.format(test_error))

logistic regression test error = 0.9914
