# Phishing Link Analysis
Hello, welcome to my phishing link data project. In this project I will be taking a dataset of phishing link records and attempt to train a model to predict if a URL is malicious or not based on a set of yes, no, or maybe, attributes. This data was gathered from https://archive.ics.uci.edu/ml/machine-learning-databases/00327

In [1]:
import pandas as pd

data = pd.read_json('http://127.0.0.1:5000/item/jupyter')
data.head()

Unnamed: 0,Abnormal_URL,DNSRecord,Domain_registeration_length,Favicon,Google_Index,HTTPS_token,Iframe,Links_in_tags,Links_pointing_to_page,Page_Rank,...,age_of_domain,double_slash_redirecting,having_At_Symbol,having_IP_Address,having_Sub_Domain,id,on_mouseover,popUpWidnow,port,web_traffic
0,-1,-1,-1,1,1,-1,1,1,1,-1,...,-1,-1,1,-1,-1,1,1,1,1,-1
1,1,-1,-1,1,1,-1,1,-1,1,-1,...,-1,1,1,1,0,2,1,1,1,0
2,-1,-1,-1,1,1,-1,1,-1,0,-1,...,1,1,1,1,-1,3,1,1,1,1
3,1,-1,1,1,1,-1,1,0,-1,-1,...,-1,1,1,1,-1,4,1,1,1,1
4,1,-1,-1,1,1,1,1,0,1,-1,...,-1,1,1,1,1,5,-1,-1,1,0


# Preprocessing
I am going to run a few small functions on the data that is loaded into the dataframe so I can see what types the data is, some general information on the data, as well as see some of the distributions of the columns.

In [10]:
data.describe()

Unnamed: 0,id,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
count,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,...,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0
mean,5528.0,0.313795,-0.633198,0.738761,0.700588,0.741474,-0.734962,0.063953,0.250927,-0.336771,...,0.613388,0.816915,0.061239,0.377114,0.287291,-0.483673,0.721574,0.344007,0.719584,0.113885
std,3191.447947,0.949534,0.766095,0.673998,0.713598,0.671011,0.678139,0.817518,0.911892,0.941629,...,0.789818,0.576784,0.998168,0.926209,0.827733,0.875289,0.692369,0.569944,0.694437,0.993539
min,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,2764.5,-1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,...,1.0,1.0,-1.0,-1.0,0.0,-1.0,1.0,0.0,1.0,-1.0
50%,5528.0,1.0,-1.0,1.0,1.0,1.0,-1.0,0.0,1.0,-1.0,...,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,0.0,1.0,1.0
75%,8291.5,1.0,-1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,11055.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [11]:
data.dtypes

id                             int64
having_IP_Address              int64
URL_Length                     int64
Shortining_Service             int64
having_At_Symbol               int64
double_slash_redirecting       int64
Prefix_Suffix                  int64
having_Sub_Domain              int64
SSLfinal_State                 int64
Domain_registeration_length    int64
Favicon                        int64
port                           int64
HTTPS_token                    int64
Request_URL                    int64
URL_of_Anchor                  int64
Links_in_tags                  int64
SFH                            int64
Submitting_to_email            int64
Abnormal_URL                   int64
Redirect                       int64
on_mouseover                   int64
RightClick                     int64
popUpWidnow                    int64
Iframe                         int64
age_of_domain                  int64
DNSRecord                      int64
web_traffic                    int64
P

In [12]:
data['having_IP_Address'].value_counts()

 1    7262
-1    3793
Name: having_IP_Address, dtype: int64

In [13]:
data['age_of_domain'].value_counts()

 1    5866
-1    5189
Name: age_of_domain, dtype: int64

In [14]:
data['Domain_registeration_length'].value_counts()

-1    7389
 1    3666
Name: Domain_registeration_length, dtype: int64

In [15]:
data['web_traffic'].value_counts()

 1    5831
-1    2655
 0    2569
Name: web_traffic, dtype: int64

# Correlation
I want to see the correlation between the variables and the Result. This small function will show the connection (positive or negative) between each variable and the Result. Note: Correlation does not mean causation.

In [3]:
import pandas as pd

data = pd.read_json('http://127.0.0.1:5000/item/jupyter')
# drop the ID variable because it doesnt add any value
data.drop('id',axis=1,inplace=True)
# This is seeing correlation between all variables and Rusult
data.corr()['Result'].sort_values()

Domain_registeration_length   -0.225789
Shortining_Service            -0.067966
Abnormal_URL                  -0.060488
HTTPS_token                   -0.039854
double_slash_redirecting      -0.038608
Redirect                      -0.020113
Iframe                        -0.003394
Favicon                       -0.000280
popUpWidnow                    0.000086
RightClick                     0.012653
Submitting_to_email            0.018249
Links_pointing_to_page         0.032574
port                           0.036419
on_mouseover                   0.041838
having_At_Symbol               0.052948
URL_Length                     0.057430
DNSRecord                      0.075718
Statistical_report             0.079857
having_IP_Address              0.094160
Page_Rank                      0.104645
age_of_domain                  0.121496
Google_Index                   0.128950
SFH                            0.221419
Links_in_tags                  0.248229
Request_URL                    0.253372


Dropping the variables that are not considered statistically significant will aid inthe accuracy of our regression model. We will drop all of the variable with correlation smaller than .05

In [6]:
import pandas as pd
data = pd.read_json('http://127.0.0.1:5000/item/jupyter')

# drop the variable that are smaller than 0.05 beacuse they arent statistically significant
data.drop('HTTPS_token',axis=1,inplace=True)
data.drop('double_slash_redirecting',axis=1,inplace=True)
data.drop('Redirect',axis=1,inplace=True)
data.drop('Iframe',axis=1,inplace=True)
data.drop('Favicon',axis=1,inplace=True)
data.drop('popUpWidnow',axis=1,inplace=True)
data.drop('RightClick',axis=1,inplace=True)
data.drop('Submitting_to_email',axis=1,inplace=True)
data.drop('port',axis=1,inplace=True)
data.drop('on_mouseover',axis=1,inplace=True)
data.drop('Links_pointing_to_page',axis=1,inplace=True)
data.drop('id',axis=1,inplace=True)



# Another check for correlation after dropping insignificant variables
data.corr()['Result'].sort_values()


Domain_registeration_length   -0.225789
Shortining_Service            -0.067966
Abnormal_URL                  -0.060488
having_At_Symbol               0.052948
URL_Length                     0.057430
DNSRecord                      0.075718
Statistical_report             0.079857
having_IP_Address              0.094160
Page_Rank                      0.104645
age_of_domain                  0.121496
Google_Index                   0.128950
SFH                            0.221419
Links_in_tags                  0.248229
Request_URL                    0.253372
having_Sub_Domain              0.298323
web_traffic                    0.346103
Prefix_Suffix                  0.348606
URL_of_Anchor                  0.692935
SSLfinal_State                 0.714741
Result                         1.000000
Name: Result, dtype: float64

# Using the data for Regression and Prediction
We now want to use this dataframe which has been reduced down to be only significant variables as a trainig and testing set for logistical regression. This will hopefully help us use the model created to predict future potentially bad links.

In [7]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import metrics

data = pd.read_json('http://127.0.0.1:5000/item/jupyter')

# drop the variable that are smaller than 0.05 beacuse they arent statistically significant
data.drop('HTTPS_token',axis=1,inplace=True)
data.drop('double_slash_redirecting',axis=1,inplace=True)
data.drop('Redirect',axis=1,inplace=True)
data.drop('Iframe',axis=1,inplace=True)
data.drop('Favicon',axis=1,inplace=True)
data.drop('popUpWidnow',axis=1,inplace=True)
data.drop('RightClick',axis=1,inplace=True)
data.drop('Submitting_to_email',axis=1,inplace=True)
data.drop('port',axis=1,inplace=True)
data.drop('on_mouseover',axis=1,inplace=True)
data.drop('Links_pointing_to_page',axis=1,inplace=True)
data.drop('id',axis=1,inplace=True)



# we will split the variables into attributes, and labels. The label is the one we want to predict
X = data[['Domain_registeration_length', 'Shortining_Service', 'Abnormal_URL', 'having_At_Symbol',
          'URL_Length', 'DNSRecord', 'Statistical_report', 'having_IP_Address', 'Page_Rank', 'age_of_domain', 'Google_Index', 
          'SFH', 'Links_in_tags', 'Request_URL', 'having_Sub_Domain', 'web_traffic', 'web_traffic', 'Prefix_Suffix', 'URL_of_Anchor',
          'SSLfinal_State']].values
# Label variable
y = data['Result'].values


# Split the data into a training and a testing frame
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# we will now train the model using the library
regressor = LogisticRegression()  
regressor.fit(X_train, y_train)


# Run predictions
y_pred = regressor.predict(X_test)


# Show top 25 predictions and the actal value
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(25)
print(df1)
print('\n')


print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(regressor.score(X_test, y_test)))

    Actual  Predicted
0       -1         -1
1       -1         -1
2       -1          1
3       -1         -1
4        1          1
5        1          1
6        1          1
7       -1         -1
8       -1         -1
9        1          1
10       1         -1
11       1          1
12       1          1
13      -1         -1
14      -1         -1
15      -1          1
16       1          1
17       1          1
18      -1         -1
19      -1          1
20      -1         -1
21       1          1
22      -1         -1
23      -1         -1
24       1          1


Accuracy of logistic regression classifier on test set: 0.91




# Results
Now that we have trained our model using the dataset and fit the model to the testing data we can see that we have around 91% accuracy. This model could now be used by a company or indavidual to judge the credibility of any given link that they see. The use of this would be to run a potentially malicious URL through this model to predict if it is malicious. 