## PhishingWebsites

- One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features.
In this dataset, we shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, we propose some new features.


In [15]:
import pandas as pd
import seaborn as sns
from sklearn.datasets import fetch_openml
import numpy as np

In [16]:
df = fetch_openml(data_id=4534,as_frame=True).frame

In [17]:
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


In [18]:
df.columns

Index(['having_IP_Address', 'URL_Length', 'Shortining_Service',
       'having_At_Symbol', 'double_slash_redirecting', 'Prefix_Suffix',
       'having_Sub_Domain', 'SSLfinal_State', 'Domain_registeration_length',
       'Favicon', 'port', 'HTTPS_token', 'Request_URL', 'URL_of_Anchor',
       'Links_in_tags', 'SFH', 'Submitting_to_email', 'Abnormal_URL',
       'Redirect', 'on_mouseover', 'RightClick', 'popUpWidnow', 'Iframe',
       'age_of_domain', 'DNSRecord', 'web_traffic', 'Page_Rank',
       'Google_Index', 'Links_pointing_to_page', 'Statistical_report',
       'Result'],
      dtype='object')

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   having_IP_Address            11055 non-null  category
 1   URL_Length                   11055 non-null  category
 2   Shortining_Service           11055 non-null  category
 3   having_At_Symbol             11055 non-null  category
 4   double_slash_redirecting     11055 non-null  category
 5   Prefix_Suffix                11055 non-null  category
 6   having_Sub_Domain            11055 non-null  category
 7   SSLfinal_State               11055 non-null  category
 8   Domain_registeration_length  11055 non-null  category
 9   Favicon                      11055 non-null  category
 10  port                         11055 non-null  category
 11  HTTPS_token                  11055 non-null  category
 12  Request_URL                  11055 non-null  category
 13  U

In [20]:
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [21]:
df.isna().sum()

having_IP_Address              0
URL_Length                     0
Shortining_Service             0
having_At_Symbol               0
double_slash_redirecting       0
Prefix_Suffix                  0
having_Sub_Domain              0
SSLfinal_State                 0
Domain_registeration_length    0
Favicon                        0
port                           0
HTTPS_token                    0
Request_URL                    0
URL_of_Anchor                  0
Links_in_tags                  0
SFH                            0
Submitting_to_email            0
Abnormal_URL                   0
Redirect                       0
on_mouseover                   0
RightClick                     0
popUpWidnow                    0
Iframe                         0
age_of_domain                  0
DNSRecord                      0
web_traffic                    0
Page_Rank                      0
Google_Index                   0
Links_pointing_to_page         0
Statistical_report             0
Result    

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   having_IP_Address            11055 non-null  int8 
 1   URL_Length                   11055 non-null  int8 
 2   Shortining_Service           11055 non-null  int8 
 3   having_At_Symbol             11055 non-null  int8 
 4   double_slash_redirecting     11055 non-null  int8 
 5   Prefix_Suffix                11055 non-null  int8 
 6   having_Sub_Domain            11055 non-null  int8 
 7   SSLfinal_State               11055 non-null  int8 
 8   Domain_registeration_length  11055 non-null  int8 
 9   Favicon                      11055 non-null  int8 
 10  port                         11055 non-null  int8 
 11  HTTPS_token                  11055 non-null  int8 
 12  Request_URL                  11055 non-null  int8 
 13  URL_of_Anchor                11055 non-null  i

In [23]:
df.shape

(11055, 31)

In [24]:
target = 'Result'
X = df.drop(target,axis=1)
y = df.loc[:,target]

In [25]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test =  train_test_split(X,y,test_size=0.095,random_state=578)
X_train.shape, X_test.shape,y_train.shape

((10004, 30), (1051, 30), (10004,))

In [26]:
from sklearn.compose import make_column_selector
num_col = make_column_selector(dtype_exclude=object)
cat_col = make_column_selector(dtype_include=object)

In [27]:
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.pipeline import make_pipeline

In [28]:
imp_mean = SimpleImputer(strategy='mean')
imp_median = SimpleImputer(strategy='median')
imp_cat = SimpleImputer(strategy='most_frequent')
onehot = OneHotEncoder(handle_unknown='ignore')
ordencode = OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-1)

In [29]:
col_transform_mean = make_column_transformer(
    (make_pipeline(imp_mean),num_col),
    (make_pipeline(imp_cat,onehot),cat_col),
    remainder='passthrough'
)

In [30]:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier()
pipe_mean = make_pipeline(col_transform_mean,knn_model)

In [40]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
K_fold = KFold(n_splits=5,shuffle=True,random_state=985)
cross_val_score(pipe_mean,X_train,y_train,cv=K_fold).mean()

0.94932023988006

In [41]:
from sklearn.model_selection import GridSearchCV

In [42]:
pipe_mean.get_params()

{'columntransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('pipeline-1',
                                  Pipeline(steps=[('simpleimputer',
                                                   SimpleImputer())]),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fa9cbebeb90>),
                                 ('pipeline-2',
                                  Pipeline(steps=[('simpleimputer',
                                                   SimpleImputer(strategy='most_frequent')),
                                                  ('onehotencoder',
                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fa9cbebe910>)]),
 'columntransformer__n_jobs': None,
 'columntransformer__pipeline-1': Pipeline(steps=[('simpleimputer', SimpleImputer())]),
 'columntra

In [43]:
params = {
          'kneighborsclassifier__n_neighbors': range(2,15),
          'kneighborsclassifier__p': [1,2,3]
          }

In [44]:
grdcv = GridSearchCV(pipe_mean,param_grid=params,cv=K_fold,n_jobs=-1)

In [50]:
from sklearn.model_selection import RandomizedSearchCV
rmdcv = RandomizedSearchCV(pipe_mean,params,random_state=895,n_jobs=-1,cv=K_fold)

In [51]:
rmdcv.fit(X_train,y_train)

RandomizedSearchCV(cv=KFold(n_splits=5, random_state=985, shuffle=True),
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('pipeline-1',
                                                                               Pipeline(steps=[('simpleimputer',
                                                                                                SimpleImputer())]),
                                                                               <sklearn.compose._column_transformer.make_column_selector object at 0x7fa9cbebeb90>),
                                                                              ('pipeline-2',
                                                                               Pipeline(steps=...
                                                                                             

In [52]:
rmdcv.best_score_

0.9554178410794603

In [53]:
rmdcv.best_params_

{'kneighborsclassifier__n_neighbors': 3, 'kneighborsclassifier__p': 3}

In [54]:
rmdcv.best_estimator_

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fa9cbdb9f90>),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  <sklearn.compose._column_trans

In [56]:
knn_model_final = KNeighborsClassifier(n_neighbors=3,p=3)

In [57]:
knn_model_final.fit(X_train,y_train)
knn_pred = knn_model_final.predict(X_train)

In [58]:
from sklearn.metrics import confusion_matrix,classification_report

In [59]:
confusion_matrix(y_train,knn_pred)

array([[4276,  158],
       [  56, 5514]])

In [60]:
print(classification_report(y_train,knn_pred,target_names=['not-phishing','phishing']))

              precision    recall  f1-score   support

not-phishing       0.99      0.96      0.98      4434
    phishing       0.97      0.99      0.98      5570

    accuracy                           0.98     10004
   macro avg       0.98      0.98      0.98     10004
weighted avg       0.98      0.98      0.98     10004

