# Naive Bayes Classification of Phishing Websites

Notebook adapted from the 05.05 Naive Bayes notebook from the Python Data Science Handbook.  
Modified by: Gábor Major  
Last Modified date: 2025-01-17

Import libraries.

In [35]:
from scipy.io import arff
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import *
from sklearn import metrics

## Import Data
Load in the data, as the arff file.

In [29]:
data = arff.loadarff('../phishing_websites_data/Training Dataset.arff')
df = pd.DataFrame(data[0])
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,b'-1',b'1',b'1',b'1',b'-1',b'-1',b'-1',b'-1',b'-1',b'1',...,b'1',b'1',b'-1',b'-1',b'-1',b'-1',b'1',b'1',b'-1',b'-1'
1,b'1',b'1',b'1',b'1',b'1',b'-1',b'0',b'1',b'-1',b'1',...,b'1',b'1',b'-1',b'-1',b'0',b'-1',b'1',b'1',b'1',b'-1'
2,b'1',b'0',b'1',b'1',b'1',b'-1',b'-1',b'-1',b'-1',b'1',...,b'1',b'1',b'1',b'-1',b'1',b'-1',b'1',b'0',b'-1',b'-1'
3,b'1',b'0',b'1',b'1',b'1',b'-1',b'-1',b'-1',b'1',b'1',...,b'1',b'1',b'-1',b'-1',b'1',b'-1',b'1',b'-1',b'1',b'-1'
4,b'1',b'0',b'-1',b'1',b'1',b'-1',b'1',b'1',b'-1',b'1',...,b'-1',b'1',b'-1',b'-1',b'0',b'-1',b'1',b'1',b'1',b'1'


Clean up data by changing the encoding to remove the 'b'.

In [30]:
df = df.select_dtypes([object])
df = df.stack().str.decode('utf-8').unstack()
df = df.apply(pd.to_numeric)
df = df.add(1)
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,0,2,2,2,0,0,0,0,0,2,...,2,2,0,0,0,0,2,2,0,0
1,2,2,2,2,2,0,1,2,0,2,...,2,2,0,0,1,0,2,2,2,0
2,2,1,2,2,2,0,0,0,0,2,...,2,2,2,0,2,0,2,1,0,0
3,2,1,2,2,2,0,0,0,2,2,...,2,2,0,0,2,0,2,0,2,0
4,2,1,0,2,2,0,2,2,0,2,...,0,2,0,0,1,0,2,2,2,2


## Create Data Sets
Split data into 60% training, 20% validation, and 20% testing sets.

In [31]:
data_target = df['Result']
data_features = df.drop(columns=['Result'])

In [32]:
# Split off 20% test set
xTrain, xTest, yTrain, yTest = train_test_split(data_features, data_target, test_size=0.2)
# Split 80% of full data into 60% and 20% setsw
xTrain, xValidation, yTrain, yValidation = train_test_split(xTrain, yTrain, test_size=0.25)

## Create Models
Use all 5 Naive Bayes models that are available in scikit-learn, to compare which model would preform best with this data set.

In [33]:
naive_bayes_models_dictionary = {
    "Bernoulli": BernoulliNB(),
    "Categorical": CategoricalNB(),
    "Complement": ComplementNB(),
    "Gaussian": GaussianNB(),
    "Multinomial": MultinomialNB()
}

In [34]:
predictions = {}
for name, model in naive_bayes_models_dictionary.items():
    print(name)
    model.fit(xTrain, yTrain)
    predictions[name] = model.predict(xValidation)

Bernoulli
      having_IP_Address  URL_Length  Shortining_Service  having_At_Symbol  \
9991                  2           0                   2                 0   
4833                  2           0                   2                 0   
14                    2           2                   0                 2   
9183                  0           0                   2                 0   
3902                  2           0                   2                 2   
...                 ...         ...                 ...               ...   
7402                  0           0                   2                 2   
9144                  2           0                   2                 2   
9690                  0           2                   2                 2   
149                   2           2                   2                 2   
7782                  2           0                   2                 2   

      double_slash_redirecting  Prefix_Suffix  having_Sub_Domain 

In [41]:
print(predictions)
for name in predictions.keys():
    print("Model " + name)
    print(metrics.classification_report(predictions[name], yValidation))

{'Bernoulli': array([2, 2, 2, ..., 0, 0, 0]), 'Categorical': array([2, 2, 2, ..., 0, 0, 0]), 'Complement': array([2, 2, 2, ..., 0, 0, 0]), 'Gaussian': array([0, 2, 2, ..., 0, 0, 0]), 'Multinomial': array([2, 2, 2, ..., 0, 0, 0])}
Model Bernoulli
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       898
           2       0.94      0.92      0.93      1313

    accuracy                           0.92      2211
   macro avg       0.91      0.92      0.91      2211
weighted avg       0.92      0.92      0.92      2211

Model Categorical
              precision    recall  f1-score   support

           0       0.91      0.94      0.93       912
           2       0.96      0.94      0.95      1299

    accuracy                           0.94      2211
   macro avg       0.93      0.94      0.94      2211
weighted avg       0.94      0.94      0.94      2211

Model Complement
              precision    recall  f1-score   support

          