# Naive Bayes Classification of Phishing Websites

Notebook adapted from the 05.05 Naive Bayes notebook from the Python Data Science Handbook.  
Modified by: Gábor Major  
Last Modified date: 2025-01-17

Import libraries.

In [24]:
from scipy.io import arff
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import *

## Import Data
Load in the data, as the arff file.

In [17]:
data = arff.loadarff('../phishing_websites_data/Training Dataset.arff')
df = pd.DataFrame(data[0])
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,b'-1',b'1',b'1',b'1',b'-1',b'-1',b'-1',b'-1',b'-1',b'1',...,b'1',b'1',b'-1',b'-1',b'-1',b'-1',b'1',b'1',b'-1',b'-1'
1,b'1',b'1',b'1',b'1',b'1',b'-1',b'0',b'1',b'-1',b'1',...,b'1',b'1',b'-1',b'-1',b'0',b'-1',b'1',b'1',b'1',b'-1'
2,b'1',b'0',b'1',b'1',b'1',b'-1',b'-1',b'-1',b'-1',b'1',...,b'1',b'1',b'1',b'-1',b'1',b'-1',b'1',b'0',b'-1',b'-1'
3,b'1',b'0',b'1',b'1',b'1',b'-1',b'-1',b'-1',b'1',b'1',...,b'1',b'1',b'-1',b'-1',b'1',b'-1',b'1',b'-1',b'1',b'-1'
4,b'1',b'0',b'-1',b'1',b'1',b'-1',b'1',b'1',b'-1',b'1',...,b'-1',b'1',b'-1',b'-1',b'0',b'-1',b'1',b'1',b'1',b'1'


Clean up data by changing the encoding to remove the 'b'.

In [18]:
df = df.select_dtypes([object])
df = df.stack().str.decode('utf-8').unstack()
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


## Create Data Sets
Split data into 60% training, 20% validation, and 20% testing sets.

In [28]:
data_features = df['Result']
data_target = df.drop(columns=['Result'])

In [29]:
# Split off 20% test set
xTrain, xTest, yTrain, yTest = train_test_split(data_features, data_target, test_size=0.2)
# Split 80% of full data into 60% and 20% sets
xTrain, xValidation, yTrain, yValidation = train_test_split(xTrain, yTrain, test_size=0.25)

## Create Models
Use all 5 Naive Bayes models that are available in scikit-learn; Bernoulli, Categorical, Complement, Gaussian, Multinomial.

In [30]:
naive_bayes_models_dictionary = {
    "Bernoulli": BernoulliNB(),
    "Categorical": CategoricalNB(),
    "Complement": ComplementNB(),
    "Gaussian": GaussianNB(),
    "Multinomial": MultinomialNB()
}