<a href="https://colab.research.google.com/github/Maria-mbugua/IPWeek9-Core/blob/main/Naive_Bayes_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1) DEFINING THE QUESTION**

## **a) Specifying the Question**

Using the dataset that has been provided to train and test the data.

## **b) Defining the metrics of success**

Performing Naive Bayes Classification with the dataset.

## **c) Understanding the context**

Using the Naive Bayes Classifier in the dataset.

## **d) Recording the Experimental Design**

1. Define the question, the metric for success, the context, experimental design.

2. Read and clean the dataset.

3. Define the appropriate method to answer the question.

4. Find and deal with outliers, anomalies, and missing data within the dataset.

5. Perform univariate, bivariate and multivariate analysis.

6. Partitioning the dataset into 80 - 20 sets.

7. Using the Naive Bayes Classifier.

8. Repeat step 6 to step 7 twice, each time splitting the dataset differently i.e. 70-30, 60-40, then note the outcomes of your modeling.

9. Applying an optimization technique.

10. Recommendation for Naive Bayes Classifier.

## **e) Relevance of the data**

The data used in this project is for performing Naive Bayes Classification.

# **2) DATA ANALYSIS**

## **a) Checking the Data**

In [91]:
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
from scipy.stats import norm
from sklearn.model_selection import train_test_split
from sklearn import feature_extraction, model_selection, naive_bayes, metrics, svm
from sklearn.naive_bayes import BernoulliNB

In [92]:
# Acquiring the data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data

--2022-05-08 00:02:03--  https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 702942 (686K) [application/x-httpd-php]
Saving to: ‘spambase.data.2’


2022-05-08 00:02:04 (2.08 MB/s) - ‘spambase.data.2’ saved [702942/702942]



In [93]:
# Reading the data
data = pd.read_csv('spambase.data', header=None)


In [94]:
# Previewing the top of the dataset
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [95]:
# Previewing the tail of the dataset
data.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
4596,0.31,0.0,0.62,0.0,0.0,0.31,0.0,0.0,0.0,0.0,...,0.0,0.232,0.0,0.0,0.0,0.0,1.142,3,88,0
4597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.353,0.0,0.0,1.555,4,14,0
4598,0.3,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.102,0.718,0.0,0.0,0.0,0.0,1.404,6,118,0
4599,0.96,0.0,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.057,0.0,0.0,0.0,0.0,1.147,5,78,0
4600,0.0,0.0,0.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.125,0.0,0.0,1.25,5,40,0


In [96]:
# Previewing the columns of the dataset
data.columns

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
            51, 52, 53, 54, 55, 56, 57],
           dtype='int64')

In [97]:
# Previewing the shape of the dataset
data.shape

(4601, 58)

In [98]:
# Displaying the number of unique values of the columns in the dataset
data.nunique()

0      142
1      171
2      214
3       43
4      255
5      141
6      173
7      170
8      144
9      245
10     113
11     316
12     158
13     133
14     118
15     253
16     197
17     229
18     575
19     148
20     401
21      99
22     164
23     143
24     395
25     281
26     240
27     200
28     156
29     179
30     128
31     106
32     184
33     110
34     177
35     159
36     188
37      53
38     163
39     125
40     108
41     186
42     136
43     160
44     230
45     227
46      38
47     106
48     313
49     641
50     225
51     964
52     504
53     316
54    2161
55     271
56     919
57       2
dtype: int64

In [99]:
# Displaying the datatypes
data.dtypes

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10    float64
11    float64
12    float64
13    float64
14    float64
15    float64
16    float64
17    float64
18    float64
19    float64
20    float64
21    float64
22    float64
23    float64
24    float64
25    float64
26    float64
27    float64
28    float64
29    float64
30    float64
31    float64
32    float64
33    float64
34    float64
35    float64
36    float64
37    float64
38    float64
39    float64
40    float64
41    float64
42    float64
43    float64
44    float64
45    float64
46    float64
47    float64
48    float64
49    float64
50    float64
51    float64
52    float64
53    float64
54    float64
55      int64
56      int64
57      int64
dtype: object

In [100]:
# Dataset information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4601 non-null   float64
 1   1       4601 non-null   float64
 2   2       4601 non-null   float64
 3   3       4601 non-null   float64
 4   4       4601 non-null   float64
 5   5       4601 non-null   float64
 6   6       4601 non-null   float64
 7   7       4601 non-null   float64
 8   8       4601 non-null   float64
 9   9       4601 non-null   float64
 10  10      4601 non-null   float64
 11  11      4601 non-null   float64
 12  12      4601 non-null   float64
 13  13      4601 non-null   float64
 14  14      4601 non-null   float64
 15  15      4601 non-null   float64
 16  16      4601 non-null   float64
 17  17      4601 non-null   float64
 18  18      4601 non-null   float64
 19  19      4601 non-null   float64
 20  20      4601 non-null   float64
 21  21      4601 non-null   float64
 22  

## **b) Data Cleaning**

In [101]:
# Checking for null values
data.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
37    0
38    0
39    0
40    0
41    0
42    0
43    0
44    0
45    0
46    0
47    0
48    0
49    0
50    0
51    0
52    0
53    0
54    0
55    0
56    0
57    0
dtype: int64

In [102]:
# Checking for duplicates
data.duplicated().any()

True

In [103]:
# Count of duplicates
data.duplicated().sum()

391

In [104]:
# Dropping duplicates
data1 = data.drop_duplicates()

In [105]:
# Count of duplicates
data1.duplicated().sum()

0

In [106]:
# Shape of the dataset
data1.shape

(4210, 58)

In [107]:
# Previewing the numerical values
data1.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
count,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,...,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0,4210.0
mean,0.104366,0.112656,0.291473,0.063078,0.325321,0.096656,0.117475,0.108,0.09186,0.24842,...,0.040403,0.144048,0.017376,0.281136,0.076057,0.045798,5.383896,52.139905,291.181948,0.398812
std,0.300005,0.45426,0.515719,1.352487,0.687805,0.27603,0.397284,0.410282,0.282144,0.656638,...,0.252533,0.274256,0.105731,0.843321,0.239708,0.435925,33.147358,199.582168,618.654838,0.489712
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.6275,7.0,40.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.073,0.0,0.016,0.0,0.0,2.297,15.0,101.5,0.0
75%,0.0,0.0,0.44,0.0,0.41,0.0,0.0,0.0,0.0,0.19,...,0.0,0.194,0.0,0.331,0.053,0.0,3.70675,44.0,273.75,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [108]:
# Anomalies
q11 = data1[1].quantile(.25)
q31 = data1[1].quantile(.75)

iqr11 = q31 - q11
iqr11

q11, q31 = np.percentile(data1[1], [25, 75]) 

iqr = q31 - q11

l_bound = q11 - (1.5*iqr)
u_bound = q31 + (1.5 * iqr)

print(iqr11, iqr)

0.0 0.0


In [109]:
# Removing outliers
Q1 = data1.quantile(0.25)
Q3 = data1.quantile(0.75)
IQR = Q3 - Q1
 
data2 = data1[~((data1 < (Q1 - 1.5 * IQR)) | (data1 > (Q3 + 1.5 * IQR))).any(axis=1)]

# New dataset
print(data2.shape)

# Old dataset
print(data1.shape)

(114, 58)
(4210, 58)


# **3) EXPLORATORY DATA ANALYSIS**

In [110]:
# There are no titles for the columns hence its impossible to plot the columns.

In [111]:
# Correlation coefficients
corr = data2.corr()
corr

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,1.0,,0.006354,,,,,0.294984,...,,0.230738,,-0.05121,-0.014969,,0.021999,0.25268,0.274692,-0.075612
3,,,,,,,,,,,...,,,,,,,,,,
4,,,0.006354,,1.0,,,,,-0.03182,...,,-0.077894,,0.139739,-0.022449,,0.348735,0.468506,0.572577,0.311559
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,0.294984,,-0.03182,,,,,1.0,...,,0.118677,,-0.01941,-0.012544,,-0.066304,0.022307,0.095105,0.097456


In [112]:
# Previewing the numerical values
data2.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
count,114.0,114.0,114.0,114.0,114.0,114.0,114.0,114.0,114.0,114.0,...,114.0,114.0,114.0,114.0,114.0,114.0,114.0,114.0,114.0,114.0
mean,0.0,0.0,0.014649,0.0,0.044474,0.0,0.0,0.0,0.0,0.006754,...,0.0,0.035263,0.0,0.084588,0.001018,0.0,1.968623,7.184211,27.692982,0.184211
std,0.0,0.0,0.09247,0.0,0.187192,0.0,0.0,0.0,0.0,0.050877,...,0.0,0.101903,0.0,0.202994,0.010864,0.0,1.088017,8.484829,35.767224,0.389367
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.224,3.0,8.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.6605,4.0,16.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.28175,8.0,31.25,0.0
max,0.0,0.0,0.75,0.0,0.94,0.0,0.0,0.0,0.0,0.41,...,0.0,0.478,0.0,0.787,0.116,0.0,6.666,55.0,226.0,1.0


In [113]:
data3 = data1.astype(int)
data3

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,3,61,278,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,5,101,1028,1
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,9,485,2259,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,3,40,191,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,3,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,3,88,0
4597,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,4,14,0
4598,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,6,118,0
4599,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,5,78,0


# **4) SOLUTION**

## **i) 80:20 partition**

In [114]:
# Splitting the dataset to train and test by using 80% of the data
X = data3.iloc[:, :-1].values
y = data3.iloc[:, 5].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [115]:
# Normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train

array([[-0.12254214, -0.12967027, -0.2720487 , ..., -0.05489053,
        -0.21017521, -0.42461157],
       [-0.12254214, -0.12967027, -0.2720487 , ..., -0.11609605,
        -0.22897892, -0.31819569],
       [-0.12254214, -0.12967027, -0.2720487 , ...,  0.31234264,
         0.0107683 , -0.23994871],
       ...,
       [-0.12254214, -0.12967027, -0.2720487 , ..., -0.11609605,
        -0.21957707, -0.16952644],
       [-0.12254214, -0.12967027, -0.2720487 , ..., -0.11609605,
        -0.18667058,  0.2639618 ],
       [-0.12254214, -0.12967027, -0.2720487 , ..., -0.11609605,
        -0.18667058, -0.34949448]])

In [116]:
# Training our model
from sklearn.naive_bayes import GaussianNB 
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

In [117]:
# Predicting our test predictors
predicted = model.predict(X_test)
print(np.mean(predicted == y_test))

0.997624703087886


In [118]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.models import load_model

In [119]:
# Predicting the Test set results
classifier = Sequential()
y_pred = classifier.predict(X_test)
y_pred

array([[-0.12254214,  3.058136  , -0.2720487 , ..., -0.08549329,
        -0.18667059, -0.3745335 ],
       [-0.12254214, -0.12967028,  2.3089654 , ...,  0.03691776,
         2.497558  ,  1.6238942 ],
       [-0.12254214, -0.12967028, -0.2720487 , ..., -0.11609606,
        -0.21017522, -0.39331278],
       ...,
       [-0.12254214, -0.12967028, -0.2720487 , ..., -0.05489053,
         0.04367479, -0.16326667],
       [-0.12254214, -0.12967028, -0.2720487 , ..., -0.11609606,
        -0.21017522, -0.36670882],
       [-0.12254214, -0.12967028, -0.2720487 , ..., -0.05489053,
        -0.18667059, -0.35731918]], dtype=float32)

## **ii) 70:30 partition**

In [120]:
# Splitting the dataset to train and test by using 70% of the data
X = data3.iloc[:, :-1].values
y = data3.iloc[:, 5].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [121]:
# Normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train

array([[-0.12624921, -0.103736  ,  2.25386172, ..., -0.06891015,
        -0.17468415, -0.37203213],
       [-0.12624921, -0.103736  , -0.27262382, ..., -0.1413217 ,
        -0.22008246, -0.29976933],
       [-0.12624921, -0.103736  , -0.27262382, ..., -0.1413217 ,
        -0.23824179, -0.43074565],
       ...,
       [-0.12624921, -0.103736  , -0.27262382, ..., -0.06891015,
         0.04322776,  0.03143515],
       [-0.12624921, -0.103736  , -0.27262382, ...,  0.07591296,
        -0.16560449, -0.39611973],
       [-0.12624921, -0.103736  , -0.27262382, ..., -0.1413217 ,
        -0.19284347, -0.35246095]])

In [122]:
# Training our model
from sklearn.naive_bayes import GaussianNB 
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

In [123]:
# Predicting our test predictors
predicted = model.predict(X_test)
print(np.mean(predicted == y_test))

0.9984164687252574


In [124]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.models import load_model

In [125]:
# Predicting the Test set results
classifier = Sequential()
y_pred = classifier.predict(X_test)
y_pred

array([[-0.12624921, -0.103736  , -0.2726238 , ...,  0.00350141,
         0.04322776,  0.48608857],
       [-0.12624921, -0.103736  , -0.2726238 , ..., -0.10511592,
        -0.02032988, -0.03028932],
       [-0.12624921, -0.103736  , -0.2726238 , ..., -0.10511592,
         0.06138708,  1.4390543 ],
       ...,
       [-0.12624921, -0.103736  , -0.2726238 , ..., -0.06891014,
         0.07046675,  0.60803205],
       [-0.12624921, -0.103736  , -0.2726238 , ..., -0.06891014,
        -0.15652482, -0.30428576],
       [-0.12624921, -0.103736  , -0.2726238 , ..., -0.03270437,
         0.00690911,  0.9527858 ]], dtype=float32)

## **iii) 60:40 partition**

In [126]:
# Splitting the dataset to train and test by using 60% of the data
X = data3.iloc[:, :-1].values
y = data3.iloc[:, 5].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)

In [127]:
# Normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train

array([[-0.12951308, -0.11333283, -0.27020714, ..., -0.11047646,
        -0.1770414 , -0.16870197],
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.08573239,
        -0.17281214,  0.14887005],
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.01150021,
        -0.14743656, -0.40532425],
       ...,
       [-0.12951308, -0.11333283,  2.26712835, ..., -0.03624427,
        -0.05862202,  0.70617779],
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.08573239,
        -0.1770414 , -0.39287044],
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.06098833,
        -0.1770414 , -0.30102364]])

In [128]:
# Training our model
from sklearn.naive_bayes import GaussianNB 
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

In [129]:
# Predicting our test predictors
predicted = model.predict(X_test)
print(np.mean(predicted == y_test))

0.995249406175772


In [130]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.models import load_model

In [131]:
# Predicting the Test set results
classifier = Sequential()
y_pred = classifier.predict(X_test)
y_pred

array([[-0.12951308, -0.11333283, -0.27020714, ..., -0.11047646,
        -0.19818772, -0.2854564 ],
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.08573239,
        -0.14743656, -0.28234294],
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.08573239,
        -0.19395846, -0.41777804],
       ...,
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.03624427,
         0.04288032,  1.2012165 ],
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.11047646,
        -0.16435361, -0.39131373],
       [-0.12951308, -0.11333283, -0.27020714, ..., -0.11047646,
        -0.16012435, -0.34928212]], dtype=float32)

## **iv) Hyperparameter Optimization using GridSearch CV**

In [136]:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV
import pandas as pd

param_grid = {'max_depth': [3, 5, 10],
              'min_samples_split': [2, 5, 10]}
base_estimator = RandomForestClassifier(random_state=0)
X, y = make_classification(n_samples=1000, random_state=0)
sh = HalvingGridSearchCV(base_estimator, param_grid, cv=5,
                         factor=2, resource='n_estimators',
                         max_resources=30).fit(X, y)
sh.best_estimator_


RandomForestClassifier(max_depth=5, n_estimators=24, random_state=0)

# **5) CONCLUSION**

1. 80:20 partition model accuracy score - 0.9976

2. 70:30 partition model accuracy score - 0.9984

3. 60:40 partition model accuracy score - 0.9952

The best partition model is 70:30 with an accuracy score of 0.984