# Python Programming: Naive Bayes - Exploratory Data Analysis

## 1. Defining the Question

### a) Specifying the Data Analytic Question

Predict whether an e-mail is a spam or not.

### b) Defining the Metric for Success

A Naive Bayes model thats at least 80% accurate.

### c) Recording the Experimental Design

1. Loading, reading and checking the data. 
2. Data cleaning
3. Exploratory Analysis
4. Modelling and implementing our solution
5. Challenging the solution.

### d) Data Relevance

The data was collected from research on a collection of e-mails. It is therefore relevant to our study.

## 2. Reading the Data

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB

In [4]:
# load the data
#
spam = pd.read_csv('spambase.data', delimiter=',')
spam



Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
1,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
2,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,1.85,0.00,0.00,1.85,0.00,0.00,...,0.000,0.223,0.0,0.000,0.000,0.000,3.000,15,54,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4596,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4597,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4598,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [5]:
# data description
#
with open("spambase.names") as f:
  print(f.read())


| SPAM E-MAIL DATABASE ATTRIBUTES (in .names format)
|
| 48 continuous real [0,100] attributes of type word_freq_WORD 
| = percentage of words in the e-mail that match WORD,
| i.e. 100 * (number of times the WORD appears in the e-mail) / 
| total number of words in e-mail.  A "word" in this case is any 
| string of alphanumeric characters bounded by non-alphanumeric 
| characters or end-of-string.
|
| 6 continuous real [0,100] attributes of type char_freq_CHAR
| = percentage of characters in the e-mail that match CHAR,
| i.e. 100 * (number of CHAR occurences) / total characters in e-mail
|
| 1 continuous real [1,...] attribute of type capital_run_length_average
| = average length of uninterrupted sequences of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_longest
| = length of longest uninterrupted sequence of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_total
| = sum of length of uninterrupted sequences of



## 3. Checking the Data

In [6]:
# Determining the no. of records in our dataset
#
spam.count()

0         4600
0.64      4600
0.64.1    4600
0.1       4600
0.32      4600
0.2       4600
0.3       4600
0.4       4600
0.5       4600
0.6       4600
0.7       4600
0.64.2    4600
0.8       4600
0.9       4600
0.10      4600
0.32.1    4600
0.11      4600
1.29      4600
1.93      4600
0.12      4600
0.96      4600
0.13      4600
0.14      4600
0.15      4600
0.16      4600
0.17      4600
0.18      4600
0.19      4600
0.20      4600
0.21      4600
0.22      4600
0.23      4600
0.24      4600
0.25      4600
0.26      4600
0.27      4600
0.28      4600
0.29      4600
0.30      4600
0.31      4600
0.32.2    4600
0.33      4600
0.34      4600
0.35      4600
0.36      4600
0.37      4600
0.38      4600
0.39      4600
0.40      4600
0.41      4600
0.42      4600
0.778     4600
0.43      4600
0.44      4600
3.756     4600
61        4600
278       4600
1         4600
dtype: int64

In [7]:
# Previewing the top of our dataset
#
spam.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [8]:
# Previewing the bottom of our dataset
#
spam.tail()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
4595,0.31,0.0,0.62,0.0,0.0,0.31,0.0,0.0,0.0,0.0,...,0.0,0.232,0.0,0.0,0.0,0.0,1.142,3,88,0
4596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.353,0.0,0.0,1.555,4,14,0
4597,0.3,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.102,0.718,0.0,0.0,0.0,0.0,1.404,6,118,0
4598,0.96,0.0,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.057,0.0,0.0,0.0,0.0,1.147,5,78,0
4599,0.0,0.0,0.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.125,0.0,0.0,1.25,5,40,0


In [9]:
# Checking whether each column has an appropriate datatype
#
spam.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4600 non-null   float64
 1   0.64    4600 non-null   float64
 2   0.64.1  4600 non-null   float64
 3   0.1     4600 non-null   float64
 4   0.32    4600 non-null   float64
 5   0.2     4600 non-null   float64
 6   0.3     4600 non-null   float64
 7   0.4     4600 non-null   float64
 8   0.5     4600 non-null   float64
 9   0.6     4600 non-null   float64
 10  0.7     4600 non-null   float64
 11  0.64.2  4600 non-null   float64
 12  0.8     4600 non-null   float64
 13  0.9     4600 non-null   float64
 14  0.10    4600 non-null   float64
 15  0.32.1  4600 non-null   float64
 16  0.11    4600 non-null   float64
 17  1.29    4600 non-null   float64
 18  1.93    4600 non-null   float64
 19  0.12    4600 non-null   float64
 20  0.96    4600 non-null   float64
 21  0.13    4600 non-null   float64
 22  

## 4. External Data Source Validation

### a.Validation

The data source is valid

## 5. Tidying the Dataset

In [10]:
# Checking for Outliers
#
Q1 = spam.quantile(0.25)
Q3 = spam.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

print((spam < (Q1 - 1.5 * IQR)) | (spam > (Q3 + 1.5 * IQR)))



0           0.00000
0.64        0.00000
0.64.1      0.42000
0.1         0.00000
0.32        0.38250
0.2         0.00000
0.3         0.00000
0.4         0.00000
0.5         0.00000
0.6         0.16000
0.7         0.00000
0.64.2      0.80000
0.8         0.00000
0.9         0.00000
0.10        0.00000
0.32.1      0.10000
0.11        0.00000
1.29        0.00000
1.93        2.64000
0.12        0.00000
0.96        1.27000
0.13        0.00000
0.14        0.00000
0.15        0.00000
0.16        0.00000
0.17        0.00000
0.18        0.00000
0.19        0.00000
0.20        0.00000
0.21        0.00000
0.22        0.00000
0.23        0.00000
0.24        0.00000
0.25        0.00000
0.26        0.00000
0.27        0.00000
0.28        0.00000
0.29        0.00000
0.30        0.00000
0.31        0.00000
0.32.2      0.00000
0.33        0.00000
0.34        0.00000
0.35        0.00000
0.36        0.11000
0.37        0.00000
0.38        0.00000
0.39        0.00000
0.40        0.00000
0.41        0.18800


In [11]:
# Checking for duplicates
#
print(spam.duplicated().sum())


391


In [12]:
# dropping duplicates
#
spam.drop_duplicates(subset = None, keep = 'first', inplace = True)

In [14]:
# rename columns
#
dict = {'0': 'freq_make',
        '0.64': 'freq_address',
        '0.64.1': 'freq_all',
        '0.1': 'fred_3d',
        '0.32': 'freq_our',
        '0.2': 'freq_over',
        '0.3': 'freq_remove',
        '0.4': 'freq_internet',
        '0.5': 'freq_order',
        '0.6': 'freq_mail',
        '0.7': 'freq_receive',
        '0.64.2': 'freq_will',
        '0.8': 'freq_people',
        '0.9': 'freq_report',
        '0.10': 'freq_addresses',
        '0.32.1': 'freq_free',
        '0.11': 'freq_business',
        '1.29': 'freq_email',
        '1.93': 'freq_you',
        '0.12': 'freq_credit',
        '0.96': 'freq_your',
        '0.13': 'freq_font',
        '0.14': 'freq_000',
        '0.15': 'freq_money',
        '0.16': 'freq_hp',
        '0.17': 'freq_hpl',
        '0.18': 'freq_george',
        '0.19': 'freq_650',
        '0.20': 'freq_lab',
        '0.21': 'freq_labs',
        '0.22': 'freq_telnet',
        '0.23': 'freq_857',
        '0.24': 'freq_data',
        '0.25': 'freq_415',
        '0.26': 'freq_85',
        '0.27': 'freq_technology',
        '0.28': 'freq_1999',
        '0.29': 'freq_parts',
        '0.30': 'freq_pm',
        '0.31': 'freq_direct',
        '0.32.2': 'freq_cs',
        '0.33': 'freq_meeting',
        '0.34': 'freq_original',
        '0.35': 'freq_project',
        '0.36': 'freq_re',
        '0.37': 'freq_edu',
        '0.38': 'freq_table',
        '0.39': 'freq_conference',
        '0.40': 'freq_;',
        '0.41': 'freq_(',
        '0.42': 'freq_[',
        '0.778': 'freq_!',
        '0.43': 'freq_$',
        '0.44': 'freq_#',
        '3.756': 'cap_len_av',
        '61': 'cap_len_lon',
        '278': 'cap_len_tot',
        '1': 'spam'}

spam.rename(columns=dict, 
            inplace=True)
spam.head()

Unnamed: 0,freq_make,freq_address,freq_all,fred_3d,freq_our,freq_over,freq_remove,freq_internet,freq_order,freq_mail,...,freq_;,freq_(,freq_[,freq_!,freq_$,freq_#,cap_len_av,cap_len_lon,cap_len_tot,spam
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


## 6. Exploratory Analysis

In [15]:
# univariate summary
#
spam.describe()

Unnamed: 0,freq_make,freq_address,freq_all,fred_3d,freq_our,freq_over,freq_remove,freq_internet,freq_order,freq_mail,...,freq_;,freq_(,freq_[,freq_!,freq_$,freq_#,cap_len_av,cap_len_lon,cap_len_tot,spam
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,0.104391,0.11253,0.29139,0.063093,0.325322,0.096679,0.117503,0.108026,0.091882,0.248479,...,0.040413,0.144082,0.01738,0.281018,0.076075,0.045809,5.384282,52.1378,291.18508,0.39867
std,0.300036,0.454241,0.515752,1.352647,0.687887,0.276059,0.397327,0.410328,0.282174,0.656705,...,0.252562,0.27428,0.105743,0.843387,0.239734,0.435976,33.151287,199.605834,618.72831,0.489683
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.627,7.0,40.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.073,0.0,0.016,0.0,0.0,2.297,15.0,101.0,0.0
75%,0.0,0.0,0.44,0.0,0.41,0.0,0.0,0.0,0.0,0.19,...,0.0,0.194,0.0,0.331,0.053,0.0,3.706,44.0,273.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [16]:
print(spam.median())
print('*' * 50)
print(spam.mode())
print('*' * 50)
print(spam.skew())
print('*' * 50)
print(spam.kurt())

freq_make            0.000
freq_address         0.000
freq_all             0.000
fred_3d              0.000
freq_our             0.000
freq_over            0.000
freq_remove          0.000
freq_internet        0.000
freq_order           0.000
freq_mail            0.000
freq_receive         0.000
freq_will            0.190
freq_people          0.000
freq_report          0.000
freq_addresses       0.000
freq_free            0.000
freq_business        0.000
freq_email           0.000
freq_you             1.360
freq_credit          0.000
freq_your            0.290
freq_font            0.000
freq_000             0.000
freq_money           0.000
freq_hp              0.000
freq_hpl             0.000
freq_george          0.000
freq_650             0.000
freq_lab             0.000
freq_labs            0.000
freq_telnet          0.000
freq_857             0.000
freq_data            0.000
freq_415             0.000
freq_85              0.000
freq_technology      0.000
freq_1999            0.000
f

In [17]:
# Ploting the bivariate summaries and recording our observations
#
spam.corr()

Unnamed: 0,freq_make,freq_address,freq_all,fred_3d,freq_our,freq_over,freq_remove,freq_internet,freq_order,freq_mail,...,freq_;,freq_(,freq_[,freq_!,freq_$,freq_#,cap_len_av,cap_len_lon,cap_len_tot,spam
freq_make,1.0,0.034114,0.063267,0.005384,0.02174,0.05504,0.011037,-0.004392,0.103818,0.040799,...,-0.027262,-0.01968,-0.034798,0.059236,0.101934,-0.009158,0.044779,0.059132,0.084126,0.129321
freq_address,0.034114,1.0,0.027609,-0.008936,0.036983,0.005623,0.077468,0.013781,0.053525,0.184441,...,0.014782,-0.028209,-0.019394,0.031702,0.044449,0.030702,0.029373,0.053866,0.034601,0.100346
freq_all,0.063267,0.027609,1.0,-0.019895,0.065681,0.066608,0.02863,0.007263,0.077618,0.025201,...,-0.036754,-0.024462,-0.03578,0.097231,0.073,-0.004503,0.095684,0.092511,0.051967,0.172193
fred_3d,0.005384,-0.008936,-0.019895,1.0,0.000256,-0.009167,0.014286,0.003798,-0.001044,-0.003849,...,-3e-05,-0.01145,-0.007516,-0.003861,0.008269,0.000133,0.005754,0.022106,0.023784,0.056407
freq_our,0.02174,0.036983,0.065681,0.000256,1.0,0.041392,0.135958,0.02341,0.014482,0.027232,...,-0.035049,-0.054583,-0.027988,0.019205,0.040696,0.002144,0.050832,0.042985,-0.010498,0.230117
freq_over,0.05504,0.005623,0.066608,-0.009167,0.041392,1.0,0.046844,0.079683,0.097012,0.010068,...,-0.021758,-0.011628,-0.016743,0.058173,0.105903,0.020405,-0.013559,0.065714,0.063402,0.212455
freq_remove,0.011037,0.077468,0.02863,0.014286,0.135958,0.046844,1.0,0.033675,0.049216,0.05567,...,-0.034092,-0.061357,-0.029478,0.051036,0.067215,0.0493,0.039171,0.050828,-0.017082,0.334605
freq_internet,-0.004392,0.013781,0.007263,0.003798,0.02341,0.079683,0.033675,1.0,0.106872,0.079023,...,-0.028986,-0.042785,-0.021393,0.029069,0.05355,-0.008128,0.009974,0.035606,0.036894,0.20078
freq_order,0.103818,0.053525,0.077618,-0.001044,0.014482,0.097012,0.049216,0.106872,1.0,0.123341,...,-0.015016,-0.03742,0.017308,0.035985,0.152436,-0.002084,0.110786,0.166304,0.233199,0.221591
freq_mail,0.040799,0.184441,0.025201,-0.003849,0.027232,0.010068,0.05567,0.079023,0.123341,1.0,...,0.006757,-0.005008,0.003785,0.031657,0.077378,0.034473,0.073125,0.101552,0.078587,0.131822


In [18]:
# corelation heatmap
#
plt.figure(figsize = (25, 25))
sns.heatmap(spam.corr(),annot=True)
plt.show()

## 7. Implementing the Solution

In [20]:
# drop redundant or highly correlated columns
#
X = spam.iloc[:, 0:57].values

corr_matrix = spam.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

spam1 = spam.drop(spam[to_drop], axis=1)
spam1




Unnamed: 0,freq_make,freq_address,freq_all,fred_3d,freq_our,freq_over,freq_remove,freq_internet,freq_order,freq_mail,...,freq_;,freq_(,freq_[,freq_!,freq_$,freq_#,cap_len_av,cap_len_lon,cap_len_tot,spam
0,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
1,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
2,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,1.85,0.00,0.00,1.85,0.00,0.00,...,0.000,0.223,0.0,0.000,0.000,0.000,3.000,15,54,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4596,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4597,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4598,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [21]:
# normalize data
#
from locale import normalize

from sklearn.preprocessing import Normalizer
normalizer = Normalizer(norm='l2')
normalizer.transform(X)

array([[2.03297105e-04, 2.71062806e-04, 4.84040726e-04, ...,
        4.95076854e-03, 9.77762266e-02, 9.95187732e-01],
       [2.59683978e-05, 0.00000000e+00, 3.07292708e-04, ...,
        4.25059392e-03, 2.09911216e-01, 9.77710178e-01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.81192865e-02, 2.04911354e-01, 9.78451717e-01],
       ...,
       [2.53812946e-03, 0.00000000e+00, 2.53812946e-03, ...,
        1.18784459e-02, 5.07625892e-02, 9.98330921e-01],
       [1.22759766e-02, 0.00000000e+00, 0.00000000e+00, ...,
        1.46672345e-02, 6.39373781e-02, 9.97423098e-01],
       [0.00000000e+00, 0.00000000e+00, 1.59858723e-02, ...,
        3.07420620e-02, 1.22968248e-01, 9.83745986e-01]])

In [22]:
# defining the variables and splitting into train and test
#
X = spam1.iloc[:, 0:56].values
y = spam1.iloc[:, 56].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [23]:
# Training our model
# 
gaussian = GaussianNB()  
model = gaussian.fit(X_train, y_train) 

In [24]:
# Predicting our test predictors
#
predicted = model.predict(X_test)
print(np.mean(predicted == y_test))

0.8171021377672208


In [25]:
# cross validating the model
#
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=gaussian, X=X_train, y=y_train, cv=5)

print(all_accuracies.mean())

0.8286281806517607


In [26]:
# Model accuracy, predictions and confusion matrix
#
y_pred = gaussian.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8171021377672208
[[353 142]
 [ 12 335]]
              precision    recall  f1-score   support

           0       0.97      0.71      0.82       495
           1       0.70      0.97      0.81       347

    accuracy                           0.82       842
   macro avg       0.83      0.84      0.82       842
weighted avg       0.86      0.82      0.82       842



Training 70% of the data:

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [28]:
# Training our model
# 
gaussian = GaussianNB()  
model = gaussian.fit(X_train, y_train) 

In [29]:
# Predicting our test predictors
#
predicted = model.predict(X_test)
print(np.mean(predicted == y_test))

0.8266033254156769


In [30]:
# cross validating the model
#
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=gaussian, X=X_train, y=y_train, cv=5)

print(all_accuracies.mean())

0.8255227187706827


In [31]:
# Model accuracy, predictions and confusion matrix
#
y_pred = gaussian.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8266033254156769
[[538 199]
 [ 20 506]]
              precision    recall  f1-score   support

           0       0.96      0.73      0.83       737
           1       0.72      0.96      0.82       526

    accuracy                           0.83      1263
   macro avg       0.84      0.85      0.83      1263
weighted avg       0.86      0.83      0.83      1263



Training 60% of the data:

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [33]:
# Training our model
# 
gaussian = GaussianNB()  
model = gaussian.fit(X_train, y_train) 

In [34]:
# Predicting our test predictors
#
predicted = model.predict(X_test)
print(np.mean(predicted == y_test))

0.831353919239905


In [35]:
# cross validating the model
#
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=gaussian, X=X_train, y=y_train, cv=5)

print(all_accuracies.mean())

0.8332673267326733


In [36]:
# Model accuracy, predictions and confusion matrix
#
y_pred = gaussian.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.831353919239905
[[737 257]
 [ 27 663]]
              precision    recall  f1-score   support

           0       0.96      0.74      0.84       994
           1       0.72      0.96      0.82       690

    accuracy                           0.83      1684
   macro avg       0.84      0.85      0.83      1684
weighted avg       0.86      0.83      0.83      1684



All the trained models have achieved the metric required with the 60% trained data having the heighest accuracy at 83%.

## 8. Challenging the solution

Further analysis and optimization techniques are required to confirm the validity of the accuracy scores achieved.

## 9. Follow up questions

### a). Did we have the right data?

Yes

### b). Do we need other data to answer our question?

More upto date data is required since the data at hand is from 1999.

### c). Did we have the right question?

Yes.