# DEFINING THE QUESTION

### a) Specifying the Question

This week's project requires us to implement Naive Bayes classifier and calculate the resulting metrics: We will be trying to Determine the Spam Email

---

## b) Defining the Metric for Success

Being able to accurately predict SPAMs.

---

## c) Understanding the context

The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR]
= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

---

## d) Experimental Design
1. Read and explore the given dataset.

2. Find and deal with outliers, anomalies, and missing data within the dataset.

3. Perform Exploratory Data Analysis.

4. Perforn Naive Bayes Classification.

5. Provide a recommendation based on your analysis.

6. Challenge your solution by providing insights on how you can make improvements in model improvement.

---

## e) Data Relevance

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

This collection of spam e-mails came from the collector's postmaster and individuals who had filed spam. The collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

---


# DATA PREPARATION

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.read_csv('/content/spambase.data', header=None)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,0.00,0.64,0.00,0.00,0.00,0.32,0.00,1.29,1.93,0.00,0.96,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.00,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.00,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.06,0.0,0.0,0.12,0.00,0.06,0.06,0.0,0.0,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.00,0.00,0.31,0.00,0.00,3.18,0.00,0.31,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.00,0.00,0.31,0.00,0.00,3.18,0.00,0.31,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,0.00,1.88,0.00,0.00,0.00,0.00,0.00,0.00,0.62,0.00,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.31,0.31,0.31,0.0,0.0,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,6.00,0.00,2.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,2.00,0.0,0.0,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.80,0.30,0.00,0.00,0.00,0.00,0.90,1.50,0.00,0.30,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,1.20,0.0,0.0,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,0.00,0.32,0.00,0.00,0.00,0.00,0.00,0.00,1.93,0.00,0.32,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.32,0.00,0.32,0.0,0.0,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [3]:
pd.read_csv('/content/spambase.names', sep='delimiter', header= None)

  """Entry point for launching an IPython kernel.


Unnamed: 0,0
0,| SPAM E-MAIL DATABASE ATTRIBUTES (in .names f...
1,|
2,"| 48 continuous real [0,100] attributes of typ..."
3,| = percentage of words in the e-mail that mat...
4,| i.e. 100 * (number of times the WORD appears...
...,...
82,char_freq_$: continuous.
83,char_freq_#: continuous.
84,capital_run_length_average: continuous.
85,capital_run_length_longest: continuous.


In [4]:
pd.read_csv('/content/spambase.DOCUMENTATION', sep='delimiter', header= None)

  """Entry point for launching an IPython kernel.


Unnamed: 0,0
0,1. Title: SPAM E-mail Database
1,2. Sources:
2,"(a) Creators: Mark Hopkins, Erik Reeber, Georg..."
3,"Hewlett-Packard Labs, 1501 Page Mill Rd., Palo..."
4,(b) Donor: George Forman (gforman at nospam hp...
...,...
117,56 1 9989 52.173 194.89 374
118,57 1 15841 283.29 606.35 214
119,58 0 1 0.39404 0.4887 124
120,This file: 'spambase.DOCUMENTATION' at the UCI...


In [5]:
# Alternative reading of data

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data

--2021-03-05 08:08:48--  https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 702942 (686K) [application/x-httpd-php]
Saving to: ‘spambase.data.1’


2021-03-05 08:08:49 (11.6 MB/s) - ‘spambase.data.1’ saved [702942/702942]



In [6]:
group_name = 'spam'

# Loads the CSV data
df = pd.read_csv('spambase.data', header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,0.00,0.64,0.00,0.00,0.00,0.32,0.00,1.29,1.93,0.00,0.96,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.00,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.00,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.06,0.0,0.0,0.12,0.00,0.06,0.06,0.0,0.0,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.00,0.00,0.31,0.00,0.00,3.18,0.00,0.31,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.00,0.00,0.31,0.00,0.00,3.18,0.00,0.31,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,0.00,1.88,0.00,0.00,0.00,0.00,0.00,0.00,0.62,0.00,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.31,0.31,0.31,0.0,0.0,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,6.00,0.00,2.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,2.00,0.0,0.0,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.80,0.30,0.00,0.00,0.00,0.00,0.90,1.50,0.00,0.30,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,1.20,0.0,0.0,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,0.00,0.32,0.00,0.00,0.00,0.00,0.00,0.00,1.93,0.00,0.32,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.32,0.00,0.32,0.0,0.0,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [7]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.64,0.0,0.0,0.0,0.32,0.0,1.29,1.93,0.0,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [8]:
df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57
4596,0.31,0.0,0.62,0.0,0.0,0.31,0.0,0.0,0.0,0.0,0.0,1.88,0.0,0.0,0.0,0.0,0.0,0.0,0.62,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.31,0.31,0.31,0.0,0.0,0.0,0.232,0.0,0.0,0.0,0.0,1.142,3,88,0
4597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.353,0.0,0.0,1.555,4,14,0
4598,0.3,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.8,0.3,0.0,0.0,0.0,0.0,0.9,1.5,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2,0.0,0.0,0.102,0.718,0.0,0.0,0.0,0.0,1.404,6,118,0
4599,0.96,0.0,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,1.93,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32,0.0,0.32,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,1.147,5,78,0
4600,0.0,0.0,0.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.65,0.0,0.0,0.0,0.0,0.0,4.6,0.0,0.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.97,0.65,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,1.25,5,40,0


In [9]:
df.shape

(4601, 58)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4601 non-null   float64
 1   1       4601 non-null   float64
 2   2       4601 non-null   float64
 3   3       4601 non-null   float64
 4   4       4601 non-null   float64
 5   5       4601 non-null   float64
 6   6       4601 non-null   float64
 7   7       4601 non-null   float64
 8   8       4601 non-null   float64
 9   9       4601 non-null   float64
 10  10      4601 non-null   float64
 11  11      4601 non-null   float64
 12  12      4601 non-null   float64
 13  13      4601 non-null   float64
 14  14      4601 non-null   float64
 15  15      4601 non-null   float64
 16  16      4601 non-null   float64
 17  17      4601 non-null   float64
 18  18      4601 non-null   float64
 19  19      4601 non-null   float64
 20  20      4601 non-null   float64
 21  21      4601 non-null   float64
 22  

# DATA CLEANING

In [11]:
df.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
37    0
38    0
39    0
40    0
41    0
42    0
43    0
44    0
45    0
46    0
47    0
48    0
49    0
50    0
51    0
52    0
53    0
54    0
55    0
56    0
57    0
dtype: int64

The Dataset is clean

---

# DATA ANALYSIS

In [12]:
df.corr()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57
0,1.0,-0.016759,0.065627,0.013273,0.023119,0.059674,0.007669,-0.00395,0.106263,0.041198,0.188459,0.105801,0.066438,0.03678,0.028439,0.059386,0.081928,0.053324,0.128243,0.021295,0.197049,-0.024349,0.134072,0.188155,-0.072504,-0.061686,-0.066424,-0.04868,-0.041251,-0.052799,-0.039066,-0.032058,-0.041014,-0.02769,-0.044954,-0.054673,-0.057312,-0.00796,-0.011134,-0.036095,-0.009703,-0.02607,-0.024292,-0.022116,-0.037105,-0.034056,-0.000953,-0.017755,-0.026505,-0.021196,-0.033301,0.058292,0.117419,-0.008844,0.044491,0.061382,0.089165,0.126208
1,-0.016759,1.0,-0.033526,-0.006923,-0.02376,-0.02484,0.003918,-0.01628,-0.003826,0.032962,-0.006864,-0.040398,-0.018858,-0.009206,0.00533,-0.009117,-0.01837,0.0335,-0.055476,-0.015806,-0.018191,-0.00885,-0.020502,0.001984,-0.043483,-0.038211,-0.030307,-0.029221,-0.02194,-0.027508,-0.018097,-0.003326,-0.024903,-0.004303,-0.024058,-0.028198,-0.024013,-0.008922,-0.019124,-0.014821,-0.01542,-0.025177,-0.00237,-0.019739,-0.016418,-0.023858,-0.009818,-0.015747,-0.007282,-0.049837,-0.018527,-0.014461,-0.009605,0.001946,0.002083,0.000271,-0.02268,-0.030224
2,0.065627,-0.033526,1.0,-0.020246,0.077734,0.087564,0.036677,0.012003,0.093786,0.032075,0.048254,0.08321,0.047593,0.008552,0.122113,0.063906,0.036262,0.121923,0.139329,0.031111,0.156651,-0.035681,0.123671,0.041145,-0.087924,-0.062459,-0.108886,-0.050648,-0.057726,-0.032547,-0.038927,-0.06187,-0.054759,-0.061706,-0.048335,-0.046504,-0.067015,0.032407,-0.014809,-0.047066,-0.030956,-0.005811,-0.044325,-0.053464,-0.050664,-0.056655,0.029339,-0.026344,-0.033213,-0.016495,-0.03312,0.10814,0.087618,-0.003336,0.097398,0.107463,0.070114,0.196988
3,0.013273,-0.006923,-0.020246,1.0,0.003238,-0.010014,0.019784,0.010268,-0.002454,-0.004947,-0.012976,-0.019221,-0.013199,0.012008,0.002707,0.007432,0.00347,0.019391,-0.010834,-0.005381,0.008176,0.028102,0.011368,0.03536,-0.015181,-0.013708,-0.010684,-0.010368,-0.007798,-0.010476,-0.007529,-0.006717,-0.008075,-0.006729,-0.006122,-0.006515,-0.007761,-0.002669,-0.004602,-0.007643,-0.00567,-0.008095,-0.009268,-0.005933,-0.012957,-0.009181,-0.003348,-0.001924,-0.000591,-0.01237,-0.007148,-0.003138,0.010862,-0.000298,0.00526,0.022081,0.021369,0.057371
4,0.023119,-0.02376,0.077734,0.003238,1.0,0.054054,0.147336,0.029598,0.020823,0.034495,0.068382,0.066788,0.031126,0.003445,0.056177,0.083024,0.143443,0.062344,0.09851,0.031526,0.136605,-0.020207,0.070037,3.9e-05,-0.072502,-0.075456,-0.088011,-0.061501,0.032048,-0.052066,-0.042535,-0.026748,-0.031998,-0.02696,-0.049732,-0.048844,-0.072599,0.130812,-0.042044,-0.021442,-0.047505,0.115041,-0.048879,0.015234,-0.042336,-0.077986,-0.0269,-0.032005,-0.032759,-0.046361,-0.02639,0.025509,0.041582,0.002016,0.052662,0.05229,0.002492,0.24192
5,0.059674,-0.02484,0.087564,-0.010014,0.054054,1.0,0.061163,0.079561,0.117438,0.013897,0.0539,0.009264,0.077631,0.009673,0.173066,0.019865,0.064137,0.07835,0.095505,0.058979,0.106833,0.007956,0.211455,0.059329,-0.084402,-0.087271,-0.069051,-0.066223,-0.048673,-0.048127,-0.046383,-0.036835,-0.034164,-0.037315,-0.054315,-0.052819,-0.057465,-0.017918,-0.047619,-0.029866,-0.029457,-0.054812,-0.030616,-0.028826,-0.053637,-0.033046,-0.014343,-0.031693,-0.019119,-0.008705,-0.015133,0.065043,0.105692,0.019894,-0.010278,0.090172,0.082089,0.232604
6,0.007669,0.003918,0.036677,0.019784,0.147336,0.061163,1.0,0.044545,0.050786,0.056809,0.159578,-0.001461,0.013295,-0.022723,0.042904,0.128436,0.187981,0.122011,0.111792,0.046134,0.130794,-0.002093,0.064795,0.030575,-0.089494,-0.08033,-0.065893,-0.066947,-0.048482,-0.058101,-0.04628,-0.040538,-0.041372,-0.04091,-0.053202,-0.053978,-0.052035,-0.014781,-0.046978,-0.022121,-0.03312,-0.049664,-0.049079,-0.034461,-0.050811,-0.056166,-0.017512,-0.031408,-0.033089,-0.051885,-0.027653,0.053706,0.070127,0.046612,0.041565,0.059677,-0.008344,0.332117
7,-0.00395,-0.01628,0.012003,0.010268,0.029598,0.079561,0.044545,1.0,0.105302,0.083129,0.128495,-0.002973,0.026274,0.012426,0.072782,0.051115,0.216422,0.037738,0.020641,0.109163,0.156905,-0.016192,0.089226,0.034127,-0.053038,-0.04145,-0.057189,-0.049988,-0.037047,-0.043405,-0.035816,-0.034276,-0.03922,-0.034811,-0.035174,-0.033747,-0.017466,-0.012119,-0.030392,-0.005988,-0.003884,-0.043626,-0.004542,-0.030134,-0.002423,-0.037916,-0.006397,-0.021224,-0.027432,-0.032494,-0.019548,0.031454,0.05791,-0.008012,0.011254,0.037575,0.040252,0.206808
8,0.106263,-0.003826,0.093786,-0.002454,0.020823,0.117438,0.050786,0.105302,1.0,0.130624,0.13776,0.030344,0.034738,0.06684,0.238436,0.008269,0.15839,0.098804,0.039017,0.123217,0.159112,-0.019648,0.1268,0.099461,-0.069931,-0.049775,-0.064608,-0.056764,-0.04484,-0.043643,-0.040158,-0.033984,-0.014403,-0.033601,-0.041847,-0.05627,-0.033244,-0.002216,-0.040844,-0.009867,-0.035177,-0.048223,-0.03419,-0.035159,-0.075558,-0.056817,0.007521,-0.026017,-0.014646,-0.031003,0.013601,0.043639,0.149365,-0.000522,0.111308,0.189247,0.248724,0.231551
9,0.041198,0.032962,0.032075,-0.004947,0.034495,0.013897,0.056809,0.083129,0.130624,1.0,0.125319,0.071157,0.045737,0.017901,0.160543,0.025601,0.081363,0.035977,0.093509,0.030859,0.098072,0.0082,0.096809,0.052129,-0.033534,-0.013045,-0.067817,0.019356,-0.026903,0.008677,-0.024423,-0.015137,-0.035366,-0.014434,-0.020092,-0.016955,-0.004944,-0.01795,-0.016091,0.004163,-0.025084,-0.054467,0.0232,-0.026654,-0.032065,-0.030326,-0.015546,-0.016842,0.011945,0.003936,0.007357,0.036737,0.075786,0.04483,0.073677,0.103308,0.087273,0.138962


# MODELING

## PREPARATION

In [13]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X = df.loc[:, 0:56]
y = df.loc[:,57]

## 80:20 Model 

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.20, random_state= 10)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [15]:
gaus = GaussianNB()

gaus.fit(X_train, y_train)

y_pred = gaus.predict(X_test)

print('Accuracy Score is: ', accuracy_score(y_pred, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy Score is:  0.8273615635179153
[[407 152]
 [  7 355]]
              precision    recall  f1-score   support

           0       0.98      0.73      0.84       559
           1       0.70      0.98      0.82       362

    accuracy                           0.83       921
   macro avg       0.84      0.85      0.83       921
weighted avg       0.87      0.83      0.83       921



## 70:30 Model

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.30, random_state= 10)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [17]:
gaus = GaussianNB()

gaus.fit(X_train, y_train)

y_pred = gaus.predict(X_test)

print('Accuracy Score is: ', accuracy_score(y_pred, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy Score is:  0.8211440984793628
[[595 226]
 [ 21 539]]
              precision    recall  f1-score   support

           0       0.97      0.72      0.83       821
           1       0.70      0.96      0.81       560

    accuracy                           0.82      1381
   macro avg       0.84      0.84      0.82      1381
weighted avg       0.86      0.82      0.82      1381



## 60:40 Model

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.40, random_state= 10)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [21]:
gaus = GaussianNB()

gaus.fit(X_train, y_train)

y_pred = gaus.predict(X_test)

print('Accuracy Score is: ', accuracy_score(y_pred, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy Score is:  0.8245518739815317
[[801 289]
 [ 34 717]]
              precision    recall  f1-score   support

           0       0.96      0.73      0.83      1090
           1       0.71      0.95      0.82       751

    accuracy                           0.82      1841
   macro avg       0.84      0.84      0.82      1841
weighted avg       0.86      0.82      0.83      1841



We can notice that as the train split decreases the accuracy scores decrease, indicating the more the data for training, the better the results will be. Therefore, our 80:20 Model split gave the best results.

## OPTIMIZING 80:20 Model

In [22]:
corr_matrix = X.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.80
to_drop = [column for column in upper.columns if any(upper[column] > 0.80)]

# Drop features 
X.drop(X[to_drop], axis=1)

# Seems that no columns are highly correlated

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,37,38,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,0.00,0.64,0.00,0.00,0.00,0.32,0.00,1.29,1.93,0.00,0.96,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.00,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.00,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.12,0.00,0.06,0.06,0.0,0.0,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.00,0.00,0.31,0.00,0.00,3.18,0.00,0.31,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.00,0.00,0.31,0.00,0.00,3.18,0.00,0.31,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,0.00,1.88,0.00,0.00,0.00,0.00,0.00,0.00,0.62,0.00,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.00,0.31,0.31,0.31,0.0,0.0,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,6.00,0.00,2.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.00,0.00,0.00,2.00,0.0,0.0,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.80,0.30,0.00,0.00,0.00,0.00,0.90,1.50,0.00,0.30,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.00,0.00,0.00,1.20,0.0,0.0,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,0.00,0.32,0.00,0.00,0.00,0.00,0.00,0.00,1.93,0.00,0.32,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.00,0.32,0.00,0.32,0.0,0.0,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78


In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.20, random_state= 10)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [24]:
# To try improve the scores, we could try and improve the imbalance of the classes as
# indicated by the lower support values of class 1 in all models.

# import library
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

ros = RandomOverSampler(random_state=42)

# fit predictor and target variablex_ros, 

x_ros, y_ros = ros.fit_resample(X, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

Original dataset shape Counter({0: 2788, 1: 1813})
Resample dataset shape Counter({1: 2788, 0: 2788})




In [25]:
X_train, X_test, y_train, y_test = train_test_split(x_ros, y_ros, test_size=0.20, random_state= 10)

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [26]:
gaus = GaussianNB()

gaus.fit(X_train, y_train)

y_pred = gaus.predict(X_test)

print('Accuracy Score is: ', accuracy_score(y_pred, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# The f1- scores have slightly improved, possibly due to the improved class balance as shown by 
# the support values

Accuracy Score is:  0.8449820788530465
[[398 148]
 [ 25 545]]
              precision    recall  f1-score   support

           0       0.94      0.73      0.82       546
           1       0.79      0.96      0.86       570

    accuracy                           0.84      1116
   macro avg       0.86      0.84      0.84      1116
weighted avg       0.86      0.84      0.84      1116



# CONCLUSION

The precision informs us on the accuracy of the true positive predictions with regards to false positives. The recall informs us on the accuracy of the true positive predictions with regards to false negatives. The f1-score finds the best balance between precision and recall. For this challenge the best accuracy score to work with, in order to beat the accuracy paradox of the accuracy score is the f1-score. Using this, the 80:20 split Model is the best Model.

---

# CHALLENGING THE SOLUTION

In [27]:
# Using Multinomial

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1, random_state=69)

model = MultinomialNB()
model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [28]:
from sklearn.model_selection import GridSearchCV

alpha = np.logspace(0,10,20)
hyperparams = dict(alpha = alpha)

clf = GridSearchCV(model, hyperparams, cv=5)

best_model = clf.fit(X_train, y_train)

In [29]:
# Viewing best hyperparameters
print('Best alpha:', best_model.best_estimator_.get_params()['alpha'])
print('Best Class_prior:', best_model.best_estimator_.get_params()['class_prior'])
print('Best Fit_prior:', best_model.best_estimator_.get_params()['fit_prior'])

# the best hyperparameters are similar to the basic model.

Best alpha: 1.0
Best Class_prior: None
Best Fit_prior: True


In [30]:
y_pred = model.predict(X_test)

print('Accuracy Score is: ', accuracy_score(y_pred, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# The values are lower, showing that our best model was the Gauss NB.

Accuracy Score is:  0.7722342733188721
[[230  55]
 [ 50 126]]
              precision    recall  f1-score   support

           0       0.82      0.81      0.81       285
           1       0.70      0.72      0.71       176

    accuracy                           0.77       461
   macro avg       0.76      0.76      0.76       461
weighted avg       0.77      0.77      0.77       461



# RECOMMENDATION

Our Gaussian method gave the best better f1-score values. The dataset however should have been provided with better column names in order to know what each feature represents.

---

# FOLLOW UP QUESTIONS

## a) Did we have the right data?

No, we do not know what all the features represent

## b) Do we need other data to answer our question?

Yes, we need column names in order to know what each feature represents.

## c) Did we have the right question?

Yes, because SPAM emails have led to many scams.