# Email Spam Detection



## Problem Statement

Given the features in the dataset, using Naive Bayes Classifier, we are to determine whether an email is spam or not.



## Evaluation Metrics

We will use the accuracy score and F1 score to evaluate the performance of our model.

## Loading Libraries and Files

In [2]:
# Importing the necessary files to our environment

import numpy as np        
import pandas as pd       


import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler       
from sklearn.preprocessing import Normalizer
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score              
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')                 


In [3]:
# Importing the files to our environment.

spam = pd.read_csv('spambase_csv.csv')

## Data Exploration

In [4]:
# Dispaying the first five records of the data.

spam.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,word_freq_receive,word_freq_will,word_freq_people,word_freq_report,word_freq_addresses,word_freq_free,word_freq_business,word_freq_email,word_freq_you,word_freq_credit,word_freq_your,word_freq_font,word_freq_000,word_freq_money,word_freq_hp,word_freq_hpl,word_freq_george,word_freq_650,word_freq_lab,word_freq_labs,word_freq_telnet,word_freq_857,word_freq_data,word_freq_415,word_freq_85,word_freq_technology,word_freq_1999,word_freq_parts,word_freq_pm,word_freq_direct,word_freq_cs,word_freq_meeting,word_freq_original,word_freq_project,word_freq_re,word_freq_edu,word_freq_table,word_freq_conference,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.64,0.0,0.0,0.0,0.32,0.0,1.29,1.93,0.0,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [5]:
# Let's print out the column names 

spam.columns

Index(['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
       'word_freq_our', 'word_freq_over', 'word_freq_remove',
       'word_freq_internet', 'word_freq_order', 'word_freq_mail',
       'word_freq_receive', 'word_freq_will', 'word_freq_people',
       'word_freq_report', 'word_freq_addresses', 'word_freq_free',
       'word_freq_business', 'word_freq_email', 'word_freq_you',
       'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000',
       'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
       'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
       'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
       'word_freq_technology', 'word_freq_1999', 'word_freq_parts',
       'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
       'word_freq_original', 'word_freq_project', 'word_freq_re',
       'word_freq_edu', 'word_freq_table', 'word_freq_conference',


- This helps us tp familiarize ourselves with the columns that we have.

In [6]:
# Let's check the datatypes of the columns in our dataset
spam.dtypes

word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl                 float64
word_freq_ge

- The datatypes of the columns are all numerical in data.

In [7]:
# Getting a general description of the data we have using the describe() function.

spam.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,word_freq_receive,word_freq_will,word_freq_people,word_freq_report,word_freq_addresses,word_freq_free,word_freq_business,word_freq_email,word_freq_you,word_freq_credit,word_freq_your,word_freq_font,word_freq_000,word_freq_money,word_freq_hp,word_freq_hpl,word_freq_george,word_freq_650,word_freq_lab,word_freq_labs,word_freq_telnet,word_freq_857,word_freq_data,word_freq_415,word_freq_85,word_freq_technology,word_freq_1999,word_freq_parts,word_freq_pm,word_freq_direct,word_freq_cs,word_freq_meeting,word_freq_original,word_freq_project,word_freq_re,word_freq_edu,word_freq_table,word_freq_conference,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,0.059824,0.541702,0.09393,0.058626,0.049205,0.248848,0.142586,0.184745,1.6621,0.085577,0.809761,0.121202,0.101645,0.094269,0.549504,0.265384,0.767305,0.124845,0.098915,0.102852,0.064753,0.047048,0.097229,0.047835,0.105412,0.097477,0.136953,0.013201,0.078629,0.064834,0.043667,0.132339,0.046099,0.079196,0.301224,0.179824,0.005444,0.031869,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,0.201545,0.861698,0.301036,0.335184,0.258843,0.825792,0.444055,0.531122,1.775481,0.509767,1.20081,1.025756,0.350286,0.442636,1.671349,0.886955,3.367292,0.538576,0.593327,0.456682,0.403393,0.328559,0.555907,0.329445,0.53226,0.402623,0.423451,0.220651,0.434672,0.349916,0.361205,0.766819,0.223812,0.621976,1.011687,0.911119,0.076274,0.285735,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,1.31,0.0,0.22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,0.0,0.8,0.0,0.0,0.0,0.1,0.0,0.0,2.64,0.0,1.27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,2.61,9.67,5.55,10.0,4.41,20.0,7.14,9.09,18.75,18.18,11.11,17.1,5.45,12.5,20.83,16.66,33.33,9.09,14.28,5.88,12.5,4.76,18.18,4.76,20.0,7.69,6.89,8.33,11.11,4.76,7.14,14.28,3.57,20.0,21.42,22.05,2.17,10.0,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


### Dealing with Duplicate Records 

In [8]:
# First we check the number of rows and columns of our data.

print('Our dataset has ', spam.shape[0], 'rows and ', spam.shape[1], 'columns')

Our dataset has  4601 rows and  58 columns


In [9]:
# Checking for the presence and the number of duplicates in the data

print('Presence : ',spam.duplicated(keep = 'first').any())
print('Count    : ',spam.duplicated(keep = 'first').sum())

Presence :  True
Count    :  391


In [10]:
# Dealing with duplicate values while keeping the first occurence of the record.

spam.drop_duplicates(keep = 'first', inplace = True)

In [11]:
# We then check the shape of the dataframe to confirm that the duplicates have been dropped.

print('The dataset now has ', spam.shape[0], 'rows and ', spam.shape[1], 'columns')

The dataset now has  4210 rows and  58 columns


## Analysis

In [12]:
# First we will verify whether the features are normally distriibuted or not.

# Using the shapiro test, we will test the for normality on the features of our data
# We will use a for loop ti iterate through all the columns systematically.

# importing the shapiro function
from scipy.stats import shapiro 

for i in spam.columns:

  stat, p = shapiro(spam[i])             # testing for normality
  print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpreting the results
alpha = 0.05
if p > alpha:
	print('Sample looks Gaussian')
else:
	print('Sample does not look Gaussian')

Statistics=0.392, p=0.000
Statistics=0.245, p=0.000
Statistics=0.626, p=0.000
Statistics=0.022, p=0.000
Statistics=0.522, p=0.000
Statistics=0.396, p=0.000
Statistics=0.329, p=0.000
Statistics=0.277, p=0.000
Statistics=0.373, p=0.000
Statistics=0.410, p=0.000
Statistics=0.351, p=0.000
Statistics=0.676, p=0.000
Statistics=0.351, p=0.000
Statistics=0.171, p=0.000
Statistics=0.185, p=0.000
Statistics=0.336, p=0.000
Statistics=0.368, p=0.000
Statistics=0.397, p=0.000
Statistics=0.859, p=0.000
Statistics=0.156, p=0.000
Statistics=0.734, p=0.000
Statistics=0.103, p=0.000
Statistics=0.321, p=0.000
Statistics=0.193, p=0.000
Statistics=0.383, p=0.000
Statistics=0.346, p=0.000
Statistics=0.205, p=0.000
Statistics=0.260, p=0.000
Statistics=0.161, p=0.000
Statistics=0.254, p=0.000
Statistics=0.153, p=0.000
Statistics=0.136, p=0.000
Statistics=0.173, p=0.000
Statistics=0.139, p=0.000
Statistics=0.198, p=0.000
Statistics=0.278, p=0.000
Statistics=0.388, p=0.000
Statistics=0.035, p=0.000
Statistics=0

- None of the features is normally distributed hence we need to formalize the data before modeling.

## Implementing the Solution

In [13]:
# Splitting the data into dependent and independent sets.

X = spam.drop('class', axis = 1)
y = spam['class']



In [14]:
X.head(2)

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,word_freq_receive,word_freq_will,word_freq_people,word_freq_report,word_freq_addresses,word_freq_free,word_freq_business,word_freq_email,word_freq_you,word_freq_credit,word_freq_your,word_freq_font,word_freq_000,word_freq_money,word_freq_hp,word_freq_hpl,word_freq_george,word_freq_650,word_freq_lab,word_freq_labs,word_freq_telnet,word_freq_857,word_freq_data,word_freq_415,word_freq_85,word_freq_technology,word_freq_1999,word_freq_parts,word_freq_pm,word_freq_direct,word_freq_cs,word_freq_meeting,word_freq_original,word_freq_project,word_freq_re,word_freq_edu,word_freq_table,word_freq_conference,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.64,0.0,0.0,0.0,0.32,0.0,1.29,1.93,0.0,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028


In [15]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: class, dtype: int64

In [16]:
# Normalizing the features 

norm = Normalizer().fit(X)

## Gaussian Model

In [17]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 55)

# Training the model.
# Splitting the data into train and test sets

gauss = GaussianNB().fit(X_train, y_train)

# Predicting
y_pred = gauss.predict(X_test)

# Evaluating the predictions made by the model

# 1. Using Classification report
print(classification_report(y_test, y_pred))


# 2. Using  the accuracy score
print(confusion_matrix(y_test, y_pred))
print('The accuracy :',accuracy_score(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.96      0.72      0.82       737
           1       0.71      0.96      0.81       526

    accuracy                           0.82      1263
   macro avg       0.83      0.84      0.82      1263
weighted avg       0.86      0.82      0.82      1263

[[529 208]
 [ 22 504]]
The accuracy : 0.8178939034045922


- There is a huge disparity in precision when predicting the spam and the ham classes. The model predicts the legitimate emails with a precision of 96% while predicting the spam emails with a precision of 71%.

## Multinomial NB Model

In [18]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 55)

# Training the model.
# Splitting the data into train and test sets

multi = MultinomialNB().fit(X_train, y_train)

# Predicting
y_pred = multi.predict(X_test)

# Evaluating the predictions made by the model

# 1. Using Classification report
print(classification_report(y_test, y_pred))


# 2. Using  the accuracy score
print(confusion_matrix(y_test, y_pred))
print('The accuracy :',accuracy_score(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.82      0.86      0.84       737
           1       0.78      0.73      0.76       526

    accuracy                           0.80      1263
   macro avg       0.80      0.79      0.80      1263
weighted avg       0.80      0.80      0.80      1263

[[631 106]
 [142 384]]
The accuracy : 0.8036421219319082


- Using a multinomial NB, the model precicts legitimate emails with a precision of 82% while predicting spam emails with a precision of 78%.

## Bernoulli Model

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 55)

# Training the model.
# Splitting the data into train and test sets

bern = BernoulliNB().fit(X_train, y_train)

# Predicting
y_pred = bern.predict(X_test)

# Evaluating the predictions made by the model

# 1. Using Classification report
print(classification_report(y_test, y_pred))


# 2. Using  the accuracy score
print(confusion_matrix(y_test, y_pred))
print('The accuracy :',accuracy_score(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.89      0.92      0.90       737
           1       0.89      0.83      0.86       526

    accuracy                           0.89      1263
   macro avg       0.89      0.88      0.88      1263
weighted avg       0.89      0.89      0.89      1263

[[680  57]
 [ 87 439]]
The accuracy : 0.8859857482185273


- The Bernoulli NB predicts both spam and legitimate emails with a precision of 89%


1. There is a huge disparity in precision when predicting the spam and the ham classes. The model predicts the legitimate emails with a precision of 96% while predicting the spam emails with a precision of 71%.
2. Using a multinomial NB, the model precicts legitimate emails with a precision of 82% while predicting spam emails with a precision of 78%.
3. The Bernoulli NB predicts both spam and legitimate emails with a precision of 89%

## Conclusion

Bernoulli NB gives the highest precision in predicting which is better than how multinomial NB will classify whether an email is spam or not with the highest precision by use of word count.