<a href="https://colab.research.google.com/github/KevOdhiambo/KNN-Naive-Bayes-Classification/blob/main/Spam_Email_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Spam Email Detection

**Specifying Analysis Question**

Create a Naive Bayes Calssification model to detect sspam emails.

**Defining the Metric for Success**

my model should be able to predict whether a mail is a spam or not.

**Understanding the context**

Email spam, also known as junk email, refers to unsolicited email messages, usually sent in bulk to a large list of recipients. Usually, these types of mail have less to no infomartion or sometimes are used for phishing purposes. Now, i'll build a naives bayes classification model that can be used to detect such mails.

**Recording the Experimental Design**

1. Load Data
2. Data Cleaning
3. Exploratory Data Analysis
4. Data Modelling
5. Model Evaluation
6. Model improvement and tuning
7. Conclusion

#1. Reading Data

In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

# Set global parameters
%matplotlib inline
sns.set()
plt.rcParams['figure.figsize'] = (10.0, 8.0)
warnings.filterwarnings('ignore')

In [2]:
#import teh column names dataset
with open('spambase.names') as file:
  names = file.read()
  print(names)

| SPAM E-MAIL DATABASE ATTRIBUTES (in .names format)
|
| 48 continuous real [0,100] attributes of type word_freq_WORD 
| = percentage of words in the e-mail that match WORD,
| i.e. 100 * (number of times the WORD appears in the e-mail) / 
| total number of words in e-mail.  A "word" in this case is any 
| string of alphanumeric characters bounded by non-alphanumeric 
| characters or end-of-string.
|
| 6 continuous real [0,100] attributes of type char_freq_CHAR
| = percentage of characters in the e-mail that match CHAR,
| i.e. 100 * (number of CHAR occurences) / total characters in e-mail
|
| 1 continuous real [1,...] attribute of type capital_run_length_average
| = average length of uninterrupted sequences of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_longest
| = length of longest uninterrupted sequence of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_total
| = sum of length of uninterrupted sequences of

In [3]:
#create a columns data
column_names=['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our',
'word_freq_over', 'word_freq_remove', 'word_freq_internet', 'word_freq_order', 'word_freq_mail',
'word_freq_receive','word_freq_will','word_freq_people','word_freq_report','word_freq_addresses',
'word_freq_free','word_freq_business','word_freq_email','word_freq_you','word_freq_credit','word_freq_your',
'word_freq_font','word_freq_000','word_freq_money','word_freq_hp','word_freq_hpl','word_freq_george',
'word_freq_650','word_freq_lab','word_freq_labs','word_freq_telnet','word_freq_857','word_freq_data',
'word_freq_415','word_freq_85','word_freq_technology','word_freq_1999','word_freq_parts','word_freq_pm',
'word_freq_direct','word_freq_cs','word_freq_meeting','word_freq_original','word_freq_project',
'word_freq_re','word_freq_edu','word_freq_table','word_freq_conference','char_freq_;','char_freq_(',
'char_freq_[','char_freq_exclamation','char_freq_dollar','char_freq_hashtag','capital_run_length_average',
'capital_run_length_longest','capital_run_length_total','spam']


In [4]:
#now load the dataset with their column names
spam=pd.read_csv('spambase.data',names=column_names)

#2. Checking the Data

In [5]:
# Determining the no. of records in our dataset
spam.shape

#the dataset has 4601 rows/entries and 58 columns

(4601, 58)

In [6]:
# Previewing the top of our dataset
spam.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_exclamation,char_freq_dollar,char_freq_hashtag,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [7]:
# Previewing the bottom of our dataset
spam.tail()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_exclamation,char_freq_dollar,char_freq_hashtag,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
4596,0.31,0.0,0.62,0.0,0.0,0.31,0.0,0.0,0.0,0.0,...,0.0,0.232,0.0,0.0,0.0,0.0,1.142,3,88,0
4597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.353,0.0,0.0,1.555,4,14,0
4598,0.3,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.102,0.718,0.0,0.0,0.0,0.0,1.404,6,118,0
4599,0.96,0.0,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.057,0.0,0.0,0.0,0.0,1.147,5,78,0
4600,0.0,0.0,0.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.125,0.0,0.0,1.25,5,40,0


In [8]:
# Checking whether each column has an appropriate datatype
spam.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_freq_make              4601 non-null   float64
 1   word_freq_address           4601 non-null   float64
 2   word_freq_all               4601 non-null   float64
 3   word_freq_3d                4601 non-null   float64
 4   word_freq_our               4601 non-null   float64
 5   word_freq_over              4601 non-null   float64
 6   word_freq_remove            4601 non-null   float64
 7   word_freq_internet          4601 non-null   float64
 8   word_freq_order             4601 non-null   float64
 9   word_freq_mail              4601 non-null   float64
 10  word_freq_receive           4601 non-null   float64
 11  word_freq_will              4601 non-null   float64
 12  word_freq_people            4601 non-null   float64
 13  word_freq_report            4601 

Almost every column has float types with only spam and capital_run_lenght_total being of integer type.

#3. Tidying the Dataset

In [9]:
# Identifying the Missing Data
spam.isnull().sum()

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

In [10]:
#check for duplicates
spam.duplicated().sum()

391

In [11]:
#as there are 391 duplicates, i'll drop them.
spam.drop_duplicates(inplace=True)

In [12]:
#have a look at our target variale proportion
spam.spam.value_counts(normalize=True)*100

#my spam variable are imbalanced as such i'll have to deal with it using SMOTE

0    60.118765
1    39.881235
Name: spam, dtype: float64

#4. Modelling

In [13]:
#create a copy of my dataset for modelling
spam_df=spam.copy()

In [14]:
# Get the dependent and independt variables
X = spam_df.iloc[:, :-1]
y = spam_df.spam

In [15]:
# Apply smote to x and y
sm = SMOTE(sampling_strategy='auto', k_neighbors=1, random_state=42)
X_res, y_res = sm.fit_resample(X, y)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_res, y_res, test_size=.3, random_state=23)

In [16]:
# Scale data
scaler = MinMaxScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [18]:
# Perform linear discriminant analysis on data

lda = LinearDiscriminantAnalysis(n_components=1)
lda.fit(x_train, y_train)

# Get explained variation by lda
lda.explained_variance_ratio_

array([1.])

In [19]:
# First component iss a 100% explanation of the class difference
lda_pred = lda.predict(x_test)
print(classification_report(y_test, lda_pred))

              precision    recall  f1-score   support

           0       0.87      0.93      0.90       749
           1       0.92      0.87      0.89       770

    accuracy                           0.90      1519
   macro avg       0.90      0.90      0.90      1519
weighted avg       0.90      0.90      0.90      1519



Using the LDA classifier, it is able to detect whether an email is spam or not by 90%

In [20]:
# create the confusion matrix
print(confusion_matrix(y_test, lda_pred, labels=[0,1]))

[[693  56]
 [103 667]]


The LDA classifier does a great job of predicting the non-spam emails(0) than the spam emails class(1) with more false positive than false negatives.

In [21]:
# Apply Naive Bayes classifiers to lda transformed data

x_train_lda = lda.transform(x_train)
x_test_lda = lda.transform(x_test)

gaussian_bayes = GaussianNB()
gaussian_bayes.fit(x_train_lda, y_train)

gaussian_pred = gaussian_bayes.predict(x_test_lda)

print(classification_report(y_test, gaussian_pred))

              precision    recall  f1-score   support

           0       0.87      0.92      0.90       749
           1       0.92      0.87      0.89       770

    accuracy                           0.90      1519
   macro avg       0.90      0.90      0.90      1519
weighted avg       0.90      0.90      0.90      1519



In [22]:
# Get confusion matrix after applying Naives Bayes
matrix = confusion_matrix(y_test, gaussian_pred)
pd.DataFrame(matrix, columns=[0,1], index=[0,1])

Unnamed: 0,0,1
0,691,58
1,100,670


the Gaussian naive bayes has a similar perfomance as the LDA classifier

In [23]:
#second test with 40% test size
# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_res, y_res, test_size=.4, random_state=34)

In [24]:
# Scale data
scaler = MinMaxScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [25]:
# Apply LDA transformation
x_train_lda = lda.transform(x_train)
x_test_lda = lda.transform(x_test)

In [26]:
# Modelling
gaussian_bayes = GaussianNB()
gaussian_bayes.fit(x_train_lda, y_train)

gaussian_pred = gaussian_bayes.predict(x_test_lda)

print(classification_report(y_test, gaussian_pred))

              precision    recall  f1-score   support

           0       0.88      0.95      0.91       995
           1       0.95      0.88      0.91      1030

    accuracy                           0.91      2025
   macro avg       0.91      0.91      0.91      2025
weighted avg       0.91      0.91      0.91      2025



In [27]:
# Get confusion matrix
matrix = confusion_matrix(y_test, gaussian_pred)
pd.DataFrame(matrix, columns=[0,1], index=[0,1])

Unnamed: 0,0,1
0,944,51
1,128,902


Increasing the test size to 40%, our model perfomance also imporves to 91%.However, the false positives are still superseeding the false negatives.

#Conclusion

Our best overal perfoming Gausian Naive BAyes has an accuracy of 91% with a test size of 40%. this also might suggest that increase in the test size further will imporve our model perfomance in detecting spam emails(class 1) or non-spam mails(class 0).