<a href="https://colab.research.google.com/github/Molly-Abisage/WK9_KNN-and-NAIVE-BAYES-CLASSIFIERS/blob/main/NAIVE_BAYES_WEEK_9_IP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##1. Defining the question

### Specifying the research question

*   We have been given a dataset containing details on emails with malicious and legitimate information, with the features describing what kind of content is in these emails, we are required to use the Naive Bayes Classifier to identify an email as spam or not spam.


###Defining the metrics for success

The metrics of this study is:

*   Obtaining an accuracy score of at least 90% on the model
*   Use of the most appropriate metrics to assess our models and explain why they are appropriate.



### Context


**Spam detection**

Accurate spam detection is considered a difficult task due to several reasons including:
*    **subjective nature of spam** - _for instance, a message containing several drug names might be a spam, but it might not be the case if the message is exchanged in a context of medical organizations_ 

 
This study will therefore make use of the Naive Bayes Classifier to detect if an email is spam or not.


###Record the experimental design

The following steps will be followed for the study:

1.   Importing libraries and loading data from a csv file
1.   Checking the data
1.   Conducting necessary data cleaning procedures
1.   Performing Exploratory Data Analysis
1.   Performing data pre-processing
1.   Building the most suitable model for this study
1.   Assessing/ Evaluating the model
1.   Making a conclusion on the study


###Data Relevance


An assumption of Spam detection that its content differs from that of a legitimate email. 

Statically features of a typical spam email include:
1.   char_freq_! 
1.   word_freq_remove 
1.   word_feq_credit 
1.   char_feq_
1.   word_feq_hp 
1.   word_feq_edu 
1.   capital_run_length_longest 
1.   word_feq_free 
1.   capital_run_length_total 
1.   word_feq_george 

The dataset provided contains all these features hence we consider the data relevant for the study.


##2. Reading the data

In [None]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
# loading datasets
spam = pd.read_csv("/content/spambase_csv.csv")



##3. Checking the data

In [None]:
# previewing the top of the dataset
spam.head()

In [None]:
# previewing the bottom of the dataset
spam.tail()

In [None]:
# checking the size of the dataset
spam.shape

display("There are {} observations with {} features ".format(spam.shape[0], spam.shape[1]))

In [None]:
# investigating the features
spam.columns

In [None]:
# checking data types
spam.dtypes

All the columns contained above are numerical features

##4. External dataset validation

In [None]:
spam.describe()

According to https://help.campaignmonitor.com/
Spam filters look at an email as a whole, with thresholds set for certain criteria. If a threshold is exceeded, the email gets marked as spam. Some things that can be caught by spam filters can include:

*  An entire email composed of capital letters 
*  Frequent, random capitalization 

From the statistical properties above, we can see that the highest value of total number of capital letters in one email is 15, 841 (which is too high) has been categorized as spam hence we can conclude that this data is valid and can be used for this study.


## Tidying the dataset

###Completeness

In [None]:
# checking for missing data
spam.isnull().values.any()

The dataset is complete as we don't have any missing values

###Consistency

In [None]:
print(spam[spam.duplicated()])

# print(train_df[train_df.duplicated()])


The duplicated object outputs 391 rows that seem to have duplicates. Most of the columns may have similar values but we can clearly see that the _capital_run_length_total_ column has a different value for each record hence we will retain this records as they are not really duplicates

###Uniformity

In [None]:
spam.dtypes

All the column have appropriate data types

###Outliers

In [None]:
# numerical_columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

fig, ax = plt.subplots(len(spam.columns), figsize = (18, 100))

for i, col_val in enumerate(spam.columns):

  sns.boxplot(y= spam[col_val], ax=ax[i])
  ax[i].set_title('Box plot-{}'.format(col_val), fontsize=10)
  ax[i].set_xlabel(col_val, fontsize=8)

plt.show()

**Observations**

Most of the columns seem to have most of the data points concentrated at zero. The columns also seem to have a few observations as outlier values. 

**Conclusion**

As the outlier values are the one's that determine if an email will be classified as spam or not, we will retain them. 

## Exploratory Data Analysis

In [None]:
!pip install -U pandas-profiling

In [None]:
import pandas_profiling as pp
import warnings
warnings.filterwarnings("ignore")

from pandas_profiling import ProfileReport
ProfileReport(spam, title = "spam email report")

##Univariate Analysis

In [None]:
# statistical summary
spam.describe()

###Distribution plots

In [None]:
# plotting distribution plots for all the above columns 

sns.set_style('darkgrid')
fig, axes = plt.subplots(len(spam.columns), figsize = (18, 100))
fig.suptitle('Distributions of all columns', y= 1.01, color = 'black', fontsize = 15)

for ax, data, name in zip(axes.flatten(), spam, spam.columns):
  sns.distplot(spam[name], ax = ax, kde = True, color = 'purple')
plt.tight_layout()

**Observations**

All the independent variables have the mode at 0 and are positively skewed. The dependent variable is categorical.

###Target variable

In [None]:
# obtaining the count of the target variable

spam["class"].value_counts()

We have a total of 2788 counts if class 0 (not spam) and 1813 (spam) hence we are dealing with an imbalanced class of more emails that were identified as no spam as compared to those identified as spam.

In [None]:
# visualizing the target variable

sns.catplot(y="class", kind="count", edgecolor=".6", data=spam);

The plot affirms the counts as had been stated above

##Bivariate Analysis

We will use whiskerplots to see how the target variable relates to some numerical columns


In [None]:
# the columns capital_run_length_average, capital_run_length_longest, capital_run_length_total deal with number of capital 
# letters in an email. Let's explore how class relate with this
# plotting whisker plots
capital_columns = ['capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total']

fig, ax = plt.subplots(3, 1, figsize=(10, 12))

for var, subplot in zip(capital_columns, ax.flatten()):
    sns.boxplot(x='class', y=var, data=spam, ax=subplot)

**Observations:**

Spam emails have such high values of capital letters as compared to emails that are not spam. However, for the _capital_run_length_total_ column, the number of observations seem to be the same in both classes only that class 1 has more observations as outliers as compared to class 0. Most of the spam emails have observations concentrated around the zero mark with a few points as outliers.

In [None]:
# let's see how the percentage of characters in an email relate to spam emails 
# 
character_columns = ['char_freq_%3B', 'char_freq_%28', 'char_freq_%5B', 'char_freq_%21', 'char_freq_%24', 'char_freq_%23']

fig, ax = plt.subplots(3, 2, figsize=(10, 12))

for var, subplot in zip(character_columns, ax.flatten()):
    sns.boxplot(x='class', y=var, data=spam, ax=subplot)              


**Observations**:

Depending on the type of character, an email can either be a spam or not a spam. Emails that are not spam have more of **char_freq_%3B, char_freq_%5B, 'char_freq_%21** characters while spam emails have more of **char_freq_%28, char_freq_%24, 'char_freq_%23** characters. 

In [None]:
# # let's see how the type of words in an email relate to the email being spam or not


word_columns = ['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d','word_freq_our', 'word_freq_over', 
       'word_freq_remove','word_freq_internet', 'word_freq_order', 'word_freq_mail','word_freq_receive', 'word_freq_will', 
       'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free','word_freq_business', 'word_freq_email', 
       'word_freq_you', 'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000','word_freq_money', 'word_freq_hp', 
       'word_freq_hpl', 'word_freq_george','word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet','word_freq_857', 
       'word_freq_data', 'word_freq_415', 'word_freq_85','word_freq_technology', 'word_freq_1999', 'word_freq_parts','word_freq_pm', 
       'word_freq_direct', 'word_freq_cs', 'word_freq_meeting','word_freq_original', 'word_freq_project', 'word_freq_re',
       'word_freq_edu', 'word_freq_table', 'word_freq_conference']

fig, ax = plt.subplots(12, 4, figsize=(18, 40))

for var, subplot in zip(word_columns, ax.flatten()):
    sns.boxplot(x='class', y=var, data=spam, ax=subplot)              

**Observations:**
Words that are more frequent is spam emails as compared to emails that are not spam include: _3d, internet, remove, addresses, receive, credit, money, font, business, 000

Words that are more frequent in emails that are not spam as compared to spam emails include: _re, edu, table, conference, cs, meeting, original, direct, project, 1999, parts, pm, 415, data, technology, lab, labs, hp, hpl, george, our, people, report, over, address, all, mail

Some words are equally represented in both classes such as: make, order, will, your. Others are outliers in both kinds of emails.

###Feature Reduction

In [None]:
# Data Reduction
X = spam.drop('class', axis = 1)
y = spam['class']    


# import train_test_split
from sklearn.model_selection import train_test_split

# splitting the data into 80% train set and 20% test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# normalizing our data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# applying Linear Discriminant Analyis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA(n_components = 19)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

# printing factors that affect the price of a house in order of how much weight each of the factors carry 
factors = pd.DataFrame (index = X.columns.values, data = lda.coef_[0].T)

# pd.options.display.float_format = '{:.8f}'.float_format
factors.sort_values(0, ascending = False)

##Baseline Model

In [None]:
# necessary imports
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split



In [None]:
X = spam.drop('class', axis = 1)
y = spam['class']    

# splitting the data into 80 - 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=6) 

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

predicted = model.predict(X_test)
# print(np.mean(predicted == y_test))

from sklearn.metrics import accuracy_score
print('Accuracy', accuracy_score(y_test, predicted))


In [None]:
# evaluating the model
from sklearn.metrics import confusion_matrix
from sklearn import metrics

confusion_test = metrics.confusion_matrix(y_test, predicted)
pd.DataFrame(data = confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

Our baseline model has an accuracy of 81.5%. As we are interested in the spam emails(class 1), We therefore have 347 cases of true positives and 411 cases of true negatives. The number of false positives is 148 which is quite high while the number of false negatives is 15. 

Let's employ techniques that will help in improving our model.

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc

# predict probabilities
lr_probs = model.predict_proba(Q_test)

# keep probabilities for spam emails only
lr_probs = lr_probs[:, 1]

# predict class values
# yhat = classifier.predict(D_test)

lr_precision, lr_recall, _ = precision_recall_curve(s_test, lr_probs)
lr_f1, lr_auc = f1_score(s_test, predicted), auc(lr_recall, lr_precision)

# summarize scores
print('Gaussian: f1=%.3f auc=%.3f' % (lr_f1, lr_auc))

# plot the precision-recall curves
no_skill = len(s_test[s_test==1]) / len(s_test)
plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
plt.plot(lr_recall, lr_precision, marker='.', label='Gaussian')

# axis labels
plt.xlabel('Recall')
plt.ylabel('Precision')
# show the legend
plt.legend()
# show the plot
plt.show()

##Pre-processing - Improving model Performance

###Multicollinearirity

In [None]:
# identify highly correlated features 
# create correlation matrix

corr_matrix = spam.corr().abs()
corr_matrix

In [None]:
# select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper

In [None]:
# find index of feature columns with correlation greater than 0.95

to_drop = [column for column in upper.columns if any (upper[column] > 0.95)]
to_drop

We got one column {word_freq_415}that a high correlation that is greater than 0.95 which we will drop in an attempt to improve model performance.

In [None]:
# drop highly correlated features

spam = spam.drop(spam[to_drop], axis=1)

In [None]:
# calculating VIF scores to verify we do Not have any other columns that are highly correlated

pd.DataFrame(np.linalg.inv(corr_matrix.values), index = corr_matrix.index, columns = corr_matrix.columns)

From the above, there is no column that has VIF score greater than 5 hence we can conclude our data does not exhibit multicollinearity

In [None]:
# verifying the new shape of the datafarame
spam.shape

# verifies we dropped one column

###Dealing with Imbalanced classes.

As we have imbalanced datasets i.e we have more observations of emails that are not spam as compared to spam emails. Hence, we will upsample the spam emails so we can obtain a balanced number of both spam and non-spam classes

In [None]:
# creating dependent and independent sets

M = spam.drop('class', axis = 1)
n = spam['class']  

# normalizing our data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
M_train = sc.fit_transform(M_train)
n_test = sc.transform(M_test)


In [None]:
# Split into train (2/3) and test (1/3) sets
test_size = 0.33
seed = 7
M_train, M_test, n_train, n_test = train_test_split(M, n, test_size=test_size, random_state=seed)

# Put X and y training data back together again
Mn_train = pd.concat([M_train, n_train], axis=1)

# Split into spam and non - spam
Mn_train_0 = Mn_train[Mn_train['class']==0]
Mn_train_1 = Mn_train[Mn_train['class']==1]

# counting the two classes
print( 'Non spam class: ', Mn_train_0.shape[0] )
print( 'Spam class: ', Mn_train_1.shape[0] )

In [None]:
# Resampling
from sklearn.utils import resample

# Undersampling non - spam emails (as we want them to be less as it is the majority class)
Mn_train_0_undersampled = resample(Mn_train_0, replace=True, n_samples=Mn_train_1.shape[0])
print( Mn_train_0_undersampled.shape)

# Oversample spam emails (as it is the minority class)
Mn_train_1_oversampled = resample(Mn_train_1, replace=True, n_samples=Mn_train_0.shape[0])
print( Mn_train_1_oversampled.shape )

In [None]:
# We can either go with the oversampled spam, or undersampled non spam
# Let's go with oversampling
combined = pd.concat([Mn_train_1_oversampled, Mn_train_0])

# Show that we now have balanced classes
combined['class'].value_counts()

##Gaussian Naive Bayes Classifier

The Naive Bayes Classifier is an algorithm for classification based on the Bayes Theorem and takes into account the assumption that the effect of a particular feature in a class is independent of other features hence why it is referred to as Naive.
The Gaussian Naive Bayes classifier assumes that the features are normally distributed.


In [None]:
#
Q = combined.drop('class', axis = 1)
s = combined['class']   

In [None]:
# splitting the data into 80 - 20%

Q_train, Q_test, s_train, s_test = train_test_split(Q, s, test_size=0.2, random_state=6) 

# normalizing our data
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()  
model = clf.fit(Q_train, s_train) 

predicted = model.predict(Q_test)
print(np.mean(predicted == s_test))

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
print(classification_report(s_test, predicted))
confusion_test = metrics.confusion_matrix(s_test, model.predict(Q_test))
pd.DataFrame(data = confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

The Gaussian Naive Bayes Model has attained an accuracy score of ~85%. 
Let's break down our output:

*   True Positives: 350
*   True Negatives: 291
*   False Positives: 100
*   False Negatives: 16
*   Accuracy:  0.846 our model is approximately 85% accurate
*   Precision: 0.78 for the spam emails
*   Recall(Sensitivity): 0.96 for the spam emails
*   f1 score: 0.86

We can consider our model a better fot than the baseline model and it is generally a good model

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# generate a no skill prediction (majority class)
ns_probs = [0 for _ in range(len(s_test))]

# predict probabilities
lr_probs = model.predict_proba(Q_test)

# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]

# calculate scores
ns_auc = roc_auc_score(s_test, ns_probs)
lr_auc = roc_auc_score(s_test, lr_probs)

# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Gaussian: ROC AUC=%.3f' % (lr_auc))

# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(s_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(s_test, lr_probs)

# plot the roc curve for the model
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Gaussian')

# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()

As we are dealing with balanced classes, we will use the ROC curve (False positive Rate against True positive Rate) to show the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Our curve is close to the top left corner and has a high value of the area under the curve, an indication that our model is skillful.

In [None]:
R = combined.drop('class', axis = 1)
u = combined['class']  

# splitting the data into 70 - 30%
R_train, R_test, u_train, u_test = train_test_split(R, u, test_size=0.3, random_state=6) 

# normalizing our data
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()  
model = clf.fit(R_train, u_train) 

predicted = model.predict(R_test)
print(np.mean(predicted == u_test))

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
print(classification_report(u_test, predicted))
confusion_test = metrics.confusion_matrix(u_test, predicted)
pd.DataFrame(data = confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

The accuracy score has increased slightly to approximately 86%. Breaking down our outputs:

True Positives: 525
True Negatives: 448
False Positives: 139
False Negatives: 24

The precision and recall have remained the same but the f1 score for the spam emails has increased slightly.

In [None]:
O = combined.drop('class', axis = 1)
p = combined['class']  

# splitting the data into 60 - 40%
O_train, O_test, p_train, p_test = train_test_split(O, p, test_size=0.4, random_state=6) 

# normalizing our data
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()  
model = clf.fit(O_train, p_train) 

predicted = model.predict(O_test)
print(np.mean(predicted == p_test))

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
print(classification_report(p_test, predicted))
confusion_test = metrics.confusion_matrix(p_test, predicted)
pd.DataFrame(data = confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

This accuracy of this model has reduced slightly in comparison to prior models.

##Challenging the Solution

The models we have built have an accuracy of approximately 85%. And f1 score of about 0.86 which is good enough. This was obtained after employing several model performance techniques such as resampling the classes to obtain equal number of classes and dropping highly multilinear features. Hence, it would be difficult to beat this particular model. We will therefore build SVM model. 

###SVM Model

In [None]:
R = combined.drop('class', axis = 1)
u = combined['class']  

# splitting the data into 70 - 30%
R_train, R_test, u_train, u_test = train_test_split(R, u, test_size=0.3, random_state=6) 

# Building the model 
from sklearn.svm import SVC,LinearSVC
rbfclassifier = SVC(kernel='rbf', gamma=0.01, C=2.0)

# Training the model using the training set
rbfclassifier.fit(R_train, u_train)

# Predict the response for the test set
u_pred = rbfclassifier.predict(R_test)

# assessing the model

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
print(classification_report(u_test, u_pred))
confusion_test = metrics.confusion_matrix(u_test, u_pred)
pd.DataFrame(data = confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])
model_accuracy = accuracy_score(u_test,u_pred)
print(model_accuracy)


In [None]:
from sklearn.model_selection import GridSearchCV

C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)
param_grid = dict(gamma=gamma_range, C=C_range)
# cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
grid = GridSearchCV(SVC(kernel ="rbf"), param_grid=param_grid)
grid.fit(R_train, u_train)

print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

The SVM model has given an accuracy score that is slightly higher than the Naive Bayes models hence this could be a bit difficult to beat. 

It is important to note that the Naive Bayes was a faster model as compared to SVM.

##Follow-up questions

The Gaussian Naive Bayes Classifier is a good fit for our data, however, we did not attain our metric of success of obtaining an accuray score of at least 90%. Could this have been attributed to the fact that we were trying to classify text features hence use Multinomial Naive Bayes instead which is better at text classification?