<h1><center>Spam message filter using Naive Bayes Classification</center></h1>

### INTRODUCTION
In machine learning, Bayes classifier are statistical classifier which can be use for predicting or classifying given data tuples belong to particular classes.

Bayes classification is based on Bayes's theorem.

![](https://annalloyd.files.wordpress.com/2019/03/bayes-1.png?w=635&h=391)

In bayes classification, Bayes theorem is used to predict the probability of a tuple belong to one particular class based on its features.

In the scope of this project, our goal is to build a classfier to detect spam mail using Bayes Classification.

The dataset which will be used later is stored in <a href='https://raw.githubusercontent.com/HuyQuangCSE/dataset/main/SMSSpamCollection'>this repository</a>


### DATA PREPARATION
But first, we need to import the required packages that we are going to use in this project.

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

Then, we collect the raw data from the source. This dataset has already been labeled beforehand.

In [None]:
url = 'https://raw.githubusercontent.com/HuyQuangCSE/dataset/main/SMSSpamCollection'
data = pd.read_csv(url, header=None, sep='\t',names=['label', 'content'])
data.head(10)

Unnamed: 0,label,content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


We will split the data into two parts for the training and testing purpose. The ratio we will use in this project is 8:2. (80% for training and 20% for testing).

The dataset will be shuffle before being split to reduce overfitting and variance.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(data.content, data.label, train_size=0.8, random_state=1)
print('Size of train data: ',len(X_train))
print('Size of test data',len(X_test))


Size of train data:  4457
Size of test data 1115


At this step, we will utilize the **Count Vectorizer** from scikit-learn library to tokenize and vectorize the training and testing dataset.

This tool also provides us with the built-in features for data cleaning. In this case, we use these to remove the stopword, punctuations (non-word characters) and set all the words to lowercase.

In [None]:
vectorizer = CountVectorizer(stop_words='english',lowercase=True).fit(X_train)
vectorized_X_train = vectorizer.transform(X_train)
vectorized_X_test = vectorizer.transform(X_test)

(training_size, vocabulary_size) = vectorized_X_train.toarray().shape
print('Training dataset\'s size: ', training_size)
print('Vocabulary\'s size: ', vocabulary_size)
pd.DataFrame({
    'content':X_train,
    'label': Y_train,
}).head(10)

Training dataset's size:  4457
Vocabulary's size:  7457


Unnamed: 0,content,label
1642,"Hi , where are you? We're at and they're not ...",ham
2899,If you r @ home then come down within 5 min,ham
480,When're you guys getting back? G said you were...,ham
3485,Tell my bad character which u Dnt lik in me. ...,ham
157,I'm leaving my house now...,ham
4430,Hey they r not watching movie tonight so i'll ...,ham
2625,S da..al r above &lt;#&gt;,ham
5365,Camera - You are awarded a SiPix Digital Camer...,spam
4067,Fyi I'm gonna call you sporadically starting a...,ham
3120,Stop knowing me so well!,ham


### TRAINING & TESTING

Now we can use the processed data for training a Naive Bayes Classification model.

In this model, Laplace smoothing will be applied with α = 1. This will helps us to solve the zero probability problems which may happen in the training process.

The trained model will be test using the process dataset for testing which we have split from the collected data earlier.

In [None]:
classifier = MultinomialNB(alpha=1)
classifier.fit(vectorized_X_train, Y_train)
print('Raw testing Data')
pd.DataFrame(X_test).head(10)

Raw testing Data


Unnamed: 0,content
1078,"Yep, by the pretty sculpture"
4028,"Yes, princess. Are you going to make me moan?"
958,Welp apparently he retired
4642,Havent.
4674,I forgot 2 ask ü all smth.. There's a card on ...
5461,Ok i thk i got it. Then u wan me 2 come now or...
4210,I want kfc its Tuesday. Only buy 2 meals ONLY ...
4216,No dear i was sleeping :-P
1603,Ok pa. Nothing problem:-)
1504,Ill be there on &lt;#&gt; ok.


The result of the classification will be evaluated by comparing to the test data original labels.

In [None]:
predictions = classifier.predict(vectorized_X_test)
print("Accuracy:", 100 * sum(predictions == Y_test) / len(predictions), '%')

Accuracy: 99.10313901345292 %


### APPLICATION

In this sections, we will use the model to classify some random messages and evaluate the results.

In [None]:
classifier.predict(vectorizer.transform(
    [
        "Thank you, ABC. Can you also share your LinkedIn profile? As you are a good at programming at pyhthon, would be willing to see your personal/college projects.",
        "Hi y'all, We have a Job Openings in the positions of software engineer, IT officer at ABC Company.Kindly, send us your resume and the cover letter as soon as possible if you think you are an eligible candidate and meet the criteria.",
        "Dear ABC, Congratulations! You have been selected as a SOftware Developer at XYZ Company. We were really happy to see your enthusiasm for this vision and mission. We are impressed with your background and we think you would make an excellent addition to the team.",
    ])
)


array(['ham', 'ham', 'ham'], dtype='<U4')

In the first example, we can easily see that the content of the test messages are normal (ham). The model can classify these messages accurately.

For the next example, let's try with some spam messages.

In [None]:
classifier.predict(vectorizer.transform(
    [
        "congratulations, you became today's lucky winner",
        "1-month unlimited calls offer Activate now",
        "You have win your self an Iphone 13 pro vjp, please send us your bank account number to claim the prize!!!"
    ])
)


array(['spam', 'spam', 'spam'], dtype='<U4')

As we can see, the model is still able to classify the spam messages.