# Building Spam Filter with Naive Bayes

In this project we will be building a spam filter for SMS messages using Naive Bayes algorithm.

To classify messages as spam or non spam:

1. Learn how humans classify messages.
2. Use that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifiy a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal,then we may need a human to classify the message).

# How Humans Classify Messages

As stated above our first task is to teach the computer how to classify messages. 

To do that we will use the multinomial Naive Bayes algorithm with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).The data collection process is described in more details [on this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

In [1]:
# Import the dataset
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

data=pd.read_csv("SMSSpamCollection",sep='\t',header=None,
                 names=['Label','SMS'])

In [2]:
data.shape

(5572, 2)

In [3]:
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Percentage of spam and non spam messages
#ham means non spam
data["Label"].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

We have approximatelt 87% non spam and 13% spam messages in our dataset.

# Splitting into training and testing set

Now that we have explored the dataset we can start buulding our spam filter.
However we should also be able to test our new filter to see how it works.
For this purpose we will first split the data into training and testing sets.
The training set will have 80% of the data while the testing set will have 20% of the data.

The model will be build on the training  dataset.We will treat the data in testing set as new messages and test our model which we build on training set. If we are able to classify more than 80% of our messages correctly in the test set then we can say our model is working well.

In [5]:
# Randomize the entire dataset
data_random=data.sample(frac=1,random_state=1)

#Split the above randomized data into training and testing set
train_set=data_random.sample(frac=0.8,random_state=1)

# Drop whatever is classified as train set from the test set
test_set=data_random.drop(train_set.index)

In [6]:
# Reset the indexes for both train and test set
train_set.reset_index(inplace=True,drop=True)

test_set.reset_index(inplace=True,drop=True)


In [7]:
# Compute the percentage of spam and non spam in both train and test sets

train_set["Label"].value_counts(normalize=True)

ham     0.866756
spam    0.133244
Name: Label, dtype: float64

In [8]:
test_set["Label"].value_counts(normalize=True)

ham     0.862657
spam    0.137343
Name: Label, dtype: float64

As we can see the original ratio of 87% non spam to 13% spam of original dataset have been retained in both train and sets.Now we can start building the model on training set.

# Naive Bayes Overview

The Naive Bayes algorithm(Multinomial Naive Bayes) works in the following way:

P(wi|Spam)=(Nwi|Spam+α)/(NSpam+α⋅NVocabulary)

P(wi|Ham)=(Nwi|Ham+α)/(NHam+α⋅NVocabulary)

The Naive Bayes equations are:

P(Spam|w1,w2,...,wn)∝P(Spam) * Π(wi|Spam) for i=1,2,...n

P(Ham|w1,w2,...,wn)∝P(Ham) * Π(wi|Ham)

* P(wi|Spam)=Probability that given the message is spam the word wi is present in it.
* P(wi|Ham)=Probability that given the message is ham the word wi is present in it.
* Π=Product
* Nwi|Spam=the number of times the word wi occurs in spam messages
* Nwi|SpamC=the number of times the word wi occurs in non-spam messages
* NSpam=total number of words in spam messages
* NSpamC=total number of words in non-spam messages
* NVocabulary=total number of words in the vocabulary
* α=1    (α is a smoothing parameter)


# Data Cleaning and Preparation
In order to calculate the above probabilities we will need to arrange the dataset in a specific format.

For example if the data is as below:

| Label 	| SMS                                    	|
|-------	|----------------------------------------	|
| spam  	| SECRET PRIZE!!CLAIM SECRET PRIZE NOW!! 	|
| ham   	| Coming to my secret part?              	|
| spam  	| Winner!Claim secret prize now!         	|


It should be brought in the below format

| Label 	| secret 	| prize 	| claim 	| now  	| coming 	| to 	| my  	| party 	| winner 	|
|-------	|--------	|-------	|-------	|------	|--------	|----	|-----	|-------	|--------	|
| spam  	| 2      	| 2     	| 1     	| 1    	| 0      	| 0  	| 0   	| 0     	| 0      	|
| ham   	| 1      	| 0     	| 0     	| 0    	| 1      	| 1  	| 1   	| 1     	| 0      	|
| spam  	| 1      	| 1     	| 1     	| 1    	| 0      	| 0  	| 0   	| 0     	| 1      	|

In the transformation above:

* The SMS column doesn't exist anymore.
 Instead, the SMS column is replaced by a series of new columns, where each column represents a unique word from the vocabulary.
* Each row describes a single message. For instance, the first row corresponds to the message "SECRET PRIZE! CLAIM SECRET PRIZE NOW!!", and it has the values spam, 2, 2, 1, 1, 0, 0, 0, 0, 0. These values tell us that:
  * The message is spam.
  * The word "secret" occurs two times inside the message.
  * The word "prize" occurs two times inside the message.
  * The word "claim" occurs one time inside the message.
  * The word "now" occurs one time inside the message.
  * The words "coming", "to", "my", "party", and "winner" occur zero times inside the message.
* All words in the vocabulary are in lower case, so "SECRET" and "secret" come to be considered to be the same word.
* Punctuation is not taken into account anymore (for instance, we can't look at the table and conclude that the first message initially had three exclamation marks).

In [9]:
# Remove the punctuations and convert the case to lower


train_set["clean_SMS"]=train_set["SMS"].str.replace("\W"," ")
'''
\W is a regex pattern which will replace any character which is not
in a-z,A-Z or 0-9
'''
train_set["clean_SMS"]=train_set["clean_SMS"].str.lower()


  after removing the cwd from sys.path.


Now that we are done with cleaning the data we will need to transform the data as we stated above. The individual columns are nothing but unique words from all the words in SMS column and the value is the frequency of that particular word.We will refer to this unique set of words as **vocabulary**

In [10]:
# Create the vocabulary

# Transform the sms column into a list of words
train_set['clean_sms_list']=train_set["clean_SMS"].str.split()

vocabulary=[]

for i in train_set["clean_sms_list"]:
    for word in i:
        if word not in vocabulary:
            vocabulary.append(word)

In [11]:
len(vocabulary)

7712

Looks like there are 7712 unique words in the vocabulary of train set.

Now that we have the vocabulary we will transform the dataset as desired.
One approach is we create d dictionary with keys as the unique words in SMS with their respective values as the number of times the respective word appears in SMS.Eventually we can convert it to a dataframe so that we get the data in a desired format.