Naive Bayes Spam filter 

The purpose of this code is to build a Naive Bayes algorithm-based spam filter for SMS messages. Of course, no filter can be 100% accurate, but it is the goal we pursue! 
So our task is to 'teach' computer the heuristics of classification. In order to do that, we have a dataset of 5572 SMS messages, that were already classified by humans (https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), and the algorithm to be used, so called multinomial Naive Bayes algorithm. 
What it does it is constantly looking for the most frequent words in the spam messages and the most frequent words in non-spam messages, and then assigns each word a probability of being part of spam. For example, word 
'gift', based on the historical data, has occured more in the spam messages rather than in non-spam. So this word will have higher probability of being in the spam box. 
Let's start with the practical side of our filter.

In [1]:
import pandas as pd

In [2]:
dataset = pd.read_csv('https://dq-content.s3.amazonaws.com/433/SMSSpamCollection',
                     sep = '\t', header = None, names = ['Label', 'SMS'])

In [5]:
dataset.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


A good practice to check how the filter works is to separate it into training and testing dataset.

In [7]:
# Randomize the dataset
data_randomized = dataset.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

It is time to perform some data cleaning and manipulation in order to prepare the data for the algorithm.

In [8]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()