# Building a spam filter with Naive Bayes #

Using a dataset by Tiago A. Almeida and Jose Maria Gomez Hidalgo - downloaded from The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), our goal is to build a spam filter by teaching the computer how to classify messages like a human would using **Multinomial Naive Bayes**. Using that, the computer will then classify messsages as spam and non-spam.

In [1]:
import pandas as pd

collection = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])
collection.tail(5)

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


### Calculating total spam labels and ham (non-spam) labels: ###

In [2]:
spam = collection[collection['Label'] == 'spam']['Label'].count()/collection.shape[0]
ham = collection[collection['Label'] == 'ham']['Label'].count()/collection.shape[0]
print('Spam and Non-spam percentages respectively:',round(ham,2)*100,round(spam,2)*100)

Spam and Non-spam percentages respectively: 87.0 13.0


### Splitting datasets into test and training ###

In [3]:
random = collection.sample(frac=1, random_state=1)
random.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [4]:
training_index = round(int(random['Label'].count() * 0.8))
training = random[ : training_index].reset_index(drop=True)
test = random[training_index :].reset_index(drop=True)

In [5]:
training_spam = training[training['Label']=='spam']['Label'].count()/training.shape[0]
training_n_spam = training[training['Label']=='ham']['Label'].count()/training.shape[0]
print('Percentage of spam and non-spam in training dataset:{0}%,{1}%'.format(round(training_spam * 100), round(training_n_spam *100)))

Percentage of spam and non-spam in training dataset:13%,87%


In [6]:
test_spam = test[test['Label']=='spam']['Label'].count()/test.shape[0]
test_n_spam = test[test['Label']=='ham']['Label'].count()/test.shape[0]
print('Percentage of spam and non-spam in test dataset:{0}%,{1}%'.format(round(test_spam * 100), round(test_n_spam *100)))

Percentage of spam and non-spam in test dataset:13%,87%


**Notice that the percentages of spam and non spam have not changed even after splitting the dataset**

## Data Cleaning ##

We will convert all upper case letter to slower case so the words can be considered the same, and remove all symbols.
This will allow is to transform the dataset's sms columns into **multiple columns of words and their counts**.

In [7]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


training['SMS'] = training['SMS'].str.lower()
training['SMS'] = training['SMS'].str.replace(r'\W',' ')
test['SMS'] = test['SMS'].str.lower()
test['SMS'] = test['SMS'].str.replace(r'\W', ' ')


In [8]:
#splitting strings in the SMS column

training['SMS'] = training['SMS'].str.split()
vocabulary = []

for row in training['SMS']:
    for word in row:
        vocabulary.append(word)

vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

    

In the screen above we created a vocabulary list containing a single copy of all  space separated strings from SMS column of the training dataset.

In [9]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1 

In [10]:
word_count = pd.DataFrame(word_counts_per_sms)


In [11]:
word_count.head()

Unnamed: 0,urgent,ileave,tendencies,aphex,honeymoon,oops,studies,expires,happy,speeding,...,pints,07046744435,yetty,isnt,news,900,mesages,4eva,hesitant,69669
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
training_count = pd.concat([training,word_count], axis = 1)

## Creating the spam filter ##
We're going to use the naive bayes equation:

p(spam|w1, w2, w3, ... wn) ~= p(spam) x  p(w1|spam) x  p(w2|spam) x ... x p(wn|spam)

to train the program to recognize spam messages by creating a filter. We'rs going to do this on the training set.

### Calculating Constants ###

In [13]:
spam_set = training_count[training_count['Label']=='spam']
ham_set = training_count[training_count['Label']=='ham']
p_spam = spam_set['Label'].count()/training_count.shape[0]
p_spam = round(p_spam,2)
p_ham = ham_set['Label'].count()/training_count.shape[0]
p_ham = round(p_ham,2)


In [14]:
n_spam_row = spam_set.sum(axis=1)
n_spam = n_spam_row.sum()

In [None]:
n_ham_row = ham_set.sum(axis=1)
n_ham = n_ham_row.sum()

In [None]:
#no of words
n_vocabulary = len(vocabulary)

#Laplace smoothing constant
alpha = 1

In [None]:
d_spam = {}
d_ham = {}

for col in word_count.columns:
    d_spam[col] = 0
    d_ham[col] = 0
    

In [None]:
spam_set_messages = spam_set['SMS']
ham_set_messages = ham_set['SMS']

In [None]:
for skey, hkey in zip(d_spam, d_ham):
    d_spam[skey] = (spam_set[skey].sum() + alpha )/(n_spam + (alpha*n_vocabulary))
    d_ham[hkey] = (ham_set[hkey].sum() + alpha)/(n_ham + (alpha * n_vocabulary))
   
                                                    

In [None]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    global p_spam_all_words 
    global p_ham_all_words 
    p_spam_all_words = 1
    p_ham_all_words = 1
    for word in message:
        if word in d_spam.keys():
            p_spam_all_words = p_spam_all_words * d_spam[word]
        if word in d_ham.keys():
            p_ham_all_words = p_ham_all_words * d_ham[word]
    p_spam_given_message = p_spam * p_spam_all_words
    p_ham_given_message = p_ham * p_ham_all_words
   

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

## Testing our filter ##


In [None]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

In [None]:
classify("Sounds good, Tom, then see u there")

In [None]:
print(test.head(5))
test[0:5]['SMS'].apply(classify)

## Results ##
The results are looking good and we have successfully built a spam filter with a great accuracy rate!.