## Spam filter using Naïve Bayes

In this project I use the Naive Bayes algorithm to build a spam filter which works by classifying text messages as either spam or not. I start by importing a csv file that contains messages and labels of 'spam' or 'ham' (ham meaning not spam)

In [33]:
import pandas as pd
sms_spam = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label','SMS'])
sms_spam.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


I explore the data set by checking its size and the how many messages in it are spam.

In [34]:
sms_spam.shape

(5572, 2)

In [35]:
sms_spam["Label"].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

I split the data up into training and test data.

In [36]:
sms_spam.sample(frac=1,random_state=1)
pct80 = round(0.8*sms_spam.shape[0])
training = sms_spam[:pct80].reset_index(drop=True)
test = sms_spam[pct80:].reset_index(drop=True)

I remove punctuation and change the text to lower case.

In [37]:
training['SMS'] = training['SMS'].str.replace('\W', ' ')
training['SMS'] = training['SMS'].str.lower()

test['SMS'] = test['SMS'].str.replace('\W', ' ')
test['SMS'] = test['SMS'].str.lower()

  """Entry point for launching an IPython kernel.
  after removing the cwd from sys.path.


I now create a list of the words in the training set. I then make a table of all these words, and how many times they appear in each of the messages.

In [38]:
training['SMS'] = training['SMS'].str.split()

vocabulary = []
for sms in training['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [39]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,receiving,natural,iz,diesel,relocate,merememberin,downs,0089,sir,effect,...,freezing,expensive,9061100010,l8r,campus,09064019788,450ppw,tall,reckon,public
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


I combine this table with the messages and labels.

In [40]:
training_clean = pd.concat([training, word_counts], axis=1)
training_clean.head()

Unnamed: 0,Label,SMS,receiving,natural,iz,diesel,relocate,merememberin,downs,0089,...,freezing,expensive,9061100010,l8r,campus,09064019788,450ppw,tall,reckon,public
0,ham,"[go, until, jurong, point, crazy, available, o...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[ok, lar, joking, wif, u, oni]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,spam,"[free, entry, in, 2, a, wkly, comp, to, win, f...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,"[u, dun, say, so, early, hor, u, c, already, t...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[nah, i, don, t, think, he, goes, to, usf, he,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This section calculates the constants needed for the classifier.

In [41]:
spam_messages = training_clean[training_clean['Label'] == 'spam']
ham_messages = training_clean[training_clean['Label'] == 'ham']

p_spam = len(spam_messages) / len(training_clean)
p_ham = len(ham_messages) / len(training_clean)

n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

n_vocabulary = len(vocabulary)

alpha = 1

In [42]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

Now to define the classifier function.

In [43]:
def classify(message, printout=False):
    
    message = message.replace('\W', ' ')
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if(printout):
        print('P(Spam|message):', p_spam_given_message)
        print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Trying out the classifier on an example:

In [44]:
classify('Free tickets to the ball game!', printout=True)

P(Spam|message): 1.3750618713561217e-15
P(Ham|message): 5.82971456244398e-16


'spam'

Now to apply this function to the test messages and see how accurate the model is.

In [45]:
test['predicted'] = test['SMS'].apply(classify)

In [46]:
correct = 0
total = test.shape[0]
    
for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1098
Incorrect: 16
Accuracy: 0.9856373429084381


So the model has over 98.5% accuracy!