# Building a Spam Filter with Naive Bayes

In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam). 

So our first task is to "teach/train" the computer/algorithm how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. 

**Note that due to the nature of spam messages, the dataset contains content that may be offensive to some users.**

In [1]:
import pandas as pd

In [2]:
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [3]:
sms.shape

(5572, 2)

In [4]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
sms['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

In [6]:
# randomizing the entire dataset

randomized = sms.sample(frac=1, random_state=1)

In [7]:
# Split the randomized dataset into a training and a test set

# Calculate index for split
training_test = round(len(randomized) * 0.8)

# Training/Test split
train = randomized[:training_test].reset_index(drop=True)
test = randomized[training_test:].reset_index(drop=True)

print(training_test)
print(train.shape)
print(test.shape)

4458
(4458, 2)
(1114, 2)


In [8]:
# percentage of spam and ham in both the training and the test set

train['Label'].value_counts(normalize=True)*100


ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [9]:
test['Label'].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

In [10]:
train['SMS'] = train['SMS'].str.replace('\W', ' ').str.lower()

In [11]:
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [12]:
train['SMS'] = train['SMS'].str.split()

vocabulary = []
for i in train['SMS']:
    for j in i:
        vocabulary.append(j)
        
vocabulary = list(set(vocabulary))

In [13]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [14]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [15]:
new_train = pd.concat([train, word_counts], axis=1)
new_train.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [16]:
spam = new_train[new_train['Label'] == 'spam']
ham = new_train[new_train['Label'] == 'ham']

In [17]:
p_spam = len(spam)/len(new_train)
p_ham = len(ham)/len(new_train)

In [18]:
n_spam = spam['SMS'].apply(len).sum()
n_ham = ham['SMS'].apply(len).sum()
n_v = len(vocabulary)
alpha = 1

In [19]:
ham_dict = {unique_word:0 for unique_word in vocabulary}
spam_dict = {unique_word:0 for unique_word in vocabulary}

for i in vocabulary:
    n_spam_word = spam[i].sum()
    p_n_spam_word = (n_spam_word + alpha)/(n_spam + alpha * n_v)
    spam_dict[i] = p_n_spam_word
    
    n_ham_word = ham[i].sum()
    p_n_ham_word = (n_ham_word + alpha)/(n_ham + alpha * n_v)
    ham_dict[i] = p_n_ham_word

In [20]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for i in message:
        if i in spam_dict:
            p_spam_given_message *= spam_dict[i]
            
        if i in ham_dict:
            p_ham_given_message *= ham_dict[i]
    
    
 
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [21]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [22]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [25]:
def classify_test(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for i in message:
        if i in spam_dict:
            p_spam_given_message *= spam_dict[i]
            
        if i in ham_dict:
            p_ham_given_message *= ham_dict[i]
    
    
 
    #print('P(Spam|message):', p_spam_given_message)
    #print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [26]:
test['predicted'] = test['SMS'].apply(classify_test)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [29]:
correct = 0
total = len(test)

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1

acc = correct/total
incorr = total - correct
print('Correct:', correct)
print('Incorrect:', incorr)
print('Accuracy:', acc)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833
