# Building a Spam Filter with Naive Bayes

**In this guided project, we're going to study the practical side of the Naive Bayes algorithm by building a spam filter for SMS messages.**

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

* Learns how humans classify messages.
* Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
* Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

In [1]:
import pandas as pd

In [2]:
sms = pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=['Label', 'SMS'])

In [3]:
sms.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [5]:
# Percent spam and ham in data
sms['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

In [6]:
# Randomizing the data so we can extract training data and test data

In [7]:
random_sms = sms.sample(frac=1, random_state=1)

In [8]:
training_sms = random_sms.iloc[:4458, :] # 80 percent of data
training_sms.reset_index(drop=True, inplace=True)

In [9]:
test_sms = random_sms.iloc[4458:, :] # 20 percent of data
test_sms.reset_index(drop=True, inplace=True)

In [10]:
# Percent spam and ham in training data
training_sms["Label"].value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [11]:
# Percent spam and ham in test data
test_sms["Label"].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

The proportions of spam and ham are very similar to orignal/full dataset.

In [12]:
training_sms["SMS"] = training_sms["SMS"].str.replace("\W", " ").str.lower().str.split()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [13]:
vocabulary = set()
for i in training_sms["SMS"]:
    for j in i:
        vocabulary.add(j)

In [14]:
vocabulary = list(vocabulary)

In [15]:
vocabulary[:5]

['person', '08718726978', 'stuffing', '09064011000', 'lov']

In [16]:
word_counts_per_sms = {unique_word : [0]*len(training_sms)
                      for unique_word in vocabulary}

In [17]:
for idx, sms in enumerate(training_sms["SMS"]):
    for word in sms:
        word_counts_per_sms[word][idx] += 1

In [18]:
new_sms = pd.DataFrame(word_counts_per_sms)

In [19]:
new_sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4458 entries, 0 to 4457
Columns: 7783 entries, 0 to 鈥
dtypes: int64(7783)
memory usage: 264.7 MB


In [20]:
training_sms = pd.concat([training_sms, new_sms], axis=1)

In [21]:
training_sms.head(3)

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
p_spam, p_ham = training_sms["Label"].value_counts(normalize=True)

n_spam = training_sms[training_sms["Label"] == "spam"]["SMS"].str.len().sum()

n_ham = training_sms[training_sms["Label"] == "ham"]["SMS"].str.len().sum()

n_vocabulary = len(vocabulary)

alpha = 1

In [23]:
p_w_spam = {unique_word : 0 for unique_word in vocabulary}
p_w_ham = {unique_word : 0 for unique_word in vocabulary}

In [24]:
spam_sms = training_sms[training_sms["Label"] == "spam"]
ham_sms = training_sms[training_sms["Label"] == "ham"]

In [25]:
for word in vocabulary:
    n_word_spam = spam_sms[word].sum()
    numerator = n_word_spam + alpha
    denominator = n_spam + (alpha*n_vocabulary)
    p_word_spam = numerator / denominator
    p_w_spam[word] = p_word_spam

In [26]:
for word in vocabulary:
    n_word_ham = ham_sms[word].sum()
    numerator = n_word_ham + alpha
    denominator = n_ham + (alpha*n_vocabulary)
    p_word_ham = numerator / denominator
    p_w_ham[word] = p_word_ham

In [40]:
import re

def classify(message):
    message = re.sub("\W", " ", message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in p_w_spam:
            p_spam_given_message *= p_w_spam[word]
        if word in p_w_ham:
            p_ham_given_message *= p_w_ham[word]
    
    if p_spam_given_message > p_ham_given_message:
        return "spam"
    elif p_ham_given_message > p_spam_given_message:
        return "ham"
    else:
        return "Needs human classification"

In [41]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [42]:
classify("Sounds good, Tom, then see u there")

'ham'

In [45]:
test_sms["predicted"] = test_sms["SMS"].apply(classify)
test_sms.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [53]:
correct = (test_sms["Label"] == test_sms["predicted"]).sum()
total = len(test_sms)

In [67]:
accuracy = correct/total
print('Correct:', correct)
print('Incorrect:', total-correct)
print('Accuracy:', accuracy*100)

Correct: 1061
Incorrect: 53
Accuracy: 95.2423698384201


# Conclusions

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of **95.24%** on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.