## Spam Filter

We'are going to use multinomial Naive Bayes algorithm to build a spam filter for SMS messages. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

Our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam)

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
file = 'https://dq-content.s3.amazonaws.com/433/SMSSpamCollection'
spam_collection = pd.read_csv(file, sep='\t', header=None, names=['Label', 'SMS'])

In [3]:
spam_collection.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
spam_collection.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


The dataset has 2 columns and 5572 rows.

In [5]:
spam_collection['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

About **87%** of the messages are ham ('ham' means non-spam) and **13%** of the messages are spam.

## Training and test set.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

* The training set will have 4,458 messages (about 80% of the dataset).
* The test set will have 1,114 messages (about 20% of the dataset).

Start by randomizing the entire dataset by using **the DataFrame.sample()** method.

In [6]:
spam_collection = spam_collection.sample(frac=1, random_state=1)

We'll split the randomized dataset into a training and a test set.

In [7]:
train_size = int(len(spam_collection)*0.8)
train_set = spam_collection.iloc[:train_size]
test_set = spam_collection.iloc[train_size:]

We'll reset the index labels for both data sets — the index labels remained unordered after randomization.

In [8]:
train_set.reset_index(inplace=True)
test_set.reset_index(inplace=True)

In [9]:
train_set['Label'].value_counts(normalize=True)*100

ham     86.53803
spam    13.46197
Name: Label, dtype: float64

In [10]:
test_set['Label'].value_counts(normalize=True)*100

ham     86.816143
spam    13.183857
Name: Label, dtype: float64

The percantages of spam and ham in both the training and the tes set ara similar how we have in the full dataset.

## Data cleaning and format

To calculate all the probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. Right now, our training and test sets have this format.

We'll remove all the punctuation from **the SMS** column. For each message, we'll transform every letter in every word to lower case.

In [11]:
train_set['SMS'] = train_set['SMS'].str.replace('\W',' ')
train_set['SMS'] = train_set['SMS'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [12]:
train_set.head()

Unnamed: 0,index,Label,SMS
0,1078,ham,yep by the pretty sculpture
1,4028,ham,yes princess are you going to make me moan
2,958,ham,welp apparently he retired
3,4642,ham,havent
4,4674,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating Vocabulary

We'll create a list with all of the unique words that occur in the messages of our training set.

In [13]:
train_set['SMS'] = train_set['SMS'].str.split()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [14]:
vocabulary = []

for text in train_set['SMS']:
    for word in text:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [15]:
len(vocabulary)

7782

## The Final Training Set

In [16]:
word_counts_per_sms = {unique_word: [0] * len(train_set['SMS']) for unique_word in vocabulary}

In [17]:
for index, sms in enumerate(train_set['SMS']):
    for word in sms:
         word_counts_per_sms[word][index] += 1

In [18]:
word_counts = pd.DataFrame(word_counts_per_sms)

In [19]:
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Now we'll concatenate the training set and the DataFrame with vocubulary.

In [20]:
training_set = pd.concat([train_set, word_counts], axis = 1)

In [21]:
training_set.head()

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants First

We'll calculate P(Spam), P(Ham), the number of words in all spam messages, the number of words in all the non-spam messages, the number of words in the vocabulary.

In [22]:
# split training_set into spam and ham messages
spam_messages = training_set[training_set['Label'] == 'spam']
ham_messages = training_set[training_set['Label'] == 'ham']

p_spam = len(spam_messages) / len(training_set)
p_ham = len(ham_messages) / len(training_set)

In [23]:
# the number of words in all the spam messages
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

In [24]:
#the number of words in all the non-spam messages 
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

In [25]:
# the number of words in vocabulary
n_vocabulary = len(vocabulary)

We'll also use Laplace smoothing and set α=1.

In [26]:
alpha = 1

## Calculating Parameters

We'll initialize two dictionaries each key-value pair ia an unique word (from our vocabulary) prepresented as a string, and the value is 0.

In [28]:
spam_dict = {word:0 for word in vocabulary}
ham_dict = {word:0 for word in vocabulary}

In [30]:
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam +alpha * n_vocabulary)
    spam_dict[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    ham_dict[word] = p_word_given_ham

## Classifying A New Message

Now we'll write the code for calculating p_spam_given_message and p_ham_given_message, and then we'll use the function to classify two new messages.

In [31]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
        
    
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

We'll use  **the classify() function** to classify two new messages.

In [32]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3489598779101096e-25
P(Ham|message): 1.9380782419077522e-27
Label: Spam


In [33]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4385273359614485e-25
P(Ham|message): 3.6893872875947e-21
Label: Ham


## Measuring the Spam Filter's Accuracy

We'll now try to determine how well the spam filter does on our test set of 1,114 messages. We'll start by writing a function that returns classification labels instead of printing them.

In [38]:
def classify_test_set(message):    
      
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
            
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

In [39]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,index,Label,SMS,predicted
0,3482,ham,Wherre's my boytoy ? :-(,ham
1,2131,ham,Later i guess. I needa do mcat study too.,ham
2,3418,ham,But i haf enuff space got like 4 mb...,ham
3,3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
4,1538,ham,All sounds good. Fingers . Makes it difficult ...,ham


Now, we'll write a function to measure the accuracy of our spam filter to find out how well our spam filter does.

In [40]:
correct = 0
total = test_set.shape[0]

In [43]:
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1101
Incorrect: 14
Accuracy: 0.9874439461883409


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,101 correctly.

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that

In [45]:
isolate_incorrectly = test_set[test_set['Label']!=test_set['predicted']]

In [46]:
isolate_incorrectly

Unnamed: 0,index,Label,SMS,predicted
115,3460,spam,Not heard from U4 a while. Call me now am here...,ham
136,1940,spam,More people are dogging in your area now. Call...,ham
153,3890,ham,Unlimited texts. Limited minutes.,spam
160,991,ham,26th OF JULY,spam
285,4862,ham,Nokia phone is lovly..,spam
294,2370,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
303,326,ham,No calls..messages..missed calls,spam
320,5046,ham,We have sent JD for Customer Service cum Accou...,spam
505,3864,spam,Oh my god! I've found your number again! I'm s...,ham
547,4676,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


In [47]:
incorrectly_vocabulary = isolate_incorrectly['SMS'].str.split()

In [48]:
incorrectly_vocabulary

115    [Not, heard, from, U4, a, while., Call, me, no...
136    [More, people, are, dogging, in, your, area, n...
153               [Unlimited, texts., Limited, minutes.]
160                                     [26th, OF, JULY]
285                          [Nokia, phone, is, lovly..]
294    [A, Boy, loved, a, gal., He, propsd, bt, she, ...
303                 [No, calls..messages..missed, calls]
320    [We, have, sent, JD, for, Customer, Service, c...
505    [Oh, my, god!, I've, found, your, number, agai...
547    [Hi, babe, its, Chloe,, how, r, u?, I, was, sm...
742    [0A$NETWORKS, allow, companies, to, bill, for,...
877    [RCT', THNQ, Adrian, for, U, text., Rgds, Vatian]
886                                     [2/2, 146tf150p]
954    [Hello., We, need, some, posh, birds, and, cha...
Name: SMS, dtype: object

In [49]:
vocabulary_new = []

for text in incorrectly_vocabulary:
    for word in text:
        vocabulary_new.append(word)
        
vocabulary_new = list(set(vocabulary_new))

In [57]:
dict_matches = {}

for word in vocabulary:
    if word in vocabulary_new:
        dict_matches[word]=0

In [58]:
len(dict_matches)

125

In [59]:
len(vocabulary) - len(dict_matches)

7657

In [60]:
len(vocabulary_new)

243

Vocabulary of training set does not contain some of the words that are in the test set (matches only 125 words). Therefore these 14 messages were not classified correctly.