## Task 
Building a spam filter for SMS messages

### Data source:
We'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection).
The data collection process is described in more details on [this page](https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

In [1]:
import pandas as pd
sms= pd.read_table("SMSSpamCollection", header=None, names=['label', 'SMS']) ## df has no header row, put SMS to avoid word sms in the message
sms.head()

Unnamed: 0,label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# explore data
print(sms.describe())
sms['label'].value_counts(normalize=True)

       label                     SMS
count   5572                    5572
unique     2                    5169
top      ham  Sorry, I'll call later
freq    4825                      30


ham     0.865937
spam    0.134063
Name: label, dtype: float64

To test the spam filter, we're first going to split our dataset into two categories:

* A training set, which we'll use to "train" the computer how to classify messages.
* A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

* The training set will have 4,458 messages (about 80% of the dataset).
* The test set will have 1,114 messages (about 20% of the dataset).

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [3]:
# Randomize dataset
sms_random=sms.sample(frac=1, random_state=1) #frac=1 to randomize entire dataset
separate_index=round(len(sms)*0.8)
training_set=sms_random[:separate_index].reset_index(drop=True) # drop=True to drop old index column of original dataset
test_set=sms_random[separate_index:].reset_index(drop=True)
print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [4]:
print(training_set['label'].value_counts(normalize=True))
print(test_set['label'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: label, dtype: float64
ham     0.868043
spam    0.131957
Name: label, dtype: float64


 the percentage of spam and ham in both the training and the test set are quite close with the original dataset. It looks good!

## Clean data
### Letter Case and Punctuation

In [5]:
training_set.head()

Unnamed: 0,label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [6]:
# remove all character aren't words
training_set['SMS']= training_set['SMS'].str.replace('\W',' ').str.lower() # replace by space
training_set.head()

  training_set['SMS']= training_set['SMS'].str.replace('\W',' ').str.lower() # replace by space


Unnamed: 0,label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. The Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify the new messages.
\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}
Within the above formula are the sub functions P(wi|Spam) and P(wi|Ham). To calculate these we need to use the below: 
\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}
Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:
* P(Spam): the probability that a message is spam   
* P(Ham): the probability that a message is not spam
* N_Spam: the number of words in all the spam messages
* N_Ham: the number of words in all the non-spam messages
* N_Vocabulary: the number of words in our vocabulary
* Alpha (the partial infinity sign as seen in the denominator): a constant value, usually 1, added to every probability to prevents zero values from making everything zero through multiplication. Known as laplace smoothing/

## Create vocabulary

In [7]:
training_set['SMS']=training_set['SMS'].str.split()
print(training_set.head())

  label                                                SMS
0   ham                  [yep, by, the, pretty, sculpture]
1   ham  [yes, princess, are, you, going, to, make, me,...
2   ham                    [welp, apparently, he, retired]
3   ham                                           [havent]
4   ham  [i, forgot, 2, ask, ü, all, smth, there, s, a,...


In [8]:
vocabulary=[]
for sms in training_set['SMS']:
    for word in sms:
        if word in vocabulary:
            continue
        else:
            vocabulary.append(word)
print(vocabulary[:5])
n_vocabulary= len(vocabulary)
print(n_vocabulary)

'''
or can use like that:
    vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary)) # set create set with distinct word
'''

['yep', 'by', 'the', 'pretty', 'sculpture']
7783


"\nor can use like that:\n    vocabulary = []\nfor sms in training_set['SMS']:\n    for word in sms:\n        vocabulary.append(word)\n        \nvocabulary = list(set(vocabulary)) # set create set with distinct word\n"

In [9]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary} # each unique word has value length as trainning_set
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index]+=1 # [word] first as it is key, then follow by [index] is index of [0.0.0  ..0] to determine position of +1

word_counts=pd.DataFrame(word_counts_per_sms)
print(word_counts.head())
    

   yep  by  the  pretty  sculpture  yes  princess  are  you  going  ...  \
0    1   1    1       1          1    0         0    0    0      0  ...   
1    0   0    0       0          0    1         1    1    1      1  ...   
2    0   0    0       0          0    0         0    0    0      0  ...   
3    0   0    0       0          0    0         0    0    0      0  ...   
4    0   0    0       0          0    0         0    0    0      0  ...   

   beauty  hides  secrets  n8  jewelry  related  trade  arul  bx526  wherre  
0       0      0        0   0        0        0      0     0      0       0  
1       0      0        0   0        0        0      0     0      0       0  
2       0      0        0   0        0        0      0     0      0       0  
3       0      0        0   0        0        0      0     0      0       0  
4       0      0        0   0        0        0      0     0      0       0  

[5 rows x 7783 columns]


In [10]:
new_training=pd.concat([training_set, word_counts], axis=1)
new_training.head()

Unnamed: 0,label,SMS,yep,by,the,pretty,sculpture,yes,princess,are,...,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
0,ham,"[yep, by, the, pretty, sculpture]",1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
p_spam=new_training['label'].value_counts(normalize=True)[1]
p_ham=new_training['label'].value_counts(normalize=True)[0]
print("P_spam: ",p_spam)
print("P_ham: ", p_ham)

P_spam:  0.13458950201884254
P_ham:  0.8654104979811574


In [12]:
spam=new_training[new_training['label']=='spam']
ham=new_training[new_training['label']=='ham']
N_spam=0
for message in spam['SMS']:
    N_spam += len(message)

N_ham=ham['SMS'].apply(len).sum()
print(N_spam)
print(N_ham)

15190
57237


## Calculating Parameters
The key detail here is that calculating P("secret"|Spam) only depends on the training set, and as long as we don't make changes to the training set, P("secret"|Spam) stays constant. The same reasoning also applies to P("secret"|Ham).
so P('wi"|spam) is unchanged

In [13]:
# Initiate parameters
parameter_spam={unique_word:[0] for unique_word in vocabulary}
parameter_ham={unique_word:[0] for unique_word in vocabulary}
alpha=1

for word in vocabulary:
    n_word_given_spam=spam[word].sum()
    p_word_given_spam=(n_word_given_spam+alpha)/(N_spam+ alpha*n_vocabulary)
    parameter_spam[word]=p_word_given_spam
    
    n_word_given_ham=ham[word].sum()
    p_word_given_ham=(n_word_given_ham+alpha)/(N_ham+ alpha*n_vocabulary)
    parameter_ham[word]=p_word_given_ham


In [14]:
import re

def classify(message):
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in vocabulary:
            p_spam_given_message *= parameter_spam[word]
            p_ham_given_message *= parameter_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'equal proabilities, have a human classify this!'

In [15]:
test_set['predict']=test_set['SMS'].apply(classify)
test_set.head()

Unnamed: 0,label,SMS,predict
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [16]:
test_set['bollean']=test_set['predict']==test_set['label']
print(test_set['bollean'].head())
test_set['bollean'].value_counts(normalize=True)

0    True
1    True
2    True
3    True
4    True
Name: bollean, dtype: bool


True     0.987433
False    0.012567
Name: bollean, dtype: float64

The filter had an accuracy of 99.87% on the test set, which is an excellent result