### Building a Spam Filter using Naive Bayes

In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

   1\. Learns how humans classify messages.
   2\. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
  3\.  Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).


So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can also download the dataset directly from this [link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('SMSSpamCollection', sep='\t',header=None,names=['Label', 'SMS'])

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [3]:
df.head(10)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [4]:
df['Label'].value_counts("ham")*100


ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Slide 2

In [5]:
#random shuffle
df_rand = df.sample(frac = 1, random_state=1)

#spltting 80-20
train = df_rand[0:4457]
test = df_rand[4458:5572]

#resetting indexes
train = train.reset_index(drop = True)
test = test.reset_index(drop = True)



In [6]:
train['Label'].value_counts("ham")*100

ham     86.53803
spam    13.46197
Name: Label, dtype: float64

In [7]:
test['Label'].value_counts("ham")*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

## Slide 3

In [8]:
train.head(10)


Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...
5,ham,Ok i thk i got it. Then u wan me 2 come now or...
6,ham,I want kfc its Tuesday. Only buy 2 meals ONLY ...
7,ham,No dear i was sleeping :-P
8,ham,Ok pa. Nothing problem:-)
9,ham,Ill be there on &lt;#&gt; ok.


In [9]:
train['SMS'] = train['SMS'].str.replace('\W',' ')
train['SMS'] = train['SMS'].str.lower()



train.head(10)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
5,ham,ok i thk i got it then u wan me 2 come now or...
6,ham,i want kfc its tuesday only buy 2 meals only ...
7,ham,no dear i was sleeping p
8,ham,ok pa nothing problem
9,ham,ill be there on lt gt ok


## Slide 4

Transform the vocabulary list into a set using the set() function. This will remove the duplicates from the vocabulary list.
Transform the vocabulary set back into a list using the list() function.

In [10]:
vocabulary = []

train['SMS']=train['SMS'].str.split()

for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [11]:
len(vocabulary)

vocabulary

['zyada',
 'gailxx',
 'dad',
 'studies',
 'jersey',
 'careful',
 'game',
 'let',
 'sat',
 'angels',
 'bit',
 'mobstorequiz10ppm',
 'joining',
 'adam',
 'ago',
 'halla',
 'four',
 'taxless',
 'prsn',
 'adsense',
 'easily',
 'mca',
 'singles',
 'thesedays',
 'suply',
 'sigh',
 'name1',
 'gosh',
 'twittering',
 'beforehand',
 'prove',
 'finish',
 'tissco',
 'sometme',
 'weigh',
 'relax',
 'playing',
 'teju',
 '08708800282',
 '0808',
 'score',
 'heaven',
 'eg',
 'forgets',
 'fear',
 'prince',
 'freshers',
 'birds',
 'bishan',
 'windy',
 'bridge',
 'feellikw',
 'packs',
 'sha',
 'lip',
 'hasnt',
 'calld',
 'linerental',
 'nauseous',
 'courage',
 'wicket',
 'staying',
 'unfolds',
 'called',
 'reply',
 'beloved',
 'urination',
 'eng',
 'ms',
 'magic',
 '09058094454',
 '80608',
 'yr',
 'somtimes',
 'amigos',
 '45239',
 'girl',
 '9755',
 'thew',
 'jobs',
 'inclusive',
 'intend',
 'twins',
 'specs',
 'flute',
 'conducts',
 'hurry',
 'under',
 '09056242159',
 'flaked',
 'partnership',
 'venaam',


In [38]:
len(vocabulary)

7782

## Slide 5

In [12]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        


In [13]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [14]:
train_clean = pd.concat([train, word_counts], axis = 1)

train_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Slide 6

calculating constants

In [39]:
spam_perc = train_clean ['Label'].value_counts()

spam_perc

ham     3857
spam     600
Name: Label, dtype: int64

In [41]:
n_spam = spam_perc["spam"]
n_ham = spam_perc["ham"]

n_ham

3857

In [17]:
len(train_clean)

4457

In [18]:

#p_spam and p_ham
p_spam = spam_perc["spam"]/len(train_clean)

p_ham = spam_perc["ham"]/len(train_clean)

print('prob of spam is: ', p_spam)
print('prob of ham is: ', p_ham)

prob of spam is:  0.13461969934933812
prob of ham is:  0.8653803006506618


In [54]:
# N_Vocabulary
n_vocabulary = len(vocabulary)

n_vocabulary

7782

In [20]:
alpha = 1


spam_msg = train_clean[train_clean["Label"]=="spam"]

ham_msg = train_clean[train_clean["Label"]=="ham"]

# N_Spam
n_words_per_spam_message = spam_msg['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_msg['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()


In [49]:
# Isolating spam and ham messages first
spam_messages = train_clean[train_clean['Label'] == 'spam']
ham_messages = train_clean[train_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(train_clean)
p_ham = len(ham_messages) / len(train_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1


In [52]:
n_vocabulary

7782

## Slide 7

In [28]:
key_list = vocabulary
y = 0


#theta or parameters for ham and spam
param_ham = dict.fromkeys(key_list,y)  
param_spam = dict.fromkeys(key_list,y)




In [22]:
for word in vocabulary:
    n_word_given_spam = spam_msg[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    param_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_msg[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    param_ham[word] = p_word_given_ham

In [48]:
spam_msg[word]

16      0
18      0
56      0
60      0
61      0
62      0
70      0
71      0
84      0
89      0
98      0
106     0
113     0
142     0
144     0
158     0
159     0
162     0
164     0
165     0
166     0
179     0
181     0
186     0
191     0
200     0
203     0
206     0
218     0
219     0
       ..
4297    0
4298    0
4306    0
4312    0
4318    0
4331    0
4332    0
4350    0
4353    0
4354    0
4357    0
4359    0
4373    0
4377    0
4379    0
4383    0
4387    0
4388    0
4390    0
4392    0
4401    0
4403    0
4407    0
4414    0
4433    0
4437    0
4439    0
4443    0
4449    0
4455    0
Name: 81618, Length: 600, dtype: int64

In [None]:
print(param_spam)


## Slide 8

In [23]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

        
         # This is where we calculate:
        ## innitiate the values for probabilities
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # iterating over each word of the message
    
    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]
        elif word in param_ham:
            p_spam_given_message *= param_ham[word]
        else:
            a = 2
                
            
    

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [24]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3479033433797822e-25
P(Ham|message): 0.8653803006506618
Label: Ham


In [25]:
classify ("Sounds good, Tom, then see u there")

P(Spam|message): 2.4370417231821285e-25
P(Ham|message): 0.8653803006506618
Label: Ham
