# Guided Project: Classify messages using Bayes algorithm

In this guided project our task is to teach the computer how to classify the messages using Bayes algorithnm, meaning to build a spam filter with Naive Bayes theorum.

The data was put together by Tiago A. Almeida and José María Gómez Hidalgo. The dataset can be downloaded by [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The dataset consists of 5572 SMS messages that are already classified. The data collection details are described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition).


In [1]:
#import the libraries and read the data
import pandas as pd
sms_data = pd.read_csv("SMSSpamCollection",sep='\t', header = None, names=['Label', 'SMS'])

In [2]:
sms_data.shape

(5572, 2)

In [3]:
sms_data["Label"].value_counts(normalize = True) *100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

From the results we can see that 86.6% is non spam messages and 13.4% is spam messages. We will split the data in to test_set and training_set. Training set will be 80% of the dataset and remaining 20% will be for test data.

In [4]:
sms_data.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
data_randomized = sms_data.sample(frac = 1, random_state = 1)
training_test_index = round(len(data_randomized)*0.8)

training_set = data_randomized[:training_test_index].reset_index(drop = True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

In [6]:
print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [7]:
training_set["SMS"] = training_set["SMS"].str.replace('\W'," ")
training_set["SMS"] = training_set["SMS"].str.lower()

In [8]:
training_set["SMS"]

0                            yep  by the pretty sculpture
1           yes  princess  are you going to make me moan 
2                              welp apparently he retired
3                                                 havent 
4       i forgot 2 ask ü all smth   there s a card on ...
5       ok i thk i got it  then u wan me 2 come now or...
6       i want kfc its tuesday  only buy 2 meals only ...
7                              no dear i was sleeping   p
8                               ok pa  nothing problem   
9                         ill be there on   lt   gt   ok 
10      my uncles in atlanta  wish you guys a great se...
11                                               my phone
12                           ok which your another number
13      the greatest test of courage on earth is to be...
14      dai what this da   can i send my resume to thi...
15                          i am late  i will be there at
16      freemsg why haven t you replied to my text  i ...
17            

In [9]:
training_set["SMS"] = training_set["SMS"].str.split()

#empty list to append each string
vocabulary = []
#iterate over sms column
for row in training_set["SMS"]:
    for word in row:
        vocabulary.append(word)

In [10]:
vocabulary = list(set(vocabulary))

In [11]:
len(vocabulary)

7783

In [12]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [13]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [14]:
training_set_clean = pd.concat([training_set,word_counts], axis= 1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [15]:
spam_messages = training_set_clean[training_set_clean["Label"]=='spam']
ham_messages = training_set_clean[training_set_clean["Label"]=='ham']

In [16]:
#p(Spam) and p(Ham)
p_spam = len(spam_messages)/len(training_set_clean)
p_ham = len(ham_messages)/len(training_set_clean)

#N_spam
n_words_per_spam_message = spam_messages["SMS"].apply(len)
n_spam = n_words_per_spam_message.sum()

#N_ham
n_words_per_ham_message = ham_messages["SMS"].apply(len)
n_ham = n_words_per_ham_message.sum()

#N_vocabulary
n_vocabulary = len(vocabulary)

alpha = 1

In [17]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam+alpha)/(n_spam +alpha * n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham+alpha * n_vocabulary)
    parameters_ham[word] = p_word_given_ham

### Creating spam filter

In [18]:
import re

def classify(message):
    message = re.sub('\W',' ',message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message) 
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [19]:
classify('WINNER!! This is the secret code to unlock the money: C3421')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [20]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


### Measuring Spam filter accuracy

In [21]:
def classify_test_set(message):
    message = re.sub('\W',' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *=parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [22]:
test_set["predicted"] = test_set["SMS"].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [23]:
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)


Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


### Conclusion

For this project we build a spam filter with Naive Bayes algorithm. The accuracy is 98.74%.