# Spam Filter for SMS Messages

## Using Naive Bayes algorithm

     
The first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.
The Dataset is available in [UCI ML Repo](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 


In [6]:
import pandas as pd
smsspam = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label','SMS'])
smsspam.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
smsspam.Label.value_counts()/5572

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In [8]:
smsspam.shape

(5572, 2)

### Training and Testing

In [9]:
data_randomized = smsspam.sample(frac=1, random_state=1)

In [11]:
train_test_index = round(len(data_randomized) * 0.8)

In [12]:
train_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

In [13]:
print(train_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [14]:
train_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [15]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

our Naive Bayes algorithm will make the classification based on the results it gets to these two equations:

P(Spam|w1,w2,...,wn)∝P(Spam)⋅n∏i=1P(wi|Spam)

P(Ham|w1,w2,...,wn)∝P(Ham)⋅n∏i=1P(wi|Ham)

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, recall that we need to use these equations:

P(wi|Spam)=Nwi|Spam+αNSpam+α⋅NVocabulary

P(wi|Ham)=Nwi|Ham+αNHam+α⋅NVocabulary

where,

Nwi|Spam=the number of times the word wi occurs in spam messagesNwi|SpamC=the number of times the word wi occurs in non-spam messagesNSpam=total number of words in spam messagesNSpamC=total number of words in non-spam messagesNVocabulary=total number of words in the vocabularyα=1    (α is a smoothing parameter)


### Data Cleaning

In [17]:
train_set['SMS'] = train_set['SMS'].str.replace('\W', ' ')
train_set['SMS'] = train_set['SMS'].str.lower()
train_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [18]:
vocabulary = []
for sms in train_set['SMS']:
    for word in sms.split():
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [20]:
len(vocabulary)

7783

In [23]:
word_counts_per_sms = {unique_word: [0] * len(train_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train_set['SMS'].str.split()):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [24]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [25]:
train_set_clean = pd.concat([train_set, word_counts], axis=1)
train_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,yep by the pretty sculpture,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,welp apparently he retired,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,havent,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,i forgot 2 ask ü all smth there s a card on ...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [27]:
spam_messages = train_set_clean[train_set_clean['Label'] == 'spam']
ham_messages = train_set_clean[train_set_clean['Label'] == 'ham']

p_spam = len(spam_messages) / len(train_set_clean)
p_ham = len(ham_messages) / len(train_set_clean)

n_words_per_spam_message = spam_messages['SMS'].str.split().apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_messages['SMS'].str.split().apply(len)
n_ham = n_words_per_ham_message.sum()

n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

* p_spam is the probability of getting a spam message in the training set
* n_words_per_spam_message is the number of words in each of those spam message.

### Calculating Parameters

In [29]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

In [30]:
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

In [31]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [32]:
classify("Hi Abhijith nice meeting you!")

P(Spam|message): 1.9792483149905178e-15
P(Ham|message): 1.4936355491254438e-11
Label: Ham


In [33]:
classify("This is your secret code: C1234. Send us the OTP and get the offer.")

P(Spam|message): 4.891859631055442e-33
P(Ham|message): 2.5880882207320876e-34
Label: Spam


### Accuracy in the Test dataset

In [34]:
def classify_test_set(message):    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [38]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [43]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


## An Accuracy of 98% achieved