# Spam Filter

With a rise of digital use, there's also an increase in the spam messages received by people around the world. Humans are often good judge in identifying spam messages from ordinary; however, as the scale of spam messages increases it is just not possible to read through every message.

The project aims to teach computers how to classify messages. The multinomial Naive Bayes Algorithm will be used to assign probabilities to a message. If the probability of a message being a spam is greater than non-spam then the message is classified as "spam." Vice-versa is also true.

In [1]:
import pandas as pd

In [2]:
spam_file = pd.read_csv('SMSSpamCollection', sep='\t',
                       header=None, names=['Label', 'SMS'])

print("File dimensions: " + str(spam_file.shape))

File dimensions: (5572, 2)


In [3]:
spam_file.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
spam_file['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

#### According to dataset, about 13.4% of messages are spam. Note: "ham" refers to non-spam messages.

## Designing Test and Training Datasets

Test dataset: Used to measure the accuracy of the algorithm in classifying the messages. An accurancy of 80% or higher is the standard the project is aiming for.

Training dataset: Used in helping the algorithm learn how to classify messages.

In [5]:
random = spam_file.sample(frac=1, random_state=1)
training = random[:4458]
test = random[4458:]

training = training.reset_index()
test = test.reset_index()

In [6]:
training['Label'].value_counts(normalize=True) * 100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [7]:
test['Label'].value_counts(normalize=True) * 100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

## Cleaning the Datasets

In [8]:
import re

def clean_data(a_string):
    a_string = re.sub('\W', ' ', a_string) #removes all punctuation from the words.
    return a_string.lower()

test_1 = clean_data('Secret!! Money, goods.') 
test_1

'secret   money  goods '

In [9]:
training['SMS'] = training['SMS'].apply(clean_data)
test['SMS'] = test['SMS'].apply(clean_data)

In [10]:
training.head(5)

Unnamed: 0,index,Label,SMS
0,1078,ham,yep by the pretty sculpture
1,4028,ham,yes princess are you going to make me moan
2,958,ham,welp apparently he retired
3,4642,ham,havent
4,4674,ham,i forgot 2 ask ü all smth there s a card on ...


In [11]:
test.head(5)

Unnamed: 0,index,Label,SMS
0,2131,ham,later i guess i needa do mcat study too
1,3418,ham,but i haf enuff space got like 4 mb
2,3424,spam,had your mobile 10 mths update to latest oran...
3,1538,ham,all sounds good fingers makes it difficult ...
4,5393,ham,all done all handed in don t know if mega sh...


## Creating a vocabulary

In [12]:
vocabulary = []
for string in training['SMS']:
    list_of_words = string.split()
    for word in list_of_words:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary)) # Transformed the list into a set to remove 
# the duplicate values, then converted back to string.
vocabulary[:10]

['finished',
 'crack',
 'kavalan',
 'smth',
 'someonone',
 'gopalettan',
 'ar',
 'deepak',
 'ondu',
 'sun']

## Creating a dictionary of the words

In [13]:
word_count_per_sms = {unique_word: [0] * len(training['SMS'])
                      for unique_word in vocabulary}

In [14]:
for index, sms in enumerate(training['SMS']):
    for word in sms.split():
        word_count_per_sms[word][index] += 1

In [15]:
word_counts = pd.DataFrame(word_count_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [16]:
word_counts = pd.concat([training, word_counts], axis=1)
word_counts.head()

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,yep by the pretty sculpture,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,yes princess are you going to make me moan,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,welp apparently he retired,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,havent,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,i forgot 2 ask ü all smth there s a card on ...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Since the datasets are clean, we can begin to calculate the probabilities.

## Probabilities and Word Counts

In [17]:
p_spam = word_counts['Label'].value_counts(normalize=True)['spam']
p_spam

0.13458950201884254

In [18]:
p_ham = word_counts['Label'].value_counts(normalize=True)['ham']
p_ham

0.8654104979811574

In [19]:
n_spam = 0
n_ham = 0

def calculate_n_words(label):
    n = 0
    for sms in word_counts[word_counts['Label'] == label]['SMS']:
        n += len(sms.split())
        
    return n

In [20]:
n_spam = calculate_n_words('spam')
n_ham = calculate_n_words('ham')

print("Total Spam words are: " + str(n_spam))
print("Total Non-spam words are: " + str(n_ham))

Total Spam words are: 15190
Total Non-spam words are: 57237


In [21]:
n_vocabulary = len(vocabulary)
alpha = 1

## Calculating Parameters

In [22]:
parameters_spam = {unique_word : 0 for unique_word in vocabulary}
parameters_ham = {unique_word : 0 for unique_word in vocabulary}

In [23]:
spam_messages = training[training['Label'] == 'spam']
ham_messages = training[training['Label'] == 'ham']

In [24]:
for unique_word in vocabulary:
    n_word_given_spam = 0
    for message in spam_messages['SMS']:
        n_word_given_spam += message.count(unique_word)
    #print(unique_word + " : " + str(n_word_given_spam))    
    probability = round((n_word_given_spam + alpha)/
                        (n_spam + (alpha * n_vocabulary)), 4)
    parameters_spam[unique_word] = probability

In [25]:
parameters_spam

{'july': 0.0001,
 'finished': 0.0,
 'solve': 0.0,
 'crack': 0.0,
 'kavalan': 0.0,
 'oredi': 0.0,
 'darkness': 0.0,
 'smth': 0.0,
 'someonone': 0.0001,
 'gopalettan': 0.0,
 'ar': 0.0175,
 'deepak': 0.0,
 'ondu': 0.0,
 'sun': 0.0004,
 'bowl': 0.0,
 '6wu': 0.0001,
 'hanks': 0.0006,
 'summer': 0.0003,
 'calls1': 0.0001,
 'disasters': 0.0,
 'sim': 0.0003,
 '09094646899': 0.0001,
 'ji': 0.0,
 'contention': 0.0,
 '08712466669': 0.0001,
 'natwest': 0.0,
 'hopeu': 0.0,
 'smells': 0.0,
 '08002986030': 0.0001,
 'servs': 0.0001,
 'each': 0.0006,
 'hearing': 0.0,
 'chill': 0.0,
 'heater': 0.0,
 'social': 0.0,
 'marine': 0.0,
 '5we': 0.0003,
 'identification': 0.0,
 'bloo': 0.0002,
 'choose': 0.0004,
 'birds': 0.0,
 'talent': 0.0,
 'sunday': 0.0,
 'pura': 0.0,
 'pls': 0.0004,
 'lion': 0.0002,
 'crore': 0.0,
 'recession': 0.0,
 'pub': 0.0001,
 'iyo': 0.0,
 'survey': 0.0001,
 'exterminator': 0.0,
 'thnk': 0.0,
 'p': 0.0743,
 'abt': 0.0001,
 'bloomberg': 0.0002,
 'every': 0.0013,
 'cookies': 0.0,
 'fav

In [26]:
for unique_word in vocabulary:
    n_word_given_ham = 0
    for message in ham_messages['SMS']:
        n_word_given_ham += message.count(unique_word)
    #print(unique_word + " : " + str(n_word_given_ham))    
    probability = round(((n_word_given_ham + alpha)/
                        (n_ham + (alpha * n_vocabulary))), 4)
    parameters_ham[unique_word] = probability

In [27]:
parameters_ham

{'july': 0.0,
 'finished': 0.0003,
 'solve': 0.0001,
 'crack': 0.0001,
 'kavalan': 0.0,
 'oredi': 0.0002,
 'darkness': 0.0,
 'smth': 0.0002,
 'someonone': 0.0,
 'gopalettan': 0.0,
 'ar': 0.0238,
 'deepak': 0.0,
 'ondu': 0.0001,
 'sun': 0.0006,
 'bowl': 0.0001,
 '6wu': 0.0,
 'hanks': 0.0009,
 'summer': 0.0,
 'calls1': 0.0,
 'disasters': 0.0,
 'sim': 0.0003,
 '09094646899': 0.0,
 'ji': 0.0003,
 'contention': 0.0,
 '08712466669': 0.0,
 'natwest': 0.0,
 'hopeu': 0.0,
 'smells': 0.0,
 '08002986030': 0.0,
 'servs': 0.0,
 'each': 0.001,
 'hearing': 0.0,
 'chill': 0.0001,
 'heater': 0.0001,
 'social': 0.0001,
 'marine': 0.0,
 '5we': 0.0,
 'identification': 0.0,
 'bloo': 0.0003,
 'choose': 0.0001,
 'birds': 0.0001,
 'talent': 0.0001,
 'sunday': 0.0002,
 'pura': 0.0,
 'pls': 0.0016,
 'lion': 0.0001,
 'crore': 0.0,
 'recession': 0.0,
 'pub': 0.0003,
 'iyo': 0.0001,
 'survey': 0.0,
 'exterminator': 0.0,
 'thnk': 0.0001,
 'p': 0.0582,
 'abt': 0.0003,
 'bloomberg': 0.0,
 'every': 0.0012,
 'cookies':

## Building the actual spam filter

In [28]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()


    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if (word in parameters_spam):
            p_spam_given_message *= parameters_spam[word]
        
        if (word in parameters_ham):
            p_ham_given_message *= parameters_ham[word]
            

#     print('P(Spam|message):', p_spam_given_message)
#     print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [29]:
#Test

print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
print("\n")
print(classify("Sounds good, Tom, then see u there"))

spam


ham


In [30]:
test['predicted'] = test['SMS'].apply(classify)
test.head(10)

Unnamed: 0,index,Label,SMS,predicted
0,2131,ham,later i guess i needa do mcat study too,ham
1,3418,ham,but i haf enuff space got like 4 mb,needs human classification
2,3424,spam,had your mobile 10 mths update to latest oran...,spam
3,1538,ham,all sounds good fingers makes it difficult ...,ham
4,5393,ham,all done all handed in don t know if mega sh...,needs human classification
5,2744,ham,but my family not responding for anything now...,needs human classification
6,1553,ham,u too,ham
7,4335,ham,boo what time u get out u were supposed to ta...,ham
8,2817,ham,genius what s up how your brother pls send h...,ham
9,4702,ham,i liked the new mobile,ham


## Assessing the accuracy of the filter

In [31]:
correct = 0
total = test['SMS'].size


for index, row in test.iterrows():
    if row['Label'] == row['predicted']:
        correct += 1
        
accuracy = (correct / total) * 100
print('The Accuracy is: {:.2f}%'.format(accuracy))

The Accuracy is: 63.73%


The accuracy is pretty low which could be because a lot of words have '0' probability, which affects the probability of messages which in turn needs human intervention.