# Building a Spam Filter with Naive Bayes

#### The objective of this project is to apply Bayes Theorem to create an algorithm with > 80% accuracy to sort out spam messages.
The dataset can be downloaded [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). For convention, messages are classified as either "spam" or "ham".

### Methodology
#### 1. Importing and inspecting the data
#### 2. Creating Training and Testing sets
#### 3. Data Cleaning and Processing
#### 4. Creating the Spam Filter
#### 5. Classifying new messages

### Conclusion: Developed algorith had 98% accuracy.

# 1. Importing and inspecting the data

In [51]:
import pandas as pd 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [21]:
print('Original dataset: {} rows and {} columns'.format(data.shape[0], data.shape[1]))

Original dataset: 5572 rows and 2 columns


In [23]:
print("Spam entries: {0:.0%}".format(data['Label'].value_counts(normalize=True)[0]))
print('')
print("Ham entries: {0:.0%}".format(data['Label'].value_counts(normalize=True)[1]))

Spam entries: 87%

Ham entries: 13%


In [20]:
print("{0:.0%}".format(data['Label'].value_counts(normalize=True)[0]))

87%


# 2. Creating Training and Testing sets
### The first step is to randomize the entire dataset to ensure that spam and ham messages are spread properly.

### In sequence, the 5.572 messages will be divided as follows:
1) Training Set (80% = 4.458 messages)

2) Testing Set (20% = 1.114 messages)

In [15]:
randomized_data = data.sample(frac=1, random_state=1)

split_index = round(len(randomized_data) * 0.8)

train_data = randomized_data[:split_index].reset_index(drop=True)

test_data = randomized_data[split_index:].reset_index(drop=True)

print('Train data: {} rows and {} columns'.format(train_data.shape[0], train_data.shape[1]))

print('')

print('Test data: {} rows and {} columns'.format(test_data.shape[0], test_data.shape[1]))

Train data: 4458 rows and 2 columns

Test data: 1114 rows and 2 columns


### Now we must verify the distribution of spam and ham entries in both datasets

In [29]:
print("train_data Spam entries: {0:.0%}".format(train_data['Label'].value_counts(normalize=True)[0]))
print('')
print("train_data Ham entries: {0:.0%}".format(train_data['Label'].value_counts(normalize=True)[1]))

train_data Spam entries: 87%

train_data Ham entries: 13%


In [44]:
print("test_data Spam entries: {0:.0%}".format(test_data['Label'].value_counts(normalize=True)[0]))
print('')
print("test_data Ham entries: {0:.0%}".format(test_data['Label'].value_counts(normalize=True)[1]))

test_data Spam entries: 87%

test_data Ham entries: 13%


### Checking values in greater detail before proceeding

In [46]:
print("train_data:")
print(train_data['Label'].value_counts(normalize=True))

print('')

print("test_data:")
print(test_data['Label'].value_counts(normalize=True))

train_data:
ham     0.86541
spam    0.13459
Name: Label, dtype: float64

test_data:
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


# 3. Data Cleaning and Processing

### In this step we'll lowercase all messages and remove characters that have no statistical significance for the algorithm (eg. ponctuations)

In [48]:
# Checking dataset before processing
train_data.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [52]:
# Removing punctuations and special characters

train_data['SMS'] = train_data['SMS'].str.replace('\W', ' ')

train_data['SMS'] = train_data['SMS'].str.lower()

train_data.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [54]:
# Creating a list ('vocabulary') to store unique words
 
vocabulary = []

train_data['SMS'] = train_data['SMS'].str.split()

for x in train_data['SMS']:
    for y in x:
        vocabulary.append(y)

vocabulary = list(set(vocabulary))

print("Number of unique words: {}".format(len(vocabulary)))

Number of unique words: 7783


In [62]:
# Counting the frequency of each unique word

word_counts_per_sms = {unique_word:[0] * len(train_data['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train_data['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

# Transforming the frequency dictionary into a DataFrame

word_counts = pd.DataFrame(word_counts_per_sms)

# This DataFrame has 7783 columns (one for each unique word) and 4458 rows (one for each index (row) in train_data)

word_counts.head()

Unnamed: 0,purple,reality,09099725823,instructions,transaction,adsense,78,def,seen,her,im,yours,sao,trishul,twins,tmorrow,aretaking,150,hiya,library,forward,dysentry,iwana,plum,o2fwd,ha,convince,opener,trip,3510i,gist,partnership,fletcher,bought,ecstasy,wating,flirt,route,which,feb,...,434,howu,university,bcm,tobacco,humanities,petexxx,placed,choice,gain,vital,os,option,spiffing,treacle,uv,yun,8077,computer,helen,cupboard,sang,cruise,apartment,sayy,ore,disc,connected,had,clearing,poet,fetch,xin,bettr,dhina,cough,othrwise,while,intentions,epi
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [67]:
# Concatenating train_data and word_counts

train_data_clean = pd.concat([train_data, word_counts], axis = 1)

train_data_clean.head()

Unnamed: 0,Label,SMS,purple,reality,09099725823,instructions,transaction,adsense,78,def,seen,her,im,yours,sao,trishul,twins,tmorrow,aretaking,150,hiya,library,forward,dysentry,iwana,plum,o2fwd,ha,convince,opener,trip,3510i,gist,partnership,fletcher,bought,ecstasy,wating,flirt,route,...,434,howu,university,bcm,tobacco,humanities,petexxx,placed,choice,gain,vital,os,option,spiffing,treacle,uv,yun,8077,computer,helen,cupboard,sang,cruise,apartment,sayy,ore,disc,connected,had,clearing,poet,fetch,xin,bettr,dhina,cough,othrwise,while,intentions,epi
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# 4. Creating the Spam Filter

### We need to calculate some probabilities to deploy Naive Bayes

In [68]:
# Breaking spam and ham into two DataFrames
spam_messages = train_data_clean[train_data_clean['Label'] == 'spam']
ham_messages = train_data_clean[train_data_clean['Label'] == 'ham']

alpha = 1

# Probabilities of ham and spam
p_spam = len(spam_messages) / len(train_data_clean)
p_ham = len(ham_messages) / len(train_data_clean)

n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

n_vocabulary = len(vocabulary)

In [69]:
# Calculating the probability of each word in the vocabulary given the message is ham or spam

parameters_spam = {unique_word: 0 for unique_word in vocabulary}
parameters_ham = {unique_words: 0 for unique_words in vocabulary}

for word in vocabulary:
    
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam

    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

# 5. Classifying new messages
### We have all parameters necessary for our algorithm to work
The filter can be understood as a function that:

1. Takes in as input a new message(word 1, word 2, ..., word n)

2. Calculates P(Spam|word 1, word 2, ..., word n) and P(Ham|word 1, word 2, ..., word n)

3. Compares the two probabilities and classifies the message according to which is highest

4. If P(Ham) = P(Spam), the algorithm will request human assistance

In [88]:
def classify(message):

    message = message.replace('\W', ' ')
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'Human classification required'

In [93]:
# Test 1

classify("WINNER!! This is the secret code to unlock the money: C3421.")

'spam'

In [90]:
# Test 2

classify("Sounds good, Tom, then see u there")

'ham'

In [95]:
# Checking test_data

test_data.head()

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...
3,ham,All sounds good. Fingers . Makes it difficult ...
4,ham,"All done, all handed in. Don't know if mega sh..."


In [96]:
# Applying the algorithm to test_data

test_data['predicted'] = test_data['SMS'].apply(classify)

test_data.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [98]:
# Measuring algorithm accuracy

correct = 0

total = len(test_data)

for row in test_data.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1

accuracy = (correct / total)

print("Correct predictions: {}".format(correct))
print("Incorrect predictions: {}".format(total - correct))
print("Algorithm Accuracy: {0:.0%}".format(accuracy))

Correct predictions: 1095
Incorrect predictions: 19
Algorithm Accuracy: 98%
