# Building a Spam Filter with Naive Bayes

## Exploring the Dataset

### In this project I will be "teaching" my computer to classify messages between spam and non-spam. To do this I will use the multinominal Naive Bayes theorem and a dataset of 5572 SMS messages that have already been classified by humans.

##### The dataset has been provided by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository.

In [1]:
import pandas as pd

# Dataset has no header row and is tab separated
sms_data = pd.read_csv('SMSSpamCollection', header = None, names = ['Label', 'SMS'] , sep = '\t')

sms_data.info()

sms_data.head()
# sms_data.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
spam = sms_data.loc[sms_data['Label'] == 'spam']

ratio_spam = spam['Label'].count() / sms_data['Label'].count()
percent_spam = ratio_spam * 100

round(percent_spam,1)

13.4

#### So 13.4% of the dataset messages are spam, that is 747 messages.

## Training and Test Set

### Here I will split my dataset into two section. I will use one as a training set, used to teach the computer how to classify messages. The other set will be the test set, this will be used to mark how well the spam filter performs on new messages.

### I'll use an 80/20 split for the SMS data, then when the filter is complete I can test by simply comparing the output of the filter to the label column completed by humans.

#### I will first randomize my entire dataset, then I will split and perform some minor analysis.

In [3]:
# Randomize Data
random_data = sms_data.sample(frac = 1, random_state = 1)

# Find the split lengths
training_test_split = round(len(random_data) * 0.8)

# Split the randomized data
training_data = random_data[:training_test_split].reset_index()
test_data = random_data[training_test_split:].reset_index()

# Check the size of each set
print(training_data.shape)
print(test_data.shape)



(4458, 3)
(1114, 3)


In [4]:
# Find the percentage of spam and ham in both sets
training_spam = training_data.loc[sms_data['Label'] == 'spam']
test_spam = test_data.loc[sms_data['Label'] == 'spam']

no_training_spam = training_spam['Label'].count()
no_test_spam = test_spam['Label'].count()
# can also do training_spam['Label'].value_counts(normalize = True)

print('Training Spam Percent:', round((no_training_spam/4458)*100,1))
print('Test Spam Percent:', round((no_test_spam/1114)*100, 1))

Training Spam Percent: 13.5
Test Spam Percent: 15.1


#### The percentage of spam in the training set is 13.5% and 15.1% in the test set, I should expect the same portion of spam when testing the filter. These percentages are both within reasonable range of the percent spam in the full data.

## Letter Case and Punctuation

### Now I will use the training set to teach the algorithm to classify new messages. I'll use the Naive Bayes algorithm, for which I'll need to clean the data a little and format it into an efficient structure for information extraction.

### Currently there are two columns, Label and SMS. To make calculations easier, it will be useful to replace the SMS column with a set of columns representing each new disctinct word from the dictionary. These columns will be integers counting how many times each column word has occurred in a specific SMS message. I will also need to remove any punctuation and make letter case uniform.

In [5]:
# Before
training_data.head()

Unnamed: 0,index,Label,SMS
0,1078,ham,"Yep, by the pretty sculpture"
1,4028,ham,"Yes, princess. Are you going to make me moan?"
2,958,ham,Welp apparently he retired
3,4642,ham,Havent.
4,4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [6]:
# After

# Remove punctuation
training_data['SMS'] = training_data['SMS'].str.replace('\W', ' ')
# Standardize to lowercase
training_data['SMS'] = training_data['SMS'].str.lower()

training_data.head()


Unnamed: 0,index,Label,SMS
0,1078,ham,yep by the pretty sculpture
1,4028,ham,yes princess are you going to make me moan
2,958,ham,welp apparently he retired
3,4642,ham,havent
4,4674,ham,i forgot 2 ask ü all smth there s a card on ...


### Create Vocabulary

#### I'll need to create a list of all the disctinct words used in the training data SMS messages. For this I will change each SMS from a string into a list.

In [7]:
training_data['SMS'] = training_data['SMS'].str.split()

vocabulary = []

for row in training_data['SMS']:
    for word in row:
        vocabulary.append(word)
        
# Now we can use a set to remove all duplications
vocabulary = list(set(vocabulary))

print(len(vocabulary))


7783


#### There are 7783 unique words in total in the training data set.

## The Final Training Set

### Now that I have a vocabulary, I need to create my final training data set. For this purpose I'll use a dictionary, with a key for each distinct word, and each keys value will be a list, where each element will be the number of occurances of the key word for one SMS message.

### I'll initialize all the list elements to 0, then do use a loop to count the occurances per SMS.

In [8]:
# Create the structure of the dictionary and lists with 0s
word_counts_per_sms = {unique_word: [0] * len(training_data['SMS']) for unique_word in vocabulary}

# Fill dictionary with occurances
# We can use the enumerate function to pass the index and the SMS to the loop
for index, sms in enumerate(training_data['SMS']):
    for word in sms:
        # Increment the correct list element (SMS) for the correct key (Word)
        word_counts_per_sms[word][index] += 1


In [9]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


#### Concatenate the word counts data set with the training data set so I can see both the full SMS message, label and word counts.

In [10]:
training_data_clean = pd.concat([training_data, word_counts], axis=1)
training_data_clean.head()

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants First

### Now that I have a clean training set, I can begin to make the filter. For this purpose I will need several constants such as P(Spam), P(Ham), $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$. I'll also use Laplace smoothing and set $\alpha$ = 1.

    

In [None]:
# Count the spam and ham messages
spam_msg = training_data_clean[training_data_clean['Label'] == 'spam']
ham_msg = training_data_clean[training_data_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_msg) / len(training_data_clean)
p_ham = len(ham_msg) / len(training_data_clean)

# N_Spam, N_Ham, N_Vocabulary

# N_Spam
words_per_spam_msg = spam_msg['SMS'].apply(len)
n_spam = words_per_spam_msg.sum()

# N_Ham
words_per_ham_msg = ham_msg['SMS'].apply(len)
n_ham = words_per_ham_msg.sum()

# N_Vocabulary
n_vocab = len(vocabulary)

# Laplace Smoothing
alpha = 1


## Calculating Parameters

### Now I have my constants, but I also need probabilities that vary depending on the individual word, like P($w_{i}$|Ham) and P($w_{i}$|Spam) where $w_{i}$ is the word in question. Although the probability is word dependent, it is message independent. It will change with the word but not the SMS. We need to calculate both of the stated parameters for each word, around 15.5k probabilities.

### I will calculate these probabilities once beforehands, so that they won't need to be done each time a new message is recieved.

In [None]:
# Initiate parameters
params_spam = {unique_word:0 for unique_word in vocabulary}
params_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_msg[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocab)
    params_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_msg[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocab)
    params_ham[word] = p_word_given_ham
    
    

## Classifying A New Message

### I have now calculated all the paramaters and constants I will need, to make the spam filter I will make a function that:

#### > Takes in a new message as input ($w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$)
#### > Calculates P(Spam|$w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$) and P(Ham|$w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$)
#### > Compares these values and:
####     - If P(Ham|$w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$) > P(Spam|$w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$) then the message is classified as ham
####     - If P(Ham|$w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$) < P(Spam|$w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$) then the message is classified as spam
####     - If P(Ham|$w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$) = P(Spam|$w_{1}$, $w_{2}$, $w_{3}$ ..., $w_{n}$), then the algorithm will ask for human help

In [None]:
print(len(spam_msg))

In [None]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    # Clean the message
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    # Initialize the initial probability for the equation
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    # Cycle through the words in the SMS, find the probability for that
    # word in the params dictionary and multiply with p_x_given_message
    for word in message:
        if word in params_spam:
            p_spam_given_message *= params_spam[word]
            
        if word in params_ham:
            p_ham_given_message *= params_ham[word]
    
    # Display probabilities
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    # Give classification
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [None]:
classify('Hi Mum, Did you get the gift card I got you? Make sure to claim it soon!')

In [None]:
classify('You have won a bunch of money, click the link below to claim. Your code is 25F TS2')

#### After some playing, the function seems to be working appropriately.

## Measuring the Spam Filter's Accuracy

### Now I'll test my function on the test data set. The function will classify each message and I'll compare it to the 'Label' column, complete by humans, to test its accuracy.

### Just need to make a small adjustment to the classify function so that it will return the classification rather than display it. Then I can add the return to a new column that I'll call predicted.


In [None]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in params_spam:
            p_spam_given_message *= params_spam[word]

        if word in params_ham:
            p_ham_given_message *= params_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [None]:
# Create the predicted column and fill
test_data['predicted'] = test_data['SMS'].apply(classify_test_set)
test_data.head()

In [None]:
correct = 0
total = len(test_data['Label'])

incorrect = []

for msg in test_data.iterrows():
    row = msg[1]
    if row['Label'] == row['predicted']:
        correct += 1
    else:
        incorrect.append(row['SMS'])
    
percent_acc = correct / total * 100

print(round(percent_acc,2),'%')

#### The spam filter is actually very accurate. It has removed almost 99% of spam, much better than I expected!

In [None]:
wrong_class = {'Incorrect_SMS' : incorrect}

pd.DataFrame(wrong_class)



#### The common reason for the function missclassing messages seems to be the abbreviations, overuse of non-letter characters and spelling mistakes within the email, this will low the amount of words contributing to the p_ham probability and hence making it easier for p_spam to be higher.