**Project: Building a Spam Filter with Naive Bayes**

**Goal:** to study the practical side of the Naive Bayes algorithm by building a spam filter for SMS messages.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

Within the dataset the labels are 'spam' and 'ham' - with 'ham' being non-spam

In [1]:
import pandas as pd

In [2]:
# import data
url = 'https://dq-content.s3.amazonaws.com/433/SMSSpamCollection'
df = pd.read_csv(url, sep = '\t', header = None, names = ['Label', 'SMS'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [3]:
# EDA
df.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [4]:
# Distributions
dist = df['Label'].value_counts(dropna = False, normalize = True)
spam_to_ham = round(dist[1] / dist[0], 1)

print('There is a {} to 1 spam to ham ratio'.format(spam_to_ham), '\n')
print('Percentage of Dataset')
round(100 * dist, 1)

There is a 0.2 to 1 spam to ham ratio 

Percentage of Dataset


ham     86.6
spam    13.4
Name: Label, dtype: float64

In [5]:
# Overall 
df['SMS'].value_counts(dropna = False)

Sorry, I'll call later                                                                                                                                                                                                                                                                                                                  30
I cant pick the phone right now. Pls send a message                                                                                                                                                                                                                                                                                     12
Ok...                                                                                                                                                                                                                                                                                                                                   10
Ok.    

In [6]:
# Spam
df[df['Label'] == 'spam']['SMS'].value_counts(dropna = False)

Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed £1000 cash or £5000 prize!                                                                                4
FREE for 1st week! No1 Nokia tone 4 ur mob every week just txt NOKIA to 8007 Get txting and tell ur mates www.getzed.co.uk POBox 36504 W45WQ norm150p/tone 16+                                                                     3
Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days.                                                                                                                         3
Loan for any purpose £500 - £75,000. Homeowners + Tenants welcome. Have you been previously refused? We can still help. Call Free 0800 1956669 or text back 'help'                                                                 3
HMV BONUS SPECIAL 500 pounds of genuine HMV vouchers to be won. Just answer 4 easy q

In [7]:
# Non-spam
df[df['Label'] == 'ham']['SMS'].value_counts(dropna = False)

Sorry, I'll call later                                                                                                                                                                                                                                                                                                                                                                                                                                          30
I cant pick the phone right now. Pls send a message                                                                                                                                                                                                                                                                                                                                                                                                             12
Ok...                                                                                             

**Model Build**

In [8]:
# Copy raw and split data into train / test
df_rand = df.copy().sample(frac = 1, random_state = 1)
split = int(len(df_rand) * 0.8)

df_train = df_rand.iloc[:split, :].copy().reset_index(drop = True)
df_test = df_rand.iloc[split:, :].copy().reset_index(drop = True)

In [9]:
df_train['Label'].value_counts(dropna = False, normalize = True)

ham     0.86538
spam    0.13462
Name: Label, dtype: float64

In [10]:
df_test['Label'].value_counts(dropna = False, normalize = True)

ham     0.868161
spam    0.131839
Name: Label, dtype: float64

The data split has given approx. equal distribution of labels in the train and test sets.

To calculate all the probabilities required by the algorithm, we'll need to perform some data cleaning to allow us to extract the information we require.

In [11]:
# Clean data
def clean_col(df, col):
    df[col] = df[col].str.replace('\W', ' ')
    df[col] = df[col].str.lower()
    return df

df_train = clean_col(df_train, 'SMS')
df_train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [12]:
# list of unique words
def unique_list(df, col):
    df = df.copy()
    df[col] = df[col].str.split()
    vocab = []
    for lst in df[col]:
        for i in lst:
            if i not in vocab:
                vocab.append(i)
                
    return vocab
vocab = unique_list(df_train, 'SMS')
len(vocab)

7782

In [13]:
# word count per message
def word_counts(df, col, vocab):
    # copy & split
    df = df.copy()
    df[col] = df[col].str.split()
    
    # new DF
    word_counts = pd.DataFrame(index = df.index, columns = vocab, data = 0)
    
    # add data to DF
    for row in df.index:
        for col_name in word_counts.columns:
            if col_name in df.loc[row, col]:
                word_counts.loc[row, col_name] +=1
    
    return word_counts

In [14]:
counts_train = word_counts(df_train, 'SMS', vocab)
counts_train

Unnamed: 0,yep,by,the,pretty,sculpture,yes,princess,are,you,going,...,prakesh,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526
0,1,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Combine with training df
df_train = pd.concat([df_train, counts_train], axis = 1)
df_train.head(2)

Unnamed: 0,Label,SMS,yep,by,the,pretty,sculpture,yes,princess,are,...,prakesh,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526
0,ham,yep by the pretty sculpture,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Isolating spam and ham messages first
spam_messages = df_train[df_train['Label'] == 'spam']
ham_messages = df_train[df_train['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(df_train)
p_ham = len(ham_messages) / len(df_train)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocab)

# Laplace smoothing
alpha = 1

In [19]:
# Initiate parameters
parameters_spam = {w: 0 for w in vocab}
parameters_ham = {w: 0 for w in vocab}

# Calculate parameters
for w in vocab:
    n_word_given_spam = spam_messages[w].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[w] = p_word_given_spam
    
    n_word_given_ham = ham_messages[w].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[w] = p_word_given_ham

**Classify New Message**

In [20]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [21]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.9444857812782986e-31
P(Ham|message): 1.080492355073868e-33
Label: Spam


In [22]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 1.1174866026064603e-29
P(Ham|message): 6.731692841854824e-26
Label: Ham


**Test Model**

In [23]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [24]:
df_test['predicted'] = df_test['SMS'].apply(classify_test_set)
df_test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Wherre's my boytoy ? :-(,ham
1,ham,Later i guess. I needa do mcat study too.,ham
2,ham,But i haf enuff space got like 4 mb...,ham
3,spam,Had your mobile 10 mths? Update to latest Oran...,spam
4,ham,All sounds good. Fingers . Makes it difficult ...,ham


In [30]:
correct = 0
total = df_test.shape[0]
    
for row in df_test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', round(100 * correct/total, 1), '%')

Correct: 1099
Incorrect: 16
Accuracy: 98.6 %


In this project we have used the Naive Bayes filter to create a model which is capable of identifying spam SMS with a 98.6% accuracy. 