# Guided Project 12: Building a Spam Filter with Naive Bayes

In [1]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

The spam data set to be worked with will be named `sms_spam`. 

In [68]:
sms_spam_full = pd.read_csv('SMSSpamCollection.txt',
                   sep='\t',
                   names=['Label', 'SMS'])

Below, we can see that only aprox. 13.4% of the messages are spam.

In [69]:
count_label = sms_spam_full.Label.value_counts(normalize=True).round(3)*100

count_label = count_label.rename('ham vs spam (%)')

count_label

ham     86.6
spam    13.4
Name: ham vs spam (%), dtype: float64

The next step is to create a training set and a test set out of `sms_spam_full`:

1. Randomizing the `sms_spam_full`.


2. Split randomized DF into a training set (20%) and a testing set (remaining 80%).


3. Compare the 'Label' value distribution of the entire randomized DF with the previously made subsets.

In [70]:
# 1.
random_sms_spam = sms_spam_full.sample(n=None, frac=1, random_state=1).reset_index(drop=True)

From earlier on we know that `random_sms_spam` has 5572 entries (0 to 5571).

In [71]:
eighty_perc = random_sms_spam.shape[0] * 0.8

print(f'Total number of observations/rows in random_sms_spam: {random_sms_spam.shape[0]}', 
      f'\n20% of total observations/rows in random_sms_spam: {round(eighty_perc, 0)}')

Total number of observations/rows in random_sms_spam: 5572 
20% of total observations/rows in random_sms_spam: 4458.0


Based on the information above we set the training set as the rows 0 to 4458, 80% of the total rows in `random_sms_spam`. The remaining rows, will form the testing set. 

In [72]:
# 2.
training_set = random_sms_spam.copy().iloc[:4458+1, :]

testing_set = random_sms_spam.copy().iloc[4458:, :]

Finally, comparing the distribution of values in the `Label` column across DataFrames.

In [73]:
# 3.

# Training set.
count_label_training = training_set.Label.value_counts(normalize=True).round(3)*100

count_label_training = count_label_training.rename('ham vs spam (%)')

# Testing set.
count_label_testing = testing_set.Label.value_counts(normalize=True).round(3)*100

count_label_testing = count_label_testing.rename('ham vs spam (%)')


# Combining all label percentage counts for comparison.
compare_label = pd.DataFrame({'random_sms_spam': count_label,
                              'count_label_training': count_label_training,
                              'count_label_testing': count_label_testing})

compare_label

Unnamed: 0,random_sms_spam,count_label_training,count_label_testing
ham,86.6,86.5,86.8
spam,13.4,13.5,13.2


As we can see in the table above, the value distribution is very similar across the panel, meaning that we can infer the conclusions produced from the training set to the testing set, since both sets resemble the original series.

In [74]:
# Saving RAM 1
del sms_spam_full
del random_sms_spam

For the next stage we focus on working with the training set. In order to apply the Naive Bayes Theorem, we can build a table that, for each message, classifies it as spam or not spam and counts the number of times a word, from the vocabulary, appears in the message. Whenever a message does not contain a word from the vocabulary, each word in the vocabulary having its own column, it registers `0`. Remember that the vocabulary is the group of all unique words gathered from all the messages within the training set.

Prior to building this DataFrame, two cleaning steps applied to the `SMS` column will be undertaken:

- eliminate punctuation.
- lower case every word.

The list below `non_words_list`, gives the characters that are removed from the strings/messages when `str.replace()` is applied with the parameter `pattern=\W` to the `training_set.SMS` column.
This is important to analise because the regex character class `\W`, when applied, may remove some specific characters that are common in spam messages, hence weakening the algorithm's accuracy.

In [75]:
all_messages_str = ''

for index, value in enumerate(training_set.SMS):
    all_messages_str = ' '.join([all_messages_str, value])
    

non_words_set = set(re.findall('\W', all_messages_str))


non_words_list = list(non_words_set)

non_words_list_w_spaces = [' ' + i + ' ' for i in non_words_list]

In [76]:
non_words_list

[')',
 '\x94',
 '~',
 '%',
 '\t',
 '@',
 ';',
 '|',
 '\x96',
 '>',
 ',',
 '\\',
 '/',
 '.',
 '"',
 '\x93',
 '\x92',
 '“',
 '+',
 '#',
 '$',
 '*',
 '£',
 '?',
 '=',
 ' ',
 '-',
 "'",
 ']',
 '<',
 '!',
 '’',
 '—',
 '¡',
 ':',
 '–',
 '(',
 '┾',
 '&',
 '‘',
 '…',
 '\n',
 '»',
 '[',
 '\x91',
 '^']

As we can observe above, there are caraceters such as currency symbols, which would are removed from the strings if we apply:

    training_set.SMS.copy().str.replace('\W', ' ', regex=True)
    
Therefore, we proceed to maintain characters related to currency in the messages:

In [77]:
# Training set `SMS` cleaned series

ts_cleaned_SMS = training_set.SMS.copy().str.replace('[^A-Za-z0-9\s£€\$]', ' ', regex=True)
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('€', ' € ', regex=False)
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('£', ' £ ', regex=True)
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('\$', ' $ ', regex=True)


We've seen in 'p12 - practice 2' that this innovation will marginally improve the algorithm's accuracy, but some money references were still not accounted for.

We can tell that we are working with a set of messages presumably from UK, since the money/currency references made are either 'GBP' or pences. When filtering the training set for 'pence' or other proxy we realize that there are many variations of this 'tell tell' sign of spamming. Some variations on 'pence' include '150pm', which stands for '150 per message' can be seen below. 

The regex expression inside `re.findall` - `pattern_pences`, looks for every sequence of characters that 
might include a sequence of characters composed of a single  to 3 digits (`\d{1, 3}`) followed immediately by the word `p`. This limitation is due to the fact that we know by default that references to pences range usually from 5 to 150 (or more) but not in the thousands range. Expected matches are: 

|sub-pattern  |  example                 | 
|---------|--------------------------|
|  \d+p\w+ |'**150p**pmsg'   | 
|  \d+p\w+ |'**50p**erwksub'   | 
|  \w+\d+p | '08712400602**450p**'  | 
|  \w+\d{1,}p\w+ | 'com1win**150p**pmx3age16'  | 

We can notice that in some situations such as in the second example, some words may not be referencing pences but we'll leave it as such for now.

In [78]:
list_of_patterns = []

pattern_pences = '\\b\w*\d{1,3}p\w*\\b'

for i, val in enumerate(ts_cleaned_SMS):
    pences = re.findall(pattern_pences, val)
    if len(pences):
        list_of_patterns.append(pences[0])
        
list_of_patterns      

['08700621170150p',
 '150pm',
 '150p',
 '10p',
 '150p',
 '150p',
 '50pmmorefrommobile2Bremoved',
 'MobStoreQuiz10ppm',
 'callcost150ppmmobilesvary',
 '450pw',
 '35p',
 'com1win150ppmx3age16',
 '1x150p',
 '10ppm',
 '000pes',
 '150ppm',
 '150p',
 '150ppm',
 '150p',
 '150p',
 '150p',
 '08712400602450p',
 '150p',
 '150p',
 '150ppm',
 '150p',
 '150ppmPOBox10183BhamB64XE',
 '8pm',
 '150p',
 '150p',
 '150ppmsg',
 '10p',
 '7pm',
 '50perWKsub',
 '50p',
 '25p',
 '150p',
 '150ppm',
 '60p',
 '20p',
 '150p',
 '150p',
 '25p',
 '150p',
 '150ppm',
 'gr8prizes',
 'com1win150ppmx3age16subscription',
 '150p',
 '150p',
 '10p',
 '2price',
 '50p',
 '2p',
 '18p',
 '150p',
 'norm150p',
 '150p',
 '150ppm',
 '100percent',
 '150p',
 '11pm',
 '150p',
 '150p',
 '60p',
 '08712400602450p',
 '150p',
 '50perWKsub',
 '150p',
 '5p',
 '150pm',
 'box245c2150pm',
 '100percent',
 '45pm',
 '50p',
 '25p',
 '150ppermessSubscription',
 '08701417012150p',
 '150p',
 '20p',
 '150p',
 '60p',
 '50p',
 '9pm',
 '7pm',
 '30pm',
 '150p'

One attempt to make the algorithm recognize references to pences is to separate expressions like the ones previously mentioned, e.g. '10p', '150p', from the rest of the characters with a whitespace. The way to achieve this is to first, identify a list of patterns (which is already done above); create a modified string compiled in a `list_of_replacements`, and last, replacing in the main data set the old expression for the new expression inside the same message. Example of a message modification: `training_set.SMS[186]`, contains this expression: 'callcost150ppmmobilesvary'.

In its former form:

'URGENT  This is the 2nd attempt to contact U U have WON  £ 1000CALL 09071512432 b4 300603t csBCM4235WC1N3XX callcost150ppmmobilesvary  max £ 7  50'

In its later form, after the modification:

'URGENT  This is the 2nd attempt to contact U U have WON  £ 1000CALL 09071512432 b4 300603t csBCM4235WC1N3XX callcost 150p pmmobilesvary  max £ 7  50'

To that end, we create 

In [79]:
list_of_replacements = []
pences_references = []

for i, val_1 in enumerate(list_of_patterns):
    if re.findall('([1-9]pm|1[0-2]pm)', val_1) == []:
        pence_ref = val_1
        digits = re.findall('\d{1,3}p', pence_ref)
        pences_references.append(digits)

    final_str = ''
    
    for val_2 in digits:
        
        sub = ' ' + val_2 + ' '
        
        if final_str == '':
            final_str = re.sub(val_2, sub, pence_ref, count=1)
            
        else:
            final_str = re.sub(val_2, sub, final_str, count=1)
    
    list_of_replacements.append(final_str)

list_of_replacements

['08700621170 150p ',
 ' 150p m',
 ' 150p ',
 ' 10p ',
 ' 150p ',
 ' 150p ',
 ' 50p mmorefrommobile2Bremoved',
 'MobStoreQuiz 10p pm',
 'callcost 150p pmmobilesvary',
 ' 450p w',
 ' 35p ',
 'com1win 150p pmx3age16',
 '1x 150p ',
 ' 10p pm',
 ' 000p es',
 ' 150p pm',
 ' 150p ',
 ' 150p pm',
 ' 150p ',
 ' 150p ',
 ' 150p ',
 '08712400602 450p ',
 ' 150p ',
 ' 150p ',
 ' 150p pm',
 ' 150p ',
 ' 150p pmPOBox10183BhamB64XE',
 ' 150p pmPOBox10183BhamB64XE',
 ' 150p ',
 ' 150p ',
 ' 150p pmsg',
 ' 10p ',
 ' 10p ',
 ' 50p erWKsub',
 ' 50p ',
 ' 25p ',
 ' 150p ',
 ' 150p pm',
 ' 60p ',
 ' 20p ',
 ' 150p ',
 ' 150p ',
 ' 25p ',
 ' 150p ',
 ' 150p pm',
 'gr 8p rizes',
 'com1win 150p pmx3age16subscription',
 ' 150p ',
 ' 150p ',
 ' 10p ',
 ' 2p rice',
 ' 50p ',
 ' 2p ',
 ' 18p ',
 ' 150p ',
 'norm 150p ',
 ' 150p ',
 ' 150p pm',
 ' 100p ercent',
 ' 150p ',
 ' 150p ',
 ' 150p ',
 ' 150p ',
 ' 60p ',
 '08712400602 450p ',
 ' 150p ',
 ' 50p erWKsub',
 ' 150p ',
 ' 5p ',
 ' 150p m',
 'box245c2 150p m'

The following function, when applied to a string, looks for expressions that may be in the `list_of_patterns` and if there is a match, it replaces that expression for the equivelent version (with whispaces separating letters from words, as seen previously), resorting to the `list_of_replacements`.

In [80]:
def replace(string):
    
    new_string = ''
    
    for i, val in enumerate(list_of_patterns):
        if val in string:
            new_string += re.sub(val, list_of_replacements[i], string)
            break
            
    if len(new_string):
        return new_string
    else:
        return string

Applying the `replace` function in the `ts_cleaned_SMS` data set:

In [81]:
ts_cleaned_SMS_2 = ts_cleaned_SMS.apply(replace)

Manually verifying changes made by the function in two random spam messages:

In [82]:
# the one from the previous example
ts_cleaned_SMS[186]

'URGENT  This is the 2nd attempt to contact U U have WON  £ 1000CALL 09071512432 b4 300603t csBCM4235WC1N3XX callcost150ppmmobilesvary  max £ 7  50'

In [83]:
ts_cleaned_SMS_2[186]

'URGENT  This is the 2nd attempt to contact U U have WON  £ 1000CALL 09071512432 b4 300603t csBCM4235WC1N3XX callcost 150p pmmobilesvary  max £ 7  50'

In [84]:
ts_cleaned_SMS[2821]

'Someone has contacted our dating service and entered your phone because they fancy you  To find out who it is call from a landline 09111032124   PoBox12n146tf150p'

In [85]:
ts_cleaned_SMS_2[2821]

'Someone has contacted our dating service and entered your phone because they fancy you  To find out who it is call from a landline 09111032124   PoBox12n146tf 150p '

In [86]:
#`\s+` ensures that if there are two or more joined whitespaces they are converted to just one.
ts_cleaned_SMS_2 = ts_cleaned_SMS_2.str.replace('\s+', ' ', regex=True) 

ts_cleaned_SMS_2.head(30)

0                                                                                                                                                                                                                                                                                                       Yep by the pretty sculpture
1                                                                                                                                                                                                                                                                                       Yes princess Are you going to make me moan 
2                                                                                                                                                                                                                                                                                                        Welp apparently he retired
3                           

The last version of `ts_cleaned_SMS` still has rows which have whitespaces at the beginning or end of the message which can be removed.

In [87]:
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('(\A +| +\Z)', '', regex=True)

Below we confirm that there is no whitespaces at the beginning or end of the row/string. 

In [88]:
cond1 = ts_cleaned_SMS.str.contains(pat='(?:\A +| +\Z)')

ts_cleaned_SMS[cond1]

Series([], Name: SMS, dtype: bool)
Series([], Name: SMS, dtype: bool)


In [89]:
# Lower case for every string.
ts_cleaned_SMS_2 = ts_cleaned_SMS_2.str.lower()

Checking if the changes in `training_set.SMS` were successful:

In [90]:
ts_cleaned_SMS_2.head()

0                                                                         yep by the pretty sculpture
1                                                          yes princess are you going to make me moan
2                                                                          welp apparently he retired
3                                                                                              havent
4    i forgot 2 ask all smth there s a card on da present lei how all want 2 write smth or sign on it
Name: SMS, dtype: object

The next stage is producing the vocabulary (set of unique words), with the following steps:

1. Split messages into columns - one unique string per column (removing most of the whitespaces as well).
2. Concatenating all the columns of strings into a single Series.
3. Dropping Nan-values.
3. Convert Series into a list.
4. Converting list into a set, thus excluding duplicated values/words.
5. Convert set into back into a list.

In [91]:
# 1.
ts_cleaned_SMS_split = ts_cleaned_SMS_2.str.split(' ', expand=True) 

# 2.
ts_cleaned_SMS_split_cat = pd.concat([ts_cleaned_SMS_split.iloc[i, :] for i in range(0, ts_cleaned_SMS_split.shape[0])], ignore_index=True)
# 3.
ts_cleaned_SMS_split_cat = ts_cleaned_SMS_split_cat.dropna()

# 4.
ts_cleaned_SMS_split_cat_to_list = ts_cleaned_SMS_split_cat.to_list()

#5.
vocabulary_set = set(ts_cleaned_SMS_split_cat_to_list)

#6.
vocabulary = list(vocabulary_set)

When checking `vocabulary` below, we see that whitespace still appears as a value, thus it can be removed.

In [92]:
vocabulary[:5]

['', 'impede', 'dogging', '7634', 'image']

In [93]:
for index, el in enumerate(vocabulary):
    if el == '':
        del vocabulary[index]
        
vocabulary[:5]

['impede', 'dogging', '7634', 'image', 'gona']

In [94]:
# Free RAM 2

del ts_cleaned_SMS_split_cat 
del ts_cleaned_SMS_split_cat_to_list
del vocabulary_set

Prior to what is in the instructions we create a splitted version of `ts_cleaned_SMS` that does not expand each string into a single column but compiles the message into a list with strings.

Similarly to `vocabulary`, I delete empty strings, `''`; this time resorting to a custom function - `remove_elements` that is applied to `ts_cleaned_SMS_split_listed`.

In [95]:
def remove_elements(list_x, list_strings):
    """Strings in list_strings are removed from list_x if this later 
    list contains any of those strings.
    """
    
    for index, el in enumerate(list_x):
        if el in list_strings:
            del list_x[index]
    
    return list_x

In [96]:
ts_cleaned_SMS_split_listed = ts_cleaned_SMS_2.str.split(' ') # `expand=False` by default

strings_to_remove = ['']

ts_cleaned_SMS_split_listed_1 = ts_cleaned_SMS_split_listed.copy().apply(lambda x: remove_elements(x, strings_to_remove))

As suggested by the tutorial, first a dictionary is created and then I go through every message and count each word repetition (each word is a key), filling out the dictionary.

In [97]:
# From the tutorial.

# 1.
word_counts_per_sms = {unique_word: [0] * len(ts_cleaned_SMS_split_listed_1) for unique_word in vocabulary}

for index, sms in enumerate(ts_cleaned_SMS_split_listed_1):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [98]:
# 2. Convert dictionary into DataFrame

word_counts_per_sms_df = pd.DataFrame(word_counts_per_sms)

Assessing if the conversion was successful. 

In [99]:
word_counts_per_sms_df.iloc[:2, 10:]

Unnamed: 0,charge,ibn,rite,traditions,outsomewhere,oveable,warner,girlie,uncles,cops,...,sundayish,safety,calls,motorola,09050000928,duffer,luck,medicine,aa,while
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Finally, concatenating `training_set` with `word_counts_per_sms_df` into a new DataFrame - `training_set_2`.

In [100]:
# 3. 

# `sort=False` is required to preserve the order of the columns in a 'first in' fashion.
training_set_2 = pd.concat([training_set, word_counts_per_sms_df], axis=1, sort=False)

In [101]:
training_set_2.iloc[:2, :10]

Unnamed: 0,Label,SMS,impede,dogging,7634,image,gona,yep,finn,road
0,ham,"Yep, by the pretty sculpture",0,0,0,0,0,1,0,0
1,ham,"Yes, princess. Are you going to make me moan?",0,0,0,0,0,0,0,0


In [102]:
# Free RAM 3

del ts_cleaned_SMS_split_listed
del word_counts_per_sms

Calculating the elements within the Naive Bayes algorithm. Starting with:

- $P(Spam)$ and $P(Ham)$.
- $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$

The Laplace smoothing parameter is set to 1: $α=1$.

Probability of Spam and Ham. 

In [103]:
label_counts_ts2 = training_set_2.Label.value_counts(normalize=True)

p_spam = label_counts_ts2.spam

p_ham = label_counts_ts2.ham

In [104]:
p_spam

0.13455931823278763

In [105]:
p_ham

0.8654406817672123

Counting $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$.

In [106]:
# n_vocabulary.

n_vocabulary = len(vocabulary)

n_vocabulary

7763

To sum up the total number of words for spam messages and for non-spam messages - $N_{Spam}$, $N_{Ham}$ respectively, the procedure will be the following:

1. add a column summing up the number of words per message/row.

2. calculate sum of words if row if messages are spam; same for non-spam.

In [107]:
# 1.
training_set_2['sum_words_sms'] = training_set_2.iloc[:, 2:].copy().sum(axis=1)

In [108]:
# 2.

n_spam = training_set_2.loc[training_set_2.Label=='spam', 'sum_words_sms'].sum()

n_ham = training_set_2.loc[training_set_2.Label=='ham', 'sum_words_sms'].sum()

In [109]:
print(f'n_spam: {n_spam}',
     f'\nn_ham: {n_ham}')

n_spam: 15557 
n_ham: 57141


Finally setting $α=1$.

In [110]:
alpha = 1 

In [111]:
# 1. the last colum 'sum_words_sms' is not included in the calculations
# of these Series, so I set `.iloc[:, 2:-1]`.
sms_spam = training_set_2.iloc[:, 2:-1].copy()[training_set_2.Label=='spam']

sms_ham = training_set_2.iloc[:, 2:-1].copy()[training_set_2.Label=='ham']

In [112]:
sms_spam.head(3)

Unnamed: 0,impede,dogging,7634,image,gona,yep,finn,road,wish,2day,...,sundayish,safety,calls,motorola,09050000928,duffer,luck,medicine,aa,while
16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [113]:
# 2.
sms_spam_sum = sms_spam.sum().transpose()

sms_ham_sum = sms_ham.sum().transpose()

In [114]:
sms_spam_sum.head()

impede     0
dogging    4
7634       1
image      0
gona       0
dtype: int64

In [115]:
# 3.

p_wi_given_spam_dict = {}

p_wi_given_ham_dict = {}

# P(w_i|Spam)
for i in range(0, sms_spam_sum.size):
    index = sms_spam_sum.index[i]
    dividend = sms_spam_sum[index] + alpha
    divisor = n_spam + (alpha*n_vocabulary)
    p_wi_given_spam_dict[index] =  dividend / divisor

    
# P(w_i|Ham)
for i in range(0, sms_ham_sum.size):
    index = sms_ham_sum.index[i]
    dividend = sms_ham_sum[index] + alpha
    divisor = n_ham + (alpha*n_vocabulary)
    p_wi_given_ham_dict[index] =  dividend / divisor

Checking first 20 items in `p_wi_given_spam_dict` and `p_wi_given_ham_dict`:

In [116]:
list(p_wi_given_spam_dict.items())[:10] 

[('impede', 4.288164665523156e-05),
 ('dogging', 0.0002144082332761578),
 ('7634', 8.576329331046312e-05),
 ('image', 4.288164665523156e-05),
 ('gona', 4.288164665523156e-05),
 ('yep', 4.288164665523156e-05),
 ('finn', 4.288164665523156e-05),
 ('road', 4.288164665523156e-05),
 ('wish', 8.576329331046312e-05),
 ('2day', 0.00017152658662092623)]

In [117]:
list(p_wi_given_ham_dict.items())[:10] 

[('impede', 3.081474177246395e-05),
 ('dogging', 1.5407370886231974e-05),
 ('7634', 1.5407370886231974e-05),
 ('image', 3.081474177246395e-05),
 ('gona', 4.622211265869592e-05),
 ('yep', 0.00015407370886231974),
 ('finn', 3.081474177246395e-05),
 ('road', 0.00013866633797608775),
 ('wish', 0.0007549611734253667),
 ('2day', 9.244422531739184e-05)]

In [118]:
# Free RAM 4

del sms_spam
del sms_ham

Now that $P(w_i|Spam)$ and $P(w_i|Ham)$ is calculated throughout the entire span of messages contained in the training set, it is possible to finally build the spam filter, by calculating and comparing $P(Spam|w_1, w_2, ..., w_n)$ with $P(Ham|w_1, w_2, ..., w_n)$.

In [119]:
def classify(message):
    """Takes in a string - a cellphone message (SMS), and returns the probability of Spam given the input message,
    the probability of non-spam (ham) given the input message and classifies whether
    the message is spam, not spam (ham), or if a human is required to classify the message.
    """

    message = re.sub('[^A-Za-z0-9\s£€\$]', ' ', message) # still a string
    message = re.sub('£', ' £ ', message)
    message = re.sub('€', ' € ', message)
    message = re.sub('\$', ' $ ', message)
    message = replace(message)
    message = re.sub('\s+', ' ', message)
#     message = message.lower() # still a string
    message = message.split() # now a list of strings

    
    # Calculating P(Spam|w_1, w_2, ..., w_n) with P(Ham|w_1, w_2, ..., w_n).
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Note: if `word` is not in the spam or in the non-spam DataFrames the loop does nothing by default.
    for word in message:
        
        if word in p_wi_given_spam_dict.keys():
            p_spam_given_message *= p_wi_given_spam_dict[word]
            
        if word in p_wi_given_ham_dict.keys():
            p_ham_given_message *= p_wi_given_ham_dict[word]
        

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
        
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
        
    else:
        print('Equal proabilities, have a human classify this!')

Testing the filter:

In [120]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 6.443283729086874e-20
P(Ham|message): 4.02491069038485e-20
Label: Spam


In [121]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 5.965566857521189e-17
P(Ham|message): 4.369378838594486e-13
Label: Ham


In [122]:
classify("Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50")

P(Spam|message): 1.5715754774554154e-65
P(Ham|message): 8.431834489246886e-65
Label: Ham


Instead of going for the suggested method in the instructions we'll add two new columns to `testing_set`: first, `Test` displays the output returned by `classify_test_set()` for each row/message; the second, `Correct`, is '1' if the filter returns the right classification, i.e. the same classification in the `Label` column (which previously classifies the message has 'spam' or 'ham'), or '0' otherwise.

In [123]:
def classify_test_set(message):
    """Takes in a string - a cellphone message (SMS), and returns a classification of whether 
    the message is spam, not spam (ham), or if a human is required to classify the message.
    """

    message = re.sub('[^A-Za-z0-9\s£€\$]', ' ', message) # still a string
    message = re.sub('£', ' £ ', message)
    message = re.sub('€', ' € ', message)
    message = re.sub('\$', ' $ ', message)
    message = replace(message)
    message = re.sub('\s+', ' ', message)
    message = message.lower() # still a string
    message = message.split() # now a list of strings

    
    # Calculating P(Spam|w_1, w_2, ..., w_n) with P(Ham|w_1, w_2, ..., w_n).
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Note: if `word` is not in the spam or in the non-spam DataFrames the loop does nothing by default.
    for word in message:
        
        if word in p_wi_given_spam_dict.keys():
            p_spam_given_message *= p_wi_given_spam_dict[word]
            
        if word in p_wi_given_ham_dict.keys():
            p_ham_given_message *= p_wi_given_ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'Requires human classification.'

In [124]:
testing_set['Test'] = testing_set['SMS'].apply(classify_test_set)


# Assigns True if condition is met and multplying by one converts True in '1' and False in '0'.
testing_set['Correct'] = (testing_set['Label'] == testing_set['Test'])*1

In [125]:
testing_set.head()

Unnamed: 0,Label,SMS,Test,Correct
4458,ham,Later i guess. I needa do mcat study too.,ham,1
4459,ham,But i haf enuff space got like 4 mb...,ham,1
4460,spam,Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out,spam,1
4461,ham,All sounds good. Fingers . Makes it difficult to type,ham,1
4462,ham,"All done, all handed in. Don't know if mega shop in asda counts as celebration but thats what i'm doing!",ham,1


Percentage of well classified messages when using the `classify_test_set()` function/filter:

In [126]:
test_accuracy = (testing_set['Correct'].sum() / testing_set.shape[0])*100

test_accuracy = test_accuracy.round(2)

print(f'When applied to the messages in the `testing_set` ({testing_set.shape[0]} entries) the test accuracy was aprox. {test_accuracy}%.')

When applied to the messages in the `training_set` (1114 entries) the test accuracy was aprox. 98.83%.


The result above is case sensitive, i.e. the algorithm does not distinguish capitalized from lower case words. In this case , it is the same result an in the 'practice 2 version', where the only modification from the original is the non-removal of currency symbols.

Lowercase off (case insensitive): When applied to the messages in the `training_set` (1114 entries) the test accuracy was aprox. 98.74%. In this case, the result is slightly better <1% then the analogous version of the 'practice 2' version.

## Extra Tasks

### 1. Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusion.

In [127]:
cond_incorrect = testing_set.Correct == 0

incorrect = testing_set[cond_incorrect].copy().reset_index(drop=True)

In [128]:
incorrect.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    13 non-null     object
 1   SMS      13 non-null     object
 2   Test     13 non-null     object
 3   Correct  13 non-null     int32 
dtypes: int32(1), object(3)
memory usage: 272.0+ bytes


In [129]:
 pd.options.display.max_colwidth = 500

incorrect[['Label', 'SMS']]

Unnamed: 0,Label,SMS
0,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
1,ham,Unlimited texts. Limited minutes.
2,ham,26th OF JULY
3,ham,Nokia phone is lovly..
4,ham,"A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied ""Boost is d secret of my energy"" n instantly d girl shouted ""our energy"" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8"
5,ham,No calls..messages..missed calls
6,ham,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us"
7,ham,Just taste fish curry :-P
8,spam,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50"
9,spam,"Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text"


In [130]:
incorrect_split = incorrect.SMS.str.replace('\W', ' ', regex=True)

In [131]:
incorrect_split 

0                                                                                                                                                                                                                                                                                                          Not heard from U4 a while  Call me now am here all night with just my knickers on  Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4 net
1                                                                                                                                                                                                                                                                                                                                                                                                                                  Unlimited texts  Limited minutes 
2                                                                                             