# Building A Spam Filter With Naive Bayes Algorithm

## Introduction

This project aims at creating a spam filter that resorts to a multinomial Naive Bayes algorithm in order to distinguish spam messages from regular ones. The filter will be tested on a data set comprised of 5572 messages that have been previously determined by humans as spam or not spam.

The spam filter is considered to be successful if it can filter out 80% of the spam from a test set.

The working data set has been made available by Tiago A. Almeida and José María Gómez Hidalgo, and is publicly available at the [The UCI Machine Learning](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) Repository.

In [1]:
import numpy as np
import pandas as pd
import re

## The Data Set and the Algorithm's Structure

The data set.

In [3]:
sms_spam_full = pd.read_csv('SMSSpamCollection.txt',
                   sep='\t',
                   names=['Label', 'SMS'])

Basic info.

In [4]:
sms_spam_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


The `Label` column is binary, the values are:

- `spam`
- `ham` (not spam)

Below are depicted the first five messages in the data set, so that one can observe the general writing style (expected to be informal), and how this can be taken into consideration when formulating the spam filtering tool.

In [5]:
pd.set_option('display.max_colwidth', None)

sms_spam_full.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


Below, we can see that only aprox. 13.4% of the messages are spam.

In [6]:
count_label = sms_spam_full.Label.value_counts(normalize=True).round(3)*100

count_label = count_label.rename('ham vs spam (%)')

count_label

ham     86.6
spam    13.4
Name: ham vs spam (%), dtype: float64

### The process to build the spam filter will take 3 steps:


1. Write a script that takes a pre-evaluated data set of messages (training set) and instruct it to learn how humans classify messages.


2. Use the knowledge acquired in 1. to estimate probabilities for new messages — probabilities for spam and non-spam.


3. Classify a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam; otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

### Step  1: creating a training set and a test set 

1. Randomize the original data set - `sms_spam_full`.


2. Split the randomized data set, into a training set (20%) and a testing set (remaining 80%).


3. Compare the `Label` value frequency distribution of the entire randomized DataFrame with the previously made subsets.

In [7]:
# 1. Randomize.
random_sms_spam = sms_spam_full.sample(n=None, frac=1, random_state=1).reset_index(drop=True)

From earlier on we know that `random_sms_spam` has 5572 entries (0 to 5571).

In [8]:
eighty_perc = random_sms_spam.shape[0] * 0.8

print(f'Total number of messages/rows in random_sms_spam: {random_sms_spam.shape[0]}', 
      f'\n80% of total messages/rows in random_sms_spam: {round(eighty_perc, 0)}')

Total number of messages/rows in random_sms_spam: 5572 
80% of total messages/rows in random_sms_spam: 4458.0


Based on the information above we set the training set as the rows 0 to 4458, 80% of the total rows in `random_sms_spam`. The remaining rows, will form the testing set. 

In [9]:
# 2. Split the randomized data set.
training_set = random_sms_spam.copy().iloc[:4458+1, :]

testing_set = random_sms_spam.copy().iloc[4458:, :]

Finally, comparing the distribution of values in the `Label` column across DataFrames.

In [10]:
# 3. Compare spam value across DataFrames.

# Training set.
count_label_training = training_set.Label.value_counts(normalize=True).round(3)*100

count_label_training = count_label_training.rename('ham vs spam (%)')

# Testing set.
count_label_testing = testing_set.Label.value_counts(normalize=True).round(3)*100

count_label_testing = count_label_testing.rename('ham vs spam (%)')


# Combining all label percentage counts for comparison.
compare_label = pd.DataFrame({'random_sms_spam': count_label,
                              'count_label_training': count_label_training,
                              'count_label_testing': count_label_testing})

compare_label

Unnamed: 0,random_sms_spam,count_label_training,count_label_testing
ham,86.6,86.5,86.8
spam,13.4,13.5,13.2


As we can see in the table above, the value distribution is very similar across the panel, meaning that we can infer the conclusions produced from the training set to the testing set, since both sets resemble the original series.

In [11]:
# Saving RAM 1
del sms_spam_full
del random_sms_spam

### Step 2: use the training set to teach the algorithm to classify new messages
---
When a new message arrives and is used as input the Naive Bayes algorithm will make the classification based on the results it gets from these two equations ("$Spam^C$" and "$Ham$" can be used interchangeably in this case):


$P(Ham|w_1, w_2, ..., w_n) > P(Spam|w_1, w_2, ..., w_n)$

Building a Naive Bayes algorithm entails two further intertwined steps. The final goal is two take a random new message as input, compute and compare the equations below:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

- If $P(Ham|w_1, w_2, ..., w_n) > P(Spam|w_1, w_2, ..., w_n)$, then the message is classified as ham.


- If $P(Ham|w_1, w_2, ..., w_n) < P(Spam|w_1, w_2, ..., w_n)$, then the message is classified as spam.


- If $P(Ham|w_1, w_2, ..., w_n) = P(Spam|w_1, w_2, ..., w_n)$, then the algorithm must request human help.

To calculate $P(w_i|Spam)$ and $P(w_i|Ham)$ inside the formulas above, we need to calculate the two equations below, which read: 'the probability of a message that contains the word $w_i$ to be spam is...' and 'the probability of a message that contains the (same) word $w_i$ to be non spam is...', respectively. 

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Here is where the training test enters as an input because we compute $P(w_i|Spam)$ and $P(w_i|Ham)$ based on elements taken and reformulated from this data set and then apply that to calculate $P(Spam | w_1,w_2, ..., w_n)$ and $P(Ham | w_1,w_2, ..., w_n)$. In other words, the Step 2 of the process is to create a 'blueprint' data set that allows us to compute
$P(w_i|Spam)$ and $P(w_i|Ham)$ for every word of every message in the training data set, from which we pick up the probabilities associated with the words that are contained in a message, that we wish to filter out.

A mock example: 

We want to calculate this message probability of being spam: 'go crazy'.

The probabilities in the 'blueprints' can be calculated because the word 'go' can be found in 667 normal messages and in 39 spam messages, while 'crazy' can be found in 8 normal messages and in 1 spam message (messages within the training set). 

The blueprints data sets can then be filled out:

- Probability of spam given the targeted words:
    - $P('go'|Spam) = a$
    - $P('crazy'|Spam) = b$


- Probability of spam given the targeted words:
    - $P('go'|Ham) = c$
    - $P('crazy'|Ham) = d$


Which allow us to calculate:

$P(Spam | w_1, w_2) = a * b$

$P(Ham | w_1, w_2) = c *d$

Having this information, we can compare values and determine if 'go crazy' is more likely to be, and ultimately categorized as spam or not spam.
We can also notice that if a test message has a word which is not in the blueprint, because it wasn't included in the training set to begin with, it will not be considered in the filter, since we don't specify a way to calculate a probability for it. 

Let's also summarize what the terms in the equations above mean. Recall that all of this elements will be taken from the training set.

\begin{align}
&N_{w_i|Spam} = \text{the number of times the word } w_i \text{ occurs in spam messages} \\
&N_{w_i|Spam^C} = \text{the number of times the word } w_i \text{ occurs in non-spam messages} \\
\\
&N_{Spam} = \text{total number of words in spam messages} \\
&N_{Spam^C} = \text{total number of words in non-spam messages} \\
\\
&N_{Vocabulary} = \text{total number of words in the vocabulary} \\
&\alpha = 1 \ \ \ \ (\alpha \text{ is a smoothing parameter})
\end{align}


#### Message cleaning.


Prior to building the 'blueprint' for the probabilities we must clean the messages in the `training_set`:

- eliminate punctuation.
- lower case every word.
- remove whitespaces at end and beginning of the message.

In [12]:
# Eliminating punctuation.
cleaned = training_set.SMS.copy().str.replace('\W', ' ', regex=True) 

# `{2.,}` ensures that if there are two or more joined whitespaces they are converted to just one.
cleaned = cleaned.str.replace(' {2,}', ' ', regex=True) 

# Remove whitespaces.
cleaned = cleaned.str.replace('(\A +| +\Z)', '', regex=True)

# Lower case for every string.
cleaned = cleaned.str.lower()

Checking changes in `cleaned`:

In [13]:
cleaned.head()

0                                                                             yep by the pretty sculpture
1                                                              yes princess are you going to make me moan
2                                                                              welp apparently he retired
3                                                                                                  havent
4    i forgot 2 ask ü all smth there s a card on da present lei how ü all want 2 write smth or sign on it
Name: SMS, dtype: object

#### Producing a 'vocabulary' .


The next stage entails the production of a vocabulary (array of unique words taken from every message in `cleaned`), with the following steps:

1. split each message into a list of words.
2. create an empty set of unique words.
3. add words to the set.
4. convert set into back into a (sorted) list.

In [14]:
# 1.
cleaned_split = cleaned.str.split(' ', expand=False) 

# 2.
vocabulary = set()

# 3.
for index, sms in enumerate(cleaned_split):
    for word in sms:
        if word:
            vocabulary.add(word)              
            
# 4.
vocabulary = sorted(list(vocabulary))

#### Producing a word counter DataFrame.


To continue the task of organizing the elements required to compute $P(w_i|Spam)$ and $P(w_i|Ham)$ for each word in the _vocabulary_, we opt to build a data set that registers how many times each word occurs in each message. To do this, we can collect this information by first compiling a dictionary and then convert it into a DataFrame, for ease of read and access.



Both the dictionary and DataFrame display the data similarly: we must be able to choose a word that is represented by a key and a column, respectively, and be able to check the word frequency by message/row (in the dictionary each message is given by the index position in the list for every key/word). 


Before that we refine a little further the cleaning of messages:
- Rows 1098 and 2700 are messages that only contain emojis, therefore they can be dropped out of the DataFrame.

In [15]:
for index, value in enumerate(cleaned_split):
    if '' in value:
        print(index, value)

print('\nMessages containing only Emojis:\n')       
print(training_set.iloc[1098,:],'\n')
print(training_set.iloc[2700,:])

cleaned_split = cleaned_split.drop([1098, 2700])

1098 ['']
2700 ['']

Messages containing only Emojis:

Label    ham
SMS      :) 
Name: 1098, dtype: object 

Label        ham
SMS      :-) :-)
Name: 2700, dtype: object


1. Filling out the dictionary.

In [16]:
word_counts_per_sms = {unique_word: [0] * len(cleaned_split) for unique_word in vocabulary}

for index, sms in enumerate(cleaned_split):
    for word in sms:
        word_counts_per_sms[word][index] += 1

Hows does the dictionary looks like and how to read it:

- We are looking at five random keys/unique words in the dictionary. Each index position of these lists represents one message, identical to the row index in `training_set`. Here we limited the search to the first 10 messages. Looking at this sample, we notice that none of these messages contains any of the following random words taken from the `vocabulary`.

In [17]:
for key in ['dog', 'answer', 'wow', 'drink', 'night']:
    print(f'Word: "{key}". Counts: {word_counts_per_sms[key][:10]}')

Word: "dog". Counts: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Word: "answer". Counts: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Word: "wow". Counts: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Word: "drink". Counts: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Word: "night". Counts: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [18]:
# 2. Convert dictionary into DataFrame.
word_counts_per_sms_df = pd.DataFrame(word_counts_per_sms)    

Assessing if the conversion was successful: using the same example above, we can see that words are now column labels, whilst each message is represented by the (row) index.


In [19]:
word_counts_per_sms_df.loc[:10, ['dog', 'answer', 'wow', 'drink', 'night']]

Unnamed: 0,dog,answer,wow,drink,night
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0
5,0,0,0,0,0
6,0,0,0,0,0
7,0,0,0,0,0
8,0,0,0,0,0
9,0,0,0,0,0


Finally, concatenating `training_set` with `word_counts_per_sms_df` into a new DataFrame - `training_set_2`, in order to make it easier to look up for the original messages.

In [20]:
ordered_cols = list(training_set.columns) + list(word_counts_per_sms_df.columns)

training_set_2 = word_counts_per_sms_df.merge(training_set, left_index=True, right_index=True)

training_set_2  = training_set_2.reindex(columns=ordered_cols)

First two rows on `training_set_2`:

In [21]:
training_set_2.iloc[:2, :10]

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02
0,ham,"Yep, by the pretty sculpture",0,0,0,0,0,0,0,0
1,ham,"Yes, princess. Are you going to make me moan?",0,0,0,0,0,0,0,0


Assessing whether the rows in `training_set_2` were well aligned with the correspondent rows in `word_counts_per_sms_df` with two examples.

Example 1: row 0

In [22]:
cond_row_0 = training_set_2.iloc[0, 2:] == 1 

print(f'training_set_2["SMS"], row 0: {training_set_2.iloc[0, 1]}')
print('\n')
print(f'training_set_2, row 0 - columns that are ">0":\n\n{training_set_2.iloc[0, 2:][cond_row_0]}')

training_set_2["SMS"], row 0: Yep, by the pretty sculpture


training_set_2, row 0 - columns that are ">0":

by           1
pretty       1
sculpture    1
the          1
yep          1
Name: 0, dtype: object


Example 2: row 1

In [23]:
cond_row_1 = training_set_2.iloc[1, 2:] == 1 

print(training_set_2.iloc[1, 1])
print('\n')
print(training_set_2.iloc[1, 2:][cond_row_1])

Yes, princess. Are you going to make me moan?


are         1
going       1
make        1
me          1
moan        1
princess    1
to          1
yes         1
you         1
Name: 1, dtype: object


In [24]:
# Free RAM 2
del word_counts_per_sms

Calculating the elements within the Naive Bayes algorithm:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}


\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}


Starting with:

- $P(Spam)$ and $P(Ham)$.
- $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$

The Laplace smoothing parameter is set to 1: $α=1$.

Probability of Spam and Ham are just the proportion of each from the total number of messages.

In [25]:
label_counts_ts2 = training_set_2.Label.value_counts(normalize=True)

p_spam = label_counts_ts2.spam

p_ham = label_counts_ts2.ham

print(f'P(Spam) = {p_spam}')
print(f'P(Ham) = {p_ham}')

P(Spam) = 0.13461969934933812
P(Ham) = 0.8653803006506618


Counting $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$.

- $N_{Vocabulary}$ is given by the number of words in the `vocabulary`.

In [26]:
n_vocabulary = len(vocabulary)

print(f'N_vocabulary = {n_vocabulary}')

N_vocabulary = 7785


To sum up the total number of words for spam messages and for non-spam messages - $N_{Spam}$, $N_{Ham}$ respectively, the procedure will be the following:

1. create two DataFrames that only contain either spam or non-spam messages.

2. in the same line of code, sum up all values by row (creating a Series of summed values) and then sum up all those values.

In [27]:
sms_spam = training_set_2.iloc[:, 2:-1].copy()[training_set_2.Label=='spam']

sms_ham = training_set_2.iloc[:, 2:-1].copy()[training_set_2.Label=='ham']

n_spam = sms_spam.sum(axis=1).sum()
n_ham = sms_ham.sum(axis=1).sum()

print(f'n_spam: {n_spam}',
     f'\nn_ham: {n_ham}')

n_spam: 11207 
n_ham: 61228


The last elements of $P(w_i|Spam)$ and $P(w_i|Ham)$ that have yet to be calculated are:

\begin{align}
&N_{w_i|Spam} = \text{the number of times the word } w_i \text{ occurs in spam messages} \\
&N_{w_i|Spam^C} = \text{the number of times the word } w_i \text{ occurs in non-spam messages} \\
\\
\end{align}


In order to find these values we can resort again to `sms_spam` and `sms_ham`.

In [28]:
sms_spam_sum = sms_spam.sum().transpose()

sms_ham_sum = sms_ham.sum().transpose()

In [29]:
sms_spam_sum 

0                1
00               7
000              6
000pes           0
008704050406     1
                ..
zyada            0
é                0
ú1               1
ü               15
〨ud              0
Length: 7784, dtype: int64

Now we can locate how many times a given word appears in either spam or ham messages. An example: find those values for the word 'crazy'. We can see below that this word appears 9 times in normal messages and twice in spam messages (we count repeats).

In [30]:
sms_spam_sum['crazy']

2

In [31]:
sms_ham_sum['crazy']

9

### Arriving at the 'blueprints' 

The last step before building the final spam filter is to fill out two dictionaries, that store $P(w_i|Spam)$ and $P(w_i|Ham)$ for each word in the _vocabulary_, based on the elements already collected:


- `n_spam` and `n_ham`.
- `n_vocabulary`.
- $N_{w_i|Spam}$ and $N_{w_i|Ham}$.
- `alpha`.

In [32]:
p_wi_given_spam_dict = {}

p_wi_given_ham_dict = {}

alpha = 1

# P(w_i|Spam)
for i in range(0, sms_spam_sum.size):
    index = sms_spam_sum.index[i] # stores each unique word
    dividend = sms_spam_sum[index] + alpha
    divisor = n_spam + (alpha*n_vocabulary)
    p_wi_given_spam_dict[index] =  dividend / divisor

    
# P(w_i|Ham)
for i in range(0, sms_ham_sum.size):
    index = sms_ham_sum.index[i] # stores each unique word
    dividend = sms_ham_sum[index] + alpha
    divisor = n_ham + (alpha*n_vocabulary)
    p_wi_given_ham_dict[index] =  dividend / divisor

Checking first 5 items in `p_wi_given_spam_dict` and `p_wi_given_ham_dict`.

In [33]:
list(p_wi_given_spam_dict.items())[:5] 

[('0', 0.00010530749789385004),
 ('00', 0.00042122999157540015),
 ('000', 0.00036857624262847516),
 ('000pes', 5.265374894692502e-05),
 ('008704050406', 0.00010530749789385004)]

In [34]:
list(p_wi_given_ham_dict.items())[:5] 

[('0', 4.3470070856215496e-05),
 ('00', 4.3470070856215496e-05),
 ('000', 0.00028980047237477),
 ('000pes', 2.8980047237477e-05),
 ('008704050406', 1.44900236187385e-05)]

In [35]:
# Free RAM 3
del sms_spam
del sms_ham

## Putting Up All The Parts Together - Assembling The Spam Filter And Assessing Its Effectiveness

---

Now that $P(w_i|Spam)$ and $P(w_i|Ham)$ have been calculated throughout the entire span of messages contained in the training set, it is possible to finally build the spam filter by calculating and comparing $P(Spam|w_1, w_2, ..., w_n)$ with $P(Ham|w_1, w_2, ..., w_n)$. If by chance the values calculated for these probabilities are equal, no classification is made, and the function returns a string message asking for a human classification.

In [36]:
def classify_test_set(message):
    """Takes in a string - a cellphone message (SMS), and returns a classification of whether 
    the message is spam, not spam (ham), or if a human is required to classify the message.
    """

    message = re.sub('\W', ' ', message) # still a string
    message = message.lower() # still a string
    message = message.split() # now a list of strings

    
    # Calculating P(Spam|w_1, w_2, ..., w_n) with P(Ham|w_1, w_2, ..., w_n).
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Note: if `word` is not in the spam or in the non-spam DataFrames the loop does nothing by default.
    for word in message:
        
        if word in p_wi_given_spam_dict.keys():
            p_spam_given_message *= p_wi_given_spam_dict[word]
            
        if word in p_wi_given_ham_dict.keys():
            p_ham_given_message *= p_wi_given_ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'Requires human classification.'

Now we are all set to apply the filter to every message on the `testing_set`. We register if the filter was successful in classifying each message correctly by comparing it with the classification in the `Label` column. We store these results in a column named `Correct` where '1' means well classified, and '0' wrongly classified. This will allow to easily calculate the accuracy rate of the filter.


In [37]:
testing_set['Test'] = testing_set['SMS'].apply(classify_test_set)


# Assigns True if condition is met and multplying by one converts True in '1' and False in '0'.
testing_set['Correct'] = (testing_set['Label'] == testing_set['Test'])*1

In [38]:
testing_set.head()

Unnamed: 0,Label,SMS,Test,Correct
4458,ham,Later i guess. I needa do mcat study too.,ham,1
4459,ham,But i haf enuff space got like 4 mb...,ham,1
4460,spam,Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out,spam,1
4461,ham,All sounds good. Fingers . Makes it difficult to type,ham,1
4462,ham,"All done, all handed in. Don't know if mega shop in asda counts as celebration but thats what i'm doing!",ham,1


Percentage of well classified messages when using the `classify_test_set()` function/filter:

In [39]:
test_accuracy = (testing_set['Correct'].sum() / testing_set.shape[0])*100

test_accuracy = test_accuracy.round(2)

print(f'When applied to the messages in the `testing_set` ({testing_set.shape[0]} entries) the test accuracy was aprox. {test_accuracy}%.')

When applied to the messages in the `testing_set` (1114 entries) the test accuracy was aprox. 94.25%.


Given the value of the test accuracy of 98.74%, we can classify it as a success, since it exceeded the approval threshold of 80% by a substantial margin. 

## Extra Task

### Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusion.

First of all, lets observe closer the 14 messages that were incorrectly tagged. For that purpose we isolate those messages into a separate DataFrame (`incorrect`).

In [40]:
cond_incorrect = testing_set.Correct == 0

incorrect = testing_set[cond_incorrect].copy().reset_index(drop=True)

Basic info on `incorrect`. 

In [41]:
incorrect.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    64 non-null     object
 1   SMS      64 non-null     object
 2   Test     64 non-null     object
 3   Correct  64 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 2.1+ KB


In [42]:
incorrect[['Label', 'SMS']]

Unnamed: 0,Label,SMS
0,spam,25p 4 alfie Moon's Children in need song on ur mob. Tell ur m8s. Txt Tone charity to 8007 for Nokias or Poly charity for polys: zed 08701417012 profit 2 charity.
1,spam,FreeMsg: Hey - I'm Buffy. 25 and love to satisfy men. Home alone feeling randy. Reply 2 C my PIX! QlynnBV Help08700621170150p a msg Send stop to stop txts
2,spam,As one of our registered subscribers u can enter the draw 4 a 100 G.B. gift voucher by replying with ENTER. To unsubscribe text STOP
3,spam,goldviking (29/M) is inviting you to be his friend. Reply YES-762 or NO-762 See him: www.SMS.ac/u/goldviking STOP? Send STOP FRND to 62468
4,ham,I wnt to buy a BMW car urgently..its vry urgent.but hv a shortage of &lt;#&gt; Lacs.there is no source to arng dis amt. &lt;#&gt; lacs..thats my prob
...,...,...
59,spam,A link to your picture has been sent. You can also use http://alto18.co.uk/wave/wave.asp?o=44345
60,spam,Please call our customer service representative on 0800 169 6031 between 10am-9pm as you have WON a guaranteed £1000 cash or £5000 prize!
61,spam,FreeMsg:Feelin kinda lnly hope u like 2 keep me company! Jst got a cam moby wanna c my pic?Txt or reply DATE to 82242 Msg150p 2rcv Hlp 08712317606 stop to 82242
62,spam,"XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL"


From what we can read above, we must first acknowledge that some messages are too ambiguous and difficult to pronounce correctly as spam or not spam, even when resorting to human judgment, some examples: 

- 'Unlimited texts. Limited minutes.' (row 2).
- 'We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us' (row 7).
- 'RCT' THNQ Adrian for U text. Rgds Vatian' (row 11).

One thing we can observe in `incorrect` is that there are spam messages that have implicit references to money or currency symbols like '£' or '150p', where this last expression stands for 150 pence (same as 1.5£). In the case of currency symbols, when we trained the the algorithm we excluded them when we set this piece of code previously:

    # Training set `SMS` cleaned series

    # Eliminating punctuation.
    cleaned = training_set.SMS.copy().str.replace('\W', ' ', regex=True) 

Because we chose to replace an undiscriminated group of characters by setting `pattern='\W'`, we prevented the algorithm from recognizing (at least some direct) money references, which are common in spam messages. Therefore, one way to improve the algorithm is to make it recognize currency or money expressions. 


For the sake of comprehension, we can find below all the suppressed characters in `training_set.SMS` which correspond to the regex character class 'not word' ('\W'): 

In [43]:
all_messages_str = ''

for index, value in enumerate(training_set.SMS):
    all_messages_str = ' '.join([all_messages_str, value])

set(re.findall('\W', all_messages_str))

{'\t',
 '\n',
 ' ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '|',
 '~',
 '\x91',
 '\x92',
 '\x93',
 '\x94',
 '\x96',
 '¡',
 '£',
 '»',
 '–',
 '—',
 '‘',
 '’',
 '“',
 '…',
 '┾'}

The table immediately below gives a short resume of various attempts of improving the algorithm's accuracy by tweaking what words/characters should be recognized in a message. The differences between versions 'a' and 'b' is that in the former, the message is converted into lower case, so that the algorithm recognizes the same word regardless of letter case. Version 'b' maintains the original case set. Because each variation on the algorithm has its own set of new procedures, which need to be thoroughly justified in order to make sense, I made a support Jupyter notebook - 'p14 - Spam Filter - Alternative Algorithm Versions (II, III and IV)', which explains the thought process behind the changes to the original algorithm.  

| Version | Case sensitive | Accuracy (%) | Description                                                 |
|---------|-----------|--------------|-------------------------------------------------------------|
| 1a      | no        | 98.74        | Original version.                                         |
| 1b      | yes       | 98.47        | ---                                                         |
| 2a      | no        | 98.83        | Recognizes currency symbols: '£', '€' and '$'.                |
| 2b      | yes       | 98.56        | ---                                                         |
| 3a      | no        | 98.83        | Same as V2 but recognizes references to GBP pences. |
| 3b      | yes        | 98.74        | ---                                                         |
| 4a      | no        | 98.65       | Separate all numbers from letters and from other characters/symbols.|
| 4b      | yes       |   98.47       | ---                                                         |


In more detail, version 2 allows the algorithm to recognize the aforementioned currency symbols as a word (delimited with whitespaces), maintaining everything else the same as the original version. Version 3 adds to the version 2 a trench of code that allows the algorithm to also recognize references to British pound pences, as mentioned above, e.g. a message that includes this expression - '150p/text', will have it converted to '150p text', so that the reference to 150 pences can be recognized. Version 4 allows every sequence of either letters, digits or symbols to be isolated by a whitespace, and thus recognized as a single expression, e.g.: '150p/text' would become '150 p / text'. The idea for this version is that spam messages may have unusual or an excessive inclusion of hyperbolic symbols/characters such as many exclamation points. 

Analyzing the table above we can observe that the task of improving on the original accuracy test will be very difficult, given that it is already at 1.26\% of being 100\% accurate. The main point to be taken here is that tweaking the original algorithm did not improve or worsen significantly the algorithm's accuracy. Version's 2a and 3a were capable of accurately classify one more message correctly (each message is 0.09\% from the total, which is the difference seen from version 1a to versions 2b and 3b). One detail that stands out is that making the the algorithm case sensitive slightly decreases its accuracy in all scenarios.   


### Final Thoughts


Since these variations of the algorithm were not up to the difficult task of reaching 100\% accuracy, other implementations can be tried in future versions. These could be based on the attribution of a certain weight to the probability of spam, P(Spam|w1, w2,..., wn), given the number of capitalized words in a message: spam messages tend to be more hyperbolic, using more capital words; or by exploring the relation between spam messages and the number of characters per message (in relation to non-spam messages), in order to try to predict whether a really long message, with many characters, has the same probability of being spam or non-spam, when there is a suspicion that spam messages may be shorter and more to the point, than other types of messages. 



As a last remark, we can at least recognize that there was room to improve from the original algorithm's version, and if the algorithm was tested in other testing sets, the idea of recognizing symbols and expressions related to currency when classifying a message as spam or not is relevant. 

\[End of Project\]

\***