# Spam Filter with Naive Bayes

[SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)

The Naive Bayes algorithm will make the classification based off of the two equations below.

$$\begin{equation}
P(B | e1,e2,...,e𝑛) \propto P(Spam) \cdot \prod_{i=1}^{n}P(e_i|B)
\end{equation}$$

$$\begin{equation}
P(B^C | e_1,e_2, ..., e_n) \propto P(B^C) \cdot \prod_{i=1}^{n}P(e_i|B^C)
\end{equation}$$

If the calculated probability of the message being 'Spam 'is higher than the probability of being non-Spam, then the message is classified as 'Spam.'

In [1]:
import pandas as pd
s_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print(s_spam.shape)
s_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
s_spam['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

Dataset has 4825 Ham messages and 747 Spam messages.

# Training and Test Set

First the dataset will be split into a training and test sets.  We need as much information as possible to build the training model while leaving enough for testing.  The training set will use 80% of the dataset leaving the other 20% for testing purposes.

In [3]:
random=s_spam.sample(frac=1, random_state=1)
train=random[:(round(5572*0.8))].reset_index(drop=True)
test=random[(round(5572*0.8)):].reset_index(drop=True)

print(train.shape)
print(test.shape)

(4458, 2)
(1114, 2)


# Letter Case and Punctuation

Next step is to use the training set to teach the algorithm how to classify new messages.

![image1](https://dq-content.s3.amazonaws.com/433/cpgp_dataset_1.png)

Split above into below:

![image2](https://dq-content.s3.amazonaws.com/433/cpgp_dataset_2.png)

In [4]:
train['SMS']=train['SMS'].str.replace('\W', ' ').str.lower()
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [5]:
#split message into SMS column into individual words
train['SMS']=train['SMS'].str.split()

# create vocabulary list containing all words in SMS and then remove duplicates.
vocabulary=[]

for row in train['SMS']:
    for w in row:
        vocabulary.append(w)

vocabulary=set(vocabulary) 
vocabulary=list(vocabulary)

print(len(vocabulary))

7783


Vocabulary contains 7783 unique words.

# Final Training Set

Individual words need to be counted for each message for future calculation.

In [6]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

word_counts=pd.DataFrame(word_counts_per_sms)   

training_set_clean = pd.concat([train, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


# Calculating Constants

$$\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}$$

P(wi|Spam) and p(wi|Ham) will be calculated with additive (Laplace) smoothing function in the following equations:

$$\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}$$

- p_spam=total probability of 'spam'
- p_ham=total probability of 'ham'
- n_spam=total words in 'spam' messages
- n_ham=total words in 'ham' messages
- n_voc=total number of unique words in dataset
- alpha=Laplance-Smoothing factor

In [7]:
# Split dataset into seperate catagories
spam=training_set_clean[training_set_clean['Label']=='spam']
ham=training_set_clean[training_set_clean['Label']=='ham']

p_spam=len(spam)/len(training_set_clean)
p_ham=len(ham)/len(training_set_clean)

n_spam=spam['SMS'].apply(len).sum()
n_ham=ham['SMS'].apply(len).sum()

n_voc=len(vocabulary)
alpha=1

# Calculating Parameters

- p_w_h=probability of ham given word

- p_w_s=probability of spam given word

$$\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}$$

In [8]:
p_w_h={unique_word:0 for unique_word in vocabulary}
p_w_s={unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    num_word_ham=ham[word].sum()
    num_word_spam=spam[word].sum()
    p_w_h[word]=(num_word_ham+alpha)/(n_ham+alpha*n_voc)
    p_w_s[word]=(num_word_spam+alpha)/(n_spam+alpha*n_voc)                      

# Classifying A New Message

Create function to test individual messages before applying to full dataset.  Function breaks apart the message into individual words and then calculates whether the message is likely to be 'Spam' or not based on above calculated probabilities using the training set.

In [9]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_w_s:
            p_spam_given_message*=p_w_s[word]
        if word in p_w_h:
            p_ham_given_message*=p_w_h[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
        
classify('WINNER!! This is the secret code to unlock the money: C3421.') 

classify("Sounds good, Tom, then see u there")

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


# Measuring the Spam Filter's Accuracy

After testing the function is modified to return results into new column in dataset.

In [10]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in p_w_s:
            p_spam_given_message*=p_w_s[word]
        if word in p_w_h:
            p_ham_given_message*=p_w_h[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'
    
test['predict']=test['SMS'].apply(classify_test_set)    

In [11]:
correct=0
total=test.shape[0]

for message in test.iterrows():
    message=message[1]
    if message['Label']==message['predict']:
        correct+=1

percent_correct=(correct/total)*100  

print(percent_correct, '% were correctly classified')      

98.74326750448833 % were correctly classified


# Conclusion

The Spam filter was over 98% effective.