### Spam Filter in Python: Naive Bayes from Scratch
#### https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html

Read in and look at the data.

In [76]:
import pandas as pd

In [77]:
sms_spam = pd.read_csv(r"C:\Users\Cory\Dropbox\PC\Documents\ENTITY\final_project\spamhamdata.csv", sep='\t', header=None, names=['Label', 'SMS'])

In [78]:
sms_spam.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [79]:
sms_spam.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [80]:
sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

The dataset is made of about 87% ham and 13% spam.

Randomize the dataset and split 20/80 for training and testing.

In [81]:
RandSpam = sms_spam.sample(frac=1, random_state=1)

Calculate index for the split.

In [82]:
trainTest_index = round(len(RandSpam) * 0.8)

Split

In [83]:
trainData = RandSpam[:trainTest_index].reset_index(drop=True)

In [84]:
print(trainData.shape)

(4458, 2)


In [85]:
testData = RandSpam[trainTest_index:].reset_index(drop=True)

In [86]:
print(testData.shape)

(1114, 2)


Compare percent ham/spam in training and testing datasets to the percentages in original data. Should be similar.

In [87]:
trainData['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [88]:
testData['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

### Clean the data

**before cleaning**

In [89]:
trainData.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


Remove punctuation.

In [90]:
trainData['SMS'] = trainData['SMS'].str.replace(
   '\W', ' ') 

  trainData['SMS'] = trainData['SMS'].str.replace(


Make lowercase.

In [91]:
trainData['SMS'] = trainData['SMS'].str.lower()

**after cleaning**

In [92]:
trainData.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Split SMS message strings into individual words, using spaces as cues to guide where to divide.

In [93]:
trainData['SMS'] = trainData['SMS'].str.split()

Create an empty vocabulary list.

In [94]:
vocabulary=[]

Go through SMS list and add every word that was split out to the vocabulary list.

In [95]:
for sms in trainData['SMS']:
   for word in sms:
      vocabulary.append(word)

Remove duplicates by changing the list into a set; then change the set back into a list. Check the length of the vocabulary list.

In [96]:
vocabulary = list(set(vocabulary))
len(vocabulary)

7783

Need to build a refined training dataset. The vocabulary list will be the basis for building a dictionary.

The lengths of the words are used to determine their indices. The indexes and the words themselves are used to code the dictionary for the training dataset. Honestly, the exact explanation for what is happening is confusing to me.

In [97]:
smsWordCounts = {unique_word: [0] * len(trainData['SMS']) for unique_word in vocabulary}

In [98]:
for index, sms in enumerate(trainData['SMS']):
   for word in sms:
      smsWordCounts[word][index] += 1

Using this dictionary, get the word counts from the training dataset.

In [99]:
word_counts = pd.DataFrame(smsWordCounts)
word_counts.head()

Unnamed: 0,got,jaya,nasdaq,82468,across,toothpaste,surely,potato,1000call,philosophical,...,gal,hugs,aiya,thinking,nike,dracula,mustprovide,fucks,viva,mth
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Use concat() to add the label and sms columns.

In [100]:
training_set_clean = pd.concat([trainData, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,got,jaya,nasdaq,82468,across,toothpaste,surely,potato,...,gal,hugs,aiya,thinking,nike,dracula,mustprovide,fucks,viva,mth
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now the training data are finally prepped.

### Create the filter

Define terms in the equation that will not change from sms to sms.

First separate the clean ham and the clean spam.

In [101]:
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

Calculate the probability of spam and the probability of ham.

In [102]:
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

Calculate the number of words in the spam, ham, and vocabulary lists.

In [103]:
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

n_vocabulary = len(vocabulary)

alpha = 1

Laplace smoothing is a technique for handling the problem of zero probability.

Initiate parameters.

In [104]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}


Calculate the parameters.

In [105]:
for word in vocabulary:
   n_word_given_spam = spam_messages[word].sum()
   p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
   parameters_spam[word] = p_word_given_spam

   n_word_given_ham = ham_messages[word].sum() 
   p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
   parameters_ham[word] = p_word_given_ham

** Finally, the spam filter. **

In [106]:
import re


In [107]:
def classify(message):
    '" message: a string "'
    message = re.sub('\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham: 
            p_ham_given_message *= parameters_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
          print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
          print('Label: Spam')
    else:
          print('Is it ham? Is it spam? I don\'t know!')


Test the filter.

In [108]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [109]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Rewrite filter code to classify and return, rather than print after each SMS.

In [110]:
def classify_test_set(message):
    '" message: a string "'
    message = re.sub('\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham: 
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
          return('ham')
    elif p_ham_given_message < p_spam_given_message:
          return('spam')
    else:
          return('Is it ham? Is it spam? I don\'t know!')

Use classify function to predict values for SMS and add them as a new column in the dataset.

In [111]:
testData['predicted'] = testData['SMS'].apply(classify_test_set)
testData.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


**Create function for measuring accuracy.**

In [112]:
correct = 0
total = testData.shape[0]

for row in testData.iterrows():
   row = row[1]
   if row['Label'] == row['predicted']:
      correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)
 

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


## Excellent!