# SIMPLE SPAM FILTER

This is my first raw machine learning coding project. Although it is nothing compare to others' spam filter projects, I am still proud of it. Just do not consider it to be a good reference on your study :>.

First, I want to talk about the mathematics here before we start coding. This spam filter is based on the Bayes theorem, as we have learned that:

$$P(spam|word) = \frac{P(word|spam) \times P(spam)}{P(word)}$$

(Because this is simple, the only feature I take into account for predicting spam is the word, including words in title and email address.) We have the formula of predicting an email spam or ham as below:

$$P(spam|w_1, w_2,..., w_n) = \frac{P(w_1, w_2,..., w_n|spam) \times P(w_1, w_2,..., w_n)}{P(spam)}$$

($(w_1, w_2,..., w_n)$ belong to the $W$ set). Based on the Chain Rule, we have:

$$P(w_1, w_2,..., w_n) = P(w_1|w_2,..., w_n) \times P(w_2|w_3,..., w_n) \times \dots \times P(w_n)$$

To make it simple, let's suppose that every word in the W set is mutually independent on the category "spam" or "ham" only (this is the "Naive Bayes" assumption), then we have:

$$P(w_1, w_2,..., w_n|spam) = \prod_{i = 1}^{n}(P(w_i|spam)$$

Therefore:

$$P(spam|w_1, w_2,..., w_n) = \frac{P(w_1, w_2,..., w_n|spam) \times P(spam)}{P(w_1, w_2,..., w_n)}$$

$$P(spam|w_1, w_2,..., w_n) = \prod_{i = 1}^{n} \frac{(P(w_i|spam))*P(spam)}{P(w_1, w_2,..., w_n)}$$

In this step, I am actually confused in figuring out how to calculate the $P(w_1, w_2,..., w_n)$, so I ask for advice from my brilliant friend and he propose another solution. I will get the ratio of $P(ham|w_1, w_2,..., w_n)$ over $P(spam|w_1, w_2,..., w_n)$:

$$R = \frac{P(ham|w_1, w_2,..., w_n)}{P(spam|w_1, w_2,..., w_n)}$$

###### R = Pi<i = 1 -> n>(P(w_i|spam))*P(spam)/Pi<i = 1 -> n>(P(w_i|ham))*P(ham)
$$R = \frac{\prod_{i = 1}^{n} P(w_i|spam) \times P(spam)}{\prod_{i = 1}^{n} (P(w_i|ham)) \times P(ham)}$$

Now it is easy to calculate $R$, isn't it? Let's go back to the $P(spam|w_1, w_2,..., w_n)$:

$$P(spam|w_1, w_2,..., w_n) = \frac{P(spam|w_1, w_2,... w_n)}{P(spam|w_1, w_2,..., w_n) \times P(ham|w_1, w_2,..., w_n)}$$

$$P(spam|w_1, w_2,..., w_n) = \frac{1}{1 + \frac{P(ham|w_1, w_2,..., w_n)}{P(spam|w_1, w_2,..., w_n)}}$$

$$P(spam|w_1, w_2,..., w_n) = \frac{1}{1 + R}$$

Now we can get started in coding:

In [12]:
# import necessary modules
import pandas as pd
import re, math
# get the dataset
fraud_email = pd.read_csv('fraud_email.csv')
# get the spam_list
spam_list = fraud_email[(fraud_email['Class'] == 1)]['Text'].tolist()
# get the ham_list
ham_list = fraud_email[(fraud_email['Class'] == 0)]['Text'].tolist()
# get the numbers of instances
dataset_row_count = len(fraud_email.axes[0])
# get the probabilities of ham/spam mail in this dataset
prob_ham_in_dataset = len(ham_list)/dataset_row_count
prob_spam_in_dataset = len(spam_list)/dataset_row_count
print('setup and initialization are completed!')

setup and initialization are completed!


Write down how to extract words

In [13]:
def extract_words(text):
    split_list = re.split('; |;|, |,|: |:|\t|\*|\n|! | |\.|\. ', str(text))
    word_list = [word.lower() for word in split_list]
    word_list = list(dict.fromkeys(word_list))
    if str('') in word_list:
        word_list.remove('')
    return word_list

How to calculate the probability of occurence od a word in a given text list:

In [14]:
def prob_in_set(word, list):
    count = 0
    for mail in list:
        if word in str(mail).lower():
            count += 1
    return count/len(list)

This is how to calculate the "Pi<i = 1 -> n>(P(w_i|spam))"

In [15]:
def product_of_num_list(num_list):
    result = 1
    for num in num_list:
        result *= float(num)
    return result

Now we can start calculating an instance:

In [16]:
def main(text):
    word_list = extract_words(text)
    prob_in_spam_list = list()
    prob_in_ham_list = list()
    for word in word_list:
        prob_in_spam_list.append(prob_in_set(word, spam_list))
        prob_in_ham_list.append(prob_in_set(word, ham_list))
    ratio = (product_of_num_list(prob_in_ham_list) * prob_ham_in_dataset) / \
        (product_of_num_list(prob_in_spam_list) * prob_spam_in_dataset)
    print('Ratio: ' + str(ratio))
    prob = 1 / (1 + ratio)
    print("You email has a chance of " + str(prob*100) + "% of being a spam.")

Let's try it:

In [17]:
text = "I'm just a little bit caught in the middle\
Life is a maze and love is a riddle\
I don't know where to go; can't do it alone; I've tried\
And I don't know why\
Slow it down\
Make it stop\
Or else my heart is going to pop\
\'Cause it\'s too much\
Yeah, it's a lot\
To be something I'm not"
main(text)

ZeroDivisionError: float division by zero

Can you see that there is the division-by-zero problem and the model is overfit that the predict result will always be around 99%? I have fixed this like this:

In [None]:
def product_of_num_list(num_list):
    result = 1
    for num in num_list:
        result *= float(num + 10)
    return result
# 10 is the number added due to Laplace correction to avoid the 0.0 probability.
# But why is it 10 instead of another number?
# I do not really know, I just adjust the number due to the test result.
# If it's too big, the return number will be infinity, but if it's too small,
# the model is overfit :(.
# Sorry!

Now let's try again:

In [None]:
text = "I'm just a little bit caught in the middle\
Life is a maze and love is a riddle\
I don't know where to go; can't do it alone; I've tried\
And I don't know why\
Slow it down\
Make it stop\
Or else my heart is going to pop\
\'Cause it\'s too much\
Yeah, it's a lot\
To be something I'm not"
main(text)

Ratio: 0.6714713818373087
You email has a chance of 59.827527462706755% of being a spam.


But we have another problem to notice: there's a lot of words like "a", "the", "is", "it",...etc., and I think it's too trivial to take into account in this model, so I will remove it by this function:

In [None]:
def idf(word):
    count = 0
    for mail in fraud_email['Text'].tolist():
        if word in str(mail).lower():
            count += 1
    return math.log(1 + (dataset_row_count/(count + 1)))

The main() function will be:

In [None]:
def main(text):
    word_list = extract_words(text)
    prob_in_spam_list = list()
    prob_in_ham_list = list()
    for word in word_list:
        if idf(word) > 1:
            prob_in_spam_list.append(prob_in_set(word, spam_list))
            prob_in_ham_list.append(prob_in_set(word, ham_list))
    ratio = (product_of_num_list(prob_in_ham_list) * prob_ham_in_dataset) / \
        (product_of_num_list(prob_in_spam_list) * prob_spam_in_dataset)
    print('Ratio: ' + str(ratio))
    prob = 1 / (1 + ratio)
    print("You email has a chance of " + str(prob*100) + "% of being a spam.")
    return (0 if (prob < 0.65) else 1)

Now we have a complete simple spam filter. Because the speed of testing an entire dataset is too slow (:<) so will try a piece of instances here:

In [None]:
start = 1000
stop = 1020
sample = fraud_email['Text'].tolist()[start:stop]
sample_result = fraud_email['Class'].tolist()[start:stop]
test_result = list()
for mail in sample:
    test_result.append(main(str(mail)))
print("Sample result: " + str(sample_result))
print("Test result: " + str(test_result))

Ratio: nan
You email has a chance of nan% of being a spam.
Ratio: 0.5417626251102475
You email has a chance of 64.86082771195032% of being a spam.
Ratio: 0.01333739730200361
You email has a chance of 98.68381475533083% of being a spam.
Ratio: 0.026624087640040428
You email has a chance of 97.40663715564644% of being a spam.
Ratio: 0.18698210028417106
You email has a chance of 84.2472687465627% of being a spam.
Ratio: 0.02265880584114123
You email has a chance of 97.78432398843871% of being a spam.
Ratio: 0.76348771554806
You email has a chance of 56.7058103769789% of being a spam.
Ratio: 1.3138832223547976
You email has a chance of 43.21739275080261% of being a spam.
Ratio: 0.7997538386148971
You email has a chance of 55.5631541683282% of being a spam.
Ratio: 1.2298480006601815
You email has a chance of 44.84610608902195% of being a spam.
Ratio: 0.40413926770230413
You email has a chance of 71.218006860272% of being a spam.
Ratio: 0.99681744675709
You email has a chance of 50.079690640

It's not enough to conclude the accuracy based on this small test. But it's quite optimistic, right?

This is the end. Thank you for reading my notebook. I also want to thank my friends who propose ways to reduce overfitting and solve other problems in this project.