<h1>SIMPLE SPAM FILTER</h1>

<code><bold>P(spam|word) = P(word|spam)*P(spam)/P(word)</bold></code>

<code><bold>P(spam|w_1, w_2,..., w_n) = P(w_1, w_2,..., w_n|spam)\*P(w_1, w_2,..., w_n)/P(spam)</bold></code>

<code><bold>P(w_1, w_2,..., w_n) = P(w_1|w_2,..., w_n)\*P(w_2|w_3,..., w_n)\*...\*P(w_n)</bold></code>

<code><bold>P(w_1, w_2,..., w_n|spam) = Pi\<i = 1 -> n>(P(w_i|spam))</bold></code>

Therefore:

<code><bold>P(spam|w_1, w_2,..., w_n) = P(w_1, w_2,..., w_n|spam)\*P(spam)/P(w_1, w_2,..., w_n)</bold></code>

<code><bold>P(spam|w_1, w_2,..., w_n) = Pi\<i = 1 -> n>(P(w_i|spam))\*P(spam)/P(w_1, w_2,..., w_n)</bold></code>

<code><bold>R = P(ham|w_1, w_2,..., w_n)/P(spam|w_1, w_2,..., w_n)</bold></code>

<code><bold>R = Pi\<i = 1 -> n>(P(w_i|spam))\*P(spam)/Pi\<i = 1 -> n>(P(w_i|ham))\*P(ham)</bold></code>

<code><bold>P(spam|w_1, w_2,..., w_n) = P(spam|w_1, w_2,... w_n)/(P(spam|w_1, w_2,..., w_n)\*P(ham|w_1, w_2,..., w_n))</bold></code>

<code><bold>P(spam|w_1, w_2,..., w_n) = 1/(1 + (P(ham|w_1, w_2,..., w_n)/P(spam|w_1, w_2,..., w_n)))</bold></code>

<code><bold>P(spam|w_1, w_2,..., w_n) = 1/(1 + R)</bold></code>

In [7]:
# import necessary modules
import pandas as pd
import re, math
# get the dataset
fraud_email = pd.read_csv('fraud_email.csv')
# get the spam_list
spam_list = fraud_email[(fraud_email['Class'] == 1)]['Text'].tolist()
# get the ham_list
ham_list = fraud_email[(fraud_email['Class'] == 0)]['Text'].tolist()
# get the numbers of instances
dataset_row_count = len(fraud_email.axes[0])
# get the probabilities of ham/spam mail in this dataset
prob_ham_in_dataset = len(ham_list)/dataset_row_count
prob_spam_in_dataset = len(spam_list)/dataset_row_count
print('setup and initialization are completed!')

setup and initialization are completed!


In [9]:
def extract_words(text):
    split_list = re.split('; |;|, |,|: |:|\t|\*|\n|! | |\.|\. ', str(text))
    word_list = [word.lower() for word in split_list]
    word_list = list(dict.fromkeys(word_list))
    if str('') in word_list:
        word_list.remove('')
    return word_list

In [10]:
def prob_in_set(word, list):
    count = 0
    for mail in list:
        if word in str(mail).lower():
            count += 1
    return count/len(list)

In [20]:
def product_of_num_list(num_list):
    result = 1
    for num in num_list:
        result *= float(num)
    return result

In [14]:
def main(text):
    word_list = extract_words(text)
    prob_in_spam_list = list()
    prob_in_ham_list = list()
    for word in word_list:
        prob_in_spam_list.append(prob_in_set(word, spam_list))
        prob_in_ham_list.append(prob_in_set(word, ham_list))
    ratio = (product_of_num_list(prob_in_ham_list) * prob_ham_in_dataset) / \
        (product_of_num_list(prob_in_spam_list) * prob_spam_in_dataset)
    print('Ratio: ' + str(ratio))
    prob = 1 / (1 + ratio)
    print("You email has a chance of " + str(prob*100) + "% of being a spam.")

In [None]:
text = "I'm just a little bit caught in the middle\
Life is a maze and love is a riddle\
I don't know where to go; can't do it alone; I've tried\
And I don't know why\
Slow it down\
Make it stop\
Or else my heart is going to pop\
\'Cause it\'s too much\
Yeah, it's a lot\
To be something I'm not"
main(text)

In [23]:
def product_of_num_list(num_list):
    result = 1
    for num in num_list:
        result *= float(num + 10)
    return result
# 10 is the number added due to Laplace correction to avoid the 0.0 probability.
# But why is it 10 instead of another number?
# I do not really know, I just adjust the number due to the test result.
# If it's too big, the return number will be infinity, but if it's too small,
# the model is overfit :(.
# Sorry!

In [24]:
text = "I'm just a little bit caught in the middle\
Life is a maze and love is a riddle\
I don't know where to go; can't do it alone; I've tried\
And I don't know why\
Slow it down\
Make it stop\
Or else my heart is going to pop\
\'Cause it\'s too much\
Yeah, it's a lot\
To be something I'm not"
main(text)

Ratio: 0.6714713818373087
You email has a chance of 59.827527462706755% of being a spam.


In [25]:
def idf(word):
    count = 0
    for mail in fraud_email['Text'].tolist():
        if word in str(mail).lower():
            count += 1
    return math.log(1 + (dataset_row_count/(count + 1)))

In [28]:
def main(text):
    word_list = extract_words(text)
    prob_in_spam_list = list()
    prob_in_ham_list = list()
    for word in word_list:
        if idf(word) > 1:
            prob_in_spam_list.append(prob_in_set(word, spam_list))
            prob_in_ham_list.append(prob_in_set(word, ham_list))
    ratio = (product_of_num_list(prob_in_ham_list) * prob_ham_in_dataset) / \
        (product_of_num_list(prob_in_spam_list) * prob_spam_in_dataset)
    print('Ratio: ' + str(ratio))
    prob = 1 / (1 + ratio)
    print("You email has a chance of " + str(prob*100) + "% of being a spam.")
    return (0 if (prob < 0.65) else 1)

In [29]:
start = 1000
stop = 1020
sample = fraud_email['Text'].tolist()[start:stop]
sample_result = fraud_email['Class'].tolist()[start:stop]
test_result = list()
for mail in sample:
    test_result.append(main(str(mail)))
print("Sample result: " + str(sample_result))
print("Test result: " + str(test_result))

Ratio: nan
You email has a chance of nan% of being a spam.
Ratio: 0.5417626251102475
You email has a chance of 64.86082771195032% of being a spam.
Ratio: 0.01333739730200361
You email has a chance of 98.68381475533083% of being a spam.
Ratio: 0.026624087640040428
You email has a chance of 97.40663715564644% of being a spam.
Ratio: 0.18698210028417106
You email has a chance of 84.2472687465627% of being a spam.
Ratio: 0.02265880584114123
You email has a chance of 97.78432398843871% of being a spam.
Ratio: 0.76348771554806
You email has a chance of 56.7058103769789% of being a spam.
Ratio: 1.3138832223547976
You email has a chance of 43.21739275080261% of being a spam.
Ratio: 0.7997538386148971
You email has a chance of 55.5631541683282% of being a spam.
Ratio: 1.2298480006601815
You email has a chance of 44.84610608902195% of being a spam.
Ratio: 0.40413926770230413
You email has a chance of 71.218006860272% of being a spam.
Ratio: 0.99681744675709
You email has a chance of 50.079690640