# Machine Learning Project 2: Spam Detection -- Application of Naive Baysian Theorem

#### Project description:
In this project, we are going to perform classification for the purpose of spam detection. Naive Baysian Theorem will be used to calculate the probability of emails falling into the spam categoty. Therefore at the end, we could input our own messages and compute their probability of being spam. The dataset used for analysis here is collected from online learning platform Coursera and it contains 5728 sample emails indexed as either 1 or 0 (1 represents spam, 0 represents non-spam). Hope this could give an insight into email security and enhance user satisfaction.

### 1. Naive Baysian Theorem formula review

$$ P(\text{spam}|\text{word}) = \frac{P(\text{word}|\text{spam}) \cdot P(\text{spam})}{P(\text{word}|\text{spam}) \cdot P(\text{spam}) + P(\text{word}|\text{ham}) \cdot P(\text{ham})}$$

One can further expand the above formula into the following version:

$$ P(\text{spam}|\text{word}) = \frac{P(\text{word}_{1}|\text{spam}) \cdot P(\text{word}_{2}|\text{spam}) \cdot \cdot \cdot P(\text{word}_{n}|\text{spam}) \cdot P(\text{spam})} {P(\text{word}_{1}|\text{spam}) \cdot P(\text{word}_{2}|\text{spam}) \cdot \cdot \cdot P(\text{word}_{n}|\text{spam}) \cdot P(\text{spam}) + P(\text{word}_{1}|\text{ham}) \cdot P(\text{word}_{2}|\text{ham}) \cdot \cdot \cdot P(\text{word}_{n}|\text{ham}) \cdot P(\text{ham}) \cdot P(\text{ham})} $$

In this case, what we need to calculate is the probability of an email being spam given a particular group of given words. The n number of events in both numerator and denominator are considered disjoint or independent. Ham means non-spam. We will base our analysis on this formula. The methodology in this project is to code each part of this formula to compute probability.

### 2. Load libraries and dataset

In [57]:
import pandas as pd
import numpy as np

In [3]:
emails=pd.read_csv("emails.csv")
print(emails.head)
print(emails.tail)
print(emails.columns)
print(emails.ndim)

<bound method NDFrame.head of                                                    text  spam
0     Subject: naturally irresistible your corporate...     1
1     Subject: the stock trading gunslinger  fanny i...     1
2     Subject: unbelievable new homes made easy  im ...     1
3     Subject: 4 color printing special  request add...     1
4     Subject: do not have money , get software cds ...     1
...                                                 ...   ...
5723  Subject: re : research and development charges...     0
5724  Subject: re : receipts from visit  jim ,  than...     0
5725  Subject: re : enron case study update  wow ! a...     0
5726  Subject: re : interest  david ,  please , call...     0
5727  Subject: news : aurora 5 . 2 update  aurora ve...     0

[5728 rows x 2 columns]>
<bound method NDFrame.tail of                                                    text  spam
0     Subject: naturally irresistible your corporate...     1
1     Subject: the stock trading gunslinger  f

### 3. Process dataset

In this section we will process the dataset for the convenience of analysis.

In [4]:
emails["words"]=0  #initialization
for i in range(len(emails)):
    email=emails["text"].values
    subject=str(email[i]).lower()
    emails["words"][i]=list(set(subject.split()))
print(emails.head)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  emails["words"][i]=list(set(subject.split()))


<bound method NDFrame.head of                                                    text  spam  \
0     Subject: naturally irresistible your corporate...     1   
1     Subject: the stock trading gunslinger  fanny i...     1   
2     Subject: unbelievable new homes made easy  im ...     1   
3     Subject: 4 color printing special  request add...     1   
4     Subject: do not have money , get software cds ...     1   
...                                                 ...   ...   
5723  Subject: re : research and development charges...     0   
5724  Subject: re : receipts from visit  jim ,  than...     0   
5725  Subject: re : enron case study update  wow ! a...     0   
5726  Subject: re : interest  david ,  please , call...     0   
5727  Subject: news : aurora 5 . 2 update  aurora ve...     0   

                                                  words  
0     [aim, formats, easy, ;, creativeness, ieader, ...  
1     [edt, slung, hawthorn, incredible, hepburn, gr...  
2     [credit, 

This aim of adding one new column with words exactly the same as those in the column "text" but are non-capitalized and non-repetitive is to ensure that the same words with spelling variations (e.g. "Money","money","MONEY") will all be converted into the lowercased version (only "money" in this case) and will only appear once. This could further make sure each word will be counted only once, leading to the accurate anlysis result.

### 4. Calculate P(spam) and P(ham)

Let's start with the simplest part of the above Naive Baysian Theorem: P(spam) and P(ham)

In [6]:
spam_number=len(emails[emails["spam"]==1])
ham_number=len(emails[emails["spam"]==0])
print("The number of emails which are spam is ",spam_number)
print("The number of emails which are ham is ",ham_number)

The number of emails which are spam is  1368
The number of emails which are ham is  4360


The we need to calculate probability

In [11]:
spam_prop=round(spam_number/len(emails),4)
ham_prop=round(ham_number/len(emails),4)
print(f"The proportion of emails which are spam is {spam_prop*100:.2f}%")
print(f"The proportion of emails which are ham is {ham_prop*100:.2f}% ")

The proportion of emails which are spam is 23.88%
The proportion of emails which are ham is 76.12% 


### 5. Calculate P(word|spam) and P(word|ham) for input messages

In order to calculate the probability of a word given it is in spam and the probability of a word given it is in ham, we need to know the frequency of each word appearing in the emails that are either spam or ham. Therefore, we could create a dictionary looks roughly like this: 

$$ \{"money":\{"spam":567,"ham":345\}, 
 "lottery":\{"spam":123,"ham":456\}....\} (this is just an example) $$

In [70]:
word_freq_dictionary={}
email_words=emails["words"].values
for words_list,row in zip(email_words,emails["spam"].index):
    for word in words_list:
        if word not in word_freq_dictionary:
            word_freq_dictionary[word]={"spam":0,"ham":0}
    
        if emails["spam"][row]==1:
            word_freq_dictionary[word]["spam"]+=1
        else:
            word_freq_dictionary[word]["ham"]+=1

# let me only display the first five items in the dictionary since it will be too lengthy to display all
top_items = dict(sorted(word_freq_dictionary.items(), key=lambda x: x[1]['spam'], reverse=True)[:5])
print(top_items)

# let's try some words
print(f"The frequncy of word 'money' being spam and ham: {word_freq_dictionary['money']}")
print(f"The frequncy of word 'million' being spam and ham: {word_freq_dictionary['million']}")
print(f"The frequncy of word 'tomorrow' being spam and ham: {word_freq_dictionary['tomorrow']}")

# for some words that may not appear in the dictionary
try:
    word_freq_dictionary['wefqwfef']
except KeyError:
    print("This word doesn't exist in the library! Please try other words!")

{'subject:': {'spam': 1368, 'ham': 4360}, '.': {'spam': 1336, 'ham': 4322}, 'to': {'spam': 1161, 'ham': 4056}, ',': {'spam': 1158, 'ham': 4142}, 'the': {'spam': 1083, 'ham': 3999}}
The frequncy of word 'money' being spam and ham: {'spam': 280, 'ham': 87}
The frequncy of word 'million' being spam and ham: {'spam': 102, 'ham': 71}
The frequncy of word 'tomorrow' being spam and ham: {'spam': 11, 'ham': 234}
This word doesn't exist in the library! Please try other words!


Since we have already figured out the frequency of each word being spam or ham, we could expand $$ P(\text{word}|\text{spam})$$ into $$ P(\text{word}_{1}|\text{spam}) \cdot P(\text{word}_{2}|\text{spam}) \cdot \cdot \cdot P(\text{word}_{n}|\text{spam}) $$
and expand $$ P(\text{word}|\text{ham})$$ into $$ P(\text{word}_{1}|\text{ham}) \cdot P(\text{word}_{2}|\text{ham}) \cdot \cdot \cdot P(\text{word}_{n}|\text{ham}) $$

Moreover, it's more convenient here for us to write a function to prevent repeating writing a loop.

In [71]:
def prop_word_given_spam_or_ham(text):
    text=text.lower()
    text=set(text.split())
    prop_word_given_spam=1 #initialization
    prop_word_given_ham=1 #initialization
    for word in text:
        if word in word_freq_dictionary:
            prop_word_given_spam*=word_freq_dictionary[word]["spam"]/spam_number
            prop_word_given_ham*=word_freq_dictionary[word]["ham"]/ham_number
    final_prop_word_given_spam=prop_word_given_spam
    final_prop_word_given_ham=prop_word_given_ham
    
    return final_prop_word_given_spam,final_prop_word_given_ham

### 6. Calculate P(spam|word)

In [73]:
def prop_spam_word(final_prop_word_given_spam,final_prop_word_given_ham):
    prop_spam_given_word=(final_prop_word_given_spam*spam_prop)/((final_prop_word_given_spam*spam_prop)+(final_prop_word_given_ham*ham_prop))
    return prop_spam_given_word

Alternatively, we can actually combine these two functions together into on for convenience.

In [74]:
def prop_word_given_spam_or_ham(text):
    text=text.lower()
    text=set(text.split())
    prop_word_given_spam=1 #initialization
    prop_word_given_ham=1 #initialization
    for word in text:
        if word in word_freq_dictionary:
            prop_word_given_spam*=word_freq_dictionary[word]["spam"]/spam_number
            prop_word_given_ham*=word_freq_dictionary[word]["ham"]/ham_number
    final_prop_word_given_spam=prop_word_given_spam
    final_prop_word_given_ham=prop_word_given_ham
    prop_spam_given_word=(final_prop_word_given_spam*spam_prop)/((final_prop_word_given_spam*spam_prop)+(final_prop_word_given_ham*ham_prop))
    
    return prop_spam_given_word

### 7. Experiment

##### Let's randomly input some messages and check whether the model works well or not.

In [81]:
message1="Check this email out right away! Otherwise you will lose chance of being a millionaire! Let's make money together."
message2="Check the link down below to learn how to learn data science and land a job as a data scientist."
message3="According to the weather forecast, it will being raining tomorrow."
message4="Dear student! Tomorrow will be the deadline. Please submit your homework to the website ASAP!"

# and here is the funny one. My Polish friend told me there used to be one type of popular but quite obvious scam message, but some people still fell into prey.
message5="Hello! I am King Charles I, and I'm actualy still alive. I have a large sum of money that no one wants to inherit.\n" +\
"I want you to be my heir. Please open the link down below and put into your bank account number and password right now. You will earn a billion\n"+\
"and get rich. Make sure do it within today, otherwise you will lose chance!"

print(f"The probability of message1 being spam is {prop_word_given_spam_or_ham(message1)*100:.2f}%")
print(f"The probability of message2 being spam is {prop_word_given_spam_or_ham(message2)*100:.2f}%")
print(f"The probability of message3 being spam is {prop_word_given_spam_or_ham(message3)*100:.2f}%")
print(f"The probability of message4 being spam is {prop_word_given_spam_or_ham(message4)*100:.2f}%")
print(f"The probability of message5 being spam is {prop_word_given_spam_or_ham(message5)*100:.2f}%")

The probability of message1 being spam is 98.60%
The probability of message2 being spam is 15.38%
The probability of message3 being spam is 0.33%
The probability of message4 being spam is 100.00%
The probability of message5 being spam is 93.97%


##### It seems the model works well! Unsurpisingly, the probability of message5 being spam is quite high. But it's so funny that message4 turns out to be spam....