<a href="https://colab.research.google.com/github/NUG30/homework-2-dillonloh/blob/main/Assignment2_Dillon_(061801914).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 2: Naive Bayes Email Spam Filter

In this homework, we will learn how to implement the Naive Bayes classifier in order to create a simple email spam filter. This spam filter will be trained by test_emails, which will be given by a vector of tuples (emails, spam/nospam). For each tuple the first entry is a string ("email"), and the second entry is 0 or 1, depending whether the email contains spam words.

In [3]:
dictionary = ['hello', 'students', 'please', 'learn', 'for', 'the', 'exam', 'buy', 'drugs', 'today', 'sun', 'is', 'shining', 'in', 'nagoya', 'lets', 'sell', 'how', 'are', 'you', 'today?', 'do', 'your', 'homework', 'want', 'free', 'solutions?', 'hey', 'always', 'ask', 'questions', 'if','have', 'any', 'math', 'not', 'good', 'submit', 'pay'] 

test_emails=[
             ["hello students, please learn for the exam", 0],
             ["hello students, please buy drugs", 1],    
             ["hello, today the sun is shining in nagoya", 0],
             ["lets sell drugs in nagoya", 1],
             ["today learn drugs", 1],
             ["how are you today?", 0],
             ["hello students, please do your homework", 0],
             ["hello, do you want free homework solutions?", 1],
             ["hey students, please always ask questions if you have any", 0],
             ["math is not good", 1],
             ["math is good", 0],
             ["please submit your homework", 0],
             ["please buy questions", 1],
             ["please pay for the exam", 1]          
             ]



The feature space for our spam filter will be $\mathcal{X}=\{0,1\}^d$, where $d$ denotes the number of words in the dictionary. For a feature (email) $x \in \mathcal{X}$ the entry $x_i$ for $i=1,\dots,d$ is $1$ if the $i$-th word of the dictionary is contained in the email and $0$ otherwise.  

# **Exercise 1**
Implement a function which creates a feature vector out of an email and a function which creates a training set out of test emails. 

You would need to figure out whether a sentence contains a word, and there are functions in Python that could determine whether a string contains another string. You can consult documentation (and Google) to find out.

In [86]:
import numpy as np

def email_to_feature(dict, email):
    """function to creates a feature vector based on input dict and email"""

    feature_vector = np.zeros((len(dict), 1))

    # Checks if dict substrings are contained in email string

    for i in range(0, len(dict)):
        if dict[i] in email:
            feature_vector[i] = 1
    
    return feature_vector

def emails_to_test_samples(dict, test_emails):
    """functions to create (feature, spam_label) test samples from dict and emails"""

    test_samples = []

    for test_email in test_emails:
        sample = [email_to_feature(dict, test_email[0]), test_email[1]]
        test_samples.append(sample)

    return test_samples

training_set = emails_to_test_samples(dictionary, test_emails)



 **Recall from Lecture 6:**

Given a training set  $\mathcal{T} = \left( (x^{(1)}, y^{(1)}) , \dots, (x^{(n)}, y^{(n)})   \right)$ we calculate for $i=1,\dots,d$ the following numbers
\begin{align*}
\phi_{i\mid y=1} &= \frac{1+\sum_{j=1}^n I(x^{(j)}_i = 1  \,\wedge \, y^{(j)}=1 ) }{2+\sum_{j=1}^n I(y^{(j)}=1)}\,,\\
	\phi_{i\mid y=0} &= \frac{1+\sum_{j=1}^n I(x^{(j)}_i = 1  \,\wedge \, y^{(j)}=0 )}{2+\sum_{j=1}^n I(y^{(j)}=0)}\,,\\
		\phi_{y=1} &= \frac{1+\sum_{j=1}^n I(y^{(j)} = 1)}{2+n} \,.
\end{align*}
Here $I$ denoted the indicator function. We used the laplace smoothing (thats why we have the $1+$ and $2+$) in order to make sure that we will not assume probability $0$ for unknown words.

Now assume we get a new feature (i.e. someone sends us an email) $x \in \{0,1\}^d$. Then we can calculate for each word $i=1,\dots,d$ the probabilities
\begin{align*}
P(x_i = 1 \mid y=1) &= \phi_{i\mid y=1}\,,\qquad &&P(x_i = 1 \mid y=0) = \phi_{i\mid y=0} \,,\\
P(x_i = 0 \mid y=1) &= 1- \phi_{i\mid y=1}\,,\qquad &&P(x_i = 0 \mid y=0) = 1-\phi_{i\mid y=0} \,. \\
\end{align*}

By the Naive Bayes assumption we have for $x \in \{0,1\}^d$
\begin{align*}
P(x \mid y)  = \prod_{i=1}^d P(x_j \mid y)\,.
\end{align*}
The probability of the new email being spam is then
\begin{align*}
P(y=1 \mid x) &= \frac{ P(x \mid y=1) P(y=1)}{P(x)}\\
&= \frac{\prod_{i=1}^d P(x_i \mid y = 1) \cdot \phi_{y=1}}{\prod_{i=1}^d P(x_i \mid y = 1) \cdot \phi_{y=1} + \prod_{i=1}^d P(x_i \mid y = 0) (1-\phi_{y=1})}\,.
\end{align*}


# **Exercise 2:** 
Use the above explanation of the Naive Bayes Spam filter and implement a function which gives the probability of an email being spam given the trainings email above. 


In [90]:
def calculate_phi(training_set, spam_value, feature_index=None):
    """function for calculating phi(i=feature_index|y=spam_value)"""
    """leave feature_index argument blank if calculating phi(y)"""

    numerator, denominator = 1, 2

    # Calculating just phi(y) if feature_index is blank
    if feature_index is None:
        for sample in training_set:
            if sample[1] == spam_value:
                numerator = numerator + 1
        denominator = denominator + len(training_set)

        phi = numerator/denominator

        print('phi(y={0}) = {1}'.format(spam_value, phi))
        return phi

    # Calculating phi(i|y) if feature_index == 0 or 1

    else:
        for sample in training_set:
            if sample[1] == spam_value:
                denominator = denominator  + 1
            if sample[0][feature_index] == 1 and sample[1] == spam_value:
                numerator = numerator + 1

        phi = numerator/denominator

        print('phi(i={0}|y={1}) = {2}'.format(feature_index, spam_value, phi))
        return phi

calculate_phi(training_set, 1, 33) #check probability 'any' is inside means spam


def probability(training_set, x_value, spam_value, feature_index=None):
    """function for getting conditional probability P(xi=x_value|y=spam_value)"""

    if x_value == 1:
        return calculate_phi(training_set, spam_value,  feature_index)

    elif x_value == 0:
        return (1-calculate_phi(training_set, spam_value, feature_index))


print(probability(training_set, 1, 1, 33))
print(probability(training_set, 0, 1, 33))

def spam_percentage(dictionary, training_set, email):
    """function for calculating the probability of an email being spam"""
    
    num_product = calculate_phi(training_set, 1)
    denom_product_1 = calculate_phi(training_set, 1)
    denom_product_2 = 1 - calculate_phi(training_set, 1)

    email_feature_vector = email_to_feature(dictionary, email)
    for i in range(0, len(email_feature_vector)):
        if email_feature_vector[i] == 1:
            num_product = num_product * probability(training_set, 1, 1, i)
            denom_product_1 = denom_product_1 * probability(training_set, 1, 1, i)
            denom_product_2 = denom_product_2 * probability(training_set, 1, 0, i)

        elif email_feature_vector[i] == 0:
            num_product = num_product * probability(training_set, 0, 1, i)
            denom_product_1 = denom_product_1 * probability(training_set, 0, 1, i)
            denom_product_2 = denom_product_2 * probability(training_set, 0, 0, i)

    percentage = num_product / (denom_product_1 + denom_product_2)

    print(percentage)
    return percentage


email="the sun is shining. buy drugs now"
spam_percentage(dictionary, training_set, email)


phi(i=33|y=1) = 0.1111111111111111
phi(i=33|y=1) = 0.1111111111111111
0.1111111111111111
phi(i=33|y=1) = 0.1111111111111111
0.8888888888888888
phi(y=1) = 0.5
phi(y=1) = 0.5
phi(y=1) = 0.5
phi(i=0|y=1) = 0.3333333333333333
phi(i=0|y=1) = 0.3333333333333333
phi(i=0|y=0) = 0.4444444444444444
phi(i=1|y=1) = 0.2222222222222222
phi(i=1|y=1) = 0.2222222222222222
phi(i=1|y=0) = 0.4444444444444444
phi(i=2|y=1) = 0.4444444444444444
phi(i=2|y=1) = 0.4444444444444444
phi(i=2|y=0) = 0.5555555555555556
phi(i=3|y=1) = 0.2222222222222222
phi(i=3|y=1) = 0.2222222222222222
phi(i=3|y=0) = 0.2222222222222222
phi(i=4|y=1) = 0.2222222222222222
phi(i=4|y=1) = 0.2222222222222222
phi(i=4|y=0) = 0.2222222222222222
phi(i=5|y=1) = 0.2222222222222222
phi(i=5|y=1) = 0.2222222222222222
phi(i=5|y=0) = 0.3333333333333333
phi(i=6|y=1) = 0.2222222222222222
phi(i=6|y=1) = 0.2222222222222222
phi(i=6|y=0) = 0.2222222222222222
phi(i=7|y=1) = 0.3333333333333333
phi(i=7|y=1) = 0.3333333333333333
phi(i=7|y=0) = 0.1111111111111

0.9299429164504411

Test your spam filter with the following email.


In [None]:
email="the sun is shining. buy drugs now"
print(SpamPercentage(email))


# **Exercise 3**
Extend your spamfilter by creating a dynamical dictionary. Instead of starting with a fixed dictionary, you should now create a dictionary out of a list of emails. 

Write a function `create_dictionary(emails)` which returns a dictionary created from a list of emails (Give as an array of arrays `[text, spam\nospam]`). Make sure that you do not include words more than once into the dictionary.
To implement this function you should look up the function `split()` for a string in Python. To take care of the symbols "." and "," you can use the `replace()` function of a string.

In [None]:
def create_dictionary(emails):

  return NotImplementedError