# Programming Assignment: Email Spam Naive Bayes

## Overview/Task

The goal of this programming assignment is to build a naive bayes classifier from scratch that can determine whether email text should be labled spam or not spam based on its contents

## Review

Remeber that a naive bayes classifier realizes the following probability:

$$P(Y|X_1,X_2,...,X_n) \propto P(Y)*P(Y|X_1)*P(Y|X_2)*...*P(Y|X_n)$$

Where $Y$ is a binary class {0,1}

Where $X_i$ is a feature of the input

The classifier will decide what class each input belongs to based on highest probability from the equation above

## Reminders

Please remember that the classifier must be written from scratch; do NOT use any libraries that implement the classifier for you, such as but not limited to sklearn.

You CAN, however, use SKlearn to split up the dataset between testing and training.

Feel free to look up any tasks you are not familiar with, e.g. the function call to read a csv

## Task list/Recommended Order

In order to provide some guidance, I am giving the following order/checklist to solve this task:
<ol>
  <li>Compute the "prior": P(Y) for Y = 0 and Y = 1</li>
  <li>Compute the "likelihood": $P(Y|X_n)$</li>
  <li>Write code that uses the two items above to make a decision on whether or not an email is spam or ham (aka not spam)</li>
  <li>Write code to evaluate your model. Test model on training data to debug </li>
  <li>Test model on testing data to debug </li>
</ol>

In [92]:
#import cell
import numpy as np
import pandas as pd
import random
import csv

## Function template

In [93]:

def prior(df):
    ham_prior = 0
    spam_prior = 0
    '''YOUR CODE HERE'''
    spam_messages = df[df['label'] == 'spam']
    ham_messages = df[df['label'] == 'ham']
    ham_prior = len(ham_messages) / len(df)
    spam_prior = len(spam_messages) / len(df)

    '''END'''
    return ham_prior, spam_prior


def likelihood(df):
    ham_like_dict = {}
    spam_like_dict = {}
    '''YOUR CODE HERE'''
    ham_like_dict = {}
    spam_like_dict = {}
    
    
    ham = {}
    spam = {}
    
    for x in range(len(df)):
        txt = set([i.strip("/.,:?!'\"") for i in df['text'].values[x].split()])
        for word in txt:
            if df['label'][x] == 'ham':
                if word in ham:
                    ham[word]+=1
                else:
                    ham[word] = 1
            else:
                if word in spam:
                    spam[word]+=1
                else:
                    spam[word] = 1
    
    for key in ham:
        ham_like_dict[key] = ham[key]/(len(df[df['label'] == 'ham']))
    for key in spam:
        spam_like_dict[key] = spam[key]/(len(df[df['label'] == 'spam']))
        
        
    ham_like_dict['unknown word'] = 1/(len(df[df['label'] == 'ham']))    
    spam_like_dict['unknown word'] = 1/(len(df[df['label'] == 'spam']))
        
    '''END'''

    return ham_like_dict, spam_like_dict


def predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, text):
    '''
    prediction function that uses prior and likelihood structure to compute proportional posterior for a single line of text
    '''
    #ham_spam_decision = 1 if classified as spam, 0 if classified as normal/ham
    ham_spam_decision = None

    '''YOUR CODE HERE'''
    

    #ham_posterior = posterior probability that the email is normal/ham
    ham_posterior = ham_prior

    #spam_posterior = posterior probability that the email is spam
    spam_posterior = spam_prior
    
    txt = set([i.strip("/.,:?!'\"") for i in text.split()])
    for word in txt:
        if word in ham_like_dict:
            ham_posterior *= ham_like_dict[word]
        else:
            ham_posterior *= ham_like_dict['unknown word']
        if word in spam_like_dict:
            spam_posterior *= spam_like_dict[word]
        else:
            spam_posterior *= spam_like_dict['unknown word']
            
    if spam_posterior < ham_posterior:
        ham_spam_decision = 0
    else:
        ham_spam_decision = 1

    
    

    '''END'''
    return ham_spam_decision


def metrics(ham_prior, spam_prior, ham_dict, spam_dict, df):
    '''
    Calls "predict" function and report accuracy, precision, and recall of your prediction
    '''
    
    '''YOUR CODE HERE'''
    true_positive = 0
    true_negative = 0
    false_positive = 0
    false_negative = 0
    
    for i in range(len(df)):
        result = predict(ham_prior, spam_prior, ham_dict, spam_dict, df['text'][i])
        if result == 1:
            if df['label'][i]=="spam":
                true_positive += 1
            else:
                false_positive += 1
        else:
            if df['label'][i]=="spam":
                false_negative += 1
            else:
                true_negative += 1
                
        
        
    acc = (true_positive + true_negative) /(true_positive+true_negative+false_positive+false_negative)
    
    precision = true_positive /(true_positive+false_positive)

    recall = true_positive /(true_positive+false_negative)
    
    
    '''END'''
    return acc, precision, recall

## Generate answers with your functions

In [94]:
#loading in the training data
train_df = pd.read_csv("./TRAIN_balanced_ham_spam.csv")
test_df = pd.read_csv("./TEST_balanced_ham_spam.csv")
df = train_df
#df.head()



In [95]:
#compute the prior

ham_prior, spam_prior = prior(train_df)

print(ham_prior, spam_prior)

0.5 0.5


In [96]:
# compute likelihood

ham_like_dict, spam_like_dict = likelihood(train_df)
print(ham_like_dict,spam_like_dict)



In [97]:
# Test your predict function with some example TEXT

some_text_example = "this is a spam message"
print(predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, some_text_example))

1


In [98]:
# Predict on test_df and compute metrics 
    
df = test_df
acc, precision, recall = metrics(ham_prior, spam_prior, ham_like_dict, spam_like_dict, df)
print(acc, precision, recall)

0.9566666666666667 0.9228395061728395 0.9966666666666667
