# Programming Assignment: Email Spam Naive Bayes

## Overview/Task

The goal of this programming assignment is to build a naive bayes classifier from scratch that can determine whether email text should be labled spam or not spam based on its contents

## Review

Remeber that a naive bayes classifier realizes the following probability:

$$P(Y|X_1,X_2,...,X_n) \propto P(Y)*P(X_1|Y)*P(X_2|Y)*...*P(X_n|Y)$$

Where $Y$ is a binary class {0,1}

Where $X_i$ is a feature of the input

The classifier will decide what class each input belongs to based on highest probability from the equation above

## Reminders

Please remember that the classifier must be written from scratch; do NOT use any libraries that implement the classifier for you, such as but not limited to sklearn.

You CAN, however, use SKlearn to split up the dataset between testing and training.

Feel free to look up any tasks you are not familiar with, e.g. the function call to read a csv

## Task list/Recommended Order

In order to provide some guidance, I am giving the following order/checklist to solve this task:
<ol>
  <li>Compute the "prior": P(Y) for Y = 0 and Y = 1</li>
  <li>Compute the "likelihood": $P(X_n|Y)$</li>
  <li>Write code that uses the two items above to make a decision on whether or not an email is spam or ham (aka not spam)</li>
  <li>Write code to evaluate your model. Test model on training data to debug </li>
  <li>Test model on testing data to debug </li>
</ol>

In [2]:
#import cell
import numpy as np
import pandas as pd
import random
import csv

## Function template

In [92]:
from functools import reduce
def prior(df):
    ham_prior = 0
    spam_prior =  0
    '''YOUR CODE HERE'''
    sum = len(df['label'])
    ham_prior = len(df[df['label']=='ham']['label'])/sum
    spam_prior = len(df[df['label']=='spam']['label'])/sum
    '''END'''
    return ham_prior, spam_prior

def likelihood(df):
    ham_like_dict = {}
    spam_like_dict = {}
    '''YOUR CODE HERE'''
    temp = df
    temp['set'] = df.apply(lambda row : set([i.strip("/.,()-:?!'\"") for i in row.text.split()]), axis = 1)
    res = set([]) # the sum set
    temp['set'].apply(lambda x: res.update(x))
    spam_df = temp[temp['label'] == 'spam']
    ham_df = temp[temp['label'] == 'ham']
    ####### above is the setup ##########
    ham_like_dict = {w:((ham_df.apply(lambda row : w in row['set'], axis = 1).sum())/ham_df.shape[0]) for w in res}
    spam_like_dict = {w:((spam_df.apply(lambda row : w in row['set'], axis = 1).sum())/spam_df.shape[0]) for w in res}
    '''END'''

    return ham_like_dict, spam_like_dict

def predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, text):
    '''
    prediction function that uses prior and likelihood structure to compute proportional posterior for a single line of text
    '''
    #ham_spam_decision = 1 if classified as spam, 0 if classified as normal/ham
    ham_spam_decision = None

    '''YOUR CODE HERE'''
    text_list = set([i.strip("/.,()-:?!'\"") for i in text.split()])
    
    spam_prob = spam_prior
    ham_prob = ham_prior
    #ham_posterior = posterior probability that the email is normal/ham
    #print('text_list: ',text_list)
    ham_prob_list = [ham_like_dict[w] if (w in ham_like_dict) else 0.5 for w in text_list]
    ham_prob = reduce(lambda acc, x : acc * x, ham_prob_list, ham_prob)
    ham_posterior = ham_prob

    #spam_posterior = posterior probability that the email is spam
    spam_prob_list = [spam_like_dict[w] if (w in spam_like_dict) else 0.5 for w in text_list]
    spam_prob = reduce(lambda acc, x : acc * x, spam_prob_list, spam_prob)
    spam_posterior = spam_prob
    #print('ham_posterior: ',ham_posterior,' spam_posterior: ',spam_posterior)
    if ham_posterior > spam_posterior:
        ham_spam_decision = 0
    else:
        ham_spam_decision = 1
    
    '''END'''
    return ham_spam_decision


def metrics(ham_prior, spam_prior, ham_dict, spam_dict, df):
    '''
    Calls "predict" function and report accuracy, precision, and recall of your prediction
    '''
    
    '''YOUR CODE HERE'''
    temp = pd.DataFrame({'pred':list(), 'target':list()})
    temp.pred = df.apply(lambda row : predict(ham_prior, spam_prior, ham_dict, spam_dict, row['text']), axis = 1)
    temp.target = pd.Series(np.where(df.label.values == 'spam', 1, 0))
    # 
    true_pos = temp[(temp['target'] == 1) & (temp['pred'] == 1)].shape[0]
    false_neg = temp[(temp['target'] == 1) & (temp['pred'] == 0)].shape[0]
    true_neg = temp[(temp['target'] == 0) & (temp['pred'] == 0)].shape[0]
    false_pos = temp[(temp['target'] == 0) & (temp['pred'] == 1)].shape[0]
    # accuracy = correct prediction/all prediction
    acc = (true_pos + true_neg) / (true_pos + true_neg + false_neg + false_pos)
    # precision = true pos / true pos + false pos
    precision = true_pos/ (true_pos+false_pos)
    # recall = true pos / true pos + false neg
    recall = true_pos / (true_pos + false_neg)
    '''END'''
    return acc, precision, recall

## Generate answers with your functions

In [75]:
#loading in the training data
train_df = pd.read_csv("./TRAIN_balanced_ham_spam.csv")
test_df = pd.read_csv("./TEST_balanced_ham_spam.csv")
df = train_df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2398 entries, 0 to 2397
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0.1  2398 non-null   int64 
 1   Unnamed: 0    2398 non-null   int64 
 2   label         2398 non-null   object
 3   text          2398 non-null   object
 4   label_num     2398 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 93.8+ KB


In [65]:
#compute the prior

ham_prior, spam_prior = prior(df)

print(ham_prior, spam_prior)

0.5 0.5


In [67]:
# compute likelihood

ham_like_dict, spam_like_dict = likelihood(df)

In [94]:
# Test your predict function with some example TEXT

some_text_example = df.text.values[1]
print(predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, some_text_example))

1


In [93]:
# Predict on test_df and compute metrics 
    
df = test_df
acc, precision, recall = metrics(ham_prior, spam_prior, ham_like_dict, spam_like_dict, df)
print(acc, precision, recall)

0.785 0.7002341920374707 0.9966666666666667
