# Programming Assignment: Email Spam Naive Bayes

## Overview/Task

The goal of this programming assignment is to build a naive bayes classifier from scratch that can determine whether email text should be labled spam or not spam based on its contents

## Review

Remeber that a naive bayes classifier realizes the following probability:

$$P(Y|X_1,X_2,...,X_n) \propto P(Y)*P(Y|X_1)*P(Y|X_2)*...*P(Y|X_n)$$

Where $Y$ is a binary class {0,1}

Where $X_i$ is a feature of the input

The classifier will decide what class each input belongs to based on highest probability from the equation above

## Reminders

Please remember that the classifier must be written from scratch; do NOT use any libraries that implement the classifier for you, such as but not limited to sklearn.

You CAN, however, use SKlearn to split up the dataset between testing and training.

Feel free to look up any tasks you are not familiar with, e.g. the function call to read a csv

## Task list/Recommended Order

In order to provide some guidance, I am giving the following order/checklist to solve this task:
<ol>
  <li>Compute the "prior": P(Y) for Y = 0 and Y = 1</li>
  <li>Compute the "likelihood": $P(Y|X_n)$</li>
  <li>Write code that uses the two items above to make a decision on whether or not an email is spam or ham (aka not spam)</li>
  <li>Write code to evaluate your model. Test model on training data to debug </li>
  <li>Test model on testing data to debug </li>
</ol>

In [129]:
#import cell
import numpy as np
import pandas as pd
import random
import string
import csv

## Function template

In [130]:
def prior(df):
    ham_prior = 0
    spam_prior =  0
    '''YOUR CODE HERE'''
    ham_count = len(df[df['label'] == 'ham'])
    spam_count = len(df[df['label'] == 'spam'])
    total_count = len(df)
    ham_prior = ham_count / total_count
    spam_prior = spam_count / total_count
    
    return ham_prior, spam_prior
    '''END'''
def prior(df):
    ham_prior = 0
    spam_prior = 0
    '''YOUR CODE HERE'''
    ham_count = len(df[df['label'] == 'ham']) 
    spam_count = len(df[df['label'] == 'spam']) 
    total_count = len(df)
    ham_prior = ham_count / total_count
    spam_prior = ham_count / total_count
    
    return ham_prior, spam_prior
    '''END'''


def likelihood(df):
    ham_like_dict = {}
    spam_like_dict = {}
    '''YOUR CODE HERE'''
    punctuation_set = set(string.punctuation)
    for index, row in df[df['label'] == 'ham'].iterrows():
        words = set([i.strip("/.,:?!'\"") for i in row['text'].split() if i not in punctuation_set]) 
        ham_word_counts = {}
        for word in words:
            if word in ham_word_counts: 
                ham_word_counts[word] += 1
            else:
                ham_word_counts[word] = 1
        for word, count in ham_word_counts.items(): 
            if word in ham_like_dict:
                ham_like_dict[word] += count 
            else:
                ham_like_dict[word] = count
    for index, row in df[df['label'] == 'spam'].iterrows():
        words = set([i.strip("/.,:?!'\"") for i in row['text'].split() if i not in punctuation_set]) 
        spam_word_counts = {}
        for word in words:
            if word in spam_word_counts: 
                spam_word_counts[word] += 1
            else:
                spam_word_counts[word] = 1
        for word, count in spam_word_counts.items(): 
            if word in spam_like_dict:
                spam_like_dict[word] += count 
            else:
                spam_like_dict[word] = count
    return ham_like_dict, spam_like_dict    
    '''END'''


def predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, text):
    '''
    prediction function that uses prior and likelihood structure to compute proportional posterior for a single line of text
    '''
    '''YOUR CODE HERE'''
    #ham_spam_decision = 1 if classified as spam, 0 if classified as normal/ham
    ham_spam_decision = None
    #ham_posterior = posterior probability that the email is normal/ham
    ham_posterior = ham_prior
    #spam_posterior = posterior probability that the email is spam
    spam_posterior = spam_prior
    words = text.split()
    for word in words:
        if word in ham_like_dict:
            ham_posterior *= ham_like_dict[word]
        else:
            ham_posterior *= 1 / (len(ham_like_dict) + len(spam_like_dict))
            
        if word in spam_like_dict:
            spam_posterior *= spam_like_dict[word]
        else:
            spam_posterior *= 1 / (len(ham_like_dict) + len(spam_like_dict))
    
    if ham_posterior >= spam_posterior:
        ham_spam_decision = 0
    else:
        ham_spam_decision = 1

    return ham_spam_decision
    '''END'''

def metrics(ham_prior, spam_prior, ham_dict, spam_dict, df):
    '''
    Calls "predict" function and report accuracy, precision, and recall of your prediction
    '''
    
    '''YOUR CODE HERE'''
    true_positives = 0
    false_positives = 0
    true_negatives = 0
    false_negatives = 0
    
    for i, row in df.iterrows():
        prediction = predict(ham_prior, spam_prior, ham_dict, spam_dict, row['text'])
        
        if row['label'] == 'spam' and prediction == 1:
            true_positives += 1
        elif row['label'] == 'ham' and prediction == 0:
            true_negatives += 1
        elif row['label'] == 'ham' and prediction == 1:
            false_positives += 1
        else:
            false_negatives += 1
            
    acc = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)
    if true_positives + false_positives == 0:
        precision = 0
    else:
        precision = true_positives / (true_positives + false_positives)
    if true_positives + false_positives == 0:
        recall = 0
    else:
        recall = true_positives / (true_positives + false_negatives)

    return acc, precision, recall
    '''END'''


## Generate answers with your functions

In [131]:
#loading in the training data
train_df = pd.read_csv("./TRAIN_balanced_ham_spam.csv")
test_df = pd.read_csv("./TEST_balanced_ham_spam.csv")
df = train_df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2398 entries, 0 to 2397
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0.1  2398 non-null   int64 
 1   Unnamed: 0    2398 non-null   int64 
 2   label         2398 non-null   object
 3   text          2398 non-null   object
 4   label_num     2398 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 93.8+ KB


In [132]:
#compute the prior

ham_prior, spam_prior = prior(df)
print(ham_prior, spam_prior)

0.5 0.5


In [133]:
# compute likelihood
ham_like_dict, spam_like_dict = likelihood(df)

In [134]:
# Test your predict function with some example TEXT

some_text_example = "hello my name is tim"
print(predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, some_text_example))

0


In [135]:
# Predict on test_df and compute metrics 
df = test_df
ham_prior, spam_prior = prior(df)
ham_dict, spam_dict = likelihood(df)
acc, precision, recall = metrics(ham_prior, spam_prior, ham_dict, spam_dict, df)
print(acc, precision, recall)

0.9733333333333334 0.9930555555555556 0.9533333333333334
