# Programming Assignment: Email Spam Naive Bayes

## Overview/Task

The goal of this programming assignment is to build a naive bayes classifier from scratch that can determine whether email text should be labled spam or not spam based on its contents

## Review

Remeber that a naive bayes classifier realizes the following probability:

$$P(Y|X_1,X_2,...,X_n) \propto P(Y)*P(Y|X_1)*P(Y|X_2)*...*P(Y|X_n)$$

Where $Y$ is a binary class {0,1}

Where $X_i$ is a feature of the input

The classifier will decide what class each input belongs to based on highest probability from the equation above

## Reminders

Please remember that the classifier must be written from scratch; do NOT use any libraries that implement the classifier for you, such as but not limited to sklearn.

You CAN, however, use SKlearn to split up the dataset between testing and training.

Feel free to look up any tasks you are not familiar with, e.g. the function call to read a csv

## Task list/Recommended Order

In order to provide some guidance, I am giving the following order/checklist to solve this task:
<ol>
  <li>Compute the "prior": P(Y) for Y = 0 and Y = 1</li>
  <li>Compute the "likelihood": $P(Y|X_n)$</li>
  <li>Write code that uses the two items above to make a decision on whether or not an email is spam or ham (aka not spam)</li>
  <li>Write code to evaluate your model. Test model on training data to debug </li>
  <li>Test model on testing data to debug </li>
</ol>

In [11]:
#import cell
import numpy as np
import pandas as pd
import random
import csv


## Function template

In [18]:

def prior(df):
    ham_prior = 0
    spam_prior =  0
    '''YOUR CODE HERE'''
    emailC = len(df)
    spam = df.loc[df["label"]=="spam"]
    spamC = spam["label"].value_counts()[0] #value_count() returns a series.
    ham = df.loc[df["label"]=="ham"]
    hamC = ham["label"].value_counts()[0] #value_count() returns a series.
    ham_prior = hamC/emailC
    spam_prior = spamC/emailC
    
    '''END'''
    return ham_prior, spam_prior
    

def likelihood(df):
    ham_like_dict = {}
    spam_like_dict = {}
    '''YOUR CODE HERE'''
    spam = df.loc[df['label']=='spam']
    ham = df.loc[df['label']=='ham']
    for x in range(len(spam)): 
        wordset = set([i.strip("/.,:?!'\"") for i in spam['text'].values[x].split()])
        for e in wordset:
            if len(e)>3:                
                if e not in spam_like_dict:
                    spam_like_dict[e]=1
                else:
                    spam_like_dict[e]+=1
    for y in range(len(ham)):
        wordset = set([i.strip("/.,:?!'\"") for i in ham['text'].values[y].split()])
        for e in wordset:
            if len(e)>3:
                if e not in ham_like_dict:
                    ham_like_dict[e]=1
                else:
                     ham_like_dict[e]+=1
    for e in ham_like_dict:
        ham_like_dict[e] = ham_like_dict[e]/len(ham)
    for e in spam_like_dict:
        spam_like_dict[e] = spam_like_dict[e]/len(spam)
    ham_like_dict = dict(sorted(ham_like_dict.items(), key=lambda item: item[1],reverse=True))
    spam_like_dict = dict(sorted(spam_like_dict.items(), key=lambda item: item[1],reverse=True))
    '''END'''
    
    return ham_like_dict, spam_like_dict

def predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, text):
    '''
    prediction function that uses prior and likelihood structure to compute proportional posterior for a single line of text
    '''
    #ham_spam_decision = 1 if classified as spam, 0 if classified as normal/ham
    ham_spam_decision = None

    '''YOUR CODE HERE'''
    #ham_posterior = posterior probability that the email is normal/ham
    text_list=set([i.strip("/.,:?!'\"") for i in text.split()])
    haml = np.array([])
    spaml = np.array([])
    for e in text_list:
        if e in ham_like_dict:
            haml = np.append(haml,ham_like_dict[e])
        else:
            haml = np.append(haml,1./(len(ham_like_dict)+len(spam_like_dict)))
        if e in spam_like_dict:
            spaml = np.append(spaml,spam_like_dict[e])
        else:
            spaml = np.append(spaml,1./(len(ham_like_dict)+len(spam_like_dict)))
    hamlikelihood=np.prod(haml)
    spamlikelihood=np.prod(spaml)
    #ham_posterior = hamlikelihood/(hamlikelihood*ham_prior+spamlikelihood*spam_prior)
    #spam_posterior = posterior probability that the email is spam
    #spam_posterior = spamlikelihood/(hamlikelihood*ham_prior+spamlikelihood*spam_prior)
    if (spamlikelihood > hamlikelihood):
        ham_spam_decision = 1
    else:
        ham_spam_decision = 0
    '''END'''
    return ham_spam_decision
    

def metrics(ham_prior, spam_prior, ham_like_dict, spam_like_dict, df):
    '''
    Calls "predict" function and report accuracy, precision, and recall of your prediction
    '''
    
    '''YOUR CODE HERE'''
    preHam,preSpam,accHam,accSpam = 0,0,0,0
    Ham = len(df.loc[df['label']=='ham'])
    Spam = len(df.loc[df['label']=='spam'])
    for e in range(len(df)):
        email = df['text'].values[e]
        if predict(ham_prior,spam_prior,ham_like_dict,spam_like_dict, email)== 1:
            preSpam +=1
            if df['label'].values[e] == 'spam':
                accSpam+=1
        else:
            preHam +=1
            if df['label'].values[e] == 'ham':
                accHam +=1
    precision = accSpam/preSpam
    recall = accSpam/Spam 
    acc = (accSpam+accHam)/(Spam+Ham)

    
    

    '''END'''
    return acc, precision, recall

## Generate answers with your functions

In [13]:
#loading in the training data
train_df = pd.read_csv("./TRAIN_balanced_ham_spam.csv")
test_df = pd.read_csv("./TEST_balanced_ham_spam.csv")
df = train_df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2398 entries, 0 to 2397
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0.1  2398 non-null   int64 
 1   Unnamed: 0    2398 non-null   int64 
 2   label         2398 non-null   object
 3   text          2398 non-null   object
 4   label_num     2398 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 93.8+ KB


In [14]:
#compute the prior

ham_prior, spam_prior = prior(df)

print(ham_prior, spam_prior)


0.5 0.5


In [22]:
# compute likelihood

ham_like_dict, spam_like_dict = likelihood(df)
likelihood(df)





In [23]:
# Test your predict function with some example TEXT

some_text_example = "write your test case here"
print(predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, some_text_example))
some_text_example = "We appreciate your help."
print(predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, some_text_example))
some_text_example = "Click for free goods"
print(predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, some_text_example))

1
0
1


In [21]:
# Predict on test_df and compute metrics 
    
df = test_df

acc, precision, recall = metrics(ham_prior, spam_prior, ham_like_dict, spam_like_dict, df)
print(acc, precision, recall)

0.845 0.9859154929577465 0.7
