# Homework 3 Problem 6

Our dataset contains emails with labels of spam or ham. You‘ll use a “naive Bayes” classifier to identify spam emails. 

## Read Data

In [1]:
import csv
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

In [2]:
# Read Data

#  The training data matrix trainFeat is a D×W matrix, 
#  where each row represents an email and each column 
#  indicates whether that word appears in that email at least once.
trainFeat = pd.read_csv('./hw3_data/trainFeat.csv', header=None)
testFeat = pd.read_csv('./hw3_data/testFeat.csv', header=None)
#  The ground truth labels are stored in trainLabels, 
#  with “1” indicating spam and “0” ham.
trainLabels = pd.read_csv('./hw3_data/trainLabels.csv', header=None, names=['label'])
testLabels = pd.read_csv('./hw3_data/testLabels.csv', header=None, names=['label'])

display(trainFeat.head(), trainLabels.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12665,12666,12667,12668,12669,12670,12671,12672,12673,12674
0,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0


In [3]:
# """
# Important preprocessing note:
# There is formatting error & processing error, I am manually adjusting them.
# I removed "," from line 22 in vocab.csv, and later added it back after reading the file.
# In line 8553, null is taken as None, and I am also fill it back
# """
# The vocabulary vocab consists of W words, where each
# character string has been mapped to a distinct integer index.
# vocab_data = pd.read_csv('./hw3_data/vocab_revised.csv', quoting=csv.QUOTE_NONE, 
#                          index_col=False, header=None, names=['text', 'drop'])
# vocab_data = vocab_data['text']
# vocab_data.index = vocab_data.index + 1 # Start Index from 1
# vocab_data.loc[22] = ','
# vocab_data.loc[8553] = 'null'
# 
# display(vocab_data)

In [4]:
# The vocabulary vocab consists of W words, where each
# character string has been mapped to a distinct integer index.
with open('./hw3_data/vocab.csv', 'r') as f:
    lines = f.readlines()
# remove seperator
vocab = pd.DataFrame([line.strip()[:-1] for line in lines], columns=['text'])

display(vocab)

Unnamed: 0,text
0,_
1,not
2,interested
3,.
4,prices
...,...
12670,epe
12671,jgs
12672,formalize
12673,liquefied


To define a probabilistic model of this data, we let $Y_i = S$ if email $i$ is spam and $Y_i = H$ if email $i$ is ham (not spam). We assume that the two classes are equally likely a priori:
$$P(Y_i = S = 1) = P(Y_i = H = 0) = 0.5.$$

To encode the data that will be used for classification, we let $X_{ij} = 1$ if email i contains an instance of word j, and $X_{ij} = 0$ if email i does not contain the word j. The set of all available data about email i is then $X_i = \{X_{ij}\ |\ j \in 1,...,W\}$.

To implement a simple Bayesian classifier, we will compute the posterior distribution $P(Y_i\ |\ X_i)$ of the class label given the observed features. If $P(Y_i = S\ |\ X_i) > P(Y_i = H \ |\  X_i)$, we classify email $i$ as spam; otherwise, we classify it as ham.

In [5]:
# In training data, W=12675, D=20229
print(trainFeat.shape)

(20229, 12675)


(a) Either prove or disprove the following statement: the Bayesian classifier described above is equivalent to a classifier that assigns label spam if $P(X_i\ |\ Y_i = S)> \ P(X_i \ | \ Y_i = H)$, and label ham otherwise.  Base your reasoning on $P(Y_i = S) = P(Y_i = H) = 0.5$ and Bayes' rule.

**Answer:** We prove the statement above.

Since the classifier predicts spam if $P(Y_i = S\ |\ X_i) > P(Y_i = H \ |\  X_i)$, by Bayes's rule

$$P(Y_i = S\ |\ X_i) = \frac{P(Y_i = S\ \cap \ X_i)}{P(X_i)} = \frac{P(X_i\ |\ Y_i = S)*P(Y_i = S)}{P(X_i)},$$
$$P(Y_i = H\ |\ X_i) = \frac{P(Y_i = H\ \cap \ X_i)}{P(X_i)} = \frac{P(X_i\ |\ Y_i = H)*P(Y_i = H)}{P(X_i)}.$$

Therefore, the statement becomes $P(X_i\ |\ Y_i = S)*P(Y_i = S) > P(X_i\ |\ Y_i = H)*P(Y_i = H)$. Since $P(Y_i = S) = P(Y_i = H) = 0.5$, it is equivalent to a classifier that assigns label spam if $P(X_i\ |\ Y_i = S)> \ P(X_i \ | \ Y_i = H)$

To further simplify the modeling problem, a naive Bayes classifier assumes that given the class label, the observed word features are conditionally independent. From the definition of independence, this implies that
$$P(X_{i,\bullet} \ | \ Y_i = S) = \prod_{j=1}^{W} P(X_{ij} \ | \ Y_i = S),  \ \ \ \ \ \ P(X_{i,\bullet} \ | \ Y_i = H) = \prod_{j=1}^{W} P(X_{ij} \ | \ Y_i = H).$$

(b) A simple way to estimate the probabilities above is by counting how many times each event occurs in the training data. Let $N_s$ be the total number of spam emails, $N_{sj}$ the number of spam emails in which word $j$ occurs, $N_h$ the total number of ham emails, and $N_{hj}$ the number of ham emails in which word $j$ occurs. we then set $$P(X_{ij}=1\ | \ Y_i = S) = \frac{N_{sj}}{N_s}, \ \ \ P(X_{ij} = 0 \ | \ Y_i = S) = 1 - P(X_{ij} = 1 \ Y_i = S) = \frac{N_s - N_{sj}}{N_s}$$ $$P(X_{ij}=1\ | \ Y_i = H) = \frac{N_{hj}}{N_h}, \ \ \ P(X_{ij} = 0 \ | \ Y_i = H) = 1 - P(X_{ij} = 1 \ Y_i = H) = \frac{N_h - N_{hj}}{N_h}$$
Write a program to compute these probabilities using data in trainFeat, trainLabels.
(Note: $P(X_{ij}=1$ means email $i$ contains word $j$.)

In [6]:
# number of spam email
N_s = trainLabels[trainLabels['label']==1].shape[0]
# number of spam email in which word j occurs
N_sj = trainFeat[trainLabels['label']==1].sum(axis=0)
# number of ham emails
N_h = trainLabels[trainLabels['label']==0].shape[0]
# number of ham email in which word j occurs
N_hj = trainFeat[trainLabels['label']==0].sum(axis=0)
print('N_s:\n', N_s, '\nN_sj:\n', N_sj, '\nN_h:\n', N_h, '\nN_hj:\n', N_hj)


N_s:
 10302 
N_sj:
 0        1128
1        3643
2         665
3        9278
4        1061
         ... 
12670       1
12671       1
12672       1
12673       1
12674       1
Length: 12675, dtype: int64 
N_h:
 9927 
N_hj:
 0         743
1        3265
2         509
3        9692
4         457
         ... 
12670      24
12671      15
12672      10
12673      12
12674      11
Length: 12675, dtype: int64


In [7]:
# If email i spam, the prob that it contains word j
p_ij_eq1_spam = N_sj/N_s
# If email i spam, the prob that it does not contain word j
p_ij_eq0_spam = 1-p_ij_eq1_spam
# If email i ham, the prob that it contains word j
p_ij_eq1_ham = N_hj/N_h
# If email i ham, the prob that it does not contain word j
p_ij_eq0_ham = 1-p_ij_eq1_ham

probability_j = pd.DataFrame({'p_ij_eq1_spam': p_ij_eq1_spam,
                              'p_ij_eq0_spam': p_ij_eq0_spam,
                              'p_ij_eq1_ham': p_ij_eq1_ham,
                              'p_ij_eq0_ham': p_ij_eq0_ham,
                              })
display(probability_j)

Unnamed: 0,p_ij_eq1_spam,p_ij_eq0_spam,p_ij_eq1_ham,p_ij_eq0_ham
0,0.109493,0.890507,0.074846,0.925154
1,0.353621,0.646379,0.328901,0.671099
2,0.064551,0.935449,0.051274,0.948726
3,0.900602,0.099398,0.976327,0.023673
4,0.102990,0.897010,0.046036,0.953964
...,...,...,...,...
12670,0.000097,0.999903,0.002418,0.997582
12671,0.000097,0.999903,0.001511,0.998489
12672,0.000097,0.999903,0.001007,0.998993
12673,0.000097,0.999903,0.001209,0.998791


In [8]:
print('P(X_ij = 1 | Y_i = S):')
for i in range(len(p_ij_eq1_spam)):
    print(f'index j={i+1}: prob={p_ij_eq1_spam[i]}')

P(X_ij = 1 | Y_i = S):
index j=1: prob=0.10949330227140361
index j=2: prob=0.3536206561832654
index j=3: prob=0.06455057270432926
index j=4: prob=0.9006018248883711
index j=5: prob=0.10298971073577946
index j=6: prob=0.6521063871092991
index j=7: prob=0.045719277810133956
index j=8: prob=0.06416229858279945
index j=9: prob=0.03640069889341875
index j=10: prob=0.3654630168899243
index j=11: prob=0.5786255096097845
index j=12: prob=0.11512327703358571
index j=13: prob=0.04222481071636575
index j=14: prob=0.15637740244612697
index j=15: prob=0.34711706464764125
index j=16: prob=0.7279169093379926
index j=17: prob=0.11842360706658901
index j=18: prob=0.6742380120364978
index j=19: prob=0.5805668802174335
index j=20: prob=0.38031450203843914
index j=21: prob=0.011648223645894
index j=22: prob=0.7878081925839643
index j=23: prob=0.6540477577169481
index j=24: prob=0.3829353523587653
index j=25: prob=0.17074354494272956
index j=26: prob=0.6421083284799068
index j=27: prob=0.018443020772665502

In [9]:
print('P(X_ij = 0 | Y_i = S):')
for i in range(len(p_ij_eq0_spam)):
    print(f'index j={i+1}: prob={p_ij_eq0_spam[i]}')

P(X_ij = 0 | Y_i = S):
index j=1: prob=0.8905066977285964
index j=2: prob=0.6463793438167347
index j=3: prob=0.9354494272956707
index j=4: prob=0.09939817511162885
index j=5: prob=0.8970102892642206
index j=6: prob=0.34789361289070087
index j=7: prob=0.954280722189866
index j=8: prob=0.9358377014172006
index j=9: prob=0.9635993011065812
index j=10: prob=0.6345369831100758
index j=11: prob=0.42137449039021546
index j=12: prob=0.8848767229664143
index j=13: prob=0.9577751892836343
index j=14: prob=0.843622597553873
index j=15: prob=0.6528829353523588
index j=16: prob=0.2720830906620074
index j=17: prob=0.881576392933411
index j=18: prob=0.3257619879635022
index j=19: prob=0.4194331197825665
index j=20: prob=0.6196854979615609
index j=21: prob=0.988351776354106
index j=22: prob=0.21219180741603572
index j=23: prob=0.3459522422830519
index j=24: prob=0.6170646476412347
index j=25: prob=0.8292564550572704
index j=26: prob=0.35789167152009316
index j=27: prob=0.9815569792273345
index j=28: p

In [10]:
print('P(X_ij = 1 | Y_i = H):')
for i in range(len(p_ij_eq1_ham)):
    print(f'index j={i+1}: prob={p_ij_eq1_ham[i]}')

P(X_ij = 1 | Y_i = H):
index j=1: prob=0.07484637856351364
index j=2: prob=0.3289009771330714
index j=3: prob=0.0512743024075753
index j=4: prob=0.9763271884758739
index j=5: prob=0.04603606326181122
index j=6: prob=0.6831872670494611
index j=7: prob=0.008562506295960512
index j=8: prob=0.015412511332728921
index j=9: prob=0.02095295658305631
index j=10: prob=0.48675329908330817
index j=11: prob=0.3492495215070011
index j=12: prob=0.05379268661226957
index j=13: prob=0.025788254256069305
index j=14: prob=0.03535811423390753
index j=15: prob=0.12924347738490985
index j=16: prob=0.8114233907524931
index j=17: prob=0.17366777475571674
index j=18: prob=0.8220006044122091
index j=19: prob=0.6390651757832175
index j=20: prob=0.27269064168429535
index j=21: prob=0.0003022061045633122
index j=22: prob=0.8553440112823613
index j=23: prob=0.6209328095094188
index j=24: prob=0.4384003223531782
index j=25: prob=0.13639568852624157
index j=26: prob=0.654679157852322
index j=27: prob=0.0062455928276

In [11]:
print('P(X_ij = 0 | Y_i = H):')
for i in range(len(p_ij_eq0_ham)):
    print(f'index j={i+1}: prob={p_ij_eq0_ham[i]}')

P(X_ij = 0 | Y_i = H):
index j=1: prob=0.9251536214364864
index j=2: prob=0.6710990228669286
index j=3: prob=0.9487256975924248
index j=4: prob=0.023672811524126147
index j=5: prob=0.9539639367381888
index j=6: prob=0.3168127329505389
index j=7: prob=0.9914374937040394
index j=8: prob=0.9845874886672711
index j=9: prob=0.9790470434169437
index j=10: prob=0.5132467009166919
index j=11: prob=0.6507504784929989
index j=12: prob=0.9462073133877305
index j=13: prob=0.9742117457439307
index j=14: prob=0.9646418857660924
index j=15: prob=0.8707565226150902
index j=16: prob=0.18857660924750685
index j=17: prob=0.8263322252442833
index j=18: prob=0.17799939558779088
index j=19: prob=0.36093482421678247
index j=20: prob=0.7273093583157046
index j=21: prob=0.9996977938954367
index j=22: prob=0.14465598871763874
index j=23: prob=0.37906719049058124
index j=24: prob=0.5615996776468217
index j=25: prob=0.8636043114737584
index j=26: prob=0.345320842147678
index j=27: prob=0.9937544071723582
index j=

(c) Consider a simplified dataset that only contains the presence or absence of a single word, $j = $``money". Compute and report the numerical values of the conditional probabilities $$P(X_{ij} = 1 \ | \ Y_i = S), \ P(X_{ij} = 1 \ | \ Y_i = H).$$ What is the test accuracy of a Bayesian classifier based on this single word?

In [12]:
# Money is when j=180 (dataframe index starts from zero here)
print(vocab[vocab['text']=='money'].index)
display(probability_j.iloc[179])

Int64Index([179], dtype='int64')


p_ij_eq1_spam    0.164143
p_ij_eq0_spam    0.835857
p_ij_eq1_ham     0.030321
p_ij_eq0_ham     0.969679
Name: 179, dtype: float64

a classifier that assigns label spam if $P(X_i\ |\ Y_i = S)> \ P(X_i \ | \ Y_i = H)$

In [13]:
print('If Xij = 1, whether to classify email i as spam:')
print(probability_j.iloc[179]['p_ij_eq1_spam'] > probability_j.iloc[179]['p_ij_eq1_ham'])
print('If Xij = 0, whether to classify email i as spam:')
print(probability_j.iloc[179]['p_ij_eq0_spam'] > probability_j.iloc[179]['p_ij_eq0_ham'])

If Xij = 1, whether to classify email i as spam:
True
If Xij = 0, whether to classify email i as spam:
False


In [14]:
test_X_feat_c = testFeat.iloc[:,179]
test_label_c = testLabels.copy()
test_label_c['pred'] = np.where(test_X_feat_c==1, 1, 0)
print('The test accuracy is:', accuracy_score(test_label_c['label'], test_label_c['pred']))

The test accuracy is: 0.5573843416370107


(d) Repeat part (c) for a different single word, $j = $``thanks". Provide an intuitive explanation for any differences in classification performance.

In [15]:
# Money is when j=860 (dataframe index starts from zero here)
print(vocab[vocab['text']=='thanks'].index)
display(probability_j.iloc[859])

Int64Index([859], dtype='int64')


p_ij_eq1_spam    0.068530
p_ij_eq0_spam    0.931470
p_ij_eq1_ham     0.320137
p_ij_eq0_ham     0.679863
Name: 859, dtype: float64

In [16]:
print('If Xij = 1, whether to classify email i as spam:')
print(probability_j.iloc[859]['p_ij_eq1_spam'] > probability_j.iloc[859]['p_ij_eq1_ham'])
print('If Xij = 0, whether to classify email i as spam:')
print(probability_j.iloc[859]['p_ij_eq0_spam'] > probability_j.iloc[859]['p_ij_eq0_ham'])

If Xij = 1, whether to classify email i as spam:
False
If Xij = 0, whether to classify email i as spam:
True


In [17]:
test_X_feat_d = testFeat.iloc[:,859]
test_label_d = testLabels.copy()
test_label_d['pred'] = np.where(test_X_feat_d==1, 0, 1)
print('The test accuracy is:', accuracy_score(test_label_d['label'], test_label_d['pred']))

The test accuracy is: 0.6315243179122183


Explanation: The prediction accuracy increased after changing the word. This might be because "thanks" is more sentimental, and an email containing this word is more likely to be ham, while "money" is a neutral word which might have weaker prediction power.

(e) Consider a slightly larger dataset which contains the presence or absence of the two words (money,thanks) from parts (c-d). Using the naive Bayes assumption, determine the test accuracy of a classifier based on these two words.

When the number of words $W$ is large, the probabilities become very small, and can underflow to 0 when using finite-precision arithmetic on a computer. To avoid this, we will instead work in the log-domain, and pick the class whose log-probability is largest. Because $\log(ab) = \log(a) + \log(b)$, we have:

$$
P(X_{i,\cdot} \ | \ Y_i = S) =\log \prod_{j=1}^{W} P(X_{ij} \ | \ Y_i = S) = \sum_{j=1}^{W} \log P(X_{ij} \ | \ Y_i = S)
$$

with a similar identity for $\log P(X_{i,\bullet} \ | \ Y_i = H).$

In [18]:
# Log probability
log_prob_j = np.log(probability_j)
display(log_prob_j.iloc[[179, 859],])

Unnamed: 0,p_ij_eq1_spam,p_ij_eq0_spam,p_ij_eq1_ham,p_ij_eq0_ham
179,-1.807018,-0.179298,-3.495903,-0.030791
859,-2.680478,-0.070992,-1.139006,-0.385864


In [19]:
# Use matrix product
test_X_feat_e = testFeat.iloc[:,[179,859]]
test_X_feat_comp_e = 1 - testFeat.iloc[:,[179,859]]

test_log_sum_spam_e = test_X_feat_e.dot(log_prob_j.iloc[[179, 859], 0])\
    + test_X_feat_comp_e.dot(log_prob_j.iloc[[179, 859], 1])
test_log_sum_ham_e = test_X_feat_e.dot(log_prob_j.iloc[[179, 859], 2])\
    + test_X_feat_comp_e.dot(log_prob_j.iloc[[179, 859], 3])

test_label_e = testLabels.copy()
test_label_e['pred'] = np.where(test_log_sum_spam_e>test_log_sum_ham_e, 1, 0)
print('The test accuracy is:', accuracy_score(test_label_e['label'], test_label_e['pred']))

The test accuracy is: 0.6361209964412812


We see that the test accuracy is slightly improved.

(f) Using the identity in the equation above, modify your classification code to compute the log-probability of the spam and ham classes in a numerically robust fashion. Determine the test accuracy of a classifier based on all $W$ words in the full dataset. Hint: this classifier should take seconds (not minutes) to train and test, and be much more accurate than part (e)

In [20]:
def bayes_classifier(prob_j, index_set_j):
    index_set_j.sort()
    log_proba_j = np.log(prob_j)
    
    # Use matrix product
    test_X_feat = testFeat.iloc[:,index_set_j]
    test_X_feat_comp = 1 - testFeat.iloc[:,index_set_j]
    
    # col0 is spam & Xij=1, col1 is spam & Xij=0
    # col3 is ham & Xij=1, col4 is ham & Xij=0
    test_log_sum_spam = test_X_feat.dot(log_proba_j.iloc[index_set_j, 0])\
        + test_X_feat_comp.dot(log_proba_j.iloc[index_set_j, 1])
    test_log_sum_ham = test_X_feat.dot(log_proba_j.iloc[index_set_j, 2])\
        + test_X_feat_comp.dot(log_proba_j.iloc[index_set_j, 3])
    
    test_label = testLabels.copy()
    test_label['pred'] = np.where(test_log_sum_spam>test_log_sum_ham, 1, 0)
    
    return test_log_sum_spam, test_log_sum_ham, test_label

In [21]:
W=12675
D=20229

index_set = [i for i in range(W)]
test_log_sum_spam_f, test_log_sum_ham_f, test_label_f = \
    bayes_classifier(probability_j, index_set)
print('The test accuracy is:', accuracy_score(test_label_f['label'], test_label_f['pred']))

The test accuracy is: 0.9494365361803084


We see now the test accuracy is very high！ We can also check the correctness of the codes by changing the index_set to [179], [859], [179, 859]. It's also very fast! Cheers!