# Text Classification
*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. Please check the pdf file for more details.*

In this exercise you will:
    
- implement a of spam classifier with **Naive Bayes method** for real world email messages
- learn the **training and testing phase** for Naive Bayes classifier  
- get an idea of the **precision-recall** tradeoff

In [1]:
# some basic imports
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse
%matplotlib inline

%load_ext autoreload
%autoreload 2

In [2]:
# ham_train contains the occurrences of each word in ham emails. 1-by-N vector
ham_train = np.loadtxt('ham_train.csv', delimiter=',')
# spam_train contains the occurrences of each word in spam emails. 1-by-N vector
spam_train = np.loadtxt('spam_train.csv', delimiter=',')
# N is the size of vocabulary.
N = ham_train.shape[0]
# There 9034 ham emails and 3372 spam emails in the training samples
num_ham_train = 9034
num_spam_train = 3372
# Do smoothing
x = np.vstack([ham_train, spam_train]) + 1

# ham_test contains the occurences of each word in each ham test email. P-by-N vector, with P is number of ham test emails.
i,j,ham_test = np.loadtxt('ham_test.txt').T
i = i.astype(np.int)
j = j.astype(np.int)
ham_test_tight = scipy.sparse.coo_matrix((ham_test, (i - 1, j - 1)))
ham_test = scipy.sparse.csr_matrix((ham_test_tight.shape[0], ham_train.shape[0]))
ham_test[:, 0:ham_test_tight.shape[1]] = ham_test_tight
# spam_test contains the occurences of each word in each spam test email. Q-by-N vector, with Q is number of spam test emails.
i,j,spam_test = np.loadtxt('spam_test.txt').T
i = i.astype(np.int)
j = j.astype(np.int)
spam_test_tight = scipy.sparse.csr_matrix((spam_test, (i - 1, j - 1)))
spam_test = scipy.sparse.csr_matrix((spam_test_tight.shape[0], spam_train.shape[0]))
spam_test[:, 0:spam_test_tight.shape[1]] = spam_test_tight



## Now let's implement a ham/spam email classifier. Please refer to the PDF file for details

In [3]:
from likelihood import likelihood
import linecache
# TODO
# Implement a ham/spam email classifier, and calculate the accuracy of your classifier

# Hint: you can directly do matrix multiply between scipy.sparse.coo_matrix and numpy.array.
# Specifically, you can use sparse_matrix * np_array to do this. Note that when you use "*" operator
# between numpy array, this is typically an elementwise multiply.

# begin answer
# l is likelihood of x, 2-by-N, with N the size of vocabulary
# l[i][j] = P(word_j|c_i)
l = likelihood(x)
# list top-10 words that are most indicative of the SPAM class
tmp=np.zeros((N))
tmp = l[0]/l[1]
tmp = np.argpartition(tmp, range(10))[:10]
print(tmp)
file_path = 'all_word_map.txt'
for i in tmp:
    word = linecache.getline(file_path, i+1).strip().split()[0]
    print(word, end=', ')
print('\n')

[30032 75525 38175 45152  9493 65397 37567 13612 56929  9452]
nbsp, viagra, pills, cialis, voip, php, meds, computron, sex, ooking, 



In [4]:
# prior: 1-by-2
# prior[i] = log[P(c_i)]
prior = np.array([num_ham_train, num_spam_train])/(num_ham_train + num_spam_train)
# log
l = np.log(l)
prior = np.log(prior)

ham_post = ham_test*l.T
spam_post = spam_test*l.T
total_ham = ham_post.shape[0]
total_spam = spam_post.shape[0]
miss_ham = np.sum(ham_post[:,0]+prior[0] < ham_post[:,1]+prior[1], dtype='int')
miss_spam = np.sum(spam_post[:,0]+prior[0] >spam_post[:,1]+prior[1], dtype='int')

# accuracy
accuracy = (total_ham+total_spam-miss_ham-miss_spam)/(total_ham+total_spam)
print(total_spam, miss_spam, total_ham, miss_ham)
print(accuracy)
# end answer

1124 31 3011 28
0.9857315598548972
