# CS405 Machine Learning: Lab 2 Preliminary
### Name: 车凯威
### ID: 12032207

__Objectives__：Text mining (deriving information from text) is a wide field which has
gained popularity with the huge text data being generated. Automation of a
number of applications like sentiment analysis, document classification, topic classification, text summarization, machine translation, etc., has been
done using machine learning models. In this lab, you are required to write
your spam filter by using naïve Bayes method. This time you should not
use 3
rd party libraries including scikit-learn.

## Instruction:
Spam filtering is a beginner’s example of document classification task
which involves classifying an email as spam or non-spam (a.k.a. ham) mail. Email dataset will be provided. We will walk through the following steps
to build this application:  
1) Preparing the text data  
2) Creating word dictionary  
3) Feature extraction process  
4) Training the classifier  
5) Checking the results on test set  

## 1 Preparing the text data:
The data-set used here, is split into a training set and a test set containing
702 mails and 260 mails respectively, divided equally between spam and
ham mails. You will easily recognize spam mails as it contains *spmsg*
in its filename.

In any text mining problem, text cleaning is the first step where we
remove those words from the document which may not contribute to the
information we want to extract. Emails may contain a lot of undesirable
characters like punctuation marks, stop words, digits, etc which may not
be helpful in detecting the spam email. The emails in Ling-spam corpus
have been already preprocessed in the following ways:  

a) Removal of stop words – Stop words like “and”, “the”, “of”, etc are
very common in all English sentences and are not very meaningful in
deciding spam or legitimate status, so these words have been removed
from the emails.   

b) Lemmatization – It is the process of grouping together the different
inflected forms of a word so they can be analysed as a single item. For
example, “include”, “includes,” and “included” would all be
represented as “include”. The context of the sentence is also preserved
in lemmatization as opposed to stemming (another buzz word in text
mining which does not consider meaning of the sentence)  

We still need to remove the non-words like punctuation marks or special
characters from the mail documents. There are several ways to do it. Here, we will remove such words after creating a dictionary, which is a very
convenient method to do so since when you have a dictionary; you need
to remove every such word only once.

In [2]:
import os
import numpy as np
from collections import Counter


from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix

def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]    
    all_words = []       
    for mail in emails:    
        with open(mail) as m:
            for i,line in enumerate(m):
                if i == 2:
                    words = line.split()
                    all_words += words
    
    dictionary = Counter(all_words)
    
    list_to_remove = list(dictionary)
    for item in list_to_remove:
        if item.isalpha() == False: 
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    dictionary = dictionary.most_common(3000)
    return dictionary
    
def extract_features(mail_dir): 
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    docID = 0
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        docID = docID + 1     
    return features_matrix
    
# Create a dictionary of words with its frequency

train_dir = 'ling-spam\\train-mails'
dictionary = make_Dictionary(train_dir)

# Prepare feature vectors per training mail and its labels

train_labels = np.zeros(702)
train_labels[351:701] = 1
train_matrix = extract_features(train_dir)



In [3]:
P_true = 350/702
P_false = 1-P_true

In [3]:
# P_Feature_true = 
len(train_matrix[:,1])
len(train_matrix[:,1][351:701])
feature_all = sum(train_matrix[:,80])
feature_true = sum(train_matrix[:,80][351:701])
feature_false = feature_all-feature_true
feature_true/feature_all

0.49416342412451364

In [4]:
def train_algorithm(train_matrix):
    p_feature_true = []
    p_feature_false = []
    for feature_row_idx in range(0,len(train_matrix[1,:])):
        #print(feature_row_idx)
        feature_row = train_matrix[:,feature_row_idx]
        ture_list = list(feature_row[351:701])
        ture_all = sum(ture_list)+ture_list.count(0)
        feature_true = sum(ture_list)

        false_list = list(feature_row[0:350])
        false_all = sum(false_list)+false_list.count(0)
        feature_false = sum(false_list)


        p_feature_true.append(feature_true/ture_all)
        p_feature_false.append(feature_false/false_all) 

        
    return p_feature_true,p_feature_false

In [5]:
import math
from functools import reduce

def pred(test_matrix,p_feature_true,p_feature_false):
    email = test_matrix[22,:]
    email_true = []
    email_false = []
    for i,line in enumerate(email):
        if (email[i]!=0)&(p_feature_true[i]!=0) :
            email_true.append(p_feature_true[i]**email[i])
            email_false.append(p_feature_true[i]**email[i])
            

    prb_true = reduce(lambda x,y:x*y,email_true)*P_true
    prb_false = reduce(lambda x,y:x*y,email_false)*P_false

    if prb_true > prb_false:
        result = "true"
    else:
        result = "false"
    print(result)

In [1]:
p_feature_true,p_feature_false = train_algorithm(train_matrix)
pred(test_matrix,p_feature_true,p_feature_false)


NameError: name 'train_algorithm' is not defined

In [10]:
x = 0
for i in train_matrix[:]:
    x = x+1
print(x)

702


In [95]:
def train_algorithm2(train_matrix):
    p_feature_true = []
    p_feature_false = []
    

    size = len(train_matrix[1,:])
    

    for feature_row_idx in range(0,size):
        #print(feature_row_idx)
        feature_row = train_matrix[:,feature_row_idx]
        feature_row_plus1 = feature_row + 1
        
        ture_list = list(feature_row_plus1[351:701])
        true_all = 352
        feature_true = 352 - ture_list.count(1) +1

        false_list = list(feature_row_plus1[0:350])
        false_all = 352
        feature_false = 352 - false_list.count(1) +1

        p_feature_true.append(feature_true/true_all)
        p_feature_false.append(feature_false/false_all) 
        #print(p_feature_false)
        
    return p_feature_true,p_feature_false

In [96]:
p_feature_true,p_feature_false = train_algorithm2(train_matrix)
# p_feature_true.count(1)
# feature_row = train_matrix[:,1]
# feature_row_plus1 = feature_row + 1
# list(feature_row_plus1).count(0)
# p_feature_true.count(0)

In [101]:
import math
from functools import reduce

def pred2(test_matrix,p_feature_true,p_feature_false,idx):

    p_feature_true = np.log(p_feature_true)
    p_feature_false = np.log(p_feature_false)

    email = test_matrix[idx,:] +1
    email_true = []
    email_false = []



    # for i,line in enumerate(email):
    #     email_true.append( * [i])
    #     email_false.append(p_feature_false[i])
    #     #p1 = np.log()

    prb_true = sum(email * p_feature_true) + np.log(P_true)
    prb_false = sum(email * p_feature_false) + np.log(P_false)

    if prb_true > prb_false:
        result = "true"
    else:
        result = "false"
    
    print(result)

In [106]:
# for i in range(130,131):
pred2(test_matrix,p_feature_true,p_feature_false,1)




true


In [4]:
def train_algorithm3(train_matrix):
    p_feature_true = []
    p_feature_false = []
    
    for i in range(0,3000):
        #print(feature_row_idx)
        feature_row = train_matrix[:,i] + 1
        
        ture_list = list(feature_row[351:701])
        
        true_all = sum(ture_list) + 2
        feature_true = sum(ture_list) - ture_list.count(1) + 1

        false_list = list(feature_row[0:350])
        false_all = sum(false_list) + 2
        feature_false = sum(false_list) - false_list.count(1) + 1

        p_feature_true.append(np.log(feature_true/true_all))
        p_feature_false.append(np.log(feature_false/false_all)) 
        #print(p_feature_false)
        
    return p_feature_true,p_feature_false

In [28]:
# for i in range(130,131):

p_feature_true,p_feature_false = train_algorithm3(train_matrix)

#print(p_feature_true)
#print(p_feature_false)
p_feature_true
p_feature_false

444796,
 -2.5489062326209613,
 -0.6253513275520677,
 -1.9049837213460958,
 -1.0670069492527785,
 -1.058982493897342,
 -1.4033057069464272,
 -1.4085449700547104,
 -1.3960505360652553,
 -0.8696764166493955,
 -2.128231705849268,
 -1.6069030568309124,
 -2.1427363920521496,
 -1.6019096460133087,
 -3.6777061535158113,
 -0.8378191164326598,
 -2.5489062326209613,
 -1.3648394261186312,
 -3.9262076404201025,
 -1.3461472931064786,
 -1.3417383395696578,
 -1.3502760619307348,
 -1.6782481188369747,
 -1.0722949803507362,
 -0.7068931014645804,
 -2.136136885356381,
 -1.2944868118667678,
 -2.8553749158590684,
 -1.3814750746839417,
 -1.0743553110225497,
 -3.3155836289391636,
 -0.8406928165592796,
 -1.8127026430732982,
 -5.863631175598097,
 -0.9770570870903027,
 -1.4894785973551214,
 -2.31867123074567,
 -2.7300291078209855,
 -1.629743178594846,
 -0.7593699671730811,
 -0.8119130172782368,
 -1.8782999255591115,
 -1.252762968495368,
 -5.863631175598097,
 -2.008345619996106,
 -1.2146844599208637,
 -1.11937428

In [26]:
list(train_matrix[:,1]).count(0)

431

In [17]:
import math
from functools import reduce

def pred3(test_matrix,p_feature_true,p_feature_false,idx):
    
    email  = test_matrix[idx,:] + 1

    # for i,line in enumerate(email):
    #     email_true.append( * [i])
    #     email_false.append(p_feature_false[i])
    #     #p1 = np.log()

    prb_true = sum(email * p_feature_true) + np.log(P_true)
    prb_false = sum(email * p_feature_false) + np.log(P_false)
    print(prb_true)
    if prb_true > prb_false:
        result=1
    else:
        result = 0
    return result

    #print(prb_false)

In [18]:
#pred3(test_matrix,p_feature_true,p_feature_false,190)
# email  = test_matrix[1,:] + 1
# print(email)
result = []
for i in range(0,260):
    result.append(pred3(test_matrix,p_feature_true,p_feature_false,i))


-10009.622306941168
-9859.956099473093
-9945.72935953161
-9798.707948525054
-9868.9005890284
-10189.417666746507
-9971.646801861254
-10178.322486401463
-9718.833109872956
-10301.397028969766
-9798.734858439942
-9857.478771322918
-9765.977111775344
-10337.1085072827
-10077.436204713713
-9739.088236369025
-9760.882900336765
-9723.631545467422
-9954.6073288005
-9734.499051001372
-10121.988043494355
-9989.450191470534
-10924.642620158666
-10387.652892714104
-9905.531978564086
-10668.612411372773
-10823.548856513195
-10155.879582940779
-10148.28507350718
-10043.111733554928
-9864.114851807917
-10655.158305421648
-9747.19691142196
-11580.634398277585
-11681.830659051437
-9771.812995127993
-9875.20752555215
-10347.07759052921
-9728.166466002773
-10152.978742265077
-9977.46856822304
-10259.223930549222
-9878.37423242737
-9776.561787614697
-10086.983403707
-9773.537174745552
-10061.438246113792
-9999.618063378246
-9822.907461056171
-9928.732127059307
-10037.666888094833
-9809.178250495273
-9882

In [8]:
#############################################
# Training SVM and Naive bayes classifier and its variants

#model1 = LinearSVC()
model2 = MultinomialNB()

#model1.fit(train_matrix,train_labels)
model2.fit(train_matrix,train_labels)


# Test the unseen mails for Spam

test_dir = 'ling-spam\\test-mails'
test_matrix = extract_features(test_dir)
test_labels = np.zeros(260)
test_labels[130:260] = 1

#result1 = model1.predict(test_matrix)
result2 = model2.predict(test_matrix)

#print(confusion_matrix(test_labels,result1))
print(confusion_matrix(test_labels,result2))

correct = 0
for i,item in enumerate(result2):
    if result2[i]==test_labels[i]:
        correct+=1
correct
acc = correct/260
acc

[[129   1]
 [  9 121]]


0.9615384615384616

In [21]:

correct = 0
for i,item in enumerate(result):
    if result[i]==test_labels[i]:
        correct+=1
correct
acc = correct/260
acc

0.9192307692307692