Project 4  
Spam Classifier  
Hantz Angrand  
4/4/2020

Document Classifier can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set
  
For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). 

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.   

Out task is to distinguish between two types of emails, “spam” and “non-spam” often called “ham”. The machine learning classifier will detect that an email is spam if it is characterised by certain features. The textual content of the email – plays a big role on detection of spam.

We will be using the enron dataset.  The dataset can be downloaded from here https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/preprocessed/

In [None]:
##1.- Load the data

In [3]:
#Import libraries and unarchived the file
import os
import nltk

import tarfile
 
def untar(fname):
    if (fname.endswith("tar.gz")):
        tar = tarfile.open(fname)
        tar.extractall()
        tar.close()
        print ("Extracted in Current Directory")

        

In [5]:
untar('enron1.tar.gz')

Extracted in Current Directory


In [13]:
#Create spam and ham list
def spamhamlist(folder):
    #Empty list
    sh_list =[]
    f_list = os.listdir(folder)
    for fl in f_list:
        f = open(folder + fl, 'rb')
        rf = f.read()
        sh_list.append(rf)
    f.close()
    return sh_list

In [14]:
#Create spam lists
spam = spamhamlist('enron1/spam/')

In [17]:
#Create ham list
ham = spamhamlist('enron1/ham/')

In [18]:
#Combine the two list and keep the labels
mails = [(email,'spam') for email in spam]
mails +=[(email, 'ham') for email in ham]

In [19]:
#get the length of mails
print("Mails dataset length is ", len(mails))

Mails dataset length is  5172


In [20]:
##2. Processing the data

In [21]:
#Firts shuffle the data
import random
random.shuffle(mails)

In [22]:
mails[:1]

[(b'Subject: do you like computers\r\nincredible offers :\r\nwindows x * p pro 2 oo 2 5 o dollars\r\no * r * d * e * r : http : / / epsom . gaulhafmk . com\r\nwe also have :\r\n- ms picture it premium 9\r\n- ms sql server 2 ooo enterprise edition\r\n- ms sql server 2 ooo enterprise edition\r\nthe offer is valid untill may 16 th\r\nstock is limited\r\ncheck out\r\njudith lanier\r\nmasseur\r\nindrel industria refrigera ? ? o londrinense , londrina 86072000 , brazil\r\nphone : 911 - 547 - 3116\r\nmobile : 737 - 434 - 1646\r\nemail : otxfcc @ satyamonline . com\r\nthis message is beng sent to confirm your account . please do not reply directly to this message\r\nthis version is a 27 day definite freeware\r\nnotes :\r\nthe contents of this message is for understanding and should not be ankara conferee\r\nturnout lou venezuela\r\ntime : sat , 07 may 2005 10 : 39 : 39 + 0200\r\n',
  'spam')]

In [38]:
#Splitting the text by white spaces and punctuation marks and linking the different forms of the same word and finally convert 
#all words to lower cases.
nltk.download('wordnet')
from nltk import word_tokenize, WordNetLemmatizer
def preprocess(sent):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(str(sent))]


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hangr\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


In [35]:
#Extract the feature characterising spam and ham emails
#First remove stop words, those are words that are not useful
from nltk.corpus import stopwords
stoplist = stopwords.words('english')

In [36]:
#For each word that is not in the stoplist get how frequently it appears or simply register the fact that the word occurs in the email
from collections import Counter
def get_features(text, setting):
    if setting =='bow':
        return {word: count for word, count in Counter(preprocess(text)).items() if not word in stoplist }
    else:
        return {word: True for word in preprocess(text) if not word in stoplist}

In [39]:
#Extarct the features using the bag of word models
feats =[(get_features(email, 'bow'), label) for (email,label) in mails]

In [41]:
feats =[(get_features(email, ''), label) for (email,label) in mails]

##Training a classifier

In [48]:
#Split into a train set and a test set that wu will be used to evaluate the model
#train function will use the features set defined above and a proportion that we set as 0.8
from nltk import NaiveBayesClassifier, classify
def train(features, prop):
    train_size = int(len(features) * prop)
    train_set = features[:train_size]
    test_set = features[train_size:]
    print("Training set size " + str(len(train_set)) + " mails")
    print("Test set size " + str(len(test_set)) + " mails")
    #Train naives bayes model
    classifier = NaiveBayesClassifier.train(train_set)
    return train_set, test_set, classifier

In [49]:
#train the classifier
train_set, test_set, classifier=train(feats, 0.8)

Training set size 4137 mails
Test set size 1035 mails


In [None]:
##Evaluation of the model

In [60]:
#Evaluate the classification by evaluating its performance on the test dataset. we are going to use the 20% to test our data
def evaluate(train_set, test_set, classifier):
    print("Accuracy of the training set " + str(round(classify.accuracy(classifier, train_set) * 100)) + "%")
    print("Accuracy of the test set " + str(round(classify.accuracy(classifier, test_set)*100)) + "%")

In [61]:
#Accuracy on the training_set, test set
evaluate(train_set, test_set, classifier)

Accuracy of the training set 97%
Accuracy of the test set 95%


In the perspective of natural language we can try to determine which word the classifier finds more important to classify an email as spam.

In [62]:
#Most informative words
classifier.show_most_informative_features(10)

Most Informative Features
                     ect = True              ham : spam   =    188.4 : 1.0
                     hou = True              ham : spam   =    184.8 : 1.0
             \r\nsubject = True              ham : spam   =    184.8 : 1.0
                pm\r\nto = True              ham : spam   =    178.2 : 1.0
                    2004 = True             spam : ham    =    150.5 : 1.0
                   meter = True              ham : spam   =    102.4 : 1.0
              medication = True             spam : ham    =     91.0 : 1.0
                    spam = True             spam : ham    =     87.7 : 1.0
                     sex = True             spam : ham    =     79.7 : 1.0
                   cheap = True             spam : ham    =     76.5 : 1.0


For instance, "medication" is around 102 times more likely to be spam than ham and sex is 79 times to be spam than ham.

#Different dataset to test our model


The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam

In [80]:
#Load the data
import csv
def load_data(Train=False):
    data = []
    # Read the training data
    f = open('sms/spam.csv')
    reader = csv.reader(f)
    next(reader, None)
    for row in reader:
        data.append(row)
    f.close()
    return data

In [81]:
data = load_data()

In [82]:
#Number of records in the dataset
print(len(data))

5572


In [83]:
data[0]

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'ham']

In [84]:
#Extarct the features using the bag of word models
feats =[(get_features(email, 'bow'), label) for (email,label) in data]

In [85]:
feats =[(get_features(email, ''), label) for (email,label) in data]

In [86]:
##Training a classifier
train_set, test_set, classifier=train(feats, 0.8)

Training set size 4457 mails
Test set size 1115 mails


In [87]:
##Evaluation of the model
evaluate(train_set, test_set, classifier)

Accuracy of the training set 90%
Accuracy of the test set 90%


In [88]:
#Most informative words
classifier.show_most_informative_features(10)

Most Informative Features
                 service = True             spam : ham    =    154.8 : 1.0
                   nokia = True             spam : ham    =    103.6 : 1.0
                     txt = True             spam : ham    =     93.7 : 1.0
                      uk = True             spam : ham    =     91.7 : 1.0
                  urgent = True             spam : ham    =     90.4 : 1.0
                    code = True             spam : ham    =     87.4 : 1.0
                      16 = True             spam : ham    =     78.9 : 1.0
                delivery = True             spam : ham    =     66.1 : 1.0
                   award = True             spam : ham    =     66.1 : 1.0
                landline = True             spam : ham    =     65.2 : 1.0


service in that set is 154 times more likely to be spam than ham.  Our model has a good accuracy rate on data that he has not seen before.

References  
https://www.kaggle.com/uciml/sms-spam-collection-dataset  
https://github.com/sampepose/SpamClassifier/blob/master/naive_bayes.py 
https://github.com/JonathanKross/spambase/blob/master/spamalot.ipynb
