Install the following module(s) before executing this notebook:
```python
import nltk
nltk.download('stopwords')
```

### Import Libraries

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /Users/sr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/sr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/sr/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import os
import random
from collections import Counter
from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier, classify

Useless words such as 'as', 'a', 'the', 'in', etc., are called `stopwords`. They are classified by language.

In [4]:
stoplist = stopwords.words('english')

#### Important Functions

Function to read the contents into a list:

In [5]:
def init_lists(folder):
    a_list = []
    file_list = os.listdir(folder)
    for a_file in file_list:
        f = open(folder + a_file, 'rb')
        a_list.append(f.read())
    f.close()
    return a_list

Function to lemmatise the sentences:

In [14]:
def preprocess(sentence):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(str(sentence, errors='ignore'))]

Fuction to extract features, leaving the stopwords:

In [15]:
def get_features(text, setting):
    if setting=='bow':
        return {word: count for word, count in Counter(preprocess(text)).items() if not word in stoplist}
    else:
        return {word: True for word in preprocess(text) if not word in stoplist}

Training the Classifier

In [22]:
def train(features, samples_proportion):
    train_size = int(len(features) * samples_proportion)
    # initialise the training and test sets
    train_set, test_set = features[:train_size], features[train_size:]
    print ('Training set size = ' + str(len(train_set)) + ' emails')
    print ('Test set size = ' + str(len(test_set)) + ' emails')
    # train the classifier
    classifier = NaiveBayesClassifier.train(train_set)
    return train_set, test_set, classifier

Accuracy of the classifier

In [25]:
def evaluate(train_set, test_set, classifier):
    # check how the classifier performs on the training and test sets
    print ('Accuracy on the training set = ' + str(classify.accuracy(classifier, train_set)))
    print ('Accuracy of the test set = ' + str(classify.accuracy(classifier, test_set)))
    # check which words are most informative for the classifier
    classifier.show_most_informative_features(10)

In [7]:
# initialise the data
spam = init_lists('enron1/spam/')

In [8]:
ham = init_lists('enron1/ham/')

In [9]:
all_emails = [(email, 'spam') for email in spam]

In [10]:
all_emails += [(email, 'ham') for email in ham]

In [11]:
random.shuffle(all_emails)

In [12]:
print ('Corpus size = ' + str(len(all_emails)) + ' emails')

Corpus size = 5172 emails


In [32]:
# extract the features
all_features = [(get_features(email, 'bow'), label) for (email, label) in all_emails]

In [33]:
print ('Collected ' + str(len(all_features)) + ' feature sets')

Collected 5172 feature sets


In [34]:
# train the classifier
train_set, test_set, classifier = train(all_features, 0.8)

Training set size = 4137 emails
Test set size = 1035 emails


In [35]:
# evaluate its performance
evaluate(train_set, test_set, classifier)

Accuracy on the training set = 0.957457094512932
Accuracy of the test set = 0.9565217391304348
Most Informative Features
                    2004 = 1                spam : ham    =    104.1 : 1.0
            prescription = 1                spam : ham    =    101.0 : 1.0
                     nom = 1                 ham : spam   =     87.1 : 1.0
                    pain = 1                spam : ham    =     86.5 : 1.0
                      xl = 2                 ham : spam   =     83.2 : 1.0
                    2005 = 1                spam : ham    =     81.6 : 1.0
                    spam = 1                spam : ham    =     81.6 : 1.0
                   meter = 1                 ham : spam   =     72.8 : 1.0
                creative = 1                spam : ham    =     72.0 : 1.0
                featured = 1                spam : ham    =     65.5 : 1.0


[1]:  https://cambridgespark.com/content/tutorials/implementing-your-own-spam-filter/index.html
[2]:  https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/preprocessed/enron1.tar.gz