We will be using Euron spam dataset for spam email classification problem. The euron datasets ar present at the below location:
http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/index.html
* Readme about the data is also present at the above location which explains details about it.

In [1]:
import os
import collections
import nltk
from nltk.classify import NaiveBayesClassifier, accuracy
# nltk.download('punkt')
import random
from sklearn.utils import shuffle

In [2]:
# Define few stop words
stop_words = {
    'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 
    'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they',
    'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 
    'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as',
    'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we',
    'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more',
    'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above',
    'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any',
    'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does',
    'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can',
    'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where',
    'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't',
    'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how',
    'further', 'was', 'here', 'than'}

The Euron dataset contains spam (malicious email) and ham (non-malicious email) two folders 
inside the big folder. Each folder spam and ham then again contain many text files. Let's load the data of the files to a list.

In [3]:
def load_files(dir):
    full_data = []
    for file_name in os.listdir(dir):
        with open(dir + '/' + file_name, 'r', encoding='ISO-8859-1') as f:
            full_data.append(f.read())
    return full_data

In [4]:
positive_samples = load_files('./data/enron1/spam')
negative_samples = load_files('./data/enron1/ham')

In [5]:
positive_samples_samples

["Subject: what up , , your cam babe\nwhat are you looking for ?\nif your looking for a companion for friendship , love , a date , or just good ole '\nfashioned * * * * * * , then try our brand new site ; it was developed and created\nto help anyone find what they ' re looking for . a quick bio form and you ' re\non the road to satisfaction in every sense of the word . . . . no matter what\nthat may be !\ntry it out and youll be amazed .\nhave a terrific time this evening\ncopy and pa ste the add . ress you see on the line below into your browser to come to the site .\nhttp : / / www . meganbang . biz / bld / acc /\nno more plz\nhttp : / / www . naturalgolden . com / retract /\ncounterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .\n",
 'Subject: want to make more money ?\norder confirmation . your order should be shipped by january , via fedex .\nyour 

In [6]:
negative_samples

["Subject: ena sales on hpl\njust to update you on this project ' s status :\nbased on a new report that scott mills ran for me from sitara , i have come up\nwith the following counterparties as the ones to which ena is selling gas off\nof hpl ' s pipe .\naltrade transaction , l . l . c . gulf gas utilities company\nbrazoria , city of panther pipeline , inc .\ncentral illinois light company praxair , inc .\ncentral power and light company reliant energy - entex\nces - equistar chemicals , lp reliant energy - hl & p\ncorpus christi gas marketing , lp southern union company\nd & h gas company , inc . texas utilities fuel company\nduke energy field services , inc . txu gas distribution\nentex gas marketing company union carbide corporation\nequistar chemicals , lp unit gas transmission company inc .\nsince i ' m not sure exactly what gets entered into sitara , pat clynes\nsuggested that i check with daren farmer to make sure that i ' m not missing\nsomething ( which i did below ) . while 

In [7]:
# Preprocessing the data includes lemmatization, tokenization and stop word removal
def preprocess_sentence(sentence):
    lemmatizer = nltk.WordNetLemmatizer()
    tokens = nltk.word_tokenize(sentence)
    tokens = [w.lower() for w in tokens]
    # find least common elements
    word_counts = collections.Counter(tokens)
    uncommon_words = word_counts.most_common()[:-10:-1]
    # filter tokens based on the following
    tokens = [w for w in tokens if w not in stop_words]
    tokens = [w for w in tokens if w not in uncommon_words]
    #lemmatize
    tokens = [lemmatizer.lemmatize(w) for w in tokens] 
    return tokens

In [8]:
# Let us have a look at an email
for email in positive_samples[:1]:
    print(email)

Subject: what up , , your cam babe
what are you looking for ?
if your looking for a companion for friendship , love , a date , or just good ole '
fashioned * * * * * * , then try our brand new site ; it was developed and created
to help anyone find what they ' re looking for . a quick bio form and you ' re
on the road to satisfaction in every sense of the word . . . . no matter what
that may be !
try it out and youll be amazed .
have a terrific time this evening
copy and pa ste the add . ress you see on the line below into your browser to come to the site .
http : / / www . meganbang . biz / bld / acc /
no more plz
http : / / www . naturalgolden . com / retract /
counterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .



In [9]:
# preprocess sentences 
positive_samples = [preprocess_sentence(email) for email in positive_samples]
negative_samples = [preprocess_sentence(email) for email in negative_samples]

In [10]:
# label samples
positive_samples = [(email, 1) for email in positive_samples]
negative_samples = [(email, 0) for email in negative_samples]
all_samples = positive_samples + negative_samples
# all_samples = shuffle(all_samples)
random.shuffle(all_samples)

In [11]:
print(f"{len(all_samples)} emails processed")

5172 emails processed


In [12]:
# Feature extraction
def feature_extraction(tokens):
    # Each word will be a feature and feature value will be word count
    return dict(collections.Counter(tokens))

In [13]:
# features = [(feature_extraction(corpus), label) for corpus, label in all_samples]
features = [(feature_extraction(corpus), label)
              for corpus, label in all_samples]

In [14]:
features[:1]

[({'-': 3,
   '.': 3,
   ':': 1,
   '@': 1,
   'address': 2,
   'attl': 1,
   'buylow': 1,
   'change': 1,
   'com': 1,
   'houston': 1,
   'htm': 1,
   'ken': 3,
   'later': 1,
   'new': 1,
   'rr': 1,
   'seaman': 2,
   'subject': 1},
  0)]

In [15]:
# train test split
def train_test_split(dataset, train_size=0.8):
    num_train_samples = int(len(dataset) * train_size)
    return dataset[:num_train_samples], dataset[num_train_samples:]

In [16]:
training_set, test_set = train_test_split(features, train_size=0.7)

In [17]:
model = nltk.classify.NaiveBayesClassifier.train(training_set)
training_error = nltk.classify.accuracy(model, training_set)
print(f'Model training complete. Accuracy on training set: {training_error}')

testing_error = nltk.classify.accuracy(model, test_set)
print(f'Accuracy on test set: {testing_error}')

Model training complete. Accuracy on training set: 0.9574585635359116
Accuracy on test set: 0.9478092783505154
