# Introduction
In this project I will be building a simple (but hopefully effective) spam filter for E-Mails using a naive Bayes approach on the Enron-Spam dataset. The dataset contains 33.716 Emails marked as spam or as non-spam ("ham") messages. The original dataset was collected by V. Metsis, I. Androutsopoulos and G. Paliouras. and can be found [here](http://www2.aueb.gr/users/ion/data/enron-spam). I packaged all the data into a [single csv-file](https://github.com/MWiechmann/enron_spam_data), which I will be using for this project.

# Reading in data

In [6]:
import pandas as pd
import dask.dataframe as dd
import csv

mails = pd.read_csv("enron_spam_data.zip", compression= "zip", index_col="Message ID")

In [7]:
mails.head()

Unnamed: 0_level_0,Subject,Message,Spam/Ham,Date
Message ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,christmas tree farm pictures,,ham,1999-12-10
1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14


In [None]:
print("\nTotal:\t" + str(mails.shape[0]))
print(mails["Spam/Ham"].value_counts(dropna=False))

print("\nProportion in %")
print(round(mails["Spam/Ham"].value_counts(normalize=True), 4)*100)

## Setting up Training & Testing Data Set
I will be using 80% of the data as training set, and the remaining 20% as testing set

In [None]:
# Randomize dataset
mails = mails.sample(frac=1, random_state=42)

# Reindex after randomization
mails.reset_index(inplace=True, drop=True)

# Get 80% as training data, rest as test data
cutoff_index = int(round(mails.shape[0] * 0.8, 0))

train = mails.iloc[:cutoff_index].copy(deep=True)
test = mails.iloc[cutoff_index:].copy(deep=True)

train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

In [None]:
print("TRAINING DATA:")
print("Proportion in %")
print(round(train["Spam/Ham"].value_counts(normalize=True), 4)*100)

print("\nTESTING DATA:")
print("Proportion in %")
print(round(test["Spam/Ham"].value_counts(normalize=True), 4)*100)

Proportions are comparable in both datasets.

# Prepping Data for Naive Bayes
In this step I will prepare the data for both the subject as well as the data for the actual email message for later processing

## Cleaning the Message Word String for Subject Line and Email Body

### Prepping Subject Line Data

In [None]:
# Remove all punctuation and convert everything to lowercase
train["Subject"] = train["Subject"].str.replace(
    "\W", " ", regex=True).str.lower()
# The method above is quick but occasionally produces double spaces - cleaning this up just in case
train["Subject"] = train["Subject"].str.replace(
    "\s{2,}", " ", regex=True).str.strip()

In [None]:
train["Subject"].head()

### Prepping Message Data

In [None]:
# Remove all punctuation and convert everything to lowercase
train["Message"] = train["Message"].str.replace(
    "\W", " ", regex=True).str.lower()
# The method above is quick but produces double spaces - cleaning this up
train["Message"] = train["Message"].str.replace(
    "\s{2,}", " ", regex=True).str.strip()

In [None]:
train["Message"].head()

## Building the vocabulary

### Vocabulary: Subject Line

In [None]:
# Transform Subject line to list
train["Subject"] = train["Subject"].str.split(" ")

In [None]:
# Build the vocabulary
subject_voc = []
# Add each single word of each message to the vocabulary
for index, subject in train["Subject"].iteritems():
    if type(subject) == list:
        # ignore instance with blank subject lines where the split resulted in nan object instead of list
        for word in subject:
            subject_voc.append(word)
# Get rid of duplicate words
subject_voc = list(set(subject_voc))

In [None]:
print("Unique words in subject line of training data set:")
print(len(subject_voc))

In [None]:
# Build dictonary with word count for each subject line

# Create a dictionary with all unique words as keys and a list as entry
# the list contains one count for each subject line (row) - will be filled in the next step
# because of reindexing ID simply starts at 0...
word_counts_per_subject["Message ID"] = list(range(train.shape[0]))
word_counts_per_subject = {word: [0]*train.shape[0] for word in subject_voc}


# loop over all the subject lines for each contained word
# increase the count in appropriate place in the dictionary by +1
for index, subject in train["Subject"].iteritems():
    for word in subject:
        word_counts_per_subject[word][index] += 1

In [None]:
# Export data to csv file

'''
The data can still be handled by pandas but is getting a bit large to be handled comfortably on all machines (1GB+)
This is of course more of a problem for the latter analysis of email message data which will usually not fit into memory.

Therefore the data will be exported to a csv file so that it can later be read in to a dask dataframe.

'''

def save_dict_to_csv(dictionary, output_file_name, progress_step = 5):
    print("...opening/creting file" + output_file_name)
    with open(output_file_name, 'w', newline='') as csvfile:
        print("...writing dictonary to file...")
        writer = csv.writer(csvfile)
        writer.writerow(dictionary)  # First row (the keys of the dictionary).
        print("...wrote header to file...")
        print("...now writing values to file...")
        
        dict_vals_zip = zip(*dictionary.values())
        
        # vals for progress messages
        steps = len(list(dict_vals_zip))
        steps_mult = round(steps / 100)
        current_step = 0
        progress = 0
        
        for values in zip(*dictionary.values()):
            writer.writerow(values)
            
            current_step += 1
            
            if current_step%steps_mult == 0:
                progress += 1
                if progress%progress_step == 0:
                    print("...about " + str(progress) + "% done...")
    
    print("...DONE!")

In [None]:
save_dict_to_csv(word_counts_per_subject, 'test_output3.csv')

### Vocabulary: Message

In [None]:
# Transform Messages to list
train["Message"] = train["Message"].str.split(" ")

In [None]:
# Build the message_vocabulary
message_voc = []
# Add each single word of each message to the message_vocabulary
for index, message in train["Message"].iteritems():
    if type(message) == list:
        # ignore instance with blank message where the split resulted in nan object instead of list
        for word in message:
            message_voc.append(word)
# Get rid of duplicate words
message_voc = list(set(message_voc))

In [None]:
print("Unique words in email message of training data set:")
print(len(message_voc))

In [None]:
# Build dataframe with word count for each message
mask_spam = train["Spam/Ham"] == "spam"
mask_ham = train["Spam/Ham"] == "ham"

train_spam = train[mask_spam]
train_ham = train[mask_ham]

# Create a dictionary with all unique words as keys and a list as entry
# the list contains one count for each email message (row) - will be filled in the next step
# word_counts_per_message = {word: [0]*train.shape[0] for word in message_voc}



# loop over all the messages for each contained word
# increase the count in appropriate place in the dictionary by +1
# for index, subject in train["Message"].iteritems():
#     for word in message:
#         word_counts_per_message[word][index] += 1