# Introduction
In this project I will be building a simple (but hopefully effective) spam filter for E-Mails using a naive Bayes approach on the Enron-Spam dataset. The dataset contains 33.716 Emails marked as spam or as non-spam ("ham") messages. The original dataset was collected by V. Metsis, I. Androutsopoulos and G. Paliouras. and can be found [here](http://www2.aueb.gr/users/ion/data/enron-spam). I packaged all the data into a [single csv-file](https://github.com/MWiechmann/enron_spam_data), which I will be using for this project.

Comment about data and generated data go here...

# Reading in data

Quick Notes on current data (make more readable later):
* enron_spam_data.zip: data set with Message ID, Subject, Spam/Ham, Date - unmodified data from repo
* train.zip: 80% of original data set for training the model - structure like enron spam data, but subject and message have been processed: converted to lowercase, removed punctuation and transformed to list objects (one entry per word)
* test.zip: 20% of original data set - no further processing yet (subject & message as simple string)
* subject_voc.zip: Subject Line Vocabulary - one column per unique word, one row per Subject Line with word count per word per subject line, Message ID as index

In [1]:
import pandas as pd
import dask.dataframe as dd
import csv
import os
import shutil

train = pd.read_csv("data/train_data.zip",
                    compression="zip", index_col="Message ID")
train.head()

Unnamed: 0_level_0,Subject,Message,Spam/Ham,Date
Message ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"['re', 'tenaska', 'iv']","['i', 'tried', 'calling', 'you', 'this', 'am',...",spam,2004-02-25
1,['neon'],"['bammel', 'neon', 'groups', 'fall', '2001', '...",ham,2001-09-24
2,"['fw', 're', 'ivanhoe', 'e', 's', 'd']","['fyi', 'kim', 'original', 'message', 'from', ...",spam,2004-12-17
3,"['start', 'date', '2', '6', '02', 'hourahead',...","['start', 'date', '2', '6', '02', 'hourahead',...",spam,2005-08-31
4,"['fw', 're', 'ivanhoe', 'e', 's', 'd']","['fyi', 'kim', 'original', 'message', 'from', ...",spam,2004-09-07


In [2]:
print("Total Count")
print(train["Spam/Ham"].value_counts(dropna=False))
print("\nProportion in %")
print(round(train["Spam/Ham"].value_counts(normalize=True), 4)*100)

Total Count
spam    13716
ham     13257
Name: Spam/Ham, dtype: int64

Proportion in %
spam    50.85
ham     49.15
Name: Spam/Ham, dtype: float64


In [3]:
# template for reading in dask data

if not os.path.exists("data/subject_voc.csv"):
    shutil.unpack_archive("data/subject_voc.csv", "data/")
subject_voc = dd.read_csv("data/subject_voc.csv").set_index("Message ID")
subject_voc.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 6845 entries, bold to doctor
dtypes: int64(6845)

# Building the Spam Filter

## General Constant Parameters for Naive Bayes

In [4]:
# Calculate probability for Spam and Non-Spam (Ham)
p_spam = train["Spam/Ham"].value_counts(normalize=True)["spam"]
p_ham = train["Spam/Ham"].value_counts(normalize=True)["ham"]

## Spam Filter based on Subject Line

### Calculating Subject Line Specific Constant Parameters

In [5]:
# Word Counts for Spam/ham
train_spam = train[train["Spam/Ham"] == "spam"]
train_ham = train[train["Spam/Ham"] == "ham"]

n_words_subject_spam = train_spam["Subject"].apply(len).sum()
n_words_subject_ham = train_ham["Subject"].apply(len).sum()
# (could also use train[mask_spam].iloc[:,2:].sum().sum() above, but takes approx 2x as long)

# Unique word count for vocabulary
n_words_subject_voc = subject_voc.shape[1]

# Smoothin parameter
alpha = 1

# Delete train_spam and train_ham again to save memory
del train_spam, train_ham

### Calculate Word-Specific Parameters (Subject Line)

In [21]:
subject_voc_words = list(subject_voc.columns[:2500])

# Add Spam/Ham category to subject_voc_df and seperate into spam/ham dataframes
subject_voc_spam = subject_voc[subject_voc_words].copy()
subject_voc_ham = subject_voc[subject_voc_words].copy()
subject_voc_spam["Spam/Ham"] = train["Spam/Ham"]
subject_voc_ham["Spam/Ham"] = train["Spam/Ham"]
subject_voc_spam = subject_voc_spam[subject_voc_ham["Spam/Ham"] == "spam"]
subject_voc_ham = subject_voc_ham[subject_voc_ham["Spam/Ham"] == "ham"]
subject_voc_spam = subject_voc_spam.compute()
subject_voc_ham = subject_voc_ham.compute()

# Build dictionaries with word-specific probability given either spam or non-spam (ham)
p_subject_word_given_spam_dict = {word: 0.0 for word in subject_voc_words}
p_subject_word_given_ham_dict = {word: 0.0 for word in subject_voc_words}

for word in subject_voc_words:
    n_word_given_spam = subject_voc_spam[word].sum()
    prob = (n_word_given_spam + alpha) / \
        (n_words_subject_spam + alpha * n_words_subject_voc)
    p_subject_word_given_spam_dict[word] = prob

    n_word_given_ham = subject_voc_ham[word].sum()
    prob = (n_word_given_ham + alpha) / \
        (n_words_subject_ham + alpha * n_words_subject_voc)
    p_subject_word_given_ham_dict[word] = prob

with open('data/p_subject_word_given_spam.csv', 'w') as f:
    writer = csv.DictWriter(f, p_subject_word_given_spam_dict.keys())
    writer.writeheader()
    writer.writerow(p_subject_word_given_spam_dict)

with open('data/p_subject_word_given_ham.csv', 'w') as f:
    writer = csv.DictWriter(f, p_subject_word_given_ham_dict.keys())
    writer.writeheader()
    writer.writerow(p_subject_word_given_ham_dict)

In [28]:
t = range(2500, n_words_subject_voc, 2500)

for val in t:
    print(val)

2500
5000


In [42]:
'''
Doing the following computation from dask dataframes is extremly slow.
Instead I will create a series of pandas dataframes from the main dask dataframe to perform the operation on
the result will be stored on file just in case.
For this dataset, pandas dataframes with 2500 columns seem to work without memory problems on my machine
Therefore the following process will go through the bigger dask dataframe with smaller pandas dataframe, 
each containing 2500 columns.
'''

print("creating dict with probabilities given spam/ham for " +
      str(n_words_subject_voc) + " words...")

# Build dictionaries with word-specific probability given either spam or non-spam (ham)
p_subject_word_given_spam_dict = {word: 0.0 for word in subject_voc.columns}
p_subject_word_given_ham_dict = {word: 0.0 for word in subject_voc.columns}

# Determine slice endpoints to go through dask dataframe in 2500 word steps
endpoints = list(range(2500, n_words_subject_voc, 2500))
endpoints.append(n_words_subject_voc)

step = 1

for endpoint in endpoints:
    print("...creating dictonary - step " + str(step) + "/" +
          str(len(endpoints)) + "...", end="\r")

    subject_voc_words_step = list(subject_voc.columns[:endpoint])

    # Limit subject vocabulary dataframe to the 2500 words in this step, 
    # Add Spam/Ham to dataframe
    # Seperate dataframe into spam/ham dataframes
    # Then transform from dask to pandas dataframe
    subject_voc_spam = subject_voc[subject_voc_words_step].copy()
    subject_voc_ham = subject_voc[subject_voc_words_step].copy()
    subject_voc_spam["Spam/Ham"] = train["Spam/Ham"]
    subject_voc_ham["Spam/Ham"] = train["Spam/Ham"]
    subject_voc_spam = subject_voc_spam[subject_voc_ham["Spam/Ham"] == "spam"]
    subject_voc_ham = subject_voc_ham[subject_voc_ham["Spam/Ham"] == "ham"]
    subject_voc_spam = subject_voc_spam.compute()
    subject_voc_ham = subject_voc_ham.compute()

    for word in subject_voc_words_step:
        n_word_given_spam = subject_voc_spam[word].sum()
        prob = (n_word_given_spam + alpha) / \
            (n_words_subject_spam + alpha * n_words_subject_voc)
        p_subject_word_given_spam_dict[word] = prob

        n_word_given_ham = subject_voc_ham[word].sum()
        prob = (n_word_given_ham + alpha) / \
            (n_words_subject_ham + alpha * n_words_subject_voc)
        p_subject_word_given_ham_dict[word] = prob

    step += 1

print("...dictonary created!                     ")

print("Now saving dictonaries with probabilities to file...")

with open("data/p_subject_word_given_spam.csv", 'w') as f:
    writer = csv.DictWriter(f, p_subject_word_given_spam_dict.keys())
    writer.writeheader()
    writer.writerow(p_subject_word_given_spam_dict)

with open("data/p_subject_word_given_ham.csv", 'w') as f:
    writer = csv.DictWriter(f, p_subject_word_given_ham_dict.keys())
    writer.writeheader()
    writer.writerow(p_subject_word_given_ham_dict)
    
print("...done! Csv-files saved to 'data/p_subject_word_given_spam.csv' and 'data/p_subject_word_given_ham.csv'")

creating dict with probabilities given spam/ham for 6845 words...
...dictonary created!          3...
Now saving dictonaries with probabilities to file...
...done! Csv-files saved to 'data/p_subject_word_given_spam.csv' and 'data/p_subject_word_given_ham.csv'
