# Cloud Constable Content-Based Spam/Fraud Detection
______
### Stephen Camera-Murray, Himani Garg, Vijay Thangella
## CSDMC2010 SPAM corpus
(http://csmining.org/index.php/spam-email-datasets-.html)

4327 messages out of which there are 2949 non-spam messages (HAM) and 1378 spam messages (SPAM)

Spam                                                   |  Ham
:-----------------------------------------------------:|:------------------------------------------------------:
<img src="Spam.png" alt="Spam" style="width: 200px;"/> | <img src="Ham.png" alt="Ham" style="width: 200px;"/>

### Step 1 - Data Exploration and Cleansing
____

#### Import required libraries

In [1]:
#import libraries
import numpy as np
import pandas as pd
import string
import email.parser 
import os, sys, stat
import shutil
import nltk
from PIL import Image
import re
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from bs4 import BeautifulSoup

#### Create helper function
Each email is separate, as are the labels defining the email as spam (0) or ham (1). We modify the functions included with the dataset to parse the files and strip out the email subject and body.

In [2]:
def ExtractSubPayload (filename):
    ''' Extract the subject and payload from the .eml file.

    '''
    fp = open(filename, encoding="Latin-1") # other encodings produced errors. while not technically correct, as long
                                            # as the character is garbled in the same way all is well
    msg = email.message_from_file(fp)
    payload = msg.get_payload()
    if type(payload) == type(list()) :
        payload = payload[0] # only use the first part of payload
    sub = msg.get('subject')
    sub = str(sub)
    if type(payload) != type('') :
        payload = str(payload)

    return sub + payload

#### Load the spam labels
We load the labels into a dataframe and peek at the first five rows.

In [3]:
# load the labels file
spamLabels = pd.read_csv (
    'data/SPAMTrain.label',
    sep=' ',
    header=None,
    names=['ham','filename']
)
spamLabels.head()

Unnamed: 0,ham,filename
0,0,TRAIN_00000.eml
1,0,TRAIN_00001.eml
2,1,TRAIN_00002.eml
3,0,TRAIN_00003.eml
4,0,TRAIN_00004.eml


#### Loop through, process email files, and add them to a single dataframe
We loop through our directory containing the email dataset and:
1. Extract the subject and message body
2. Strip out html tags with beautifulsoup
3. Match it up with the correct label
4. Append it to our full dataframe

In [4]:
# create an empty dataframe for the emails
emailsDF = pd.DataFrame(columns=('ham', 'content'))

# loop through each file in the directory, skipping the label file
srcdir = 'data'
files = os.listdir(srcdir)
for idx, file in enumerate(files):
    
    srcpath = os.path.join(srcdir, file)
    if ( file != 'SPAMTrain.label'):

        # extract the subject and body
        body = ExtractSubPayload (srcpath)

        # load the body into beautiful soup to parse the html, if any
        soup = BeautifulSoup(body, "lxml")

        # remove script and style elements, may not be necessary for emails
        for script in soup(["script", "style"]):
            script.extract()

        # extract the text
        text = soup.get_text()
        
        # regex to replace non-letters with blank. Note: we also want to remove
        # the phrase "[SPAM]" which is added by some the mailer's spam-detection
        # so we're not cheating :-)
        clean_text = re.sub("\[SPAM\]|[^a-zA-Z]", " ", text ).lower()

        # append label and email body to our dataframe
        emailsDF.loc[idx] = [ spamLabels[spamLabels['filename']==file]['ham'].values[0], clean_text ]

In [5]:
emailsDF.head()

Unnamed: 0,ham,content
1,0.0,one of a kind money maker try it for free con...
2,0.0,link to my webcam you wanted wanna see sexuall...
3,1.0,re how to manage multiple internet connection...
4,0.0,give her hour rodeoenhance your desire p...
5,0.0,best price on the netf f m suddenlysusan sto...


#### Create Word Clouds

Let's pull all of the words into two single strings, one for spam and one for ham.

In [6]:
# get all of the spam words and ham words in a single string
spam_words = emailsDF[emailsDF['ham']==0]['content'].str.cat()
ham_words  = emailsDF[emailsDF['ham']==1]['content'].str.cat()

Create the word cloud images

In [34]:
d = os.path.dirname('.')

spam_mask = np.array(Image.open(os.path.join(d, "Spam.jpg")))
ham_mask = np.array(Image.open(os.path.join(d, "Ham.jpg")))

stopwords = set(STOPWORDS)

# generate word cloud
wc_spam = WordCloud(background_color=None, mode="RGBA", max_words=100, mask=spam_mask,
               stopwords=stopwords)
wc_spam.generate(spam_words)

wc_ham = WordCloud(background_color=None, mode="RGBA", max_words=100, mask=ham_mask,
               stopwords=stopwords)
wc_ham.generate(ham_words)

# store to file
wc_spam.to_file(os.path.join(d, "SpamWordCloud.png"))
wc_ham.to_file (os.path.join(d, "HamWordCloud.png"))

<wordcloud.wordcloud.WordCloud at 0x21ce6ab9668>

We observe that each set of classified words are quite different and should be useful in building a predictive model. There is some repetition, but that is likely due to extra spaces that were not filtered out in our cleansing. Tokenization in our data preparation step should take care of this, but we should double-check to be sure. One interesting detail we also notice is the ham-classified dataset seems to be tech-heavy, which may not be representative. Once we build the model, we should use additional datasets to check the accuracy of our final model.

Spam                                                   |  Ham
:-----------------------------------------------------:|:------------------------------------------------------:
<img src="SpamWordCloud.png" alt="Spam" style="height: 300px;"/> | <img src="HamWordCloud.png" alt="Ham" style="height: 300px;"/>

#### Write our cleansed dataset to the data folder

In [12]:
# write the cleansed dataframe to a file
emailsDF.to_csv('data/cleansedEmails.tab.gz', index=False, compression='gzip', sep='\t')