# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with. And as a little remark here, we can say that if we'd like to scale that problem up into a big data problem, we could easily do what we do using hdfs datastore to hold our email data.

In [5]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# method for reading files from path
def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            
            # join whole file path
            path = os.path.join(root, filename)
            
            # prepare variables
            inBody = False
            
            # start with an empty list of lines
            lines = []
            
            # open file in read mode
            f = io.open(path, 'r', encoding='latin1')
            
            # read line by line; body starts with an empty line
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            
            # concat all list items to message with linebreaks between lines
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    
    # reuse readFiles method
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

my_dir = "C:/Users/kohleggermichael/OneDrive/Teaching/[WCIS] CRM und Information Mining II/Block-1"

data = data.append(dataFrameFromDirectory(my_dir + '/emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory(my_dir + '/emails/ham', 'ham'))


Let's have a look at that DataFrame:

In [6]:
data.head()

Unnamed: 0,class,message
C:/Users/kohleggermichael/OneDrive/Teaching/[WCIS] CRM und Information Mining II/Block-1/emails/spam\00001.7848dde101aa985090474a91ec93fcf0,spam,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr..."
C:/Users/kohleggermichael/OneDrive/Teaching/[WCIS] CRM und Information Mining II/Block-1/emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
C:/Users/kohleggermichael/OneDrive/Teaching/[WCIS] CRM und Information Mining II/Block-1/emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
C:/Users/kohleggermichael/OneDrive/Teaching/[WCIS] CRM und Information Mining II/Block-1/emails/spam\00004.eac8de8d759b7e74154f142194282724,spam,##############################################...
C:/Users/kohleggermichael/OneDrive/Teaching/[WCIS] CRM und Information Mining II/Block-1/emails/spam\00005.57696a39d7d84318ce497886896bf90d,spam,I thought you might like these:\n\n1) Slim Dow...


Now we will use a CountVectorizer to **split up each message into its list of words**. The counts object is a sparse matrix with emails on cols, words (aka terms) on rows and word count per email in its cells. As the overall dictionary is huge (429,785.00 words), the matrix gets quite big. Since we cannot access the matrix properly, we simply plot some meta information about it. *Also mind that we have not done any NLP stuff yet (e.g., stemming, stop-word reduction).*

![Sparse Matrix Example](img/sparsematrix.png "Sparse Matrix Example")

In [16]:
# vecotrice message into words with count
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
counts

<3000x62964 sparse matrix of type '<class 'numpy.int64'>'
	with 429785 stored elements in Compressed Sparse Row format>

Now lets throw our sparse matrix into a **MultinomialNB (naive bayes)** classifier that also gets the classification (spam/ham) for each handled email. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [24]:
# train a multinomial naive bayes classifier on the counts data
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out:

In [4]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], 
      dtype='<U4')

In [22]:
# you could also hand over a larger set of emails (in this case I take the original training data)
classifier.predict(vectorizer.transform(data["message"].values))

array(['spam', 'spam', 'spam', ..., 'ham', 'ham', 'ham'], 
      dtype='<U4')

What does this all mean? This is a pretty good example of a fully implementable classifier, that you could **directly use in an application**. Lets assume that you would implement a guest book app in php. You would probably establish a little form that lets users add a new post to your guest book, and would very likely also attract spammers. Now you could do is go and get some example spam that you could train a classifier with. Then you couls take the trained model and implement it into a function that you can call from your php-submit method and that would return a binary marker (for spam/ham). OnSubmit, you could let the post content run through that method and would immediately get a classification. If that classification is spam, you could reject the post right away.

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect. If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.