## Naive Bayes Implementation Spam Filter
In this collab notebook we will be using the scikit learn (sklearn) package to analyze emails and label them as Spam vs Non-Spam (Ham). We are using a supervised machine learning algorithm known as the Naive Bayes algorithm in this implementation.

## Bayesian Classification Algorithm
Naive Bayes classifiers are built on Bayesian classification methods which rely on Bayes Theorem. Bayes theorem summarized says that the probability of A 
given B or $P(A | {\rm B})$ is the probability of A times the probability of B, given A, over the probability of B.

Bayes's theorem:

$$
P(A&nbsp;|&nbsp; {\rm &nbsp;features}) = \frac{P({\rm B} | A)P(A)}{P({\rm B})}
$$

If we are comparing two labels we will compute the ratio of the probabilities for each label:

$$
\frac{P(A_1 | {\rm B})}{P(A_2 | {\rm B})} = \frac{P({\rm B} | A_1)}{P({\rm B} | A_2)}\frac{P(A_1)}{P(A_2)}
$$

The implementation is called "naive" because it assumes that the presence of different words are independent of one another. We know that this isn't true. Most sentences follow similar structure and can be predicted however that is not covered in the scope of this tutorial.

We need to import some of the necessary packages:

In [4]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

This function will iterate through every file in a directory, build its full path & read it in:

In [5]:
def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message

This creates a dataframe with 2 colummns - the body of the e-mail (Message) and the classification of Spam or Ham (Class):

In [None]:
def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('C:\\Users\\joe\\Documents\\DataScience-Python3\\emails\\spam', 'spam'))
data = data.append(dataFrameFromDirectory('C:\\Users\\joe\\Documents\\DataScience-Python3\\emails\\ham', 'ham'))


We can preview the dataframe by uing the head command. For a parameter we will use 5 so that we get the first 5 rows:

In [8]:
data.head(5)

Unnamed: 0,class,message
C:\Users\joe\Documents\DataScience-Python3\emails\spam\00001.7848dde101aa985090474a91ec93fcf0,spam,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr..."
C:\Users\joe\Documents\DataScience-Python3\emails\spam\00002.d94f1b97e48ed3b553b3508d116e6a09,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
C:\Users\joe\Documents\DataScience-Python3\emails\spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
C:\Users\joe\Documents\DataScience-Python3\emails\spam\00004.eac8de8d759b7e74154f142194282724,spam,##############################################...
C:\Users\joe\Documents\DataScience-Python3\emails\spam\00005.57696a39d7d84318ce497886896bf90d,spam,I thought you might like these:\n\n1) Slim Dow...


Next we want to split each message up into a list of words. For this we use CountVectorizer to convert the email into a list of words and the variable counts tracks their frequency of occurance. Vectorizer converts the words into numerical values. Then we run it in the MultinomialNB classifier. Then we call the classifier.fit() function. This will use the naive bayes algorithm to predict whether or not new emails will be Spam/Ham.

In [9]:
# Tokenizes each word and counts frequency of each word in an email
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Now lets test out our algorithm and see how accurate it is. Keep in mind, the more data we put in, the more accurate the algorithm will be. Supervised learning algorithms rely on previous labeled example data to make predictions.

In [10]:
test_Accuracy = ['Free STCC tuition! Click Here!', "STCC Weather Delay"]
example_counts = vectorizer.transform(test_Accuracy)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'],
      dtype='<U4')