# Naive Bayes Classifier

We will use <b>`sklearn.naive_bayes`</b> to train a spam Classifier.

In [2]:
# Internal Imports
import os
import io

# External Imports
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [3]:
def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message

In [4]:
def dataFrameFromDirectory(path, classification):
    """
    Create DataFrames from the files present in the Directory
    given as "path" argument.
    """
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory('emails/ham', 'ham'))

In [5]:
data.head()

Unnamed: 0,class,message
emails/spam/00119.7bd666ac52f079fb3b5ff0be83b55286,spam,\n\n<html><body><pre>\n\n\n\n_________________...
emails/spam/00294.df27a988d82cc82296e33e6d727ac47e,spam,Get your favorite Poker action at http://www.m...
emails/spam/00059.dc5b9ea22c6848c97871f0d9576cc931,spam,"<HTML><P ALIGN=CENTER><font ptsize=""14"" family..."
emails/spam/00444.33afc8c1f9cea3100ca8502e8a785259,spam,=============================================\...
emails/spam/00110.f3c4ebe14b439420b53212332326181f,spam,Wanna see sexually curious teens playing with ...


Now, we will use a <b>`CountVectorizer`</b> to split up each message into its list of words, and throw the output into a <b>`MultinomialNB`</b>. Call <b>`fit()`</b> and we've got a trained spam filter ready !

In [6]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [7]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [8]:
examples = ['Free Gifts now!!!', "Hi John, How about a game of basketball tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], 
      dtype='<U4')

We passed a list of strings(examples, one of which is Spam and other is Ham) as an argument to tranform method because we need to convert the messages into the same format on which we trained our model. "vectorizer.transform" will convert these messages into a list of words and their frequenices where the words are represented by positions in an array.

After we did this tranformation, We can actually use the predict fundtion on the classifier on the array which contains list of words.