# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

### Creation of a dataframe of known "<u>spam</u>" and "<u>ham</u>" emails

In [1]:
# Importing libraries
import os
import io
import numpy as np
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Defining functions to read the data (emails) from files

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

# Reading the emails from files into a dataframe

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('C:\MLCourse\emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory('C:\MLCourse\emails/ham', 'ham'))


  data = data.append(dataFrameFromDirectory('C:\MLCourse\emails/spam', 'spam'))
  data = data.append(dataFrameFromDirectory('C:\MLCourse\emails/ham', 'ham'))


Let's have a look at that DataFrame:

In [2]:
data.head()

Unnamed: 0,message,class
C:\MLCourse\emails/spam\00001.7848dde101aa985090474a91ec93fcf0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...",spam
C:\MLCourse\emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
C:\MLCourse\emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
C:\MLCourse\emails/spam\00004.eac8de8d759b7e74154f142194282724,##############################################...,spam
C:\MLCourse\emails/spam\00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam


### Shuffling the data

In [3]:
# Please note: In order to create reproducible results
# (same train/test sets, hence same model and the same accuracy),
# it is required to use a random_state within data.sample,
# so that the results and the accuracy of the model do not to change between
# one run of this notebook to the next.

# data1 = data.sample(frac=1)
data1 = data.sample(frac = 1, random_state = 1)

In [4]:
# help(data.sample(frac=1))

In [5]:
data1.tail()

Unnamed: 0,message,class
C:\MLCourse\emails/ham\02264.2d1e4f4cd87e96f5ac2f24c992b59e2a,"URL: http://www.newsisfree.com/click/-4,851800...",ham
C:\MLCourse\emails/ham\00406.fe97503539c60ff6814fa2fadf1aa630,"On 2 Sep 2002, RossO wrote:\n\n\n\n> John Wayl...",ham
C:\MLCourse\emails/ham\00597.d69fd43248612845c272cc277b890e1a,">>>>> ""S"" == Stephen D Williams <sdw@lig.net> ...",ham
C:\MLCourse\emails/spam\00236.2772a068fff32e2f8d7f8a94bd9280cd,"Dear User,\n\n\n\nDo you ever wish you could e...",spam
C:\MLCourse\emails/ham\00562.3f2a351171504facae22864c794c26b6,\n\nHanson is always good.\n\n\n\nOne of my sc...,ham


In [6]:
len(data)

3000

In [7]:
len(data1)

3000

### Creation of train and test sets

In [8]:
# The train set contains 2/3 of the data: 2000 emails
# The test set contains 1/3 of the data: 1000 emails

trainx = data1[["message"]][:2000]
testx = data1[["message"]][2000:]
testy = data1[["class"]][2000:]
trainy = data1[["class"]][:2000]

In [9]:
testy.head()

Unnamed: 0,class
C:\MLCourse\emails/ham\02241.aaefd69aeb045921a2c82a01c13d225d,ham
C:\MLCourse\emails/spam\00079.cc3fa7d977a44a09d450dde5db161c37,spam
C:\MLCourse\emails/ham\02030.596414a1b7f0e928af1b40535a6cc0ca,ham
C:\MLCourse\emails/ham\01712.c20d5899b4b27415389a10cd4f400019,ham
C:\MLCourse\emails/ham\02152.8df514c41920019281f8f0723dad0001,ham


In [10]:
trainy.head()

Unnamed: 0,class
C:\MLCourse\emails/ham\01458.faf90fabe2c118e46dd6a60139a7317f,ham
C:\MLCourse\emails/ham\01588.f0623dc7b744dd2ba0417f8f0a98662f,ham
C:\MLCourse\emails/ham\00895.0c7898bdc5199ca3efd6af04c80430d0,ham
C:\MLCourse\emails/ham\01021.3fc1c0955f38f5873882a577f00a5f2c,ham
C:\MLCourse\emails/ham\00599.94c013ab7037d45045aafbac3389bef0,ham


### Creation of a spam filter

Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [11]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(trainx['message'].values)

classifier = MultinomialNB()
targets = trainy['class'].values
classifier.fit(counts, targets)

MultinomialNB()

In [12]:
trainx.head()

Unnamed: 0,message
C:\MLCourse\emails/ham\01458.faf90fabe2c118e46dd6a60139a7317f,\n\nRobert Strickler said:\n\n\n\n> Looks like...
C:\MLCourse\emails/ham\01588.f0623dc7b744dd2ba0417f8f0a98662f,"Hi,\n\n\n\nIs it possible to use razor without..."
C:\MLCourse\emails/ham\00895.0c7898bdc5199ca3efd6af04c80430d0,"she read the links. what must it be like, she ..."
C:\MLCourse\emails/ham\01021.3fc1c0955f38f5873882a577f00a5f2c,"\n\nHi Folks,\n\n\n\nI've been trying to set a..."
C:\MLCourse\emails/ham\00599.94c013ab7037d45045aafbac3389bef0,Chuck Murcko wrote:\n\n> > The usual crud. Wh...


### Testing the model on the test set

In [13]:
testx_counts = vectorizer.transform(testx["message"].values)

In [14]:
predictions = classifier.predict(testx_counts)

In [15]:
predictions[:5]

array(['ham', 'spam', 'ham', 'ham', 'ham'], dtype='<U4')

### Model Accuracy

In [16]:
acc_test = (predictions == np.array(testy["class"])).sum()/len(predictions)*100

In [17]:
print('The accuracy on the test set (1/3 of the data was used as a test set) is ' + str(acc_test) + "%.")

The accuracy on the test set (1/3 of the data was used as a test set) is 94.3%.


### Using the model on new email examples

Let's try it out:

In [18]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], dtype='<U4')

In [19]:
examples1 = ["Ancient Healing Secrets Free Ebook"]
examples1_counts = vectorizer.transform(examples1)
predictions = classifier.predict(examples1_counts)
predictions

array(['spam'], dtype='<U4')

In [20]:
examples2 = ["2 new items in your Stack Exchange inbox"]
examples2_counts = vectorizer.transform(examples1)
predictions = classifier.predict(examples2_counts)
predictions

array(['spam'], dtype='<U4')

In [21]:
examples3 = ["UK Research and Innovation MRC funding opportunities Update"]
examples3_counts = vectorizer.transform(examples3)
predictions = classifier.predict(examples3_counts)
predictions

array(['ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect. **<u>Done</u>**.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails. **<u>Done</u>**.