# Spam Classifier

In [22]:
import os
import io
import numpy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = True
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = False
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return pd.DataFrame(rows, index=index)

data = pd.DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory("C:/Users/HarshLaptop/Desktop/DataScience/DataScience-Python3/emails/spam", 'spam'))
data = data.append(dataFrameFromDirectory("C:/Users/HarshLaptop/Desktop/DataScience/DataScience-Python3/emails/ham", 'ham'))


Let's have a look at that DataFrame:

In [23]:
data.head()

Unnamed: 0,class,message
C:/Users/HarshLaptop/Desktop/DataScience/DataScience-Python3/emails/spam\0002.2001-05-25.SA_and_HP.spam.txt,spam,Subject: fw : this is the solution i mentioned...
C:/Users/HarshLaptop/Desktop/DataScience/DataScience-Python3/emails/spam\0002.2003-12-18.GP.spam.txt,spam,Subject: adv : space saving computer to replac...
C:/Users/HarshLaptop/Desktop/DataScience/DataScience-Python3/emails/spam\0003.2003-12-18.GP.spam.txt,spam,Subject: fw : account over due wfxu ppmfztdtet...
C:/Users/HarshLaptop/Desktop/DataScience/DataScience-Python3/emails/spam\0004.2001-06-12.SA_and_HP.spam.txt,spam,Subject: spend too much on your phone bill ? 2...
C:/Users/HarshLaptop/Desktop/DataScience/DataScience-Python3/emails/spam\0005.2001-06-23.SA_and_HP.spam.txt,spam,Subject: discounted mortgage broker 512517\n\n...


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [36]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
from sklearn.model_selection import train_test_splitl
X_train,X_test, y_train, y_test = train_test_split(counts, test_size = 0.3) 
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

<class 'scipy.sparse.csr.csr_matrix'>


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out:

In [35]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], 
      dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.