# Creating a Spam Filter and a Naive Bayes Classifier

**_Author: Christos Anagnostopoulos_**



###  Introduction


We will use the data set  “SMSSpamCollection” (downloadable from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/)) to build a Naïve Bayes spam filter by going through the following steps:

1. Load the data file and perform EDA
2. Shuffle and Split the messages
3. Build a simple Naïve Bayes classifier. 
4. Use training set to train the classifier ‘train’. 
5. Using the validation set, explore how the  classifier performs out-of-sample.
6. Define a second classifier, and compare its performance with the one defined in Part 2.


### Part 1 - Importing the data set and  exploratory data analysis (EDA)

In [1]:
import pandas as pd

messages = pd.read_csv('SMSSpamCollection.txt', sep = '\t', names = ["label", "sms"])

In [2]:
messages.head(10)

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [3]:
messages.shape

(5572, 2)

In [4]:
messages.columns

Index(['label', 'sms'], dtype='object')

In [5]:
messages.describe()

Unnamed: 0,label,sms
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


### Part 2 - Shuffling  and splitting the text messages

In [6]:
messages = messages.sample(frac = 1, random_state = 0).reset_index(drop = True)

#shuffle sms and reset the index


In [7]:
messages.head()

Unnamed: 0,label,sms
0,ham,"Storming msg: Wen u lift d phne, u say ""HELLO""..."
1,spam,<Forwarded from 448712404000>Please CALL 08712...
2,ham,And also I've sorta blown him off a couple tim...
3,ham,"Sir Goodmorning, Once free call me."
4,ham,All will come alive.better correct any good lo...


Splitting the messages into a training set (2,500 messages), a validation set (1,000 messages) and a test set (remaining messages).

In [8]:
msgs = list(messages.sms) 
lbls = list(messages.label) 
trainingMsgs = msgs[:2500] 
valMsgs = msgs[2500:3500] 
testingMsgs = msgs[3500:]

In [9]:
trainingLbls = lbls[ :2500]
valLbls = lbls[2500:3500]
testingLbls = lbls[3500: ]

### Part 3 - Building a simple Naïve Bayes classifier

In [10]:
import numpy as np


class NaiveBayesForSpam:
    def train (self, hamMessages, spamMessages):
        self.words = set (' '.join (hamMessages + spamMessages).split())
        self.priors = np.zeros (2)
        self.priors[0] = float (len (hamMessages)) / (len (hamMessages) + len (spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        for i, w in enumerate (self.words):
            prob1 = (1.0 + len ([m for m in hamMessages if w in m])) / len (hamMessages)
            prob2 = (1.0 + len ([m for m in spamMessages if w in m])) / len (spamMessages)
            self.likelihoods.append ([min (prob1, 0.95), min (prob2, 0.95)])
        self.likelihoods = np.array (self.likelihoods).T
        
    def predict (self, message):
        posteriors = np.copy (self.priors)
        for i, w in enumerate (self.words):
            if w in message.lower():  # convert to lower-case
                posteriors *= self.likelihoods[:,i]
            else:                                   
                posteriors *= np.ones (2) - self.likelihoods[:,i]
            posteriors = posteriors / np.linalg.norm (posteriors)  # normalise
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]    

    def score (self, messages, labels):
        confusion = np.zeros(4).reshape (2,2)
        for m, l in zip (messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0,0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0,1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1,0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1,1] += 1
        return (confusion[0,0] + confusion[1,1]) / float (confusion.sum()), confusion

### Part 4 - Training the `train`  classifier

In [11]:
hammsgs = [m for (m, l) in zip(trainingMsgs, trainingLbls) if 'ham' in l]

In [12]:
spammsgs = [m for (m, l) in zip(trainingMsgs, trainingLbls) if 'spam' in l]

In [13]:
print(len(hammsgs))
print(len(spammsgs))

2170
330


In [24]:
#instantiate the classifier
clf = NaiveBayesForSpam()
clf.train(hammsgs, spammsgs)

### Part 5 - Exploring the performance of the `train` classifier.

In [25]:
score, confusion = clf.score (valMsgs, valLbls)

In [26]:
print("The overall performance is:", score)


The overall performance is: 0.977


In [27]:
print("The confusion matrix is:\n", confusion)

The confusion matrix is:
 [[864.  20.]
 [  3. 113.]]


As a baseline, we calculate what the success rate would be if we always guessed `ham`.


In [28]:
print('new_score', len([1 for l in valLbls if 'ham' in l]) / float (len ( valLbls)))

new_score 0.867


### Part 6 - Creating a new  `train2` classifier

In [29]:
class NaiveBayesForSpam:
    def train2 ( self , hamMessages , spamMessages) :
            self.words = set (' '.join (hamMessages + spamMessages).split())
            self.priors = np. zeros (2)
            self.priors [0] = float (len (hamMessages)) / (len (hamMessages) +len( spamMessages ) )
            self.priors [1] = 1.0 - self . priors [0] 
            self.likelihoods = []
            spamkeywords = [ ]
            for i, w in enumerate (self.words):
                prob1 = (1.0 + len ([m for m in hamMessages if w in m])) /len ( hamMessages )
                prob2 = (1.0 + len ([m for m in spamMessages if w in m])) /len ( spamMessages ) 
                if prob1 * 20 < prob2:
                    self.likelihoods.append([min (prob1 , 0.95) , min (prob2 , 0.95) ])
                    spamkeywords . append (w) 
            self.words = spamkeywords
            self.likelihoods = np.array (self.likelihoods).T 
            
    def predict (self, message):
        posteriors = np.copy (self.priors)
        for i, w in enumerate (self.words):
            if w in message.lower():  # convert to lower-case
                posteriors *= self.likelihoods[:,i]
            else:                                   
                posteriors *= np.ones (2) - self.likelihoods[:,i]
            posteriors = posteriors / np.linalg.norm (posteriors)  # normalise
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]    

    def score (self, messages, labels):
        confusion = np.zeros(4).reshape (2,2)
        for m, l in zip (messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0,0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0,1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1,0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1,1] += 1
        return (confusion[0,0] + confusion[1,1]) / float (confusion.sum()), confusion

In [30]:
#define the classifier
clf = NaiveBayesForSpam()

#train it
clf.train2(hammsgs, spammsgs)

Re-compute the score and the confusion matrix on the *validation set* using the updated classifier.

In [31]:
#Again, this cell may take a long time to run!
score_2, confusion_2 = clf.score(valMsgs, valLbls)

In [32]:
print("The overall performance is: ", score_2)

The overall performance is:  0.979


In [33]:
print("The confusion matrix is:\n", confusion_2)

The confusion matrix is:
 [[863.  17.]
 [  4. 116.]]
