# Spam SMS Filter
## Naive Bayes Classification

We implement a Naive Bayes classifier for filter out spam SMS.

The dataset was obtained from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection. It contains:

-> 425 spam SMS messages, manually extracted from the Grumbletext Web site which is a UK forum in which cell phone users make public claims about SMS spam messages
-> 3,375 non-spam SMS messages largely originated from Singaporeans and mostly from students attending the National University of Singapore.
-> 450 non-spam SMS messages from Caroline Tag's PhD Thesis.
-> 1,002 non-spam SMS messages and spam 322 SMS messages from the SMS Spam Corpus v.0.1 Big.

^Description paraphrased from the link above.

In [1]:
%matplotlib inline
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#getWords() takes a string as input and outputs the list of words (delimited by non-characters) in lowercase.
import re
def getWords(text):
    return re.compile('\w+').findall(text.lower())

In [3]:
#getInVec() takes a list of test words and an n-length wordlist as input.
#it outputs an n-length feature vector that is 1 if the word in wordlist appears in the list of test words.
def getInVec(message, wordlist):
    x=np.array( [1 if word in message else 0 for word in wordlist] )
    return x

In [4]:
#load the trainingset
trainingset=np.genfromtxt("data/SMSs.dat", delimiter="\t", dtype=None)
trainingset=trainingset[0:5000,:]
m=len(trainingset)

#generate the wordlist
wordlist = []
for i in range(0,m):
    for word in getWords( trainingset[i,1] ):
        wordlist.append( word )
print "The length of the wordlist is " + str(len(wordlist))
wordlist= np.array( list( set(wordlist) ) )
n=len(wordlist)
print "The length of the set of the wordlist is " + str(n)

The length of the wordlist is 78761
The length of the set of the wordlist is 8117


In [5]:
#generate the target matrix
Y=np.array( [1 if i=="spam" else 0 for i in trainingset[:,0]] ) #Y m*1 target matrix 
Y_not=np.logical_not(Y)
sum_spam=sum(Y) #total number of spam messages
sum_notspam=sum(Y_not)

In [6]:
#generate the feature matrix (Sorry this takes a while because of the loop)
X=np.zeros((m,n)) #m*n feature matrix
for i in range(0,m):
    SMS=getWords(trainingset[i,1])
    X[i]=getInVec(SMS, wordlist)
sum_x=np.sum(X, axis=0) #n*1 vector: number of j'th feature

#calculate the parameter F_xj_given_spam 
sum_x_and_spam=np.zeros(n)
for j in range(0,n):
    sum_x_and_spam[i] = sum( np.logical_and(X[:,j],Y) ) #m*1 vector: number of {j'th feature AND spam}
F_xj_given_spam = (sum_x_and_spam+1)/(sum_spam+2) #

#calculate the parameter F_xj_given_notspam 
sum_x_and_notspam=np.zeros(n)
for j in range(0,n):
    sum_x_and_notspam[i] = sum( np.logical_and(X[:,j],Y_not) ) #m*1 vector: number of {j'th feature AND notspam}
F_xj_given_notspam = (sum_x_and_notspam+1)/(sum_notspam+2)

P_spam=float(sum_spam)/m

F_xj=sum_x/m

In [11]:
#load the validationset
validationset=np.genfromtxt("data/SMSs.dat", delimiter="\t", dtype=None)
validationset=validationset[5000:-1,:]

for entry in validationset[0:10,:]:
    trial=getWords(entry[1])
    x_trial=getInVec(trial,wordlist) #feature vector of the trial
    print entry[1]

    log_P_x_given_spam=0
    log_P_x_given_notspam=0

    for j in range (0,n):
        if (x_trial[j]==1):
            log_P_x_given_spam = log_P_x_given_spam + np.log(F_xj_given_spam[j])
            log_P_x_given_notspam = log_P_x_given_notspam + np.log(F_xj_given_notspam[j])
        if (x_trial[j]==0):
            log_P_x_given_spam = log_P_x_given_spam + np.log(1-F_xj_given_spam[j])
            log_P_x_given_notspam = log_P_x_given_notspam + np.log(1-F_xj_given_notspam[j])

    P_x_given_spam = np.exp(log_P_x_given_spam)
    P_x_given_notspam = np.exp(log_P_x_given_notspam)
    P_spam_given_x = P_spam* P_x_given_spam/ (P_x_given_spam*(P_spam) + P_x_given_notspam*(1-P_spam))
    
    print "The probability that this message is spam is " + str(P_spam_given_x)
    
    if(str(P_spam_given_x)==str(1.0)):
        print "I will categorize this message as SPAM."
        if (entry[0]=="spam"):
            print "This message is indeed SPAM."
        else:
            print "I was wrong; this message is NOTSPAM."
                
    if(str(P_spam_given_x)!=str(1.0)):
        print "I will categorize this message as NOTSPAM."
        if (entry[0]=="ham"):
            print "This message is indeed NOTSPAM."
        else:
            print "I was wrong; this message is SPAM."
                
    print "\n"

Hmph. Go head, big baller.
The probability that this message is spam is 0.00159288026456
I will categorize this message as NOTSPAM.
This message is indeed NOTSPAM.


Well its not like you actually called someone a punto. That woulda been worse.
The probability that this message is spam is 0.99999474166
I will categorize this message as NOTSPAM.
This message is indeed NOTSPAM.


Nope. Since ayo travelled, he has forgotten his guy
The probability that this message is spam is 0.730646590324
I will categorize this message as NOTSPAM.
This message is indeed NOTSPAM.


You still around? Looking to pick up later
The probability that this message is spam is 0.945706938036
I will categorize this message as NOTSPAM.
This message is indeed NOTSPAM.


CDs 4u: Congratulations ur awarded £500 of CD gift vouchers or £125 gift guaranteed & Freeentry 2 £100 wkly draw xt MUSIC to 87066 TnCs www.ldew.com1win150ppmx3age16
The probability that this message is spam is 1.0
I will categorize this message as S