# Naive Bayes

**Note: This following cell contains some predefined functions to implement a type of Decision Tree algorithm called CART (Classification and Regression Trees). Please make sure you have run this cell before you run other cells in this notebook.**

In [33]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

def loadDataSet(dataset):
    with open(dataset) as f:
        returnVec=[]
        data=f.readlines()
        text=[entry.split('\t')[1].rstrip() for entry in data]
        #print instances
        parsedText=map(textParse,text)
        #print parsedText
        vocabList=createVocabList(parsedText)
        for parsedSMS in parsedText:
            returnVec.append(setOfWords2Vec(vocabList,parsedSMS))
        labels=[entry.split('\t')[0] for entry in data]
        return returnVec,labels,vocabList
         
def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)
    return list(vocabSet)
        
def textParse(bigString):
    import re
    #listOfTokens=re.split(r'\W*',bigString)
    listOfTokens=re.split(r'[^A-Za-z]*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else: print 'the word: %s is not in my Vocabulary' % word
    return returnVec

## Build a classifier##
* The variable "dataset" stores the name of text file that you input and is passed as an argument of the function "loadDataSet()".  
* After processing, the loadDataSet function will output, or in other words, return three variables, "returnVec", "labels", "vocabList".  
* "returnVec" stores the vectorized feature values. "labels" stores the labels of all instances. "vocabList" stores all the distinct words that appear in the dataset.  
* The variable "n_foldCV" stores the number of times of n-fold cross validation that you input.
* The variable "clf" stores a Naive Bayes model, and it can be fitted with "returnVec" and "labels". Once the model is fit, it can be used to predict unseen instances.  
* The variable "scores" stores the accuracy of n-fold cross validation of the model.  
* "Vectorized" here means the algorithm creates a vocabulary list for all distinct words appears in the traning set and preprocesses each instance into a vector which records the appearance of each word in the vocabulary list. For instance, if the training set contained two instance, "I am happy" and "I like you", then the vocabulary list would be ["I","am","happy","like","you"] and the two vectorized instances would be [1,1,1,0,0] and [1,0,0,1,1]. 0 means unshown and 1 means shown.

In [42]:
dataset=raw_input('Please Enter Your Data Set:')
n_foldCV=int(raw_input("Please Enter the Number of Folds:"))
returnVec,labels,vocabList=loadDataSet(dataset)

Please Enter Your Data Set:SMSSpamCollection
Please Enter the Number of Folds:5


Bernoulli Naive Bayes

In [49]:
clf = BernoulliNB()
clf.fit(returnVec, labels)
scores = cross_val_score(clf, returnVec, labels, cv=n_foldCV)

Gaussian Naive Bayes

In [36]:
clf = GaussianNB()
clf.fit(returnVec, labels)
scores = cross_val_score(clf, returnVec, labels, cv=n_foldCV)

Multinomial Naive Bayes

In [38]:
clf = MultinomialNB()
clf.fit(returnVec, labels)
scores = cross_val_score(clf, returnVec, labels, cv=n_foldCV)

## Evaluate a classifier##
The following cells will output the accuracy score in each run and the accuracy estimate of the model under 95% confidence interval.

Bernoulli Naive Bayes

In [50]:
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[ 0.98207885  0.97037702  0.97217235  0.97486535  0.97935368]
Accuracy: 0.98 (+/- 0.01)


Gaussian Naive Bayes

In [37]:
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[ 0.88799283  0.87455197  0.88240575  0.88509874  0.89856373]
Accuracy: 0.89 (+/- 0.02)


Multinomial Naive Bayes

In [39]:
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[ 0.97580645  0.97401434  0.96140036  0.97307002  0.96947935]
Accuracy: 0.97 (+/- 0.01)


## Prediction##

In [51]:
testset=raw_input('Please Enter Your SMS:')
testset=textParse(testset)
print testset
testset=setOfWords2Vec(vocabList,testset)

Please Enter Your SMS:PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires
['private', 'your', 'account', 'statement', 'for', 'shows', 'unredeemed', 'bonus', 'points', 'claim', 'call', 'identifier', 'code', 'expires']


In [54]:
predictions=clf.predict(testset)
print predictions

['spam']


