# Naive Bayes

**Note: This following cell contains some predefined functions to implement a type of Decision Tree algorithm called CART (Classification and Regression Trees). Please make sure you have run this cell before you run other cells in this notebook.**

In [14]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

def loadDataSet(dataset):
    
    with open(dataset) as f:
        data=f.readlines()
        #attributes=data[0].rstrip().split(',')[:-1]
        instances=[entry.rstrip().split(',')[:-1] for entry in data[1:]]
        dataArray=[]
        for i in range(len(instances[0])):
            try:
                dataArray.append([float(instance[i]) for instance in instances])
            except:
                encodedData,codeBook=encode([instance[i] for instance in instances])
                dataArray.append(encodedData)
                print attributes[i],': ',codeBook.items()
        instances=np.array(dataArray).T
        labels=[entry.rstrip().split(',')[-1] for entry in data[1:]]
        return instances,labels

def encode(data):
    codeBook={}
    uniqueVals=list(set(data))
    for Val in uniqueVals:
        codeBook[Val]=uniqueVals.index(Val)
    encodedData=map(uniqueVals.index,data)
    return encodedData,codeBook

def chooseClassifier(choice,instances,labels):
    clf=[]
    choice=choice.split(',')
    if "1" in choice:
        clf_B = BernoulliNB()
        clf_B.fit(instances, labels)
        print 'Bernoulli Naive Bayes is used.'
        clf.append(clf_B)
    if "2" in choice:
        clf_G = GaussianNB()
        clf_G.fit(instances, labels)
        print "Gaussian Naive Bayes is used."
        clf.append(clf_G)
    if "3" in choice:
        clf_M = MultinomialNB()
        clf_M.fit(instances, labels)
        print "Multinomial Naive Bayes is used."
        clf.append(clf_M)
    else:
        print "Please choose a correct classifier."
    return clf
    
def evaluateClf(clf,instances,labels,n_foldCV):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print "======BernoulliNB======"
            print scores
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="GaussianNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print "======GaussianNB======"
            print scores
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="MultinomialNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print "======MultinomialNB======"
            print scores
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
            
def predict(clf,testset):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            prediction=item.predict(testset)
            print "BernoulliNB: ",prediction
        elif type(item).__name__=="GaussianNB":
            prediction=item.predict(testset)
            print "GaussianNB: ",prediction 
        elif type(item).__name__=="MultinomialNB":
            prediction=item.predict(testset)
            print "MultinomialNB:",prediction 


## Build a classifier##
There are three Naive Bayes classifiers provided. If you want to use Bernoulli Naive Bayes, input **1**. For Gaussian Naive Bayes, input **2**. For Multinomial Naive Bayes, input **3**. You can choose multiple classifiers at the same time. Input the numbers and separate them with comma.  

(Optional) For those of you interested in how data flows between different functions, here is a description:
* The variable "dataset" stores the name of text file that you input and is passed as an argument of the function "loadDataSet()".  
* After processing, the loadDataSet function will output, or in other words, return two variables, "instances", and "labels".  
* "instances" stores the feature value of each instance. "labels" stores the labels of all instances.   
* The variable "n_foldCV" stores the number of times of n-fold cross validation that you input.
* The variable "choice" stores your choice of classifiers. "instances", "labels" and "choice" are the arguments of function chooseClassifier and the function return a variable "clf".
* The variable "clf" stores up to three Naive Bayes models, and it can be fitted with "instances" and "labels". Once the model are fitted, they can be used to predict unseen instances.  

In [15]:
dataset=raw_input("Please Enter Your Data Set:")
n_foldCV=int(raw_input("Please Enter the Number of Folds:"))
choice=raw_input("Please Choose Classifiers:")
instances,labels=loadDataSet(dataset)
clf=chooseClassifier(choice,instances,labels)

Please Enter Your Data Set:lenses.txt
Please Enter the Number of Folds:5
Please Choose Classifiers:1,2,3
raining :  [('pre', 0), ('presbyopic', 1), ('young', 2)]
todays :  [('hyper', 0), ('myope', 1)]
msgs :  [('yes', 0), ('no', 1)]
reaction :  [('reduced', 0), ('normal', 1)]
Bernoulli Naive Bayes is used.
Gaussian Naive Bayes is used.
Multinomial Naive Bayes is used.


## Evaluate a classifier##
The following cell will output the accuracy score in each run and the accuracy estimate of the model under 95% confidence interval.  
**__BernoulliNB only accepts binary-valued variables. Should I mention this in README?__**

In [16]:
evaluateClf(clf,instances,labels,n_foldCV)

[ 1.    0.6   0.8   0.8   0.75]
Accuracy: 0.79 (+/- 0.26)
[ 1.    0.8   0.8   0.8   0.75]
Accuracy: 0.83 (+/- 0.17)
[ 0.6   0.6   0.4   0.6   0.75]
Accuracy: 0.59 (+/- 0.22)




## Predict unseen instances##
When you are prompted to input test set, please input one entry each time and separate each value with a comma. Also, please make sure the input are numbers instead of strings. When your input is strings, please encode them manually according to the codebook printed above.  

In [17]:
testset=raw_input('Please Enter Your Unseen Instance:')
testset=map(float,testset.split(','))
testset=np.array(testset).reshape(1, -1)

Please Enter Your Unseen Instance:2,1,0,1


In [18]:
predict(clf,testset)

BernoulliNB:  ['hard']
GaussianNB:  ['hard']
MultinomialNB: ['no lenses']
