# A Notebook to Use Naïve Bayes Classifiers

This notebook shows how to train a Naïve Bayes to classify unseen instances.

For those of you interested in understanding the code, it uses predefined functions from the [sklearn](http://scikit-learn.org) library of machine learning primitives. A few more details about the code:  
* The variable "dataset" stores the name of text file that you input and is passed as an argument of the function "loadDataSet()".  
* After processing, the loadDataSet function will output, or in other words, return two variables, "instances", and "labels".  
* "instances" stores the feature value of each instance. "labels" stores the labels of all instances.   
* The variable "n_foldCV" stores the number of times of n-fold cross validation that you input.
* The variable "choice" stores your choice of classifiers. "instances", "labels" and "choice" are the arguments of function chooseClassifier and the function return a variable "clf".
* The variable "clf" stores up to three Naive Bayes models, and it can be fitted with "instances" and "labels". Once the model are fitted, they can be used to predict unseen instances. 

In [13]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

def loadDataSet(dataset): 
    with open(dataset) as f:
        data=f.readlines()
        attributes=data[0].rstrip().split(',')[:-1]
        instances=[entry.rstrip().split(',')[:-1] for entry in data[1:]]
        dataArray=[]
        for i in range(len(instances[0])):
            try:
                dataArray.append([float(instance[i]) for instance in instances])
            except:
                encodedData,codeBook=encode([instance[i] for instance in instances])
                dataArray.append(encodedData)
                print(attributes[i],': ',list(codeBook.items()))
        instances=np.array(dataArray).T
        labels=[entry.rstrip().split(',')[-1] for entry in data[1:]]
        return instances,labels

def encode(data):
    codeBook={}
    uniqueVals=list(set(data))
    for Val in uniqueVals:
        codeBook[Val]=uniqueVals.index(Val)
    encodedData=list(map(uniqueVals.index,data))
    return encodedData,codeBook

def chooseClassifier(choice,instances,labels):
    clf=[]
    choice=choice.split(',')
    if "1" in choice:
        clf_B = BernoulliNB()
        clf_B.fit(instances, labels)
        print('Bernoulli Naive Bayes is used.')
        clf.append(clf_B)
    elif "2" in choice:
        clf_G = GaussianNB()
        clf_G.fit(instances, labels)
        print("Gaussian Naive Bayes is used.")
        clf.append(clf_G)
    elif "3" in choice:
        clf_M = MultinomialNB()
        clf_M.fit(instances, labels)
        print("Multinomial Naive Bayes is used.")
        clf.append(clf_M)
    else:
        print("Please choose a correct classifier.")
    return clf
    
def evaluateClf(clf,instances,labels,n_foldCV):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======BernoulliNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="GaussianNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======GaussianNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="MultinomialNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======MultinomialNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
            
def predict(clf,testset):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            prediction=item.predict(testset)
            print("BernoulliNB: ",prediction)
        elif type(item).__name__=="GaussianNB":
            prediction=item.predict(testset)
            print("GaussianNB: ",prediction) 
        elif type(item).__name__=="MultinomialNB":
            prediction=item.predict(testset)
            print("MultinomialNB:",prediction) 

## Training: Build a Naïve Bayes Classifier##
The cell below asks for a dataset. It trains a Naïve Bayes classifier. There are three Naive Bayes classifiers provided. They are based on different mathmatical fundations and might have different performance over different datasets.  

If you want to use Bernoulli Naive Bayes, input **1**. For Gaussian Naive Bayes, input **2**. For Multinomial Naive Bayes, input **3**. You can choose multiple classifiers at the same time. Input the numbers and separate them with comma.  

We provide three classification datasets that could be applied to the Naïve Bayes algorithms. 
* ["iris.data"](https://archive.ics.uci.edu/ml/datasets/iris) has four attributes with continuous values describing three different iris species.
* ["lenses.txt"](https://archive.ics.uci.edu/ml/datasets/lenses) contains four attributes with discrete values and three classes.
* ["SMSSpamCollection.txt"](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) includes 5572 SMS messages collected from four different research sources and they were labeled as spam or ham. "testset_SMS" is the testset and contains two randomly chosen SMS messages from the SMSSpamCollection dataset and they were deleted from the original dataset to prevent dataset contamination. Before you fit the dataset to any ML algorithm, please use the text featurization notebook to vectorize text. Besides, please open the testset and copy and paste the text (don't include the brackets) in it to Jupyter notebook when you want to run prediction function.  

**__BernoulliNB only accepts binary-valued variables. It won't have a good performance if the variables, namely attributes are not binary-valued__**

In [19]:
dataset=input("Please Enter Your Data Set:")
choice=input("Please Choose Classifiers:")
instances,labels=loadDataSet(dataset)
clf=chooseClassifier(choice,instances,labels)

Please Enter Your Data Set:./Dataset/SMSSpamCollection_Vectorized.txt
Please Choose Classifiers:1
Bernoulli Naive Bayes is used.


## Predict unseen instances##
When you are prompted to input test set, please input one entry each time and separate each value with a comma. Also, please make sure the input are numbers instead of strings. When your input is strings, please encode them manually according to the codebook printed above.  

In [20]:
testset=input('Please Enter Your Unseen Instance:')
testset=list(map(float,testset.split(',')))
testset=np.array(testset).reshape(1, -1)

Please Enter Your Unseen Instance:0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [21]:
predict(clf,testset)

BernoulliNB:  ['spam']


## Evaluate a classifier##
The following cell will output the accuracy score in each run and the accuracy estimate of the model under 95% confidence interval.  

In [22]:
dataset=input('Please Enter Your Test Data:')
choice=input("Please Choose Classifiers:")
n_foldCV=int(input("Please Enter the Number of Folds:"))
instances,labels=loadDataSet(dataset)
clf=chooseClassifier(choice,instances,labels)
evaluateClf(clf,instances,labels,n_foldCV)

Please Enter Your Test Data:./Dataset/SMSSpamCollection_Vectorized.txt
Please Choose Classifiers:1
Please Enter the Number of Folds:10
Bernoulli Naive Bayes is used.
[ 0.9874552   0.98028674  0.96057348  0.97849462  0.97491039  0.96768402
  0.97661871  0.97841727  0.97661871  0.98381295]
Accuracy: 0.98 (+/- 0.01)
