# Understanding Naive Bayes
`Story`

You have been studying about machine learning algorithms for a while now, and you want to implement Naive Bayes algorithm from scratch to study the mathematics behind it, and compare it to the efficiency of the Scikit-Learn models we have.

You know that `classifier systems` are most popular with spam filtering for emails, collaborative filtering for recommendation engines and sentiment analysis. AI is good with demarcating groups based on patterns over large sets of data.

Naive Bayes classifier is based on Bayes’ theorem and is one of the oldest approaches for classification problems.

`P(A|B) = P(B|A).P(A) / P(B)`


The objective here is to determine the likelihood of an event A happening given B happens.

The naive Bayes classifier combines Bayes’ model with decision rules like the hypothesis which is the most probable outcomes.

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

It was initially introduced for text categorisation tasks and still is used as a benchmark.

There have been many innovations like Support Vector Machines or KNN over the years in solving the classification problem with more flexibility and smartly. But Naive Bayes classifier can still be competent with enough pre-processed data and has shown great results in medical applications where classification is crucial in diagnosis.


### How Good Is NB Classifier For ML

The first assumption of a Naive Bayes classifier is that the value of a particular feature is independent of the value of any other feature. Which means that the interdependencies within data are comfortably neglected. Hence the name ‘naive.’

A naive Bayes classifier considers every feature to contribute independently to the probability irrespective of the correlations.

For `unsupervised` or in more practical scenarios, maximum likelihood is the method used by naive Bayes model in order to avoid any Bayesian methods, which are good in supervised setting.

`Gaussian Naive Bayes` classifier where the feature values are assumed to be distributed in accordance with Gaussian distribution. The likelihood of the feature being classified is assumed to be Gaussian.


`Multinomial Naive Bayes` classifier considers feature vectors which are representation of the frequencies with which few events have been generated by a multinomial distribution.

Whereas, in Bernoulli Naive Bayes approach, features are independent booleans and can be used for binary responses.

For example, in document classification tasks, Multinomial NB can be used for a number of times a word appears in the document(frequency). And, Bernoulli NB for classifying whether a word appears or not (a binary YES or NO).





### Let's take an example to understand Naive Bayes 

Imagine two people `Alice and Bob` whose word usage pattern you know. To keep example simple, lets assume that Alice uses combination of three words [love, great, wonderful] more often and Bob uses words [dog, ball wonderful] often.

Lets assume you received and anonymous email whose sender can be either Alice or Bob. Lets say the content of email is “I love beach sand. Additionally the sunset at beach offers wonderful view”

#### Can you guess who the sender might be?
Well if you guessed it to be Alice you are correct. Perhaps your reasoning would be the content has words love, great and wonderful that are used by Alice.

Now let’s add a combination and probability in the data we have.Suppose Alice and Bob uses following words with probabilities as show below. Now, can you guess who is the sender for the content : “Wonderful Love.”

Now what do you think?

If you guessed it to be Bob, you are correct. If you know mathematics behind it, good for you. If not, This is where we apply Bayes Theorem.


`P(A|B) = P(B|A).P(A) / P(B)`

**It tells us how often A happens given that B happens, written P(A|B), when we know how often B happens given that A happens, written P(B|A) , and how likely A and B are on their own.**

    P(A|B) is “Probability of A given B”, the probability of A given that B happens
    P(A) is Probability of A
    P(B|A) is “Probability of B given A”, the probability of B given that A happens
    P(B) is Probability of B

* When P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke, then:

* P(Fire|Smoke) means how often there is fire when we see smoke.
* P(Smoke|Fire) means how often we see smoke when there is fire.

So the formula kind of tells us “forwards” when we know “backwards” (or vice versa)

Example: If dangerous fires are rare (1%) but smoke is fairly common (10%) due to factories, and 90% of dangerous fires make smoke then:

P(Fire|Smoke) =P(Fire) P(Smoke|Fire) =1% x 90% = 9%P(Smoke)10%

In this case 9% of the time expect smoke to mean a dangerous fire.

# AIM
Now, you want to build both Supervised and Unsupervised Classifiers using Naive Bayes, and use them on a couple of Datasets

`How?`

The first step is to import the necesarry libraries, and here they would be `random` and `intertools (permutations)`

In [1]:
# Import
import random
from itertools import permutations

`Story`



# Function for confusion matrix

Now, she asks you if you know how to define the confusion matrix. You remember having done it before as well.


`How?`

So, you decide to move ahead in the following manner:
* Take the data and predicted list as arguments.
* Calculate the accuracy of the predicted classes:
    Here, you iterate through the data and check if the original and predicted data is the same, if yes, you increase the counter by 1.
* Calculate the confusion matrix and print it. Note that you should be using the dictionary properties for this.

In [4]:
def print_matrix(data, predict_list):
    # Calculate the accuracy of predicted classes
    count = 0
    true_list = []
    for i in range(len(data)):
        true_list.append(data[i][-1])
        if data[i][-1] == predict_list[i]:
            count += 1
    
    # Calculate the confusion matrix of result
    class_dict = dict([(line[-1], 0) for line in data])
    for k in class_dict.keys():
        class_dict[k] = dict([(line[-1], 0) for line in data])
    
    for i in range(len(true_list)):
        class_dict[true_list[i]][predict_list[i]] += 1
        
    # Print format of confusion matrix
    print('{0:>20s}|'.format('Actual\Predict'), end = '')
    for k, v in class_dict.items():
        print('{0:>20s}|'.format(k), end='')
    print()
    for k, v in class_dict.items():
        print('{0:>20s}|'.format(k), end='')
        for values in v.values():
            print('{0:>20d}|'.format(values), end='')
        print()
    print("\nAccuracy: {0}\n".format(count / len(data)))
    
    return

`Interpretation`: We have calculated and printed the confusion matrix from the above function

# Loading csv file

`Story`:
    
Now, Mayuri asks you if you have opened csv files without pandas before, and you remember doing it for your mathematical models before as well.

`How?`
So, it is not much work for her to brief you about opening a csv using file handling in python.
You create an empty list and append the data in it by iterating through the lines using `readlines()` function.
To eliminate the empty spaces, you use `.strip()` built in function.

In [5]:
# This function should open a data file in csv, and transform it into a usable format 

file_list = [ 'car.csv', 'hypothyroid.csv']

def preprocess(filename):
    """Read csv file from input filename

    Parameters
    ----------
    filename: name of csv file

    Returns
    -------
    data: 2-D array of data from csv file
    """
    file = open(filename, 'r')
    data = []
    
    for line in file.readlines():
        data.append(line.strip().split(','))
        
    file.close()
    return data

`Interpretation`: This function preprocesses the data and creates a dataframe in the form of a list, and returns the same.

# Supervised Naive Bayes model

# Prior and Posterior Probability

`Story`

Mayuri now tells you that you can build both supervised as well as unsupervised Naive Bayes classifiers.

Within the field of machine learning, there are two main types of tasks: `supervised, and unsupervised`. The main difference between the two types is that supervised learning is done using a ground truth, or in other words, we have prior knowledge of what the output values for our samples should be. Therefore, the goal of supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data. Unsupervised learning, on the other hand, does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.

    Supervised learning is typically done in the context of classification, when we want to map input to output labels, or regression, when we want to map input to a continuous output. Common algorithms in supervised learning include logistic regression, naive bayes, support vector machines, artificial neural networks, and random forests. In both regression and classification, the goal is to find specific relationships or structure in the input data that allow us to effectively produce correct output data. Note that “correct” output is determined entirely from the training data, so while we do have a ground truth that our model will assume is true, it is not to say that data labels are always correct in real-world situations. Noisy, or incorrect, data labels will clearly reduce the effectiveness of your model.

The most common tasks within unsupervised learning are clustering, representation learning, and density estimation. In all of these cases, we wish to learn the inherent structure of our data without using explicitly-provided labels. Some common algorithms include k-means clustering, principal component analysis, and autoencoders. Since no labels are provided, there is no specific way to compare model performance in most unsupervised learning methods.


    Unsupervised learning is very useful in exploratory analysis because it can automatically identify structure in data. For example, if an analyst were trying to segment consumers, unsupervised clustering methods would be a great starting point for their analysis. In situations where it is either impossible or impractical for a human to propose trends in the data, unsupervised learning can provide initial insights that can then be used to test individual hypotheses.


Naive Bayes can be built for both supervised and unsupervised learning, however it is majorly used for supervised classfication.

So, we will build a standard `supervised Naive Bayes model`.




`How?`

We begin by taking the data as input, which we had processed in the earlier functions. The data would be a 2D array from the csv file.

This would return the prior and posterior probabilities, which would be used in further functions to make predictions using Supervised NB.

She now explains you what are prior and posterior probabilities.

`Story`


### `What is the prior?`

Prior is a probability calculated to express one's beliefs about this quantity before some evidence is taken into account. In statistical inferences and bayesian techniques, priors play an important role in influencing the likelihood for a datum.


`Black Swan Paradox`

Theory of black swan events is a metaphor that describes an event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight. The term is based on an ancient saying that presumed black swans did not exist – a saying that became reinterpreted to teach a different lesson after black swans were discovered in the wild. 

Such events will have huge impact while training bayesian classifiers - especially, naive bayes where the product of probabilities just turn 0. To avoid blindly rejecting a data point, we use the
prior probability to move ahead. 

    The prior probability of an event will be revised as new data or information becomes available, to produce a more accurate measure of a potential outcome. That revised probability becomes the posterior probability and is calculated using Bayes' theorem. In statistical terms, the posterior probability is the probability of event A occurring given that event B has occurred.

`For example`, three acres of land have the labels A, B, and C. One acre has reserves of oil below its surface, while the other two do not. The prior probability of oil being found on acre C is one third, or 0.333. But if a drilling test is conducted on acre B, and the results indicate that no oil is present at the location, then the posterior probability of oil being found on acres A and C become 0.5, as each acre has one out of two chances. 
    
   


### `Posterior probability` 
It is the probability an event will happen after all evidence or background information has been taken into account. It is closely related to prior probability, which is the probability an event will happen before you taken any new evidence into account. You can think of posterior probability as an adjustment on prior probability:

* Posterior probability = prior probability + new evidence (called likelihood).

A posterior probability is the probability of assigning observations to groups given the data. A prior probability is the probability that an observation will fall into a group before you collect the data. For example, if you are classifying the buyers of a specific car, you might already know that 60% of purchasers are male and 40% are female. If you know or can estimate these probabilities, a discriminant analysis can use these prior probabilities in calculating the posterior probabilities. When you don't specify prior probabilities, Minitab assumes that the groups are equally likely.

With the assumption that the data have a normal distribution, the linear discriminant function is increased by ln(pi), where pi is the prior probability of group i. Because observations are assigned to groups by the smallest generalized distance, or equivalently the largest linear discriminant function, the effect is to increase the posterior probabilities for a group with a high prior probability.


`What is a Posterior Distribution?`

Mayuri goes a step further to explain you that The posterior distribution is a way to summarize what we know about uncertain quantities in Bayesian analysis. It is a combination of the prior distribution and the likelihood function, which tells you what information is contained in your observed data (the “new evidence”). In other words, the posterior distribution summarizes what you know after the data has been observed. The summary of the evidence from the new observations is the likelihood function.

Posterior Distribution = Prior Distribution + Likelihood Function (“new evidence”)

`Example of a Posterior Probability`

As a simple example to envision posterior probability, suppose there are three acres of land with labels A, B and C. One acre has reserves of oil below its surface, while the other two do not. The prior probability of oil in acre C is one-third or 33%. A drilling test is conducted on acre B, and the results indicate that no oil is present at the location. With acre B eliminated, the posterior probability of acre C containing oil becomes 0.5, or 50%. 

In [6]:
# This function should build a supervised NB model
def train_supervised(data):
    """Build a supervised Naive Bayes model

    Parameters
    ----------
    data: 2-D array of data from csv file

    Returns
    -------
    prior_dict: dictionary of prior probability
    poste_dict: 3-D dictionary of posterior probability
    """
    # Prior probability
    prior_dict = {}
    # Posterior probability
    poste_dict = {}
    
    # Calculate prior probability
    for line in data:
        # Calculate prior count in last column
        clas = line[-1]
        # Check class exists in dictionary
        if clas not in prior_dict:
            prior_dict[clas] = 1
        else:
            prior_dict[clas] += 1
    
    # Divide with instance number to get prior probability        
    for key, value in prior_dict.items():
        prior_dict[key] = value / len(data)
    
    # Calculate posterior probability
    for att in range(len(data[0])-1):
        clas_dict = {}
        for line in data:
            # Calculate posterior count in last column
            clas = line[-1]
            # Check class exists in dictionary
            if clas not in clas_dict:
                clas_dict[clas] = {}
            
            # Check attribute exists in dictionary
            if line[att] not in clas_dict[clas]:
                clas_dict[clas][line[att]] = 1
            else:
                clas_dict[clas][line[att]] += 1

        poste_dict[att] = clas_dict
        
        # Divide with instance number to get posterior probability
        for cla in prior_dict.keys():
            sum_value = sum(clas_dict[cla].values())
            for key, value in clas_dict[cla].items():
                clas_dict[cla][key] = value / sum_value

    return prior_dict, poste_dict

`Interpretation` : This function returns prior_dict: dictionary of prior probability, and poste_dict: 3-D dictionary of posterior probability

`Story`

Here, Mayuri asks you to take the data, the prior dictionary and the poterior dictionary as arguments and build the prediction function.

* You are asked to create a list to store predicted values.
* Next, you find the length of the data rows.
* Iterate through the data, and then iterate through key and items of the prior_dictionary.
* Ignore missing value marked with ?
* Check attribute have corresponding posterior probability
* Epsilon smoothing if no value exists
* Append the class with most possible probability value
* Return the predicted list


`Hint`: Epsilon smoothing. replace zero with a trivially small non-zero number, typically called ε ε < 1/n for n instances. Compare number of ε

In [7]:
# This function should predict the class for a set of instances, based on a trained model 
def predict_supervised(data, prior_dict, poste_dict):
    """Predict the class based on a trained model

    Parameters
    ----------
    prior_dict: dictionary of prior probability
    poste_dict: 3-D dictionary of posterior probability

    Returns
    -------
    predict_list: list of predicted classes
    """
    predict_list = []
    sum_value = len(data)
    
    for line in data:
        predict_dict = {}
        for key, value in prior_dict.items():
            # Inital prior probability
            predict_dict[key] = value
            for index in range(len(line)-1):
                att = line[index]
                # Ignore missing value marked with ?
                if att != '?':
                    # Check attribute have corresponding posterior probability
                    if att in poste_dict[index][key]:
                        predict_dict[key] *= poste_dict[index][key][att]
                    else:
                        # Epsilon smoothing if no value exists
                        predict_dict[key] *= 0.01 / sum_value
        # Append the class with most possible probability value
        predict_list.append(max(predict_dict, key=predict_dict.get))
 
    return predict_list

`Interpretation` : This functions makes predictions on the model and returns the predicted list

`Story`

Now, that everything has been done, Mayuri asks you to make a function which takes the data list, finename and the predicted data as arguments and prints the confusion matrix for the same.

In [8]:
# This function should evaluate a set of predictions, in a supervised context
def evaluate_supervised(data, filename, predict_list):
    """Evaluate the predictions for supervised NB

    Parameters
    ----------
    data: 2-D array of data from csv file
    predict_list: list of predicted classes
    """
    print('{0:*^105}'.format('supervised' + ' ' + filename.split('.')[0]))
    # Print confusion matrix
    print_matrix(data, predict_list)

    return

`Interpretation` This function prints the confusion matrix for the given file.

`Story`

Now Finally, Mayuri asks you to make the function which will be the mega function to call all the required fucntions.

`How`?

This would take the files as arguments, remember we had created the file list earlier? 
Iterating through each file, it would call the `preprocess` function, the `train_supervised` function, the `predict_list` function and the `evaluate_supervised` fucntion.

Atlast, the mega function will be called for evaluation.

In [9]:
# Main function for supervised NB
def supervised(file_list):
    
    for filename in file_list:
        data = preprocess(filename)
        prior_dict, poste_dict = train_supervised(data)
        predict_list = predict_supervised(data, prior_dict, poste_dict)
        evaluate_supervised(data, filename, predict_list)
        
# Run the next line of code, can show the confusion matrix of all supervised dataset
supervised(file_list)

*********************************************supervised car**********************************************
      Actual\Predict|               unacc|                 acc|               vgood|                good|
               unacc|                1161|                  47|                   0|                   2|
                 acc|                  85|                 289|                   0|                  10|
               vgood|                   0|                  26|                  39|                   0|
                good|                   0|                  46|                   2|                  21|

Accuracy: 0.8738425925925926

*****************************************supervised hypothyroid******************************************
      Actual\Predict|         hypothyroid|            negative|
         hypothyroid|                   0|                 151|
            negative|                   0|                3012|

Accuracy: 0.9522605121719886



# `Conclusion`:

We can conclude that 
* The model is properly made since it is very accurate on both the files.
* Naive Bayes primality requires understanding of the prior and posterior probabilities
* Naive bayes is a supervised learning model.