# Naive Bayes


**Naïve Bayes Classifier** is a conditional probability model with the **maximum a posteriori or MAP decision rule**, which assigns a class label $\hat{y} = C_i$ for an observation X, if its $p(C_i|X)$ is the highest among all k classes.

**Assumption**: 

- independence among predictors
- the presence of a particular feature in a class is unrelated to the presence of any other feature.


1. The instance probability we want to derive is $p(C_k|X) = p(C_k|x_1, x_2, ..., x_n)$, which is the posterior probability of class ($C_k$, target) given predictor vector $X = (x_1, x_2, ..., x_n)$. According to the Bayes Theorem, the posterior probability can be represented as:

\begin{align}
p(C_k|X) = \frac{p(X|C_k)p(C_k)}{p(X)} 
\end{align}


2. Because the denominator does not depend on C and the values of the features X are given, so that the denominator is effectively constant. Therefore,

\begin{align}
p(C_k|X) \propto p(X|C_k)p(C_k)
\end{align}


3. If we insert $X = (x_1, x_2, ..., x_n)$ to the formular and apply the **chain rule**

\begin{align}
p(C_k|x_1, x_2, ..., x_n) &\propto p(x_1, x_2, ..., x_n|C_k)p(C_k) \\
&= p(x_1, x_2, ..., x_n,C_k)\\
&= p(x_1| x_2, ..., x_n,C_k)p(x_2, x_3, ..., C_k) \\
&= p(x_1| x_2, ..., x_n,C_k)p(x_2| x_3, ..., C_k)p(x_3, ..., C_k) \\
&= ... \\
&= p(x_1| x_2, ..., x_n,C_k)p(x_2| x_3, ..., C_k)p(x_3|x_4, ..., C_k) ... p(x_{n-1}|x_n, C_k)p(x_n|C_k)p(C_k)
\end{align}


4. Now the "naïve" conditional independence assumptions come into play: because that all features in X are mutually independent,

\begin{align}
p(x_i|x_{i+1}, ..., x_n,C_k) = p(x_i|C_K)
\end{align}


5. Thus, the posterior probability can be expressed as

\begin{align}
p(C_k|x_1, x_2, ..., x_n) &\propto p(x_1, x_2, ..., x_n,C_k) \\
&=p(x_1| x_2, ..., x_n,C_k)p(x_2| x_3, ..., C_k)p(x_3|x_4, ..., C_k) ... p(x_{n-1}|x_n, C_k)p(x_n|C_k)p(C_k) \\
&=p(x_1|C_k)p(x_2|C_k)p(x_3|C_k) ... p(x_n|C_k)p(C_k) \\
&=p(C_k)\prod_{i=1}^np(x_i|C_k)
\end{align}

6. Construct a classifier

\begin{align}
\hat{y} = \underset{k \in \{1, \ldots, K\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \mid C_k) &&&& \text{where $p(C_k)$ is the probability of each class in the dataset}
\end{align}




7. In the final step, we are good to go: simply calculate the posterior probability for every class of a specific observation and compare which class has a higher probability. But to calculate each $p(x_i|C_k)$, we can't simply use a frequency-based probability. When a frequency-based probability is zero, it will wipe out all the information in the other probabilities. A solution would be **Laplace smoothing**, which is a technique for smoothing categorical data. With the lapalce smoothing. The resulting estimate will be between the empirical probability (relative frequency) $\frac{x_i}{N}$, and the uniform probability $\frac{1}{d}$. Usually, we use $\alpha = 1$

\begin{align}
p_{x_i}=\frac{x_i+\alpha}{N+\alpha d} &&&& (i = 1,..., d)
\end{align}

# Types of Naive Bayes

## Multinomial Naive Bayes

This is mostly used for **document classification** problem, i.e whether a document belongs to the category of sports, politics, technology etc. We can use the frequency of the words present in the document as predictors. To make your classifier more advanced, tf-idf is also an ideal option . Tf-idf not only counts the occurrence of a word in the given text(or document), but also reflect **inverse document frequency**, which is a measure of how much information the word provides. 


## Bernoulli Naive Bayes

This is similar to the multinomial naive bayes but the predictors are **boolean variables**. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.


## Gaussian Naive Bayes

When the predictors take up a **continuous** value and are not discrete, we assume that these values are sampled from a gaussian distribution.

# Advantages and Disadvantages of Naive Bayes

## Advantages

1. In spite of its simplicity and the strict independence assumptions, the **Naive Bayes classifier performs surprisingly well** for classification on many real-world tasks and you **need less training data**. This is because the violation of the independence assumption tends not to hurt classification performance.


2. Naive Bayes is that it is naturally an **“incremental learner”.** An incremental learner is an induction technique that can update its model one training example at a time. It does not need to reprocess all past training examples when new training data become available.


3. It is easy and fast to predict class of test data set.

## Disadvantages

1. The probability estimation from Naive Bayes will be quite **inaccurate**. It will **overestimated** for the correct class and **underestimated** for the incorrect classes. This does become a problem, though, if we’re going to be using the probability estimates themselves—so Naive Bayes should be used with caution for actual decision-making with costs and benefits.


2. It perform well in case of **categorical input variables** compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

# Applications of Naive Bayes Algorithms

- Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.


- Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in **text classification** (due to better result in multi-class problems and **independence rule**) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)


- Recommendation System: **Naive Bayes Classifier and Collaborative Filtering** together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

# Naive Bayes Algorithm

In [7]:
import pandas as pd
import numpy as np
from sklearn import datasets

data = datasets.load_iris()
X = data.data  # we only take the first two features.
y = data.target

In [29]:
def naive_bayes_classifiers(features, label, input_data):
    ## 1. Calculate prior probability of each class
    label_category = list(set(label))
    label_number = len(label_category)
    label_prob = {}
    for i in range(0, label_number):
        # Laplace smoothing
        label_prob[label_category[i]] = len(label[label==label_category[i]] + 1)/len(label) + len(label) * 1
    # print(label_prob) #{1: 0.6, -1: 0.4}

    
    ## 2. Calculate the conditional probability of each sample given their current class
    features_set = {}
    
    ## 2.1 Gather all feature set
    for i in range(len(features)):
        features_set[i] = list(set(features[i]))
    print(features_set)
    
    ## 2.2 Calculate the probability of each feature under different labels
    ## P(Y=label_1)*p(X=feature_1|Y=label_1)*P(X=feature_2|Y=feature_2)
    ## e.g. P(Y=1)P(X=2∣Y=1)P(X=1∣Y=1) & P(Y=−1)P(X=2∣Y=−1)P(X=1∣Y=−1) 

    features_label_prob = {}
    
    for i in range(len(features)):
        for j in range(len(features_set[i])):
            for k in range(label_number):
                
                label_index = np.where(label==label_category[k])
                object_feature = np.array(features[i])[label_index]
                # print(object_feature)
                
                # Count the number of sample with a certain feature value in the class and add laplace smoothing
                nominator_value = object_feature[object_feature==features_set[i][j]]
                nominator = len(nominator_value) + 1
                # print(nominator)
                
                # Count the number of sample in the class and add laplace smoothing
                label_select = label[label==label_category[k]]
                denominator = len(label_select) + len(features_set[i]) * 1
                # print(denominator)
                
                features_label_prob_key = str(i)+str(features_set[i][j])+str(label_category[k])
                features_label_prob[features_label_prob_key] = nominator/denominator
                
                # print(features_label_prob_key) 
                # print(denominator)
    
    # Predict the class given input data
    calc_label_prob = {}
    
    # Given the init product
    product = 1
    
    # Calculate the conditional probability under each class
    for i in range(0, label_number):
        product = 1
        calc_label_prob[label_category[i]] = label_prob[label_category[i]]
        # print(calc_label_prob[label_category[i]])
        
        for j in range(0, len(input_data)):
            
            # Construct key
            key = str(j)+str(input_data[j])+str(label_category[i])
            # print(key)
            # print(features_label_prob[key])
            
            product = product * features_label_prob[key]
            print(product)
            
        calc_label_prob[label_category[i]] = calc_label_prob[label_category[i]]*product
        

    # print(calc_label_prob) #{1: 0.02222222222222222, -1: 0.06666666666666667}

    # Return the class with highest probability
    output_label = max(calc_label_prob,key=calc_label_prob.get)
    
    # print(output_label)
    return output_label


features = [[1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], 
            ['S', 'M', 'M', 'S', 'S', 'S', 'M', 'M', 'L', 'L', 'L', 'M', 'M', 'L', 'L']]
features = np.array(features)
label = [-1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1]
label = np.array(label)
input_data = np.array([2, 'S'])
output_label = naive_bayes_classifiers(features, label, input_data)
print("the output label of input_data is:{}".format(output_label))

{0: ['1', '2', '3'], 1: ['M', 'S', 'L']}
0.3333333333333333
0.05555555555555555
0.3333333333333333
0.14814814814814814
the output label of input_data is:-1
