# Naive Bayes Classifier from Scratch
​
In this notebook, we will look at the overview of the Naive Bayes classifier and implement it from scratch and try to fit it on this mushroom classification dataset to predict if a mushroom is poisonous or edible.

The probabilistic model for this classifier looks like:
\begin{equation}
P(C_k | x) = \frac{P(C_k) * P(x | C_k)}{P(x)}
\end{equation}


In plain English, using Bayesian probability terminology, the above equation can be written as
\begin{equation}
posterior = \frac{prior * likelihood}{evidence}
\end{equation}

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

## Load the Dataset

In [None]:
df = pd.read_csv('../input/mushroom-classification/mushrooms.csv')

df.head()

Here,, we can see from the dataset that 

- All the features are categorical and
- the values for the features need to be encoded into numeric values for our classifier.


## Encode the features into Numerical data

We will use the LabelEncoder for this specific task which creates ordinal values for all the categorical features. 

While it may not be the best approach but it is really simple

In [None]:
encoder = LabelEncoder()

# Apply the encoder to each of the columns
df = df.apply(encoder.fit_transform)

df.head()

As we can see the values for each column are from 0 to the number of categories for that feature. Next, we will define our functions for prior and likelihood to compute the posterior probability and weigh against each of the target variable to see which data point fits where.

## Split the dataset into train and test parts

In [None]:
# Seperating our target and features
X = df.drop(columns = ['class'])
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

print("X_train = ", X_train.shape)
print("y_train = ", y_train.shape)
print("X_test = ", X_test.shape)
print("y_test = ", y_test.shape)

## Building Our Classifier

Finally the time has come when we start building our classifer step-by-step. As we have seen earlier, to classify any data point in the Naive Bayes Classifier, we need the likelihood and prior probability which we will compare and assign the class to the one with higher posterior probability.

In [None]:
# First, we calculate the prior probability which is just the percentage of data points belonging to the mentioned class
# For example, if our training dataset has 60% edible mushrooms, then the prior probability will be 0.6 when calculating 
# in the testing side.

def prior(y_train, label):
    
    total_points = y_train.shape[0]
    class_points = np.sum(y_train == label)
    
    return class_points/float(total_points)

In [None]:
## Next, we will define a function to calculate the conditional probability that we will use then to calculate the
## likelihood,

def cond_prob(X_train, y_train, feat_col, feat_val, label):
    """
    In this function, we will calculate the conditional probability which will be used to calculate likelihood.
    The value it returns is of the form 
        P(x_i | y = C)
    which is the probability of the current feature (given by feat_col x_i) having the current value (given by feat_val)
    given that it belongs to the target class C
    
    Effectively, it reduces to the form
        all points belongig to class C which have the given value for the feature column / all points belonging to class C
    """
    
    # Getting all the 
    X_filtered = X_train[y_train == label]
    
    numerator = np.sum(X_filtered[feat_col] == feat_val)
    denominator = np.sum(y_train == label)
    
    return numerator/float(denominator)

In [None]:
## Now time to calculate the posterior probability and make predictions

def predict(X_train, y_train, xtest):
    
    # Get the number of target classes
    classes = np.unique(y_train)
    
    # All the features for our dataset
    features = [x for x in X_train.columns]
    
    
    # Compute posterior probabilites for each class
    post_probs = []
    
    for label in classes:
        
        # Since, posterior = prior * likelihood
        # We'll calculate likelihood by calculating the product of the conditional probabilities for each of the features
        
        likelihood = 1.0
        
        for f in features:
            cond = cond_prob(X_train, y_train, f, xtest[f], label)
            likelihood *= cond
        
        prior_prob = prior(y_train, label)
        
        posterior = prior_prob * likelihood
        
        post_probs.append(posterior)
        
    # Return the label for which the posterior probability was the maximum
    prediction = np.argmax(post_probs)
    
    return prediction    

## Time to test our classifer

In [None]:
# First, let's check on a random example

rand_example = 6

output = predict(X_train, y_train, X_test.iloc[rand_example])

print("Naive Bayes Classifier predicts ", output)
print("Current Answer ", y_test.iloc[rand_example])

In [None]:
## Now, we'll check the results on each of the test data point and calculate 
## an accuracy-based score for our classifier

def accuracy_score(X_train, y_train, xtest, ytest):
    
    preds = []
    
    for i in range(xtest.shape[0]):
        pred_label = predict(X_train, y_train, xtest.iloc[i])
        preds.append(pred_label)
        
    preds = np.array(preds)
    
    accuracy = np.sum(preds == ytest)/ytest.shape[0]
    
    return accuracy

In [None]:
print("Accuracy Score for our classifier == ", accuracy_score(X_train, y_train, X_test, y_test))

**This brings us to the end of this notebook. We built a naive bayes classifier from scratch, trained it on our data and then tested it to find out that it has an accuracy of 99.7% which is really very good.**