<a href="https://colab.research.google.com/github/Massittha/Data-portfolio/blob/main/My_GaussianNB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The naive Bayes Classifier

## Introduction
In this colab notebook, I created my own version of **naive Bayes Classifier** class to make predictions on the breast cancer Wisconsin dataset from `sklearn`, similar to my [previous notebook](https://github.com/Massittha/Data-portfolio/blob/main/My_KNeighbors_Class.ipynb), in which **KNeighbors Classifier** was used to solve the same problem.

The steps involved were the following:

1. Load the dataset
2. Split the data into training and testing sets
3. Create my naive Bayes Classifier class
4. Predict the training set using my algorithm
5. Compare the prediction results with `sklearn GaussianNB`
6. Cross validate my algorithm


## 1. Load the dataset

In [2]:
import numpy as np
from sklearn import datasets

cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target

## 2. Split the data
The dataset was splitted into training and test sets, having the test set of size 0.2 of its original.


In [3]:
from sklearn.model_selection import train_test_split

X_train , X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 10)

## 3. My naive Bayes Classifier class
The naive Bayes Classifier assumes all features in the training data are independent of each other, given a class, and values of each feature are normally distributed. The algorithm is constructed with Bayes theorem as follow.
$$
class_{predicted} = \text{argmax}_k p(c_k)\prod_i p(x_i| c_k)\\
$$

where $p(c_k)$ is the probability of the data point being class $c_k$, and $p(x_i|c_k)$ is the probability of a feature $i$ having the value of $x$ given the class $c_k$. The model returns the class $c_k$ that yields the highest probability.

By assuming normal distribution of features, the conditional probability can be substituted with the probability distribution function, and applying log to the equation gives:

\begin{align}
class_{predicted} &=\text{argmax}_k (log(p(c_k)) + \sum_i \log(p(x_i|c_k)))\\
&=\text{arkmax}_k \left(log(p(c_k)) + \sum_i \log\left(\frac{1}{\sigma_k\sqrt{2\pi}} exp\left(\frac{-(x-\mu_k)^2}{2\sigma_k}\right)\right)\right)\\
\end{align}

\
where $$log\left(\frac{1}{\sigma_k\sqrt{2\pi}} exp\left(\frac{-(x-\mu_k)^2}{2\sigma_k}\right)\right)$$ is the **log-likelihood function**


In `My_GaussianNB` class below, the `fit` method separetes the training set into a set of class 0 and a set of class 1, and the classification process in the `predict` method is done by applying the Bayes theorem above.


In [4]:
##The naive Bayes Classifier
class My_GaussianNB:


    import numpy as np

    def fit(self,X_train,Y_train):
        self.X_train = X_train
        self.Y_train = Y_train

        ## Separate training data into a set of class 0 and a set of class 1

        self.data_0 = X_train[Y_train == 0]
        self.data_1 = X_train[Y_train == 1]


    def predict(self,X_test):
        ## 1. calculate mean and std of each feature in both sets of separated data

        means_0 , stds_0 = self.data_0.mean(axis = 0) , self.data_0.std(axis = 0)
        means_1 , stds_1 = self.data_1.mean(axis = 0) , self.data_1.std(axis = 0)

        ## 2. calculate the probability of the datapoint being each class
        ## number of datapoints with each class / Rows of training set

        prob_0 = len(self.data_0)/len(self.X_train)
        prob_1 = len(self.data_1)/len(self.X_train)


        ## 3. calculate log-likelihood
        from scipy.stats import norm

        log_0 = np.log(prob_0) + np.sum(norm.logpdf(X_test, loc=means_0, scale=stds_0), axis=1)
        log_1 = np.log(prob_1) + np.sum(norm.logpdf(X_test, loc=means_1, scale=stds_1), axis=1)


        ## 4. prediction

        classes = log_0 < log_1

        return classes

## 4. Predict the training set with `My_GaussianNB`

In [5]:
my_model = My_GaussianNB()
my_model.fit(X_train,Y_train)
my_preds = my_model.predict(X_train)

## 5. Compare the results with `sklearn GaussianNB`

Both my algorithm and `sklarn GaussianNB` give the same prediction results

In [6]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB(var_smoothing=0.)
gnb.fit(X_train, Y_train)
sk_preds = gnb.predict(X_train)

In [7]:
np.all( my_preds == sk_preds)

True

## 6. Cross-validation
KFold cross-validation was applied to validate the generalisability of my algorithm using the training data as the corss-validation input, test data as the unseen set, and number of folds equals to 10.

In [9]:
## Cross validation
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

kf = KFold(n_splits=10, random_state=10, shuffle=True)

train_accuracies = []
val_accuracies = []

iteration = 1
    # K-Fold Cross Validation loop

for train_index, val_index in kf.split(X_train):
        Xtr, Xval = X_train[train_index], X_train[val_index]
        ytr, yval = Y_train[train_index], Y_train[val_index]

        my_gnb = My_GaussianNB()
        my_gnb.fit(Xtr,ytr)
        my_gnb.predict(Xval)

        pred_train = my_gnb.predict(Xtr)
        pred_val = my_gnb.predict(Xval)


        # calculate training and validation accuracy and store
        train_accuracies.append(accuracy_score(ytr, pred_train))
        val_accuracies.append(accuracy_score(yval, pred_val))
        print(f"Iteration {iteration}:",end = " ")
        print(f"train accuracy: {round(train_accuracies[iteration-1],4)}, val accuracy: {round(val_accuracies[iteration-1],4)}")
        iteration +=1

train_accuracy_mean = np.mean(train_accuracies)
val_accuracy_mean = np.mean(val_accuracies)


print("\n")
print(f"Mean Training Accuracy: {train_accuracy_mean:.4f}")
print(f"Mean Validation Accuracy: {val_accuracy_mean:.4f}")






Iteration 1: train accuracy: 0.9193, val accuracy: 1.0
Iteration 2: train accuracy: 0.9291, val accuracy: 0.8913
Iteration 3: train accuracy: 0.9242, val accuracy: 0.9565
Iteration 4: train accuracy: 0.9242, val accuracy: 0.9348
Iteration 5: train accuracy: 0.9291, val accuracy: 0.8913
Iteration 6: train accuracy: 0.9244, val accuracy: 0.8889
Iteration 7: train accuracy: 0.9268, val accuracy: 0.9111
Iteration 8: train accuracy: 0.9268, val accuracy: 0.9333
Iteration 9: train accuracy: 0.9341, val accuracy: 0.9111
Iteration 10: train accuracy: 0.922, val accuracy: 0.9333


Mean Training Accuracy: 0.9260
Mean Validation Accuracy: 0.9252


Apply the algorithm on the unseen test set

In [11]:
my_gnb.fit(X_train,Y_train)
my_preds = my_gnb.predict(X_test)

print(f"Final accuracy score: {accuracy_score(Y_test,my_preds):.4f}")

Final accuracy score: 0.9561


The accuracy scores were similar, hence the model showed generalisability.