In [1]:
%matplotlib inline

This note and dataset used is taken from these links:
* https://www.kdnuggets.com/2020/06/naive-bayes-algorithm-everything.html
* https://online.stat.psu.edu/stat505/lesson/10
* https://www.data.gov.my/data/dataset/754d77ae-1dfb-4d56-a0aa-d37cffec4ff1/resource/fd1308bc-954f-4b89-a3f5-ede0ee4ec343/download/m-20210309040336_202211160300310_2008-2022-flows-of-foreign-direct-investment-in-malaysia_blocks.csv

# Naive Bayes Classifier

Now, we are going to see one of the most applied technique in modern statistics ~ the Naive Bayes Classifier.

This technique is a part of tools used in statistics classification problem. The main theory underlying behind it, which is the Bayes Theorem is used widely in other classifier such as the cluster analysis and discriminant analysis. The theorem is so simple that I feel like a skimp if not state it in this note. The Bayes Theorem is

$$ P(A|B) = \frac{P(B|A)\times P(A)}{P(B)} $$

where 

P(A|B) = Probability of A given B is occuring,

P(A) = Probability of A is occuring.

For classification problem, this theorem is easier to understand if we think of it in terms of priori and posterior distribution. If we want to know what is the probability of our observed data $x_i$ to fall into Category A given that we observed some feature $\textbf{X}$ (this is A|B), we can calculate it if we know what is the probability of seeing feature $\textbf{X}$ given it is category A (this is P(B|A)) multiply with probability of seeing category A. The left side of the equation is the posterior probability and the right side is the priori probability ~ the right side is calculated using information that we have in hand, in our case, it is the data collected previously for $x_i$.

For each feature $\textbf{X}$, we can then assume that it is independent among each other (meaning that if I just use 2 features to predict the category of A, the occurence of feature $X_1$ is not dependent on the occurence of feature $X_2$). This will further simplify our calculation above since for independency,

$$ P(X_1 \cap X_2) = P(X_1) \times P(X_2)$$.

And thus, our new formula for Bayes Theorem will looks like this,

$$ P(A|\textbf{X}) = \frac{P(\textbf{X}|A)\times P(A)}{P(\textbf{X})} $$

$$ P(A|\textbf{X}) = \frac{P(X_1|A)P(X_2|A)\ldots P(X_n|A)\times P(A)}{P(\textbf{X})} $$

$$ P(A|\textbf{X}) = \frac{\prod_{i=1}^{n} P(X_i|A)\times P(A)}{P(\textbf{X})} $$

$$ P(A|\textbf{X}) = \frac{P(A) \times \prod_{i=1}^{n} P(X_i|A)}{P(\textbf{X})} $$

Please remember that in the example above, we just simply calculate the posterior probability of $x_i$ to fall into category A. If we have other category such as B, C and D, we'll need to find the posterior probability for each category and then choose the highest probability as our predicted category. (i.e, category with the highest posterior probability will be assigned to our $x_i$).

There are many types of NB classifier but for this note, I will use the Gaussian Naive Bayes classifier where I will use the Gaussian Distribution (Normal distribution) to calculate each feature's probability. i.e, this part $P(X_i|A)$ is calculated using the normal distribution. 

We use gaussian NB if we have to make prediction based on features which are stored in numerical value.

Next, we'll going through a simulation on how the NB works. I will use the data from DOSM which is the "Flows of FDI in Malaysia by blocks of country" and predict the type of country block given that we observed a specific amount of credit, debit and net of debit-credit of the country's FDI into Malaysia.

As usual, before running any analysis, we need to make sure whether we need to do any testing to see if our data meet certain assumptions or not. However, for this note, I will not check the assumption because I just want to focus on the method/steps.

In [2]:
import pandas as pd

path = r"https://www.data.gov.my/data/dataset/754d77ae-1dfb-4d56-a0aa-d37cffec4ff1/resource/fd1308bc-954f-4b89-a3f5-ede0ee4ec343/download/m-20210309040336_202211160300310_2008-2022-flows-of-foreign-direct-investment-in-malaysia_blocks.csv"
data = pd.read_csv(path)
data = data.rename(columns = {'Blocks of countries':'blocks','Credit RM Million':'credit','Debit RM Million':'debit','Net RM Million':'net'})
data.head()

Unnamed: 0,Year,blocks,Category,Countries,credit,debit,net
0,2008,East Asia,Total East Asia,Total Country,17175,14129,3046
1,2008,East Asia,of which,"China, People's Republic of",1116,914,201
2,2008,East Asia,of which,"Hong Kong, SAR",5602,4825,777
3,2008,East Asia,of which,Japan,7911,6129,1783
4,2008,East Asia,of which,"Korea, Republic of",886,880,6


In [3]:
# keeping relevant data only. We don't want the Total for each block.
data = data[data['Category']=='of which']

In [4]:
# splitting into train & test dataset
from sklearn.model_selection import train_test_split
X, y = data[['credit','debit','net']], data['blocks']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [5]:
from collections import Counter
import numpy as np

class gauss_NB:
    def __init__(self):
        pass
    
    def fit(self, feature_train, result_train):
        self.xtrain = feature_train
        self.ytrain = result_train
        self.category_prob = result_train.groupby(result_train).count() / len(result_train)
        self.catname = np.array(self.category_prob.index.values)
        self.featname = feature_train.columns.values.tolist()
        
    def get_param(self, feature, categ):
        data = pd.merge(self.ytrain, self.xtrain, left_index=True, right_index=True)
        self.params = data.groupby(data.iloc[:,0])[feature].describe()[['mean','std']]
        return self.params.loc[categ]
    
    def gauss_prob(self,xi,ave,sd):
        return np.exp(-0.5 * (((xi-ave)/sd)**2)) / (sd * np.sqrt(2*np.pi))
       
    def predict_cat(self, feature_test):
        predict = []
        for i in range(len(feature_test)):
            post_prob = []
            for cat in self.catname:
                feat_prob = [self.gauss_prob(feature_test.iloc[i][[feat]], self.get_param(feat,cat)[0], self.get_param(feat,cat)[1]) for feat in self.featname]
                feat_prob = np.prod(feat_prob)
                cat_prob = feat_prob * self.category_prob.loc[cat]   
                post_prob.append(cat_prob)
            post_prob = np.asarray(post_prob)
            highest_prob = np.argmax(post_prob)
            label = self.catname[highest_prob]
            predict.append(label)
        return np.array(predict)

In [6]:
clf = gauss_NB()
clf.fit(X_train, y_train)
our_prediction = clf.predict_cat(X_test)
accuracy = np.sum(y_test == our_prediction) / len(y_test)
accuracy # very low accuracy but hey, we obtained a similar result with the sklearn package below!

0.38596491228070173

In [7]:
y = []
for i in y_train:
    if i == 'Southeast Asia':
        y.append(1)
    elif i == 'Europe':
        y.append(2)
    elif i == 'East Asia':
        y.append(3)
    elif i == 'North America':
        y.append(4)
    elif i == 'Latin America':
        y.append(5)
    elif i == 'Oceania':
        y.append(6)

ytest = []
for i in y_test:
    if i == 'Southeast Asia':
        ytest.append(1)
    elif i == 'Europe':
        ytest.append(2)
    elif i == 'East Asia':
        ytest.append(3)
    elif i == 'North America':
        ytest.append(4)
    elif i == 'Latin America':
        ytest.append(5)
    elif i == 'Oceania':
        ytest.append(6)
        
xs = [[X_train.iloc[i,0], X_train.iloc[i,1], X_train.iloc[i,2]] for i in range(len(X_train))]
xtest = [[X_test.iloc[i,0], X_test.iloc[i,1], X_test.iloc[i,2]] for i in range(len(X_test))]

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
result2 = model.fit(xs, y).predict(xtest)

accuracy = np.sum(ytest == result2) / len(ytest)
accuracy # this is the prediction from sklearn

0.38596491228070173

In [8]:
np.sum(y_test == our_prediction) / np.sum(ytest == result2)

1.0

And there you go. Our simulated naive bayes classifier works pretty well compare to the established package. Eventhough we use more time to run our prediction function, but it somehow obtain a similar result with package that is more optimized. What matter the most is that we all learn something new and now, we have a better understanding on how the Naive Bayes Classifier works.

Below is the result that we obtained from the simulated NB

In [10]:
our_prediction

array(['Southeast Asia', 'Europe', 'Latin America', 'Latin America',
       'Latin America', 'Europe', 'North America', 'Oceania', 'Oceania',
       'Europe', 'North America', 'Latin America', 'Latin America',
       'Oceania', 'Latin America', 'Latin America', 'East Asia',
       'Latin America', 'Latin America', 'Oceania', 'Latin America',
       'Southeast Asia', 'Latin America', 'Latin America', 'Oceania',
       'Latin America', 'Latin America', 'North America', 'Oceania',
       'Latin America', 'East Asia', 'Latin America', 'Europe', 'Oceania',
       'Latin America', 'Latin America', 'East Asia', 'Oceania',
       'Latin America', 'Southeast Asia', 'East Asia', 'Latin America',
       'Europe', 'Latin America', 'Latin America', 'East Asia', 'Oceania',
       'Latin America', 'Oceania', 'Latin America', 'Latin America',
       'Europe', 'Oceania', 'North America', 'East Asia', 'East Asia',
       'Oceania'], dtype='<U14')