# Naive Bayes Classifier Prototype

This is the implementation of mine *Naive Bayes Classifier*.

In [1]:
import numpy as np
import pandas as pd

## Preparing Datasets

First, let's prepare our data set. 

For the values like 'High', 'Normal' and 'Low', we convert them into ints like 2, 1 and 0.

For the values like 'Yes' and 'No', we convert them into ints like 1 and 0.

In [2]:
raw_data = pd.read_csv('./data.csv')
raw_data = raw_data.set_index('Data')
raw_data

Unnamed: 0_level_0,Blood Pressure,Fever,Diabetes,Vomit,Suffering from disease Z
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,High,High,Yes,No,No
2,High,High,Yes,Yes,No
3,Low,High,Yes,No,Yes
4,Normal,Mild,Yes,No,Yes
5,Normal,No fever,No,No,Yes
6,Normal,No fever,No,Yes,No
7,Low,No fever,No,Yes,Yes
8,High,Mild,Yes,No,No
9,High,No fever,No,No,Yes
10,Normal,Mild,No,No,Yes


In [3]:
def map_blood_pressure(s: str) -> int:
    match s:
        case 'High':
            return 2
        case 'Normal':
            return 1
        case 'Low':
            return 0


def map_fever(s: str) -> int:
    match s:
        case 'High':
            return 2
        case 'Mild':
            return 1
        case 'No fever':
            return 0


def map_diabetes(s: str) -> float:
    match s:
        case 'Yes':
            return 1
        case 'No':
            return 0


def map_vomit(s: str) -> float:
    match s:
        case 'Yes':
            return 1
        case 'No':
            return 0


evidence = raw_data.drop(columns='Suffering from disease Z')
evidence['Blood Pressure'] = evidence['Blood Pressure'].map(map_blood_pressure)
evidence['Fever'] = evidence['Fever'].map(map_fever)
evidence['Diabetes'] = evidence['Diabetes'].map(map_diabetes)
evidence['Vomit'] = evidence['Vomit'].map(map_vomit)
evidence

Unnamed: 0_level_0,Blood Pressure,Fever,Diabetes,Vomit
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2,2,1,0
2,2,2,1,1
3,0,2,1,0
4,1,1,1,0
5,1,0,0,0
6,1,0,0,1
7,0,0,0,1
8,2,1,1,0
9,2,0,0,0
10,1,1,0,0


In [4]:
def map_suffering_from_disease_z(s: str) -> float:
    match s:
        case 'Yes':
            return 1
        case 'No':
            return 0


belief = raw_data[['Suffering from disease Z']].copy()
belief['Suffering from disease Z'] = belief['Suffering from disease Z'].map(map_suffering_from_disease_z)
belief

Unnamed: 0_level_0,Suffering from disease Z
Data,Unnamed: 1_level_1
1,0
2,0
3,1
4,1
5,1
6,0
7,1
8,0
9,1
10,1


Second, convert them into NumPy matrix / vector.

In [5]:
evidence = np.array(evidence)
evidence

array([[2, 2, 1, 0],
       [2, 2, 1, 1],
       [0, 2, 1, 0],
       [1, 1, 1, 0],
       [1, 0, 0, 0],
       [1, 0, 0, 1],
       [0, 0, 0, 1],
       [2, 1, 1, 0],
       [2, 0, 0, 0],
       [1, 1, 0, 0],
       [2, 1, 0, 1],
       [0, 1, 1, 1],
       [0, 2, 0, 0],
       [1, 1, 1, 1]], dtype=int64)

In [6]:
belief = np.array(belief)
belief

array([[0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0]], dtype=int64)

## Implement Prototype

Let's recall the *Naive Bayes Rule*. For $n$ independent evidences and $m$ mutual exclusive beliefs, we have: 

$$
P(B_i|E) = \frac{P(B_i) \Pi^n P(e_n|B_i)}{\Sigma^m (\Pi^n P(e_n|B_m))}
$$

And since we are implementing a classifier, we **only care about which belief has the highest probability**, instead of caring about each belief's probability. Therefore, we can remove the denominator, and give the following equation: 

$$
B = \operatorname{argmax}_{B_i}{P(B_i) \Pi^n P(e_n|B_i)}
$$

The notation $\operatorname{argmax}_{i}f(i)$ finds the $i$ which gives the largest $f(x)$.

Besides, let's recall how do we find the *prior probability* $P(B)$ and the *likelihood* $P(e|B)$. 

The prior probability $P(B)$ simply represents how likely a belief will happen on its own, i.e., $\frac{f_B}{n}$, where $f_B$ is the frequency of the belief happens, and $n$ is the sample size. 

The likelihood $P(e|B)$ represents how likely an evidence will happen given that the belief happens, i.e., $\frac{f_{e,B}}{f_{B}}$, where $f_{e,B}$ is the frequency of the evidence and belief both happen, and $f_B$ is the frequency of the belief happens.

Let's write the code! 

First, let's calculate the prior probabilities for each belief. Here, the index of the array indicates the belief (0 -> False -> No Z; 1 -> True -> Yes Z) and the corresponding value indicates the prior probability for that belief.

In [7]:
belief_frequencies = np.array([np.count_nonzero(belief == i) for i in range(2)])
prior_probabilities = belief_frequencies / belief.size
prior_probabilities

array([0.35714286, 0.64285714])

Second, let's calculate the likelihoods for the state-space. Beginning at the frequencies...

In [8]:
# 2 beliefs; and
# 4 evidences, each has up to 3 possible values.
      
likelihoods = np.array([np.count_nonzero(evidence[(belief == b).reshape(evidence.shape[0])][:, e] == v) 
                        for b in range(2) 
                        for e in range(4) 
                        for v in range(3)])
likelihoods = likelihoods.reshape(2, 4, 3)
likelihoods


array([[[0, 2, 3],
        [1, 2, 2],
        [1, 4, 0],
        [2, 3, 0]],

       [[4, 3, 2],
        [3, 4, 2],
        [6, 3, 0],
        [6, 3, 0]]])

Then the probabilities...

In [9]:
likelihoods = likelihoods.astype(np.float64)
for i in range(2):
    likelihoods[i, :, :] = likelihoods[i, :, :] / belief_frequencies[i]
likelihoods

array([[[0.        , 0.4       , 0.6       ],
        [0.2       , 0.4       , 0.4       ],
        [0.2       , 0.8       , 0.        ],
        [0.4       , 0.6       , 0.        ]],

       [[0.44444444, 0.33333333, 0.22222222],
        [0.33333333, 0.44444444, 0.22222222],
        [0.66666667, 0.33333333, 0.        ],
        [0.66666667, 0.33333333, 0.        ]]])

## Predicting

With the prior probabilities and the likelihoods, we can calculate the *score* for each belief, and choose the belief which has the largest score. That is, again: 

$$
B = \operatorname{argmax}_{B_i}{P(B_i) \Pi^n P(e_n|B_i)}
$$

and therefore formula of the score is: 

$$
\mathrm{score} = P(B_i) \Pi^n P(e_n|B_i)
$$

Given a patient: 

$$
\mathrm{(Blood Pressure, Fever, Diabetes, Vomit) = (High, No, Yes, Yes) = (2, 0, 1, 1)}
$$

Let's predict whether the patient is suffering from disease Z. 

Here, we retrieve the corresponding likelihoods.

In [10]:
patient = np.array([2, 0, 1, 1])
scores = likelihoods[:, np.arange(4), patient]
scores

array([[0.6       , 0.2       , 0.8       , 0.6       ],
       [0.22222222, 0.33333333, 0.33333333, 0.33333333]])

Then calculate the product of each row.

In [11]:
scores = scores.prod(axis=-1)
scores

array([0.0576    , 0.00823045])

Then multiply with the prior possibilities.

In [12]:
scores *= prior_probabilities
scores

array([0.02057143, 0.00529101])

Find which belief gives the maximum score.

In [13]:
np.argmax(scores, axis=0)

0

The answer is `0`! Stating that our patient is likely not suffering from disease Z! 🎉🎉🎉