## Naive Bayes

### Imports (Not Important)

In [1]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from collections import Counter
import numpy as np

In [29]:
dataset = datasets.load_iris()
X, y = dataset['data'], dataset['target']
train_X, test_X, train_y, test_y = train_test_split(X, y)

### Naive Bayes
Classification in the abstract looks to find the probability of a class given a feature vector
$$p(\text{Class} ~|~ (x_1, x_2, \ldots, x_n) = \frac{p((x_1, x_2, \ldots, x_n), \text{Class})}{p((x_1, x_2, \ldots, x_n))}$$
Note that the denominator is constant with the Class, and thus calculating it will not contribute to the classification, so it can be ignored! Thus we only need to find the numerator. By the chain rule the numerator can be rewritten as 
$$p((x_1, x_2, \ldots, x_n), \text{Class}) = p(x_1~|~x_2, \ldots x_n, \text{Class}) \cdot p(x_2 ~|~ x_3 \ldots, x_n, \text{Class}) \cdots p(x_n ~|~ \text{Class}) \cdot p(\text{Class}) \tag{1}$$

Up to this point its all been just fact, but using our *Naive* assumption that all variables are indepedent, eq (1) becomes
$$p(x_i ~|~ x_{i+1} \ldots x_n, \text{Class}) \approx p(x_i ~|~ \text{Class})$$
So now
$$
p(\text{Class} ~|~ (x_1, x_2, \ldots, x_n) \approx p((x_1, x_2, \ldots, x_n), \text{Class}) \approx p(\text{Class}) \prod_{i=1}^n p(x_i ~|~ \text{Class})
$$

### Practical Points
Want to find
$$p(\text{Class}) \prod_{i=1}^n p(x_i ~|~ \text{Class})$$
Practically speaking we find those values simply by counting occurences in the dataset
$$p(\text{Class}) = \frac{\text{Count(Class)}}{\text{Number of Datapoints}} \quad p(x_i ~|~ \text{Class}) = \frac{\text{Count(Class and $x_i$)}}{\text{Count(Class)}}$$

In [3]:
def count_class_and_x(X, y, target_X, target_y):
    select = (y == target_y) & (X == target_X)
    return sum(select)

In [106]:
class NaiveBayes:
    def __init__(self):
        pass
    
    def fit(self, X, y):
        N = len(y) # N = number of datapoints
        _, n_col = X.shape
        self.classes = np.unique(y)
        self.count_class = dict(Counter(y.tolist()))
        self.p_class = {_class : count/N for _class, count in self.count_class.items()}
        self.count_values = {}
        self.N = N
        
        # xi_vals[col][_class][val] = cnt
        col_vals = {}
        for col in range(n_col):
            class_vals = {}
            for _class in np.unique(y):
                count_vals = {}
                for val in np.unique(X[:, col]):
                    cnt = count_class_and_x(X[:, col], y, val, _class)
                    count_vals[val] = cnt
                class_vals[_class] = count_vals
            
            col_vals[col] = class_vals
        
        self.count_values = col_vals
            
                        
    def _compute_class_prob(self, _class, row):
        class_prob = self.p_class[_class]
        
        for col, val in enumerate(row):
            cnt = 0
            if val in self.count_values[col][_class]:
                cnt = self.count_values[col][_class][val]
            
            # LAPLACE SMOOTHING
            # this is to handle the 0 count case
            prob = (cnt + 1.)/(self.count_class[_class] + self.N)
            class_prob *= prob
        return class_prob
    
    def predict(self, row):
        class_probs = [(_class, self._compute_class_prob(_class, row)) for _class in self.classes]
        return max(class_probs, key=lambda x : x[1])
            
            

### Dummy Test
- Fit the Classifier on all the data, then feed it back the same data and see how well it does

In [124]:
NB = NaiveBayes()
NB.fit(X, y)
num_correct = 0
for i in range(len(test_X)): 
    _guess, confidence = NB.predict(test_X[i])
    if _guess == test_y[i]:
        num_correct += 1
print("Accuracy", num_correct/len(test_X))

Accuracy 0.9736842105263158


### Generality Test
- Testing with data the classifier has not seen before

In [125]:
NB = NaiveBayes()
NB.fit(train_X, train_y)
num_correct = 0
for i in range(len(test_X)): 
    _guess, confidence = NB.predict(test_X[i])
    if _guess == test_y[i]:
        num_correct += 1
print("Accuracy", num_correct/len(test_X))

Accuracy 0.9473684210526315
