# Naive Bayes from Scratch 

This notebook implements **Gaussian Naive Bayes** using **only NumPy** .

### Contents
- Theory recap
- Training functions
- Prediction functions
- Example dataset
- Evaluation


### Import Library

In [1]:
import numpy as np

### Theory Recap

Bayes Theorem:

$$P(y|X) = P(X|y)P(y)$$

Assumptions:
- Features are conditionally independent
- Continuous features follow Gaussian distribution


### Helper Functions

In [2]:
def calculate_mean_variance(X, y):
    '''
    inputs
        X = array of samples with features [numpy array]
        y = class of each sample [numpy array]
    
    outputs
        classes = categories (0,1,2,3,etc) [numpy array]
        mean = mean of each class [numpy array]
        var = variance of each class [numpy array]
        priors = prior probability of each class [numpy array]
    
    '''
    
    classes = np.unique(y)
    mean = {}
    var = {}
    priors = {}

    for c in classes:
        X_c = X[y == c] # Filter X values which related y value is equals to c(class)
        mean[c] = np.mean(X_c, axis=0) # Calculate mean of the class 
        var[c] = np.var(X_c, axis=0) # Calculate variance of the class
        priors[c] = X_c.shape[0] / X.shape[0] # Calculate prior probability of class

    return classes, mean, var, priors

### Gaussian Probability Density Function


$$P(x|y) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

In [3]:
def gaussian_pdf(x, mean, var):
    numerator = np.exp(- (x - mean) ** 2 / (2 * var))
    denominator = np.sqrt(2 * np.pi * var)
    return numerator / denominator

### Prediction Function

In [4]:
def predict(X, classes, mean, var, priors):
    '''
    inputs
        X = array of rows [numpy array]
        classes = categories (0,1,2,3,etc) [numpy array]
        mean = mean of each class [numpy array]
        var = variance of each class [numpy array]
        priors = prior probability of each class [numpy array]
    
    outputs
        predictions = array of predictions for given array of inputs [numpy array]
    
    '''
    predictions = []

    for x in X:
        posteriors = []

        for c in classes:
            prior = np.log(priors[c]) # prior for class c
            conditional = np.sum(np.log(gaussian_pdf(x, mean[c], var[c]))) # sum of log of gausian value of each feature of x related to class c 
            posterior = prior + conditional
            posteriors.append(posterior) # Probability of a class given the data

        predictions.append(classes[np.argmax(posteriors)]) # Append the class which has maximum value of probability

    return np.array(predictions)

### Accuracy Function

In [5]:
def accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

### Sample Dataset

| Feature Index | Feature Name          | Meaning                              |
| ------------- | --------------------- | ------------------------------------ |
| X[ : , 0]       | `study_hours_per_day` | Average hours studied per day        |
| X[ : , 1]       | `mock_test_score`     | Score in a practice test (out of 50) |

So each row means:

[ study hours per day , mock test score ]

| Class Value | Class Name      | Meaning                     |
| ----------- | --------------- | --------------------------- |
| 0           | `Not_Qualified` | Student is unlikely to pass |
| 1           | `Qualified`     | Student is likely to pass   |





In [6]:
X = np.array([
    [0.5, 15],
    [1.0, 18],
    [1.2, 19],
    [1.5, 20],
    [2.0, 21],
    [2.5, 22],
    [3.0, 23],
    [3.5, 24],
    [4.0, 25],
    [4.5, 26],
    [5.0, 30],
    [5.5, 32],
    [6.0, 34],
    [6.2, 35],
    [6.5, 36],
    [7.0, 38],
    [7.2, 39],
    [7.5, 40],
    [8.0, 42],
    [8.2, 43],
    [8.5, 44],
    [9.0, 45],
    [9.2, 46],
    [9.5, 47],
    [10.0, 48],
    [10.5, 49],
    [11.0, 50],
    [11.5, 50],
    [12.0, 50],
    [12.5, 50]
])

y = np.array([
    0, 0, 0, 0, 0,
    0, 0, 0, 0, 0,
    1, 1, 1, 1, 1,
    1, 1, 0, 1, 1,
    1, 1, 1, 1, 1,
    0, 1, 1, 1, 1
])


### Train the Model

In [7]:
classes, mean, var, priors = calculate_mean_variance(X, y)

print("Classes:", classes)
print("Means:", mean)# average for each feature for particular class
# 0: array([ 2, 21]) = average study hours of student who failed = 2, average mock test score of student who failed = 21
# 1: array([ 7, 41]) = average study hours of student who passed = 7, average mock test score of student who passed = 41
print("Variances:", var)# variance for each feature for particular class
print("Priors:", priors)
print( type(classes))

Classes: [0 1]
Means: {0: array([ 3.475     , 25.16666667]), 1: array([ 8.48888889, 42.16666667])}
Variances: {0: array([ 7.856875  , 86.80555556]), 1: array([ 4.91987654, 42.47222222])}
Priors: {0: 0.4, 1: 0.6}
<class 'numpy.ndarray'>


### Make Predictions

In [8]:
X_test = np.array([
    [2, 20],
    [7, 43]
])

predictions = predict(X_test, classes, mean, var, priors)
print("Predictions:", predictions)

Predictions: [0 1]


### Evaluate the Model

In [9]:
y_pred = predict(X, classes, mean, var, priors)
print("Training Accuracy:", accuracy(y, y_pred))

Training Accuracy: 0.8666666666666667
