In this notebook, I implement a Multinomial Naive Bayes classifier on a public [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income) dataset from 1994, reviewing the relavent Bayesian methods behind the algorithm. 

Bayes theorem states that 
$$
P(A | B) = \frac{P(A \cap Y)}{P(B)} = \frac{P(B|A) P(A)}{P(B)}\quad \Longleftrightarrow \quad \text{posterior} = \frac{\text{Class likelihood} \times \text{prior}}{\text{Evidence}}
$$
* $P(A \cap Y)$: the probability of $A$ and $B$
* $P(A|B)$: the probability of $A$ given $B$
* $P(B|A)$: the probability of $B$ given $A$
* P(A): the probability of $A$ occuring
* P(B): the probability of $B$ occuring

And, when two events are independent, 
$$
P(A \cap B) = P(A) \cdot P(B)
$$

For class variable $y$ and dependent feature vector $X$, we can apply Bayes theorem: 
$$
P(y|X) = \frac{P(X|y)P(y)}{P(X)}, \text{where } X = (x_1, x_2, x_3,...,x_n)
$$

The Naive Bayes approximation assumes that different feature dimensions (elements of $X$) are are conditionally independent. Applying this to our posterior probability: 
$$
P(y|x_1,...,x_n) = \frac{P(x_1|y)P(x_2|y)...P(x_3|y)P(y)}{P(x_1)P(x_2)...P(x_n)} 
$$
$$
P(y|x_1,...,x_n) \propto P(y) \prod_{j=1}^n P(x_j |y)
$$
For class label k, 
$$
P(y=k|x) \propto P(y=k) \prod_{j=1}^n P(x_j | y = k)
$$

We must now calculate model parameters ($\theta$'s) for each class probability $P(y = k)$ and each conditional-class probability $p(x_j = v | y = k)$. Begining the the simpler case, 
$$
\theta_k = P(y = k) = \frac{N_k + \alpha}{n+ \alpha \times K}
$$
* $N_k$: number of instances with label $k$
* $n$: number of training instances
* $K$: number of unique classes
* $\alpha$: Laplace smoothing parameter

To calculate class-conditional probabilities, 
$$
\theta_{k,j,v} = P(x_j = v_j | y = k) = \frac{N_{k,v_j} + \alpha}{N_k + \alpha \times V_j}
$$
* $N_{k,j.v}$: the number of times the value $v_j$ occurs in feature $x_j$ in training instances where the the target class is $k$
* $N_k$: the total count of all feature values where the target class is $k$
* $V_{j}$: the number distinct values of distinct values that feature $x_j$ can take
* $\alpha$: Laplace smoothing parameter
  
By setting $\alpha = 1$, we will apply Laplace smoothing to handle zero-frequency problems (when a word has not been observed in a class). 

Lastly, to make an inference/prediction about a new instance, we can define a Naive Bayes classifier by modifying our probability calculation for log space:
$$
P(y=k|x) \propto P(y=k) \prod_{j=1}^n P(x_j | y = k) 
$$
$$
\hat{y} = \underset{k \in \{1,2,...,K \}  }{\text{argmax}} P(y=k) \prod_{j=1}^n P(x_j | y = k) 
$$
$$\boxed{\boxed{
\hat{y} = \underset{k \in \{1,2,...,K \}  }{\text{argmax}} \log P(y=k) + \log \sum_{j=1}^n P(x_j | y = k) }}
$$

In [3]:
import pandas as pd
import numpy as np

In [4]:
class NaiveBayesClassifier:
    def __init__(self, alpha=1):                                    # Initialize an untrained NB classifier
        self.alpha = alpha                                          # Laplacian smoothing as default
        self.log_priors = None
        self.class_conditional_probs = None
        self.features = None

    def train(self, data, target_class):
        """
        Calculates priors 
        # Calculating class frequencies and log priors"""
        training_instances = len(data)                              # Count the number of instances in the training data
        class_frequencies = {}                                      # Calculate the frequency of each class in target column (helpful for priors)
        for i in range(training_instances):
            class_label = data.iloc[i][target_class]
            if class_label in class_frequencies:
                class_frequencies[class_label] += 1
            else:
                class_frequencies[class_label] = 1
        self.log_priors = {}                                        # Calculate log of prior probabilities for each class
        for class_label, freq in class_frequencies.items():
            theta = (freq + self.alpha) / (training_instances + self.alpha * len(class_frequencies))
            self.log_priors[class_label] = np.log(theta)

        self.class_conditional_probs = {}                           # Initialize structure to store class-conditional probabilities
        for class_value in class_frequencies.keys():                # Iterate over every feature for every class, skipping target class
            self.class_conditional_probs[class_value] = {}
            for feature in data.columns:                            # Count instances of this feature for the current class
                if feature != target_class:
                    total_count = 0
                    for feature_value in data[feature].unique():
                        total_count += len(data[(data[target_class] == class_value) & (data[feature] == feature_value)])

                    V_j = len(data[feature].unique())                                           # Count how many unique values (options) for this feature
                    for feature_value in data[feature].unique():                                # Now, iterating over each unique value of the feature
                        N_k_vj = len(data[(data[target_class] == class_value) & (data[feature] == feature_value)])
                        theta_k_j_v = (N_k_vj + self.alpha) / (total_count + self.alpha * V_j)
                        if feature not in self.class_conditional_probs[class_value]:            # Log-tranform probability and store it in dictionary
                            self.class_conditional_probs[class_value][feature] = {}
                        self.class_conditional_probs[class_value][feature][feature_value] = np.log(theta_k_j_v)

        self.features = [feature for feature in data.columns if feature != target_class]        # List of columns that will be used to train

    def predict(self, new_instance):
        class_probabilities = {}                                        # Initialize structure to store probability of new instance being each class
        for class_label in self.log_priors.keys():                      # Use each prior as initial guess...
            log_prob = self.log_priors[class_label]
            for feature in self.features:                               # add the conditional probability of each feature given the current class...
                feature_value = new_instance.get(feature)
                log_prob += self.class_conditional_probs[class_label][feature][feature_value]
            class_probabilities[class_label] = log_prob

        return max(class_probabilities, key=class_probabilities.get)    # and return class which yields the highest probability

In [22]:
classifier = NaiveBayesClassifier()

train_data = pd.read_csv("./datasets/1994_census_cleaned_train.csv")    # Training NB classifier on 1994 Adult Census Income
target_class = "sex"
classifier.train(train_data, target_class)

test_data = pd.read_csv("./datasets/1994_census_cleaned_test.csv")      # Testing it on a different unique batch 

new_instance = {}
correct_labels, incorrect_labels = 0, 0
for i in range(len(test_data)):
    for j in range(len(test_data.columns)):
        if test_data.columns[j] != target_class:
            new_instance[test_data.columns[j]] = test_data.iloc[i][j]
        else:
            correct_value = test_data[target_class][i]
    if classifier.predict(new_instance) == correct_value:
        correct_labels += 1
    else:
        incorrect_labels += 1

print("Accuracy rate:", round((correct_labels / (correct_labels + incorrect_labels)), 2), "%")

Accuracy rate: 0.82 %


If we instead had a dataset of continous (not discrete) variables, we can apply a similar classification method: Gaussian Naive Bayes. To do this, we must first assume that each feature is normally (Gaussian) distributed. For each class, we can then calculate mean and variance of each feature. That is, for a dataset with features $x_1, x_2,...,x_n$ and classes $k_1, k_2,...,k_m$, for each feature $x_i$ in class $k_j$, we calculate mean $\mu_{ij}$ and variance $\sigma_{ij}^2$. For example, mean is the sum of all values of feature $x_i$ for the instances in class $k_j$ divided by the number of instances of in class $k_j$, These are calculated as:
$$
\mu_{ij} = \frac{\sum_{x \in k_j} x_i }{N_{c_j}} \quad \text{and} \quad \sigma_{ij} = \frac{\sum_{x \in k_j} \left( x_i - \mu_{ij} \right)^2 }{N_{c_j}} 
$$
We can then use the Gaussian probability density function to calculate the probability of observing the specific value $x$ for feature $x_i$ given that it belongs to class $k_j$. 
$$
P(x_i = x | k_j) = \frac{1}{\sqrt{2 \pi \sigma_{ij}^2}} \exp \left(- \frac{\left( x - \mu_{ij} \right)^2}{2\sigma_{ij}^2}\right)
$$