## Assignment 1: Bayes Classifier Question 3

In this question, you will implement the **CategoricaL Naive Bayes classifier** from scratch.
This means you must not rely on any pre-implemented models (such as those provided
by the scikit-learn library), in order to gain a deeper understanding of how the
algorithm works internally. 

Through this process, you will explore how probabilities are estimated from data, 
how predictions are made using conditional independence assumptions, and how numerical 
issues can arise in probabilistic models.

Unless explicitly stated otherwise, you should not be making any changes to the library
imports section of this notebook. In particular, you are not allowed to import any additional Python libraries. 

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

**Data Loader and Processing Section**

In [2]:
# =====================================================================================
# TODO: Using Numpy, extract CSV data. Note: Please omit header row.

my_data = np.loadtxt('transactions.csv', delimiter=',', skiprows=1, dtype=str)

# =====================================================================================
# TODO: Split dataset into a feature matrix and label vector.

tmp_feature_mat = my_data[:, :-1]
tmp_label_vector = my_data[:, -1:].ravel() # make it into 1D array and flatten with ravel()

# =====================================================================================
# TODO: Convert categorical strings to integers for each feature. 
# (Example Low, Medium, High -> 0, 1, 2)

feature_mat = np.zeros(tmp_feature_mat.shape, dtype=int)
feature_mat["Low" == tmp_feature_mat] = 0
feature_mat["Medium" == tmp_feature_mat] = 1
feature_mat["High" == tmp_feature_mat] = 2
feature_mat["Old" == tmp_feature_mat] = 0
feature_mat["Recent" == tmp_feature_mat] = 1
feature_mat["New" == tmp_feature_mat] = 2
feature_mat["Morning" == tmp_feature_mat] = 0
feature_mat["Afternoon" == tmp_feature_mat] = 1
feature_mat["Evening" == tmp_feature_mat] = 2
feature_mat["Night" == tmp_feature_mat] = 3

label_vector = np.full(tmp_label_vector.shape, fill_value=-10, dtype=int)
label_vector[tmp_label_vector == "no"] = 0
label_vector[tmp_label_vector == "yes"] = 1

# =====================================================================================
# TODO: Use train_test_split function from the Scikit-learn library to split data 
# into training (75%) and test (25%) sets.
# *IMPORTANT: Add the arguemnt random_state=42*
# Hint: Use stratify=y to maintain class distribution

X_train, X_test, y_train, y_test = train_test_split(
    feature_mat, 
    label_vector, 
    test_size=0.25, 
    train_size=0.75,
    stratify=label_vector,
    random_state=42
)

**Categorical Naive Bayes Classifier**

Write your code where you see the key word pass. (You should remove pass)

In [3]:
class CategoricalNaiveBayes:
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.classes = None             # Unique class labels
        self.priors = None              # Class prior probabilities
        self.feature_cond_probs = None  # List to store feature probabilities per class

    def fit(self, X, y):
        """
        TODO:
        - Identify unique classes
        - Compute class priors P(y)
        - Count occurrences of each feature value per class
        - Apply Laplace smoothing (alpha)
        """
        
        # get the classes and their counts
        self.classes, counts = np.unique(y, return_counts=True)
        # sort the classes and their counts together
        indices = np.argsort(self.classes)
        self.classes = self.classes[indices]
        counts = counts[indices]
        # divide the counts of each classes by the total amount of elements.
        # This gives the prior for each classes. P(Hi)
        self.priors = counts/len(y)
        
        # find the |V| for each count and store it
        features_category_count = np.zeros(X.shape[1], dtype=int)
        for f_idx in range(X.shape[1]):
            # for each category calculate |V|
            features_category_count[f_idx] = len(np.unique(X[:,f_idx]))

        # find the P(category | class) for all categories for all features
        self.feature_cond_probs = []
        # loop over classes
        for c in self.classes:
            # filter the matrix by the current category
            X_class = X[y == c]
            
            # create a new class list
            self.feature_cond_probs.append([])
            
            # loop over features
            for f_idx in range(X.shape[1]):
                categories, category_counts= np.unique(X_class[:, f_idx], return_counts=True)
                
                # create a new feature for the current class
                total_words_in_c = len(X_class)
                V = features_category_count[f_idx]
                
                # give them smoothed probability of 0 count by default
                ordered_category_probs = [self.alpha/(total_words_in_c + self.alpha*V)] * V
                        
                # loop through all categories given the class
                for idx, category in enumerate(categories):
                    # get their counts
                    count_w_c = category_counts[idx]
                    # update their smoothed probability values since their count is greater than 0
                    ordered_category_probs[category] = (count_w_c + self.alpha) / (total_words_in_c + self.alpha * V)
                
                self.feature_cond_probs[-1].append(ordered_category_probs)
                

    def compute_log_likelihood(self, X):
        """
        TODO:
        - For each sample and each class:
        - Sum log-probabilities of all features given the class
        - Add log prior of the class
        - Return a log-likelihood array of shape (n_samples, n_classes)
        """
        log_likelyhood = np.zeros((len(X), len(self.classes)))
        
        for i, x in enumerate(X):
            # loop through the classes
            for c in self.classes:
                # loop through the sample's feature values
                log_sum = np.log(self.priors[c])
                # for each feature, retrieve the category of the sample and sum the 
                for feature_idx, category in enumerate(x):
                    log_sum += np.log(self.feature_cond_probs[c][feature_idx][category])
                log_likelyhood[i, c] = log_sum
                    
        return log_likelyhood

    def predict(self, X):
        """
        TODO:
        - Use compute_log_likelihood
        - Return the class with highest log-posterior for each sample
        """
        loglikelyhood = self.compute_log_likelihood(X)
        
        return np.argmax(loglikelyhood, axis=1)

**Classifier training and testing**

In [4]:
# =====================================================================================
# TODO: Create an instance of your CategoricalNaiveBayes classifier and train it

cnb = CategoricalNaiveBayes()
cnb.fit(X_train, y_train)

# =====================================================================================
# TODO: Make predictions on the test data

y_pred = cnb.predict(X_test)

# =====================================================================================
# TODO: Using Scikit-learn's accuracy_score and classification_report functions, print
# both the accuracry score and the classification_report metrics for your classifier

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}\n\n")
print(classification_report(y_test, y_pred))

Accuracy: 0.9000


              precision    recall  f1-score   support

           0       0.92      0.94      0.93        35
           1       0.86      0.80      0.83        15

    accuracy                           0.90        50
   macro avg       0.89      0.87      0.88        50
weighted avg       0.90      0.90      0.90        50

