# Assignment 1

---

Author: Adam Ryan

Created Date: 2021-09-14

Date Last Modified: 2021-10-24

Description: A solution to Assignment 1 regarding the implementation of MyGaussianNB which implements Gaussian Naive Bayes.

---

## Task Description

### Objective
The objective of this assignment is to implement a Gaussian Naive Bayes classifier in the scikit-learn framework. A notebook (MajorityClassClf) is provided with a simple example of a classifier that works with scikit-learn.

Note: The code developed in this assignment will be extended in the second assignment to allow for missing values. 

### Requirements
The notebook MajorityClassClf contains some basic code to help you get started. 
Provide a python class MyGaussianNB that implements Gaussian Naive Bayes. The conditional probabilities should be calculated as follows:
        
       


\begin{equation}
\mathrm{P}(x_i | y) = \frac{1}{\sqrt{2 \pi (\sigma_{y_i})^2}} e^{-\frac{(x_i - \mu_y)^2}{2(\sigma_{y_i})^2}} = \frac{1}{\sqrt{2 \pi (\sigma_{y_i})^2}} \text{exp}({-\frac{(x_i - \mu_y)^2}{2(\sigma_{y_i})^2}})
\end{equation}


where 𝜇y is the mean for variable i for class y and 𝜎y is the corresponding standard deviation. Thereafter the classification should use the NB formulae presented in the lectures. Alternatives that use addition of conditional probabilities or logs should not be used. 
    
The API specification for sklearn classifiers is here: https://scikit-learn.org/stable/developers/develop.html 
You should implement the ‘fit’ and ‘predict’ methods, there is no need to implement ‘predict_proba’. 
Prior probabilities should be calculated from the training data. With this, there will be no need to pass parameters when instances are created. 

Test the performance of your implementation against the GaussianNB implementation in scikit-learn. You should use a range of datasets for this testing. Possible test sets used in lectures are penguins,  diabetes and glassV2. 

### Submission
This is an individual (not group) project. Submission is through the Brightspace page. Your submission should comprise your notebook and the second dataset that you use. Clear all outputs in the notebook before saving for submission. You can use markdown cells in the notebook to report your findings and conclusions. 

---


## Solution

### Explanation

1. A function is created which creates models: SK Naive Bayes and My Naive Bayes.
2. The My Naive Bayes function is defined. 
    1. Fit does validation and prepares the class stats for predict.
    2. Predict validates it has been fitted. It goes through each row in the test data, and for each row for every class it goes through each column and gets the probability using the Gaussian NB PDF and then derives total row probability for that class. I do not normalise the probabilities across n-many classes as the only piece which matters is the max class. I add this to an array and return it. Validation is done against std.Dev=0 by setting it to the square root of the minimum float in this instance.
3. Some functions are defined to create models (MyNaiveBayes and SKLearnBaye) and analyse them.
4. Functions are designed to read in all datasets and test my implementation and the default implementation and validate they produce equal classification reports.
5. We observe equality in the classification results we are obtaining.

#### Required Modules

In [1]:
####--------------------------------------
#00.Import Modules
####--------------------------------------


######---------BEGIN
#      SUPPRESS DEPRECIATION WARNINGS: Applicable to datetime_is_numeric=True
######--------END

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

######---------BEGIN
#      ML
######--------END

#import nltk as nl
import sklearn as sk
#import xgboost as xg
#import pymc3 as pymc
#import sympy as sym

from sklearn.model_selection import train_test_split, KFold, cross_val_score

from sklearn import metrics
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels

from sklearn.naive_bayes import GaussianNB

from sklearn.base import BaseEstimator, ClassifierMixin

#from sklearn.tree import export_graphviz
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import RandomForestClassifier


######---------BEGIN
#      SQL
######--------END



######---------BEGIN
#     GENERAL
######--------END

import pandas as pd
import numpy as np
import sys
import time

######---------BEGIN
#     DATA VIS
######--------END

import matplotlib as mp
#from bokeh import *
#from dash import *

import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.dates as mdates

# My Gaussian Naive Bayes Implementation

In [2]:
class MyGaussianNB(BaseEstimator, ClassifierMixin):
    """A class to capture My Gaussian Naive Bayes
    
    Input: 
    BaseEstimator
    ClassifierMixin
    
    Output:
    An item of the MGNB class.
    """
    
    
    def __init__(self):
        """Initialise
        
        Fit function populates each Attribute.
        
        Important Note: I am not defining the attributes here, they're merely being documented.
        This is done so that the is_fitted method works to validate that the instance has been fitted, and raises an error if not"""
    
        #Initial data
        ##self.X_=None
        ##self.y_=None
        
        #Unique Class Labels
        #self.classes_=None
        
        #Statistics for each class - Mean, std, Perc
        ##self.summary_stats_=dict()
        
        #Matrix for each Class
        ##self.class_dict_=dict()
        
        #Predict Variables
        ##self.X_test_=None
        ##self.y_test_=None
        
        #Cont Table
        ##self.cont_dict_=dict()
        ##self.raw_dict_=dict()
        
        
    
    def fit(self, Xt, yt):
        """A function to fit the data.
        
        ---
        First it validates the checks 
        -> Is it a Dataframe, list, or Numpy Array?
        -> Is it non-empty?
        -> Are they the same size?
        -> Is it numeric only?
        
        Second:
        It sets the X and y values
        
        Third:
        It gets the unique clases.
        
        Fourth:
        It gets the summary stats which are needed for the probability calculation for each class.
        ---
        
        I work solely with numpy arrays. 
        In hindsight it would have been much easier to use Pandas dataframes but I believe numpy is meant to be faster as it's based in C.
        
        Input:
        Xt: List - X values
        yt: List - y values
        
        Output:
        ???"""
        
        #Validate the type of the inputs -> Note I wrote this before seeing SKLearn recommendation
        Xt=self.__validate_input_type(Xt)
        yt=self.__validate_input_type(yt)  
        
        #Validate the training set contains numbers
        if not self.__validate_numeric(Xt):
            raise ValueError("All inputs in the training set should be numeric")
        
        if not self.__validate_length(X=Xt,y=yt):
            raise ValueError("Arrays are of different length")
            
        if len(yt)<1:
            raise ValueError("Empty arrays are not alloweed")
            

        #https://scikit-learn.org/stable/developers/develop.html Recommends using these two instead of the above.
        Xt, yt = check_X_y(Xt, yt)
        
        #Post Validation
        self.X_ = Xt
        
        #Store class labels as strings
        self.y_ = np.array(yt, dtype=str)
        
        #Set the list of classes
        self.classes_ = unique_labels(self.y_)
        
        #Set the class dictionary
        self.__class_dict()
        
        #Get the stats for each Class and set it
        self.__set_class_stats()
        
        return self
    
    
    def predict(self, X_test):
        """A function to predict a test set.
        
        Input: 
        X_test: List/array/dataframe
        
        Output:
        ???"""
        
        X_test=self.__validate_input_type(X_test)
        
        #Validate the test set contains numbers
        if not self.__validate_numeric(X_test):
            raise ValueError("All inputs in the training set should be numeric")
            
        #Validate the test set has same column count as training
        if not self.__validate_column_count(X_train=self.X_,X_test=X_test):
            raise ValueError("The test set should have the same column count as the training set")
            
        X_test = check_array(X_test, accept_sparse=True)
    
        #Check the model has been fitted -> Returns error if not.
        check_is_fitted(self, attributes=['X_','y_','classes_','class_dict_','summary_stats_'],msg='is_fitted_')
        
        #Passed validation
        self.X_test_=X_test
        
        #BEGIN WITH PREDICTING
        y_pred=[]
        
        column_count=len(X_test[0])
        
        class_summary_stats=self.get_class_stats()
        
        row_class_dict=dict()
        raw_contingency_dictionary=dict()
        
        #For each row in the test data, take the row
        for row_index in range(len(X_test)):
            
            #get a single row
            row_value=X_test[row_index]
            row_class_dict[row_index]=dict()
            raw_contingency_dictionary[row_index]=dict()
            
            #For that row, get the probability values for every class
            for class_label in self.get_classes():
                
                #Empty array to calculate probability for that row
                row_list=[]
                
                #class Stats
                class_mean=class_summary_stats[class_label]['Mean']
                class_std=class_summary_stats[class_label]['Standard Deviation']
                class_percent=class_summary_stats[class_label]['Percent']
                
                #Go through column for that row and add in probability:
                for column_index in range(column_count):
                    row_list+=[self.__calculate_gaussian_probability(x=row_value[column_index]
                                                                     ,mean=class_mean[column_index]
                                                                     , std_dev=class_std[column_index])]
                    

                
                #This stores the probability values calculated for each row for each class to refer back to later
                raw_contingency_dictionary[row_index][class_label]= np.array(row_list)
                
                #This just stores the probability for that class
                row_class_dict[row_index][class_label]= np.prod(np.array(row_list)) * class_percent
                
            
            #Probabilities calculated for each class - for that row index get the class:
            class_for_row=max(row_class_dict[row_index], key=row_class_dict[row_index].get)
            
            y_pred+=[class_for_row]
            
        y_pred=np.array(y_pred)   
        self.cont_dict_=row_class_dict
        self.raw_dict_=raw_contingency_dictionary
        
        return y_pred
    
    
    
    def predict_proba(self, Xtes):
        """A function to predict the probability"""
        pass
    # We should really be implementing predict_proba as well.
    
    
    
    #A function to get a dictionary for the class
    def __class_dict(self):
        """A function to make a dictionary such that each class is held in the dictionary and all rows of data are put against that class.
        
        Input: 
        X -> n-dimensional Array
        y -> Single-dimensional array
        
        Output:
        Dictionary -> {Class1:[Row_{1,1},Row_{1,2},...Row_{1,K}]
                        ,Class2:[Row_{2,1},...Row_{2,m}]
                        ,...
                        ,ClassN:[R_{N,1},...R_{N,x}]}
        
        """
        class_dict=dict()
        
        X=self.X_
        y=self.y_
        
        #Each row in the matrix
        for row_index in range(len(y)):
            
            class_value=y[row_index]
            
            if class_value not in class_dict:
                class_dict[class_value]=[np.array(X[row_index])]
                
            else:
                class_dict[class_value]+=[np.array(X[row_index])]
                
        
                
        for class_value in self.classes_:
            class_dict[class_value]=np.array(class_dict[class_value])
            
        self.class_dict_=class_dict
        
        return
            
    #A function to get the statistics for each class
    def __set_class_stats(self):
        """For each class, get the summary statistics, mainly:
        -> Mean, standard dev, percentage of presence
        
        I.e.:
        class1:[Mean_1,std_1,Percent of Total_1]
        ,class2:[Mean_2,std_2,Percent of Total_2]
        ,...,
        classN:[Mean_N,std_N,Percent of Total_N]
        
        """
        
        stat_dict=dict()
        
        for pred_class in self.classes_:
            
            #Matrix of classes
            class_matrix=self.class_dict_[pred_class]
            
            #Class:
            stat_dict[pred_class]={
                                'Mean':self.__get_column_means(class_matrix)
                                ,'Standard Deviation':self.__get_column_std_dev(class_matrix)
                                ,'Percent':(self.__get_row_count(class_matrix) / len((self.y_)))
                                }
    
        
        self.summary_stats_=stat_dict
        
        return
        
    
    #Getter
    def get_x(self):
        return self.X_
    
    #Getter
    def get_y(self):
        return self.y_
    
    #Getter
    def get_classes(self):
        return self.classes_
    
    #Getter
    def get_class_dict(self):
        return self.class_dict_
    
    #Getter
    def get_class_stats(self):
        return self.summary_stats_
    

    #Getter
    def get_raw_dict(self):
        return self.raw_dict_
    
    def get_cont_dict(self):
        return cont_dict_
                
        
    #MEANS
    @staticmethod   
    def __get_column_means(X):
        """Get the column mean of an Array x"""
        return X.mean(axis=0)
    
    @staticmethod
    def __get_row_means(X):
        """Get the row mean of an Array x"""
        return X.mean(axis=1)
    
    
    #Standard Deviations
    @staticmethod
    def __get_column_std_dev(X):
        """Get the column std dev of an Array x"""
        return X.std(axis=0)
    
    @staticmethod
    def __get_row_std_dev(X):
        """Get the row std dev of an Array x"""
        return X.std(axis=1)
    
    #Counts
    @staticmethod   
    def __get_column_count(X):
        """Get the column Count of an Array x"""
        return len(X.T)
    
    @staticmethod   
    def __get_row_count(X):
        """Get the row count of an Array x"""
        return len(X)
    
    
    #Probability Function
    @staticmethod
    def __calculate_gaussian_probability(x, mean, std_dev):
        """A function to implement the pdf for a gaussian"""
        
        if std_dev==0:
            #This way squaring it will be well defined
            std_dev=sys.float_info.min**(1/2)
            
        first_term= (1 / (np.sqrt(2 * np.pi) * std_dev))
        second_term= (np.exp(-((x-mean)**2 / (2 * (std_dev**2) ))))
        gaussian_pdf =  first_term*second_term
        
        return  gaussian_pdf
        
    
    #Validation
    @staticmethod
    def __validate_numeric(X):
        """A function to check that X is a numeric array"""
        return X.dtype.kind in ('b','u','i','f','c')
    
    #Validation
    @staticmethod
    def __validate_length(X,y):
        """A function to check that X and y are equal length"""
        return len(X)==len(y)
    
    @staticmethod
    def __validate_column_count(X_train,X_test):
        """A function to check that X_train and X_test are equal length"""
        return len(X_train[0])==len(X_test[0])
    
    @staticmethod
    def __validate_input_type(X):
        """Validate the type of X"""
        if not (isinstance(X,list) or isinstance(X, np.ndarray)):
            if isinstance(X,pd.DataFrame):
                X=X.to_numpy()
                return X

            else:
                raise TypeError("Input must be a list, numpy array, or dataframe")
                
        return np.array(X)



# A function to analyse Models and wrap model creation

In [3]:
def equal_index_array(l1,l2):
    
    bool_array=[]
    
    for pos in range(len(l1)):
        bool_array+=[l1[pos]==l2[pos]]
        
    return bool_array


def model_metrics(testActualVal, predictions,verbose=True):
    """A function to get Metrics for a Model
    
    Note: This is a function I wrote for the Research Practicum: 
    https://github.com/Team10UCD/Frontend/blob/cc12998790b7207a859d5089e10f085a65586294/flask/Data_Analytics/Model_Analytics/Route102_sample/02_local_ModelExplorationAndFeatureSelection_Route102.ipynb"""
    
    try:
        accuracy=metrics.accuracy_score(testActualVal, predictions)
    except:
        accuracy=None
        pass
    
    try:
        confusion_matrix=metrics.confusion_matrix(testActualVal, predictions)
    except:
        confusion_matrix=None
        pass
        
    try:
        classification_rep=metrics.classification_report(testActualVal, predictions,output_dict=True)
    except:
        classification_rep=None
        pass
    
    if verbose:
        
        try:
            print("----DETAIL----")
            print("\n\nAccuracy: \n")
            display(accuracy)
            print("\n\nConfusion matrix: \n")
            display(confusion_matrix)
            print("\n\nClassification report:\n ")
            display(classification_rep)
            
        except:
            print("----DETAIL----")
            print("\n\nAccuracy: \n", accuracy)
            print("\n\nConfusion matrix: \n", confusion_matrix)
            print("\n\nClassification report:\n ", classification_rep)
            
    
    result_dict={}
    result_dict['Accuracy']=accuracy
    result_dict['Confusion']=confusion_matrix
    result_dict['ClassificationRep']=classification_rep
    return result_dict


def create_model(X,y,scaler='Standard',random_state=14395076,plot_comp=True,assess=True,test_size=0.33, verbose=True,mod_type=''):
    """A wrapper to call the create model function"""

    
    if mod_type=='My Naive Bayes':
        mod_result=create_my_naive_bayes(X=X
                                         ,y=y
                                         ,plot_comp=plot_comp
                                         ,scaler=scaler
                                         ,assess=assess
                                         ,random_state=random_state
                                         , verbose=verbose
                                        ,test_size=test_size)
        return mod_result

    
    
    elif mod_type=="SK Naive Bayes":
        mod_result=create_sk_naive_bayes(X=X
                                         ,y=y
                                         ,plot_comp=plot_comp
                                         ,random_state=random_state
                                         ,scaler=scaler
                                         ,assess=assess
                                         , verbose=verbose
                                        ,test_size=test_size)
    
        return mod_result
    
    else:
        print("Sorry, I've not built that yet.")
        raise ValueError("Must be My Naive Bayes or SK Naive Bayes")

# Create a model using the default SKLearn estimator

In [4]:
def create_sk_naive_bayes(X,y,plot_comp,scaler='Standard', random_state=14395076,assess=True, test_size=0.33,verbose=True):
    """Create a model using SK's Naive Bayes."""
    
    
    print("""
-
SKLEARN NAIVE BAYES:
-
          
          """)
    
    X_train, X_test, y_train, y_test = train_test_split(X 
                                                        ,y
                                                        ,random_state=random_state
                                                        ,test_size=test_size)
    
    if scaler=='Standard':
        scaler = StandardScaler()
    elif scaler=='MinMax':    
        scaler = MinMaxScaler()
    else:
        raise ValueError("Need to implemenet your scaler - Choose Standard or MinMax")
    
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    start=time.perf_counter()
    
    #Create the DT Regression
    model = GaussianNB()

    #Fit the data
    model.fit(X_train,y_train)
   
    
    #Check the predictions
    predictions = model.predict(X_test)
    end=time.perf_counter()
    
    print("Total Time to Classify: {}".format(end-start))
    
    
    #Results
    pred_vs_act_df=pd.DataFrame({'Actual':y_test
                                 ,'PredictionClass':predictions
                                ,'Diff':equal_index_array(l1=y_test,l2=predictions)})
    
    model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)
    
    
    scores = cross_val_score(GaussianNB(), X, y, scoring='accuracy', cv=5)
    print(scores)

    cv_rmse = scores**0.5
    print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
    print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))
    
    model_metric['ClassificationRep']['Classification Time']=end-start
    result_dict={}
    result_dict['Model']=model
    result_dict['Classification_Time']=end-start
    result_dict['Actual vs Prediction']=pred_vs_act_df
    result_dict['Accuracy']=model_metric['Accuracy']
    result_dict['Confusion']=model_metric['Confusion']
    result_dict['ClassificationRep']=model_metric['ClassificationRep']
    result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
    result_dict['CrossVal_Acc_Mean']=np.std(cv_rmse)

    return result_dict

# Create a model using my SKLearn estimator
Note: as long as the same random_state and test size is present they'll both work on the same set.

In [5]:
def create_my_naive_bayes(X,y,plot_comp,scaler='Standard',random_state=14395076, assess=True,test_size=0.33, verbose=True):
    """Create a model using My Naive Bayes"""
    

    print("""
-
My NAIVE BAYES:
-
           
          """)
    
    X_train, X_test, y_train, y_test = train_test_split(X 
                                                        ,y
                                                        ,random_state=random_state
                                                        ,test_size=test_size)
    
    if scaler=='Standard':
        scaler = StandardScaler()
    elif scaler=='MinMax':    
        scaler = MinMaxScaler()
    
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    start=time.perf_counter()
    #Create the DT Regression
    model = MyGaussianNB()

    #Fit the data
    model.fit(X_train,y_train)
   
    
    #Check the predictions
    predictions = model.predict(X_test)
    end=time.perf_counter()
    
    print("Total Time to Classify: {}".format(end-start))
    
    #Results
    pred_vs_act_df=pd.DataFrame({'Actual':y_test
                                 ,'PredictionClass':predictions
                                ,'Diff':equal_index_array(l1=y_test,l2=predictions)})
    
    model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)
    
    
    scores = cross_val_score(MyGaussianNB(), X, y, scoring='accuracy', cv=5)
    print(scores)

    cv_rmse = scores**0.5
    print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
    print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))
    
    model_metric['ClassificationRep']['Classification Time']=end-start
    result_dict={}
    result_dict['Model']=model
    result_dict['Classification_Time']=end-start
    result_dict['Actual vs Prediction']=pred_vs_act_df
    result_dict['Accuracy']=model_metric['Accuracy']
    result_dict['Confusion']=model_metric['Confusion']
    result_dict['ClassificationRep']=model_metric['ClassificationRep']
    result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
    result_dict['CrossVal_Acc_Mean']=np.std(cv_rmse)

    return result_dict

# For a single dataset, create both models and add them to the same dataframe for that file for easy comparison.

In [6]:
def test_on_single_dataset(filename,x_columns,y_column,random_state=14395076,test_size=0.33,scaler='Standard'):
    """A function to test a single dataset"""
    
    #Read in the data
    df=pd.read_csv(filename)
    X=df[x_columns].values
    y=df.pop(y_column[0]).astype(str).values
    
    my_dict=create_model(X,y,scaler='Standard',random_state=random_state,plot_comp=True,assess=True,test_size=0.33, verbose=True,mod_type='My Naive Bayes')
    sk_dict=create_model(X,y,scaler='Standard',random_state=random_state,plot_comp=True,assess=True,test_size=0.33, verbose=True,mod_type='SK Naive Bayes')
    
    
    my_df=pd.DataFrame(my_dict['ClassificationRep'])
    sk_df=pd.DataFrame(sk_dict['ClassificationRep'])
    
    my_df['Type']='MyGaussianNaiveBayes'
    sk_df['Type']='SKGaussianNaiveBayes'
    
    report_df=pd.concat([my_df,sk_df])
    
    return report_df

# A function to run the single comparison on all datasets

In [7]:
def test_on_datasets(dataset_dictionary):
    """A function to test on all datasets consistently"""
    
    report_dictionary=dict()
    df_list=[]
    
    for file in dataset_dictionary:
        file_dictionary=dataset_dictionary[file]
        
        fp=file_dictionary['filepath']
        x_columns=file_dictionary['x_columns']
        y_columns=file_dictionary['y_column']
        
        
        print(
    """
---------------
---------------   

BEGIN TESTING ON - {}:

---------------
---------------
    """.format(file))

        single_report_dict=test_on_single_dataset(filename=fp
                               ,x_columns=x_columns
                               ,y_column=y_columns
                               ,random_state=14395076
                               ,test_size=0.33
                               ,scaler='Standard')
        
        single_report_dict['File']=file
        single_report_dict.set_index(['File','Type',single_report_dict.index],inplace=True)
        df_list+=[single_report_dict]
        
        report_dictionary[file]=single_report_dict

    #full_report_df=pd.concat(df_list)
    
    return report_dictionary

# Display the reports in a Notebook file.

In [8]:
def check_all_reports(dataset_dictionary,reports_per_file):
    """A function to display all reports"""
    
    for file in dataset_dictionary:
        display(reports_per_file[file].T)
        
    return

# Define the Files to Test.

They should be in the format:

test_dictionary:{filename:
                        {filepath,
                        target_column,
                        feature_columns}
                  }
                  
Validation Is NOT DONE on the files e.g. to ensure values are numeric/non-numeric - This should be done as a preprocessing step. The functions above assume this work is complete

In [9]:
test_datasets={
    'penguins':
        {
            'filepath':'./Test Datasets/penguins_af.csv'
            ,'y_column':['species']
            ,'x_columns':['bill_length_mm','flipper_length_mm','body_mass_g','bill_depth_mm']
        }
    ,'diabetes':
        {
            'filepath':'./Test Datasets/diabetes.csv'
            ,'y_column':['neg_pos']
            ,'x_columns':['preg', 'plas', 'pres', 'skin', 'insu', 'mass', 'pedi', 'age']
        }
    ,'glass':
        {
            'filepath':'./Test Datasets/glassV2.csv'
            ,'y_column':['Type']
            ,'x_columns':['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe']
        }
}

# Test all of the files.

In [10]:
reports_per_file=test_on_datasets(dataset_dictionary=test_datasets)


---------------
---------------   

BEGIN TESTING ON - penguins:

---------------
---------------
    

-
My NAIVE BAYES:
-
           
          
Total Time to Classify: 0.007652750000000097
----DETAIL----


Accuracy: 



0.9545454545454546



Confusion matrix: 



array([[50,  1,  0],
       [ 4, 17,  0],
       [ 0,  0, 38]])



Classification report:
 


{'Adelie': {'precision': 0.9259259259259259,
  'recall': 0.9803921568627451,
  'f1-score': 0.9523809523809523,
  'support': 51},
 'Chinstrap': {'precision': 0.9444444444444444,
  'recall': 0.8095238095238095,
  'f1-score': 0.8717948717948718,
  'support': 21},
 'Gentoo': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 38},
 'accuracy': 0.9545454545454546,
 'macro avg': {'precision': 0.95679012345679,
  'recall': 0.9299719887955181,
  'f1-score': 0.9413919413919413,
  'support': 110},
 'weighted avg': {'precision': 0.955050505050505,
  'recall': 0.9545454545454546,
  'f1-score': 0.9534465534465534,
  'support': 110}}

[0.98507463 0.95522388 0.95522388 0.96969697 0.98484848]
Avg Accuracy score over 5 folds: 
 0.9848695244513902
Stddev Accuracy score over 5 folds: 
 0.006751912908470249

-
SKLEARN NAIVE BAYES:
-
          
          
Total Time to Classify: 0.0008684579999997943
----DETAIL----


Accuracy: 



0.9545454545454546



Confusion matrix: 



array([[50,  1,  0],
       [ 4, 17,  0],
       [ 0,  0, 38]])



Classification report:
 


{'Adelie': {'precision': 0.9259259259259259,
  'recall': 0.9803921568627451,
  'f1-score': 0.9523809523809523,
  'support': 51},
 'Chinstrap': {'precision': 0.9444444444444444,
  'recall': 0.8095238095238095,
  'f1-score': 0.8717948717948718,
  'support': 21},
 'Gentoo': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 38},
 'accuracy': 0.9545454545454546,
 'macro avg': {'precision': 0.95679012345679,
  'recall': 0.9299719887955181,
  'f1-score': 0.9413919413919413,
  'support': 110},
 'weighted avg': {'precision': 0.955050505050505,
  'recall': 0.9545454545454546,
  'f1-score': 0.9534465534465534,
  'support': 110}}

[0.98507463 0.95522388 0.95522388 0.96969697 0.98484848]
Avg Accuracy score over 5 folds: 
 0.9848695244513902
Stddev Accuracy score over 5 folds: 
 0.006751912908470249

---------------
---------------   

BEGIN TESTING ON - diabetes:

---------------
---------------
    

-
My NAIVE BAYES:
-
           
          
Total Time to Classify: 0.019765999999999728
----DETAIL----


Accuracy: 



0.7874015748031497



Confusion matrix: 



array([[141,  26],
       [ 28,  59]])



Classification report:
 


{'tested_negative': {'precision': 0.834319526627219,
  'recall': 0.844311377245509,
  'f1-score': 0.8392857142857143,
  'support': 167},
 'tested_positive': {'precision': 0.6941176470588235,
  'recall': 0.6781609195402298,
  'f1-score': 0.6860465116279069,
  'support': 87},
 'accuracy': 0.7874015748031497,
 'macro avg': {'precision': 0.7642185868430212,
  'recall': 0.7612361483928693,
  'f1-score': 0.7626661129568106,
  'support': 254},
 'weighted avg': {'precision': 0.7862976229955244,
  'recall': 0.7874015748031497,
  'f1-score': 0.7867982708556779,
  'support': 254}}

[0.75324675 0.72727273 0.74675325 0.78431373 0.74509804]
Avg Accuracy score over 5 folds: 
 0.8667310233848855
Stddev Accuracy score over 5 folds: 
 0.010687913636962644

-
SKLEARN NAIVE BAYES:
-
          
          
Total Time to Classify: 0.001536874999999771
----DETAIL----


Accuracy: 



0.7874015748031497



Confusion matrix: 



array([[141,  26],
       [ 28,  59]])



Classification report:
 


{'tested_negative': {'precision': 0.834319526627219,
  'recall': 0.844311377245509,
  'f1-score': 0.8392857142857143,
  'support': 167},
 'tested_positive': {'precision': 0.6941176470588235,
  'recall': 0.6781609195402298,
  'f1-score': 0.6860465116279069,
  'support': 87},
 'accuracy': 0.7874015748031497,
 'macro avg': {'precision': 0.7642185868430212,
  'recall': 0.7612361483928693,
  'f1-score': 0.7626661129568106,
  'support': 254},
 'weighted avg': {'precision': 0.7862976229955244,
  'recall': 0.7874015748031497,
  'f1-score': 0.7867982708556779,
  'support': 254}}

[0.75324675 0.72727273 0.74675325 0.78431373 0.74509804]
Avg Accuracy score over 5 folds: 
 0.8667310233848855
Stddev Accuracy score over 5 folds: 
 0.010687913636962644

---------------
---------------   

BEGIN TESTING ON - glass:

---------------
---------------
    

-
My NAIVE BAYES:
-
           
          
Total Time to Classify: 0.01342995800000013
----DETAIL----


Accuracy: 



0.47058823529411764



Confusion matrix: 



array([[12,  5, 11,  0,  0],
       [10,  7,  0,  2,  0],
       [ 4,  0,  2,  0,  0],
       [ 0,  2,  0,  1,  0],
       [ 0,  0,  1,  1, 10]])



Classification report:
 


{'1': {'precision': 0.46153846153846156,
  'recall': 0.42857142857142855,
  'f1-score': 0.4444444444444445,
  'support': 28},
 '2': {'precision': 0.5,
  'recall': 0.3684210526315789,
  'f1-score': 0.4242424242424242,
  'support': 19},
 '3': {'precision': 0.14285714285714285,
  'recall': 0.3333333333333333,
  'f1-score': 0.2,
  'support': 6},
 '5': {'precision': 0.25,
  'recall': 0.3333333333333333,
  'f1-score': 0.28571428571428575,
  'support': 3},
 '7': {'precision': 1.0,
  'recall': 0.8333333333333334,
  'f1-score': 0.9090909090909091,
  'support': 12},
 'accuracy': 0.47058823529411764,
 'macro avg': {'precision': 0.47087912087912087,
  'recall': 0.45939849624060153,
  'f1-score': 0.4526984126984127,
  'support': 68},
 'weighted avg': {'precision': 0.5298561732385262,
  'recall': 0.47058823529411764,
  'f1-score': 0.4922247686953569,
  'support': 68}}

[0.31707317 0.34146341 0.36585366 0.43902439 0.09756098]
Avg Accuracy score over 5 folds: 
 0.5454472550595352
Stddev Accuracy score over 5 folds: 
 0.12117101096895212

-
SKLEARN NAIVE BAYES:
-
          
          
Total Time to Classify: 0.0011702499999999283
----DETAIL----


Accuracy: 



0.47058823529411764



Confusion matrix: 



array([[12,  5, 11,  0,  0],
       [10,  7,  0,  2,  0],
       [ 4,  0,  2,  0,  0],
       [ 0,  2,  0,  1,  0],
       [ 0,  0,  1,  1, 10]])



Classification report:
 


{'1': {'precision': 0.46153846153846156,
  'recall': 0.42857142857142855,
  'f1-score': 0.4444444444444445,
  'support': 28},
 '2': {'precision': 0.5,
  'recall': 0.3684210526315789,
  'f1-score': 0.4242424242424242,
  'support': 19},
 '3': {'precision': 0.14285714285714285,
  'recall': 0.3333333333333333,
  'f1-score': 0.2,
  'support': 6},
 '5': {'precision': 0.25,
  'recall': 0.3333333333333333,
  'f1-score': 0.28571428571428575,
  'support': 3},
 '7': {'precision': 1.0,
  'recall': 0.8333333333333334,
  'f1-score': 0.9090909090909091,
  'support': 12},
 'accuracy': 0.47058823529411764,
 'macro avg': {'precision': 0.47087912087912087,
  'recall': 0.45939849624060153,
  'f1-score': 0.4526984126984127,
  'support': 68},
 'weighted avg': {'precision': 0.5298561732385262,
  'recall': 0.47058823529411764,
  'f1-score': 0.4922247686953569,
  'support': 68}}

[0.31707317 0.34146341 0.36585366 0.43902439 0.2195122 ]
Avg Accuracy score over 5 folds: 
 0.5766820074372563
Stddev Accuracy score over 5 folds: 
 0.06342892204503187


# We observe equality in the results which we are seeing between the SKLearn and My implementation.

However, we see the sklearn implementation is considerably faster.

While I tried to use only numpy arrays hoping that this would lead to decent performance, we see that the SKLearn implementation is at least one order of magnitude faster than my implementation in all datasets. My assumption is that SKLearn also stores considerably few variables than I store, so is likely a more optimal approach in regards to both space and time complexity.

In [11]:
check_all_reports(dataset_dictionary=test_datasets,reports_per_file=reports_per_file)

File,penguins,penguins,penguins,penguins,penguins,penguins,penguins,penguins
Type,MyGaussianNaiveBayes,MyGaussianNaiveBayes,MyGaussianNaiveBayes,MyGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes
Unnamed: 0_level_2,precision,recall,f1-score,support,precision,recall,f1-score,support
Adelie,0.925926,0.980392,0.952381,51.0,0.925926,0.980392,0.952381,51.0
Chinstrap,0.944444,0.809524,0.871795,21.0,0.944444,0.809524,0.871795,21.0
Gentoo,1.0,1.0,1.0,38.0,1.0,1.0,1.0,38.0
accuracy,0.954545,0.954545,0.954545,0.954545,0.954545,0.954545,0.954545,0.954545
macro avg,0.95679,0.929972,0.941392,110.0,0.95679,0.929972,0.941392,110.0
weighted avg,0.955051,0.954545,0.953447,110.0,0.955051,0.954545,0.953447,110.0
Classification Time,0.007653,0.007653,0.007653,0.007653,0.000868,0.000868,0.000868,0.000868


File,diabetes,diabetes,diabetes,diabetes,diabetes,diabetes,diabetes,diabetes
Type,MyGaussianNaiveBayes,MyGaussianNaiveBayes,MyGaussianNaiveBayes,MyGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes
Unnamed: 0_level_2,precision,recall,f1-score,support,precision,recall,f1-score,support
tested_negative,0.83432,0.844311,0.839286,167.0,0.83432,0.844311,0.839286,167.0
tested_positive,0.694118,0.678161,0.686047,87.0,0.694118,0.678161,0.686047,87.0
accuracy,0.787402,0.787402,0.787402,0.787402,0.787402,0.787402,0.787402,0.787402
macro avg,0.764219,0.761236,0.762666,254.0,0.764219,0.761236,0.762666,254.0
weighted avg,0.786298,0.787402,0.786798,254.0,0.786298,0.787402,0.786798,254.0
Classification Time,0.019766,0.019766,0.019766,0.019766,0.001537,0.001537,0.001537,0.001537


File,glass,glass,glass,glass,glass,glass,glass,glass
Type,MyGaussianNaiveBayes,MyGaussianNaiveBayes,MyGaussianNaiveBayes,MyGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes,SKGaussianNaiveBayes
Unnamed: 0_level_2,precision,recall,f1-score,support,precision,recall,f1-score,support
1,0.461538,0.428571,0.444444,28.0,0.461538,0.428571,0.444444,28.0
2,0.5,0.368421,0.424242,19.0,0.5,0.368421,0.424242,19.0
3,0.142857,0.333333,0.2,6.0,0.142857,0.333333,0.2,6.0
5,0.25,0.333333,0.285714,3.0,0.25,0.333333,0.285714,3.0
7,1.0,0.833333,0.909091,12.0,1.0,0.833333,0.909091,12.0
accuracy,0.470588,0.470588,0.470588,0.470588,0.470588,0.470588,0.470588,0.470588
macro avg,0.470879,0.459398,0.452698,68.0,0.470879,0.459398,0.452698,68.0
weighted avg,0.529856,0.470588,0.492225,68.0,0.529856,0.470588,0.492225,68.0
Classification Time,0.01343,0.01343,0.01343,0.01343,0.00117,0.00117,0.00117,0.00117


---

### Manual Intervention

For the purposes of assessment, here is a Notebook Section which can be modified directly to pass in values without working through the entire code above.

1. Run this bit to read in the DF and values

In [12]:
#Modify Here to pass in X and Y
scaler='Standard'

df=pd.read_csv(test_datasets['penguins']['filepath'])


X=df[test_datasets['penguins']['x_columns']].values
y=df.pop(test_datasets['penguins']['y_column'][0]).values

In [13]:



X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)

if scaler=='Standard':
    scaler = StandardScaler()
elif scaler=='MinMax':    
    scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



---

2. Run this bit to test on MyGaussianNaiveBayes

In [14]:

#Create the DT Regression
model = MyGaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)


scores = cross_val_score(MyGaussianNB(), X, y, scoring='accuracy', cv=5)
print(scores)

cv_rmse = scores**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_Mean']=np.std(cv_rmse)
result_dict

----DETAIL----


Accuracy: 



0.9545454545454546



Confusion matrix: 



array([[50,  1,  0],
       [ 4, 17,  0],
       [ 0,  0, 38]])



Classification report:
 


{'Adelie': {'precision': 0.9259259259259259,
  'recall': 0.9803921568627451,
  'f1-score': 0.9523809523809523,
  'support': 51},
 'Chinstrap': {'precision': 0.9444444444444444,
  'recall': 0.8095238095238095,
  'f1-score': 0.8717948717948718,
  'support': 21},
 'Gentoo': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 38},
 'accuracy': 0.9545454545454546,
 'macro avg': {'precision': 0.95679012345679,
  'recall': 0.9299719887955181,
  'f1-score': 0.9413919413919413,
  'support': 110},
 'weighted avg': {'precision': 0.955050505050505,
  'recall': 0.9545454545454546,
  'f1-score': 0.9534465534465534,
  'support': 110}}

[0.98507463 0.95522388 0.95522388 0.96969697 0.98484848]
Avg Accuracy score over 5 folds: 
 0.9848695244513902
Stddev Accuracy score over 5 folds: 
 0.006751912908470249


{'Model': MyGaussianNB(),
 'Actual vs Prediction':         Actual PredictionClass  Diff
 0       Adelie          Adelie  True
 1       Gentoo          Gentoo  True
 2       Adelie          Adelie  True
 3       Adelie          Adelie  True
 4       Gentoo          Gentoo  True
 ..         ...             ...   ...
 105     Adelie          Adelie  True
 106  Chinstrap       Chinstrap  True
 107     Gentoo          Gentoo  True
 108  Chinstrap       Chinstrap  True
 109     Adelie          Adelie  True
 
 [110 rows x 3 columns],
 'Accuracy': 0.9545454545454546,
 'Confusion': array([[50,  1,  0],
        [ 4, 17,  0],
        [ 0,  0, 38]]),
 'ClassificationRep': {'Adelie': {'precision': 0.9259259259259259,
   'recall': 0.9803921568627451,
   'f1-score': 0.9523809523809523,
   'support': 51},
  'Chinstrap': {'precision': 0.9444444444444444,
   'recall': 0.8095238095238095,
   'f1-score': 0.8717948717948718,
   'support': 21},
  'Gentoo': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 

---

3. Run this bit to test on SK Learns' GaussianNaiveBayes

In [15]:


#Create the DT Regression
model = GaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)


scores = cross_val_score(MyGaussianNB(), X, y, scoring='accuracy', cv=5)
print(scores)

cv_rmse = scores**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_Mean']=np.std(cv_rmse)
result_dict

----DETAIL----


Accuracy: 



0.9545454545454546



Confusion matrix: 



array([[50,  1,  0],
       [ 4, 17,  0],
       [ 0,  0, 38]])



Classification report:
 


{'Adelie': {'precision': 0.9259259259259259,
  'recall': 0.9803921568627451,
  'f1-score': 0.9523809523809523,
  'support': 51},
 'Chinstrap': {'precision': 0.9444444444444444,
  'recall': 0.8095238095238095,
  'f1-score': 0.8717948717948718,
  'support': 21},
 'Gentoo': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 38},
 'accuracy': 0.9545454545454546,
 'macro avg': {'precision': 0.95679012345679,
  'recall': 0.9299719887955181,
  'f1-score': 0.9413919413919413,
  'support': 110},
 'weighted avg': {'precision': 0.955050505050505,
  'recall': 0.9545454545454546,
  'f1-score': 0.9534465534465534,
  'support': 110}}

[0.98507463 0.95522388 0.95522388 0.96969697 0.98484848]
Avg Accuracy score over 5 folds: 
 0.9848695244513902
Stddev Accuracy score over 5 folds: 
 0.006751912908470249


{'Model': GaussianNB(),
 'Actual vs Prediction':         Actual PredictionClass  Diff
 0       Adelie          Adelie  True
 1       Gentoo          Gentoo  True
 2       Adelie          Adelie  True
 3       Adelie          Adelie  True
 4       Gentoo          Gentoo  True
 ..         ...             ...   ...
 105     Adelie          Adelie  True
 106  Chinstrap       Chinstrap  True
 107     Gentoo          Gentoo  True
 108  Chinstrap       Chinstrap  True
 109     Adelie          Adelie  True
 
 [110 rows x 3 columns],
 'Accuracy': 0.9545454545454546,
 'Confusion': array([[50,  1,  0],
        [ 4, 17,  0],
        [ 0,  0, 38]]),
 'ClassificationRep': {'Adelie': {'precision': 0.9259259259259259,
   'recall': 0.9803921568627451,
   'f1-score': 0.9523809523809523,
   'support': 51},
  'Chinstrap': {'precision': 0.9444444444444444,
   'recall': 0.8095238095238095,
   'f1-score': 0.8717948717948718,
   'support': 21},
  'Gentoo': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 's