# Assignment 1

---

Author: Adam Ryan

Created Date: 2021-09-14

Date Last Modified: 2021-12-07

Description: A solution to Assignment 1 and Assignment 2 regarding the implementation of MyGaussianNB which implements Gaussian Naive Bayes.

---

## Task Description

### Objective
The objective of this assignment is to implement a Gaussian Naive Bayes classifier in the scikit-learn framework. A notebook (MajorityClassClf) is provided with a simple example of a classifier that works with scikit-learn.

Note: The code developed in this assignment will be extended in the second assignment to allow for missing values. 

### Requirements
The notebook MajorityClassClf contains some basic code to help you get started. 
Provide a python class MyGaussianNB that implements Gaussian Naive Bayes. The conditional probabilities should be calculated as follows:
        

\begin{equation}
\mathrm{P}(x_i | y) = \frac{1}{\sqrt{2 \pi (\sigma_{y_i})^2}} e^{-\frac{(x_i - \mu_y)^2}{2(\sigma_{y_i})^2}} = \frac{1}{\sqrt{2 \pi (\sigma_{y_i})^2}} \text{exp}({-\frac{(x_i - \mu_y)^2}{2(\sigma_{y_i})^2}})
\end{equation}


where 𝜇y is the mean for variable i for class y and 𝜎y is the corresponding standard deviation. Thereafter the classification should use the NB formulae presented in the lectures. Alternatives that use addition of conditional probabilities or logs should not be used. 
    
The API specification for sklearn classifiers is here: https://scikit-learn.org/stable/developers/develop.html 
You should implement the ‘fit’ and ‘predict’ methods, there is no need to implement ‘predict_proba’. 
Prior probabilities should be calculated from the training data. With this, there will be no need to pass parameters when instances are created. 

Test the performance of your implementation against the GaussianNB implementation in scikit-learn. You should use a range of datasets for this testing. Possible test sets used in lectures are penguins,  diabetes and glassV2. 

### Submission
This is an individual (not group) project. Submission is through the Brightspace page. Your submission should comprise your notebook and the second dataset that you use. Clear all outputs in the notebook before saving for submission. You can use markdown cells in the notebook to report your findings and conclusions. 

---


## Solution - Assignment 1

### Explanation

1. A function is created which creates models: SK Naive Bayes and My Naive Bayes.
2. The My Naive Bayes function is defined. 
    1. Fit does validation and prepares the class stats for predict.
    2. Predict validates it has been fitted. It goes through each row in the test data, and for each row for every class it goes through each column and gets the probability using the Gaussian NB PDF and then derives total row probability for that class. I do not normalise the probabilities across n-many classes as the only piece which matters is the max class. I add this to an array and return it. Validation is done against std.Dev=0 by setting it to the square root of the minimum float in this instance.
3. Some functions are defined to create models (MyNaiveBayes and SKLearnBaye) and analyse them.
4. Functions are designed to read in all datasets and test my implementation and the default implementation and validate they produce equal classification reports.
5. We observe equality in the classification results we are obtaining over the non Missing Set.


----

# Assignment 2 Specific

## Objective

The objective of this assignment is to explore how missing values can be handled in supervised machine learning. 

Two strategies will be explored
1. Consider missing values explicitly in the classification algorithm (Gaussian Naive Bayes) 
2. Use imputation methods to guess missing values. 

## Requirements


You may use the code from your submission for the first assignment as your starting point or you may use the sample solution for that assignment that will be provided. 

### Part 1 

Extend the Gaussian Naive Bayes code so that it handles missing values. Gaussian Naive Bayes can handle missing values in training by calculating conditional probabilities on the values that are present. You may choose to put a limit on the number of missing values allowed. 

Your code should also handle missing values on any test data. The easiest way to do this is to leave features with missing values out of the posterior probability calculation.

Comment on any design decisions you make in markup. 

### Part 2

Test the performance of your implementation against the scikit-learn GaussianNB using missing value imputation.

Test two imputation options, one univariate and one multi-variate.  To help with your evaluation two versions of the penguins datasets with missing values are provided, one with 20% missing and the other with 40%. 

You should use cross validation for testing, taking care that any scaling and imputation is handled properly within cross validation.

Comment on the results of your evaluation. 


---

## Solution - Assignment 2 Specific - Design Decisions/Implementation Details

1. I extend my assignment 1 solution to handle NANs. 
    
    
    1. In the fit method, I alter the calculation of the class statistics by modifying methods to calculate based on the non-nan values by using the Numpy NANMean, NANStd, NANProd methods within the \_\_set_class_stats() method. I also change the check_x_Y function to include the force_all_finite=False parameter which allows it to contain inf, -inf, and nan values.
    
    
    2. In the predict method, I alter the calculation of the row propbability to use nanprod * class percent. When calculating a probability, in the event that the entire row is nan, np.nanprod(array)=1, and hence the probability for that row becomes the overall class probability within the training set and hence defaults to the majority class. I believe this is a logical/suitable method in the absence of other information. I am highly highly reluctant to filter out data, with the logic being if a user passes a 100 row dataset with 10 NAN rows, the expected outcome in a production environment should be 100 y predictions, not 90 predictions with 10 NANs. While this biases predictions in instances where the training data is misrepresentative or is borderline in terms of weighting, situations which are more significantly skewed are better fit by this approach. In instances where only some of the data is missing, the row probability excludes these missing values from the calculation.
    
    
    3. Outside of the class, I adjust the test_datasets to include the missing value datasets - These produce a default error with default SKLearn implementation so I add a try and except for this case. I also alter the ingestion function to read ? as nan and to consider the first column as an index for the MV dataset.
    
    
    4. For troubleshooting purposes I add a function to show some of the attributes of MyGaussianNB() and to return the models so that I can investigate and verify the probability calculations by performing some spot-checks.
    
    
    5. I enable the ability to not use a scaler but keep the StandardScaler as the default.
    
    
    6. I validate that my changes have not altered the performance on the original dataset.
    
    
    7. One note which might be interesting/useful to overcome scenarios where the bias towards the majority class is undesired would be to allow the user to implement either a probabilistic selection (in the scenario where all are nan, and the user passes 'e.g. binomial' as a parameter, the class selection is made according to that distribution, or alternatively to allow the user to supply weightings to the class percentage (this would essentially serve as the user supplying a Bayesian prior probability as it would capture what the user suspects the ultimate class probability should be within their dataset). As the results of the model were actually surprisingly good with minimal changes over the penguins dataset, I have elected not to include this.
  
    
    8. In the scenario where a user enters an entire column of nulls, I return an error - this should be dealt with in preprocessing and should not be allowed.
    
    
    
    
2. I implement the imputation method.


    1. I change the create_sk_naive_bayes method to take four possible imputation values, None, KNN, univariate, multivariate.
    
    
    2. This parameter fit transforms the test set, and then uses that on the model generation. 
    
    
    3. I adjust the function test_on_single_dataset to run SKLearn's implementation with each of the imputers. All functions take the random state 14395076. I use cross validaiton with 10 folds.
    
    
    4. The report generation function is adjusted to combine the results of each of the runs (MyGNB, SKGNB_NoImpute, SKGNB_meanImpute, SKGNB_iterImpute,SKGNB_KNNImpute) to allow for an easy comparison. I add into the classification report the cross validation scores to have these consolidated in a single location, but they're also outputted during the runs.
    
    
    
 
3. Results of the imputation:


    1. Penguins MV 0.2 - This is largely uninteresting, minor differences between Iter/Multi and Mine/Uni.
    
    
    2. Penguins MV 0.4 - This is the more intersting scenario.
        
        
        1. MyNB()
        
            1. Accuracy: 0.8909090909090909
            2. 10-Fold Acc: 0.8705882352941176
            3. 10-Fold Stddev: 0.038965332777169584
        
        2. SKNB() with Mean Imputation
        
            1. Accuracy: 0.8363636363636363
            2. 10-Fold Acc: 0.8498217468805704
            3. 10-Fold Stddev: 0.05064279039960052
        
        3. SKNB() with Itera Imputation
        
            1. Accuracy: 0.8454545454545455
            2. 10-Fold Acc: 0.8374331550802138
            3. 10-Fold Stddev: 0.06728459087519195
        
        4. SKNB() with KNN Imputation
        
            1. Accuracy: 0.7909090909090909
            2. 10-Fold Acc: 0.8195187165775402
            3. 10-Fold Stddev: 0.056256088617336775

4. Commentary on results: I'm quite surprised that the custom implementation developed is better than the SKLearn with imputation examples. Looking at the cross validation results and standard deviation of CV results, it looks like the differences may not be statistically significant outside of the KNN results with a p value of 0.05 (looking at the values, it looks like taking MyNB as a base and checking+- 2std around the 10-fold accuracy mean that all except KNN Iter would fall within this range) however I was expecting the results to be more balanced. I believe a signficant reason for these results is that as the Penguins dataset is imbalanced towards the majority class, my implementation works 'better' because there's a natural bias towards this class due to how I've implemented it which I believe is elevating the accuracy. Potentially if the dataset was more balanced between classes, we would see more of an equivalence with each approach (as we see Uni and Iter imputation are pretty similar in terms of performance; the KNN imputation is the weakest but this could potentially be fine-tuned with hyperparamter tuning to optimise the number of neighbours to take versus my approach which was to choose 5 neighbours arbitrarily and this clearly has not worked).

#### Required Modules

In [None]:
####--------------------------------------
#00.Import Modules
####--------------------------------------


######---------BEGIN
#      SUPPRESS DEPRECIATION WARNINGS: Applicable to datetime_is_numeric=True
######--------END

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

######---------BEGIN
#      ML
######--------END

#import nltk as nl
import sklearn as sk
#import xgboost as xg
#import pymc3 as pymc
#import sympy as sym

from sklearn.model_selection import train_test_split, KFold, cross_val_score

from sklearn import metrics
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score,f1_score

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels

from sklearn.naive_bayes import GaussianNB

from sklearn.base import BaseEstimator, ClassifierMixin

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer

#from sklearn.tree import export_graphviz
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import RandomForestClassifier


######---------BEGIN
#      SQL
######--------END



######---------BEGIN
#     GENERAL
######--------END

import pandas as pd
import numpy as np
import sys
import time

######---------BEGIN
#     DATA VIS
######--------END

import matplotlib as mp
#from bokeh import *
#from dash import *

import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.dates as mdates

# My Gaussian Naive Bayes Implementation

In [None]:
class MyGaussianNB(BaseEstimator, ClassifierMixin):
    """A class to capture My Gaussian Naive Bayes
    
    Input: 
    BaseEstimator
    ClassifierMixin
    
    Output:
    An item of the MGNB class.
    """
    
    
    def __init__(self):
        """Initialise
        
        Fit function populates each Attribute.
        
        Important Note: I am not defining the attributes here, they're merely being documented.
        This is done so that the is_fitted method works to validate that the instance has been fitted, and raises an error if not"""
    
        #Initial data
        ##self.X_=None
        ##self.y_=None
        
        #Unique Class Labels
        #self.classes_=None
        
        #Statistics for each class - Mean, std, Perc
        ##self.summary_stats_=dict()
        
        #Matrix for each Class
        ##self.class_dict_=dict()
        
        #Predict Variables
        ##self.X_test_=None
        ##self.y_test_=None
        
        #Cont Table
        ##self.cont_dict_=dict()
        ##self.raw_dict_=dict()
        
        
    
    def fit(self, Xt, yt):
        """A function to fit the data.
        
        ---
        First it validates the checks 
        -> Is it a Dataframe, list, or Numpy Array?
        -> Is it non-empty?
        -> Are they the same size?
        -> Is it numeric only?
        
        Second:
        It sets the X and y values
        
        Third:
        It gets the unique clases.
        
        Fourth:
        It gets the summary stats which are needed for the probability calculation for each class.
        ---
        
        I work solely with numpy arrays. 
        In hindsight it would have been much easier to use Pandas dataframes but I believe numpy is meant to be faster as it's based in C.
        
        Input:
        Xt: List - X values
        yt: List - y values
        
        Output:
        ???"""
        
        #Validate the type of the inputs -> Note I wrote this before seeing SKLearn recommendation
        Xt=self.__validate_input_type(Xt)
        yt=self.__validate_input_type(yt)  
        
        #Validate the training set contains numbers
        if not self.__validate_numeric(Xt):
            raise ValueError("All inputs in the training set should be numeric")
        
        if not self.__validate_length(X=Xt,y=yt):
            raise ValueError("Arrays are of different length")
            
        if len(yt)<1:
            raise ValueError("Empty arrays are not alloweed")
            
            
        nulls_found_in_all_column=self.__check_if_all_nan(full_set=Xt)

        #https://scikit-learn.org/stable/developers/develop.html Recommends using these two instead of the above.
        Xt, yt = check_X_y(Xt, yt, force_all_finite=False)
        
        #Post Validation
        self.X_ = Xt
        
        #Store class labels as strings
        self.y_ = np.array(yt, dtype=str)
    
        
        #Set the list of classes
        self.classes_ = unique_labels(self.y_)
        
        #Set the class dictionary
        self.__class_dict()
        
        #Get the stats for each Class and set it
        self.__set_class_stats()
        
        return self
    
    
    def predict(self, X_test):
        """A function to predict a test set.
        
        Input: 
        X_test: List/array/dataframe
        
        Output:
        ???"""
        
        X_test=self.__validate_input_type(X_test)
        
        #Validate the test set contains numbers
        if not self.__validate_numeric(X_test):
            raise ValueError("All inputs in the training set should be numeric")
            
        #Validate the test set has same column count as training
        if not self.__validate_column_count(X_train=self.X_,X_test=X_test):
            raise ValueError("The test set should have the same column count as the training set")
            
            
        nulls_found_in_all_column=self.__check_if_all_nan(full_set=X_test)
        
        X_test = check_array(X_test, accept_sparse=True, force_all_finite=False)
    
        #Check the model has been fitted -> Returns error if not.
        check_is_fitted(self, attributes=['X_','y_','classes_','class_dict_','summary_stats_'],msg='is_fitted_')
        
        #Passed validation
        self.X_test_=X_test
        
        #BEGIN WITH PREDICTING
        y_pred=[]
        
        column_count=len(X_test[0])
        
        class_summary_stats=self.get_class_stats()
        
        row_class_dict=dict()
        raw_contingency_dictionary=dict()
        
        #For each row in the test data, take the row
        for row_index in range(len(X_test)):
            
            #get a single row
            row_value=X_test[row_index]
            row_class_dict[row_index]=dict()
            raw_contingency_dictionary[row_index]=dict()
            
            #For that row, get the probability values for every class
            for class_label in self.get_classes():
                
                #Empty array to calculate probability for that row
                row_list=[]
                
                #class Stats
                class_mean=class_summary_stats[class_label]['Mean']
                class_std=class_summary_stats[class_label]['Standard Deviation']
                class_percent=class_summary_stats[class_label]['Percent']
                
                #Go through column for that row and add in probability:
                for column_index in range(column_count):
                    
                    row_list+=[self.__calculate_gaussian_probability(x=row_value[column_index]
                                                                         ,mean=class_mean[column_index]
                                                                     , std_dev=class_std[column_index])]
                        
                    

                
                #This stores the probability values calculated for each row for each class to refer back to later
                raw_contingency_dictionary[row_index][class_label]= np.array(row_list)
                
                #This just stores the probability for that class
                row_class_dict[row_index][class_label]= np.nanprod(np.array(row_list)) * class_percent
                
            
            #Probabilities calculated for each class - for that row index get the class:
            class_for_row=max(row_class_dict[row_index], key=row_class_dict[row_index].get)
            
            y_pred+=[class_for_row]
            
        y_pred=np.array(y_pred)   
        self.cont_dict_=row_class_dict
        self.raw_dict_=raw_contingency_dictionary
        
        return y_pred
    
    
    
    def predict_proba(self, Xtes):
        """A function to predict the probability"""
        pass
    # We should really be implementing predict_proba as well.
    
    
    
    #A function to get a dictionary for the class
    def __class_dict(self):
        """A function to make a dictionary such that each class is held in the dictionary and all rows of data are put against that class.
        
        Input: 
        X -> n-dimensional Array
        y -> Single-dimensional array
        
        Output:
        Dictionary -> {Class1:[Row_{1,1},Row_{1,2},...Row_{1,K}]
                        ,Class2:[Row_{2,1},...Row_{2,m}]
                        ,...
                        ,ClassN:[R_{N,1},...R_{N,x}]}
        
        """
        class_dict=dict()
        
        X=self.X_
        y=self.y_
        
        #Each row in the matrix
        for row_index in range(len(y)):
            
            class_value=y[row_index]
            
            if class_value not in class_dict:
                class_dict[class_value]=[np.array(X[row_index])]
                
            else:
                class_dict[class_value]+=[np.array(X[row_index])]
                
        
                
        for class_value in self.classes_:
            class_dict[class_value]=np.array(class_dict[class_value])
            
        self.class_dict_=class_dict
        
        return
            
    #A function to get the statistics for each class
    def __set_class_stats(self):
        """For each class, get the summary statistics, mainly:
        -> Mean, standard dev, percentage of presence
        
        I.e.:
        class1:[Mean_1,std_1,Percent of Total_1]
        ,class2:[Mean_2,std_2,Percent of Total_2]
        ,...,
        classN:[Mean_N,std_N,Percent of Total_N]
        
        """
        
        stat_dict=dict()
        
        for pred_class in self.classes_:
            
            #Matrix of classes
            class_matrix=self.class_dict_[pred_class]
            
            #Class:
            stat_dict[pred_class]={
                                'Mean':self.__get_column_means(class_matrix)
                                ,'Standard Deviation':self.__get_column_std_dev(class_matrix)
                                ,'Percent':(self.__get_row_count(class_matrix) / len((self.y_)))
                                ,'Class Count':(self.__get_row_count(class_matrix))
                                }
    
        
        self.summary_stats_=stat_dict
        
        return
        
    
    #Getter
    def get_x(self):
        return self.X_
    
    #Getter
    def get_y(self):
        return self.y_
    
    #Getter
    def get_classes(self):
        return self.classes_
    
    #Getter
    def get_class_dict(self):
        return self.class_dict_
    
    #Getter
    def get_class_stats(self):
        return self.summary_stats_
    

    #Getter
    def get_raw_dict(self):
        return self.raw_dict_
    
    def get_cont_dict(self):
        return cont_dict_
                
        
    #MEANS
    @staticmethod   
    def __get_column_means(X):
        """Get the column mean of an Array x"""
        return np.nanmean(X,axis=0)
    
    @staticmethod
    def __get_row_means(X):
        """Get the row mean of an Array x"""
        return np.nanmean(X,axis=1)
    
    
    #Standard Deviations
    @staticmethod
    def __get_column_std_dev(X):
        """Get the column std dev of an Array x"""
        return np.nanstd(X,axis=0)
    
    @staticmethod
    def __get_row_std_dev(X):
        """Get the row std dev of an Array x"""
        return np.nanstd(X,axis=1)
    
    #Counts
    @staticmethod   
    def __get_column_count(X):
        """Get the column Count of an Array x"""
        return len(X.T)
    
    @staticmethod   
    def __get_row_count(X):
        """Get the non NAN row count of an Array x"""
        return len(X)
    @staticmethod  
    def __check_if_all_nan(full_set):
        """Check if any columns are all NAN and if so return an error."""
        
        nulls_found=False
        
        for col_index in range(full_set.shape[1]):

            if np.all(np.isnan(full_set[:,col_index])):
                nulls_found=True
                raise ValueError("An entire column in your data is NaN. Please remove this column in preprocessing")

        return nulls_found

    
    #Probability Function
    @staticmethod
    def __calculate_gaussian_probability(x, mean, std_dev):
        """A function to implement the pdf for a gaussian"""
        
        if std_dev==0:
            #This way squaring it will be well defined
            std_dev=sys.float_info.min**(1/2)
            
        first_term= (1 / (np.sqrt(2 * np.pi) * std_dev))
        second_term= (np.exp(-((x-mean)**2 / (2 * (std_dev**2) ))))
        gaussian_pdf =  first_term*second_term
        
        return  gaussian_pdf
        
    
    #Validation
    @staticmethod
    def __validate_numeric(X):
        """A function to check that X is a numeric array"""
        return X.dtype.kind in ('b','u','i','f','c')
    
    #Validation
    @staticmethod
    def __validate_length(X,y):
        """A function to check that X and y are equal length"""
        return len(X)==len(y)
    
    @staticmethod
    def __validate_column_count(X_train,X_test):
        """A function to check that X_train and X_test are equal length"""
        return len(X_train[0])==len(X_test[0])
    
    @staticmethod
    def __validate_input_type(X):
        """Validate the type of X"""
        if not (isinstance(X,list) or isinstance(X, np.ndarray)):
            if isinstance(X,pd.DataFrame):
                X=X.to_numpy()
                return X

            else:
                raise TypeError("Input must be a list, numpy array, or dataframe")
                
        return np.array(X)



# A function to analyse Models and wrap model creation

In [None]:
def equal_index_array(l1,l2):
    
    bool_array=[]
    
    for pos in range(len(l1)):
        bool_array+=[l1[pos]==l2[pos]]
        
    return bool_array


def model_metrics(testActualVal, predictions,verbose=True):
    """A function to get Metrics for a Model
    
    Note: This is a function I wrote for the Research Practicum: 
    https://github.com/Team10UCD/Frontend/blob/cc12998790b7207a859d5089e10f085a65586294/flask/Data_Analytics/Model_Analytics/Route102_sample/02_local_ModelExplorationAndFeatureSelection_Route102.ipynb"""
    
    try:
        accuracy=metrics.accuracy_score(testActualVal, predictions)
    except:
        accuracy=None
        pass
    try:
        fscore=metrics.f1_score(testActualVal, predictions)
    except:
        fscore=None
        pass
    
    
    
    try:
        confusion_matrix=metrics.confusion_matrix(testActualVal, predictions)
    except:
        confusion_matrix=None
        pass
        
    try:
        classification_rep=metrics.classification_report(testActualVal, predictions,output_dict=True)
    except:
        classification_rep=None
        pass
    
    if verbose:
        
        try:
            print("----DETAIL----")
            print("\n\nAccuracy: \n")
            display(accuracy)
            print("\n\nConfusion matrix: \n")
            display(confusion_matrix)
            print("\n\nClassification report:\n ")
            display(classification_rep)
            
        except:
            print("----DETAIL----")
            print("\n\nAccuracy: \n", accuracy)
            print("\n\nConfusion matrix: \n", confusion_matrix)
            print("\n\nClassification report:\n ", classification_rep)
            
    
    result_dict={}
    result_dict['Accuracy']=accuracy    
    result_dict['F1-Score']=fscore
    result_dict['Confusion']=confusion_matrix
    result_dict['ClassificationRep']=classification_rep
    return result_dict


def create_model(X,y,scaler='Standard',random_state=14395076,imputer=None,plot_comp=True,assess=True,test_size=0.33, verbose=True,mod_type=''):
    """A wrapper to call the create model function"""

    
    if mod_type=='My Naive Bayes':
        mod_result=create_my_naive_bayes(X=X
                                         ,y=y
                                         ,plot_comp=plot_comp
                                         ,scaler=scaler
                                         ,assess=assess
                                         ,random_state=random_state
                                         , verbose=verbose
                                        ,test_size=test_size)
        return mod_result

    
    
    elif mod_type=="SK Naive Bayes":
        mod_result=create_sk_naive_bayes(X=X
                                         ,y=y
                                         ,plot_comp=plot_comp
                                         ,imputer=imputer
                                         ,random_state=random_state
                                         ,scaler=scaler
                                         ,assess=assess
                                         , verbose=verbose
                                        ,test_size=test_size)
    
        return mod_result
    
    else:
        print("Sorry, I've not built that yet.")
        raise ValueError("Must be My Naive Bayes or SK Naive Bayes")

# Create a model using the default SKLearn estimator

In [None]:
def create_sk_naive_bayes(X,y,plot_comp,scaler='Standard', imputer=None, random_state=14395076,assess=True, test_size=0.33,verbose=True):
    """Create a model using SK's Naive Bayes."""
    
    
    print("""
-
SKLEARN NAIVE BAYES - Imputation Method - {}:
-
          
          """.format(imputer))
    
    
    
    
    X_train, X_test, y_train, y_test = train_test_split(X 
                                                        ,y
                                                        ,random_state=random_state
                                                        ,test_size=test_size)
    
    
    
    if scaler=='Standard':
        scaler = StandardScaler()
    elif scaler=='MinMax':    
        scaler = MinMaxScaler()
    
    if scaler in ['Standard','MinMax']:
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)
    
    
    if imputer=='univariate':
        imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        X_train = imp.fit_transform(X_train)
        X_test = imp.transform(X_test)
        
    
    elif imputer=='multivariate':
        imp=IterativeImputer(max_iter=100, random_state=random_state)
        X_train = imp.fit_transform(X_train)
        X_test = imp.transform(X_test)
        
    elif imputer=='KNN':
        imp = KNNImputer(n_neighbors=5, weights="uniform")
        X_train = imp.fit_transform(X_train)
        X_test = imp.transform(X_test)
        
        
    start=time.perf_counter()
    
    #Create the DT Regression
    model = GaussianNB()

    #Fit the data
    model.fit(X_train,y_train)
   
    
    #Check the predictions
    predictions = model.predict(X_test)
    end=time.perf_counter()
    
    print("Total Time to Classify: {}".format(end-start))
    
    
    #Results
    pred_vs_act_df=pd.DataFrame({'Actual':y_test
                                 ,'PredictionClass':predictions
                                ,'Diff':equal_index_array(l1=y_test,l2=predictions)})
    
    model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)
    
    try:
        if imputer=='univariate':
            imp = SimpleImputer(missing_values=np.nan, strategy='mean')
            X_cross = imp.fit_transform(X)


        elif imputer=='multivariate':
            imp=IterativeImputer(max_iter=100, random_state=random_state)
            X_cross = imp.fit_transform(X)
            
        elif imputer=='KNN':
            imp = KNNImputer(n_neighbors=5, weights="uniform")
            X_cross = imp.fit_transform(X)
            
        else:
            X_cross=X
        
        scores = cross_val_score(GaussianNB(), X_cross, y, scoring='accuracy', cv=10)
        print(scores)
    except:
        print("ERROR")
        pass

    cv_rmse = scores #**0.5
    print("Avg Accuracy score over 10 folds: \n", np.mean(cv_rmse))
    print("Stddev Accuracy score over 10 folds: \n", np.std(cv_rmse))
    
    model_metric['ClassificationRep']['Classification Time']=end-start
    result_dict={}
    result_dict['Model']=model
    result_dict['Classification_Time']=end-start
    result_dict['Actual vs Prediction']=pred_vs_act_df
    
    result_dict['Accuracy']=model_metric['Accuracy']
    result_dict['Confusion']=model_metric['Confusion']
    
    
    result_dict['ClassificationRep']=model_metric['ClassificationRep']
    result_dict['ClassificationRep']['Cross_Val_Acc_Mean']=np.mean(cv_rmse)
    result_dict['ClassificationRep']['Cross_Val_Acc_STD']=np.std(cv_rmse)
    
    result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
    result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)

    return result_dict

# Create a model using my SKLearn estimator
Note: as long as the same random_state and test size is present they'll both work on the same set.

In [None]:
def create_my_naive_bayes(X,y,plot_comp,scaler='Standard',random_state=14395076, assess=True,test_size=0.33, verbose=True):
    """Create a model using My Naive Bayes"""
    

    print("""
-
My NAIVE BAYES:
-
           
          """)
    
    X_train, X_test, y_train, y_test = train_test_split(X 
                                                        ,y
                                                        ,random_state=random_state
                                                        ,test_size=test_size)
    
    if scaler=='Standard':
        scaler = StandardScaler()
    elif scaler=='MinMax':    
        scaler = MinMaxScaler()
    
    if scaler in ['Standard','MinMax']:
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)
    
    start=time.perf_counter()
    #Create the DT Regression
    model = MyGaussianNB()

    #Fit the data
    model.fit(X_train,y_train)
   
    
    #Check the predictions
    predictions = model.predict(X_test)
    end=time.perf_counter()
    
    print("Total Time to Classify: {}".format(end-start))
    
    #Results
    pred_vs_act_df=pd.DataFrame({'Actual':y_test
                                 ,'PredictionClass':predictions
                                ,'Diff':equal_index_array(l1=y_test,l2=predictions)})
    
    model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)
    
    
    scores = cross_val_score(MyGaussianNB(), X, y, scoring='accuracy', cv=10)
    print(scores)

    cv_rmse = scores #**0.5
    print("Avg Accuracy score over 10 folds: \n", np.mean(cv_rmse))
    print("Stddev Accuracy score over 10 folds: \n", np.std(cv_rmse))
    
    model_metric['ClassificationRep']['Classification Time']=end-start
    result_dict={}
    result_dict['Model']=model
    result_dict['Classification_Time']=end-start
    result_dict['Actual vs Prediction']=pred_vs_act_df
    result_dict['Accuracy']=model_metric['Accuracy']
    result_dict['Confusion']=model_metric['Confusion']
    
    result_dict['ClassificationRep']=model_metric['ClassificationRep']
    
    result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
    result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)

    return result_dict

# For a single dataset, create both models and add them to the same dataframe for that file for easy comparison.

In [None]:
def test_on_single_dataset(filename,x_columns,y_column,random_state=14395076,test_size=0.33,scaler='Standard'):
    """A function to test a single dataset"""
    
    #Read in the data
    
    if filename in ('penguins 0.2','penguins 0.4'):
        df=pd.read_csv(filename,index_col=0,na_values = '?')
    else:
        df=pd.read_csv(filename,na_values = '?')
        
    X=df[x_columns].values
    y=df.pop(y_column[0]).astype(str).values
    
    use_scaler=scaler
    
    my_dict=create_model(X,y,scaler=use_scaler,random_state=random_state,plot_comp=True,assess=True,test_size=test_size, verbose=True,mod_type='My Naive Bayes')
    
    #Protecting against changes to X by imputing
    
    if filename in ('penguins 0.2','penguins 0.4'):
        df=pd.read_csv(filename,index_col=0,na_values = '?')
    else:
        df=pd.read_csv(filename,na_values = '?')
        
    X=df[x_columns].values
    y=df.pop(y_column[0]).astype(str).values
    
    
    try:
        sk_dict_no_Impute=create_model(X,y,scaler=use_scaler,imputer=None, random_state=random_state,plot_comp=True,assess=True,test_size=test_size, verbose=True,mod_type='SK Naive Bayes')
    except Exception as e:
        sk_dict_no_Impute={'ClassificationRep':dict()}
        print("DEFAULT SKLEARN ERROR: {}\n".format(e))
        
        
    #Protecting against changes to X by imputing  
    if filename in ('penguins 0.2','penguins 0.4'):
        df=pd.read_csv(filename,index_col=0,na_values = '?')
    else:
        df=pd.read_csv(filename,na_values = '?')
        
    X=df[x_columns].values
    y=df.pop(y_column[0]).astype(str).values
    
    try:
        sk_dict_uni_Impute=create_model(X,y,scaler=use_scaler,imputer='univariate', random_state=random_state,plot_comp=True,assess=True,test_size=test_size, verbose=True,mod_type='SK Naive Bayes')
    except Exception as e:
        sk_dict_uni_Impute={'ClassificationRep':dict()}
        print("UNIVARIATE SKLEARN ERROR: {}\n".format(e))
        
    #Protecting against changes to X by imputing    
    if filename in ('penguins 0.2','penguins 0.4'):
        df=pd.read_csv(filename,index_col=0,na_values = '?')
    else:
        df=pd.read_csv(filename,na_values = '?')
        
    X=df[x_columns].values
    y=df.pop(y_column[0]).astype(str).values
    
    try:
        sk_dict_multi_Impute=create_model(X,y,scaler=use_scaler,imputer='multivariate', random_state=random_state,plot_comp=True,assess=True,test_size=test_size, verbose=True,mod_type='SK Naive Bayes')
    except Exception as e:
        sk_dict_multi_Impute={'ClassificationRep':dict()}
        print("MULTIARIATE SKLEARN ERROR: {}\n".format(e))
        
        
        
    #Protecting against changes to X by imputing   
    if filename in ('penguins 0.2','penguins 0.4'):
        df=pd.read_csv(filename,index_col=0,na_values = '?')
    else:
        df=pd.read_csv(filename,na_values = '?')
        
    X=df[x_columns].values
    y=df.pop(y_column[0]).astype(str).values

    try:
        sk_dict_knn_Impute=create_model(X,y,scaler=use_scaler,imputer='KNN', random_state=random_state,plot_comp=True,assess=True,test_size=test_size, verbose=True,mod_type='SK Naive Bayes')
    except Exception as e:
        sk_dict_knn_Impute={'ClassificationRep':dict()}
        print("KNN SKLEARN ERROR: {}\n".format(e))
        
    
    my_df=pd.DataFrame(my_dict['ClassificationRep'])
    sk_df_no_impute=pd.DataFrame(sk_dict_no_Impute['ClassificationRep'])
    sk_df_uni_impute=pd.DataFrame(sk_dict_uni_Impute['ClassificationRep'])
    sk_df_multi_impute=pd.DataFrame(sk_dict_multi_Impute['ClassificationRep'])
    sk_df_knn_impute=pd.DataFrame(sk_dict_knn_Impute['ClassificationRep'])
    
    my_df['Type']='MyGaussianNB'
    sk_df_no_impute['Type']='SKGaussianNB_NoImputer'
    sk_df_uni_impute['Type']='SKGaussianNB_UniImputer'
    sk_df_multi_impute['Type']='SKGaussianNB_MultiImputer'
    sk_df_knn_impute['Type']='SKGaussianNB_KNN_Imputer'
    
    report_df=pd.concat([my_df,sk_df_no_impute,sk_df_uni_impute,sk_df_multi_impute,sk_df_knn_impute])
    
    return report_df, my_dict, sk_dict_no_Impute, sk_dict_uni_Impute,sk_dict_multi_Impute, sk_dict_knn_Impute

# A function to run the single comparison on all datasets

In [None]:
def test_on_datasets(dataset_dictionary):
    """A function to test on all datasets consistently"""
    
    report_dictionary=dict()
    df_list=[]
    model_dictionay=dict()
    
    for file in dataset_dictionary:
        file_dictionary=dataset_dictionary[file]
        
        fp=file_dictionary['filepath']
        x_columns=file_dictionary['x_columns']
        y_columns=file_dictionary['y_column']
        
        
        print(
    """
---------------
---------------   

BEGIN TESTING ON - {}:

---------------
---------------
    """.format(file))

        single_report_dict, my_model, sk_dict_no_Impute, sk_dict_uni_Impute, sk_dict_multi_Impute, sk_dict_knn_Impute =test_on_single_dataset(filename=fp
                               ,x_columns=x_columns
                               ,y_column=y_columns
                               ,random_state=14395076
                               ,test_size=0.33
                               ,scaler='Standard')
        
        single_report_dict['File']=file
        model_dictionay[file]={'Mine': my_model
                               ,'SKLearn_NoImute':sk_dict_no_Impute
                               ,'SKLearn_UniImute':sk_dict_uni_Impute
                              ,'SKLearn_MultiImute':sk_dict_multi_Impute
                              ,'SKLearn_KNNImute':sk_dict_knn_Impute
                              ,}
        
        single_report_dict.set_index(['File','Type',single_report_dict.index],inplace=True)
        df_list+=[single_report_dict]
        
        report_dictionary[file]=single_report_dict

    #full_report_df=pd.concat(df_list)
    
    return report_dictionary, model_dictionay

# Display the reports in a Notebook file.

In [None]:
def check_all_reports(dataset_dictionary,reports_per_file):
    """A function to display all reports"""
    
    for file in dataset_dictionary:
        display(reports_per_file[file])
        
    return

# Check the contingency table and Raw Data for my Model

1. While generating the MV data I found it very tricky to understand what was happening in rows where it was all nan.

2. Using this method and the model attribtues I was able to identify that np.nanprod([nan])=1 and hence in these instancse it takes the class parameter of the most frequent class.

3. Using this format you can access specific data from the model_dictionary. This includes a line-by-line comparison (pre Note for SKLearn unless it's been defined

In [None]:
def see_sample_data_for_model(filename,model_dictionary,dataset_dictionary):
    
    print("------------")
    
    if filename in dataset_dictionary:
        
        print("""CLASS STATS - {}:""".format(filename))
        display(model_dictionary[filename]['Mine']['Model'].get_class_stats())
        print("""RAW DATA - {}:""".format(filename))
        display(model_dictionary[filename]['Mine']['Model'].raw_dict_)
        print("""CONT. TABLE DATA - {}:""".format(filename))
        display(model_dictionary[filename]['Mine']['Model'].cont_dict_)
        
        print("------------")

    else:
        print("File not found in test datasets")


----

# Assignment Configuration Starts Here

----



# Define the Files to Test.

They should be in the format:

test_dictionary:{filename:
                        {filepath,
                        target_column,
                        feature_columns}
                  }
                  
Validation Is NOT DONE on the files e.g. to ensure values are numeric/non-numeric - This should be done as a preprocessing step. The functions above assume this work is complete

In [None]:
test_datasets={
    'penguins':
        {
            'filepath':'./Test Datasets/penguins_af.csv'
            ,'y_column':['species']
            ,'x_columns':['bill_length_mm','flipper_length_mm','body_mass_g','bill_depth_mm']
        }
    
    ,'penguins 0.2':
        {
            'filepath':'./Test Datasets/penguinsMV0.2.csv'
            ,'y_column':['species']
            ,'x_columns':['bill_length','flipper_length','body_mass','bill_depth']
        }
    
    ,'penguins 0.4':
        {
            'filepath':'./Test Datasets/penguinsMV0.4.csv'
            ,'y_column':['species']
            ,'x_columns':['bill_length','flipper_length','body_mass','bill_depth']
        }
    
    ,'diabetes':
        {
            'filepath':'./Test Datasets/diabetes.csv'
            ,'y_column':['neg_pos']
            ,'x_columns':['preg', 'plas', 'pres', 'skin', 'insu', 'mass', 'pedi', 'age']
        }
    ,'glass':
        {
            'filepath':'./Test Datasets/glassV2.csv'
            ,'y_column':['Type']
            ,'x_columns':['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe']
        }
}




missing_datasets={
    'penguins 0.2':
        {
            'filepath':'./Test Datasets/penguinsMV0.2.csv'
            ,'y_column':['species']
            ,'x_columns':['bill_length','flipper_length','body_mass','bill_depth']
        }
    
    ,'penguins 0.4':
        {
            'filepath':'./Test Datasets/penguinsMV0.4.csv'
            ,'y_column':['species']
            ,'x_columns':['bill_length','flipper_length','body_mass','bill_depth']
        }
    
}







---

# Test all of the files.

## Assignment 2 Commentary

1. Running the below cells does the following:
    1. It runs model generation (Mine and SKLearn) on all datasets defined in the test_datasets dictionary.
    2. We see it successfully runs for the missing value datasets.
    3. We see the SKLearn implementation does not run for the missing value datasets by default.
    4. We see the other ones impute.
    

In [None]:
reports_per_file, model_dictionary=test_on_datasets(dataset_dictionary=test_datasets)

In [None]:
check_all_reports(dataset_dictionary=test_datasets,reports_per_file=reports_per_file)

In [None]:
check_all_reports(dataset_dictionary=missing_datasets,reports_per_file=reports_per_file)

# Checking actual data used.

In making the changes, I had difficulty understanding if the values computed were correct. I developed the see_sample_data_for_model to print the raw data (row prob for each class, class stats, etc.) for each case. I used the below which works on both missing sets and non-missing sets so that I could confirm they were adding up correctly.

In [None]:
#for name in missing_datasets:
for name in test_datasets:
    see_sample_data_for_model(name,model_dictionary,dataset_dictionary=test_datasets)

---

### Manual Intervention

For the purposes of assessment, here is a Notebook Section which can be modified directly to pass in values without working through the entire code above.


Note: This was a little bit annoying as trying to seperate out elements caused X and y to be imputed by the previous imputer. This is why the beginning of each involves reading in the file, splitting, etc., to ensure the values are 'fresh'

## Penguins 0.2

In [None]:

##############################################################

#-------------READ IN DATA CONFIGURATION---------------------#

##############################################################


#Modify Here to pass in X and Y
scaler='Standard'

fn='penguins 0.2'
fp=test_datasets[fn]['filepath']
    
if fn in ('penguins 0.2','penguins 0.4'):
    df=pd.read_csv(fp,index_col=0,na_values = '?')
else:
    df=pd.read_csv(fp,na_values = '?')


X=df[test_datasets[fn]['x_columns']].values
y=df.pop(test_datasets[fn]['y_column'][0]).values


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)

if scaler=='Standard':
    scaler = StandardScaler()
elif scaler=='MinMax':    
    scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


##############################################################

#-------------MODEL---------------------#

##############################################################



print("""


##############################################################

#-------------Mine - {}---------------------#

##############################################################



""".format(fn))


#Create the DT Regression
model = MyGaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)


scores = cross_val_score(MyGaussianNB(), X, y, scoring='accuracy', cv=10)
print(scores)

cv_rmse = scores #**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
result_dict

In [None]:
try:
    
    
    
    ##############################################################

    #-------------READ IN DATA CONFIGURATION---------------------#

    ##############################################################
    

    #Modify Here to pass in X and Y
    scaler='Standard'

    fn='penguins 0.2'
    fp=test_datasets[fn]['filepath']

    if fn in ('penguins 0.2','penguins 0.4'):
        df=pd.read_csv(fp,index_col=0,na_values = '?')
    else:
        df=pd.read_csv(fp,na_values = '?')


    X=df[test_datasets[fn]['x_columns']].values
    y=df.pop(test_datasets[fn]['y_column'][0]).values


    X_train, X_test, y_train, y_test = train_test_split(X 
                                                        ,y
                                                        ,random_state=14395076
                                                        ,test_size=0.33)


    

    print("""


##############################################################

#-------------NO IMPUTER - {}---------------------#

##############################################################



""".format(fn))

    ##############################################################

    #-------------MODEL---------------------#

    ##############################################################

    #Create the DT Regression
    model = GaussianNB()

    #Fit the data
    model.fit(X_train,y_train)


    #Check the predictions
    predictions = model.predict(X_test)

    #Results
    pred_vs_act_df=pd.DataFrame({'Actual':y_test
                                 ,'PredictionClass':predictions
                                ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

    model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)


    scores = cross_val_score(MyGaussianNB(), X, y, scoring='accuracy', cv=10)
    print(scores)

    cv_rmse = scores #**0.5
    print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
    print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


    result_dict={}
    result_dict['Model']=model
    result_dict['Actual vs Prediction']=pred_vs_act_df
    result_dict['Accuracy']=model_metric['Accuracy']
    result_dict['Confusion']=model_metric['Confusion']
    result_dict['ClassificationRep']=model_metric['ClassificationRep']
    result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
    result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
    result_dict
    
except Exception as e:
    print("Default SKLearn won't accept NAN - {}".format(e))

In [None]:

##############################################################

#-------------READ IN DATA CONFIGURATION---------------------#

##############################################################



#Modify Here to pass in X and Y
scaler='Standard'

fn='penguins 0.2'
fp=test_datasets[fn]['filepath']
    
if fn in ('penguins 0.2','penguins 0.4'):
    df=pd.read_csv(fp,index_col=0,na_values = '?')
else:
    df=pd.read_csv(fp,na_values = '?')


X=df[test_datasets[fn]['x_columns']].values
y=df.pop(test_datasets[fn]['y_column'][0]).values


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)

if scaler=='Standard':
    scaler = StandardScaler()
elif scaler=='MinMax':    
    scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)





##############################################################

#-------------IMPUTER CONFIGURATION--------------------------#

##############################################################



imputer='univariate'

if imputer=='univariate':
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


elif imputer=='multivariate':
    imp=IterativeImputer(max_iter=100, random_state=14395076)
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)

elif imputer=='KNN':
    imp = KNNImputer(n_neighbors=5, weights="uniform")
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


##############################################################




print("""


##############################################################

#-------------UNI IMPUTER - {}---------------------#

##############################################################



""".format(fn))




##############################################################

#-------------MODEL---------------------#

##############################################################


#Create the DT Regression
model = GaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)



    
try:
    if imputer=='univariate':
        imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        X_cross = imp.fit_transform(X)


    elif imputer=='multivariate':
        imp=IterativeImputer(max_iter=10, random_state=14395076)
        X_cross = imp.fit_transform(X)

    elif imputer=='KNN':
        imp = KNNImputer(n_neighbors=5, weights="uniform")
        X_cross = imp.fit_transform(X)

    else:
        X_cross=X

    scores = cross_val_score(GaussianNB(), X_cross, y, scoring='accuracy', cv=10)
    print(scores)
except:
    print("ERROR")
    pass
print(scores)

cv_rmse = scores #**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
result_dict

In [None]:

##############################################################

#-------------READ IN DATA CONFIGURATION---------------------#

##############################################################

#Modify Here to pass in X and Y
scaler='Standard'

fn='penguins 0.2'
fp=test_datasets[fn]['filepath']
    
if fn in ('penguins 0.2','penguins 0.4'):
    df=pd.read_csv(fp,index_col=0,na_values = '?')
else:
    df=pd.read_csv(fp,na_values = '?')


X=df[test_datasets[fn]['x_columns']].values
y=df.pop(test_datasets[fn]['y_column'][0]).values


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)


##############################################################

#-------------IMPUTER CONFIGURATION--------------------------#

##############################################################



imputer='multivariate'

if imputer=='univariate':
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


elif imputer=='multivariate':
    imp=IterativeImputer(max_iter=100, random_state=14395076)
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)

elif imputer=='KNN':
    imp = KNNImputer(n_neighbors=5, weights="uniform")
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


##############################################################




##############################################################

#-------------MODEL---------------------#

##############################################################




print("""


##############################################################

#-------------Multi IMPUTER - {}---------------------#

##############################################################



""".format(fn))


#Create the DT Regression
model = GaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)



    
try:
    if imputer=='univariate':
        imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        X_cross = imp.fit_transform(X)


    elif imputer=='multivariate':
        imp=IterativeImputer(max_iter=10, random_state=14395076)
        X_cross = imp.fit_transform(X)

    elif imputer=='KNN':
        imp = KNNImputer(n_neighbors=5, weights="uniform")
        X_cross = imp.fit_transform(X)

    else:
        X_cross=X

    scores = cross_val_score(GaussianNB(), X_cross, y, scoring='accuracy', cv=10)
    print(scores)
except:
    print("ERROR")
    pass
print(scores)

cv_rmse = scores #**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
result_dict

In [None]:

##############################################################

#-------------READ IN DATA CONFIGURATION---------------------#

##############################################################

#Modify Here to pass in X and Y
scaler='Standard'

fn='penguins 0.2'
fp=test_datasets[fn]['filepath']
    
if fn in ('penguins 0.2','penguins 0.4'):
    df=pd.read_csv(fp,index_col=0,na_values = '?')
else:
    df=pd.read_csv(fp,na_values = '?')


X=df[test_datasets[fn]['x_columns']].values
y=df.pop(test_datasets[fn]['y_column'][0]).values


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)



##############################################################

#-------------IMPUTER CONFIGURATION--------------------------#

##############################################################



imputer='KNN'

if imputer=='univariate':
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


elif imputer=='multivariate':
    imp=IterativeImputer(max_iter=100, random_state=random_state)
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)

elif imputer=='KNN':
    imp = KNNImputer(n_neighbors=5, weights="uniform")
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


##############################################################



print("""


##############################################################

#-------------KNN IMPUTER - {}---------------------#

##############################################################



""".format(fn))

##############################################################

#-------------MODEL---------------------#

##############################################################

#Create the DT Regression
model = GaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)



    
try:
    if imputer=='univariate':
        imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        X_cross = imp.fit_transform(X)


    elif imputer=='multivariate':
        imp=IterativeImputer(max_iter=10, random_state=random_state)
        X_cross = imp.fit_transform(X)

    elif imputer=='KNN':
        imp = KNNImputer(n_neighbors=5, weights="uniform")
        X_cross = imp.fit_transform(X)

    else:
        X_cross=X

    scores = cross_val_score(GaussianNB(), X_cross, y, scoring='accuracy', cv=10)
    print(scores)
except:
    print("ERROR")
    pass
print(scores)

cv_rmse = scores #**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
result_dict

---

## Penguins 0.4

In [None]:

##############################################################

#-------------READ IN DATA CONFIGURATION---------------------#

##############################################################


#Modify Here to pass in X and Y
scaler='Standard'

fn='penguins 0.4'
fp=test_datasets[fn]['filepath']
    
if fn in ('penguins 0.2','penguins 0.4'):
    df=pd.read_csv(fp,index_col=0,na_values = '?')
else:
    df=pd.read_csv(fp,na_values = '?')


X=df[test_datasets[fn]['x_columns']].values
y=df.pop(test_datasets[fn]['y_column'][0]).values


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)

if scaler=='Standard':
    scaler = StandardScaler()
elif scaler=='MinMax':    
    scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


##############################################################

#-------------MODEL---------------------#

##############################################################



print("""


##############################################################

#-------------MINE - {}---------------------#

##############################################################



""".format(fn))

#Create the DT Regression
model = MyGaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)


scores = cross_val_score(MyGaussianNB(), X, y, scoring='accuracy', cv=10)
print(scores)

cv_rmse = scores #**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
result_dict

---

2. Run this bit to test on SK Learns' GaussianNaiveBayes without imputation

In [None]:
try:
    
    
    
    ##############################################################

    #-------------READ IN DATA CONFIGURATION---------------------#

    ##############################################################
    

    #Modify Here to pass in X and Y
    scaler='Standard'

    fn='penguins 0.4'
    fp=test_datasets[fn]['filepath']

    if fn in ('penguins 0.2','penguins 0.4'):
        df=pd.read_csv(fp,index_col=0,na_values = '?')
    else:
        df=pd.read_csv(fp,na_values = '?')


    X=df[test_datasets[fn]['x_columns']].values
    y=df.pop(test_datasets[fn]['y_column'][0]).values


    X_train, X_test, y_train, y_test = train_test_split(X 
                                                        ,y
                                                        ,random_state=14395076
                                                        ,test_size=0.33)


    
    

    print("""


##############################################################

#-------------NO IMPUTER - {}---------------------#

##############################################################



""".format(fn))

    ##############################################################

    #-------------MODEL---------------------#

    ##############################################################

    #Create the DT Regression
    model = GaussianNB()

    #Fit the data
    model.fit(X_train,y_train)


    #Check the predictions
    predictions = model.predict(X_test)

    #Results
    pred_vs_act_df=pd.DataFrame({'Actual':y_test
                                 ,'PredictionClass':predictions
                                ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

    model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)


    scores = cross_val_score(MyGaussianNB(), X, y, scoring='accuracy', cv=10)
    print(scores)

    cv_rmse = scores #**0.5
    print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
    print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


    result_dict={}
    result_dict['Model']=model
    result_dict['Actual vs Prediction']=pred_vs_act_df
    result_dict['Accuracy']=model_metric['Accuracy']
    result_dict['Confusion']=model_metric['Confusion']
    result_dict['ClassificationRep']=model_metric['ClassificationRep']
    result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
    result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
    result_dict
    
except Exception as e:
    print("Default SKLearn won't accept NAN - {}".format(e))

---

3. Run this bit to test on SK Learns' GaussianNaiveBayes with imputation of the mean

In [None]:

##############################################################

#-------------READ IN DATA CONFIGURATION---------------------#

##############################################################



#Modify Here to pass in X and Y
scaler='Standard'

fn='penguins 0.4'
fp=test_datasets[fn]['filepath']
    
if fn in ('penguins 0.2','penguins 0.4'):
    df=pd.read_csv(fp,index_col=0,na_values = '?')
else:
    df=pd.read_csv(fp,na_values = '?')


X=df[test_datasets[fn]['x_columns']].values
y=df.pop(test_datasets[fn]['y_column'][0]).values


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)

if scaler=='Standard':
    scaler = StandardScaler()
elif scaler=='MinMax':    
    scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)





##############################################################

#-------------IMPUTER CONFIGURATION--------------------------#

##############################################################



imputer='univariate'

if imputer=='univariate':
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


elif imputer=='multivariate':
    imp=IterativeImputer(max_iter=100, random_state=14395076)
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)

elif imputer=='KNN':
    imp = KNNImputer(n_neighbors=5, weights="uniform")
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


##############################################################






print("""


##############################################################

#-------------UNI IMPUTER - {}---------------------#

##############################################################



""".format(fn))


##############################################################

#-------------MODEL---------------------#

##############################################################


#Create the DT Regression
model = GaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)



    
try:
    if imputer=='univariate':
        imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        X_cross = imp.fit_transform(X)


    elif imputer=='multivariate':
        imp=IterativeImputer(max_iter=10, random_state=14395076)
        X_cross = imp.fit_transform(X)

    elif imputer=='KNN':
        imp = KNNImputer(n_neighbors=5, weights="uniform")
        X_cross = imp.fit_transform(X)

    else:
        X_cross=X

    scores = cross_val_score(GaussianNB(), X_cross, y, scoring='accuracy', cv=10)
    print(scores)
except:
    print("ERROR")
    pass
print(scores)

cv_rmse = scores #**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
result_dict

---

4. Run this bit to test on SK Learns' GaussianNaiveBayes with iterative imputation

In [None]:

##############################################################

#-------------READ IN DATA CONFIGURATION---------------------#

##############################################################

#Modify Here to pass in X and Y
scaler='Standard'

fn='penguins 0.4'
fp=test_datasets[fn]['filepath']
    
if fn in ('penguins 0.2','penguins 0.4'):
    df=pd.read_csv(fp,index_col=0,na_values = '?')
else:
    df=pd.read_csv(fp,na_values = '?')


X=df[test_datasets[fn]['x_columns']].values
y=df.pop(test_datasets[fn]['y_column'][0]).values


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)



print("""


##############################################################

#-------------MULTI IMPUTER - {}---------------------#

##############################################################



""".format(fn))
##############################################################

#-------------IMPUTER CONFIGURATION--------------------------#

##############################################################



imputer='multivariate'

if imputer=='univariate':
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


elif imputer=='multivariate':
    imp=IterativeImputer(max_iter=100, random_state=14395076)
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)

elif imputer=='KNN':
    imp = KNNImputer(n_neighbors=5, weights="uniform")
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


##############################################################




##############################################################

#-------------MODEL---------------------#

##############################################################

#Create the DT Regression
model = GaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)



    
try:
    if imputer=='univariate':
        imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        X_cross = imp.fit_transform(X)


    elif imputer=='multivariate':
        imp=IterativeImputer(max_iter=10, random_state=14395076)
        X_cross = imp.fit_transform(X)

    elif imputer=='KNN':
        imp = KNNImputer(n_neighbors=5, weights="uniform")
        X_cross = imp.fit_transform(X)

    else:
        X_cross=X

    scores = cross_val_score(GaussianNB(), X_cross, y, scoring='accuracy', cv=10)
    print(scores)
except:
    print("ERROR")
    pass
print(scores)

cv_rmse = scores #**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
result_dict

---

5. Run this bit to test on SK Learns' GaussianNaiveBayes with KNN Imputer

In [None]:






##############################################################

#-------------READ IN DATA CONFIGURATION---------------------#

##############################################################

#Modify Here to pass in X and Y
scaler='Standard'

fn='penguins 0.4'
fp=test_datasets[fn]['filepath']
    
if fn in ('penguins 0.2','penguins 0.4'):
    df=pd.read_csv(fp,index_col=0,na_values = '?')
else:
    df=pd.read_csv(fp,na_values = '?')


X=df[test_datasets[fn]['x_columns']].values
y=df.pop(test_datasets[fn]['y_column'][0]).values


X_train, X_test, y_train, y_test = train_test_split(X 
                                                    ,y
                                                    ,random_state=14395076
                                                    ,test_size=0.33)




print("""


##############################################################

#-------------KNN IMPUTER - {}---------------------#

##############################################################



""".format(fn))

##############################################################

#-------------IMPUTER CONFIGURATION--------------------------#

##############################################################



imputer='KNN'

if imputer=='univariate':
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


elif imputer=='multivariate':
    imp=IterativeImputer(max_iter=100, random_state=random_state)
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)

elif imputer=='KNN':
    imp = KNNImputer(n_neighbors=5, weights="uniform")
    X_train = imp.fit_transform(X_train)
    X_test = imp.transform(X_test)


##############################################################



##############################################################

#-------------MODEL---------------------#

##############################################################

#Create the DT Regression
model = GaussianNB()

#Fit the data
model.fit(X_train,y_train)


#Check the predictions
predictions = model.predict(X_test)

#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
                             ,'PredictionClass':predictions
                            ,'Diff':equal_index_array(l1=y_test,l2=predictions)})

model_metric=model_metrics(testActualVal=y_test, predictions=predictions, verbose=True)



    
try:
    if imputer=='univariate':
        imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        X_cross = imp.fit_transform(X)


    elif imputer=='multivariate':
        imp=IterativeImputer(max_iter=10, random_state=random_state)
        X_cross = imp.fit_transform(X)

    elif imputer=='KNN':
        imp = KNNImputer(n_neighbors=5, weights="uniform")
        X_cross = imp.fit_transform(X)

    else:
        X_cross=X

    scores = cross_val_score(GaussianNB(), X_cross, y, scoring='accuracy', cv=10)
    print(scores)
except:
    print("ERROR")
    pass
print(scores)

cv_rmse = scores #**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))


result_dict={}
result_dict['Model']=model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_STD']=np.std(cv_rmse)
result_dict