## Missing Values

In [1]:
import numpy as np
import pandas as pd
import math
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.pipeline import Pipeline
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.model_selection import cross_val_score

## Extension of MyGaussianNB Class (Handling of Missing Values)

1. Extend the Gaussian Naive Bayes code so that it handles missing values. Gaussian Naive Bayes can handle missing values in training by calculating conditional probabilities on the values that are present. You may choose to put a limit on the
number of missing values allowed. Your code should also handle missing values on any test data. The easiest way to do
this is to leave features with missing values out of the posterior probability calculation.

In [2]:
class MyGaussianNB(BaseEstimator, ClassifierMixin):          
    def fit(self, Xt, yt):
        self.var_smoothing = 1e-9   # zero variance will cause division by zero errors.
        self.Xt = Xt
        self.yt = yt
        self.n_feat = Xt.shape[1]
        self.mus = {}
        self.sig_sqs = {}
        self.priors = {}
        
        c_dict = Counter(self.yt)
        
        for c in c_dict.keys():
            self.mus[c] = np.zeros(self.n_feat) # where the means will be stored
            self.sig_sqs[c] = np.zeros(self.n_feat) # where the variances will be stored
            self.priors[c] = c_dict[c]/Xt.shape[0]
            
            mask = self.yt == c
            X_tr_c = self.Xt[mask, :] # the rows for this class label
            
            for f in range(self.n_feat):
                self.mus[c][f] = np.nanmean(X_tr_c[:,f])  # Changing the mean to nanmean to leave the nans out of conditional probability
                self.sig_sqs[c][f] = np.nanvar(X_tr_c[:,f] + self.var_smoothing)  # Similarly, Changing the variance to nanvariance to leave the nans out of conditional probability.
        #print(self.mus)
        #print(self.sig_sqs)
        
        return self
    
    # The predictions are the most common class in the training set.
    def predict(self, Xtes):
        #print("Predicting MGNB")
        self.Xtes = Xtes
         
        res_list = []
        for sample in Xtes:
            res_list.append(self.predict_single(sample))
            
        return np.array(res_list)
    
    def predict_single(self, x_single):
        probs = {}
        for c in self.priors.keys():   # for each of the class labels
            probs[c] = self.priors[c]
            for i, f in enumerate(x_single):
                if np.isnan(f):
                    pxi_y = 1
                else:
                    t1 = 1/math.sqrt(2*math.pi*self.sig_sqs[c][i])
                    num = (f - self.mus[c][i])**2
                    den = 2*self.sig_sqs[c][i]
                    pxi_y = t1 * math.exp(-num/den)
                probs[c] = probs[c] * pxi_y
                #print(t1, num, den, pxi_y)
                #print(probs)
            #print(c, self.priors[c])
        return max(probs, key=probs.get) # Return the key with the largest value
    

### Design Decisions:

#### In the Fit method, 
I have modified the mean and variance to **"nanmean"** and **"nanvar"** from numpy library to leave the nans out of conditional probabilities. 

#### In the Predict Single method, 
I have added a condition to check if the test attribute is nan or not and if it would be a nan then it would assign the probability of 1 which would separate it from the posterior probability calculation.

# Testing

2. Test the performance of your implementation against the scikit-learn `GaussianNB` using missing value imputation. Test two imputation options, one **univariate** and one **multi-variate**. To help with your evaluation two versions of the penguins datasets with missing values are provided, one with **20% missing** and the other with **40%**. You should use **cross validation** for testing, taking care that any scaling and imputation is handled properly within cross validation.

### Creating the fidelity test method to perform hold out testing using univariate and multivariate imputing and comparing the results for both the classifiers

In [3]:
def fidelity_tests (X,y, nreps = 10):
    for rs in range(1, nreps + 1):
        X_tr_raw, X_ts_raw, y_train, y_test = train_test_split(X, y, 
                                                               random_state=rs, 
                                                               test_size=1/2)
        
        # Univariate Imputing
        imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        imp.fit(X_tr_raw)
        Xi_train = imp.transform(X_tr_raw)
        Xi_test = imp.transform(X_ts_raw)
        
        # Multivariate Imputing
        imp_kNN = KNNImputer(missing_values = np.nan)
        imp_kNN.fit(X_tr_raw)
        X_train_kNN = imp_kNN.transform(X_tr_raw)
        X_test_kNN = imp_kNN.transform(X_ts_raw)
        
        # Applying Standard Scaler to the imputed value dataset
        scaler_simple = StandardScaler()
        XiS_train = scaler_simple.fit_transform(Xi_train)
        XiS_test = scaler_simple.transform(Xi_test)
        
        scaler_iterative = StandardScaler()
        XS_train_kNN = scaler_iterative.fit_transform(X_train_kNN)
        XS_test_kNN = scaler_iterative.transform(X_test_kNN)
        
        gnb_Simple = GaussianNB()
        mgnb_Simple = MyGaussianNB()
        
        gnb_Iterative = GaussianNB()
        mgnb_Iterative = MyGaussianNB()
        
        gnb_Simple.fit(XiS_train,y_train)
        mgnb_Simple.fit(XiS_train,y_train)
        
        gnb_Iterative.fit(XS_train_kNN,y_train)
        mgnb_Iterative.fit(XS_train_kNN,y_train)
        
        ascore = accuracy_score(gnb_Simple.predict(XiS_test),mgnb_Simple.predict(XiS_test)) 
        gnb_acc_simple = accuracy_score(gnb_Simple.predict(XiS_test),y_test)
        mgnb_acc_simple = accuracy_score(mgnb_Simple.predict(XiS_test),y_test)
        
        print ("Run after Univariate Imputing: %d Score: %.2f SK acc: %.2f My acc: %.2f" % (rs, ascore, gnb_acc_simple, mgnb_acc_simple))
        
        aiscore = accuracy_score(gnb_Iterative.predict(XS_test_kNN),mgnb_Iterative.predict(XS_test_kNN))
        gnb_acc_iterative = accuracy_score(gnb_Iterative.predict(XS_test_kNN),y_test)
        mgnb_acc_iterative = accuracy_score(mgnb_Iterative.predict(XS_test_kNN),y_test)
        
        print ("Run after Multivariate Imputing: %d Score: %.2f SK acc: %.2f My acc: %.2f" % (rs, aiscore, gnb_acc_iterative, mgnb_acc_iterative))
        

### Creating the crossval test method to perform cross validation testing using univariate and multivariate imputing and comparing the results for both the classifiers

In [4]:
def crossval_test (X,y, cv = 8, njobs = -1):
    
    # Pipeline with univariate imputing, standard scaling and sklearn gaussian naive bayes classifier
    GNBSpipe  = Pipeline(steps=[
        ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
        ('scaler', StandardScaler()),
        ('naive_bayes', GaussianNB())])
    
    # Pipeline with Multivariate imputing, standard scaling and sklearn gaussian naive bayes classifier
    GNBIpipe  = Pipeline(steps=[
        ('imputer', KNNImputer(missing_values = np.nan)),
        ('scaler', StandardScaler()),
        ('naive_bayes', GaussianNB())])
    
    # Pipeline with univariate imputing, standard scaling and my gaussian naive bayes classifier
    MGNBSpipe  = Pipeline(steps=[
        ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
        ('scaler', StandardScaler()),
        ('naive_bayes', MyGaussianNB())])
    
    # Pipeline with Multivariate imputing, standard scaling and my gaussian naive bayes classifier
    MGNBIpipe  = Pipeline(steps=[
        ('imputer', KNNImputer(missing_values = np.nan)),
        ('scaler', StandardScaler()),
        ('naive_bayes', MyGaussianNB())])
    
    # Pipeline with standard scaling and my gaussian naive bayes classifier
    MGNBpipe  = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('naive_bayes', MyGaussianNB())])
    
    # Generating scores using cross validation method
    gnb_crossval_simple = cross_val_score(GNBSpipe, X, y, cv=cv, n_jobs = njobs)
    gnb_crossval_iterative = cross_val_score(GNBIpipe, X, y, cv=cv, n_jobs = njobs)
    
    print("Cross Validation Accuracy for SKlearn GNB after Univariate Imputing: {0:4.2f}".format(sum(gnb_crossval_simple)/len(gnb_crossval_simple)))
    print("Cross Validation Accuracy for SKlearn GNB after Multivariate Imputing: {0:4.2f}".format(sum(gnb_crossval_iterative)/len(gnb_crossval_iterative)))
    
    mgnb_crossval_simple = cross_val_score(MGNBSpipe, X, y, cv=cv, n_jobs = njobs)
    mgnb_crossval_iterative = cross_val_score(MGNBIpipe, X, y, cv=cv, n_jobs = njobs)
    
    print("Cross Validation Accuracy for My GNB after Univariate Imputing: {0:4.2f}".format(sum(mgnb_crossval_simple)/len(mgnb_crossval_simple)))
    print("Cross Validation Accuracy for My GNB after Multivariate Imputing: {0:4.2f}".format(sum(mgnb_crossval_iterative)/len(mgnb_crossval_iterative)))
    
    mgnb_crossval = cross_val_score(MGNBpipe, X, y, cv=cv, n_jobs = njobs)
    
    print("Cross Validation Accuracy for My GNB without any Imputing: {0:4.2f}".format(sum(mgnb_crossval)/len(mgnb_crossval)))

## Penguins Dataset with 20% Missing Values

In [5]:
penguins_20 = pd.read_csv('PenguinsMV0.2.csv', index_col = 0)
penguins_20 = penguins_20.replace('?',np.nan)
print(penguins_20.shape)
penguins_20.head()

(333, 5)


Unnamed: 0,bill_length,bill_depth,flipper_length,body_mass,species
0,39.1,18.7,181.0,3750.0,Adelie
1,39.5,17.4,186.0,3800.0,Adelie
2,40.3,18.0,195.0,3250.0,Adelie
3,36.7,19.3,193.0,3450.0,Adelie
4,39.3,20.6,190.0,3650.0,Adelie


In [6]:
y = penguins_20.pop('species').values
X_raw = penguins_20.values

## Fidelity Testing and Accuracy Scores (20% Missing Values Dataset)

In [7]:
fidelity_tests(X_raw, y)

Run after Univariate Imputing: 1 Score: 1.00 SK acc: 0.94 My acc: 0.94
Run after Multivariate Imputing: 1 Score: 1.00 SK acc: 0.93 My acc: 0.93
Run after Univariate Imputing: 2 Score: 1.00 SK acc: 0.96 My acc: 0.96
Run after Multivariate Imputing: 2 Score: 1.00 SK acc: 0.96 My acc: 0.96
Run after Univariate Imputing: 3 Score: 1.00 SK acc: 0.95 My acc: 0.95
Run after Multivariate Imputing: 3 Score: 1.00 SK acc: 0.95 My acc: 0.95
Run after Univariate Imputing: 4 Score: 1.00 SK acc: 0.95 My acc: 0.95
Run after Multivariate Imputing: 4 Score: 1.00 SK acc: 0.95 My acc: 0.95
Run after Univariate Imputing: 5 Score: 1.00 SK acc: 0.96 My acc: 0.96
Run after Multivariate Imputing: 5 Score: 1.00 SK acc: 0.96 My acc: 0.96
Run after Univariate Imputing: 6 Score: 1.00 SK acc: 0.96 My acc: 0.96
Run after Multivariate Imputing: 6 Score: 1.00 SK acc: 0.96 My acc: 0.96
Run after Univariate Imputing: 7 Score: 1.00 SK acc: 0.98 My acc: 0.98
Run after Multivariate Imputing: 7 Score: 1.00 SK acc: 0.98 My ac

## Cross Validation Testing and Accuracy Scores (20% Missing Values Dataset)

In [8]:
crossval_test(X_raw, y)

Cross Validation Accuracy for SKlearn GNB after Univariate Imputing: 0.97
Cross Validation Accuracy for SKlearn GNB after Multivariate Imputing: 0.97
Cross Validation Accuracy for My GNB after Univariate Imputing: 0.97
Cross Validation Accuracy for My GNB after Multivariate Imputing: 0.97
Cross Validation Accuracy for My GNB without any Imputing: 0.97


**Conclusion for 20% missing values dataset**: Cross validation scores are all same for sklearn gaussian naive bayes classifier and MyGaussianNB when tested using univariate and multivariate imputing methods. Infact, it is producing the same accuracy score when there is no imputing method applied on MyGaussianNB classifier.

## Penguins Dataset with 40% Missing Values

In [9]:
penguins_40 = pd.read_csv('PenguinsMV0.4.csv', index_col = 0)
penguins_40 = penguins_40.replace('?',np.nan)
print(penguins_40.shape)
penguins_40.head()

(333, 5)


Unnamed: 0,bill_length,bill_depth,flipper_length,body_mass,species
0,39.1,,181.0,3750.0,Adelie
1,39.5,17.4,186.0,,Adelie
2,40.3,,195.0,,Adelie
3,36.7,19.3,193.0,3450.0,Adelie
4,,,,3650.0,Adelie


In [10]:
y_40 = penguins_40.pop('species').values
X_raw_40 = penguins_40.values

## Fidelity Testing and Accuracy Scores (40% Missing Values Dataset)

In [11]:
fidelity_tests(X_raw_40, y_40)

Run after Univariate Imputing: 1 Score: 1.00 SK acc: 0.79 My acc: 0.79
Run after Multivariate Imputing: 1 Score: 1.00 SK acc: 0.81 My acc: 0.81
Run after Univariate Imputing: 2 Score: 1.00 SK acc: 0.81 My acc: 0.81
Run after Multivariate Imputing: 2 Score: 1.00 SK acc: 0.82 My acc: 0.82
Run after Univariate Imputing: 3 Score: 1.00 SK acc: 0.83 My acc: 0.83
Run after Multivariate Imputing: 3 Score: 1.00 SK acc: 0.84 My acc: 0.84
Run after Univariate Imputing: 4 Score: 1.00 SK acc: 0.80 My acc: 0.80
Run after Multivariate Imputing: 4 Score: 1.00 SK acc: 0.81 My acc: 0.81
Run after Univariate Imputing: 5 Score: 1.00 SK acc: 0.83 My acc: 0.83
Run after Multivariate Imputing: 5 Score: 1.00 SK acc: 0.84 My acc: 0.84
Run after Univariate Imputing: 6 Score: 1.00 SK acc: 0.87 My acc: 0.87
Run after Multivariate Imputing: 6 Score: 1.00 SK acc: 0.84 My acc: 0.84
Run after Univariate Imputing: 7 Score: 1.00 SK acc: 0.81 My acc: 0.81
Run after Multivariate Imputing: 7 Score: 1.00 SK acc: 0.84 My ac

## Cross Validation Testing and Accuracy Scores (40% Missing Values Dataset)

In [12]:
crossval_test(X_raw_40, y_40)

Cross Validation Accuracy for SKlearn GNB after Univariate Imputing: 0.85
Cross Validation Accuracy for SKlearn GNB after Multivariate Imputing: 0.80
Cross Validation Accuracy for My GNB after Univariate Imputing: 0.85
Cross Validation Accuracy for My GNB after Multivariate Imputing: 0.80
Cross Validation Accuracy for My GNB without any Imputing: 0.87


**Conclusion for 40% missing values dataset**: Cross validation scores are same for sklearn gaussian naive bayes classifier and MyGaussianNB when tested using univariate and multivariate imputing methods. Though, MyGaussianNB is producing better accuracy score when there is no imputing method applied on this classifier.

## Conclusion:

It is now evident after testing the performance of `MyGaussianNB` Classifier against `GaussianNB` implementation in scikit-learn that both are producing the **exact same scores** for both imputing methods on **Penguin datasets with 20% missing values**. But, with **penguin dataset having 40% missing values**, it is producing different scores i.e. **85% accuracy when univariate imputing** is applied on both the classifers, **81% accuracy when multivariate imputing** is applied on both the classifiers. Though, `MyGaussianNB` produced **87% accuracy when no imputing method** was applied to the classifier.