# ID2214/FID3214 Assignment 2 Group no. [2]
### Project members: 
[Farrokh Bolandi, bolandi@kth.se]
[Ezio Cristofoli, ezioc@kth.se]
[Abyel Tesfay, Abyel@kth.se]

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas and time

In [1]:
import numpy as np
import pandas as pd
import time


## Reused functions from Assignment 1

In [2]:
# Copy and paste functions from Assignment 1 here that you need for this assignment
def create_column_filter(df):
    df2 = df.copy()
    column_filter = list(df2.columns)
    columns = [col for col in df2.columns if col != 'CLASS' and col != 'ID']
    for col in columns:
        if df2[col].isnull().all():
            df2.drop(columns=col, inplace=True)
            column_filter.remove(col)
            continue

        if len(df2[col].dropna().unique()) <= 1:
            df2.drop(columns=col, inplace=True)
            column_filter.remove(col)

    return df2, column_filter


def apply_column_filter(df, column_filter):
    df2 = df.copy()
    [df2.drop(columns=col, inplace=True) for col in df2.columns if col not in column_filter]
    return df2

def create_normalization(df, normalizationtype='minmax'):
    df2 = df.copy()
    include_types = np.int32, np.int64, np.float32, np.float64
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID') 
               and df2[col].dtype in include_types]
    normalization = {}
    for col in columns:
        if normalizationtype=='minmax':
            min = df2[col].min()
            max = df2[col].max()
            normalization[col] = normalizationtype, min, max
        elif normalization=='zscore':
            mean = df2[col].mean()
            std = df[col].std()
            normalization = normalizationtype, mean, std
            
    for col in columns:
        values = list(normalization[col])
        if values[0] == 'minmax':
            df2[col] = [(x-values[1])/(values[2]-values[1]) for x in df[col]]

    return df2, normalization

def apply_normalization(df, normalization):
    df2 = df.copy()
    include_types = np.int32, np.int64, np.float32, np.float64
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID') 
               and df2[col].dtype in include_types]
    for col in columns:
        values = list(normalization[col])
        if values[0] == 'minmax':
            df2[col] = [(x-values[1])/(values[2]-values[1]) for x in df[col]]
            
    return df2

def create_imputation(df):
    df2 = df.copy()
    numeric_types = np.int32, np.int64, np.float32, np.float64
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID')]
    imputation = {}
    for col in columns:
        if df2[col].dtype in numeric_types:
            if df2[col].isnull().all():
                df2[col].fillna(0, inplace=True)
            imputation[col] = df2[col].mean()
            df2[col].fillna(df2[col].mean(), inplace=True)
        else:
            if df2[col].isnull().all():
                df2[col].fillna('', inplace=True) if df2[col].dtype == 'object' else \
                    df2[col].astype('category') and df2[col].fillna(df2[col].cat.categories[0], inplace=True)
            
            imputation[col] = df2[col].mode()[0]
            df2[col].fillna(imputation[col], inplace=True)
        
    return df2, imputation

def apply_imputation(df, imputation):
    df2 = df.copy()
    return df2.fillna(value=imputation)

def create_one_hot(df):
    df2=df.copy()
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID')]
    one_hot={}
    for col in columns:
        if df2[col].dtype.name != 'category':
            continue
        one_hot[col]=df2[col].unique()
        tmp = pd.get_dummies(df2[col], prefix=col, prefix_sep='-', dtype=np.float64)
        df2.drop(columns=col, inplace=True)
        df2 = pd.concat([df2, tmp], axis=1)

    return df2, one_hot

def apply_one_hot(df, one_hot):
    df2=df.copy()
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID')]
    for col in columns:
        if df2[col].dtype.name != 'category':
            continue
        tmp = pd.get_dummies(df2[col], prefix=col, prefix_sep='-', dtype=np.float64)
        df2.drop(columns=col, inplace=True)
        df2 = pd.concat([df2, tmp], axis=1)
    return df2    

def accuracy(df, correctlabels):
    highest_probability = df.idxmax(axis=1)
    correct_occurances = 0
    for correct_label, predicted_label in zip(correctlabels, highest_probability):
        if correct_label==predicted_label:
            correct_occurances+=1 
    
    return correct_occurances/df.index.size

def brier_score(df, correctlabels):
    squared_sum = 0
    row = 0
    for label in correctlabels:
        i = np.where(df.columns==label)[0]
        for col in df.columns:
            squared_sum += (1 - df.loc[row, label])**2 if label==col else df.loc[row, col]**2
        row+=1
    
    return squared_sum/df.index.size

def auc(df, correctlabels):
    auc=0
    for col in df.columns:
        df2 = pd.concat([df[col], pd.Series(correctlabels.astype('category'), name='correct')], axis=1)
        # get dummies for correct labels and sort descending
        df2 = pd.get_dummies(df2.sort_values(col, ascending=False))
        
        # move col to first for easier total tp and fp calculation
        tmp=df2.pop('correct_'+str(col))
        # get the col frequency for calculating weighted AUCs
        col_frequency=tmp.sum()/tmp.index.size
        df2.insert(1, tmp.name, tmp)
        scores={}
        # populate the scores dictionary for column i.e. key=score, value=[tp_sum, fp_sum]
        for row in df.index:
            key=df2.iloc[row, 0]
            current=np.zeros(2, dtype=np.uint) if scores.get(key) is None else scores[key]
            to_add=np.array([1,0]) if df2.iloc[row, 1]==1 else np.array([0,1])
            scores[key]=current+to_add

        # calculate auc based on scores
        cov_tp=0
        column_auc=0
        tot_tp=0
        tot_fp=0
        # calculate total tp and fp 
        for value in scores.values():
            tot_tp+=int(value[0])
            tot_fp+=int(value[1])
            
        # same algorithm as in the lecture (bad naming though)
        for i in scores.values():
            if i[1] == 0:
                cov_tp+=i[0]
            elif i[0] == 0:
                column_auc += (cov_tp/tot_tp)*(i[1]/tot_fp)
            else:
                column_auc += (cov_tp/tot_tp)*(i[1]/tot_fp)+(i[0]/tot_tp)*(i[1]/tot_fp)/2
                cov_tp += i[0]

        auc+=col_frequency*column_auc
    
    return auc

def create_bins(df, nobins=10, bintype='equal-width'):
    df2=df.copy()
    include_types = np.int32, np.int64, np.float32, np.float64
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID') 
               and df2[col].dtype in include_types]
    binning ={}
    for col in columns:
        df2[col].astype('category')
        if (bintype == 'equal-width'):
            res, bins = pd.cut(df2[col], nobins, retbins=True, labels=False)
            bins[0] = -np.inf
            bins[len(bins)-1] = np.inf
            binning[col] = bins
        elif (bintype == 'equal-size'):
            # We drop duplicates which results in fewer bins
            res, bins = pd.qcut(df2[col], nobins, retbins=True, labels=False, duplicates='drop')
            bins[0] = -np.inf
            bins[len(bins)-1] = np.inf
            binning[col] = bins
    return df2, binning

def apply_bins(df, binning):
    df2 = df.copy()
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID')] 
    for col in columns:
        labels=list(range(len(binning[col])-1)) 
        df2[col] = pd.cut(df2[col], binning[col], labels=labels)
    return df2

def create_bins(df, nobins=10, bintype='equal-width'):
    df2 = df.copy()
    include_types = np.int32, np.int64, np.float32, np.float64
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID')
               and df2[col].dtype in include_types]
    binning = {}
    for col in columns:
        df2[col].astype('category')
        if (bintype == 'equal-width'):
            df2[col], bins = pd.cut(df2[col], nobins, retbins=True, duplicates='drop', labels=False)
            bins[0] = -np.inf
            bins[len(bins) - 1] = np.inf
            binning[col] = bins
        elif (bintype == 'equal-size'):
            # We drop duplicates which results in fewer bins
            df2[col], bins = pd.qcut(df2[col], nobins, retbins=True, labels=False, duplicates='drop')
            bins[0] = -np.inf
            bins[len(bins) - 1] = np.inf
            binning[col] = bins
    return df2, binning


def apply_bins(df, binning):
    df2 = df.copy()
    columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID')]
    for col in columns:
        labels = list(range(len(binning[col]) - 1))
        df2[col] = pd.cut(df2[col], binning[col], labels=labels)
    return df2

## 1. Define the class kNN

In [3]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, normalization, one_hot, labels, training_labels, training_data, training_time
#
# Input to fit:
# self              - the object itself
# df                - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype - "minmax" (default) or "zscore"
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter   - a column filter (see Assignment 1) from df
# self.imputation      - an imputation mapping (see Assignment 1) from df
# self.normalization   - a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot         - a one-hot mapping (see Assignment 1)
# self.training_labels - a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels          - a list of the categories (class labels) of the previous series
# self.training_data   - the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
#                        normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# k    - an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the relative class frequencies in the set of class labels from the k nearest 
#               (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply column filtering, imputation, normalization and one-hot
#
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
#
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies

class kNN:
    def __init__(self):
        self.column_filter = None
        self.imputation = None
        self.normalization = None
        self.one_hot = None
        self.labels = None
        self.training_labels = None
        self.training_data = None
        self.training_time = None
    
    def fit(self, df, normalizationtype='minmax'):
        df2 = df.copy()
        df2.drop(columns=['CLASS', 'ID'], inplace=True)
        df2, self.column_filter = create_column_filter(df2)
        df2, self.imputation = create_imputation(df2)
        df2, self.normalization = create_normalization(df2, normalizationtype)
        df2, self.one_hot = create_one_hot(df2)
        
        # Set training labels
        self.training_labels = pd.Series(df['CLASS'].astype('category'))
        
        # Set labels
        self.labels = np.array(self.training_labels.unique().sort_values())
        
        # Set training_data
        self.training_data = df2.values
        
    def get_nearest_neighbor_predictions(self, x_test, k):
        # key=index value=distance series
        distances = {}
        for index, row in enumerate(self.training_data):
            distance = np.sqrt(np.sum( (row - x_test)**2 ))
            distances[index] = distance
        
        distances = sorted(distances.items(), key=lambda x:x[1])
        k_nearest_labels = np.array([self.training_labels[distances[i][0]] for i in range(k)])
        return k_nearest_labels
        
    
    def predict(self, df, k=5):
        df2 = df.copy()
        df2.drop(columns=['CLASS', 'ID'], inplace=True)
        df2 = apply_column_filter(df2, self.column_filter)
        df2 = apply_imputation(df2, self.imputation)
        df2 = apply_normalization(df2, self.normalization)
        df2 = apply_one_hot(df2, None)
        predictions = pd.DataFrame(np.zeros((df2.index.size, self.labels.size)), columns=self.labels)
        for index, row in df2.iterrows():
            raw_k_neighbours = self.get_nearest_neighbor_predictions(df2.iloc[index,:].values, k)
            # get probabilities
            for col in self.labels:
                if col not in raw_k_neighbours:
                    continue
                else:
                    predictions.loc[index, col] = np.count_nonzero(raw_k_neighbours==col)/raw_k_neighbours.size

        return predictions
        
        
# ==============================================
glass_train_df = pd.read_csv("glass_train.csv")
glass_test_df = pd.read_csv("glass_test.csv")
knn_model = kNN()
t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))
test_labels = glass_test_df["CLASS"]
k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))
for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise
results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])
print()
display("results",results)
#=======================================================
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.4f}".format(brier_score(predictions,train_labels)))

Training time: 0.01 s.
Testing time (k=1): 0.17 s.
Testing time (k=3): 0.25 s.
Testing time (k=5): 0.19 s.
Testing time (k=7): 0.19 s.
Testing time (k=9): 0.20 s.



'results'

Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.81035
3,0.616822,0.488058,0.815859
5,0.607477,0.474019,0.833805
7,0.635514,0.470723,0.834465
9,0.635514,0.483674,0.828734


Accuracy on training set (k=1): 1.0000
AUC on training set (k=1): 1.0000
Brier score on training set (k=1): 0.0000


In [None]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)



In [None]:
 	Accuracy 	Brier score 	AUC
1 	0.747664 	0.504673 	0.810350
3 	0.663551 	0.488058 	0.815859
5 	0.579439 	0.471028 	0.833843
7 	0.598131 	0.471867 	0.833481
9 	0.616822 	0.482981 	0.827727

In [None]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.4f}".format(brier_score(predictions,train_labels)))

### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [None]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, binning, labels, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self    - the object itself
# df      - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins  - no. of bins (default = 10)
# bintype - either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter              - a column filter (see Assignment 1) from df
# self.binning                    - a discretization mapping (see Assignment 1) from df
# self.class_priors               - a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
#                                   to the relative frequencies of the labels
# self.labels                     - a list of the categories (class labels) of the "CLASS" column of df
# self.feature_class_value_counts - a mapping from the feature (column name) to the number of
#                                   training instances with a specific combination of (non-missing, categorical) 
#                                   value for the feature and class label
# self.feature_class_counts       - a mapping from the feature (column name) to the number of
#                                   training instances with a specific class label and some (non-missing, categorical) 
#                                   value for the feature
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply the column filter and discretization
#
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
#
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
#
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors

class NaiveBayes:
    def __init__(self):
        # column_filter, binning, labels, class_priors, feature_class_value_counts, feature_class_counts
        self.column_filter = None
        self.binning = None
        self.labels = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None

    def fit(self, df, nobins=10, bintype='equal-width'):
        df2 = df.copy()
        df2, self.column_filter = create_column_filter(df2)
        df2, self.binning = create_bins(df2, nobins, bintype)

        # Set labels
        self.labels = np.array(pd.Series(df['CLASS'].astype('category').sort_values()).unique())

        # Set class_priors = P(H)
        self.class_priors = {}
        for label in self.labels:
            prior = np.count_nonzero(df2['CLASS'] == label) / df2['CLASS'].size
            self.class_priors[label] = prior

        # Set class_value_counts = numerator in P(Xi|H) for i in {features}
        self.feature_class_value_counts = {}
        # Set class_counts = denominator in P(Xi|H) for i in {features}
        self.feature_class_counts = {}

        columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID')]

        # Class labels are sorted so we son't add them assuming index maps to the correct label later
        # Assumption: NANs are dropped or purned in data preparation step
        class_counts = []
        for label in self.labels:
            class_counts += [np.count_nonzero(df2['CLASS'] == label)]
        for col in columns:
            self.feature_class_counts[col] = class_counts
            self.feature_class_value_counts[col] = []

        for index, label in enumerate(self.labels):
            for col in columns:
                value_count = {}
                for unique in df2[col].unique():
                    value_count[unique] = 0
                for value, clazz in zip(df2[col], df2['CLASS']):
                    if clazz != label:
                        continue
                    else:
                        value_count[value] += 1
                self.feature_class_value_counts[col].append(value_count)
        
    def predict(self, df):
        df2 = df.copy()
        df2 = apply_column_filter(df2, self.column_filter)
        df2 = apply_bins(df2, self.binning)
        predictions = pd.DataFrame(np.zeros((df2.index.size, self.labels.size)), columns=self.labels)
        columns = [col for col in df2.columns if (col != 'CLASS' and col != 'ID')]
        
        # Calculate non-normalized probabilities
        for label_index, label in enumerate(self.labels):
            for row_index, row in df2.iterrows():
                relative_frequency = 1
                # sum = 0
                for col in columns:
                    value = row[col]
                    label_feature_value_count = self.feature_class_value_counts[col][label_index]
                    try:
                        numerator = label_feature_value_count[value]
                    except KeyError:
                        continue
                    denominator = self.feature_class_counts[col][label_index]
                    relative_frequency *= (numerator / denominator)
                probability = relative_frequency * self.class_priors[label]
                predictions.iloc[row_index, label_index] = probability
        
       # Normalize
        for index, row in predictions.iterrows():
            sum = 0
            for col in self.labels:
                sum += row[col]
            if sum == 0:
                sum = self.class_priors[col]
            for col in self.labels:
                predictions.loc[index, col] /=sum
                
        return predictions
    
    
# ====================================================
glass_train_df = pd.read_csv("glass_train.csv")
glass_test_df = pd.read_csv("glass_test.csv")
nb_model = NaiveBayes()
test_labels = glass_test_df["CLASS"]
nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]
results = np.empty((len(parameters),3))
for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])
print()
display("results",results)
# =======================================================
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels)))

In [None]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

In [None]:
3 	equal-width 	0.616822 	0.622116 	0.724335
    equal-size 	0.607477 	0.554782 	0.780163
5 	equal-width 	0.644860 	0.551101 	0.771688
    equal-size 	0.598131 	0.581556 	0.796675
10 	equal-width 	0.654206 	0.527569 	0.812887
    equal-size 	0.588785 	0.741668 	0.751165

In [None]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels)))

### Comment on assumptions, things that do not work properly, etc.