# ID2214/FID3214 Assignment 2 Group no. 13
### Project members: 
- Christian Durán García - chdg@kth.se
- Kailin Wu - kailinw@kth.se
- William Carlstedt - wcar@kth.se

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas and time

In [2]:
import numpy as np
import pandas as pd
import time

In [3]:
from platform import python_version

print(f"Python version: {python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

Python version: 3.10.13
NumPy version: 1.24.3
Pandas version: 2.1.1


## Reused functions from Assignment 1

In [4]:
# Copy and paste functions from Assignment 1 here that you need for this assignment
def create_column_filter(df):
    # Create copy of df
    df_copy = df.copy()
    # Keep columns CLASS and ID
    columns_to_keep = ['CLASS', 'ID']

    for column in df_copy.columns:
        # If the columns are CLASS or ID it'll be keep in the df
        if column in columns_to_keep:
            continue
        # Drop column if contain only missing values
        elif(df_copy[column].nunique()<=1 or df_copy[column].isna().all()):
            df_copy = df_copy.drop(column, axis = 1)

    # Return df and column filter
    return df_copy, df_copy.columns.tolist()

# Function apply column_filter
def apply_column_filter(df, column_filter):
    # Copy of df
    df_copy = df.copy()

    # Return df with the colums filtered
    return df_copy[column_filter].copy()

def create_normalization(df, normalizationtype='minmax'):
    # Copy of df
    df_copy = df.copy()
    # Select columns of dtype float and int
    selected_columns = [col for col in df_copy.select_dtypes(include=['float', 'int']).columns if col not in ['CLASS', 'ID']]
    # DF with selected columns
    df_copy = df_copy[selected_columns]
    # Dictionary to store column and values
    column_mapping = {}

    # Case 1: minmax
    if normalizationtype == 'minmax':
        # For loop for columns in df       
        for col in selected_columns:
            # Minimum value of the column
            min = df_copy[col].min()
            # Maximum value of the column
            max = df_copy[col].max()
            # Store min max values in the column_mapping dictionary
            column_mapping[col] = ('minmax', min, max)
            # Apply the minmax normalization
            df_copy[col] = [(x-min)/(max-min) for x in df_copy[col]]
        # Return the normalized df and the mapping dictionary
        return df_copy, column_mapping
    
    # Case 2: zscore
    if normalizationtype == 'zscore':
        # For loop for columns in df
        for col in selected_columns:
            # Mean value of the column
            mean = df_copy[col].mean()
            # Standard deviation of the column
            std = df_copy[col].std()
            # Store the mean and std values in the dictionary
            column_mapping[col] = ('zscore', mean, std)
            # Apply the zscore normalization
            df_copy[col] = df[col].apply(lambda x: (x-mean)/std)
        # Return normalized df and the mapping dictionary
        return df_copy, column_mapping

# Function apply_normalization
def apply_normalization(df, normalization):
    # Copy of the original df
    df_copy = df.copy()

    # For loop of columns and items
    for col, (method, param1, param2) in normalization.items():
        # If the method == minmax apply the minmax normalization
        if method == 'minmax':
            # min, max values from the normalization mapping
            min, max = param1, param2
            # Apply normalization
            df_copy[col] = (df_copy[col]-min)/(max-min)
        # If the method == zscore apply the zscore normalization
        elif method == 'zscore':
            # mean, std values from the normalization mapping
            mean, std = param1, param2
            # Apply normalization
            df_copy[col] = (df_copy[col]-mean)/std
    # Return the normalized df
    return df_copy

def create_imputation(df):
    # Copy of the original df
    df_copy = df.copy()
    # Exclude columns 'CLASS' and 'ID'
    exclude_columns = ['CLASS']
    if 'ID' in df_copy.columns:
        exclude_columns.append('ID')

    # Dictionary for imputation   
    imputation_mapping = {}

    # For loop for columns
    for col in df_copy.columns:
        # If the column is one of the exclude columns, it'll continue with the next step 
        if col in exclude_columns:
            continue
        
        # If the data type of the column is float or int
        if df_copy[col].dtype == 'float' or df_copy[col].dtype == 'int':
            # If all values are NA
            if df_copy[col].isna().all():
                # Fill NA with 0
                df_copy[col] = df_copy[col].fillna(0)
                # Update mapping
                imputation_mapping[col] = 0
            # If not all are NA
            else:
                # Mean value of the column
                mean_value = df_copy[col].mean()
                # Fill the NA of the column with the mean value
                df_copy[col].fillna(mean_value, inplace=True)
                # Update mapping
                imputation_mapping[col] = mean_value
        # If column data type is object
        elif df_copy[col].dtype == 'object':
            # If column contains all NA
            if df_copy[col].isna().all():
                # Fill NAs with ''
                df_copy[col] = df_copy[col].fillna('')
                # Update mapping
                imputation_mapping[col] = ''
            # If not all are NA
            else:
                # Mode value of the column
                mode_value = df_copy[col].mode()[0]
                # Fill NA with mode value
                df_copy[col].fillna(mode_value, inplace=True)
                # Update mapping
                imputation_mapping[col] = mode_value
    
    # Return imputed df and mapping
    return df_copy, imputation_mapping

# Function apply_imputation
def apply_imputation(df, imputation):
    # Copy of the original df
    df_copy = df.copy()
    
    # For loop for column and imputation values
    for col, value in imputation.items():
        # Fill NA with the value
        df_copy[col].fillna(value, inplace=True)

    # Return imputed df
    return df_copy

def create_bins(df, nobins=10, bintype='equal-width'):
    df1 = df.copy() # Copy of dataframe
    num_cols = df1.select_dtypes(include=['float64', 'int64']).columns.tolist() # Select only numerical columns
    num_cols = [col for col in num_cols if col not in ['CLASS', 'ID']] # Exclude special columns

    mapping = {} # Initiaize dictionary for mapping

    for col in num_cols: # Iterate over numerical columns

        if bintype == 'equal-width': # Use cut
            _, bins = pd.cut(df1[col], nobins, labels=False, retbins=True) # Only getting bins
        elif bintype == 'equal-size': # Use qcut
            _, bins = pd.qcut(df1[col], nobins, labels=False, retbins=True, duplicates='drop') # Only getting bins

        bins[0], bins[-1] = -np.inf, np.inf # Change bin boundaries to infinity
        
        df1[col] = pd.cut(df1[col], bins) # Apply the binning strategy

        labels = {} # Initialize dictionary to get labels
        for i in range(len(bins)-1): # Loop for nobins to get according labels (they are ordered already)
            labels[df1[col].cat.categories[i]] = i

        df1[col] = df1[col].replace(labels) # Replace bins for labels

        mapping[col] = bins # Map column to bins

    return df1, mapping

def apply_bins(df, binning):
    df1 = df.copy() # Copying dataframe

    for col, bins in binning.items():
        df1[col] = pd.cut(df1[col], bins)

        labels = {} # Initialize dictionary to get labels
        for i in range(len(bins)-1): # Loop for nobins to get according labels (they are ordered already)
            labels[df1[col].cat.categories[i]] = i

        df1[col] = df1[col].replace(labels) # Replace bins for labels
    
    return df1

def create_one_hot(df):
    df1 = df.copy() # Copy original dataframe
    cat_cols = df1.select_dtypes(include=['object', 'category']).columns.tolist() # Get all categorical columns
    cat_cols = [col for col in cat_cols if col not in ['CLASS', 'ID']] # Exclude special columns

    ### There are no null values in the dataframe, so no manipulation needed for this issue

    mapping = {} # Initialize the mapping dictionary

    columns = [] # Initialize a list to store all the columns generated

    for col in cat_cols: # Iterate over all categorical columns
        mapping[col] = df1[col].unique().tolist() # Get all possible categories of the column

        for val in mapping[col]: # Iterate over the possible values of the column
            encoded_col = df1[col].where(df1[col] == val).fillna(0).replace({val:1}).astype('float64')\
                .rename(col + '_' + val) # Create the encoded column via series manipulation
            columns.append(encoded_col) # Append the column to the list of new columns

    encoded_df = pd.concat([df1['CLASS']] + columns, axis=1) # Concatenate new columns with class column

    return encoded_df, mapping

def apply_one_hot(df, mapping):
    df1 = df.copy() # Copy of the input dataframe

    columns = [] # Initialize list to store all new columns

    for col, values in mapping.items():
        for val in values: # Use the same code as in the previous function
            encoded_col = df1[col].where(df1[col] == val).fillna(0).replace({val:1}).astype('float64')\
                .rename(col + '_' + val) # Create the encoded column via series manipulation
            columns.append(encoded_col) # Append the column to the list of new columns

    encoded_df = pd.concat([df1['CLASS']] + columns, axis=1) # Concatenate new columns with class column

    return encoded_df

def split(df, testfraction=0.5):
    # Calculate the number of test instances
    num_test_instances = int(len(df) * testfraction)
    
    # Get a permuted list of indexes from the DataFrame
    permuted_indices = np.random.permutation(df.index)
    
    # Split the indices into training and test indices
    test_indices = permuted_indices[:num_test_instances]
    train_indices = permuted_indices[num_test_instances:]
    
    # Create the training and test DataFrames
    trainingdf = df.loc[train_indices]
    testdf = df.loc[test_indices]
    
    return trainingdf, testdf

def accuracy(df, correctlabels):
    # Handle ties by picking the first label with the highest probability
    predictions = df.idxmax(axis=1)
    # Resolve ties by selecting the first label with the highest probability
    # This is done by sorting the values within each row and taking the idxmax
    predictions_with_tie_breaking = df.apply(lambda x: x.index[x.values == x.max()][0], axis=1)
    # Calculate accuracy
    correct_predictions = sum(predictions_with_tie_breaking == correctlabels)
    accuracy = correct_predictions / len(correctlabels)
    return accuracy

def folds(df, nofolds=10):
    # Shuffle the indices of the DataFrame
    shuffled_indices = np.random.permutation(df.index)
    # Calculate the size of each fold
    fold_size = len(df) // nofolds
    # Initialize the list of folds
    folds = []
    
    # Create each fold
    for i in range(nofolds):
        # Determine the start and end indices of the fold
        start_index = i * fold_size
        # If it's the last fold, it should contain all remaining instances
        if i == nofolds - 1:
            end_index = len(df)
        else:
            end_index = (i + 1) * fold_size
        # Append the fold to the list of folds
        folds.append(df.iloc[shuffled_indices[start_index:end_index]])
    
    return folds

def brier_score(df, correctlabels):
    df1 = df.copy() # Copying original dataframe

    brier_scores = [] # Initialize a list to store all the brier scores to later apply a mean

    for i, row in df1.iterrows(): # Iterate over the dataframe's rows
        idx = np.where(df1.columns == correctlabels[i])[0] # Get the index of the column that corresponds with the correct label
        correct_label_vector = np.zeros(len(row)) # Generate an empty vector that will contain the probabilities
        correct_label_vector[idx] = 1 # Replace the index of the correct label with one
        brier_scores.append(np.sum((row - correct_label_vector)**2))# Apply the brier scores formula
    
    return np.mean(brier_scores) # Return the final brier score

def auc_binary(predictions, correctlabels, target_label):
    # Get all the true labels for the target label
    true_labels = (np.array(correctlabels) == target_label) # Expect to get a binary array where true represents the positive class

    # Get all the probabilities, in this case the scores, of the possitive class (aka target label)
    positive_class_probabilities = predictions[target_label]

    # Sort the predictions in descending order
    sorted_idx = np.argsort(positive_class_probabilities)[::-1].tolist() # Sort the indeces
    sorted_labels = true_labels[sorted_idx] # Sort the labels based on the sorted index

    # Get the number of positive and negative predictions
    n_pos = sum(sorted_labels) # Number of positives is the sum of the positive cases
    n_neg = len(sorted_labels) - n_pos # Number of negatives is the total number of predictions minus the number of positives

    # Calculate the TPR and FPR
    tpr = np.cumsum(sorted_labels) / n_pos
    fpr = np.cumsum(~sorted_labels) / n_neg # False Positives are the inverse of True Positives, thus the use of ~

    # Calculate the auc using trapezoidal rule (considers the three possible cases for calculating area)
    auc = np.trapz(tpr, fpr)

    return auc

def auc(predictions, correctlabels):
    true_labels = pd.Series(correctlabels) # Convert the correct labels into a Series to apply certain methods to it
    labels = true_labels.unique().tolist() # Get all unique labels
    
    auc_scores = [] # Initialize list for storing the auc values

    # Calculate the binary auc for each label and add the score to the dictionary
    for label in labels:
        auc_scores.append(auc_binary(predictions, correctlabels, label))

    auc_scores = np.array(auc_scores) # Transform into numpy array for broadcast operations

    # Calculate the relative frequency of each label
    rel_freq = (true_labels.value_counts().loc[labels] / true_labels.count()).to_numpy() # Using loc in value counts to be sure of order of scores

    # Calculate the weighted auc
    weighted_auc = np.sum(auc_scores * rel_freq) / sum(rel_freq)

    return weighted_auc

## 1. Define the class kNN

In [3]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, normalization, one_hot, labels, training_labels, training_data, training_time
#
# Input to fit:
# self              - the object itself
# df                - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype - "minmax" (default) or "zscore"
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter   - a column filter (see Assignment 1) from df
# self.imputation      - an imputation mapping (see Assignment 1) from df
# self.normalization   - a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot         - a one-hot mapping (see Assignment 1)
# self.training_labels - a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels          - a list of the categories (class labels) of the previous series
# self.training_data   - the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
#                        normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# k    - an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the relative class frequencies in the set of class labels from the k nearest 
#               (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply column filtering, imputation, normalization and one-hot
#
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
#
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies



In [4]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time: 0.01 s.
Testing time (k=1): 0.14 s.
Testing time (k=3): 0.15 s.
Testing time (k=5): 0.15 s.
Testing time (k=7): 0.15 s.
Testing time (k=9): 0.15 s.



'results'

Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.81035
3,0.663551,0.488058,0.815859
5,0.579439,0.471028,0.833843
7,0.598131,0.471867,0.833481
9,0.616822,0.482981,0.827727


In [5]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set (k=1): 1.0000
AUC on training set (k=1): 1.0000
Brier score on training set (k=1): 0.0000


### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [5]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, binning, labels, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self    - the object itself
# df      - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins  - no. of bins (default = 10)
# bintype - either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter              - a column filter (see Assignment 1) from df
# self.binning                    - a discretization mapping (see Assignment 1) from df
# self.class_priors               - a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
#                                   to the relative frequencies of the labels
# self.labels                     - a list of the categories (class labels) of the "CLASS" column of df
# self.feature_class_value_counts - a mapping from the feature (column name) to the number of
#                                   training instances with a specific combination of (non-missing, categorical) 
#                                   value for the feature and class label
# self.feature_class_counts       - a mapping from the feature (column name) to the number of
#                                   training instances with a specific class label and some (non-missing, categorical) 
#                                   value for the feature
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply the column filter and discretization
#
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
#
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
#
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors
#
# Hint 5: To clarify the assignment text a little: self.feature_class_value_counts should be a mapping from 
#         a column name (a specific feature) to another mapping, which given a class label and a value for 
#         the feature, returns the number of training instances which have included this combination, 
#         i.e., the number of training instances with both the specific class label and this value on the feature.
#
# Hint 6: As an additional hint, you may take a look at the slides from the NumPy and pandas lecture, to see how you 
#         may use "groupby" in combination with "size" to get the counts for combinations of values from two columns.

class NaiveBayes():
    
    # Initialize the Naive Bayes object
    def __init__(self):
        self.column_filter = None
        self.binning = None
        self.labels = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None

    # Define the fit method, assign values to attributes
    def fit(self, X, nobins=10, bintype='equal-width'):
        
        ### Set the attributes to a suitable value
        X, self.column_filter = create_column_filter(X)
        X, self.binning = create_bins(X, nobins, bintype)

        # Get the class priors (relative frequencies of each class)
        self.class_priors = {}
        for idx, value in X['CLASS'].astype('category').value_counts(normalize=True).sort_index().items():
            self.class_priors[idx] = value

        # Get the labels
        self.labels = list(self.class_priors.keys())

        # Drop the Class and ID columns for the data
        X_copy = X.drop(['ID', 'CLASS'], axis=1)

        # Feature class value counts and feature class counts
        self.feature_class_value_counts = {}
        self.feature_class_counts = {}

        # Iterate over each unique value in the training dataset
        for label in self.labels:
            # Create a subset where there are only rows containing the class being examined
            label_df = X[X["CLASS"] == label]

            # Iterate over all the columns of the original dataset
            for column in X_copy.columns: 

                # Iterate over the unique values of the class subset and the examined column
                for value, count in label_df[column].value_counts().sort_index().items(): 

                    # If the column is not present in the keys of the dict, it initializes the key and assigns an empty dict as its value
                    if column not in self.feature_class_value_counts: 
                        self.feature_class_value_counts[column] = {}

                    # Assign the count value for the column + (label, value) key
                    self.feature_class_value_counts[column][(label, value)] = count
                    
                    # Assigns the count of occurrances of the specific class label for the given column by the number of rows in label_df
                    self.feature_class_counts[(label, column)] = label_df.shape[0]

    # Define predict method
    def predict(self, X):

        # Apply the mappings from the fit method
        X = apply_column_filter(X, self.column_filter)
        X = apply_bins(X, self.binning)

        # Create a copy without the ID and CLASS columns
        X_copy = X.drop(['ID', 'CLASS'], axis=1)

        # Initialize an empty dataframe to calculate the predictions by row
        predictions = pd.DataFrame(columns=self.labels)

        # Loop through the rows in order to calculate the probabilities
        for idx, row in X_copy.iterrows():
            # Initialize an empty dictionary to store all calculated probabilities and append it to the DF
            probabilities = {}

            # Iterate over each label from the training labels
            for label in self.labels:
                # First establish the probability of the label as the prior probability
                prob = self.class_priors[label]

                # Iterate over the given row
                for column, value in row.items():
                    # Check if the column is in class value counts and label / value is in that column dictionary from class value counts
                    if column in self.feature_class_value_counts and (label, value) in self.feature_class_value_counts[column]:
                        # Multiply by the class priors to the relative frequencies
                        prob *= self.feature_class_value_counts[column][(label, value)] / self.feature_class_counts[(label, column)]
                    else:
                        # Set for unseen probabilities
                        prob *= 1 / (self.feature_class_counts[(label, column)] + len(self.feature_class_value_counts.get(column, [])))

                # Assign the probability of that label 
                probabilities[label] = prob

            # Sum in order to get the total probabilities per row
            total_prob = sum(probabilities.values())

            # If the total probability is 0, set the probabilities to the class priors
            if total_prob == 0:
                predictions.loc[idx] = list(self.class_priors.values())
            else: # If more than 0, normalize the probabilities using the sum of the non normalized probabilities
                predictions.loc[idx] = [prob / total_prob for prob in probabilities.values()]

        return predictions


In [6]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time (3, 'equal-width'): 0.34 s.
Testing time (3, 'equal-width'): 0.32 s.
Training time (3, 'equal-size'): 0.17 s.
Testing time (3, 'equal-size'): 0.34 s.
Training time (5, 'equal-width'): 0.24 s.
Testing time (5, 'equal-width'): 0.41 s.
Training time (5, 'equal-size'): 0.26 s.
Testing time (5, 'equal-size'): 0.35 s.
Training time (10, 'equal-width'): 0.24 s.
Testing time (10, 'equal-width'): 0.34 s.
Training time (10, 'equal-size'): 0.35 s.
Testing time (10, 'equal-size'): 0.40 s.



'results'

Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.726175
3,equal-size,0.607477,0.554782,0.788255
5,equal-width,0.64486,0.551101,0.771041
5,equal-size,0.598131,0.581556,0.797479
10,equal-width,0.654206,0.527569,0.811645
10,equal-size,0.588785,0.741668,0.762948


In [7]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.8505
AUC on training set: 0.9690
Brier score on training set: 0.2263


### Comment on assumptions, things that do not work properly, etc.