# Assignment 2
## Group name: [enter your group name with either the prefix "ID2214-" or "FID3214-"]
### Project members: 
[Francesco Luce, luce@kth.se]

[Leandro Duarte, leandrod@kth.se]

[Stefano Bosoppi, bosoppi@kth.se]

### Declaration:
By submitting this assignment, it is hereby declared that all group members listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214), no part of the solution has been provided by someone not listed as a project member above, and no part of the solution has been generated by a system.

It is furthermore declared that the submitted assignment will not be shared during the course, with any individual other than the group members listed above and teachers of the course ID2214/FID3214. In particular, the assignment will not be uploaded to any public repository. The submitted assignment can be shared after the course only if written consent has been provided by the course responsible of ID2214/FID3214.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy and pandas may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas and time

In [82]:
import numpy as np
import pandas as pd
import time

In [83]:
from platform import python_version

print(f"Python version: {python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

Python version: 3.11.5
NumPy version: 1.26.3
Pandas version: 2.1.4


## Reused functions from Assignment 1

In [84]:
# Copy and paste functions from Assignment 1 here that you need for this assignment
def create_column_filter(df: pd.DataFrame) -> tuple[pd.DataFrame, list[str]]:
    """
    Creates a filtered dataframe by removing columns with only missing values or one unique value.
    Keeps CLASS and ID columns.
    """
    # Copy df
    df_copy = df.copy()

    # filter with CLASS and ID if they exist
    column_filter = [col for col in df.columns if col in ['CLASS', 'ID']]

    # Check each column
    for col in df.columns:
        if col not in ['CLASS', 'ID']:
            unique_vals = df[col].dropna().unique()
            if len(unique_vals) > 1:  # Keep if more than 1 unique non-null value
                column_filter.append(col)

    return df_copy[column_filter], column_filter

def apply_column_filter(df: pd.DataFrame, column_filter: list[str]) -> pd.DataFrame:
    """
    Applies a column filter to keep only specified columns.
    """
    return df.copy()[column_filter]

def minmax_normalize(
        col: pd.Series, min_val: float = None, max_val: float = None
) -> tuple[pd.Series, tuple[float, float]]:
    """
    Returns MinMax-normalized `col` and the min and max values, respectively, used for normalization.
    If values for min and/or max are provided, they are used, otherwise they are derived from `col`.
    """
    norm_col = col.copy()

    col_min = col.min() if min_val is None else min_val
    col_max = col.max() if max_val is None else max_val

    norm_col = (norm_col - col_min) / (col_max - col_min)

    return norm_col, (col_min, col_max)


def zscore_normalize(
        col: pd.Series, mean_val: float = None, std_val: float = None
) -> tuple[pd.Series, tuple[float, float]]:
    """
    Returns z-normalized `col` and the mean and standard deviation values, respectively, used for normalization.
    If values for mean and/or standard deviation are provided, they are used, otherwise they are dervied from `col`.
    """
    norm_col = col.copy()

    col_mean = col.mean() if mean_val is None else mean_val
    col_std = col.std() if std_val is None else std_val

    norm_col = (norm_col - col_mean) / col_std

    return norm_col, (col_mean, col_std)


def get_normalizer(normalizationtype: str):
    """
    Returns the normalizer function corresponding to the provided type.
    Accepted types are "minmax" and "zscore".
    """
    match normalizationtype:
        case "minmax":
            return minmax_normalize
        case "zscore":
            return zscore_normalize
        case _:
            raise Exception(f'Normalization type "{normalizationtype}" not supported.')


def create_normalization(
        df: pd.DataFrame, normalizationtype: str = "minmax"
) -> tuple[pd.DataFrame, dict[str, tuple[str, float, float]]]:
    """
    Normalizes `df`'s columns (excluding "CLASS" and "ID") with the normalization type provided.
    Returns the normalized dataframe and a dictionary associating each column with the normalization type and the parameters used by the corresponding normalizer.
    """
    new_df = df.copy()
    normalization = {}

    normalizer = get_normalizer(normalizationtype)

    columns = set(new_df.columns).difference({"CLASS", "ID"})
    for col in columns:
        norm_col, params = normalizer(new_df[col])
        new_df[col] = norm_col

        normalization[col] = tuple([normalizationtype] + [val for val in params])

    return new_df, normalization


def apply_normalization(
        df: pd.DataFrame, normalization: dict[str, tuple[str, float, float]]
) -> pd.DataFrame:
    """
    Normalizes `df`'s column (excluding "CLASS" and "ID") using the normalization type and parameters specified in `normalization`.
    """
    new_df = df.copy()
    columns = set(new_df.columns).difference({"CLASS", "ID"})

    for col in columns:
        col_dets = normalization[col]

        normalizer = get_normalizer(col_dets[0])

        norm_col, _ = normalizer(new_df[col], *col_dets[1:])

        new_df[col] = norm_col

    return new_df

def create_imputation(df: pd.DataFrame) -> tuple[pd.DataFrame, dict]:
    """
    Create imputation values and apply them to missing values in dataframe.
    """

    df_copy = df.copy()
    imputation = {}

    for col in df.columns:
        if col in ["CLASS", "ID"]:
            continue

        # Handle numeric columns
        if pd.api.types.is_numeric_dtype(df[col]):
            fill_value = df[col].mean()
            if pd.isna(fill_value):  # All values missing
                fill_value = 0

        # Handle categorical/object columns
        elif df[col].dtype == 'category':
            fill_value = (
                df[col].mode().iloc[0]
                if not df[col].mode().empty
                else df[col].cat.categories[0]
            )
        else:  # object type
            fill_value = df[col].mode().iloc[0] if not df[col].mode().empty else ""

        df_copy[col] = df_copy[col].fillna(fill_value)
        imputation[col] = fill_value

    return df_copy, imputation


def apply_imputation(df: pd.DataFrame, imputation: dict) -> pd.DataFrame:
    """
    Apply existing imputation values to missing values in dataframe.
    """
    df_copy = df.copy()

    for col, value in imputation.items():
        if col in df_copy.columns:
            df_copy[col] = df_copy[col].fillna(value)

    return df_copy

def create_one_hot(df: pd.DataFrame) -> tuple[pd.DataFrame, dict]:
    """
    Create one-hot encoding for categorical features.
    """

    df_copy = df.copy()
    one_hot = {}

    for col in df.columns:
        if col in ["CLASS", "ID"]:
            continue

        # Only process object or category columns
        if df[col].dtype not in ["object", "category"]:
            continue

        # Get unique categories
        categories = df[col].unique()
        one_hot[col] = categories

        # Create one-hot encoded columns
        for category in categories:
            new_col_name = f"{col}_{category}"
            df_copy[new_col_name] = (df[col] == category).astype(float)

        # Drop original column
        df_copy.drop(columns=[col], inplace=True)

    return df_copy, one_hot


def apply_one_hot(df: pd.DataFrame, one_hot: dict) -> pd.DataFrame:
    """
    Apply one-hot encoding using existing categories.
    """

    df_copy = df.copy()

    for col, categories in one_hot.items():
        if col not in df_copy.columns:
            continue

        # Create one-hot encoded columns
        for category in categories:
            new_col_name = f"{col}_{category}"
            df_copy[new_col_name] = (df[col] == category).astype(float)

        # Drop original column
        df_copy.drop(columns=[col], inplace=True)

    return df_copy

def create_bins(
        df: pd.DataFrame, nobins: int = 10, bintype: str = "equal-width"
) -> tuple[pd.DataFrame, dict]:
    """
    Create bins for numeric features and apply discretization.
    """

    df_copy = df.copy()
    binning = {}

    for col in df.columns:
        if col in ["CLASS", "ID"]:
            continue

        # Only process numeric columns
        if not np.issubdtype(df[col].dtype, np.number):
            continue

        # Create bins based on bintype
        if bintype == "equal-width":
            discretized, bins = pd.cut(df[col], bins=nobins, labels=False, retbins=True)
        else:  # equal-size
            discretized, bins = pd.qcut(
                df[col], q=nobins, labels=False, retbins=True, duplicates="drop"
            )

        # Adjust bin edges
        bins[0] = -np.inf
        bins[-1] = np.inf

        # Store bins and update column
        binning[col] = bins
        df_copy[col] = pd.Categorical(discretized, categories=range(nobins))

    return df_copy, binning


def apply_bins(df: pd.DataFrame, binning: dict) -> pd.DataFrame:
    """
    Apply existing bins to numeric features.
    """

    df_copy = df.copy()

    for col, bins in binning.items():
        if col not in df_copy.columns:
            continue

        nobins = len(bins) - 1  # number of bins is one less than number of thresholds
        discretized = pd.cut(df_copy[col], bins=bins, labels=False)
        df_copy[col] = pd.Categorical(discretized, categories=range(nobins))

    return df_copy

def brier_score(df:pd.DataFrame, correctlabels:list) -> float:
    squared_errors = []

    for i, label in enumerate(correctlabels):
        # Create the true vector (ideal prediction)
        true_vector = np.zeros(len(df.columns))
        true_vector[np.where(df.columns == label)[0][0]] = 1

        # Calculate the squared error for the current prediction
        prediction = df.iloc[i].values
        squared_error = np.sum((prediction - true_vector) ** 2)
        squared_errors.append(squared_error)

    brier_score = np.mean(squared_errors)
    return brier_score

def feature_auc(
        scores_performance: list[tuple[float, int, int]], tot_tp: int, tot_fp: int
) -> float:
    """
    Returns the area under the ROC curve for a specific feature.
    """
    auc_c = 0
    cov_tp = 0

    for s, tp_s, fp_s in scores_performance:

        if fp_s == 0:
            cov_tp += tp_s
        elif tp_s == 0:
            auc_c += (cov_tp / tot_tp) * (fp_s / tot_fp)
        else:
            auc_c += (cov_tp / tot_tp) * (fp_s / tot_fp) + (tp_s / tot_tp) * (
                    fp_s / tot_fp
            ) / 2

    return auc_c


def class_performance(
        class_scores: pd.Series, tp: np.ndarray
) -> dict:
    """
    Returns a dictionary associating the number of true and false positive to each score for a specific class.
    """
    scores_performance = {}
    unique_scores = class_scores.unique()

    for s in unique_scores:
        s_obs = (class_scores == s).astype(np.int32)

        tp_s = np.dot(s_obs, tp.astype(np.int32))
        fp_s = sum(s_obs) - tp_s

        scores_performance[s] = (tp_s, fp_s)

    return scores_performance


def auc(df: pd.DataFrame, correctlabels: list) -> float:
    """
    Returns the weighted area under the ROC curve of a dataframe of scores, given the correct labels.
    """
    assert df.shape[0] == len(
        correctlabels
    ), "the number of correct labels must equal the number of rows in df"

    class_counts = {cls: 0 for cls in df.columns}
    for label in correctlabels:
        class_counts[label] += 1

    auc = 0

    for cls in df.columns:
        tp = np.array(correctlabels) == cls
        fp = np.array(correctlabels) != cls

        scores_performance = class_performance(df[cls], tp)

        scores_performance = [
            (s, scores_performance[s][0], scores_performance[s][1])
            for s in sorted(scores_performance.keys(), reverse=True)
        ]

        auc += (
                class_counts[cls]
                / df.shape[0]
                * feature_auc(scores_performance, sum(tp), sum(fp))
        )

    return auc

def accuracy(df: pd.DataFrame, correctlabels:list) -> float:
    # Retrieving column names for max values, ties are resolved with first value
    predicted_labels = df.idxmax(axis=1)
    n_correct = sum(pred == correct for pred,correct in zip(predicted_labels, correctlabels))
    return n_correct / df.shape[0]


## 1. Define the class kNN

In [85]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__:
# self - the object itself
#
# Output from __init__:
# <nothing>
#
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, normalization, one_hot, labels, training_labels, training_data, training_time
#
# Input to fit:
# self              - the object itself
# df                - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype - "minmax" (default) or "zscore"
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter   - a column filter (see Assignment 1) from df
# self.imputation      - an imputation mapping (see Assignment 1) from df
# self.normalization   - a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot         - a one-hot mapping (see Assignment 1)
# self.training_labels - a pandas series corresponding to the "CLASS" column, set to be of type "category"
# self.labels          - a list of the categories (class labels) of the previous series
# self.training_data   - the values (an ndarray) of the transformed dataframe, i.e., after employing imputation,
#                        normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# k    - an integer >= 1 (default = 5)
#
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the relative class frequencies in the set of class labels from the k nearest
#               (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply column filtering, imputation, normalization and one-hot
#
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
#
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies
class kNN:
    def __init__(self):
        self.column_filter = None
        self.imputation = None
        self.normalization = None
        self.one_hot = None
        self.labels = None
        self.training_labels = None
        self.training_data = None
        self.training_time = None

    def fit(self, df: pd.DataFrame, normalizationtype: str = "minmax"):
        self.column_filter = create_column_filter(df)[1]
        self.imputation = create_imputation(df)[1]
        self.normalization = create_normalization(
            df, normalizationtype=normalizationtype
        )[1]
        self.one_hot = create_one_hot(df)[1]

        self.training_labels = df["CLASS"].astype("category")
        self.labels = self.training_labels.unique()
        self.training_data = apply_one_hot(
            apply_normalization(
                apply_imputation(df.drop(columns=["CLASS", "ID"]), self.imputation),
                self.normalization,
            ),
            self.one_hot,
        )

    def predict(self, df: pd.DataFrame, k: int) -> pd.DataFrame:
        preprocessed_data = (
            apply_one_hot(
                apply_normalization(
                    apply_imputation(
                        apply_column_filter(df, self.column_filter), self.imputation
                    ),
                    self.normalization,
                ),
                self.one_hot,
            )
            .drop(columns=["CLASS", "ID"])
            .select_dtypes(include=np.number)
        )

        predictions = []
        for _, row in preprocessed_data.iterrows():
            predictions.append(self.__get_single_prediction(row, k))
        return pd.DataFrame(predictions, columns=self.labels)

    def __euclidean_distance(self, a: np.ndarray, b: np.ndarray) -> float:
        distance = np.sqrt(np.sum((a - b) ** 2))
        return distance

    def __get_single_prediction(self, instance: pd.Series, k: int) -> pd.Series:
        numeric_data = self.training_data.select_dtypes(include=np.number)
        prediction = {col: 0 for col in self.labels}
        distances = []

        for (_, row), label in zip(numeric_data.iterrows(), self.training_labels):
            # Calculate the distance and create the tuple
            distance = self.__euclidean_distance(row.values, instance.values)
            distances.append((distance, label))

            # Sort by distance and retain only the closest k distances
            distances = sorted(distances, key=lambda x: x[0])[:k]

        for dist, label in distances:
            prediction[label] += 1

        prediction = pd.Series({col: pred / k for col, pred in prediction.items()})
        return prediction

In [86]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")
glass_test_df = pd.read_csv("glass_test.csv")

knn_model = kNN()


t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]
k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

display("results",results)

Training time: 0.03 s.
Testing time (k=1): 1.15 s.
Testing time (k=3): 0.86 s.
Testing time (k=5): 0.97 s.
Testing time (k=7): 0.95 s.
Testing time (k=9): 1.00 s.


'results'

Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.213714
3,0.616822,0.488058,0.203171
5,0.607477,0.474019,0.276589
7,0.635514,0.470723,0.249385
9,0.635514,0.483674,0.226532


In [87]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set (k=1): 1.0000
AUC on training set (k=1): 1.0000
Brier score on training set (k=1): 0.0000


### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [88]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__:
# self - the object itself
#
# Output from __init__:
# <nothing>
#
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, binning, labels, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self    - the object itself
# df      - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins  - no. of bins (default = 10)
# bintype - either "equal-width" (default) or "equal-size"
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter              - a column filter (see Assignment 1) from df
# self.binning                    - a discretization mapping (see Assignment 1) from df
# self.class_priors               - a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
#                                   to the relative frequencies of the labels
# self.labels                     - a list of the categories (class labels) of the "CLASS" column of df
# self.feature_class_value_counts - a mapping from the feature (column name) to the number of
#                                   training instances with a specific combination of (non-missing, categorical)
#                                   value for the feature and class label
# self.feature_class_counts       - a mapping from the feature (column name) to the number of
#                                   training instances with a specific class label and some (non-missing, categorical)
#                                   value for the feature
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
#
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply the column filter and discretization
#
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and
#         feature_class_counts)
#
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
#
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors
#
# Hint 5: To clarify the assignment text a little: self.feature_class_value_counts should be a mapping from
#         a column name (a specific feature) to another mapping, which given a class label and a value for
#         the feature, returns the number of training instances which have included this combination,
#         i.e., the number of training instances with both the specific class label and this value on the feature.
#
# Hint 6: As an additional hint, you may take a look at the slides from the NumPy and pandas lecture, to see how you
#         may use "groupby" in combination with "size" to get the counts for combinations of values from two columns.


class NaiveBayes:
    def __init__(self) -> None:
        self.column_filter: list[str] = None
        self.binning: dict[str, np.ndarray] = None
        self.labels: list[str] = None
        self.class_priors: dict[str, float] = None
        self.feature_class_value_counts: dict[str, pd.Series] = None
        self.feature_class_counts: dict[str, pd.Series] = None

    def fit(
        self, df: pd.DataFrame, nobins: int = 10, bintype: str = "equal-width"
    ) -> None:
        assert bintype in ["equal-width", "equal-size"]

        new_df, self.column_filter = create_column_filter(df)

        new_df, self.binning = create_bins(new_df, nobins=nobins, bintype=bintype)

        self.labels = new_df["CLASS"].unique().tolist()
        self.class_priors = new_df["CLASS"].value_counts(normalize=True).to_dict()

        self.feature_class_value_counts = {}
        self.feature_class_counts = {}

        for col in self.__get_feature_columns():
            self.feature_class_value_counts[col] = new_df.groupby(
                ["CLASS", col], observed=True
            ).size()
            self.feature_class_counts[col] = new_df.groupby("CLASS", observed=True)[
                col
            ].count()

    def __get_feature_columns(self) -> set[str]:
        return set(self.column_filter).difference({"ID", "CLASS"})

    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        new_df = apply_column_filter(df, self.column_filter)
        new_df = apply_bins(new_df, self.binning)

        predictions = []

        for _, row in new_df.iterrows():
            predictions.append(self.__get_class_probabilities(row))

        # Return predictions as a DataFrame
        return pd.DataFrame(predictions)

    def __get_class_probabilities(self, row: pd.Series) -> dict[str, float]:
        class_probabilities = {}

        for label in self.labels:
            # Start with class prior
            prob = self.class_priors[label]

            for col in self.__get_feature_columns():
                value: int = row[col]

                value_given_class_count: int = self.feature_class_value_counts[col].get(
                    (label, value), default=0
                )
                class_count: int = self.feature_class_counts[col][label]

                prob *= value_given_class_count / class_count if class_count > 0 else 1

            class_probabilities[label] = prob

        # Normalize probabilities
        prob_sum = sum(class_probabilities.values())
        if prob_sum > 0:
            for label in class_probabilities:
                class_probabilities[label] /= prob_sum
        else:
            # If sum is zero, fall back to class priors
            class_probabilities = self.class_priors

        return class_probabilities

In [89]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time (3, 'equal-width'): 0.03 s.
Testing time (3, 'equal-width'): 0.30 s.
Training time (3, 'equal-size'): 0.03 s.
Testing time (3, 'equal-size'): 0.34 s.
Training time (5, 'equal-width'): 0.03 s.
Testing time (5, 'equal-width'): 0.38 s.
Training time (5, 'equal-size'): 0.03 s.
Testing time (5, 'equal-size'): 0.43 s.
Training time (10, 'equal-width'): 0.03 s.
Testing time (10, 'equal-width'): 0.38 s.
Training time (10, 'equal-size'): 0.03 s.
Testing time (10, 'equal-size'): 0.44 s.



'results'

Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.291832
3,equal-size,0.607477,0.554782,0.666537
5,equal-width,0.64486,0.551101,0.382137
5,equal-size,0.598131,0.581556,0.69292
10,equal-width,0.654206,0.527569,0.564259
10,equal-size,0.588785,0.741668,0.4231


In [90]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.8505
AUC on training set: 0.9015
Brier score on training set: 0.2263


### Comment on assumptions, things that do not work properly, etc.