# Assignment 2
## Group name: [ID2214 - Group 15]
### Project members: 
[Adalet Adiljan, adalat@kth.se]

### Declaration:
By submitting this assignment, it is hereby declared that all group members listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214), no part of the solution has been provided by someone not listed as a project member above, and no part of the solution has been generated by a system.

It is furthermore declared that the submitted assignment will not be shared during the course, with any individual other than the group members listed above and teachers of the course ID2214/FID3214. In particular, the assignment will not be uploaded to any public repository. The submitted assignment can be shared after the course only if written consent has been provided by the course responsible of ID2214/FID3214.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy and pandas may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas and time

In [6]:
import numpy as np
import pandas as pd
import time

In [7]:
from platform import python_version

print(f"Python version: {python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

Python version: 3.12.2
NumPy version: 1.26.4
Pandas version: 2.2.2


## Reused functions from Assignment 1

In [8]:
# Copy and paste functions from Assignment 1 here that you need for this assignment
def create_column_filter(df):
    new_df = df.copy()
    for col in new_df:
        if col in ['CLASS','ID']:   # skip CLASS, ID
            continue
        if new_df[col].nunique() > 1: # skip if >1 unique value in col
            continue
        new_df = new_df.drop(columns=[col]) # drop col
    column_filter = new_df.columns
    return new_df, column_filter

def apply_column_filter(df,column_filter):
    new_df = df.copy()
    new_df = new_df[column_filter]
    return new_df


def create_normalization(df,normalizationtype='minmax'):
    if normalizationtype not in ['minmax','zscore']:
        raise ValueError('Not a valid normalization type')
        
    new_df = df.copy()
    newdict = {}
    for col in new_df:
        if col in ['CLASS','ID']: # skip CLASS, ID
            continue
        if normalizationtype == 'minmax':
            min = new_df[col].min()
            max = new_df[col].max()
            new_df[col] = [(x-min)/(max-min) for x in new_df[col]] # minmax formula
            newdict[col] = (normalizationtype,min,max)
        else:
            mean = new_df[col].mean()
            std = new_df[col].std()
            new_df[col] = new_df[col].apply(lambda x: (x-mean)/std) # zscore formula
            newdict[col] = (normalizationtype,mean,std)
    return new_df, newdict

def apply_normalization(df,normalization):
    new_df = df.copy()
    for key, value in normalization.items():
        if key in ['CLASS','ID']:
            continue
        
        if value[0] == 'minmax':
            new_df[key] = [(x-value[1])/(value[2]-value[1]) for x in new_df[key]]
            
        if value[0] == 'zscore':
            new_df[key] = new_df[key].apply(lambda x: (x-value[1])/value[2])
    return new_df


def create_imputation(df):
    new_df = df.copy()
    imputation = {}
    for col in new_df:
        if col in ['CLASS','ID']:
            continue
        
        if new_df[col].dtype in ['float64','int64']:
            if new_df[col].nunique() == 0: # if col is completely empty
                mean = 0 # hint 4
            else:
                mean = new_df[col].mean()
            
            new_df[col] = new_df[col].fillna(mean) # replace NaN with mean
            imputation[col] = mean
            
        if new_df[col].dtype in ['object','category']:
            if new_df[col].dtype == 'object' and new_df[col].nunique() == 0: # if col is object and completely empty
                new_df[col] = "" # hint 4
                
            if new_df[col].dtype == 'category' and new_df[col].nunique() == 0: # if col is category and completely empty
                new_df[col] = new_df[col].fillna(new_df.categories[0]) # hint 4
                
            new_df[col] = new_df[col].fillna(new_df[col].mode()[0]) # replace NaN with mode
            imputation[col] = new_df[col].mode()[0]
    
    return new_df,imputation


def apply_imputation(df,imputation):
    new_df = df.copy()
    for key, value in imputation.items():
        if key in ['CLASS','ID']:
            continue
        
        new_df[key] = new_df[key].fillna(value)
    return new_df


def create_bins(df,nobins=10,bintype='equal-width'):
    if bintype not in ['equal-width','equal-size']:
        raise ValueError('Not a valid bintype')
        
    new_df = df.copy()
    binning = {}
    
    for col in new_df:
        if col in ['CLASS','ID']:
            continue
        
        if new_df[col].dtype in ['float64','int64']:
            
            if bintype == 'equal-width':
                res, bins = pd.cut(new_df[col],nobins,labels=False,retbins=True)
            
            if bintype == 'equal-size':
                res, bins = pd.qcut(new_df[col],nobins,labels=False,retbins=True,duplicates='drop') # drop because Bin edges must be unique (source, pandas docs)
                
            # hint 6
            bins[0] = -np.inf
            bins[-1] = np.inf
            
            new_df[col] = res # set the column to bin category/index
            binning[col] = bins # dict
        
    for col in new_df:
        new_df[col] = new_df[col].astype('category') # hint 4
        
        if col in binning:
            new_df[col] = new_df[col].cat.set_categories(range(nobins)) # hint 5
            
    return new_df,binning

def apply_bins(df,binning):
    new_df = df.copy()
    
    for col in new_df:
        if col in binning:
            new_res = pd.cut(new_df[col],binning[col],labels=False)
            
            new_df[col] = new_res
            
            new_df[col] = new_df[col].astype('category') # hint 4
            
            new_df[col] = new_df[col].cat.set_categories(range(len(binning[col])-1)) # hint 5
  
    return new_df


def create_one_hot(df):
    new_df = df.copy()
    one_hot = {}
    
    for col in new_df:
        if col in ['CLASS','ID']:
            continue
        
        if new_df[col].dtype in ['object','category']:
            categories = sorted(new_df[col].unique()) # save all unique "categories" (values)
                                                      # sorted just to have them in same order as provided output
            for val in categories:
                new_df[col+'-'+val] = (new_df[col] == val).astype(float) # set new values to float
                                                                         # True translates to 1
                                                                         # False translates to 0
            one_hot[col] = categories
            new_df.drop(columns=col,inplace=True) # drop original
    return df,one_hot

def apply_one_hot(df,one_hot):
    new_df = df.copy()
    for key, value in one_hot.items():
        if key in ['CLASS','ID']:
            continue
        
        for val in value:
            new_df[key+'-'+val] = (new_df[key] == val).astype(float) 
        
        new_df.drop(columns=key,inplace=True)
    return new_df

def accuracy(df,correctlabels):
    if len(correctlabels) != len(df):
        raise ValueError('#correctlabels and #rows do not match')
    
    col_index = np.apply_along_axis(lambda x:np.argmax(x),1,df) # get maxindex from row on column axis
    actual = []
    for x in col_index: 
        actual.append(df.columns[x]) # get the column name from index
        
    true_positives = 0
    for i, label in enumerate(actual):
        if label == correctlabels[i]:
            true_positives += 1
    
    accuracy = true_positives / len(correctlabels)
    return accuracy

def brier_score(predictions,correctlabels):
    if len(predictions) != len(correctlabels):
        raise ValueError("Both arguments must have the same length.")
        
    square_error_sum = []
    
    for i, label in enumerate(correctlabels):
        predicted_vector = predictions.iloc[i].values # = values in row i
        
        correctlabel_index = np.where(predictions.columns==label)[0] # column index matching current label
        
        cli_vector = np.zeros(len(predictions.columns)) # correct label index vector
        cli_vector[correctlabel_index] = 1 # index for current label = 1, effectively creating a one-hot
        
        square_error = sum((predicted_vector - cli_vector)**2)
        
        square_error_sum.append(square_error)
    
    brier_score = np.mean(square_error_sum)
    return brier_score


def auc(df,correctlabels):
    predictions = df.copy()
    
    col_index = np.apply_along_axis(lambda x:np.argmax(x),1,predictions) # get maxindex from row on column axis
    
    c_labels = [] # list that will become binary tp / fp
    score_cols = predictions.columns # saving the col names with scores
    
    for x in col_index: 
       c_labels.append(predictions.columns[x]) # getting the col name based on col index
    predictions['Positives'] = (pd.Series(c_labels) == correctlabels).astype(int)
    
    class_counts = pd.Series(correctlabels).value_counts(normalize=True) # count the frequency (no. of "A" and "B")
                                                                         # divided by tot no. of "correctlabels"
                                                                         # in this example:
                                                                         # A - 0.5 (2 A, divided by 4 tot)
                                                                         # B - 0.5 (2 B, divided by 4 tot)
    auc_total = 0
    for i in range(len(score_cols)): # create a new df for each class that contains col='score','tp','fp'
        current_class = score_cols[i]
        
        new_df = pd.DataFrame()
        new_df['score'] = predictions[current_class]
        new_df['tp'] = (pd.Series(correctlabels) == current_class).astype(int) # 1 where label matches, 0 where it doesn't
        new_df['fp'] = (pd.Series(correctlabels) != current_class).astype(int)
        
        #new_df = pd.concat([new_df, temp_df], ignore_index=True) # concat to a new_df
        
        new_df = new_df.groupby('score', as_index=False).sum()
        new_df = new_df.sort_values(by='score', ascending=False) # sort in decending order
        new_df = new_df.reset_index(drop=True)
        
        auc = 0
        cov_tp = 0
        tot_tp = new_df['tp'].sum()
        tot_fp = new_df['fp'].sum()
        for j in range(len(new_df)): # lecture formula for auc using the new df created
            if new_df['fp'][j] == 0:
                cov_tp += new_df['tp'][j]
            elif new_df['tp'][j] == 0:
                auc += (cov_tp/tot_tp)*(new_df['fp'][j]/tot_fp)
            else:
                auc += ((cov_tp/tot_tp)*(new_df['fp'][j]/tot_fp)+
                        (new_df['tp'][j]/tot_tp)*(new_df['fp'][j]/tot_fp)/2)
                cov_tp += new_df['tp'][j]
        
        auc_total += auc * class_counts[current_class] # auc * freq for current class, 
                                                       # summed by looping and doin it for all classes
    
    return auc_total


## 1. Define the class kNN

In [63]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, normalization, one_hot, labels, training_labels, training_data, training_time
class kNN:
    def __init__(self): # Initialize all attributes to None
        self.column_filter = None
        self.imputation = None
        self.normalization = None
        self.one_hot = None
        self.labels = None
        self.training_labels = None
        self.training_data = None
        self.training_time = None
# Input to fit:
# self              - the object itself
# df                - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype - "minmax" (default) or "zscore"
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter   - a column filter (see Assignment 1) from df
# self.imputation      - an imputation mapping (see Assignment 1) from df
# self.normalization   - a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot         - a one-hot mapping (see Assignment 1)
# self.training_labels - a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels          - a list of the categories (class labels) of the previous series
# self.training_data   - the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
#                        normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns
    
    def fit(self, df, normalizationtype="minmax"):
        _, self.column_filter = create_column_filter(df)
        imputed, self.imputation = create_imputation(df)
        normalized, self.normalization = create_normalization(imputed, normalizationtype)
        final_result_df, self.one_hot = create_one_hot(normalized)
        self.training_labels = final_result_df["CLASS"].astype("category")
        self.labels = self.training_labels.cat.categories.tolist()
        self.training_data = final_result_df.drop(columns=["CLASS", "ID"])

# Input to predict:
# self - the object itself
# df   - a dataframe
# k    - an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the relative class frequencies in the set of class labels from the k nearest 
#               (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply column filtering, imputation, normalization and one-hot
#
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
#
    def predict(self, df, k=5):
        assert set(self.labels) == set(df["CLASS"].unique())
        # Apply preprocessing steps to the test data
        filtered = apply_column_filter(df, self.column_filter)
        imputed = apply_imputation(filtered, self.imputation)
        normalized = apply_normalization(imputed, self.normalization)
        final_result_df = apply_one_hot(normalized, self.one_hot)
        
        predicted_values = final_result_df.drop(columns=["CLASS", "ID"]).values
        
        # Iterate over each test instance
        predictions = []
        for row in predicted_values:
            predictions.append(self.get_nearest_neighbor_predictions(row, k))
        # Return predictions as a DataFrame
        predictions = pd.DataFrame(predictions, columns=self.labels)
        assert predictions.shape[0] == df.shape[0], ( 
            f"Mismatch: Predictions have {predictions.shape[0]} rows, "
            f"but test data has {df.shape[0]} rows"# Ensuring that the number of predictions matches the number of rows
        )
        return predictions
        
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies
    
    def get_nearest_neighbor_predictions(self, x, k):
        distances = np.sqrt(((self.training_data - x) ** 2).sum(axis=1))
        sorted_distances = np.argsort(distances)
        
        nearest_indices = sorted_distances[:k]
        nearest_neighbors = self.training_labels.iloc[nearest_indices]
        each_label_predictions = nearest_neighbors.value_counts(normalize=True).reindex(self.labels, fill_value=0)
        assert np.isclose(each_label_predictions.sum(), 1.0) # Ensuring that the values for all labels are normalized
        return each_label_predictions.values

In [64]:
glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

knn_model = kNN() # initializing the kNN model: where we handle fitting and prediction.

t0 = time.perf_counter() 
knn_model.fit(glass_train_df) #fit model trains the kNN model on the given dataset. 
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0)) #Time taken for the training.

test_labels = glass_test_df["CLASS"] # true labels are extracted from CLASS column, from test data.

k_values = [1,3,5,7,9]  # Defined for evaluation
results = np.empty((len(k_values),3))

for i in range(len(k_values)): 
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time: 0.01 s.
Testing time (k=1): 0.12 s.
Testing time (k=3): 0.09 s.
Testing time (k=5): 0.09 s.
Testing time (k=7): 0.09 s.
Testing time (k=9): 0.08 s.



'results'

Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.81035
3,0.663551,0.488058,0.815859
5,0.579439,0.474019,0.833805
7,0.598131,0.470723,0.834465
9,0.616822,0.483674,0.828734


In [60]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set (k=1): 1.0000
AUC on training set (k=1): 1.0000
Brier score on training set (k=1): 0.0000


### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [70]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, binning, labels, class_priors, feature_class_value_counts, feature_class_counts
#
import numpy as np
import pandas as pd
import time

class NaiveBayes:
    def __init__(self):
        self.column_filter = None
        self.binning = None # discretization mapping 
        self.labels = None 
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None

# Input to fit:
# self    - the object itself
# df      - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins  - no. of bins (default = 10)
# bintype - either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# <nothing>
    def fit(self, df, nobins=10, bintype="equal-width"):
        _, self.column_filter = create_column_filter(df)
        binned, self.binning = create_bins(df, nobins=nobins, bintype=bintype)
        
        number_of_classes = binned["CLASS"].value_counts(normalize=True)
        self.class_priors = number_of_classes.to_dict()
        self.labels = number_of_classes.index.tolist()
        self.feature_class_value_counts = {}
        self.feature_class_counts = {}
        
        for feature in binned.columns:
            if feature not in ["CLASS", "ID"]:
                
                feature_class_count = binned.groupby(["CLASS", feature], observed=True).size()
                self.feature_class_value_counts[feature] = feature_class_count.to_dict()
                
                totalFeatures_class = binned.groupby("CLASS", observed=True)[feature].count()
                self.feature_class_counts[feature] = totalFeatures_class.to_dict()
# Input to predict:
# self - the object itself
# df   - a dataframe
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply the column filter and discretization
#
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
#
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
#
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors
#
# Hint 5: To clarify the assignment text a little: self.feature_class_value_counts should be a mapping from 
#         a column name (a specific feature) to another mapping, which given a class label and a value for 
#         the feature, returns the number of training instances which have included this combination, 
#         i.e., the number of training instances with both the specific class label and this value on the feature.
#
# Hint 6: As an additional hint, you may take a look at the slides from the NumPy and pandas lecture, to see how you 
#         may use "groupby" in combination with "size" to get the counts for combinations of values from two columns.
   
    def predict(self, df):
        filtrated = apply_column_filter(df, self.column_filter)
        binned = apply_bins(filtrated, self.binning)
        predictions = []
        for _, row in binned.iterrows(): 
            class_probs = self.calculate_probabilities(row)
            predictions.append(class_probs)
        
        predictions_df = pd.DataFrame(predictions, columns=self.labels)
        return predictions_df # Returning the result as a df 

# helper function to calculate probability for each class:
    def calculate_probabilities(self, row):
        probabilities = {}
        for label in self.labels:
            prob = self.class_priors[label]  # Start with class prior
            for feature, value in row.items():
                if feature in self.feature_class_value_counts:
                    feature_value_count = self.feature_class_value_counts[feature].get((label, value), 0)
                    feature_class_total = self.feature_class_counts[feature].get(label, 1) 
                   
                    # Calculate the relative frequency
                    prob *= feature_value_count / feature_class_total  
            probabilities[label] = prob

# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors
        total_prob = sum(probabilities.values())
        if total_prob == 0:
            # If total_prob is 0, use class priors
            probabilities = {}
            for label in self.labels:
                probabilities[label] = self.class_priors[label]
        else:
            # Normalize each probability
            normalized_values = {}
            for label, prob in probabilities.items():
                normalized_values[label] = prob / total_prob
            probabilities = normalized_values
            
        return list(probabilities.values())


In [71]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time (3, 'equal-width'): 0.03 s.
Testing time (3, 'equal-width'): 0.01 s.
Training time (3, 'equal-size'): 0.01 s.
Testing time (3, 'equal-size'): 0.01 s.
Training time (5, 'equal-width'): 0.01 s.
Testing time (5, 'equal-width'): 0.01 s.
Training time (5, 'equal-size'): 0.01 s.
Testing time (5, 'equal-size'): 0.01 s.
Training time (10, 'equal-width'): 0.01 s.
Testing time (10, 'equal-width'): 0.01 s.
Training time (10, 'equal-size'): 0.01 s.
Testing time (10, 'equal-size'): 0.01 s.



'results'

Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.724335
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,0.551101,0.771688
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.654206,0.527569,0.812887
10,equal-size,0.588785,0.741668,0.751165


In [67]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.8505
AUC on training set: 0.9687
Brier score on training set: 0.2263


### Comment on assumptions, things that do not work properly, etc.

In [73]:
# I'm not sure if the rows 5-9 in the first part of the assignment matches the expected values from the expected results in the HTML version, 
# but I tried to solve it by ensuring that the probabilities for each row sum is 1 and any misalignment that could've occured between the labels
# and that the probabilities for each class and row is considered and calculated. Other than that, I've found these fun and interesting to understand!