# ID2214/FID3214 Assignment 2 Group no. 5
### Project members: 
[Piriya Sureshkumar, piriya@ug.kth.se]
[Pablo Laso, plaso@kth.se]
[Lucas Trouessin, lucastr@kth.se]

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas and time

In [1]:
import numpy as np
import pandas as pd
import time

## Reused functions from Assignment 1

In [2]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

def create_column_filter(df):
    output_df=df.copy()
    columns=[col for col in output_df.columns if col not in ['CLASS','ID']]
    
    #Calculate columns to drop
    drop = []
    for col in columns:
        if len(output_df[col].dropna().unique()) < 2:
            drop.append(col)
    
    output_df.drop(drop, axis=1, inplace=True)
    column_filter = output_df.columns
    
    return output_df,column_filter

def apply_column_filter(df,column_filter):
    output_df=df.copy()
    output_df=output_df[output_df.columns.intersection(column_filter)]
    return output_df

def create_imputation(df):
    
    output_df=df.copy()
    columns=output_df.columns
    columns=[col for col in columns if col not in ['CLASS','ID']]
    types=output_df.dtypes
    imputation={}
    
    for col in columns:

        if types[col].name in ['int64','float64']:
            
            # Checking if it is only nan
            if output_df[col].isnull().all():
                imputation[col]=0
            else :
                imputation[col]=output_df[col].mean()

        if types[col].name in ['object','category']:
            
            # Checking if it is only nan
            if output_df[col].isnull().all():
                imputation[col]=''
            else :
                imputation[col]=output_df[col].mode()[0]
                    
    # Apply changes
    output_df.fillna(imputation,inplace=True)
    
    return output_df,imputation

def apply_imputation(df,imputation):
    output_df=df.copy()
    output_df.fillna(value=imputation,inplace=True)
    return output_df

def create_normalization(df,normalizationtype='minmax'):
    
    output_df=df.copy()
    types=output_df.dtypes
    columns = [col for col in output_df.columns if types[col].name in ['int64','float64'] and col not in ['CLASS','ID']]
    
    normalization={}
    
    if normalizationtype=='minmax':
    
        for col in columns:
            min_value=output_df[col].min()
            max_value=output_df[col].max()
            normalization[col]=(normalizationtype,min_value,max_value)

            if min_value != max_value :    #to avoid division by zero
                output_df[col]=output_df[col].apply(lambda x: (x-min_value)/(max_value-min_value))

    elif normalizationtype=='zscore':
        
        for col in columns:
            mean = df[col].mean()
            std = df[col].std()
            normalization[col]=(normalizationtype,mean,std)

            if std != 0 :                  #to avoid division by zero
                output_df[col] = df[col].apply(lambda x: (x-mean)/std)
                    
    return output_df, normalization

def apply_normalization(df,normalization):
    output_df=df.copy()
    
    for col in normalization:
        normalization_type,a,b=normalization[col]
        
        # Avoid division by zero
        if normalization_type=='minmax' and a!=b :
            output_df[col]=output_df[col].apply(lambda x: (x-a)/(b-a))
            
            # Limit to [0,1]
            output_df[col].clip(0,1,inplace=True)
            
        if normalization_type=='zscore' and b!=0 :
            output_df[col] = output_df[col].apply(lambda x: (x-a)/b)
            
    return output_df

def create_one_hot(df):
    
    output_df=df.copy()
    types=output_df.dtypes
    columns = [col for col in output_df.columns if types[col].name in ['category','object'] and col not in ['CLASS','ID']]
    
    # Create mapping :
    one_hot={}
    for col in columns :
        one_hot[col]=output_df[col].astype('category').cat.categories
    
        # Apply transformation
        for category in one_hot[col]:
            col_name=str(col)+'-'+str(category)
            output_df[col_name]=(output_df[col]==category).astype(float)
        
        # Remove the column
        output_df.drop(columns=col,inplace=True)
    
    return output_df,one_hot

def apply_one_hot(df,one_hot):
    output_df=df.copy()
    types=output_df.dtypes
    columns = [col for col in output_df.columns if types[col].name in ['category','object'] and col not in ['CLASS','ID']]
    
    for col in columns :
        
        # Apply transformation
        for category in one_hot[col]:
            col_name=str(col)+'-'+str(category)
            output_df[col_name]=(output_df[col]==category).astype(float)
        
        # Remove the column
        output_df.drop(columns=col,inplace=True)
        
    return output_df

def accuracy(df,correctlabels):
    n=len(correctlabels)
    compare=[]
    
    for i in range(n):
        
        #Get the prediction, with the max of the first column
        predict=df.loc[i,:].idxmax()
        compare.append(predict==correctlabels[i])
    
    accuracy = sum(compare)/n
    return(accuracy)

def brier_score(df, correctlabels):
    classes=df.columns
    n=len(correctlabels)
    brier_score=0
    for c in classes:
        brier_score+=sum((predictions[c]-[x==c for x in correctlabels])**2)
    return brier_score/n

def auc(df, correctlabels):
    auc=0
    classes = df.columns
    
    # Computing the binary AUCs
    for c in classes:
        correct=[x==c for x in correctlabels]
        
        # Create the dict with scores and pairs of Tp and Tn
        scores={}
        for i in range(len(correctlabels)):
            score=df[c][i]
            (p,n)=scores.get(score,(0,0))
            if correct[i]:
                scores[score]=(p+1,n)
            else :
                scores[score]=(p,n+1)
        
        # Create the descending list of pairs
        scores = np.array([[score,scores[score][0],scores[score][1]] for score in scores.keys()])
        scores = scores[np.argsort(scores[:,0])[::-1]]
        
        # Compute the binary AUC
        binary_auc=0
        pos=sum(scores[:,1])
        neg=sum(scores[:,2])
        y=0
        
        for i in range(len(scores[:,0])):
            p,n=scores[:,1][i],scores[:,2][i]
            
            # Case where we just need to increase the height
            if n==0 :
                y+=p/pos
            
            # Case where we just need to add the area of a rectangle with the same height as before
            elif p==0 :
                binary_auc+=n/neg*y
                
            # Case where we need to add the rectangle with the same height as before, plus the area of the triangle
            else :
                binary_auc+=n/neg*(y+p/pos/2)
                y+=p/pos
        
        # Add to the AUC with the proper weight
        auc+= binary_auc*sum(correct)/len(correct)
        
    return auc

def create_bins(df,nobins=10,bintype='equal-width'):
    
    output_df=df.copy()
    types=output_df.dtypes
    columns = [col for col in output_df.columns if types[col].name in ['int64','float64'] and col not in ['CLASS','ID']]
    
    binning={}
    
    for col in columns:
        
        if bintype=='equal-width':
            output_df[col], bins = pd.cut(output_df[col],nobins,retbins=True,labels=False)
            bins[0]=-np.inf
            bins[-1]=np.inf
            binning[col]=bins
            output_df[col]=output_df[col].astype('category')

        elif bintype=='equal-size':
            output_df[col], bins = pd.qcut(output_df[col],nobins,retbins=True,duplicates='drop',labels=False)
            bins[0]=-np.inf
            bins[-1]=np.inf
            binning[col]=bins
            output_df[col]=output_df[col].astype('category')
                    
    return output_df,binning

def apply_bins(df,binning):
    output_df=df.copy()
    for col in binning:
        output_df[col]=pd.cut(output_df[col],binning[col],labels=False)
        output_df[col]=output_df[col].astype('category')
    return output_df

## 1. Define the class kNN

In [9]:
##### Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, normalization, one_hot, labels, training_labels, training_data, training_time
#
# Input to fit:
# self              - the object itself
# df                - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype - "minmax" (default) or "zscore"
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter   - a column filter (see Assignment 1) from df
# self.imputation      - an imputation mapping (see Assignment 1) from df
# self.normalization   - a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot         - a one-hot mapping (see Assignment 1)
# self.training_labels - a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels          - a list of the categories (class labels) of the previous series
# self.training_data   - the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
#                        normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# k    - an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the relative class frequencies in the set of class labels from the k nearest 
#               (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply column filtering, imputation, normalization and one-hot
#
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
#
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies

import math

class kNN:
    
    def __init__(self):
        self.column_filter=None
        self.imputation=None
        self.normalization=None
        self.one_hot=None
        self.labels=None
        self.training_labels=None
        self.training_data=None
        self.training_time=None
    
    def fit(self,df,normalizationtype="minmax"):
        self.training_data,self.column_filter=create_column_filter(df)
        self.training_data,self.imputation=create_imputation(df)
        self.training_data,self.normalization=create_normalization(self.training_data,normalizationtype)
        self.training_data,self.one_hot=create_one_hot(self.training_data)
        self.labels=df["CLASS"].astype("category")
        self.training_labels=list(self.labels.cat.categories)
        
    def predict(self,df,k=5):
        # Prepare our data
        test=df.copy().drop(columns=["CLASS","ID"])
        test=apply_one_hot(apply_normalization(apply_imputation(apply_column_filter(test,self.column_filter),self.imputation),self.normalization),self.one_hot)
        test=test.values
        train=self.training_data.drop(columns=["CLASS","ID"]).values
        
        # Define the sub-function that returns the predicted probabilities of the different Classes for one test row
        def get_nearest_neighbors_predictions(x_test,k):
            
            # Get the row-indices of the k nearest vectors to x_test
            
            # Using another module (math) to compute the distance.. How else could we do? np.linalg was much slower
            nearest = np.argsort([np.linalg.norm(x_test-x_train) for x_train in train])[:k]
#             nearest = np.argsort([math.dist(x_test,x_train) for x_train in train])[:k]

            # Define the class probabilities
            predicted_classes=[0.0]*len(self.training_labels)
                                 
            for row_index in nearest:
                # Add a fraction of 1/k for each class of the k neighbors
                predicted_classes[self.training_labels.index(self.labels.iloc[row_index])]+=1/k
            
            return predicted_classes
          
        # Go through the rows, append the predictions to a list and generate the related dataframe
        predictions=[]
        for row in test:
            predictions.append(get_nearest_neighbors_predictions(row,k))
        predictions = pd.DataFrame(predictions,columns=self.training_labels)

        return predictions

In [12]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time: 0.01 s.
Testing time (k=1): 0.10 s.
Testing time (k=3): 0.08 s.
Testing time (k=5): 0.10 s.
Testing time (k=7): 0.09 s.
Testing time (k=9): 0.10 s.



'results'

Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.81035
3,0.663551,0.488058,0.815859
5,0.579439,0.471028,0.833843
7,0.598131,0.471867,0.833481
9,0.616822,0.482981,0.827727


In [6]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set (k=1): 1.0000
AUC on training set (k=1): 1.0000
Brier score on training set (k=1): 0.0000


### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [79]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, binning, labels, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self    - the object itself
# df      - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins  - no. of bins (default = 10)
# bintype - either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter              - a column filter (see Assignment 1) from df
# self.binning                    - a discretization mapping (see Assignment 1) from df
# self.class_priors               - a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
#                                   to the relative frequencies of the labels
# self.labels                     - a list of the categories (class labels) of the "CLASS" column of df
# self.feature_class_value_counts - a mapping from the feature (column name) to the number of
#                                   training instances with a specific combination of (non-missing, categorical) 
#                                   value for the feature and class label
# self.feature_class_counts       - a mapping from the feature (column name) to the number of
#                                   training instances with a specific class label and some (non-missing, categorical) 
#                                   value for the feature
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply the column filter and discretization
#
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
#
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
#
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors

class NaiveBayes():
    def __init__(self):
        self.column_filter=None
        self.binning=None
        self.labels=None
        self.class_priors=None
        self.feature_class_value_counts=None
        self.feature_class_counts=None
    
    def fit(self,df,nobins=10,bintype="equal-width"):
        data,self.column_filter=create_column_filter(df)
        data,self.binning=create_bins(data,nobins,bintype)
        self.labels=list(df["CLASS"].astype("category").cat.categories)
        self.class_priors=dict(df["CLASS"].value_counts()/df["CLASS"].value_counts().sum())
        
        # Feature class value counts :
        features=[col for col in df.columns if col not in ["CLASS","ID"]]
        self.feature_class_value_counts={}
        
        for col in features:
            # Append the dict containing the mapping from the tuples (feature_value,class) to the nb of apppearances related
            # The value_counts method will not count anything if one of the fields (feature_value,class) is undefined
            self.feature_class_value_counts[col]=dict(data.loc[:,[col,"CLASS"]].value_counts())
            
        # Feature class counts :
        self.feature_class_counts={}
        for col in features:
            # We need to make sure to drop the undefined rows of either the feature or the class before counting the values
            self.feature_class_counts[col]=dict(data.loc[:,[col,"CLASS"]].dropna()["CLASS"].value_counts())
        
    def predict(self,df):
        data=df.copy().drop(columns=["CLASS","ID"])
        data=apply_bins(apply_column_filter(data,self.column_filter),self.binning)
        
        # Go through the rows, append the predictions to a list and generate the related dataframe
        predictions=[]
        
        # For each row of data
        for k in range(len(data)):
            row=data.iloc[k,:]
            
            # We need to go through the classes and compute the non-normalized probabilities
            row_predictions=[]
            for c in self.labels:
                
                # Starting with the class priors
                proba=self.class_priors[c]
                
                # We then multiply by each P(feature | class)
                for i,feature in enumerate(data.columns):
                    fcv=self.feature_class_value_counts[feature].get((row[i],c),0) # We might not have seen some values for the feature
                    proba*=fcv/self.feature_class_counts[feature][c]

                # And we store it in our list
                row_predictions.append(proba)
            
            # Normalization
            tot=sum(row_predictions)
            if tot==0:
                row_predictions=[self.class_priors[c] for c in self.labels]
            else:
                row_predictions=[x/tot for x in row_predictions]
            
            predictions.append(row_predictions)
        
        # Create a df from the predictions
        predictions=pd.DataFrame(predictions,columns=self.labels)
        
        return predictions
    
# Test my code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time (3, 'equal-width'): 0.06 s.
Testing time (3, 'equal-width'): 0.16 s.
Training time (3, 'equal-size'): 0.06 s.
Testing time (3, 'equal-size'): 0.16 s.
Training time (5, 'equal-width'): 0.06 s.
Testing time (5, 'equal-width'): 0.17 s.
Training time (5, 'equal-size'): 0.07 s.
Testing time (5, 'equal-size'): 0.15 s.
Training time (10, 'equal-width'): 0.05 s.
Testing time (10, 'equal-width'): 0.15 s.
Training time (10, 'equal-size'): 0.07 s.
Testing time (10, 'equal-size'): 0.13 s.



'results'

Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.724335
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,0.551101,0.771688
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.654206,0.527569,0.812887
10,equal-size,0.588785,0.741668,0.751165


In [7]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time (3, 'equal-width'): 0.10 s.
Testing time (3, 'equal-width'): 0.09 s.
Training time (3, 'equal-size'): 0.09 s.
Testing time (3, 'equal-size'): 0.07 s.
Training time (5, 'equal-width'): 0.10 s.
Testing time (5, 'equal-width'): 0.08 s.
Training time (5, 'equal-size'): 0.08 s.
Testing time (5, 'equal-size'): 0.07 s.
Training time (10, 'equal-width'): 0.09 s.
Testing time (10, 'equal-width'): 0.07 s.
Training time (10, 'equal-size'): 0.10 s.
Testing time (10, 'equal-size'): 0.07 s.



'results'

Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.724335
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,0.551101,0.771688
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.654206,0.527569,0.812887
10,equal-size,0.588785,0.741668,0.751165


In [8]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.8505
AUC on training set: 0.9687
Brier score on training set: 0.2263


### Comment on assumptions, things that do not work properly, etc.