# ID2214/FID3214 Assignment 2 Group no. [enter]
### Project members: 
[Stephen Moran, smoran@kth.se]
[Stefano Perenzoni, perenz@kth.se]
[Christian Antonio, cantonio@kth.se]

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas and time

In [1]:
import numpy as np
import pandas as pd
import time

## Reused functions from Assignment 1

In [206]:
#Column filter
def create_column_filter(df): 
    exempt = ["CLASS"]
    col_filter = exempt.copy()
    for col in df.columns: 
        if col not in exempt: 
            if (not df[col].isnull().values.all()) and (df[col].nunique() > 1): 
                col_filter.append(col)
    filtered = df.copy()
    filtered = filtered.loc[:, col_filter]
    return filtered, col_filter


def apply_column_filter(df, column_filter):
    return df[column_filter]

#Binning
def create_bins(df, nobins=10, bintype='equal-width'):
    to_rtn = df.copy()
    bins = {}
    for col in to_rtn.columns:
        if col != "CLASS" and col != "ID" and df[col].dtype in ["float64", "float32", "int64", "int32"]:
            if bintype == "equal-width":
                to_rtn[col], binR = pd.cut(to_rtn[col],nobins,retbins=True,duplicates="drop",labels=False)
                bins[col] = binR    
            elif bintype == "equal-size":
                to_rtn[col], binR = pd.qcut(to_rtn[col],q=nobins,retbins=True,duplicates="drop",labels=False)
                bins[col] = binR
            to_rtn[col] = to_rtn[col].astype("category")
            to_rtn[col] = to_rtn[col].cat.set_categories([str(i) for i in to_rtn[col].cat.categories], rename = True)
            #Infinity edges
            bins[col][0] = -np.inf
            bins[col][-1] = np.inf
        else:
            to_rtn[col] = to_rtn[col].astype('category')
    return to_rtn, bins

def apply_bins(df, binning): 
    exempt = ["ID", "CLASS"]
    binned_df = df.copy()
    for col in binned_df.columns: 
        if col not in exempt: 
            if binned_df[col].dtype == "int64" or binned_df[col].dtype == "float64": 
                bins = binning[col]
                bin_length = len(bins) - 1
                labels = np.arange(0, bin_length, 1)
                binned_df[col] = pd.cut(binned_df[col], bins, labels=labels)
                binned_df[col] = binned_df[col].astype('category')
    return binned_df

#Imputation
def create_imputation(df):
    to_rtn = df.copy()
    imp_dict = {}
    for col in to_rtn.columns:
        if col not in ['CLASS', 'ID']:
            if to_rtn[col].dtypes == "int" or to_rtn[col].dtype == "float":
                val = to_rtn[col].mean()
                to_rtn[col] = to_rtn[col].fillna(value=val)
            else:
                #Get first value of the mode, alternatively np.random could be used
                val = to_rtn[col].mode()[0]
                to_rtn[col] = to_rtn[col].fillna(value=val)
            imp_dict[col] = val
    #to_rtn = to_rtn.fillna(value=imp_dict)
    return to_rtn, imp_dict

def apply_imputation(df, imputation):
    to_rtn = df.copy()
    #No check for columns type because only rights type are included in the imputation dictionary
    to_rtn = to_rtn.fillna(value=imputation)
    return to_rtn


#Normalisation
def create_normalization(df, normalizationtype='minmax'):
    to_rtn = df.copy()
    norm_dict = {}
    colu = to_rtn.columns
    print(colu)
    for col in colu:
        #Check if col is numeric
        if pd.api.types.is_numeric_dtype(to_rtn[col]) and col not in ['CLASS', 'ID']:
            if normalizationtype == 'minmax':
                max = to_rtn[col].max()
                min = to_rtn[col].min()
                norm_dict[col] = (normalizationtype, min, max)
                #Normalize using minamx
                to_rtn[col] = to_rtn[col].apply(lambda x: (x-min)/(max-min))
            if normalizationtype == 'zscore':
                mean = to_rtn[col].mean()
                std = to_rtn[col].std()
                norm_dict[col] = ('zscore', mean, std)
                #Normalizing using zscore
                to_rtn[col] = to_rtn[col].apply(lambda x: (x-mean)/std)
    return to_rtn, norm_dict


def apply_normalization(df, norm_dict):
  to_rtn = df.copy()
  for col in to_rtn.columns:
    if pd.api.types.is_numeric_dtype(to_rtn[col]) and col not in ['CLASS', 'ID']:
        min = norm_dict[col][1]
        mean = norm_dict[col][1]
        max = std = norm_dict[col][2]
        normtype = norm_dict[col][0]
        if normtype == 'minmax':
            val_list = []
            # Calculate the normalized value in the range [0,1]
            for x in to_rtn[col]:
                v = (x-min)/(max-min)
                if v > 1.0:
                    v = 1
                if v < 0.0:
                    v = 0
                val_list.append(v)
            to_rtn[col] = val_list
            #to_rtn[col] = to_rtn[col].apply(lambda x: (x-min)/(max-min)) #Have to consider the [0,1] range
        if normtype == 'zscore':
            #Normalize with zscore
            to_rtn[col] = to_rtn[col].apply(lambda x: (x-mean)/std)
  return to_rtn


#One-hot encoding 
def create_one_hot(df):
    to_rtn = df
    enc = {}
    for col in [el for el in to_rtn.columns if el not in ['CLASS', 'ID']]:
        #Check for columns type
        if str(to_rtn.dtypes[col]) == "category" or str(to_rtn.dtypes[col]) == "object":
            #Convert columns to category
            to_rtn[col] = to_rtn[col].astype("category")
            #For each column, get the list of categories
            enc[col] = list(to_rtn[col].cat.categories)
            for i in enc[col]:
                tit = col + '_' + str(i) #Name of the new column
                col_enc = to_rtn[col] == i
                col_enc = col_enc.astype("int")
                df_enc = col_enc
            df_enc = df_enc.drop(axis=1, columns=col)
            
    return df_enc, enc

def apply_one_hot(df,one_hot):
    df_new = df.copy()
    for col in df.columns:
        if col in one_hot:    
            for i in one_hot[col]:
                name = col + "_" + str(i)
                new_col = df[col]==i
                new_col = pd.Series(new_col.astype("int"))
                df_new[name] = new_col
            df_new = df_new.drop(columns = col, axis = 1)
            
    return df_new



#AUC 
def auc(df, correctlabels):   
    cols = df.columns
    cor_list = [(c == cols) for c in correctlabels]
    #print(cor_list)
    correct_filter = pd.DataFrame(cor_list, columns=cols)

    # Calculate binary AUC for each class label
    # Treating the predicted probability of this class for each instance as a score
    AUC = 0
    for col in df.columns:     
        # Map from each score to an array with number of true positives and true negative
        class_score = {score: [0, 0] for score in df[col]} # [positive, negative]
        
        for i in range(len(df[col])):  
            # We find score of the true positives and then the ones of true negatives
            score = df[col][i] # probability of class col row n
            is_positive = correct_filter[col][i] == True
            class_score[score] = [class_score[score][0] + is_positive,
                                  class_score[score][1] + ~is_positive]
        #We create a single reversed list of pairs
        sort_score = sorted(class_score, reverse=True)
        sorted_list = np.array([class_score[score] for score in sort_score])
        
        class_auc = cov_tp = 0
        tp, fp = sorted_list[:, 0], sorted_list[:, 1] 
        tot_tp, tot_fp = sum(tp), sum(fp)
        
        #Evaluate the AUC considering the 3 different cases
        for i in range(len(tp)):   
            if fp[i] == 0:
                #Increase up the. y-axis
                cov_tp += tp[i]
            elif tp[i] == 0:
                #Have rectangle to calculate
                class_auc += (cov_tp/tot_tp)*(fp[i]/tot_fp)
            else: 
                class_auc += (cov_tp/tot_tp)*(fp[i]/tot_fp)+(tp[i]/tot_tp)*(fp[i]/tot_fp)/2
                cov_tp += tp[i]
                
        frequency = dict(pd.Series(correctlabels).value_counts(normalize=True))
        AUC += frequency[col]*class_auc
    return AUC

#Accuracy
def accuracy(df, correctlabels):
    labels = df.idxmax(axis=1)
    truelabels = (labels == correctlabels).sum()
    accuracy = truelabels/len(df)
    
    return accuracy

#Brier Score
def brier_score(df, correctlabels): 
    label_df = pd.get_dummies(correctlabels)
    brier_score = np.mean(np.sum((df - label_df)**2, axis=1))
    
    return brier_score



## 1. Define the class kNN

In [3]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, normalization, one_hot, labels, training_labels, training_data, training_time
#
# Input to fit:
# self              - the object itself
# df                - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype - "minmax" (default) or "zscore"
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter   - a column filter (see Assignment 1) from df
# self.imputation      - an imputation mapping (see Assignment 1) from df
# self.normalization   - a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot         - a one-hot mapping (see Assignment 1)
# self.training_labels - a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels          - a list of the categories (class labels) of the previous series
# self.training_data   - the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
#                        normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# k    - an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the relative class frequencies in the set of class labels from the k nearest 
#               (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply column filtering, imputation, normalization and one-hot
#
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
#
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies



In [66]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

NameError: name 'kNN' is not defined

In [5]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set (k=1): 1.0000
AUC on training set (k=1): 1.0000
Brier score on training set (k=1): 0.0000


### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [272]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, binning, labels, class_priors, feature_class_value_counts, feature_class_counts
class NaiveBayes(): 
    def __init__(self): 
        self.column_filter = None
        self.binning = None 
        self.labels = None
        self.class_priors = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None
    
    # Input to fit:
    # self    - the object itself
    # df      - a dataframe (where the column names "CLASS" and "ID" have special meaning)
    # nobins  - no. of bins (default = 10)
    # bintype - either "equal-width" (default) or "equal-size" 
    #
    # Output from fit:
    # <nothing>
    #
    # The result of applying this function should be:
    #
    # self.column_filter              - a column filter (see Assignment 1) from df
    # self.binning                    - a discretization mapping (see Assignment 1) from df
    # self.class_priors               - a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
    #                                   to the relative frequencies of the labels
    # self.labels                     - a list of the categories (class labels) of the "CLASS" column of df
    # self.feature_class_value_counts - a mapping from the feature (column name) to the number of
    #                                   training instances with a specific combination of (non-missing, categorical) 
    #                                   value for the feature and class label
    # self.feature_class_counts       - a mapping from the feature (column name) to the number of
    #                                   training instances with a specific class label and some (non-missing, categorical) 
    #                                   value for the feature
    #
    # Note that the function does not return anything but just assigns values to the attributes of the object.

    def fit(self, df, nobins = 10, bintype = "equal-width"): 
        
        df, self.column_filter = create_column_filter(df)
        temp, self.binning = create_bins(df, nobins, bintype)
        self.class_priors = df['CLASS'].value_counts(normalize = True).to_dict()
        self.labels = df['CLASS'].unique()
        feature_class_value_counts = {}
        feature_class_counts = {}
        for col in df.columns:
            if col not in ["CLASS", "ID"]: 
                inner_value_count_dict = feature_class_value_counts[col] = {}
                for (i,j) in temp.groupby(["CLASS",col]): 
                    inner_value_count_dict[i[0],i[1]] = len(j)
        
                inner_class_count = feature_class_counts[col] = {}
                for(i,j) in temp.groupby(["CLASS"]): 
                    inner_class_count[i] = len(j)

        self.feature_class_value_counts = feature_class_value_counts
        self.feature_class_counts = feature_class_counts





    # Input to predict:
    # self - the object itself
    # df   - a dataframe
    # 
    # Output from predict:
    # predictions - a dataframe with class labels as column names and the rows corresponding to
    #               predictions with estimated class probabilities for each row in df, where the class probabilities
    #               are estimated by the naive approximation of Bayes rule (see lecture slides)
    #
    # Hint 1: First apply the column filter and discretization
    #
    # Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
    #         frequency of the observed feature value given the class (using feature_class_value_counts and 
    #         feature_class_counts) 
    #
    # Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
    #         product of the relative frequencies
    #
    # Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
    #         this sum is zero, then set the probabilities to the class priors

    def predict(self, df):
        df_c = df.copy()
        df_c = apply_column_filter(df_c, self.column_filter)
        df_c = apply_bins(df_c, self.binning)
        to_rtn = {}
        
        #Calculate the relative frequency of the observed features given the class
        for col in self.feature_class_value_counts:
            for lab in self.feature_class_value_counts[col]:
                x = self.feature_class_counts[col][lab[0]]
                self.feature_class_value_counts[col][lab] = self.feature_class_value_counts[col][lab] / x
                        
        #Calculate the non-normalized estimated class probabilities
        #Iterate over row 
        to_rtn = {}
        for i,r in df_c.iterrows():
            #Init the prob to class_priors (First element of the numerator)
            prob = {c:self.class_priors[c] for c in self.labels}
            #Iterate over columns for each row
            for col in df_c.columns.drop(['CLASS', 'ID']):
                v = str(r[col])
                #Iterate over label
                for l in self.labels:
                    #Multiply the probability with the relative frequency of the observed value 'v' given the class 'l'
                    prob[l] = prob[l] * (self.feature_class_value_counts[col].get((l,v), 0))
            to_rtn[i] = prob
        
        #Normalize
        for k in to_rtn:
            sum_tmp = sum(to_rtn[k].values())
            if sum_tmp == 0:
                #Set prob to the class priors
                to_rtn[k] = self.class_priors
                #sum_tmp = sum(self.class_priors.values())
            else: 
                to_rtn[k] = {c:(to_rtn[k][c]/sum_tmp) for c in to_rtn[k]}           
   
        to_rtn = pd.DataFrame.from_dict(to_rtn, orient='index', columns=self.labels)
        return to_rtn


"""
    def predict(self, df): 
        df = apply_column_filter(df, self.column_filter)
        df = apply_bins(df, self.binning)
    
        rel_frequency = {}
        value_count_cols = self.feature_class_value_counts.copy()
        class_counts_cols = self.feature_class_counts.copy()
    
        for col in value_count_cols.keys(): 
            inner_rel_freq = rel_frequency[col] = {}
            value_counts = value_count_cols.get(col)
            class_counts = class_counts_cols.get(col)
            for keys in value_counts: 
                value_freq = value_counts.get(keys)
                class_freq = class_counts.get(keys[0])
                relative = value_freq / class_freq
                inner_rel_freq[keys] = relative


        class_prob = pd.DataFrame(1.,index=np.arange(len(df)), columns=self.class_priors.keys())
        for row in df.index: 
            for c in self.class_priors.keys():
                for col in df.columns: 
                    if col not in ["CLASS", "ID"]: 
                        value = df[col].iloc[row]
                        freq = rel_frequency.get(col)
                        res = freq.get((c,str(value)), 0)
                        class_prob[c][row] *= res
                class_prob[c][row] *= self.class_priors[c]

        #Normalise the probabilities
        for i in class_prob.index: 
            sum_row = sum(class_prob.iloc[i])
            if sum_row == 0: 
                for c in class_prob.columns: 
                    class_prob.loc[i, c] = self.class_priors.get(c)
            else: 
                class_prob.loc[i] = class_prob.loc[i].divide(sum_row)
        return class_prob
"""

'\n    def predict(self, df): \n        df = apply_column_filter(df, self.column_filter)\n        df = apply_bins(df, self.binning)\n    \n        rel_frequency = {}\n        value_count_cols = self.feature_class_value_counts.copy()\n        class_counts_cols = self.feature_class_counts.copy()\n    \n        for col in value_count_cols.keys(): \n            inner_rel_freq = rel_frequency[col] = {}\n            value_counts = value_count_cols.get(col)\n            class_counts = class_counts_cols.get(col)\n            for keys in value_counts: \n                value_freq = value_counts.get(keys)\n                class_freq = class_counts.get(keys[0])\n                relative = value_freq / class_freq\n                inner_rel_freq[keys] = relative\n\n\n        class_prob = pd.DataFrame(1.,index=np.arange(len(df)), columns=self.class_priors.keys())\n        for row in df.index: \n            for c in self.class_priors.keys():\n                for col in df.columns: \n                 

In [275]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time (3, 'equal-width'): 0.09 s.
Testing time (3, 'equal-width'): 0.05 s.
Training time (3, 'equal-size'): 0.08 s.
Testing time (3, 'equal-size'): 0.05 s.
Training time (5, 'equal-width'): 0.09 s.
Testing time (5, 'equal-width'): 0.04 s.
Training time (5, 'equal-size'): 0.10 s.
Testing time (5, 'equal-size'): 0.05 s.
Training time (10, 'equal-width'): 0.10 s.
Testing time (10, 'equal-width'): 0.04 s.
Training time (10, 'equal-size'): 0.11 s.
Testing time (10, 'equal-size'): 0.04 s.



'results'

Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.724335
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,0.551101,0.771688
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.654206,0.527569,0.812887
10,equal-size,0.588785,0.741668,0.751165


In [276]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.8505
AUC on training set: 0.9687
Brier score on training set: 0.2263


### Comment on assumptions, things that do not work properly, etc.