# ID2214/FID3214 Assignment 3 Group no. 13
### Project members: 
- Christian Durán García - chdg@kth.se
- Kailin Wu - kailinw@kth.se
- William Carlstedt - wcar@kth.se

### Declaration
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas, time and sklearn.tree, may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas, time and DecisionTreeClassifier from sklearn.tree

In [1]:
import numpy as np
import pandas as pd
import time
import sklearn
from sklearn.tree import DecisionTreeClassifier

In [2]:
from platform import python_version

print(f"Python version: {python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"sklearn version: {sklearn.__version__}")

Python version: 3.10.13
NumPy version: 1.24.3
Pandas version: 2.1.1
sklearn version: 1.2.1


## Reused functions from Assignment 1

In [3]:
# Copy and paste functions from Assignment 1 here that you need for this assignment
def create_column_filter(df):
    # Create copy of df
    df_copy = df.copy()
    # Keep columns CLASS and ID
    columns_to_keep = ['CLASS', 'ID']

    for column in df_copy.columns:
        # If the columns are CLASS or ID it'll be keep in the df
        if column in columns_to_keep:
            continue
        # Drop column if contain only missing values
        elif(df_copy[column].nunique()<=1 or df_copy[column].isna().all()):
            df_copy = df_copy.drop(column, axis = 1)

    # Return df and column filter
    return df_copy, df_copy.columns.tolist()

# Function apply column_filter
def apply_column_filter(df, column_filter):
    # Copy of df
    df_copy = df.copy()

    # Return df with the colums filtered
    return df_copy[column_filter].copy()

def create_normalization(df, normalizationtype='minmax'):
    # Copy of df
    df_copy = df.copy()
    # Select columns of dtype float and int
    selected_columns = [col for col in df_copy.select_dtypes(include=['float', 'int']).columns.tolist() if col not in ['CLASS', 'ID']]
    # Dictionary to store column and values
    column_mapping = {}

    # Case 1: minmax
    if normalizationtype == 'minmax':
        # For loop for columns in df       
        for col in selected_columns:
            # Minimum value of the column
            min = df_copy[col].min()
            # Maximum value of the column
            max = df_copy[col].max()
            # Store min max values in the column_mapping dictionary
            column_mapping[col] = ('minmax', min, max)
            # Apply the minmax normalization
            df_copy[col] = [(x-min)/(max-min) for x in df_copy[col]]
        # Return the normalized df and the mapping dictionary
        return df_copy, column_mapping
    
    # Case 2: zscore
    if normalizationtype == 'zscore':
        # For loop for columns in df
        for col in selected_columns:
            # Mean value of the column
            mean = df_copy[col].mean()
            # Standard deviation of the column
            std = df_copy[col].std()
            # Store the mean and std values in the dictionary
            column_mapping[col] = ('zscore', mean, std)
            # Apply the zscore normalization
            df_copy[col] = df[col].apply(lambda x: (x-mean)/std)
        # Return normalized df and the mapping dictionary
        return df_copy, column_mapping

# Function apply_normalization
def apply_normalization(df, normalization):
    # Copy of the original df
    df_copy = df.copy()

    # For loop of columns and items
    for col, (method, param1, param2) in normalization.items():
        # If the method == minmax apply the minmax normalization
        if method == 'minmax':
            # min, max values from the normalization mapping
            min, max = param1, param2
            # Apply normalization
            df_copy[col] = (df_copy[col]-min)/(max-min)
        # If the method == zscore apply the zscore normalization
        elif method == 'zscore':
            # mean, std values from the normalization mapping
            mean, std = param1, param2
            # Apply normalization
            df_copy[col] = (df_copy[col]-mean)/std
    # Return the normalized df
    return df_copy

def create_imputation(df):
    df1 = df.copy() # Copy of the dataframe
    null_cols = df1.isna().any().index.tolist() # Get all columns that need imputations
    null_cols = [col for col in null_cols if col not in ['CLASS', 'ID']] # Exclude special cols
    null_df = df1[null_cols] # Get a dataframe only with columns with null values

    mapping = {}

    for col, dtype in null_df.dtypes.items():

        # Check for columns with all missing values
        if df1[col].isna().all():
            if dtype in ['float64', 'int64']:
                mapping[col] = 0
            elif dtype == 'object':
                mapping[col] = ''
            elif dtype == 'category':
                mapping[col] = df1[col].cat.categories[0]

        # For columns with just some missing values
        elif dtype in ['float64', 'int64']:
            mapping[col] = df1[col].mean()
        elif dtype in ['object', 'category']:
            mapping[col] = df1[col].mode()[0]

    df1 = df1.fillna(value=mapping) # Fillna using created mapping

    return df1, mapping

def apply_imputation(df, mapping):
    df1 = df.copy()
    df1 = df1.fillna(value=mapping)
    return df1

def create_bins(df, nobins=10, bintype='equal-width'):
    df1 = df.copy() # Copy of dataframe
    num_cols = df1.select_dtypes(include=['float64', 'int64']).columns.tolist() # Select only numerical columns
    num_cols = [col for col in num_cols if col not in ['CLASS', 'ID']] # Exclude special columns

    mapping = {} # Initiaize dictionary for mapping

    for col in num_cols: # Iterate over numerical columns

        if bintype == 'equal-width': # Use cut
            _, bins = pd.cut(df1[col], nobins, labels=False, retbins=True) # Only getting bins
        elif bintype == 'equal-size': # Use qcut
            _, bins = pd.qcut(df1[col], nobins, labels=False, retbins=True, duplicates='drop') # Only getting bins

        bins[0], bins[-1] = -np.inf, np.inf # Change bin boundaries to infinity
        
        df1[col] = pd.cut(df1[col], bins) # Apply the binning strategy

        labels = {} # Initialize dictionary to get labels
        for i in range(len(bins)-1): # Loop for nobins to get according labels (they are ordered already)
            labels[df1[col].cat.categories[i]] = i

        df1[col] = df1[col].replace(labels) # Replace bins for labels

        mapping[col] = bins # Map column to bins

    return df1, mapping

def apply_bins(df, binning):
    df1 = df.copy() # Copying dataframe

    for col, bins in binning.items():
        df1[col] = pd.cut(df1[col], bins)

        labels = {} # Initialize dictionary to get labels
        for i in range(len(bins)-1): # Loop for nobins to get according labels (they are ordered already)
            labels[df1[col].cat.categories[i]] = i

        df1[col] = df1[col].replace(labels) # Replace bins for labels
    
    return df1

def create_one_hot(df):
    df1 = df.copy() # Copy original dataframe
    cat_cols = df1.select_dtypes(include=['object', 'category']).columns.tolist() # Get all categorical columns
    cat_cols = [col for col in cat_cols if col not in ['CLASS', 'ID']] # Exclude special columns

    ### There are no null values in the dataframe, so no manipulation needed for this issue

    mapping = {} # Initialize the mapping dictionary

    columns = [] # Initialize a list to store all the columns generated

    for col in cat_cols: # Iterate over all categorical columns
        mapping[col] = np.sort(df1[col].unique()).tolist() # Get all possible categories of the column

        for val in mapping[col]: # Iterate over the possible values of the column
            encoded_col = df1[col].where(df1[col] == val).fillna(0).replace({val:1}).astype('float64')\
                .rename(col + '_' + val) # Create the encoded column via series manipulation
            columns.append(encoded_col) # Append the column to the list of new columns

    encoded_df = pd.concat(columns + [df1.drop(cat_cols, axis=1)], axis=1) # Concatenate new columns with class column

    return encoded_df, mapping

def apply_one_hot(df, mapping):
    df1 = df.copy() # Copy of the input dataframe

    columns = [] # Initialize list to store all new columns

    for col, values in mapping.items():
        for val in values: # Use the same code as in the previous function
            encoded_col = df1[col].where(df1[col] == val).fillna(0).replace({val:1}).astype('float64')\
                .rename(col + '_' + val) # Create the encoded column via series manipulation
            columns.append(encoded_col) # Append the column to the list of new columns

    encoded_df = pd.concat(columns + [df1.drop(list(mapping.keys()), axis=1)], axis=1) # Concatenate new columns with class column

    return encoded_df

def split(df, testfraction=0.5):
    # Calculate the number of test instances
    num_test_instances = int(len(df) * testfraction)
    
    # Get a permuted list of indexes from the DataFrame
    permuted_indices = np.random.permutation(df.index)
    
    # Split the indices into training and test indices
    test_indices = permuted_indices[:num_test_instances]
    train_indices = permuted_indices[num_test_instances:]
    
    # Create the training and test DataFrames
    trainingdf = df.loc[train_indices]
    testdf = df.loc[test_indices]
    
    return trainingdf, testdf

def accuracy(df, correctlabels):
    # Handle ties by picking the first label with the highest probability
    predictions = df.idxmax(axis=1)
    # Resolve ties by selecting the first label with the highest probability
    # This is done by sorting the values within each row and taking the idxmax
    predictions_with_tie_breaking = df.apply(lambda x: x.index[x.values == x.max()][0], axis=1)
    # Calculate accuracy
    correct_predictions = sum(predictions_with_tie_breaking == correctlabels)
    accuracy = correct_predictions / len(correctlabels)
    return accuracy

def folds(df, nofolds=10):
    # Shuffle the indices of the DataFrame
    shuffled_indices = np.random.permutation(df.index)
    # Calculate the size of each fold
    fold_size = len(df) // nofolds
    # Initialize the list of folds
    folds = []
    
    # Create each fold
    for i in range(nofolds):
        # Determine the start and end indices of the fold
        start_index = i * fold_size
        # If it's the last fold, it should contain all remaining instances
        if i == nofolds - 1:
            end_index = len(df)
        else:
            end_index = (i + 1) * fold_size
        # Append the fold to the list of folds
        folds.append(df.iloc[shuffled_indices[start_index:end_index]])
    
    return folds

def brier_score(df, correctlabels):
    df1 = df.copy() # Copying original dataframe

    brier_scores = [] # Initialize a list to store all the brier scores to later apply a mean

    for i, row in df1.iterrows(): # Iterate over the dataframe's rows
        idx = np.where(df1.columns == correctlabels[i])[0] # Get the index of the column that corresponds with the correct label
        correct_label_vector = np.zeros(len(row)) # Generate an empty vector that will contain the probabilities
        correct_label_vector[idx] = 1 # Replace the index of the correct label with one
        brier_scores.append(np.sum((row - correct_label_vector)**2))# Apply the brier scores formula
    
    return np.mean(brier_scores) # Return the final brier score

def auc_binary(predictions, correctlabels, target_label):
    # Get all the true labels for the target label
    true_labels = (np.array(correctlabels) == target_label) # Expect to get a binary array where true represents the positive class

    # Get all the probabilities, in this case the scores, of the possitive class (aka target label)
    positive_class_probabilities = predictions[target_label]

    # Sort the predictions in descending order
    sorted_idx = np.argsort(positive_class_probabilities)[::-1].tolist() # Sort the indeces
    sorted_labels = true_labels[sorted_idx] # Sort the labels based on the sorted index

    # Get the number of positive and negative predictions
    n_pos = sum(sorted_labels) # Number of positives is the sum of the positive cases
    n_neg = len(sorted_labels) - n_pos # Number of negatives is the total number of predictions minus the number of positives

    # Calculate the TPR and FPR
    tpr = np.cumsum(sorted_labels) / n_pos
    fpr = np.cumsum(~sorted_labels) / n_neg # False Positives are the inverse of True Positives, thus the use of ~

    # Calculate the auc using trapezoidal rule (considers the three possible cases for calculating area)
    auc = np.trapz(tpr, fpr)

    return auc

def auc(predictions, correctlabels):
    true_labels = pd.Series(correctlabels) # Convert the correct labels into a Series to apply certain methods to it
    labels = true_labels.unique().tolist() # Get all unique labels
    
    auc_scores = [] # Initialize list for storing the auc values

    # Calculate the binary auc for each label and add the score to the dictionary
    for label in labels:
        auc_scores.append(auc_binary(predictions, correctlabels, label))

    auc_scores = np.array(auc_scores) # Transform into numpy array for broadcast operations

    # Calculate the relative frequency of each label
    rel_freq = (true_labels.value_counts().loc[labels] / true_labels.count()).to_numpy() # Using loc in value counts to be sure of order of scores

    # Calculate the weighted auc
    weighted_auc = np.sum(auc_scores * rel_freq) / sum(rel_freq)

    return weighted_auc

## 1. Define the class RandomForest

In [4]:
# Define the class RandomForest with three functions __init__, fit and predict (after the comments):

# Input to __init__: 
# self - the object itself

# Output from __init__:
# <nothing>

# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, one_hot, labels, model

# Input to fit:
# self      - the object itself
# df        - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# no_trees  - no. of trees in the random forest (default = 100)

# Output from fit:
# <nothing>

# The result of applying this function should be:

# self.column_filter - a column filter (see Assignment 1) from df
# self.imputation    - an imputation mapping (see Assignment 1) from df
# self.one_hot       - a one-hot mapping (see Assignment 1) from df
# self.labels        - a (sorted) list of the categories of the "CLASS" column of df
# self.model         - a random forest, consisting of no_trees trees, where each tree is generated from a bootstrap sample
#                      and the number of evaluated features is log2|F| where |F| is the total number of features
#                      (for details, see lecture slides)

# Note that the function does not return anything but just assigns values to the attributes of the object.

# Hint 1: First create the column filter, imputation and one-hot mappings

# Hint 2: Then get the class labels and the numerical values (as an ndarray) from the dataframe after dropping the class labels 

# Hint 3: Generate no_trees classification trees, where each tree is generated using DecisionTreeClassifier 
#         from a bootstrap sample (see lecture slides), e.g., generated by np.random.choice (with replacement) 
#         from the row numbers of the ndarray, and where a random sample of the features are evaluated in
#         each node of each tree, of size log2(|F|), where |F| is the total number of features;
#         see the parameter max_features of DecisionTreeClassifier

# Input to predict:
# self - the object itself
# df   - a dataframe

# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are the averaged probabilities output by each decision tree in the forest

# Hint 1: Drop any "CLASS" and "ID" columns of the dataframe first and then apply column filter, imputation and one_hot

# Hint 2: Iterate over the trees in the forest to get the prediction of each tree by the method predict_proba(X) where 
#         X are the (numerical) values of the transformed dataframe; you may get the average predictions of all trees,
#         by first creating a zero-matrix with one row for each test instance and one column for each class label, 
#         to which you add the prediction of each tree on each iteration, and then finally divide the prediction matrix
#         by the number of trees.

# Hint 3: You may assume that each bootstrap sample that was used to generate each tree has included all possible
#         class labels and hence the prediction of each tree will contain probabilities for all class labels
#         (in the same order). Note that this assumption may be violated, and this limitation will be addressed 
#         in the next part of the assignment. 

class RandomForest():
    def __init__(self):
        self.column_filter = None
        self.imputation = None
        self.one_hot = None
        self.labels = None
        self.model = None

    def fit(self, df, no_trees=100):
        
        # Set the mapping for data manipulation
        df, self.column_filter = create_column_filter(df)
        df, self.imputation = create_imputation(df)
        df, self.one_hot = create_one_hot(df)

        # Create a sorted list of the categories in the CLASS column
        self.labels = df['CLASS'].astype('category').cat.categories.sort_values().tolist()

        # Get the training data as an ndarray
        training_features = df.drop(['CLASS'], axis=1).values

        # Get the labels as an ndarray
        training_labels = df['CLASS'].values

        # Create an array to store all the DecisionTrees
        trees = np.empty(no_trees, dtype=DecisionTreeClassifier)

        # Create a loop to train each of the decision trees
        for i in range(no_trees):

            # Create bootstrap indeces with replacement
            bootstrap_indeces = np.random.choice(len(training_features), size=len(training_features), replace=True)

            # Create the bootstrap sample
            bootstrap_features = training_features[bootstrap_indeces]
            bootstrap_labels = training_labels[bootstrap_indeces]

            # Create a tree with max_features = log2|F|
            tree = DecisionTreeClassifier(max_features='log2')

            # Fit the tree with the bootstrap samples
            tree.fit(bootstrap_features, bootstrap_labels)

            # Store the fitted tree into the array of trees
            trees[i] = tree

        # Store the array of trees as model
        self.model = trees

    def predict(self, df):

        # Apply the mappings for data manipulation
        df = apply_column_filter(df, self.column_filter)
        df = apply_imputation(df, self.imputation)
        df = apply_one_hot(df, self.one_hot)

        # Get the features
        test_features = df.drop(['CLASS'], axis=1).values

        # Initialize a matrix to store the predicted probabilities of each model
        prediction_matrix = np.zeros((len(test_features), len(self.labels)))

        # Iterate over the trees generated in the fit method
        for tree in self.model:
            # Get class probabilities using predict proba
            class_probabilities = tree.predict_proba(test_features)

            # Add the probabilities to the matrix
            prediction_matrix += class_probabilities

        # Get the average probabilities of all the trees
        avg_proba = prediction_matrix / len(self.model)

        # Create a dataframe with the class labels as columns and the data as the avergae probabilities
        predictions = pd.DataFrame(avg_proba, columns=self.labels)

        return predictions

In [9]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("tic-tac-toe_train.csv")

test_df = pd.read_csv("tic-tac-toe_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)

print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 0.42 s.
Testing time: 0.18 s.
Accuracy: 0.9123
AUC: 0.9907
Brier score: 0.1756


In [10]:
train_labels = train_df["CLASS"]
predictions = rf.predict(train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels))) # Comment this out if not implemented in assignment 1
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels))) # Comment this out if not implemented in assignment 1

Accuracy on training set: 1.0000
AUC on training set: 1.0000
Brier score on training set: 0.0206


### Comment on assumptions, things that do not work properly, etc.


## 2a. Handling trees with non-aligned predictions

In [11]:
# Define a revised version of the class RandomForest with the same input and output as described in part 1 above,
# where the predict function is able to handle the case where the individual trees are trained on bootstrap samples
# that do not include all class labels in the original training set. This leads to that the class probabilities output
# by the individual trees in the forest do not refer to the same set of class labels.

# Hint 1: The categories obtained with <pandas series>.cat.categories are sorted in the same way as the class labels
#         of a DecisionTreeClassifier; the latter are obtained by <DecisionTreeClassifier>.classes_ 
#         The problem is that classes_ may not include all possible labels, and hence the individual predictions 
#         obtained by <DecisionTreeClassifier>.predict_proba may be of different length or even if they are of the same
#         length do not necessarily refer to the same class labels. You may assume that each class label that is not included
#         in a bootstrap sample should be assigned zero probability by the tree generated from the bootstrap sample. 

# Hint 2: Create a mapping from the complete (and sorted) set of class labels l0, ..., lk-1 to a set of indexes 0, ..., k-1,
#         where k is the number of classes

# Hint 3: For each tree t in the forest, create a (zero) matrix with one row per test instance and one column per class label,
#         to which one column is added at a time from the output of t.predict_proba 

# Hint 4: For each column output by t.predict_proba, its index i may be used to obtain its label by t.classes_[i];
#         you may then obtain the index of this label in the ordered list of all possible labels from the above mapping (hint 2); 
#         this index points to which column in the prediction matrix the output column should be added to 

class RandomForest():
    def __init__(self):
        self.column_filter = None
        self.imputation = None
        self.one_hot = None
        self.labels = None
        self.model = None
        self.label_mapping = None # Adding the mapping to the attributes of the model

    def fit(self, df, no_trees=100):
        
        # Set the mapping for data manipulation
        df, self.column_filter = create_column_filter(df)
        df, self.imputation = create_imputation(df)
        df, self.one_hot = create_one_hot(df)

        # Create a sorted list of the categories in the CLASS column
        self.labels = df['CLASS'].astype('category').cat.categories.sort_values()

        # Create the mapping from feature to index
        self.label_mapping = {label: i for i, label in enumerate(self.labels)}

        # Get the training data as an ndarray
        training_features = df.drop(['CLASS'], axis=1).values

        # Get the labels as an ndarray
        training_labels = df['CLASS'].values

        # Create an array to store all the DecisionTrees
        trees = np.empty(no_trees, dtype=DecisionTreeClassifier)

        # Create a loop to train each of the decision trees
        for i in range(no_trees):

            # Create bootstrap indeces with replacement
            bootstrap_indeces = np.random.choice(len(training_features), size=len(training_features), replace=True)

            # Create the bootstrap sample
            bootstrap_features = training_features[bootstrap_indeces]
            bootstrap_labels = training_labels[bootstrap_indeces]

            # Create a tree with max_features = log2|F|
            tree = DecisionTreeClassifier(max_features='log2')

            # Fit the tree with the bootstrap samples
            tree.fit(bootstrap_features, bootstrap_labels)

            # Store the fitted tree into the array of trees
            trees[i] = tree

        # Store the array of trees as model
        self.model = trees

    def predict(self, df):

        # Apply the mappings for data manipulation
        df = apply_column_filter(df, self.column_filter)
        df = apply_imputation(df, self.imputation)
        df = apply_one_hot(df, self.one_hot)

        # Get the features
        test_features = df.drop(['CLASS'], axis=1).values

        # Initialize an array for storing the matrices
        prediction_matrices = np.empty(len(self.model), dtype=np.ndarray)

        # Iterate over the trees generated in the fit method
        for index, tree in enumerate(self.model):
            # Get class probabilities using predict proba
            class_probabilities = tree.predict_proba(test_features)

            # Initialize a matrix to store the predicted probabilities of each model
            prediction_matrix = np.zeros((len(test_features), len(self.labels)))

            # Add the probabilities to the matrix using the mapping
            for i, column_proba in enumerate(class_probabilities.T):
                # Get the corresponfing label for the column
                label = tree.classes_[i]

                # Map the label to its index using the mapping produced in the fit methos
                idx = self.label_mapping[label]

                # Add the probabilities to the corresponding column
                prediction_matrix[:, idx] += column_proba

            # Add the matrix to the corresponding index in the array of matrices
            prediction_matrices[index] = prediction_matrix

        # Get the average probabilities of all the trees
        avg_proba = np.mean(prediction_matrices, axis=0)

        # Create a dataframe with the class labels as columns and the data as the avergae probabilities
        predictions = pd.DataFrame(avg_proba, columns=self.labels)

        return predictions

In [14]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)
print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 0.45 s.
Testing time: 0.14 s.
Accuracy: 0.9488
AUC: 0.9695
Brier score: 0.0996


## 2b. Estimate predictive performance using out-of-bag predictions

In [8]:
# Define an extended version of the class RandomForest with the same input and output as described in part 2a above,
# where the results of the fit function also should include:
# self.oob_acc - the accuracy estimated on the out-of-bag predictions, i.e., the fraction of training instances for 
#                which the given (correct) label is the same as the predicted label when using only trees for which
#                the instance is out-of-bag
#
# Hint 1: You may first create a zero matrix with one row for each training instance and one column for each class label
#         and one zero vector to allow for storing aggregated out-of-bag predictions and the number of out-of-bag predictions
#         for each training instance, respectively. By "aggregated out-of-bag predictions" is here meant the sum of all 
#         predicted probabilities (one sum per class and instance). These sums should be divided by the number of predictions
#         (stored in the vector) in order to obtain a single class probability distribution per training instance. 
#         This distribution is considered to be the out-of-bag prediction for each instance, and e.g., the class that 
#         receives the highest probability for each instance can be compared to the correct label of the instance, 
#         when calculating the accuracy using the out-of-bag predictions.
#
# Hint 2: After generating a tree in the forest, iterate over the indexes that were not included in the bootstrap sample
#         and add a prediction of the tree to the out-of-bag prediction matrix and update the count vector
#
# Hint 3: Note that the input to predict_proba has to be a matrix; from a single vector (row) x, a matrix with one row
#         can be obtained by x[None,:]
#
# Hint 4: Finally, divide each row in the out-of-bag prediction matrix with the corresponding element of the count vector
#
#         For example, assuming that we have two class labels, then we may end up with the following matrix:
#
#         2 4
#         4 4
#         5 0
#         ...
#
#         and the vector (no. of predictions) (6, 8, 5, ...)
#
#         The resulting class probability distributions are:
#
#         0.333... 0.666...
#         0.5 0.5
#         1.0 0

In [9]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

print("OOB accuracy: {:.4f}".format(rf.oob_acc))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)
print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 1.96 s.
OOB accuracy: 0.9555
Testing time: 0.06 s.
Accuracy: 0.9488
AUC: 0.9718
Brier score: 0.0986


In [10]:
train_labels = train_df["CLASS"]
rf = RandomForest()
rf.fit(train_df)
predictions = rf.predict(train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 1.00
AUC on training set: 1.00
Brier score on training set: 0.01


### Comment on assumptions, things that do not work properly, etc.