# Assignment 3
## Group name: [ID2214 - Group - 15]
### Project members: 
[Adalet Adiljan, adalat@kth.se]

### Declaration:
By submitting this assignment, it is hereby declared that all group members listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214), no part of the solution has been provided by someone not listed as a project member above, and no part of the solution has been generated by a system.

It is furthermore declared that the submitted assignment will not be shared during the course, with any individual other than the group members listed above and teachers of the course ID2214/FID3214. In particular, the assignment will not be uploaded to any public repository. The submitted assignment can be shared after the course only if written consent has been provided by the course responsible of ID2214/FID3214.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas, time and sklearn.tree, may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas, time and DecisionTreeClassifier from sklearn.tree

In [1]:
import numpy as np
import pandas as pd
import time
import sklearn
from sklearn.tree import DecisionTreeClassifier

In [2]:
from platform import python_version

print(f"Python version: {python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"sklearn version: {sklearn.__version__}")

Python version: 3.12.7
NumPy version: 1.26.4
Pandas version: 2.2.2
sklearn version: 1.5.2


## Reused functions from Assignment 1

In [3]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

# Copy and paste functions from Assignment 1 here that you need for this assignment
def create_column_filter(df):
    new_df = df.copy()
    for col in new_df:
        if col in ['CLASS','ID']:   # skip CLASS, ID
            continue
        if new_df[col].nunique() > 1: # skip if >1 unique value in col
            continue
        new_df = new_df.drop(columns=[col]) # drop col
    column_filter = new_df.columns
    return new_df, column_filter

def apply_column_filter(df,column_filter):
    new_df = df.copy()
    new_df = new_df[column_filter]
    return new_df


def create_normalization(df,normalizationtype='minmax'):
    if normalizationtype not in ['minmax','zscore']:
        raise ValueError('Not a valid normalization type')
        
    new_df = df.copy()
    newdict = {}
    for col in new_df:
        if col in ['CLASS','ID']: # skip CLASS, ID
            continue
        if normalizationtype == 'minmax':
            min = new_df[col].min()
            max = new_df[col].max()
            new_df[col] = [(x-min)/(max-min) for x in new_df[col]] # minmax formula
            newdict[col] = (normalizationtype,min,max)
        else:
            mean = new_df[col].mean()
            std = new_df[col].std()
            new_df[col] = new_df[col].apply(lambda x: (x-mean)/std) # zscore formula
            newdict[col] = (normalizationtype,mean,std)
    return new_df, newdict

def apply_normalization(df,normalization):
    new_df = df.copy()
    for key, value in normalization.items():
        if key in ['CLASS','ID']:
            continue
        
        if value[0] == 'minmax':
            new_df[key] = [(x-value[1])/(value[2]-value[1]) for x in new_df[key]]
            
        if value[0] == 'zscore':
            new_df[key] = new_df[key].apply(lambda x: (x-value[1])/value[2])
    return new_df


def create_imputation(df):
    new_df = df.copy()
    imputation = {}
    for col in new_df:
        if col in ['CLASS','ID']:
            continue
        
        if new_df[col].dtype in ['float64','int64']:
            if new_df[col].nunique() == 0: # if col is completely empty
                mean = 0 # hint 4
            else:
                mean = new_df[col].mean()
            
            new_df[col] = new_df[col].fillna(mean) # replace NaN with mean
            imputation[col] = mean
            
        if new_df[col].dtype in ['object','category']:
            if new_df[col].dtype == 'object' and new_df[col].nunique() == 0: # if col is object and completely empty
                new_df[col] = "" # hint 4
                
            if new_df[col].dtype == 'category' and new_df[col].nunique() == 0: # if col is category and completely empty
                new_df[col] = new_df[col].fillna(new_df.categories[0]) # hint 4
                
            new_df[col] = new_df[col].fillna(new_df[col].mode()[0]) # replace NaN with mode
            imputation[col] = new_df[col].mode()[0]
    
    return new_df,imputation


def apply_imputation(df,imputation):
    new_df = df.copy()
    for key, value in imputation.items():
        if key in ['CLASS','ID']:
            continue
        
        new_df[key] = new_df[key].fillna(value)
    return new_df


def create_bins(df,nobins=10,bintype='equal-width'):
    if bintype not in ['equal-width','equal-size']:
        raise ValueError('Not a valid bintype')
        
    new_df = df.copy()
    binning = {}
    
    for col in new_df:
        if col in ['CLASS','ID']:
            continue
        
        if new_df[col].dtype in ['float64','int64']:
            
            if bintype == 'equal-width':
                res, bins = pd.cut(new_df[col],nobins,labels=False,retbins=True)
            
            if bintype == 'equal-size':
                res, bins = pd.qcut(new_df[col],nobins,labels=False,retbins=True,duplicates='drop') # drop because Bin edges must be unique (source, pandas docs)
                
            # hint 6
            bins[0] = -np.inf
            bins[-1] = np.inf
            
            new_df[col] = res # set the column to bin category/index
            binning[col] = bins # dict
        
    for col in new_df:
        new_df[col] = new_df[col].astype('category') # hint 4
        
        if col in binning:
            new_df[col] = new_df[col].cat.set_categories(range(nobins)) # hint 5
            
    return new_df,binning

def apply_bins(df,binning):
    new_df = df.copy()
    
    for col in new_df:
        if col in binning:
            new_res = pd.cut(new_df[col],binning[col],labels=False)
            
            new_df[col] = new_res
            
            new_df[col] = new_df[col].astype('category') # hint 4
            
            new_df[col] = new_df[col].cat.set_categories(range(len(binning[col])-1)) # hint 5
  
    return new_df


def create_one_hot(df):
    new_df = df.copy()
    one_hot = {}
    
    for col in new_df:
        if col in ['CLASS','ID']:
            continue
        
        if new_df[col].dtype in ['object','category']:
            categories = sorted(new_df[col].unique()) # save all unique "categories" (values)
                                                      # sorted just to have them in same order as provided output
            for val in categories:
                new_df[col+'-'+val] = (new_df[col] == val).astype(float) # set new values to float
                                                                         # True translates to 1
                                                                         # False translates to 0
            one_hot[col] = categories
            new_df.drop(columns=col,inplace=True) # drop original
    return df,one_hot

def apply_one_hot(df,one_hot):
    new_df = df.copy()
    for key, value in one_hot.items():
        if key in ['CLASS','ID']:
            continue
        
        for val in value:
            new_df[key+'-'+val] = (new_df[key] == val).astype(float) 
        
        new_df.drop(columns=key,inplace=True)
    return new_df

def accuracy(df,correctlabels):
    if len(correctlabels) != len(df):
        raise ValueError('#correctlabels and #rows do not match')
    
    col_index = np.apply_along_axis(lambda x:np.argmax(x),1,df) # get maxindex from row on column axis
    actual = []
    for x in col_index: 
        actual.append(df.columns[x]) # get the column name from index
        
    true_positives = 0
    for i, label in enumerate(actual):
        if label == correctlabels[i]:
            true_positives += 1
    
    accuracy = true_positives / len(correctlabels)
    return accuracy

def brier_score(predictions,correctlabels):
    if len(predictions) != len(correctlabels):
        raise ValueError("Both arguments must have the same length.")
        
    square_error_sum = []
    
    for i, label in enumerate(correctlabels):
        predicted_vector = predictions.iloc[i].values # = values in row i
        
        correctlabel_index = np.where(predictions.columns==label)[0] # column index matching current label
        
        cli_vector = np.zeros(len(predictions.columns)) # correct label index vector
        cli_vector[correctlabel_index] = 1 # index for current label = 1, effectively creating a one-hot
        
        square_error = sum((predicted_vector - cli_vector)**2)
        
        square_error_sum.append(square_error)
    
    brier_score = np.mean(square_error_sum)
    return brier_score


def auc(df,correctlabels):
    predictions = df.copy()
    
    col_index = np.apply_along_axis(lambda x:np.argmax(x),1,predictions) # get maxindex from row on column axis
    
    c_labels = [] # list that will become binary tp / fp
    score_cols = predictions.columns # saving the col names with scores
    
    for x in col_index: 
       c_labels.append(predictions.columns[x]) # getting the col name based on col index
    predictions['Positives'] = (pd.Series(c_labels) == correctlabels).astype(int)
    
    class_counts = pd.Series(correctlabels).value_counts(normalize=True) # count the frequency (no. of "A" and "B")
                                                                         # divided by tot no. of "correctlabels"
                                                                         # in this example:
                                                                         # A - 0.5 (2 A, divided by 4 tot)
                                                                         # B - 0.5 (2 B, divided by 4 tot)
    auc_total = 0
    for i in range(len(score_cols)): # create a new df for each class that contains col='score','tp','fp'
        current_class = score_cols[i]
        
        new_df = pd.DataFrame()
        new_df['score'] = predictions[current_class]
        new_df['tp'] = (pd.Series(correctlabels) == current_class).astype(int) # 1 where label matches, 0 where it doesn't
        new_df['fp'] = (pd.Series(correctlabels) != current_class).astype(int)
        
        #new_df = pd.concat([new_df, temp_df], ignore_index=True) # concat to a new_df
        
        new_df = new_df.groupby('score', as_index=False).sum()
        new_df = new_df.sort_values(by='score', ascending=False) # sort in decending order
        new_df = new_df.reset_index(drop=True)
        
        auc = 0
        cov_tp = 0
        tot_tp = new_df['tp'].sum()
        tot_fp = new_df['fp'].sum()
        for j in range(len(new_df)): # lecture formula for auc using the new df created
            if new_df['fp'][j] == 0:
                cov_tp += new_df['tp'][j]
            elif new_df['tp'][j] == 0:
                auc += (cov_tp/tot_tp)*(new_df['fp'][j]/tot_fp)
            else:
                auc += ((cov_tp/tot_tp)*(new_df['fp'][j]/tot_fp)+
                        (new_df['tp'][j]/tot_tp)*(new_df['fp'][j]/tot_fp)/2)
                cov_tp += new_df['tp'][j]
        
        auc_total += auc * class_counts[current_class] # auc * freq for current class, 
                                                       # summed by looping and doin it for all classes
    
    return auc_total


## 1. Define the class RandomForest

In [28]:
# Define the class RandomForest with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, one_hot, labels, model
class RandomForest:
    def __init__(self):
        self.column_filter = None
        self.imputation = None
        self.one_hot = None
        self.labels = None
        self.model = None

# Input to fit:
# self      - the object itself
# df        - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# no_trees  - no. of trees in the random forest (default = 100)
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter - a column filter (see Assignment 1) from df
# self.imputation    - an imputation mapping (see Assignment 1) from df
# self.one_hot       - a one-hot mapping (see Assignment 1) from df
# self.labels        - a (sorted) list of the categories of the "CLASS" column of df
# self.model         - a random forest, consisting of no_trees trees, where each tree is generated from a bootstrap sample
#                      and the number of evaluated features is log2|F| where |F| is the total number of features
#                      (for details, see lecture slides)
# Note that the function does not return anything but just assigns values to the attributes of the object.
    
    def fit(self, df, no_trees=100):
        df_filtered = df.drop(columns=['CLASS'])
        new_df, imputation = create_imputation(df)
        self.column_filter = df_filtered.columns.tolist()
        self.imputation = imputation

        df_one_hot = pd.get_dummies(new_df)
        
        self.one_hot = df_one_hot.columns.tolist()
        self.labels = sorted(df['CLASS'].unique())
        self.model = DecisionTreeClassifier(n_estimators=no_trees, max_features='sqrt')

#
# Hint 1: First create the column filter, imputation and one-hot mappings ( done)
#
# Hint 2: Then get the class labels and the numerical values (as an ndarray) from the dataframe after dropping the class labels 
        y = df['CLASS'].values
        X = df_one_hot.values 
#
# Hint 3: Generate no_trees classification trees, where each tree is generated using DecisionTreeClassifier 
#         from a bootstrap sample (see lecture slides), e.g., generated by np.random.choice (with replacement) 
#         from the row numbers of the ndarray, and where a random sample of the features are evaluated in
#         each node of each tree, of size log2(|F|), where |F| is the total number of features;
#         see the parameter max_features of DecisionTreeClassifier
#
        self.model = []
        for _ in range(no_trees):
            sample_indices = np.random.choice(X.shape[0],X.shape[0], replace=True)
            X_bootstrap = X[sample_indices]
            y_bootstrap = y[sample_indices]
    
            tree = DecisionTreeClassifier(max_features=int(np.log2(X.shape[1])))
            tree.fit(X_bootstrap, y_bootstrap)
            self.model.append(tree) # List of trained decision trees
        
# Input to predict:
# self - the object itself
# df   - a dataframe
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are the averaged probabilities output by each decision tree in the forest
# 
# Hint 1: Drop any "CLASS" and "ID" columns of the dataframe first and then apply column filter, imputation and one_hot
#
# Hint 2: Iterate over the trees in the forest to get the prediction of each tree by the method predict_proba(X) where 
#         X are the (numerical) values of the transformed dataframe; you may get the average predictions of all trees,
#         by first creating a zero-matrix with one row for each test instance and one column for each class label, 
#         to which you add the prediction of each tree on each iteration, and then finally divide the prediction matrix
#         by the number of trees.
#
# Hint 3: You may assume that each bootstrap sample that was used to generate each tree has included all possible
#         class labels and hence the prediction of each tree will contain probabilities for all class labels
#         (in the same order). Note that this assumption may be violated, and this limitation will be addressed 
#         in the next part of the assignment. 

    def predict(self, df):
            # Drop 'CLASS' and 'ID' columns from the test data and apply imputation and one-hot encoding
            df_filtered = df.drop(columns=['CLASS', 'ID'], errors='ignore')
            new_data = apply_imputation(df, self.imputation)
    
            # One-hot encode the data (assuming df_one_hot is precomputed)
            categorical_columns = new_data.select_dtypes(include=['object', 'category']).columns
            df_one_hot = pd.get_dummies(new_data, columns=categorical_columns, drop_first=True)
            
            # Get feature values as numpy array
            X = df_one_hot.values
            num_samples = X.shape[0]
            num_classes = len(self.labels)
    
            # Initialize a matrix to accumulate probabilities
            prediction_results = np.zeros((num_samples, num_classes))
    
            # Iterate over each tree to get predictions
            for tree in self.model:
                proba = tree.predict_proba(X)  # Get predicted probabilities for each class
                prediction_results += proba  # Sum the probabilities
    
            # Average the predictions across all trees
            prediction_results /= len(self.model)
    
            # Create a DataFrame for the predictions with class labels as column names
            prediction_df = pd.DataFrame(prediction_results, columns=self.labels)
    
            return prediction_df


In [29]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("tic-tac-toe_train.csv")

test_df = pd.read_csv("tic-tac-toe_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)

print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

TypeError: DecisionTreeClassifier.__init__() got an unexpected keyword argument 'n_estimators'

In [30]:
train_labels = train_df["CLASS"]
predictions = rf.predict(train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels))) # Comment this out if not implemented in assignment 1
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels))) # Comment this out if not implemented in assignment 1

TypeError: 'NoneType' object is not iterable

### Comment on assumptions, things that do not work properly, etc.


## 2a. Handling trees with non-aligned predictions

In [41]:
# Define a revised version of the class RandomForest with the same input and output as described in part 1 above,
# where the predict function is able to handle the case where the individual trees are trained on bootstrap samples
# that do not include all class labels in the original training set. This leads to that the class probabilities output
# by the individual trees in the forest do not refer to the same set of class labels.
#
# Hint 1: The categories obtained with <pandas series>.cat.categories are sorted in the same way as the class labels
#         of a DecisionTreeClassifier; the latter are obtained by <DecisionTreeClassifier>.classes_ 
#         The problem is that classes_ may not include all possible labels, and hence the individual predictions 
#         obtained by <DecisionTreeClassifier>.predict_proba may be of different length or even if they are of the same
#         length do not necessarily refer to the same class labels. You may assume that each class label that is not included
#         in a bootstrap sample should be assigned zero probability by the tree generated from the bootstrap sample. 
#
# Hint 2: Create a mapping from the complete (and sorted) set of class labels l0, ..., lk-1 to a set of indexes 0, ..., k-1,
#         where k is the number of classes
#
# Hint 3: For each tree t in the forest, create a (zero) matrix with one row per test instance and one column per class label,
#         to which one column is added at a time from the output of t.predict_proba 
#
# Hint 4: For each column output by t.predict_proba, its index i may be used to obtain its label by t.classes_[i];
#         you may then obtain the index of this label in the ordered list of all possible labels from the above mapping (hint 2); 
#         this index points to which column in the prediction matrix the output column should be added to 

from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd

class RandomForest:
    def __init__(self):
        self.column_filter = None
        self.imputation = None
        self.one_hot = None
        self.labels = None
        self.model = None  # List of decision trees in the random forest

    def fit(self, df, no_trees=100):
        # Drop 'CLASS' column (target variable)
        df_filtered = df.drop(columns=['CLASS'], errors='ignore')
        
        # Apply imputation
        new_df, imputation = create_imputation(df_filtered)
        self.column_filter = new_df.columns.tolist()  # Save column filter
        self.imputation = imputation  # Save imputation mapping
        
        # One-hot encode
        df_one_hot = pd.get_dummies(new_df)
        self.one_hot = df_one_hot.columns.tolist()  # Save one-hot mapping
        
        # Extract class labels
        self.labels = sorted(df['CLASS'].unique())
        
        # Feature matrix and target vector
        y = df['CLASS'].values
        X = df_one_hot.values
        
        # Build the forest
        self.model = []
        for _ in range(no_trees):
            # Bootstrap sample
            sample_indices = np.random.choice(X.shape[0], X.shape[0], replace=True)
            X_bootstrap = X[sample_indices]
            y_bootstrap = y[sample_indices]
            
            # Train tree
            tree = DecisionTreeClassifier(max_features=int(np.log2(X.shape[1])))
            tree.fit(X_bootstrap, y_bootstrap)
            self.model.append(tree)  # Add tree to forest

    def predict(self, df):
        # Drop 'CLASS' and 'ID' columns
        df_filtered = df.drop(columns=['CLASS', 'ID'], errors='ignore')
        new_data = apply_imputation(df_filtered, self.imputation)
        
        # Align one-hot encoding with training data
        df_one_hot = pd.get_dummies(new_data)
        df_one_hot = df_one_hot.reindex(columns=self.one_hot, fill_value=0)
        
        # Feature matrix
        X_test = df_one_hot.values
        
        # Initialize predictions matrix
        num_samples = X_test.shape[0]
        num_classes = len(self.labels)
        predictions = np.zeros((num_samples, num_classes))
        
        # Map class labels to indices
        label_to_index = {label: idx for idx, label in enumerate(self.labels)}
        
        for tree in self.model:
            # Get probabilities from the tree
            probas = tree.predict_proba(X_test)
            
            # Map tree's class labels to global indices
            for i, tree_label in enumerate(tree.classes_):
                global_index = label_to_index[tree_label]
                predictions[:, global_index] += probas[:, i]
        
        # Average predictions across all trees
        predictions /= len(self.model)
        
        # Convert predictions to DataFrame
        prediction_df = pd.DataFrame(predictions, columns=self.labels)
        return prediction_df



In [42]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)
print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 0.14 s.
Testing time: 0.06 s.
Accuracy: 0.9488
AUC: 0.9663
Brier score: 0.1051


## 2b. Estimate predictive performance using out-of-bag predictions

In [61]:
# Define an extended version of the class RandomForest with the same input and output as described in part 2a above,
# where the results of the fit function also should include:
# self.oob_acc - the accuracy estimated on the out-of-bag predictions, i.e., the fraction of training instances for 
#                which the given (correct) label is the same as the predicted label when using only trees for which
#                the instance is out-of-bag
#
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd

class RandomForest:

        def fit(self, df, no_trees=100):


# Hint 1: You may first create a zero matrix with one row for each training instance and one column for each class label
#         and one zero vector to allow for storing aggregated out-of-bag predictions and the number of out-of-bag predictions
#         for each training instance, respectively. By "aggregated out-of-bag predictions" is here meant the sum of all 
#         predicted probabilities (one sum per class and instance). These sums should be divided by the number of predictions
#         (stored in the vector) in order to obtain a single class probability distribution per training instance. 
#         This distribution is considered to be the out-of-bag prediction for each instance, and e.g., the class that 
#         receives the highest probability for each instance can be compared to the correct label of the instance, 
#         when calculating the accuracy using the out-of-bag predictions.
#
    zero_matrix = np.zeros((instances, class_labels)) # one row for each training instance & one column for each class label
    zero_vector = np.zeros(instances)

    aggregated_oob = sum(instances, class_labels)


            
# Hint 2: After generating a tree in the forest, iterate over the indexes that were not included in the bootstrap sample
#         and add a prediction of the tree to the out-of-bag prediction matrix and update the count vector
#
# Hint 3: Note that the input to predict_proba has to be a matrix; from a single vector (row) x, a matrix with one row
#         can be obtained by x[None,:]
#
# Hint 4: Finally, divide each row in the out-of-bag prediction matrix with the corresponding element of the count vector
#
#         For example, assuming that we have two class labels, then we may end up with the following matrix:
#
#         2 4
#         4 4
#         5 0
#         ...
#
#         and the vector (no. of predictions) (6, 8, 5, ...)
#
#         The resulting class probability distributions are:
#
#         0.333... 0.666...
#         0.5 0.5
#         1.0 0



In [62]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

print("OOB accuracy: {:.4f}".format(rf.oob_acc))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)
print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

ValueError: operands could not be broadcast together with shapes (5,) (4,) (5,) 

In [63]:
train_labels = train_df["CLASS"]
rf = RandomForest()
rf.fit(train_df)
predictions = rf.predict(train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

ValueError: operands could not be broadcast together with shapes (5,) (4,) (5,) 

### Comment on assumptions, things that do not work properly, etc.