## Problem Set 11

First the exercise:
* What is the maximum depth of a decision tree trained on $N$ samples?
The decision tree must make a proper split at each node, so the size of each node must reduce by at least one as we move down one level. So the maximum depth of a  decision tree is $N-1$.
* If we train a decision tree to an arbitrary depth, what will be the training error?
Assuming the training data assigns unique labels to samples with identical features, this will be Zero. If we train a decision tree to arbitrary depth we will end up with a tree where each node contains samples with identical features. If each of these samples has the same label than any of the standard rules (voting, averaging) will return the correct response.
* How can we alter a loss function to help regularize a decision tree?
One of the simplest ways is to add to our loss function an increasing function of the depth of the node. For example, we could just add $\lambda |D|$ or perhaps $\lambda 2^|D|$ where $\lambda$ is an appropriate hyperparameter (probably very small). One should choose so that growth of this regularization term so that it will not dominate the unregularized cost function when obtaining improvements at the desired rate. 

### Python Lab

Now let us load our standard libraries.

In [1]:
import numpy as np
import pandas as pd

Let us load the credit card dataset and extract a small dataframe of numerical features to test on.

In [26]:
big_df = pd.read_csv("UCI_Credit_Card.csv")

In [28]:
big_df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [29]:
len(big_df)

30000

In [30]:
len(big_df.dropna())

30000

In [31]:
df = big_df.drop(labels = ['ID'], axis = 1)

In [40]:
labels = df['default.payment.next.month']
df.drop('default.payment.next.month', axis = 1, inplace = True)

In [41]:
num_samples = 25000

In [42]:
train_x, train_y = df[0:num_samples], labels[0:num_samples]

In [43]:
test_x, test_y = df[num_samples:], labels[num_samples:]

In [48]:
test_x.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
25000,410000.0,1,1,1,38,-1,-1,-1,-1,-2,...,35509.0,0.0,0.0,0.0,0.0,35509.0,0.0,0.0,0.0,0.0
25001,260000.0,1,2,2,35,0,0,0,0,0,...,297313.0,276948.0,2378.0,-2709.0,12325.0,6633.0,6889.0,1025.0,2047.0,194102.0
25002,50000.0,1,2,1,40,0,0,0,0,0,...,11353.0,12143.0,11753.0,11922.0,1200.0,4000.0,2000.0,2000.0,1000.0,1000.0
25003,360000.0,1,3,1,37,-1,-1,-1,-2,-2,...,0.0,0.0,0.0,0.0,303.0,0.0,0.0,0.0,0.0,860.0
25004,50000.0,1,3,1,49,0,0,0,0,0,...,50076.0,48995.0,19780.0,15102.0,2000.0,5000.0,2305.0,3000.0,559.0,3000.0


LIMIT_BAL
SEX
EDUCATION
MARRIAGE
AGE
PAY_0
PAY_2
PAY_3
PAY_4
PAY_5
PAY_6
BILL_AMT1
BILL_AMT2
BILL_AMT3
BILL_AMT4
BILL_AMT5
BILL_AMT6
PAY_AMT1
PAY_AMT2
PAY_AMT3
PAY_AMT4
PAY_AMT5
PAY_AMT6


Now let us write our transformation function.

In [62]:
def bin_transform(df):
    new = df.copy(deep = True)
    medians = df.median()
    for col_name in df.axes[1]:
        new[col_name] = (df[col_name] >= medians[col_name])
    return new

In [63]:
bin_transform(df).head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,False,True,True,False,False,True,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,True,True,False,False,True,True,True,True,...,False,False,False,False,False,False,False,False,False,True
2,False,True,True,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,True
3,False,True,True,False,True,True,True,True,True,True,...,True,True,True,True,False,True,False,False,False,False
4,False,False,True,False,True,False,True,False,True,True,...,True,True,True,True,False,True,True,True,False,False


In [64]:
train_x_t = bin_transform(train_x)

In [65]:
test_x_t = bin_transform(test_x)

Now let us build some simple loss functions for 1d labels.

In [100]:
def bdd_cross_entropy(pred, label):
    return -np.mean(label*np.log(pred+10**(-20)))

In [102]:
def MSE(pred,label):
    return np.mean((pred-label)**2)

In [103]:
def acc(pred,label):
    return np.mean((pred>=0.5)==(label == 1))

Now let us define the find split function.

In [97]:
def find_split(x, y, loss):
    min_ax = None
    min_loss = np.Inf
    N = len(x)
    for col_name in x.axes[1]:
        mask = x[col_name]
        num_pos = np.sum(mask)
        num_neg = N - num_pos
        pos_y = np.mean(y[mask])
        neg_y = np.mean(y[~mask])
        l = (num_pos*loss(pos_y, y[mask]) + num_neg*loss(neg_y, y[~mask]))/N
        print("Column {0} split has loss {1}".format(col_name, l))
        if l < min_loss:
            min_loss = l
            min_ax = col_name
    return min_ax, min_loss
        

In [98]:
find_split(train_x_t, train_y, MSE)

Column LIMIT_BAL split has loss 0.1695355209804491
Column SEX split has loss 0.1731219496274674
Column EDUCATION split has loss 0.17287344471965424
Column MARRIAGE split has loss 0.17319265757975288
Column AGE split has loss 0.173318892363075
Column PAY_0 split has loss 0.1717514981906695
Column PAY_2 split has loss 0.17202252471313842
Column PAY_3 split has loss 0.17191765041495996
Column PAY_4 split has loss 0.17223636307005985
Column PAY_5 split has loss 0.17232129416490316
Column PAY_6 split has loss 0.17253319155880448
Column BILL_AMT1 split has loss 0.17325282560000002
Column BILL_AMT2 split has loss 0.17330610559999998
Column BILL_AMT3 split has loss 0.17333131520000003
Column BILL_AMT4 split has loss 0.17333436800000004
Column BILL_AMT5 split has loss 0.17332015999999997
Column BILL_AMT6 split has loss 0.17327217920000001
Column PAY_AMT1 split has loss 0.17086868146418618
Column PAY_AMT2 split has loss 0.1709179393133935
Column PAY_AMT3 split has loss 0.17150788901767042
Column

('LIMIT_BAL', 0.16953552098044911)

In [101]:
find_split(train_x_t, train_y, bdd_cross_entropy)

Column LIMIT_BAL split has loss 0.32609324928399475
Column SEX split has loss 0.33421127717335847
Column EDUCATION split has loss 0.33362747952504695
Column MARRIAGE split has loss 0.33436660837163695
Column AGE split has loss 0.33464852355164837
Column PAY_0 split has loss 0.3308843371730157
Column PAY_2 split has loss 0.3316023603256719
Column PAY_3 split has loss 0.331355010904273
Column PAY_4 split has loss 0.3321193974158408
Column PAY_5 split has loss 0.33232409957840303
Column PAY_6 split has loss 0.3328343764615235
Column BILL_AMT1 split has loss 0.334500432415302
Column BILL_AMT2 split has loss 0.33461987645234526
Column BILL_AMT3 split has loss 0.33467637691547464
Column BILL_AMT4 split has loss 0.334683218289005
Column BILL_AMT5 split has loss 0.3346513767448689
Column BILL_AMT6 split has loss 0.3345438245879838
Column PAY_AMT1 split has loss 0.3291165664664216
Column PAY_AMT2 split has loss 0.3292922028001125
Column PAY_AMT3 split has loss 0.33056481518410924
Column PAY_AMT

('LIMIT_BAL', 0.32609324928399475)

In [104]:
find_split(train_x_t, train_y, acc)

Column LIMIT_BAL split has loss 0.77688
Column SEX split has loss 0.77688
Column EDUCATION split has loss 0.77688
Column MARRIAGE split has loss 0.77688
Column AGE split has loss 0.77688
Column PAY_0 split has loss 0.77688
Column PAY_2 split has loss 0.77688
Column PAY_3 split has loss 0.77688
Column PAY_4 split has loss 0.77688
Column PAY_5 split has loss 0.77688
Column PAY_6 split has loss 0.77688
Column BILL_AMT1 split has loss 0.77688
Column BILL_AMT2 split has loss 0.77688
Column BILL_AMT3 split has loss 0.77688
Column BILL_AMT4 split has loss 0.77688
Column BILL_AMT5 split has loss 0.77688
Column BILL_AMT6 split has loss 0.77688
Column PAY_AMT1 split has loss 0.77688
Column PAY_AMT2 split has loss 0.77688
Column PAY_AMT3 split has loss 0.77688
Column PAY_AMT4 split has loss 0.77688
Column PAY_AMT5 split has loss 0.77688
Column PAY_AMT6 split has loss 0.77688


('LIMIT_BAL', 0.77688000000000001)

In [106]:
np.mean(train_y[train_x_t['LIMIT_BAL']])

0.16276337432443017

In [108]:
np.mean(train_y[~train_x_t['LIMIT_BAL']])

0.28611133818360174

In [110]:
np.mean(train_y[train_x_t['AGE']])

0.22733317687553783

In [111]:
np.mean(train_y[~train_x_t['AGE']])

0.21871163133338789