## Problem Set 11

First the exercise:
* What is the maximum depth of a decision tree trained on $N$ samples?
The decision tree must make a proper split at each node, so the size of each node must reduce by at least one as we move down one level. So the maximum depth of a  decision tree is $N-1$.
* If we train a decision tree to an arbitrary depth, what will be the training error?
Assuming the training data assigns unique labels to samples with identical features, this will be Zero. If we train a decision tree to arbitrary depth we will end up with a tree where each node contains samples with identical features. If each of these samples has the same label than any of the standard rules (voting, averaging) will return the correct response.
* How can we alter a loss function to help regularize a decision tree?
One of the simplest ways is to add to our loss function an increasing function of the depth of the node. For example, we could just add $\lambda |D|$ or perhaps $\lambda 2^|D|$ where $\lambda$ is an appropriate hyperparameter (probably very small). One should choose so that growth of this regularization term so that it will not dominate the unregularized cost function when obtaining improvements at the desired rate. 

### Python Lab

Now let us load our standard libraries.

In [249]:
import numpy as np
import pandas as pd

Let us load the credit card dataset and extract a small dataframe of numerical features to test on.

In [250]:
big_df = pd.read_csv("UCI_Credit_Card.csv")

In [251]:
big_df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [252]:
len(big_df)

30000

In [253]:
len(big_df.dropna())

30000

In [254]:
df = big_df.drop(labels = ['ID'], axis = 1)

In [255]:
labels = df['default.payment.next.month']
df.drop('default.payment.next.month', axis = 1, inplace = True)

In [256]:
num_samples = 25000

In [257]:
train_x, train_y = df[0:num_samples], labels[0:num_samples]

In [258]:
test_x, test_y = df[num_samples:], labels[num_samples:]

In [259]:
test_x.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
25000,410000.0,1,1,1,38,-1,-1,-1,-1,-2,...,35509.0,0.0,0.0,0.0,0.0,35509.0,0.0,0.0,0.0,0.0
25001,260000.0,1,2,2,35,0,0,0,0,0,...,297313.0,276948.0,2378.0,-2709.0,12325.0,6633.0,6889.0,1025.0,2047.0,194102.0
25002,50000.0,1,2,1,40,0,0,0,0,0,...,11353.0,12143.0,11753.0,11922.0,1200.0,4000.0,2000.0,2000.0,1000.0,1000.0
25003,360000.0,1,3,1,37,-1,-1,-1,-2,-2,...,0.0,0.0,0.0,0.0,303.0,0.0,0.0,0.0,0.0,860.0
25004,50000.0,1,3,1,49,0,0,0,0,0,...,50076.0,48995.0,19780.0,15102.0,2000.0,5000.0,2305.0,3000.0,559.0,3000.0


In [260]:
train_y.head()

0    1
1    1
2    0
3    0
4    0
Name: default.payment.next.month, dtype: int64

Now let us write our transformation function.

In [264]:
class bin_transformer(object):
    
    def __init__(self, df, num_quantiles = 2):
        self.quantiles = df.quantile(np.linspace(1./num_quantiles, 1.-1./num_quantiles,num_quantiles-1))
            
    
    def transform(self, df):
        new = pd.DataFrame()
        fns = {}
        for col_name in df.axes[1]:
            for ix, q in self.quantiles.iterrows():
                quart = q[col_name]
                new[col_name+str(ix)] = (df[col_name] >= quart)
                fns[col_name+str(ix)] =(col_name, lambda x: x[col_name]>=quart)
        return new, fns

In [265]:
transformer = bin_transformer(df,5)

In [266]:
train_x_t, tr_fns = transformer.transform(train_x)

In [267]:
test_x_t, test_fns = transformer.transform(test_x)

In [268]:
train_x_t.head()

Unnamed: 0,LIMIT_BAL0.2,LIMIT_BAL0.4,LIMIT_BAL0.6,LIMIT_BAL0.8,SEX0.2,SEX0.4,SEX0.6,SEX0.8,EDUCATION0.2,EDUCATION0.4,...,PAY_AMT40.6,PAY_AMT40.8,PAY_AMT50.2,PAY_AMT50.4,PAY_AMT50.6,PAY_AMT50.8,PAY_AMT60.2,PAY_AMT60.4,PAY_AMT60.6,PAY_AMT60.8
0,False,False,False,False,True,True,True,True,True,True,...,False,False,True,False,False,False,True,False,False,False
1,True,True,False,False,True,True,True,True,True,True,...,False,False,True,False,False,False,True,True,False,False
2,True,False,False,False,True,True,True,True,True,True,...,False,False,True,True,False,False,True,True,True,True
3,True,False,False,False,True,True,True,True,True,True,...,False,False,True,True,False,False,True,True,False,False
4,True,False,False,False,True,False,False,False,True,True,...,True,True,True,False,False,False,True,False,False,False


In [269]:
tr_fns

{'AGE0.2': ('AGE',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'AGE0.4': ('AGE',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'AGE0.6': ('AGE',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'AGE0.8': ('AGE',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'BILL_AMT10.2': ('BILL_AMT1',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'BILL_AMT10.4': ('BILL_AMT1',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'BILL_AMT10.6': ('BILL_AMT1',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'BILL_AMT10.8': ('BILL_AMT1',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'BILL_AMT20.2': ('BILL_AMT2',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'BILL_AMT20.4': ('BILL_AMT2',
  <function __main__.bin_transformer.transform.<locals>.<lambda>>),
 'BILL_AMT20.6': ('BILL_AMT2',
  <function __main__.bin_tr

Now let us build some simple loss functions for 1d labels.

In [270]:
def bdd_cross_entropy(pred, label):
    return -np.mean(label*np.log(pred+10**(-20)))

In [271]:
def MSE(pred,label):
    return np.mean((pred-label)**2)

In [272]:
def acc(pred,label):
    return np.mean((pred>=0.5)==(label == 1))

Now let us define the find split function.

In [273]:
def find_split(x, y, loss, verbose = False):
    min_ax = None
    base_loss = loss(np.mean(y),y) 
    min_loss = base_loss
    N = len(x)
    for col_name in x.axes[1]:
        mask = x[col_name]
        num_pos = np.sum(mask)
        num_neg = N - num_pos
        pos_y = np.mean(y[mask])
        neg_y = np.mean(y[~mask])
        l = (num_pos*loss(pos_y, y[mask]) + num_neg*loss(neg_y, y[~mask]))/N
        if verbose:
            print("Column {0} split has improved loss {1}".format(col_name, base_loss-l))
        if l < min_loss:
            min_loss = l
            min_ax = col_name
    return min_ax, min_loss
        

In [278]:
find_split(train_x_t, train_y, MSE, verbose = True)

Column LIMIT_BAL0.2 split has improved loss 0.0032026111833937665
Column LIMIT_BAL0.4 split has improved loss 0.0036568972936314725
Column LIMIT_BAL0.6 split has improved loss 0.002968295613244798
Column LIMIT_BAL0.8 split has improved loss 0.0017932272689534512
Column SEX0.2 split has improved loss nan
Column SEX0.4 split has improved loss 0.0002155159725325817
Column SEX0.6 split has improved loss 0.0002155159725325817
Column SEX0.8 split has improved loss 0.0002155159725325817
Column EDUCATION0.2 split has improved loss 2.3907091916103296e-05
Column EDUCATION0.4 split has improved loss 0.0004640208803457502
Column EDUCATION0.6 split has improved loss 0.0004640208803457502
Column EDUCATION0.8 split has improved loss 0.0004640208803457502
Column MARRIAGE0.2 split has improved loss 3.249086770407139e-05
Column MARRIAGE0.4 split has improved loss 3.249086770407139e-05
Column MARRIAGE0.6 split has improved loss 0.00014480802024710582
Column MARRIAGE0.8 split has improved loss 0.000144808

('PAY_00.8', 0.14999327547679009)

In [279]:
find_split(train_x_t, train_y, bdd_cross_entropy, verbose = 0)

('PAY_00.8', 0.29117246034455752)

In [280]:
find_split(train_x_t, train_y, acc, verbose = 0)

(None, 0.77688000000000001)

In [281]:
np.mean(train_y[train_x_t['PAY_00.8']])

0.19974715549936789

In [283]:
np.mean(train_y[~train_x_t['PAY_00.8']])

0.13956871914783062

In [284]:
np.mean(train_y[train_x_t['AGE0.2']])

0.21648985128130602

In [285]:
np.mean(train_y[~train_x_t['AGE0.2']])

0.25453293550608219