In [1]:
%%html
<style>
.h1_cell, .just_text {
    box-sizing: border-box;
    padding-top:5px;
    padding-bottom:5px;
    font-family: "Times New Roman", Georgia, Serif;
    font-size: 125%;
    line-height: 22px; /* 5px +12px + 5px */
    text-indent: 25px;
    background-color: #fbfbea;
    padding: 10px;
}

hr { 
    display: block;
    margin-top: 0.5em;
    margin-bottom: 0.5em;
    margin-left: auto;
    margin-right: auto;
    border-style: inset;
    border-width: 2px;
}
</style>

<h1>
<center>
Module 5: First look at Bias-Variance Tradeoff
</center>
</h1>
<div class=h1_cell>
<p>
There is a classic problem with machine learning models using labeled data (i.e., supervised learning). Typically you are given training data and you build your model from that. Once you have your model, you can release into the wild where it will start working with new data coming from the world. The problem is that your model can suffer from one of two problems: (1) your model is too weak and underfits the training data or (2) your model is too strong and overfits the training data. The former is called *bias*. High bias can cause a model to miss the relevant relations between columns/features and the target output. Think of our prediction tree that was a stump using only `sex_female` as a column. It ignored all other columns. Think of this predictor as highly biased to liking one column. If it was less biased (more egalitarian) it would have included most if not all the columns.
<p>
In terms of overfitting, it seems strange to say a model is too strong. What is really meant is that the model pays too much attention to the nuances of the training data. It obtains high accuracy by modeling the random noise in the training data. This is called high *variance*. Take the Titanic data. What if a model used the `Name` column as a predictor. We could get 100% accuracy from this. The `Name` column has unique values. We could just assign each unique name to the correct output (Survived, Perished). How would this predictor do when looking at new data not in the training set? Not very well. I claim the `Name` column has random noise in terms of its raw values.
<p>
Caveat 1: I am claiming that the *raw* name value is noise in terms of prediction. That said, I think it could be useful to wrangle the `Name` column a bit to pull out useful info from the raw values. For instance, I see salutations like Master, Reverend, Miss, Honorable, etc. These indeed might carry information. Maybe they identify passengers of "high class" that were let on the lifeboats. We could wrangle a new binary column `upper_class` that is formed by looking for salutations in the Name column.
<p>
Caveat 2: the drawback of using the Titanic data is that it is hard to see beyond the training data. Our Titanic models will not be released for future use. There will not be another Titanic built. So maybe better to think of the Loan Table. The models you build for it could definitely be released for future use. If good enough, I suppose they could replace human loan-agents in a bank.
<p>
We will study a variety of methods for handling the Bias-Variance tradeoff in the coming weeks. We are looking for a sweet spot where there is not too much underfitting and not too much overfitting. This week we will try a technique called k-fold cross-validation.
<p>
Let's bring in our previous results now.
</div>

In [2]:
import pandas as pd
import os

week = 4  # from last module

home_path =  os.path.expanduser('~')

file_path = '/Dropbox/cis399_ds1_f17/notebook_history/'

file_name = 'titanic_wrangled_w'+str(week)+'.csv'

titanic_table = pd.read_csv(home_path + file_path + file_name)

pd.__version__  # should see 0.20.3 or higher

u'0.20.3'

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
os.chdir(home_path + '/Dropbox/cis399_ds1_f17/week_libraries/datascience_1')
!git pull

Already up-to-date.


In [5]:
import sys
sys.path.append(home_path + '/Dropbox/cis399_ds1_f17/week_libraries/datascience_1')

In [6]:
from week4 import *

%who function

accuracy	 build_pred	 build_tree_iter	 compute_prediction	 f1	 find_best_splitter	 generate_table	 gig	 gini	 
informedness	 predictor_case	 probabilities	 tree_predictor	 


In [7]:
titanic_table.head(1)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3,pclass_nan,tree_1,tree_1_type,tree_2,tree_2_type,tree_3,tree_3_type,tree_4,tree_4_type,tree_5,tree_5_type
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,22.0,0,0,1,0,Child,1,0,0,0,1,0,0,0,1,0,0,true_negative,0,true_negative,0,true_negative,0,true_negative,0,true_negative


<h2>
Drop columns from last module
</h2>
<div class=h1_cell>
<p>
We have been using Titanic table to store our tree exploration. That worked ok in the last module but we need something more general now. I want to keep our exploration results in a separate table and not pollute the Titanic table with exploration data. First things first: drop the exploration columns from last week.
</div>

In [8]:
titanic_table = titanic_table.drop(['tree_1', 'tree_1_type', 'tree_2', 'tree_2_type', 'tree_3', 'tree_3_type', 'tree_4', 'tree_4_type', 'tree_5', 'tree_5_type'], axis=1)
titanic_table.head(1)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3,pclass_nan
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,22.0,0,0,1,0,Child,1,0,0,0,1,0,0,0,1,0


<h2>
Using both a training set and a testing set
</h2>
<div class=h1_cell>
<p>
In the prior module we judged the goodness of our trees based on their accuracy, f1 and informedness scores. But you may have noticed that we used the same data to both train and test our trees. This exacerbates the risk of bias and variance. One idea to avoid this problem is to train on part of the data and test on the other part of the data. Let's see how we can do that.
<p>
The standard approach is to take 2/3 of the data as training and hold out 1/3 as testing.
</div>

In [9]:
total_len = len(titanic_table.index)
split_boundary = int(total_len*(2/3.0))
split_boundary

594


<div class=h1_cell>
<p>
You can use a slice operator on a table just like you can on a list. Cool.
</div>

In [10]:
training_table = titanic_table[0:split_boundary]  # 0-593
test_table = titanic_table[split_boundary:]       # 594 to 890
print(len(training_table))
print(len(test_table))

594
297


<div class=h1_cell>
Now we can build a tree using training data. I am going to use a max-depth of 3.
</div>

In [11]:
splitter_columns = [
 'emb_C',
 'emb_Q',
 'emb_S',
 'emb_nan',
 'age_Child',
 'age_Adult',
 'age_Senior',
 'no_age',
 'ok_child',
 'sex_female',
 'sex_male', 
 'pclass_1',
 'pclass_2',
 'pclass_3',
 'pclass_nan'
]

In [12]:
#Notice using training_table not titanic_table

tree_train = build_tree_iter(training_table, splitter_columns, 'Survived', {'max-depth':3})
tree_train['paths']

[{'conjunction': [('sex_female_1', <function week4.<lambda>>),
   ('pclass_3_1', <function week4.<lambda>>),
   ('emb_Q_1', <function week4.<lambda>>)],
  'gig_score': 0.05562746548323472,
  'prediction': 1},
 {'conjunction': [('sex_female_1', <function week4.<lambda>>),
   ('pclass_3_1', <function week4.<lambda>>),
   ('emb_Q_0', <function week4.<lambda>>)],
  'gig_score': 0.05562746548323472,
  'prediction': 0},
 {'conjunction': [('sex_female_1', <function week4.<lambda>>),
   ('pclass_3_0', <function week4.<lambda>>),
   ('age_Child_1', <function week4.<lambda>>)],
  'gig_score': 0.0015571657971864689,
  'prediction': 1},
 {'conjunction': [('sex_female_1', <function week4.<lambda>>),
   ('pclass_3_0', <function week4.<lambda>>),
   ('age_Child_0', <function week4.<lambda>>)],
  'gig_score': 0.0015571657971864689,
  'prediction': 1},
 {'conjunction': [('sex_female_0', <function week4.<lambda>>),
   ('ok_child_1', <function week4.<lambda>>),
   ('pclass_3_1', <function week4.<lambda>>

<div class=h1_cell>
Now that we have our tree, let's automate a bit of code from last module. I'll define a function that will compute a temporary table to hold the predictions and actuals. I'll then produce a Series object, `types`, with the 4 cases. I'll return it in a nice dictionary-like format.
</div>

In [13]:
def caser(table, tree, target):
    scratch_table = pd.DataFrame(columns=['prediction', 'actual'])
    scratch_table['prediction'] = table.apply(lambda row: tree_predictor(row, tree), axis=1)
    scratch_table['actual'] = table[target]  # just copy the target column
    cases = scratch_table.apply(lambda row: predictor_case(row, pred='prediction', target='actual'), axis=1)
    return cases.value_counts()

In [14]:
train_cases = caser(training_table, tree_train, 'Survived')
train_cases

true_negative     349
true_positive     139
false_negative     95
false_positive     11
dtype: int64

<h2>
Let's get serious about saving our results
</h2>
<div class=h1_cell>
<p>
Up until now we have been storing our testing results in the Titanic table itself. I'd like to stop doing that. I am going to define a new table (a new Dataframe) to hold the results of various training and testing we do.

</div>

In [15]:
columns = ['name', 'true_positive', 'false_positive', 'true_negative', 'false_negative',
           'accuracy', 'f1', 'informedness']
results_table = pd.DataFrame(columns=columns)  # empty for now

<div class=h1_cell>
The table is currently empty. I'll use each row to record my exploration results. I'll start by adding what I get from using my training data, i.e., `train_cases` from above. Notice that the Series object we used for appending (i.e., `train_cases`) has some but not all of the columns defined.
</div>

In [16]:
results_table = results_table.append(train_cases,ignore_index=True)
results_table.head()

Unnamed: 0,name,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
0,,139.0,11.0,349.0,95.0,,,


<div class=h1_cell>
I'll keep appending rows onto the end of the table so will need to know what the last row's index is. I could try to maintain that myself in some variable I define. But I'd rather just used the `last_valid_index` method built in to pandas.
</div>

In [17]:
end = results_table.last_valid_index()  # should be 0 in this case - just one row

<div class=h1_cell>
I am now ready to fill out the missing columns for the row I just added. I'll do that now.
</div>

In [18]:
results_table.name.iloc[end] =  'tree training'
results_table.accuracy.iloc[end] =  accuracy(train_cases)
results_table.f1.iloc[end] =  f1(train_cases)
results_table.informedness.iloc[end] =  informedness(train_cases)

results_table.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,name,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
0,tree training,139.0,11.0,349.0,95.0,0.821549,0.723958,0.563462


<div class=h1_cell>
Ok, now let's use the same tree but check it out on the test data, i.e., the 30% we held out.
</div>

In [19]:
test_cases = caser(test_table, tree_train, 'Survived')
test_cases

true_negative     182
true_positive      57
false_negative     51
false_positive      7
dtype: int64

In [20]:
results_table = results_table.append(test_cases,ignore_index=True)
end = results_table.last_valid_index()
results_table.name.iloc[end] =  'tree test'
results_table.accuracy.iloc[end] =  accuracy(test_cases)
results_table.f1.iloc[end] =  f1(test_cases)
results_table.informedness.iloc[end] =  informedness(test_cases)

results_table.head()

Unnamed: 0,name,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
0,tree training,139.0,11.0,349.0,95.0,0.821549,0.723958,0.563462
1,tree test,57.0,7.0,182.0,51.0,0.804714,0.662791,0.490741


<div class=h1_cell>
Hmmmm. All our scores dropped when using the separate test data. This is almost always the case, typically caused by overfitting on the training set (high variance).
</div>

<h2>
Ok, let's generalize
</h2>
<p>
<div class=h1_cell>
<p>
What we are doing is called cross-validation: breaking our data up into training and testing sets. I'd like to push a bit harder on the cross-validation idea. I suggest we try more than just one split for training/testing. Try a bunch out then average their results. There are many ways we can consider generating a separate set of splits. I am going to focus on a standard approach called K-Folding. The general idea is that we divide the table into K partitions or folds. We then build K trees from various combinations of the folds and do K tests, one for each tree. Where does K come from? You get to choose. I will use K=5 below but K=10 is more common.
<p>
Even with this standard algorithm there are variations on how you select the K folds. I am going to use a sequential approach, splitting into folds along the row indices. So my first fold will be from 0 to i, my next fold from i+1 to j, etc.
<p>
I'll also refer to the folds as slices.
</div>

In [21]:
k = 5  # more often 10

total_len = len(titanic_table.index)
slice_size = int(1.0/k*total_len)
slice_size

178

In [22]:
slice_1 = titanic_table[0:slice_size]
slice_2 = titanic_table[1*slice_size:2*slice_size]
slice_3 = titanic_table[2*slice_size:3*slice_size]
slice_4 = titanic_table[3*slice_size:4*slice_size]
slice_5 = titanic_table[4*slice_size:]

<div class=h1_cell>
Now that I have my 5 folds/slices, I'll take the first step: train on 4 of the slices and test on the remaining slice. 
</div>

In [23]:
fold1_test_table = slice_1
fold1_train_table = pd.concat([slice_2, slice_3, slice_4, slice_5])
len(fold1_train_table)

713

<div class=h1_cell>
Keep plodding along until I have 5 training sets and 5 test sets. 
</div>

In [24]:
fold2_test_table = slice_2
fold2_train_table = pd.concat([slice_1, slice_3, slice_4, slice_5])
len(fold2_train_table)

713

In [25]:
fold3_test_table = slice_3
fold3_train_table = pd.concat([slice_1, slice_2, slice_4, slice_5])
len(fold3_train_table)

713

In [26]:
fold4_test_table = slice_4
fold4_train_table = pd.concat([slice_1, slice_2, slice_3, slice_5])
len(fold4_train_table)

713

In [27]:
fold5_test_table = slice_5
fold5_train_table = pd.concat([slice_1, slice_2, slice_3, slice_4])
len(fold5_train_table)

712

<h2>
Try fold-set number 1
</h2>
<p>
<div class=h1_cell>
Whew. We now have 5 pairs of training and test. I'll try the first fold and see how it goes. First I'll build a tree from the training slice then test it with the test slice.

</div>

In [28]:
fold1_tree = build_tree_iter(fold1_train_table, splitter_columns, 'Survived', {'max-depth':3})  # train

In [29]:
fold1_cases = caser(fold1_test_table, fold1_tree, 'Survived')  # test

<h2>
Create a new table to hold our results
</h2>
<p>
<div class=h1_cell>
I know I have a results table from the 70/30 split. I decided to build a new table of results because we are now using something new: K-folding. I'll still use the old columns, though. I am using `hyper5_` to remind myself that K is 5. 

</div>

In [30]:
hyper5_results_table = pd.DataFrame(columns=columns)

In [31]:
hyper5_results_table = hyper5_results_table.append(fold1_cases,ignore_index=True)
end = hyper5_results_table.last_valid_index()
hyper5_results_table.name.iloc[end] =  'fold1 test'
hyper5_results_table.accuracy.iloc[end] =  accuracy(fold1_cases)
hyper5_results_table.f1.iloc[end] =  f1(fold1_cases)
hyper5_results_table.informedness.iloc[end] =  informedness(fold1_cases)

hyper5_results_table.head()

Unnamed: 0,name,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
0,fold1 test,34.0,9.0,110.0,25.0,0.808989,0.666667,0.500641


<h2>
Ok, now for fold 2
</h2>
<p>
<div class=h1_cell>
We now have results from the first fold-set. Let's try the next.

</div>

In [32]:
fold2_tree = build_tree_iter(fold2_train_table, splitter_columns, 'Survived', {'max-depth':3})
fold2_cases = caser(fold2_test_table, fold2_tree, 'Survived')
hyper5_results_table = hyper5_results_table.append(fold2_cases,ignore_index=True)
end = hyper5_results_table.last_valid_index()
hyper5_results_table.name.iloc[end] =  'fold2 test'
hyper5_results_table.accuracy.iloc[end] =  accuracy(fold2_cases)
hyper5_results_table.f1.iloc[end] =  f1(fold2_cases)
hyper5_results_table.informedness.iloc[end] =  informedness(fold2_cases)

hyper5_results_table.head()

Unnamed: 0,name,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
0,fold1 test,34.0,9.0,110.0,25.0,0.808989,0.666667,0.500641
1,fold2 test,48.0,5.0,95.0,30.0,0.803371,0.732824,0.565385


<h2>
Kind of tedious
</h2>
<p>
<div class=h1_cell>
I could keep going with the remaining 3. But when I move to K=10, it's a bit much to repeat all this 10 times for the 10 folds. So let's build a new function to do it for us. First, I'll define a helper function that given a list of slices and an index, will create a new list of slices with the index slice left out. Why? Because we need something to compute the training table and will use this function to get the necessary slices.
</div>

In [33]:
def compute_training(slices, left_out):
    training_slices = []
    for i in range(len(slices)):
        if i == left_out:
            continue
        training_slices.append(slices[i])
    return pd.concat(training_slices)  # note we are returning a table (DataFrame)

<h2>
Ready to automate
</h2>
<p>
<div class=h1_cell>
I'll define a function that takes as arguments (a) the big table, (b) value for K, (c) the target column, (e) the hyper parameters to be used in building a model, and (f) the candidate columns to build the splitters from.
<p>
The function's output will be a results table. Notice I also added a comment onto the table to help me remember what hyper parameteres I was using.
</div>

In [34]:
def k_fold(table, k, target, hypers, candidate_columns):
    result_columns = ['name', 'true_positive', 'false_positive', 'true_negative', 'false_negative', 'accuracy', 'f1', 'informedness']
    k_fold_results_table = pd.DataFrame(columns=result_columns)
    
    total_len = len(table.index)
    slice_size = int(total_len/(1.0*k))
    slices = []

    #generate the slices
    for i in range(k-1):
        a_slice =  table[i*slice_size:(i+1)*slice_size]
        slices.append( a_slice )
    slices.append( table[(k-1)*slice_size:] )  # whatever is left
    
    #generate test results
    for i in range(k):
        test_table = slices[i]
        train_table = compute_training(slices, i)
        fold_tree = build_tree_iter(train_table, candidate_columns, target, hypers)  # train
        fold_cases = caser(test_table, fold_tree, target)  # test

        k_fold_results_table = k_fold_results_table.append(fold_cases,ignore_index=True)
        end = k_fold_results_table.last_valid_index()
        k_fold_results_table.name.iloc[end] =  'fold '+str(i+1)+' test'
        k_fold_results_table.accuracy.iloc[end] =  accuracy(fold_cases)
        k_fold_results_table.f1.iloc[end] =  f1(fold_cases)
        k_fold_results_table.informedness.iloc[end] =  informedness(fold_cases)
        
    k_fold_results_table.__doc__ = str(hypers)  # adds comment to remind me of hyper params used
    return k_fold_results_table

<h2>
Let's try it out with default hyper params
</h2>
<p>
<div class=h1_cell>
I'll use K=5 and default values for hyper-parameters.
</div>

In [35]:
default5_table = k_fold(titanic_table, 5, 'Survived', {}, splitter_columns)  # max-depth=4

In [36]:
default5_table.__doc__  # will be empty dict because we are using default values

'{}'

In [37]:
default5_table.head()  # since K=5 this will show all the rows

Unnamed: 0,name,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
0,fold 1 test,34.0,9.0,110.0,25.0,0.808989,0.666667,0.500641
1,fold 2 test,51.0,5.0,95.0,27.0,0.820225,0.761194,0.603846
2,fold 3 test,46.0,5.0,103.0,24.0,0.837079,0.760331,0.610847
3,fold 4 test,38.0,6.0,101.0,33.0,0.780899,0.66087,0.479137
4,fold 5 test,41.0,5.0,110.0,23.0,0.843575,0.745455,0.597147


In [38]:
default5_table.describe()  # can use this to see mean of columns

Unnamed: 0,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,42.0,6.0,103.8,26.4,0.818153,0.718903,0.558323
std,6.670832,1.732051,6.379655,3.974921,0.024903,0.05076,0.063119
min,34.0,5.0,95.0,23.0,0.780899,0.66087,0.479137
25%,38.0,5.0,101.0,24.0,0.808989,0.666667,0.500641
50%,41.0,5.0,103.0,25.0,0.820225,0.745455,0.597147
75%,46.0,6.0,110.0,27.0,0.837079,0.760331,0.603846
max,51.0,9.0,110.0,33.0,0.843575,0.761194,0.610847


<h2>
Tuning hyper-parameters
</h2>
<p>
<div class=h1_cell>
We have 2 hyper-parameters. I'll concentrate on max-depth. I have the results for the default max-depth (i.e., 4) above. The means from the 5 folds are (0.818153,	0.718903,	0.558323).
<p>
I'll now try changing the max-depth to 3 and see how we do. As reminder, we did this in the last module and got a result that was the same as with max-depth 4. But that was using same data for training and testing. Now let's see if K-folding gives us a different answer.
</div>

In [39]:
max3_table = k_fold(titanic_table, 5, 'Survived', {'max-depth':3}, splitter_columns)
max3_table

Unnamed: 0,name,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
0,fold 1 test,34.0,9.0,110.0,25.0,0.808989,0.666667,0.500641
1,fold 2 test,48.0,5.0,95.0,30.0,0.803371,0.732824,0.565385
2,fold 3 test,46.0,5.0,103.0,24.0,0.837079,0.760331,0.610847
3,fold 4 test,38.0,6.0,101.0,33.0,0.780899,0.66087,0.479137
4,fold 5 test,38.0,5.0,110.0,26.0,0.826816,0.71028,0.550272


In [40]:
max3_table.__doc__

"{'max-depth': 3}"

In [41]:
max3_table.describe()

Unnamed: 0,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,40.8,6.0,103.8,27.6,0.811431,0.706194,0.541256
std,5.932959,1.732051,6.379655,3.781534,0.021781,0.042642,0.052476
min,34.0,5.0,95.0,24.0,0.780899,0.66087,0.479137
25%,38.0,5.0,101.0,25.0,0.803371,0.666667,0.500641
50%,38.0,5.0,103.0,26.0,0.808989,0.71028,0.550272
75%,46.0,6.0,110.0,30.0,0.826816,0.732824,0.565385
max,48.0,9.0,110.0,33.0,0.837079,0.760331,0.610847


<div class=h1_cell>
<p>
Here are our means from depth 4: (0.818153	0.718903	0.558323).
<p>
Here are our means from depth 3: (0.811431	0.706194	0.541256).
<p>
Using 5-folds, we lost a little ground when going from 4 to 3.
<p>
Let's try depth 2. Note that by decreasing the depth, we are moving the needle away from high variance but towards high bias.
</div>

In [42]:
max2_table = k_fold(titanic_table, 5, 'Survived', {'max-depth':2}, splitter_columns)
max2_table.head()

Unnamed: 0,name,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
0,fold 1 test,23.0,8.0,111.0,36.0,0.752809,0.511111,0.322604
1,fold 2 test,39.0,3.0,97.0,39.0,0.764045,0.65,0.47
2,fold 3 test,51.0,18.0,90.0,19.0,0.792135,0.733813,0.561905
3,fold 4 test,42.0,17.0,90.0,29.0,0.741573,0.646154,0.432671
4,fold 5 test,43.0,15.0,100.0,21.0,0.798883,0.704918,0.54144


In [43]:
max2_table.__doc__

"{'max-depth': 2}"

In [44]:
max2_table.describe()

Unnamed: 0,true_positive,false_positive,true_negative,false_negative,accuracy,f1,informedness
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,39.6,12.2,97.6,28.8,0.769889,0.649199,0.465724
std,10.285913,6.457554,8.677557,8.843076,0.024815,0.085648,0.095627
min,23.0,3.0,90.0,19.0,0.741573,0.511111,0.322604
25%,39.0,8.0,90.0,21.0,0.752809,0.646154,0.432671
50%,42.0,15.0,97.0,29.0,0.764045,0.65,0.47
75%,43.0,17.0,100.0,36.0,0.792135,0.704918,0.54144
max,51.0,18.0,111.0,39.0,0.798883,0.733813,0.561905


<div class=h1_cell>
Lost quite a bit of ground with level 2.
</div>

<h2>
Where now?
</h2>
<p>
<div class=h1_cell>
<p>
We could try a few more values of depth. But we also could start playing with the other knob, the gig cutoff. By the end of our exploration, we should have a good idea on what values to set our hyper-parameters to. At that point, we will generate the final tree using all of the data. Something like this:
<p>
<pre>
<code>
optimal_depth = ...  # what we discovered in our K-folding
optimal_gig_cutoff = ...  # ditto
hypers = {'max-depth': optimal_depth, 'gig-cutoff': optimal_gig_cutoff )
final_tree = build_tree_iter(titanic_table, candidate_columns, 'Survived', hypers )
</code>
</pre>
</div>

<hr>
<h1>Write it out</h1>
<div class=h1_cell>

Save the table so can use it in next module.
</div>

In [45]:
import os

week = 5  # change this each week

home_path =  os.path.expanduser('~')

file_path = '/Dropbox/cis399_ds1_f17/notebook_history/'

file_name = 'titanic_wrangled_w'+str(week)+'.csv'

titanic_table.to_csv(home_path + file_path + file_name, index=False)


<h2>
Next up
</h2>
<p>
<div class=h1_cell>
Think about this. With K-folding, we are building k separate trees but then throwing them away. The final tree is produced from all the data. What if we decided not to throw those k trees away? What if we chose to keep all the trees as the "final tree". We would have an ensemble of trees (AKA a forest). How would they agree among themselves on the correct prediction? How about simply letting them vote. That's what is coming up next.
</div>