In [1]:
%%html
<style>
.h1_cell, .just_text {
    box-sizing: border-box;
    padding-top:5px;
    padding-bottom:5px;
    font-family: "Times New Roman", Georgia, Serif;
    font-size: 125%;
    line-height: 22px; /* 5px +12px + 5px */
    text-indent: 25px;
    background-color: #fbfbea;
    padding: 10px;
}

hr { 
    display: block;
    margin-top: 0.5em;
    margin-bottom: 0.5em;
    margin-left: auto;
    margin-right: auto;
    border-style: inset;
    border-width: 2px;
}
</style>

<h1>
<center>
Module 6: Random Forests - another approach to Bias-Variance
</center>
</h1>
<div class=h1_cell>
<p>
In this module we will continue to look at means of tackling the Bias-Variance problem. We will focus on the Variance problem of overfitting. We will see a new concept called *ensemble* learning. I liken it to crowd-sourcing. Instead of relying on just one expert, let's round up a collection (AKA an ensemble) of experts. We can let them each, individually, come up with a prediction. Then we can take a vote and use the winning prediction.
<p>
The technique we will look at is called *Random Forests*. It is a special method falling under the more general heading of *bagging*. We will crowd-source a forest of trees to get their predictions and then take majority vote. Where, you ask, does this forest of trees come from? We build them following relatively straightforward steps. So to summarize, we first build our forest of decision trees. Then when we want to do predictions for real, we give each tree a vote and majority wins. Cool.
</div>

<h2>
Jargon alerts
</h2>
<div class=h1_cell>
<p>
Random forests have jargon that goes with them. I'll alert you to where jargony terms show up.
<p>
Jargon alert: *Random Forest* is jargon :) And so is *bagging* and *ensemble*.
</div>

In [2]:
import pandas as pd
import os

week = 5  # from last module

home_path =  os.path.expanduser('~')

file_path = '/Dropbox/cis399_ds1_f17/notebook_history/'

file_name = 'titanic_wrangled_w'+str(week)+'.csv'

titanic_table = pd.read_csv(home_path + file_path + file_name)

pd.__version__  # should see 0.20.3 or higher

u'0.20.3'

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
os.chdir(home_path + '/Dropbox/cis399_ds1_f17/week_libraries/datascience_1')
!git pull

Already up-to-date.


In [5]:
import sys
sys.path.append(home_path + '/Dropbox/cis399_ds1_f17/week_libraries/datascience_1')

In [6]:
from week5 import *

%who function

accuracy	 build_pred	 build_tree_iter	 caser	 compute_prediction	 compute_training	 f1	 find_best_splitter	 generate_table	 
gig	 gini	 informedness	 k_fold	 predictor_case	 probabilities	 tree_predictor	 


In [7]:
titanic_table.head(1)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3,pclass_nan
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,22.0,0,0,1,0,Child,1,0,0,0,1,0,0,0,1,0


<h2>
A forest starts with some trees
</h2>
<div class=h1_cell>
<p>
Let's build a forest of two trees to get started. Once we see how to do that, we can scale it up to N trees.
<p>
Our approach will be to do random selections of both the rows (axis=0) and columns (axis=1) as we build the tree.
</div>

<h2>
Step 1. Generate the training data for the first tree (and don't lose the left-out data)
</h2>
<div class=h1_cell>
<p>
We will take a slice from the entire table to use as training. This may sound familiar: we did something similar when doing K-folding. The big difference here is that we will take random rows *with replacement*. This means the same row can appear more than once in our slice. With K-Folding, we did not let this happen. And BTW, the size of the slice is the size of the original table, e.g. 891 rows in slice!
<p>
We don't want to lose track of the rows we did not use. They will become important later.
<p>
Jargon alert: selecting a random sample of rows for training (with replacement) is called *bootstrapping* or *bagging*.
</div>

<h2>
Random but predictable
</h2>
<div class=h1_cell>
<p>
I am going to be needing random values both in Python and in Pandas. I am going to set things up so that I always get the same random values when I restart the kernel and run this notebook from the top. Why? Helps me debug to have the same values every time.
</div>

In [8]:
# All this does is give me a function that produces a sequence of ints starting at the seed.
# I'm using a closure to hide the counter z. Could use a Python class but here I like a closure better.

def seeder(seed):
    z = [seed]  # why embed in list? See https://stackoverflow.com/a/4851555
    def f():
        val = z[0]
        z[0] += 1
        return val
    return f

new_seed = seeder(100)  # new_seed is a function that will return a sequence of ints, one on each new call

In [9]:
import random
random.seed(1000)  # this seeds the Python random number generator

<h2>
Let's bootstrap!
</h2>
<div class=h1_cell>
<p>
We need a table that is same size as Titanic table (891 rows). We will select the rows in the Titanic Table randomly. We will allow replacement: the same row can be selected mulitple times. What is very cool is that Pandas gives us a method, `sample`, that does exactly what we want. Pretty nice of them.
<p>
As you can see below, I am setting the fraction of the table I want to 100%. And I am using my new_seed function to give me the random seed.
<p>
Once I have my new table, I am going to reindex it. This will create a new column `index` that has the row numbers from the original Titanic table. I'll want those later. Check it out.
</div>

In [10]:
train1 = titanic_table.sample(frac=1.0, replace=True, random_state=new_seed())  # Easy peasy - thanks pandas!
train1 = train1.reset_index()
train1.head()

Unnamed: 0,index,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3,pclass_nan
0,520,1,1,"Perreault, Miss. Anne",female,30.0,0,0,12749,93.5,B73,S,0,30.0,0,0,1,0,Adult,0,1,0,1,0,0,1,0,0,0
1,792,0,3,"Sage, Miss. Stella Anna",female,,8,2,CA. 2343,69.55,,S,1,29.699118,0,0,1,0,Adult,0,1,0,1,0,0,0,0,1,0
2,835,1,1,"Compton, Miss. Sara Rebecca",female,39.0,1,1,PC 17756,83.1583,E49,C,0,39.0,1,0,0,0,Adult,0,1,0,1,0,0,1,0,0,0
3,871,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S,0,47.0,0,0,1,0,Adult,0,1,0,1,0,0,1,0,0,0
4,855,1,3,"Aks, Mrs. Sam (Leah Rosen)",female,18.0,0,1,392091,9.35,,S,0,18.0,0,0,1,0,Child,1,0,0,1,0,0,0,0,1,0


In [11]:
#Just for giggles, get a count of how many rows duplicated in train1
train1.duplicated(['index'], keep=False).value_counts()

True     595
False    296
dtype: int64

<h2>
We will need the leftovers eventually
</h2>
<div class=h1_cell>
<p>
Since we have duplicates in `train1`, there must be some rows from the Titanic table that were not included in train1. I would like to know which rows were left out of train1.
<p>
Jargon alert: the rows that are left out are called *out of bag*.
</div>

In [12]:
left_out1 = titanic_table.loc[~titanic_table.index.isin(train1['index'])]
left_out1 = left_out1.reset_index()  #builds a new index column
left_out1.head()

Unnamed: 0,index,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3,pclass_nan
0,10,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S,0,4.0,0,0,1,0,Child,1,0,0,1,0,1,0,0,1,0
1,12,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S,0,20.0,0,0,1,0,Child,1,0,0,0,1,0,0,0,1,0
2,19,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C,1,29.699118,1,0,0,0,Adult,0,1,0,1,0,0,0,0,1,0
3,22,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q,0,15.0,0,1,0,0,Child,1,0,0,1,0,0,0,0,1,0
4,23,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S,0,28.0,0,0,1,0,Adult,0,1,0,0,1,0,1,0,0,0


In [13]:
#We expect no True values with this - should have unique rows
left_out1.duplicated(['index'], keep=False).value_counts()

False    342
dtype: int64


<div class=h1_cell>
<p>
We don't really need the entire left_out1 table. All we need are the values in the `index` column. We can use those to access rows in the Titanic table later. So will pull those indices out and put in a list.
</div>

In [14]:
left_out_indices1 = left_out1['index'].tolist()
left_out_indices1[:5]  # should be same as what see in head() above

[10, 12, 19, 22, 23]

<h2>
Let's congratulate ourselves
</h2>
<div class=h1_cell>
<p>
We have completed the first big step in building our two-tree forest. We have generated a bootstrapped table, `train1`, that we can use for training our first tree.
<p>
Next step is to do the training.

</div>

In [15]:
splitter_columns = [
 'emb_C',
 'emb_Q',
 'emb_S',
 'emb_nan',
 'age_Child',
 'age_Adult',
 'age_Senior',
 'no_age',
 'ok_child',
 'sex_female',
 'pclass_1',
 'pclass_2',
 'pclass_3',
 'pclass_nan'
]

<h2>
Training is a bit different
</h2>
<div class=h1_cell>
<p>
Normally we would use all the columns in splitter_columns to build our tree. We are going to do something different. For each node in the tree we are building, I will only choose the best splitter from a subset of splitter_columns. What I choose to be in the subset will be random. The size of the subset, which I will call `m`, is a hyper-parameter that you can set. A rule of thumb is to set `m` to (at max) the square root of the length of `splitter_columns`. That length is 14 so I will use a value of 3 for m.
<p>
Jargon alert: choosing a random subset of columns/features is called *attribute bagging* which is a type of *random subspace sampling*.

</div>

In [16]:
m = int(len(splitter_columns)**.5)  # default is square root of total number of splitters rounded down
m

3

<h2>
The Fickas variation
</h2>
<div class=h1_cell>
<p>
Normally we would take a sample of m *with* replacement. That means, for m = 3, we could potentially have the result be `['no_age', 'no_age', 'no_age']`. Thought question: how would this shake out with `find_best_splitter`?
<p>
For efficiency, I am going to sample *without* replacement. So I am not allowing duplicates in my resulting list. I'll pay a price for this potentially. What is that price? Work on my thought question :)
</div>

In [17]:
#Use random library sample method to get sample without replacement
rcols = random.sample(splitter_columns, m)
rcols

['pclass_1', 'ok_child', 'emb_Q']


<div class=h1_cell>
<p>
We now have the candidate splitters (3 of them) for the root node of tree. We can use our library functions from past modules to do the rest.
</div>

In [18]:
columns_sorted = find_best_splitter(train1, rcols, 'Survived')  #notice using train1 and rcols
(best_column, gig_value) = columns_sorted[0]
print((best_column, gig_value))

('pclass_1', 0.033921730457458055)



<div class=h1_cell>
<p>
Now we can build the 2 starting paths emanating from the root node.
</div>

In [19]:

current_paths = [{'conjunction': [(best_column+'_1', build_pred(best_column, 1))],
                  'prediction': None,
                  'gig_score': gig_value},
                 {'conjunction': [(best_column+'_0', build_pred(best_column, 0))],
                  'prediction': None,
                  'gig_score': gig_value}]
                 
current_paths

[{'conjunction': [('pclass_1_1', <function week4.<lambda>>)],
  'gig_score': 0.033921730457458055,
  'prediction': None},
 {'conjunction': [('pclass_1_0', <function week4.<lambda>>)],
  'gig_score': 0.033921730457458055,
  'prediction': None}]

<div class=h1_cell>
I'll follow another round of splitting, i.e., grow the tree to level 2. I am copying and pasting code from `build_tree_iter` here. But I am making some changes, which I'll mark with `new` comments.
</div>

In [20]:
table = train1

tree_paths = []
new_paths = []
gig_cutoff = 0.0

for path in current_paths:
    conjunct = path['conjunction']
    before_table = generate_table(table, conjunct)
    rcols = random.sample(splitter_columns, m)       #new - chooses random subset for each node
    columns_sorted = find_best_splitter(before_table, rcols, 'Survived')  #new - using rcols
    (best_column, gig_value) = columns_sorted[0]
    if gig_value > gig_cutoff:
        new_path_1 = {'conjunction': conjunct + [(best_column+'_1', build_pred(best_column, 1))],
                    'prediction': None,
                     'gig_score': gig_value}
        new_paths.append( new_path_1 ) #true
        new_path_0 = {'conjunction': conjunct + [(best_column+'_0', build_pred(best_column, 0))],
                    'prediction': None,
                     'gig_score': gig_value
                     }
        new_paths.append( new_path_0 ) #false
    else:
        #not worth splitting so complete the path with a prediction
        path['prediction'] = compute_prediction(before_table, 'Survived')
        tree_paths.append(path)

In [21]:
new_paths

[{'conjunction': [('pclass_1_1', <function week4.<lambda>>),
   ('age_Child_1', <function week4.<lambda>>)],
  'gig_score': 0.01159708069026677,
  'prediction': None},
 {'conjunction': [('pclass_1_1', <function week4.<lambda>>),
   ('age_Child_0', <function week4.<lambda>>)],
  'gig_score': 0.01159708069026677,
  'prediction': None},
 {'conjunction': [('pclass_1_0', <function week4.<lambda>>),
   ('ok_child_1', <function week4.<lambda>>)],
  'gig_score': 0.018515700385069722,
  'prediction': None},
 {'conjunction': [('pclass_1_0', <function week4.<lambda>>),
   ('ok_child_0', <function week4.<lambda>>)],
  'gig_score': 0.018515700385069722,
  'prediction': None}]

In [22]:
tree_paths  # should be empty list

[]

<div class=h1_cell>
I'll stop here with a tree of level 2. Now copy path info into tree_paths so includes predictions.
</div>

In [23]:
for path in new_paths:
    conjunct = path['conjunction']
    before_table = generate_table(table, conjunct)
    path['prediction'] = compute_prediction(before_table, 'Survived')
    tree_paths.append(path)

In [24]:
tree_paths

[{'conjunction': [('pclass_1_1', <function week4.<lambda>>),
   ('age_Child_1', <function week4.<lambda>>)],
  'gig_score': 0.01159708069026677,
  'prediction': 1},
 {'conjunction': [('pclass_1_1', <function week4.<lambda>>),
   ('age_Child_0', <function week4.<lambda>>)],
  'gig_score': 0.01159708069026677,
  'prediction': 1},
 {'conjunction': [('pclass_1_0', <function week4.<lambda>>),
   ('ok_child_1', <function week4.<lambda>>)],
  'gig_score': 0.018515700385069722,
  'prediction': 1},
 {'conjunction': [('pclass_1_0', <function week4.<lambda>>),
   ('ok_child_0', <function week4.<lambda>>)],
  'gig_score': 0.018515700385069722,
  'prediction': 0}]

<div class=h1_cell>
I'm going to add another attribute `oob` to a tree. It stands for Out of Bag. We will make use of it later.
</div>

In [25]:
tree1 = {'paths': tree_paths, 'weight': None, 'oob': left_out_indices1}

In [26]:
forest1 = [tree1]

<h2>
Big hand clap
</h2>
<div class=h1_cell>
<p>
We have successfully built the first tree in the forest. We could stop here with a one-tree forest (boring). Let's add at least one more tree.
<p>
I'll build the second tree without much in way of comments. It will look the same as construction of tree1.

</div>

In [27]:
#First generate new training data - every tree gets its own data
train2 = titanic_table.sample(frac=1.0, replace=True, random_state=new_seed())
left_out2 = titanic_table.loc[~titanic_table.index.isin(train2.index)]

In [28]:
train2 = train2.reset_index()
left_out2 =left_out2.reset_index()
left_out_indices2 = left_out2['index'].tolist()

In [29]:
train2.duplicated(['index'], keep=False).value_counts()

True     565
False    326
dtype: int64

In [30]:
left_out_indices2[:5]

[0, 2, 6, 7, 8]

In [31]:
#build the root node

rcols = random.sample(splitter_columns, m)
rcols

['age_Adult', 'age_Senior', 'emb_S']

In [32]:
columns_sorted = find_best_splitter(train2, rcols, 'Survived')
(best_column, gig_value) = columns_sorted[0]
print((best_column, gig_value))

('emb_S', 0.03150290796366739)


In [33]:

current_paths = [{'conjunction': [(best_column+'_1', build_pred(best_column, 1))],
                  'prediction': None,
                  'gig_score': gig_value},
                 {'conjunction': [(best_column+'_0', build_pred(best_column, 0))],
                  'prediction': None,
                  'gig_score': gig_value}]
                 
current_paths

[{'conjunction': [('emb_S_1', <function week4.<lambda>>)],
  'gig_score': 0.03150290796366739,
  'prediction': None},
 {'conjunction': [('emb_S_0', <function week4.<lambda>>)],
  'gig_score': 0.03150290796366739,
  'prediction': None}]

In [34]:
table = train2

tree_paths = []
new_paths = []
gig_cutoff = 0.0

for path in current_paths:
    conjunct = path['conjunction']
    before_table = generate_table(table, conjunct)
    rcols = random.sample(splitter_columns, m)       #new - chooses random subset for each node
    columns_sorted = find_best_splitter(before_table, rcols, 'Survived')  #using rcols
    (best_column, gig_value) = columns_sorted[0]
    if gig_value > gig_cutoff:
        new_path_1 = {'conjunction': conjunct + [(best_column+'_1', build_pred(best_column, 1))],
                    'prediction': None,
                     'gig_score': gig_value}
        new_paths.append( new_path_1 ) #true
        new_path_0 = {'conjunction': conjunct + [(best_column+'_0', build_pred(best_column, 0))],
                    'prediction': None,
                     'gig_score': gig_value
                     }
        new_paths.append( new_path_0 ) #false
    else:
        #not worth splitting so complete the path with a prediction
        path['prediction'] = compute_prediction(before_table, 'Survived')
        tree_paths.append(path)

In [35]:
new_paths

[{'conjunction': [('emb_S_1', <function week4.<lambda>>),
   ('sex_female_1', <function week4.<lambda>>)],
  'gig_score': 0.09709945586203417,
  'prediction': None},
 {'conjunction': [('emb_S_1', <function week4.<lambda>>),
   ('sex_female_0', <function week4.<lambda>>)],
  'gig_score': 0.09709945586203417,
  'prediction': None},
 {'conjunction': [('emb_S_0', <function week4.<lambda>>),
   ('pclass_3_1', <function week4.<lambda>>)],
  'gig_score': 0.07297366447176823,
  'prediction': None},
 {'conjunction': [('emb_S_0', <function week4.<lambda>>),
   ('pclass_3_0', <function week4.<lambda>>)],
  'gig_score': 0.07297366447176823,
  'prediction': None}]

In [36]:
for path in new_paths:
    conjunct = path['conjunction']
    before_table = generate_table(table, conjunct)
    path['prediction'] = compute_prediction(before_table, 'Survived')
    tree_paths.append(path)

In [37]:
tree_paths

[{'conjunction': [('emb_S_1', <function week4.<lambda>>),
   ('sex_female_1', <function week4.<lambda>>)],
  'gig_score': 0.09709945586203417,
  'prediction': 1},
 {'conjunction': [('emb_S_1', <function week4.<lambda>>),
   ('sex_female_0', <function week4.<lambda>>)],
  'gig_score': 0.09709945586203417,
  'prediction': 0},
 {'conjunction': [('emb_S_0', <function week4.<lambda>>),
   ('pclass_3_1', <function week4.<lambda>>)],
  'gig_score': 0.07297366447176823,
  'prediction': 0},
 {'conjunction': [('emb_S_0', <function week4.<lambda>>),
   ('pclass_3_0', <function week4.<lambda>>)],
  'gig_score': 0.07297366447176823,
  'prediction': 1}]

In [38]:
tree2 = {'paths': tree_paths, 'weight': None, 'oob': left_out_indices2}

In [39]:
forest1.append(tree2)

<h2>
Let's stop at a two-tree forest
</h2>
<p>
<div class=h1_cell>
<p>
It would be better to have an odd number to break voting ties, but we will figure something out.
<p>
Now that we have a forest, let's see how to use it for prediction. I'll define a new function, `vote_taker`, that tallies up the votes of all the trees for a single row. Ties go to the negative outcome 0 (arbitrarily).

</div>

In [40]:
def vote_taker(row, forest):
    votes = {0:0, 1:0}
    for tree in forest:
        prediction = tree_predictor(row, tree)
        votes[prediction] += 1
    winner = 1 if votes[1]>votes[0] else 0  #ties go to 0
    return winner

In [41]:
row0 = titanic_table.loc[0]
vote_taker(row0, forest1)  #tree1 0, tree2 0, winner 0

0


<p>
<div class=h1_cell>
<p>
I'm going to go back to using the Titanic table to keep our predictions for now. Not quite ready to set up a full-blown results table.

</div>

In [42]:
#ok, use forest on entire table

titanic_table['forest_1'] = titanic_table.apply(lambda row: vote_taker(row, forest1), axis=1)

In [43]:
titanic_table.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3,pclass_nan,forest_1
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,22.0,0,0,1,0,Child,1,0,0,0,1,0,0,0,1,0,0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,38.0,1,0,0,0,Adult,0,1,0,1,0,0,1,0,0,0,1
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,26.0,0,0,1,0,Child,1,0,0,1,0,0,0,0,1,0,0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,35.0,0,0,1,0,Adult,0,1,0,1,0,0,1,0,0,0,1
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,35.0,0,0,1,0,Adult,0,1,0,0,1,0,0,0,1,0,0


In [44]:
titanic_table['forest_1_type'] = titanic_table.apply(lambda row: predictor_case(row, pred='forest_1', target='Survived'), axis=1)

In [45]:
titanic_table.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3,pclass_nan,forest_1,forest_1_type
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,22.0,0,0,1,0,Child,1,0,0,0,1,0,0,0,1,0,0,true_negative
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,38.0,1,0,0,0,Adult,0,1,0,1,0,0,1,0,0,0,1,true_positive
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,26.0,0,0,1,0,Child,1,0,0,1,0,0,0,0,1,0,0,false_negative
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,35.0,0,0,1,0,Adult,0,1,0,1,0,0,1,0,0,0,1,true_positive
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,35.0,0,0,1,0,Adult,0,1,0,0,1,0,0,0,1,0,0,true_negative


In [46]:
forest1_types = titanic_table['forest_1_type'].value_counts()  # returns a series
forest1_types

true_negative     514
false_negative    219
true_positive     123
false_positive     35
Name: forest_1_type, dtype: int64

In [47]:
print((accuracy(forest1_types), f1(forest1_types), informedness(forest1_types)))

(0.7149270482603816, 0.49200000000000005, 0.29589684593998644)


<h2>
Not that good
</h2>
<p>
<div class=h1_cell>
<p>
I'd like to try larger forests. Maybe 10 trees in the forest. But to do that, I don't want to copy and paste all that code. What I would like is a function that can build a forest for me, taking as a hyper parameter how many trees to include. Here is the start.
<pre>
<code>
def forest_builder(table, column_choices, target, hypers):
    depth = 2 if 'max-depth' not in hypers else hypers['max-depth']
    tree_n = 5 if 'total-trees' not in hypers else hypers['total-trees']
    m = int(len(column_choices)**.5) if 'm' not in hypers else hypers['m']
</code>
</pre>
<p>
The return should be a forest, i.e., a list of trees as seen above.
<p>
My implementation borrows heavily from `tree_builder_iter` from module 4. I am still using the nested function `iterative_build` to build a tree. But I have modified it to use bootstrapped training data for each tree and a random subset of attributes for each node in a tree. At the bottom I repeatedly call `iterative_build` to generate the trees for my forest.
</div>

In [48]:
def forest_builder(table, column_choices, target, hypers):
    tree_n = 5 if 'total-trees' not in hypers else hypers['total-trees']
    m = int(len(column_choices)**.5) if 'm' not in hypers else hypers['m']
    k = hypers['max-depth'] if 'max-depth' in hypers else min(2, len(column_choices))
    gig_cutoff = hypers['gig-cutoff'] if 'gig-cutoff' in hypers else 0.0

    #build a single tree - call it multiple times to build multiple trees
    def iterative_build(k):
        train = table.sample(frac=1.0, replace=True, random_state=new_seed())
        train = train.reset_index()
        left_out = table.loc[~table.index.isin(train['index'])]
        left_out = left_out.reset_index() # this gives us the old index in its own column
        oob_list = left_out['index'].tolist()  # list of row indices from original titanic table
        
        rcols = random.sample(column_choices, m)  # subspcace sampling
        columns_sorted = find_best_splitter(train, rcols, target)
        (best_column, gig_value) = columns_sorted[0]

        #Note I add _1 or _0 to make it more readable for debugging
        current_paths = [{'conjunction': [(best_column+'_1', build_pred(best_column, 1))],
                          'prediction': None,
                          'gig_score': gig_value},
                         {'conjunction': [(best_column+'_0', build_pred(best_column, 0))],
                          'prediction': None,
                          'gig_score': gig_value}
                        ]
        k -= 1  # we just built a level as seed so subtract 1 from k
        tree_paths = []  # add completed paths here

        while k>0:
            new_paths = []
            for path in current_paths:
                conjunct = path['conjunction']  # a list of (name, lambda)
                before_table = generate_table(train, conjunct)  #the subtable the current conjunct leads to
                rcols = random.sample(column_choices, m)  # subspace
                columns_sorted = find_best_splitter(before_table, rcols, target)
                (best_column, gig_value) = columns_sorted[0]
                if gig_value > gig_cutoff:
                    new_path_1 = {'conjunction': conjunct + [(best_column+'_1', build_pred(best_column, 1))],
                                'prediction': None,
                                 'gig_score': gig_value}
                    new_paths.append( new_path_1 ) #true
                    new_path_0 = {'conjunction': conjunct + [(best_column+'_0', build_pred(best_column, 0))],
                                'prediction': None,
                                 'gig_score': gig_value
                                 }
                    new_paths.append( new_path_0 ) #false
                else:
                    #not worth splitting so complete the path with a prediction
                    path['prediction'] = compute_prediction(before_table, target)
                    tree_paths.append(path)
            #end for loop

            current_paths = new_paths
            if current_paths != []:
                k -= 1
            else:
                break  # nothing left to extend so have copied all paths to tree_paths
        #end while loop

        #Generate predictions for all paths that have None
        for path in current_paths:
            conjunct = path['conjunction']
            before_table = generate_table(train, conjunct)
            path['prediction'] = compute_prediction(before_table, target)
            tree_paths.append(path)
        return (tree_paths, oob_list)
    
    #let's build the forest
    forest = []
    for i in range(tree_n):
        (paths, oob) = iterative_build(k)
        forest.append({'paths': paths, 'weight': None, 'oob': oob})
        
    return forest

<div class=h1_cell>
<p>
Ok, let's build a forest with 5 trees (the default).
</div>

In [49]:
forest2 = forest_builder(titanic_table, splitter_columns, 'Survived', hypers={})
len(forest2)

5

<div class=h1_cell>
<p>
Now get predictions by voting.
</div>

In [50]:
titanic_table['forest_2'] = titanic_table.apply(lambda row: vote_taker(row, forest2), axis=1)

<div class=h1_cell>
<p>
Follow normal process now. Produce column of types.
</div>

In [51]:
titanic_table['forest_2_type'] = titanic_table.apply(lambda row: predictor_case(row, pred='forest_2', target='Survived'), axis=1)

<div class=h1_cell>
<p>
Produce type counts.
</div>

In [52]:
forest2_types = titanic_table['forest_2_type'].value_counts()  # returns a series
forest2_types

true_negative     457
true_positive     217
false_negative    125
false_positive     92
Name: forest_2_type, dtype: int64

<div class=h1_cell>
<p>
Print measures.
</div>

In [53]:
print((accuracy(forest2_types), f1(forest2_types), informedness(forest2_types)))

(0.7564534231200898, 0.6666666666666666, 0.4669255104975554)


<h2>
Try 2 more
</h2>
<p>
<div class=h1_cell>
<p>
One with 11 trees with depth of 2, and one with 11 trees and depth of 1 (i.e., 11 stumps).
</div>

In [54]:
forest3 = forest_builder(titanic_table, splitter_columns, 'Survived', hypers={'total-trees':11})
len(forest3)

11

In [55]:
titanic_table['forest_3'] = titanic_table.apply(lambda row: vote_taker(row, forest3), axis=1)
titanic_table['forest_3_type'] = titanic_table.apply(lambda row: predictor_case(row, pred='forest_3', target='Survived'), axis=1)
forest3_types = titanic_table['forest_3_type'].value_counts()  # returns a series
print((accuracy(forest3_types), f1(forest3_types), informedness(forest3_types)))

(0.77665544332211, 0.716927453769559, 0.5382993001629757)


In [56]:
forest4 = forest_builder(titanic_table, splitter_columns, 'Survived', hypers={'total-trees':11, 'max-depth':1})
len(forest4)

11

In [57]:
titanic_table['forest_4'] = titanic_table.apply(lambda row: vote_taker(row, forest4), axis=1)
titanic_table['forest_4_type'] = titanic_table.apply(lambda row: predictor_case(row, pred='forest_4', target='Survived'), axis=1)
forest4_types = titanic_table['forest_4_type'].value_counts()  # returns a series
print((accuracy(forest4_types), f1(forest4_types), informedness(forest4_types)))

(0.7575757575757576, 0.5846153846153846, 0.39708561020036415)


<h2>
Should explore more
</h2>
<p>
<div class=h1_cell>
If we were ambitious, we could write yet another function that explored for us. Tried combinations of number of trees and depth and reported the best performing. I won't make you write this function but it should be in your grasp by this point. All you would be doing is repeating steps above but now in nested loops that produced various combinations to try.
</div>

<h2>
Ouf Of Bag errors
</h2>
<p>
<div class=h1_cell>
We could now use K-Folding to do a better evaluation of our forests. But I want to consider another approach called *Out of Bag error* (jargon alert). You will now see where that `oob` value on a tree comes in handy. As reminder, for each tree we generated a training sample. But that sample always leaves some rows out because of replacement. We captured these "left out" rows in the `oob` entry. We actually captured the row indices in the larger Titanic table.
<p>
Here's what I would like to do. I would like to evaluate a forest on the out of bag rows. It kind of makes sense, right? A tree was trained on some set of rows that excluded the rows in oob. So the oob rows are a bit like the test data from K-Folding. We use the oob rows for testing.

One way to look at it is I create a new testing table that is the union of the oob list on each tree. I then use this testing table to get predictions by forest vote taking. Here is the twist: a tree gets to vote on a row only if the row is in its oob list.
<p>
I am going to give you the chance to decide how to implement oob testing as part of your homework assignment.
</div>

<hr>
<h1>Write it out</h1>
<div class=h1_cell>

Save the table so can use it in next module.
</div>

In [58]:
import os

week = 6  # change this each week

home_path =  os.path.expanduser('~')

file_path = '/Dropbox/cis399_ds1_f17/notebook_history/'

file_name = 'titanic_wrangled_w'+str(week)+'.csv'

titanic_table.to_csv(home_path + file_path + file_name, index=False)


<h2>
Next up
</h2>
<p>
<div class=h1_cell>
We will continue to look at forests. But study an alternative way to consturcting the trees that is kind of cool.
</div>