# Introduction



In this notebook, we will 
- Learn how to use BoostedTrees Classifier for training and evaluating
- Explore how training can be speeded up for small datasets
- Will develop intuition for how some of the hyperparameters affect the performance of boosted trees.


In [None]:
# We will use some np and pandas for dealing with input data.
import numpy as np
import pandas as pd
# And of course, we need tensorflow.
import tensorflow as tf

from distutils.version import StrictVersion

In [None]:
tf.__version__

# Load dataset
We will be using the titanic dataset, where the goal is to predict passenger survival given characteristiscs such as gender, age, class, etc.

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)
tf.set_random_seed(123)

# Load dataset.
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

In [None]:
fcol = tf.feature_column
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
                       'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

def one_hot_cat_column(feature_name, vocab):
  return fcol.indicator_column(
      fcol.categorical_column_with_vocabulary_list(feature_name,
                                                 vocab))
fc = []
for feature_name in CATEGORICAL_COLUMNS:
  # Need to one-hot encode categorical features.
  vocabulary = dftrain[feature_name].unique()
  fc.append(one_hot_cat_column(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  fc.append(fcol.numeric_column(feature_name,
                                dtype=tf.float32))

In [None]:
# Prepare the input fn. Use the entire dataset for a batch since this is such a small dataset.
def make_input_fn(X, y, n_epochs=None, do_batching=True):
  def input_fn():
    BATCH_SIZE = len(y)  # Use entire dataset.
    dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
    # For training, cycle thru dataset as many times as need (n_epochs=None).    
    dataset = dataset.repeat(n_epochs)  
    if do_batching:
      dataset = dataset.batch(BATCH_SIZE)
    return dataset
  return input_fn

# Training and Evaluating Classifiers

In [None]:
TRAIN_SIZE = len(dftrain)
params = {
  'n_trees':10,
  'center_bias':False,
  'l2_regularization':1./TRAIN_SIZE # regularization is per instance, so if you are familiar with XGBoost, you need to divide these values by the num of examples per layer
}


Exercise: Train a Boosted Trees model using tf.estimator. What are the best results you can get?

Train and evaluate the model. We will look at accuracy first.


In [None]:
# Training and evaluation input functions.
n_batches_per_layer = 1  # Use one batch, consisting of the entire dataset to build each layer in the tree.
DO_BATCHING = True

train_input_fn = make_input_fn(dftrain, y_train, n_epochs=None, do_batching=DO_BATCHING)
eval_input_fn = make_input_fn(dfeval, y_eval, n_epochs=1, do_batching=DO_BATCHING)
est = # TODO

est.train(train_input_fn)

# Eval.
pd.Series(est.evaluate(eval_input_fn))

Exercise #2:  Can you get better performance out of the classifier? How do the results compare to using a DNN? Accuracy and AUC?

# Results

Let's understand how our model is performing.

In [None]:
pred_dicts = list(est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])

probs.plot(kind='hist', bins=20, title='predicted probabilities');

**???** Why are the probabilities right skewed?

Let's plot an ROC curve to understand model performance for various predicition probabilities.

In [None]:
from sklearn.metrics import roc_curve
from matplotlib import pyplot as plt

fpr, tpr, _ = roc_curve(y_eval, probs)
plt.plot(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.xlim(0,)
plt.ylim(0,);

**???** What does true positive rate and false positive rate refer to for this dataset?

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License