# Deep learning with tf.data and tf.estimators

Since the recent 2018 Google I/O I meant to do a tutorial on the new data pipelines of tensorflow and the estimator class that they introduced a couple of verions ago. Google is pushing tensorflow to be an easy to use framework without a steep learning curve - at least in order to achieve elementary results. Both libraries that we are going to showcase in this notebook are striving for that. Tf.data replaces the old fashioned way of feed_dict within the tf.session and streamlines the data input flow. Tf.estimator acts as a blanket for all deep learning models that have tensorflow under the hood. It takes care of training, evaluation and prediction with wrapper functions on top of your model. Also we are going to see one of the out of the box classifiers that Google has developed DNNClassifier and test how good it performs.

In order to examine these libraries we are going to tackle the Kaggle problem of the Titanic. We will try to predict based on features such as how much the Titanic ticket cost, the age, the ticket class etc if the passenger survived or not. Most of the good solutions on Kaggle achieve around 75-85% accuracy in this problem with extensive feature engineering. Here we are not going to bother with feature engineering since our purpose is not to break the Kaggle record. Let's take a look of the dataset:

In [3]:
# Necessary imports
import tensorflow as tf
import pandas as pd
import numpy as np

In [4]:
# Load the dataset on memory and show first 5 records
data = pd.read_csv("/Users/Blackbak/giannis_home/python_folder/titanic_dataset.csv")
data.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55
2,0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55
3,0,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55
4,0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55


In [5]:
# Split the dataset into train and test for us to evaluate the generalization of our models
train = data.iloc[:int(data.shape[0]*0.8)]
test = data.iloc[int(data.shape[0]*0.8):]

First thing that we need to do is to specify our input pipeline. The pipeline ties with the estimator class because we need to feed the estimator data in a specific way. The estimator class takes as an argument an input function that returns the next data to be trained or evaluated. So basically what we need is a generator function that outputs the next next batch of data for batch training and testing or the next data point for online. If you are not familiar with generators I would suggest seeing [this youtube video](https://www.youtube.com/watch?v=cKPlPJyQrt4) (actually I would suggest it to everyone regardless). Thankfully Google has provided us with the nesessary tools that make this task very easy. But first things first we need to specify the data that we need to load. Depending on how the data are stored there are different functions to load the data into the tf.data.Dataset class. We see below the most common ones:

In [None]:
# If the data are stored in the default format of tensorflow TFRecords
files = tf.data.Dataset.list_files(file_pattern)
dataset = tf.data.TFRecordDataset(files)
# If the data are store in one or multiple csv files
dataset = tf.contrib.data.make_csv_dataset("*.csv", # path to the csv file/files
                                       batch_size=32, # have to specify batch size in this step
                                       column_names=["features", "that", "are", "useful"],
                                       label_name="label_column")
# If the data are in memory already in a dictionary
dataset = tf.data.Dataset.from_tensor_slices(data_dict)

Once the data are loaded into a tf.data.Dataset form our goal is to develop the generator function. Before we proceed with making our data iterable we need to specify some key parameters on how we consume them such as the number of epochs, the batch size, if we shuffle after each epoch or if we want to manipulate the input. As in their [presentation at Google](https://www.youtube.com/watch?v=uIcqeP7MFH0&t=270s) the usual data pipeline would look something like this:

In [None]:
dataset = dataset.shuffle(1000) # 1000 is the shuffle buffer size where it samples from
dataset = dataset.repeat(num_epochs)
# if some pre-processing is needed we can do map and filter functions with the help of lambda
# the downside is that it is somewhat complex
dataset = dataset.map(lambda x: tf.parse_single_example(x, features)) 
dataset = dataset.batch(batch_size)
# here we make the data iterable and call the next batch
iterator = dataset.make_one_shot_iterator()
next_data = iterator.get_next()

At this point what we only need to do is to wrap these operations in a function to be able to get parsed as an argument in the estimator decleration. Also we will need two functions, one for training and one for evaluating. We could have one and parse the data as an argument but this way is a bit more clear. We are going to use only numerical features to keep the notebook short.

In [6]:
def train_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((train[["pclass", "fare", "age", "sibsp", "parch"]].to_dict("list"), 
                                                  train["survived"].values))
    dataset = dataset.shuffle(1000).repeat().batch(100)
    iterator = dataset.make_one_shot_iterator()
    feat_next, label_next = iterator.get_next()
    return feat_next, label_next

In [7]:
def eval_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((test[["pclass", "fare", "age", "sibsp", "parch"]].to_dict("list"), 
                                                  test["survived"].values))
    dataset = dataset.shuffle(1000).repeat().batch(100)
    iterator = dataset.make_one_shot_iterator()
    feat_next, label_next = iterator.get_next()
    return feat_next, label_next

Next we need to define our features for the estimator to understand its input. Basically with this step we make sure that values get connected to the corresponding input and that different types of input gets handled accordingly e.g. categorical entries get translated to one-hot encodings. More on the features at the [tensorflow docs](https://www.tensorflow.org/get_started/feature_columns).

In [8]:
feat_name = ["pclass", "fare", "age", "sibsp", "parch"]
my_feature_columns = []
for name in feat_name:
    my_feature_columns.append(tf.feature_column.numeric_column(key=name))

Now we are ready to define the estimator class. Google has developed a handful of predefined estimators to make our life a bit easier. We are going to showcase the DNNClassifier model, which is what the name suggests: a feed forward classifier.  

In [9]:
estimator = tf.estimator.DNNClassifier(feature_columns=my_feature_columns,
                                        hidden_units=[1000,1000],
                                      dropout=0.5)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/n8/wbjbrw4n6wv8v5kbx4zg70wm0000gn/T/tmppsx1900f', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x10a230630>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [10]:
estimator.train(input_fn=train_input_fn, steps=3000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/n8/wbjbrw4n6wv8v5kbx4zg70wm0000gn/T/tmppsx1900f/model.ckpt.
INFO:tensorflow:loss = 194.63202, step = 1
INFO:tensorflow:global_step/sec: 50.324
INFO:tensorflow:loss = 67.78077, step = 101 (1.988 sec)
INFO:tensorflow:global_step/sec: 56.2207
INFO:tensorflow:loss = 62.972076, step = 201 (1.779 sec)
INFO:tensorflow:global_step/sec: 56.7792
INFO:tensorflow:loss = 63.67186, step = 301 (1.762 sec)
INFO:tensorflow:global_step/sec: 57.0776
INFO:tensorflow:loss = 64.02928, step = 401 (1.752 sec)
INFO:tensorflow:global_step/sec: 56.2241
INFO:tensorflow:loss = 63.789528, step = 501 (1.779 sec)
INFO:tensorflow:global_step/sec: 55.9744
INFO:tensorflow:loss = 64.28005, step = 601 (1.787 sec)
INFO:tensorflow:gl

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x10a230358>

In [11]:
estimator.evaluate(input_fn=eval_input_fn, steps=200)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-10-14:39:11
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/n8/wbjbrw4n6wv8v5kbx4zg70wm0000gn/T/tmppsx1900f/model.ckpt-3000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [20/200]
INFO:tensorflow:Evaluation [40/200]
INFO:tensorflow:Evaluation [60/200]
INFO:tensorflow:Evaluation [80/200]
INFO:tensorflow:Evaluation [100/200]
INFO:tensorflow:Evaluation [120/200]
INFO:tensorflow:Evaluation [140/200]
INFO:tensorflow:Evaluation [160/200]
INFO:tensorflow:Evaluation [180/200]
INFO:tensorflow:Evaluation [200/200]
INFO:tensorflow:Finished evaluation at 2018-06-10-14:39:14
INFO:tensorflow:Saving dict for global step 3000: accuracy = 0.78255, accuracy_baseline = 0.79015, auc = 0.5815983, auc_precision_recall = 0.28891575, average_loss = 0.55284804, global_step = 3000, label/

{'accuracy': 0.78255,
 'accuracy_baseline': 0.79015,
 'auc': 0.5815983,
 'auc_precision_recall': 0.28891575,
 'average_loss': 0.55284804,
 'label/mean': 0.20985,
 'loss': 55.284805,
 'precision': 0.4500657,
 'prediction/mean': 0.32641548,
 'recall': 0.16321182,
 'global_step': 3000}

These models are nice out of the box solutions for rapid prototyping or for people that are not tha familiar with machine learning. In most cases we will need to specify our own model to solve the specific task at hand e.g. for image classification we will need convolutional layers with pooling. In this example we are going to formulate another feed forward network but we will use batch normalization on the layers. The model is defined as a function that outputs different outcomes based on the mode that it is in. Each estimator has 3 modes: training, evaluating and predicting - tf.estimator.ModeKeys.(TRAIN/EVAL/PREDICT). In the model function my_model_fn we are going to go through the different steps that we must specify in order to comply with the estimator form.

In [12]:
# This is the layer that we are going to use as hidden
def dnn_layer(inputs, unit_num, activation, d_rate, mode):
    bn = tf.layers.batch_normalization(inputs=inputs)
    nn = tf.layers.dense(inputs=bn, units=unit_num, activation=activation)
    dn = tf.layers.dropout(nn, rate=d_rate, training=mode == tf.estimator.ModeKeys.TRAIN)
    return dn

In [16]:
def my_model_fn(features, labels, mode, params):
    # Always the first step of the model function is to connect input and feature definitions
    net = tf.feature_column.input_layer(features, params['feature_columns'])
    # Define the computation graph for forward pass
    for hid_num in params["hidden_units"]:
        net = dnn_layer(inputs=net, unit_num=hid_num, activation=tf.nn.leaky_relu, d_rate=0.5, mode=mode)
    logits = tf.layers.dense(inputs=net, units=params["n_classes"])
    # Prediction part
    predictions = {
      # Generate predictions (for PREDICT and EVAL mode)
      "classes": tf.argmax(input=logits, axis=1),
      # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
      # `logging_hook`.
      "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
    }
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

    # Calculate Loss (for both TRAIN and EVAL modes)
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # Configure the Training Op (for TRAIN mode)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
        train_op = optimizer.minimize(
            loss=loss,
            global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

    # Add evaluation metrics (for EVAL mode)
    if mode == tf.estimator.ModeKeys.EVAL:
        eval_metric_ops = {
            "accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"]),
            "auc": tf.metrics.auc(labels=labels, predictions=predictions["classes"])}
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

In [17]:
classifier = tf.estimator.Estimator(
    model_fn=my_model_fn,
    params={
        'feature_columns': my_feature_columns,
        'hidden_units': [1000, 1000],
        'n_classes': 2
    })

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/n8/wbjbrw4n6wv8v5kbx4zg70wm0000gn/T/tmp1tui31ju', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1235e7908>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [20]:
classifier.train(input_fn=train_input_fn, steps=3000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/n8/wbjbrw4n6wv8v5kbx4zg70wm0000gn/T/tmp1tui31ju/model.ckpt-3000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 3001 into /var/folders/n8/wbjbrw4n6wv8v5kbx4zg70wm0000gn/T/tmp1tui31ju/model.ckpt.
INFO:tensorflow:loss = 0.6218874, step = 3001
INFO:tensorflow:global_step/sec: 41.134
INFO:tensorflow:loss = 0.6308135, step = 3101 (2.433 sec)
INFO:tensorflow:global_step/sec: 42.4112
INFO:tensorflow:loss = 0.59997797, step = 3201 (2.357 sec)
INFO:tensorflow:global_step/sec: 43.8712
INFO:tensorflow:loss = 0.55855, step = 3301 (2.279 sec)
INFO:tensorflow:global_step/sec: 48.3987
INFO:tensorflow:loss = 0.68547094, step = 3401 (2.066 sec)
INFO:tensorflow:global_step/sec: 45.9739
INFO:tensorflow:loss = 0.58312017, step 

<tensorflow.python.estimator.estimator.Estimator at 0x1235e78d0>

In [21]:
classifier.evaluate(input_fn=eval_input_fn, steps=200)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-10-14:44:56
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/n8/wbjbrw4n6wv8v5kbx4zg70wm0000gn/T/tmp1tui31ju/model.ckpt-6000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [20/200]
INFO:tensorflow:Evaluation [40/200]
INFO:tensorflow:Evaluation [60/200]
INFO:tensorflow:Evaluation [80/200]
INFO:tensorflow:Evaluation [100/200]
INFO:tensorflow:Evaluation [120/200]
INFO:tensorflow:Evaluation [140/200]
INFO:tensorflow:Evaluation [160/200]
INFO:tensorflow:Evaluation [180/200]
INFO:tensorflow:Evaluation [200/200]
INFO:tensorflow:Finished evaluation at 2018-06-10-14:44:58
INFO:tensorflow:Saving dict for global step 6000: accuracy = 0.79, auc = 0.6002502, global_step = 6000, loss = 0.54535925


{'accuracy': 0.79, 'auc': 0.6002502, 'loss': 0.54535925, 'global_step': 6000}

We achieved test set accuracy of 79% which is better than most models for the Titanic dataset without any feature engineering. Concluding this guide we have seen an example of the new input pipelines that Google introduced recently and how they tie into the estimator models. For more extensive examples and tutorials I strongly advise to visit the latest version [docs](https://www.tensorflow.org/get_started/).