<h1> Creating TensorFlow model </h1>

This notebook illustrates:
<ol>
<li> Creating a model using the high-level Estimator API 
</ol>

In [1]:
# change these to try this notebook out
BUCKET = 'asl-ml-immersion-temp'
PROJECT = 'asl-ml-immersion'
REGION = 'us-central1'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [3]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

In [4]:
%bash
ls *.csv

eval.csv
train.csv


<h2> Create TensorFlow model using TensorFlow's Estimator API </h2>
<p>
First, write an input_fn to read the data.

In [9]:
import shutil
import numpy as np
import tensorflow as tf
import tensorflow.contrib.learn as tflearn
import tensorflow.contrib.layers as tflayers
from tensorflow.contrib.learn.python.learn import learn_runner
import tensorflow.contrib.metrics as metrics

In [10]:
CSV_COLUMNS = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
LABEL_COLUMN = 'weight_pounds'
KEY_COLUMN = 'key'
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], [0.0], ['null'], ['null'], ['null'], ['nokey']]
TRAIN_STEPS = 1000

def read_dataset(prefix, pattern, batch_size=512):
  # use prefix to create filename
  filename = './{}*{}*'.format(prefix, pattern)
  if prefix == 'train':
    mode = tf.contrib.learn.ModeKeys.TRAIN
  else:
    mode = tf.contrib.learn.ModeKeys.EVAL
    
  # the actual input function passed to TensorFlow
  def _input_fn():
    # could be a path to one file or a file pattern.
    input_file_names = tf.train.match_filenames_once(filename)
    filename_queue = tf.train.string_input_producer(
        input_file_names, shuffle=True)
 
    # read CSV
    reader = tf.TextLineReader()
    _, value = reader.read_up_to(filename_queue, num_records=batch_size)
    value_column = tf.expand_dims(value, -1)
    columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)
    features = dict(zip(CSV_COLUMNS, columns))
    features.pop(KEY_COLUMN)
    label = features.pop(LABEL_COLUMN)
    return features, label
  
  return _input_fn

Next, define the feature columns

In [23]:
def get_wide_deep():
  # define column types
  races = ['White', 'Black', 'American Indian', 'Chinese', 
           'Japanese', 'Hawaiian', 'Filipino', 'Unknown',
           'Asian Indian', 'Korean', 'Samaon', 'Vietnamese']
  is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use = \
   [ \
    tflayers.sparse_column_with_keys('is_male', keys=['True', 'False']),
    tflayers.real_valued_column('mother_age'),
    tflayers.sparse_column_with_keys('mother_race', keys=races),
    tflayers.real_valued_column('plurality'),
    tflayers.real_valued_column('gestation_weeks'),
    tflayers.sparse_column_with_keys('mother_married', keys=['True', 'False']),
    tflayers.sparse_column_with_keys('cigarette_use', keys=['True', 'False', 'None']),
    tflayers.sparse_column_with_keys('alcohol_use', keys=['True', 'False', 'None'])
    ]

  # transformations
  plurality_b = tflayers.bucketized_column(plurality, boundaries=np.arange(0.5, 5.5, 1.0).tolist())
  mother_age_b = tflayers.bucketized_column(mother_age, boundaries=np.arange(10, 40, 5).tolist())
  gestation_b = tflayers.bucketized_column(gestation_weeks, boundaries=[25, 30, 35, 38, 40])
  mother_race_e = tflayers.embedding_column(mother_race, 3)
  crosses = tflayers.crossed_column([mother_age_b, plurality_b, gestation_b], hash_bucket_size=10)

  # which columns are wide (sparse, linear relationship to output) and which are deep (complex relationship to output?)
  wide = [is_male, mother_race, 
          plurality_b, mother_age_b, gestation_b, crosses,
          mother_married, cigarette_use, alcohol_use]
  deep = [\
                mother_age,
                plurality,
                gestation_weeks,
                mother_race_e
               ]
  return wide, deep

  return wide, deep

To predict with the TensorFlow model, we also need a serving input function. We will want all the inputs from our user.

In [21]:
def serving_input_fn():
    feature_placeholders = {
      'is_male': tf.placeholder(tf.string, [None]),
      'mother_age': tf.placeholder(tf.float32, [None]),
      'mother_race': tf.placeholder(tf.string, [None]),
      'plurality': tf.placeholder(tf.float32, [None]),
      'gestation_weeks': tf.placeholder(tf.float32, [None]),
      'mother_married': tf.placeholder(tf.string, [None]),
      'cigarette_use': tf.placeholder(tf.string, [None]),
      'alcohol_use': tf.placeholder(tf.string, [None])
    }
    features = {
      key: tf.expand_dims(tensor, -1)
      for key, tensor in feature_placeholders.items()
    }
    return tflearn.utils.input_fn_utils.InputFnOps(
      features,
      None,
      feature_placeholders)

Finally, train!

In [None]:
from tensorflow.contrib.learn.python.learn.utils import saved_model_export_utils

pattern = "csv"

def experiment_fn(output_dir):
    wide, deep = get_wide_deep()
    return tflearn.Experiment(
        tflearn.DNNLinearCombinedRegressor(model_dir=output_dir,
                                           linear_feature_columns=wide,
                                           dnn_feature_columns=deep,
                                           dnn_hidden_units=[64, 32]),
        train_input_fn=read_dataset('train', pattern),
        eval_input_fn=read_dataset('eval', pattern),
        eval_metrics={
            'rmse': tflearn.MetricSpec(
                metric_fn=metrics.streaming_root_mean_squared_error
            )
        },
        export_strategies=[saved_model_export_utils.make_export_strategy(
            serving_input_fn,
            default_output_alternative_key=None,
            exports_to_keep=1
        )],
        min_eval_frequency=200,
        train_steps=TRAIN_STEPS
    )

shutil.rmtree('babyweight_trained', ignore_errors=True) # start fresh each time
learn_runner.run(experiment_fn, 'babyweight_trained')

When I ran it, the final lines of the output (above) were:
<pre>
INFO:tensorflow:SavedModel written to: babyweight_trained/export/Servo/1501608221496/saved_model.pb
({'global_step': 1000, 'loss': 1.2297399, 'rmse': 1.1134831},
 ['babyweight_trained/export/Servo/1501608221496'])
</pre>
The Servo directory contains the final model and the final RMSE is 1.113

<h2> Monitor and experiment with training </h2>

In [11]:
from google.datalab.ml import TensorBoard
TensorBoard().start('./babyweight_trained')

5359

In TensorBoard, look at the learned embeddings for the race. Are they getting clustered? How about the weights for the hidden layers? What if you run this longer? What happens if you change the batchsize?

In [12]:
TensorBoard.stop(5539)

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License