<h1> Serving embeddings </h1>

This notebook illustrates how to:
<ol>
<li> Create a custom embedding as part of a regression/classification model
<li> Representing categorical variables in different ways
<li> Math with feature columns
<li> Serve out the embedding, as well as the original model's predictions
</ol>


In [1]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [3]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


In [4]:
%%bash
if ! gcloud storage ls | grep -q gs://${BUCKET}/; then
  gcloud storage buckets create --location=${REGION} gs://${BUCKET}
fi

## Creating dataset

The problem is to estimate demand for bicycles at different rental stations in New York City.  The necessary data is in BigQuery:

In [5]:
query = """
#standardsql
WITH bicycle_rentals AS (
  SELECT
    COUNT(starttime) as num_trips,
    EXTRACT(DATE from starttime) as trip_date,
    MAX(EXTRACT(DAYOFWEEK from starttime)) as day_of_week,
    start_station_id
  FROM `bigquery-public-data.new_york.citibike_trips`
  GROUP BY trip_date, start_station_id
),

rainy_days AS
(
SELECT
  date,
  (MAX(prcp) > 5) AS rainy
FROM (
  SELECT
    wx.date AS date,
    IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp
  FROM
    `bigquery-public-data.ghcn_d.ghcnd_2016` AS wx
  WHERE
    wx.id = 'USW00094728'
)
GROUP BY
  date
)

SELECT
  num_trips,
  day_of_week,
  start_station_id,
  rainy
FROM bicycle_rentals AS bk
JOIN rainy_days AS wx
ON wx.date = bk.trip_date
"""
import google.datalab.bigquery as bq
df = bq.Query(query).execute().result().to_dataframe()

In [7]:
# shuffle the dataframe to make it easier to split into train/eval later
df = df.sample(frac=1.0)
df.head()

Unnamed: 0,num_trips,day_of_week,start_station_id,rainy
71373,11,4,3050,False
53403,339,3,497,True
36856,114,3,259,False
83250,95,5,377,False
56575,35,4,270,False


## Feature engineering
Let's build a model to predict the number of trips that start at each station, given that we know the day of the week and whether it is a rainy day.

Inputs to the model:
* day of week (integerized, since it is 1-7)
* station id (hash buckets, since we don't know full vocabulary. The dataset has about 650 unique values. we'll use a much larger hash bucket size, but then embed it into a lower dimension)
* rainy (true/false)

Label:
* num_trips

By embedding the station id into just 2 dimensions, we will also get to learn which stations are like each other, at least in the context of rainy-day-rentals.

### Change data type

Let's change the Pandas data types to more efficient (for TensorFlow) forms.

In [8]:
df.dtypes

num_trips           int64
day_of_week         int64
start_station_id    int64
rainy                bool
dtype: object

In [9]:
import numpy as np
df = df.astype({'num_trips': np.float32, 'day_of_week': np.int32, 'start_station_id': np.int32, 'rainy': str})
df.dtypes

num_trips           float32
day_of_week           int32
start_station_id      int32
rainy                object
dtype: object

### Scale the label to make it easier to optimize.

In [10]:
df['num_trips'] = df['num_trips'] / 1000.0

In [11]:
num_train = (int) (len(df) * 0.8)
train_df = df.iloc[:num_train]
eval_df  = df.iloc[num_train:]
print("Split into {} training examples and {} evaluation examples".format(len(train_df), len(eval_df)))

Split into 104148 training examples and 26037 evaluation examples


In [12]:
train_df.head()

Unnamed: 0,num_trips,day_of_week,start_station_id,rainy
71373,0.011,4,3050,False
53403,0.339,3,497,True
36856,0.114,3,259,False
83250,0.095,5,377,False
56575,0.035,4,270,False


<h2> Creating an Estimator model </h2>

Pretty minimal, but it works!

In [None]:
import tensorflow as tf
import pandas as pd

def make_input_fn(indf, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    indf,
    indf['num_trips'],
    num_epochs=num_epochs,
    shuffle=True)

def serving_input_fn():
    feature_placeholders = {
      'day_of_week': tf.placeholder(tf.int32, [None]),
      'start_station_id': tf.placeholder(tf.int32, [None]),
      'rainy': tf.placeholder(tf.string, [None])
    }
    features = {
        key: tf.expand_dims(tensor, -1)
        for key, tensor in feature_placeholders.items()
    }
    return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)
  
def train_and_evaluate(output_dir, nsteps):
  station_embed = tf.feature_column.embedding_column(
      tf.feature_column.categorical_column_with_hash_bucket('start_station_id', 5000, tf.int32), 2)
  feature_cols = [
    tf.feature_column.categorical_column_with_identity('day_of_week', num_buckets = 8),
    station_embed,
    tf.feature_column.categorical_column_with_vocabulary_list('rainy', ['false', 'true'])
  ]
  estimator = tf.estimator.LinearRegressor(
                       model_dir = output_dir,
                       feature_columns = feature_cols)
  train_spec=tf.estimator.TrainSpec(
                       input_fn = make_input_fn(train_df, None),
                       max_steps = nsteps)
  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
  eval_spec=tf.estimator.EvalSpec(
                       input_fn = make_input_fn(eval_df, 1),
                       steps = None,
                       start_delay_secs = 1, # start evaluating after N seconds
                       throttle_secs = 10,  # evaluate every N seconds
                       exporters = exporter)
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  
import shutil
OUTDIR='./model_trained'
shutil.rmtree(OUTDIR, ignore_errors=True)
train_and_evaluate(OUTDIR, 10)

## Predict using the exported model

In [14]:
%writefile test.json
{"day_of_week": 3, "start_station_id": 384, "rainy": "false"}
{"day_of_week": 4, "start_station_id": 384, "rainy": "true"}

Overwriting test.json


In [23]:
%bash
EXPORTDIR=./model_trained/export/exporter/
MODELDIR=$(ls $EXPORTDIR | tail -1)
gcloud ml-engine local predict --model-dir=${EXPORTDIR}/${MODELDIR} --json-instances=./test.json

PREDICTIONS
[0.09803415834903717]
[0.07751345634460449]


  from ._conv import register_converters as _register_converters
2018-07-17 18:49:13.727528: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA



## Serving out the embedding also

To serve out the embedding, we need to use a model function (a custom estimator) so that we have access to output_alternates

In [None]:
import tensorflow as tf
import pandas as pd

def make_input_fn(indf, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    indf,
    indf['num_trips'],
    num_epochs=num_epochs,
    shuffle=True)

def serving_input_fn():
    feature_placeholders = {
      'day_of_week': tf.placeholder(tf.int32, [None]),
      'start_station_id': tf.placeholder(tf.int32, [None]),
      'rainy': tf.placeholder(tf.string, [None])
    }
    features = {
        key: tf.expand_dims(tensor, -1)
        for key, tensor in feature_placeholders.items()
    }
    return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)

def model_fn(features, labels, mode):
  # linear model
  station_col = tf.feature_column.categorical_column_with_hash_bucket('start_station_id', 5000, tf.int32)
  station_embed = tf.feature_column.embedding_column(station_col, 2)  # embed dimension
  embed_layer = tf.feature_column.input_layer(features, station_embed)
  
  cat_cols = [
    tf.feature_column.categorical_column_with_identity('day_of_week', num_buckets = 8),
    tf.feature_column.categorical_column_with_vocabulary_list('rainy', ['false', 'true'])
  ]
  cat_cols = [tf.feature_column.indicator_column(col) for col in cat_cols]
  other_inputs = tf.feature_column.input_layer(features, cat_cols)
  
  all_inputs = tf.concat([embed_layer, other_inputs], axis=1)
  predictions = tf.layers.dense(all_inputs, 1)  # linear model
  
  # 2. Use a regression head to use the standard loss, output, etc.
  my_head = tf.contrib.estimator.regression_head()
  spec = my_head.create_estimator_spec(
    features = features, mode = mode, labels = labels, logits = predictions,
    optimizer = tf.train.FtrlOptimizer(learning_rate = 0.1)
  )
  
  # 3. Create predictions
  predictions_dict = {
    "predicted": predictions,
    "station_embed": embed_layer
  }
    
  # 4. Create export outputs
  export_outputs = {
    "predict_export_outputs": tf.estimator.export.PredictOutput(outputs = predictions_dict)
  }

  # 5. Return EstimatorSpec after modifying the predictions and export outputs
  return spec._replace(predictions = predictions_dict, export_outputs = export_outputs)

def train_and_evaluate(output_dir, nsteps):
  estimator = tf.estimator.Estimator(
                       model_fn = model_fn,
                       model_dir = output_dir)
  train_spec=tf.estimator.TrainSpec(
                       input_fn = make_input_fn(train_df, None),
                       max_steps = nsteps)
  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
  eval_spec=tf.estimator.EvalSpec(
                       input_fn = make_input_fn(eval_df, 1),
                       steps = None,
                       start_delay_secs = 1, # start evaluating after N seconds
                       throttle_secs = 10,  # evaluate every N seconds
                       exporters = exporter)
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  
import shutil
OUTDIR='./model_trained'
shutil.rmtree(OUTDIR, ignore_errors=True)
train_and_evaluate(OUTDIR, 1000)

In [21]:
%bash
EXPORTDIR=./model_trained/export/exporter/
MODELDIR=$(ls $EXPORTDIR | tail -1)
gcloud ml-engine local predict --model-dir=${EXPORTDIR}/${MODELDIR} --json-instances=./test.json

PREDICTED              STATION_EMBED
[0.08442232012748718]  [0.0008054960635490716, 0.0008597194100730121]
[0.0879446491599083]   [0.0008054960635490716, 0.0008597194100730121]


  from ._conv import register_converters as _register_converters
2018-07-17 16:28:14.124627: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA



## Explore embeddings

Let's explore the embeddings for some stations. Let's look at stations with overall similar numbers of trips. Do they have similar embedding values?

In [21]:
stations="""
  SELECT
    COUNT(starttime) as num_trips,
    MAX(start_station_name) AS station_name,
    start_station_id
  FROM `bigquery-public-data.new_york.citibike_trips`
  GROUP BY start_station_id
  ORDER BY num_trips desc
"""
stationsdf = bq.Query(stations).execute().result().to_dataframe()

In [22]:
stationsdf.head()

Unnamed: 0,num_trips,station_name,start_station_id
0,359182,Pershing Square North,519
1,291615,E 17 St & Broadway,497
2,277060,Lafayette St & E 8 St,293
3,275348,W 21 St & 6 Ave,435
4,268807,8 Ave & W 31 St N,521


In [30]:
stationsdf[500:505]

Unnamed: 0,num_trips,station_name,start_station_id
500,2828,W 87 St & West End Ave,3287
501,2808,47 Ave & 31 St,3221
502,2785,21 St & 41 Ave,3237
503,2734,Hanson Pl & Ashland Pl,3429
504,2678,E 67 St & Park Ave,3133


In [38]:
%writefile test.json
{"day_of_week": 4, "start_station_id": 435, "rainy": "true"}
{"day_of_week": 4, "start_station_id": 521, "rainy": "true"}
{"day_of_week": 4, "start_station_id": 3221, "rainy": "true"}
{"day_of_week": 4, "start_station_id": 3237, "rainy": "true"}

Overwriting test.json


435 and 521 are in the first list (of top rental locations) and in the Chelsea Market area.
3221 and 3237 are in the second list (of rare rentals) and in Long Island.
Do the embeddings reflect this?

In [39]:
%bash
EXPORTDIR=./model_trained/export/exporter/
MODELDIR=$(ls $EXPORTDIR | tail -1)
gcloud ml-engine local predict --model-dir=${EXPORTDIR}/${MODELDIR} --json-instances=./test.json

PREDICTED              STATION_EMBED
[0.08976395428180695]  [-2.3062024411046878e-05, 0.008066227659583092]
[0.08778293430805206]  [-3.2270579595206073e-06, 0.0011660171439871192]
[0.08674221485853195]  [1.1034319868485909e-05, -0.00245896028354764]
[0.08657103031873703]  [9.333280104328878e-06, -0.0030552244279533625]


  from ._conv import register_converters as _register_converters
2018-07-10 14:11:42.625068: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA



In this case, the first dimension of the embedding is almost zero in all cases. So, we only need a one dimensional embedding. And in that, it is quite clear that the Manhattan, frequent rental stations have positive values (0.0081, 0.0011) whereas the Long Island, rare rental stations have negative values (-0.0025, -0.0031).

Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License