<h1> Feature Engineering </h1>

In this notebook, you will learn how to incorporate feature engineering into your pipeline.
<ol>
<li> Adding feature crosses in TensorFlow </li>
<li> Reading data from BigQuery </li>
<li> Carrying out preprocessing using tf.transform </li>
</ol>

Table of Contents:
<div id="toc"></div>

In [4]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [None]:
%bash
pip install --upgrade tensorflow_transform

In [5]:
import google.cloud.ml as ml
import tensorflow as tf
import tensorflow_transform as tft
import shutil
print tf.__version__
print ml.sdk_location

1.0.0-rc1
gs://cloud-ml/sdk/cloudml-0.1.9-alpha.dataflow.tar.gz


<h2> Environment variables for project and bucket </h2>

Change the cell below to reflect your Project ID and bucket name. See Lab 3a for setup instructions.

In [6]:
import os
PROJECT = 'cloud-training-demos'    # CHANGE THIS
BUCKET = 'cloud-training-demos-ml'  # CHANGE THIS

os.environ['PROJECT'] = PROJECT # for bash
os.environ['BUCKET'] = BUCKET # for bash

<h2> Specifying query to pull the data </h2>

The full dataset is 1 billion rows. For experimentation, let's sample it to create 10,000 samples.
Later, we'll remove the limit and train on the full dataset.
We're also using BigQuery sampling to pull out independent training and validation samples.

Note that because the test dataset is now different, we can not really compare test statistics between this and the previous .csv methods.

In [7]:
def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid
  """
  base_query = """
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  CONCAT(STRING(pickup_datetime), STRING(pickup_longitude), STRING(pickup_latitude), STRING(dropoff_latitude), STRING(dropoff_longitude)) AS key,
  DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,
  HOUR(pickup_datetime)*1.0 AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
FROM
  [nyc-tlc:yellow.trips]
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  """

  if EVERY_N == None:
    if phase < 2:
      # training
      query = "{0} AND ABS(HASH(pickup_datetime)) % 4 < 2".format(base_query)
    else:
      query = "{0} AND ABS(HASH(pickup_datetime)) % 4 == {1}".format(base_query, phase)
  else:
      query = "{0} AND ABS(HASH(pickup_datetime)) % {1} == {2}".format(base_query, EVERY_N, phase)

  
    
  return query
    
print create_query(2, 100000)


SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  CONCAT(STRING(pickup_datetime), STRING(pickup_longitude), STRING(pickup_latitude), STRING(dropoff_latitude), STRING(dropoff_longitude)) AS key,
  DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,
  HOUR(pickup_datetime)*1.0 AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
FROM
  [nyc-tlc:yellow.trips]
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
   AND ABS(HASH(pickup_datetime)) % 100000 == 2


Try the query above in https://bigquery.cloud.google.com/table/nyc-tlc:yellow.trips if you want to see what it does (ADD LIMIT 10 to the query!)

<h2> Preprocessing Dataflow job from BigQuery </h2>

This code reads from BigQuery and saving the data on Google Cloud.  We can do preprocessing inside Dataflow, but then we'll have to remember to repeat the prepreprocessing during inference. It is better to use tf.transform which will do this book-keeping for you, or to do preprocessing within your TensorFlow.

So, our "preprocessing" is simply a data transformation.  While we could read from BQ directly from TensorFlow, it is less expensive to export to CSV and do the training off CSV.

If you are running on the Cloud, you should go to the GCP Console (https://console.cloud.google.com/dataflow) to look at the status of the job. If you are running locally, you'll get a Running bar and it will take up to 5 minutes.

In [8]:
!gsutil -m rm -rf gs://$BUCKET/taxifare/taxi_preproc/

Removing gs://cloud-training-demos-ml/taxifare/taxi_preproc/features_eval-00000-of-00001.tfrecord.gz#1484431738311206...
Removing gs://cloud-training-demos-ml/taxifare/taxi_preproc/features_train-00000-of-00001.tfrecord.gz#1484431731394942...
Removing gs://cloud-training-demos-ml/taxifare/taxi_preproc/metadata.yaml#1484431721525262...
/ [3/3 objects] 100% Done                                                       
Operation completed over 3 objects.                                              


In [None]:
# Change as needed
EVERY_N = 50 * 1000 # Change this to None to preprocess full dataset

# Direct runs locally; Dataflow runs on the Cloud.
RUNNER = 'DirectPipelineRunner'
#RUNNER = 'DataflowPipelineRunner'

OUTPUT_DIR = 'gs://{0}/taxifare/taxi_preproc/'.format(BUCKET)
options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'),
    'project': PROJECT,
    'extra_packages': ['tensorflow_transform'],
    'teardown_policy': 'TEARDOWN_ALWAYS',
    'no_save_main_session': True
}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline(RUNNER, options=opts)
p = beam.Pipeline()
transform_fn, train_dataset, eval_dataset = preprocess_all(
      pipeline, create_query(1, 100000), create_query(2, 100000), None, OUTPUT_DIR)
pipeline.run()

In [None]:
!gsutil ls gs://$BUCKET/taxifare/taxi_preproc/

In [None]:
%bash
gsutil cp -R gs://$BUCKET/taxifare/taxi_preproc /content/training-data-analyst/CPB102/lab4a

<h2> 3. Training </h2>

Training requires you to package up your TensorFlow model into a Python package. We've done this in the directory 'taxifare'

In that code, the latitude and longitude are discretized, and feature-crossed. The hourofday and dayofweek are divided into buckets that reflect typical traffic patterns.  The whole model is then trained.

In [None]:
%bash
grep -A 40 create_inputs taxifare/trainer/taxifare.py

In [None]:
%bash
rm -rf taxifare.tar.gz taxi_trained
tar cvfz taxifare.tar.gz taxifare

In [None]:
%mlalpha train
package_uris: /content/training-data-analyst/CPB102/lab4a/taxifare.tar.gz
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths: /content/training-data-analyst/CPB102/lab4a/taxi_preproc4a/features_train*
  eval_data_paths: /content/training-data-analyst/CPB102/lab4a/taxi_preproc4a/features_eval*
  metadata_path: /content/training-data-analyst/CPB102/lab4a/taxi_preproc4a/metadata.yaml
  output_path: /content/training-data-analyst/CPB102/lab4a/taxi_trained
  max_steps: 2500

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License