<h1> Feature Engineering </h1>

This notebook runs preprocessing on the full dataset.  It is the same as feateng.ipynb except for the change to do the whole thing.

In [7]:
import google.cloud.ml as ml
import tensorflow as tf
print tf.__version__
print ml.sdk_location

0.11.0rc0
gs://cloud-ml/sdk/cloudml-0.1.6-alpha.dataflow.tar.gz


<h1> Specifying query to pull the data </h1>

Same as feateng.ipynb

In [8]:
def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid
  """
  base_query = """
SELECT
  DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,
  HOUR(pickup_datetime)*1.0 AS hourofday,
  pickup_longitude, pickup_latitude, 
  dropoff_longitude, dropoff_latitude,
  passenger_count*1.0 AS passenger_count,
  (tolls_amount + fare_amount) as fare_amount
FROM
  [nyc-tlc:yellow.trips]
WHERE
    trip_distance > 0
    AND fare_amount >= 2.5
    AND pickup_longitude > -78
    AND pickup_longitude < -70
    AND dropoff_longitude > -78
    AND dropoff_longitude < -70
    AND pickup_latitude > 37
    AND pickup_latitude < 45
    AND dropoff_latitude > 37
    AND dropoff_latitude < 45
    AND passenger_count > 0 
  """

  if EVERY_N == None:
    if phase < 2:
      # training
      query = "{0} AND ABS(HASH(pickup_datetime)) % 4 < 2".format(base_query)
    else:
      query = "{0} AND ABS(HASH(pickup_datetime)) % 4 == {1}".format(base_query, phase)
  else:
      query = "{0} AND ABS(HASH(pickup_datetime)) % {1} == {2}".format(base_query, EVERY_N, phase)

  
    
  return query
    
print create_query(2, 100000)


SELECT
  DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,
  HOUR(pickup_datetime)*1.0 AS hourofday,
  pickup_longitude, pickup_latitude, 
  dropoff_longitude, dropoff_latitude,
  passenger_count*1.0 AS passenger_count,
  (tolls_amount + fare_amount) as fare_amount
FROM
  [nyc-tlc:yellow.trips]
WHERE
    trip_distance > 0
    AND fare_amount >= 2.5
    AND pickup_longitude > -78
    AND pickup_longitude < -70
    AND dropoff_longitude > -78
    AND dropoff_longitude < -70
    AND pickup_latitude > 37
    AND pickup_latitude < 45
    AND dropoff_latitude > 37
    AND dropoff_latitude < 45
    AND passenger_count > 0 
   AND ABS(HASH(pickup_datetime)) % 100000 == 2


<h2> Preprocessing features using Cloud ML SDK </h2>

We could discretize the lat-lon columns using the SDK, but we'll defer that to TensorFlow to enable it to be a hyper-parameter if necessary.

In [10]:
import google.cloud.ml.features as features

import google.cloud.ml as ml
print ml.sdk_location

class TaxifareFeatures(object):
  csv_columns = ('dayofweek', 'hourofday', 'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count','fare_amount')
  fare_amount = features.target('fare_amount').continuous()
  pcount = features.numeric('passenger_count').scale()
  plat = features.numeric('pickup_latitude').scale()
  dlat = features.numeric('dropoff_latitude').scale()
  plon = features.numeric('pickup_longitude').scale()
  dlon = features.numeric('dropoff_longitude').scale()
  dayofweek = features.numeric('dayofweek').identity()
  hourofday = features.numeric('hourofday').identity()

gs://cloud-ml/sdk/cloudml-0.1.6-alpha.dataflow.tar.gz


<h2> Preprocessing Dataflow job from BigQuery </h2>

This code reads from BigQuery and runs the above preprocessing, saving the data on Google Cloud.  Make sure to change the BUCKET and PROJECt variable to be yours.

If you are running on the Cloud, you should go to the GCP Console to look at the status of the job. If you are running locally, you'll get a Running bar and it will take up to 5 minutes.

In [15]:
%bash
BUCKET=cloud-training-demos
gsutil -m rm -r -f gs://$BUCKET/taxifare/taxi_preproc4a_full

CommandException: 1 files/objects could not be removed.


In [16]:
# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.dataflow.io.tfrecordio as tfrecordio
import google.cloud.ml.io as io
import os
import datetime

# Change as needed
BUCKET = 'cloud-training-demos'
PROJECT = 'cloud-training-demos'
EVERY_N = 100 # Change this to None to preprocess full dataset

# Direct runs locally; Dataflow runs on the Cloud.
#RUNNER = 'DirectPipelineRunner'
RUNNER = 'DataflowPipelineRunner'

OUTPUT_DIR = 'gs://{0}/taxifare/taxi_preproc4a_full/'.format(BUCKET)
options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': 'preprocess-fulldataset' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'),
    'project': PROJECT,
    'extra_packages': [ml.sdk_location],
    'teardown_policy': 'TEARDOWN_ALWAYS',
    'no_save_main_session': True
}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline(RUNNER, options=opts)

# defines
feature_set = TaxifareFeatures()
train_query = create_query(1, EVERY_N)
valid_query = create_query(2, EVERY_N)
train = pipeline | 'read_train' >> beam.Read(beam.io.BigQuerySource(query=train_query))
eval = pipeline | 'read_valid' >> beam.Read(beam.io.BigQuerySource(query=valid_query))

(metadata, train_features, eval_features) = ((train, eval) |
   'Preprocess' >> ml.Preprocess(feature_set))

(metadata
   | 'SaveMetadata'
   >> io.SaveMetadata(os.path.join(OUTPUT_DIR, 'metadata.yaml')))
(train_features
   | 'WriteTraining'
   >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_train')))
(eval_features
   | 'WriteEval'
   >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_eval')))

# run pipeline
pipeline.run()

<DataflowPipelineResult <Job
 id: u'2016-09-28_05_27_01-7932406974919974864'
 projectId: u'cloud-training-demos'
 steps: []
 tempFiles: []
 type: TypeValueValuesEnum(JOB_TYPE_BATCH, 1)> at 0x7ff749b692d0>

In [17]:
!gsutil ls gs://cloud-training-demos/taxifare/taxi_preproc4a_full/

gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00000-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00001-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00002-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00003-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00004-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00005-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00006-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00007-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00008-of-00012.tfrecord.gz
gs://cloud-training-demos/taxifare/taxi_preproc4a_full/features_eval-00009-of-00012.tfrecord.gz
gs://cloud-training-demos/taxi

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License