<h1> Feature Engineering </h1>

This notebook is Lab4a of CPB 102, Google's course on Machine Learning using Cloud ML.

This notebook demonstrates:
<ol>
<li> Reading data from BigQuery </li>
<li> Carrying out preprocessing using the ML SDK </li>
<li> Adding feature crosses in TensorFlow </li>
</ol> 

By removing the BigQuery sampling, you can train on the whole dataset.  This will take quite a while, though.

<div id="toc"></div>

In [2]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [2]:
%bash
# remember to "Reset Session" if you execute this cell -- this is needed to restart the Python kernel with updated package
#gsutil cp gs://cloud-ml/sdk/cloudml-0.1.4.tar.gz .
pip install --force-reinstall --upgrade cloudml-0.1.3.tar.gz

Processing ./cloudml-0.1.3.tar.gz
Collecting oauth2client==2.2.0 (from cloudml==0.1.3)
Collecting six>=1.10.0 (from cloudml==0.1.3)
  Using cached six-1.10.0-py2.py3-none-any.whl
Collecting google-cloud-dataflow>=0.4.0 (from cloudml==0.1.3)
Collecting bs4>=0.0.1 (from cloudml==0.1.3)
  Using cached bs4-0.0.1.tar.gz
Collecting numpy>=1.10.4 (from cloudml==0.1.3)
  Using cached numpy-1.11.1-cp27-cp27mu-manylinux1_x86_64.whl
Collecting pillow>=3.2.0 (from cloudml==0.1.3)
  Using cached Pillow-3.3.1-cp27-cp27mu-manylinux1_x86_64.whl
Collecting dpkt>=1.8.7 (from cloudml==0.1.3)
  Using cached dpkt-1.8.8-py2-none-any.whl
Collecting nltk>=3.2.1 (from cloudml==0.1.3)
  Using cached nltk-3.2.1.tar.gz
Collecting httplib2>=0.9.1 (from oauth2client==2.2.0->cloudml==0.1.3)
  Using cached httplib2-0.9.2.zip
Collecting rsa>=3.1.4 (from oauth2client==2.2.0->cloudml==0.1.3)
  Using cached rsa-3.4.2-py2.py3-none-any.whl
Collecting pyasn1>=0.1.7 (from oauth2client==2.2.0->cloudml==0.1.3)
  Using cached p

<h1> Specifying query to pull the data </h1>

The full dataset is 1 billion rows. For experimentation, let's sample it to create 10,000 samples.
Towards the end of this notebook, we'll remove the limit and train on the full dataset.
We're also using BigQuery sampling to divide up the data into train (50%), valid (25%) and test (25%).

In [1]:
def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid, 3=test
  """
  base_query = """
SELECT
  DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,
  HOUR(pickup_datetime)*1.0 AS hourofday,
  pickup_longitude, pickup_latitude, 
  dropoff_longitude, dropoff_latitude,
  passenger_count,
  (tolls_amount + fare_amount) as fare_amount
FROM
  [nyc-tlc:yellow.trips]
WHERE
    trip_distance > 0
    AND fare_amount >= 2.5
    AND pickup_longitude > -78
    AND pickup_longitude < -70
    AND dropoff_longitude > -78
    AND dropoff_longitude < -70
    AND pickup_latitude > 37
    AND pickup_latitude < 45
    AND dropoff_latitude > 37
    AND dropoff_latitude < 45
    AND passenger_count > 0 
  """

  if EVERY_N == None:
    query = base_query
  else:
    query = "{0} AND ABS(HASH(pickup_datetime)) % {1} == 1".format(base_query, EVERY_N)

  if phase < 2:
    # training
    query = "{0} AND ABS(HASH(pickup_datetime)) % 4 < 2".format(query)
  else:
    query = "{0} AND ABS(HASH(pickup_datetime)) % 4 == {1}".format(query, phase)
    
  return query
    
print create_query(1, 100000)


SELECT
  DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,
  HOUR(pickup_datetime)*1.0 AS hourofday,
  pickup_longitude, pickup_latitude, 
  dropoff_longitude, dropoff_latitude,
  passenger_count,
  (tolls_amount + fare_amount) as fare_amount
FROM
  [nyc-tlc:yellow.trips]
WHERE
    trip_distance > 0
    AND fare_amount >= 2.5
    AND pickup_longitude > -78
    AND pickup_longitude < -70
    AND dropoff_longitude > -78
    AND dropoff_longitude < -70
    AND pickup_latitude > 37
    AND pickup_latitude < 45
    AND dropoff_latitude > 37
    AND dropoff_latitude < 45
    AND passenger_count > 0 
   AND ABS(HASH(pickup_datetime)) % 100000 == 1 AND ABS(HASH(pickup_datetime)) % 4 < 2


Try the query above in https://bigquery.cloud.google.com/table/nyc-tlc:yellow.trips if you want to see what it does (ADD LIMIT 10 to the query!)

<h2> Preprocessing features using Cloud ML SDK </h2>

We could discretize the lat-lon columns using the SDK, but we'll defer that to TensorFlow to enable it to be a hyper-parameter if necessary.

In [2]:
import google.cloud.ml.features as features

import google.cloud.ml as ml
print ml.sdk_location

class TaxifareFeatures(object):
  csv_columns = ('dayofweek', 'hourofday', 'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count','fare_amount')
  fare_amount = features.target('fare_amount').regression()
  pcount = features.numeric('passenger_count') # default is to scale all values in [-1, 1]
  plat = features.numeric('pickup_latitude') # .discretize(buckets=10), sparse=False),
  dlat = features.numeric('dropoff_latitude')
  plon = features.numeric('pickup_longitude')
  dlon = features.numeric('dropoff_longitude')
  dayofweek = features.numeric('dayofweek').identity()
  hourofday = features.numeric('hourofday').identity()

gs://cloud-ml/sdk/cloudml-0.1.4.tar.gz


<h2> Preprocessing Dataflow job </h2>

This code reads from BigQuery and runs the above preprocessing, saving the data on Google Cloud.  Make sure to change the BUCKET and PROJECt variable to be yours.

In [4]:
# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.dataflow.io.tfrecordio as tfrecordio
import google.cloud.ml.io as io
import os

# Change as needed
BUCKET = 'cloud-training-demos'
PROJECT = 'cloud-training-demos'
EVERY_N = 100*1000 # 100 * 1000 # Change this to None to preprocess full dataset

# Run full data preprocessing on Cloud, else locally
RUNNER = 'DirectPipelineRunner'
if EVERY_N == None:
   RUNNER = 'BlockingDataflowPipelineRunner'

# defines
feature_set = TaxifareFeatures()
pipeline = beam.Pipeline(argv=['--project', PROJECT,
                               '--runner', RUNNER,
                               '--job_name', 'lab3a',
                               '--extra_package', ml.sdk_location,
                               '--no_save_main_session', 'True',  # to prevent pickling and uploading Datalab itself!
                               '--staging_location', 'gs://{0}/taxifare/staging'.format(BUCKET),
                               '--temp_location', 'gs://{0}/taxifare/temp'.format(BUCKET)])



# defines
feature_set = TaxifareFeatures()
OUTPUT_DIR = 'gs://{0}/taxifare/taxi_preproc4a'.format(BUCKET)
pipeline = beam.Pipeline(argv=['--project', PROJECT])
train_query = create_query(1, EVERY_N)
valid_query = create_query(2, EVERY_N)

train = pipeline | 'read_train' >> beam.Read(beam.io.BigQuerySource(query=train_query))
eval = pipeline | 'read_valid' >> beam.Read(beam.io.BigQuerySource(query=valid_query))

(metadata, train_features, eval_features) = ((train, eval) |
    ml.Preprocess('Preprocess', feature_set))

train_parameters = tfrecordio.TFRecordParameters(
    file_path_prefix=os.path.join(OUTPUT_DIR, 'features_train'),
    file_name_suffix='',
    shard_file=True,
    compress_file=True)
eval_parameters = tfrecordio.TFRecordParameters(
    file_path_prefix=os.path.join(OUTPUT_DIR, 'features_eval'),
    file_name_suffix='',
    shard_file=True,
    compress_file=True)
(metadata, train_features, eval_features) | (
    io.SavePreprocessed('SavingData', OUTPUT_DIR,
                        file_parameters_list=[
                            os.path.join(OUTPUT_DIR, 'metadata.yaml'),
                            train_parameters, eval_parameters]))

# run pipeline
pipeline.run()

<apache_beam.runners.direct_runner.DirectPipelineResult at 0x7f11905c4710>

In [5]:
!gsutil ls gs://cloud-training-demos/taxifare/taxi_preproc4a/



Updates are available for some Cloud SDK components.  To install them,
please run:
  $ gcloud components update

gs://cloud-training-demos/taxifare/taxi_preproc4a/features_eval-00000-of-00001
gs://cloud-training-demos/taxifare/taxi_preproc4a/features_train-00000-of-00001
gs://cloud-training-demos/taxifare/taxi_preproc4a/info
gs://cloud-training-demos/taxifare/taxi_preproc4a/metadata.yaml


<h2> 3. Training </h2>

Training requires you to package up your TensorFlow model into a Python package. We've done this in the directory 'taxifare'

In that code, the hourofday and dayofweek are one-hot encoded.  The latitude and longitude are discretized, and feature-crossed. The whole model is then trained.


In [1]:
%bash
rm -rf taxifare.tar.gz taxi_trained
tar cvfz taxifare.tar.gz taxifare
#gsutil cp taxifare.tar.gz gs://cloud-training-demos/taxifare/source/taxifare.tar.gz

taxifare/
taxifare/PKG-INFO
taxifare/setup.cfg
taxifare/setup.py
taxifare/trainer/
taxifare/trainer/__init__.py
taxifare/trainer/task.py
taxifare/trainer/taxifare.py
taxifare/trainer.egg-info/
taxifare/trainer.egg-info/dependency_links.txt
taxifare/trainer.egg-info/PKG-INFO
taxifare/trainer.egg-info/SOURCES.txt
taxifare/trainer.egg-info/top_level.txt


In [7]:
%bash
gsutil cp -R gs://cloud-training-demos/taxifare/taxi_preproc4a /content/CPB102/lab4a

Copying gs://cloud-training-demos/taxifare/taxi_preproc4a/features_eval-00000-of-00001...
Downloading ...ab4a/taxi_preproc4a/features_eval-00000-of-00001: 0 B/3 B    Downloading ...ab4a/taxi_preproc4a/features_eval-00000-of-00001: 3 B/3 B    
Copying gs://cloud-training-demos/taxifare/taxi_preproc4a/features_train-00000-of-00001...
Downloading ...b4a/taxi_preproc4a/features_train-00000-of-00001: 0 B/328.58 KiB    Downloading ...b4a/taxi_preproc4a/features_train-00000-of-00001: 72 KiB/328.58 KiB    Downloading ...b4a/taxi_preproc4a/features_train-00000-of-00001: 144 KiB/328.58 KiB    Downloading ...b4a/taxi_preproc4a/features_train-00000-of-00001: 216 KiB/328.58 KiB    Downloading ...b4a/taxi_preproc4a/features_train-00000-of-00001: 288 KiB/328.58 KiB    Downloading ...b4a/taxi_preproc4a/features_train-00000-of-00001: 328.58 KiB/328.58 KiB    
Copying gs://cloud-training-demos/taxifare/taxi_preproc4a/info...
Downloading file:///content/CPB102/lab4a/taxi_preproc4a/info:    0 B/44

In [22]:
%ml train

Parameters,Local Run Required,Cloud Run Required,Description
package_uris,True,True,A GCS or local (for local run only) path to your python training program package.
python_module,True,True,The module to run.
scale_tier,False,True,"Type of resources requested for the job. On local run, BASIC means 1 master process only, and any other values mean 1 master 1 worker and 1 ps processes. But you can also override the values by setting worker_count and parameter_server_count. On cloud, see service definition for possible values."
region,False,True,Where the training job runs. For cloud run only.
args,False,False,Args that will be passed to your training program.


In [None]:
%%ml train
package_uris: /content/CPB102/lab4a/taxifare.tar.gz
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths:
    - /content/CPB102/lab4a/taxi_preproc4a/features_train-00000-of-00001
  eval_data_paths:
    - /content/CPB102/lab4a/taxi_preproc4a/features_eval-00000-of-00001
  metadata_path: /content/CPB102/lab4a/taxi_preproc4a/metadata.yaml
  output_path: /content/CPB102/lab4a/taxi_trained
  max_steps: 1000

Type %ml train into an empty cell, run it, fill in some params and execute it again. (create a new code cell and try it out!)

In [1]:
%%ml train --cloud
package_uris: gs://cloud-training-demos/taxifare/source/taxifare.tar.gz
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths:
    - gs://cloud-training-demos/taxifare/taxi_preproc4a/features_train-00000-of-00001
  eval_data_paths:
    - gs://cloud-training-demos/taxifare/taxi_preproc4a/features_eval-00000-of-00001
  metadata_path: gs://cloud-training-demos/taxifare/taxi_preproc4a/metadata.yaml
  output_path: gs://cloud-training-demos/taxifare/taxi_trained4a
  max_steps: 1000

In [7]:
!gsutil ls gs://cloud-training-demos/taxifare/taxi_trained4a

ls: cannot access /content/CPB102/lab2d/taxi_trained: No such file or directory


In [8]:
%tensorboard start --logdir gs://cloud-training-demos/taxifare/taxi_trained4a

In [9]:
%tensorboard stop --pid 3222

In [20]:
%ml summary --dir gs://cloud-training-demos/taxifare/taxi_trained4a/summaries  gs://cloud-training-demos/taxifare/taxi_trained4a/eval  --name loss error --step

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License