<h1> Feature Engineering </h1>

In this notebook, you will learn how to incorporate feature engineering into your pipeline. This includes:
<ul>
<li> Working with feature columns </li>
<li> Adding feature crosses in TensorFlow </li>
<li> Reading data from BigQuery </li>
<li> Creating datasets using Dataflow </li>
<li> Using a wide-and-deep model </li>
</ul>

In [0]:
#@markdown Copy-paste your GCP Project ID in the following field:

PROJECT = "" #@param {type: "string"}


#@markdown When running this cell you will need to **uncheck "Reset all runtimes before running"** as shown on the following screenshot:
#@markdown ![](https://i.imgur.com/9dgw0h0.png)
#@markdown Next, use Shift-Enter to run this cell and to complete authentication.

try:  
  from google.colab import auth
  auth.authenticate_user()  
  print("AUTHENTICATED")
except:
  print("FAILED to authenticate")
  
REGION = "us-central1"   
BUCKET = PROJECT

# Copy taxi-*.csv files from github if they are missing from the runtime.
!wget -nc --quiet https://github.com/osipov/training-data-analyst/raw/master/bootcamps/serverless_ml/taxi-11k-datasets.zip  
!unzip -q -n taxi-11k-datasets.zip  

<h2> Environment variables for project and bucket </h2>



In [0]:
# for bash
import os
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TF_VERSION'] = '1.12' 

In [0]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

In [0]:
%%bash
rm -rf taxifare
mkdir -p taxifare/trainer

for file in taxifare/setup.py \
            taxifare/trainer/__init__.py \
            taxifare/trainer/model.py \
            taxifare/trainer/task.py
do
  wget --quiet -nc \
  https://github.com/osipov/edu/raw/master/mle1/feateng/$file \
  -O $file
done

find taxifare

After running the next cell, notice the new placeholders for features in the INPUT_COLUMNS. The two new categorical features are for the day of the week (`dayofweek`) and the hour of the day (`hourofday`). Also, there are three engineered feature placeholders: two for representing differences between latitude and longitude coordinates (`latdiff` and `londiff`) and one with an estimated euclidean distance (`euclidean`).

In [0]:
!grep -m 1 -A 16 INPUT_COLUMNS taxifare/trainer/model.py

The next cell highlights the changes in  `build estimator` to use bucketized features and feature crosses. The NumPy `np.linspace` function divides up a range into a fixed number of partitions. In this case, each of the NYC lat/lon coordinates for a taxi ride are placed into one of  `nbucket`  buckets. Finally, both location buckets and day/hour features are feature crossed to create 4 additional features.

In [0]:
!grep -m 1 -A 22 build_estimator taxifare/trainer/model.py

The feature definitions are grouped into wide and deep columns as shown in the next cell...

In [0]:
!grep -m 1 -A 20 wide_columns taxifare/trainer/model.py

...and the model is modified to use the wide-and-deep implementation called `DNNLinearCombinedRegressor`.

In [0]:
!grep -m 1 -A 8 DNNLinearCombinedRegressor taxifare/trainer/model.py

Notice that in the next cell, the values for the location difference and euclidean distance features are computed in-memory, as a part of the TensorFlow model implementation.

In [0]:
!grep -m 1 -A 14 add_engineered taxifare/trainer/model.py

<h2> Hyper-parameter tune </h2>

Based on my hyperparameter tuning experiments, I ended up choosing the following values:
<ol>
<li> train_batch_size: 512 </li>
<li> nbuckets: 16 </li>
<li> hidden_units: "64 64 64 8" </li>    
</ol>

This gives an RMSE of **$5.7**, a considerable improvement from the 8.3 we were getting earlier ... Let's try this over a larger dataset.

<h1> Run Cloud training on 2 million row dataset </h1>

This run uses as input 2 million rows and takes ~70 minutes with 10 workers (STANDARD_1 pricing tier). The model is exactly the same as above. The only changes are to the input (to use the larger dataset) and to the Cloud MLE tier (to use STANDARD_1 instead of BASIC -- STANDARD_1 is approximately 10x more powerful than BASIC). 

When doing distributed training, use train_steps instead of num_epochs. The distributed workers don't know how many rows there are, but we can calculate train_steps = num_rows \* num_epochs / train_batch_size. In this case, we have 2141023 * 100 / 512 = 418168 train steps.

In [0]:
%%bash
OUTDIR=gs://${BUCKET}/taxifare/feateng2.2m
JOBNAME=feateng_$(date -u +%y%m%d_%H%M%S)
TIER=STANDARD_1 

echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR

gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=$TIER \
   --runtime-version=$TF_VERSION \
   -- \
   --train_data_paths="gs://kmo-us-central1-misc/taxifare/2.2m/1.csv*" \
   --eval_data_paths="gs://kmo-us-central1-misc/taxifare/2.2m/2.csv*"  \
   --output_dir=$OUTDIR \
   --train_steps=418168 \
   --train_batch_size=512 --nbuckets=16 --hidden_units="64 64 64 8"

After you submit the job you should see a message confirming that your job was QUEUED. To monitor the progress of the job from the GCP user interface, navigate to [Jobs](https://console.cloud.google.com/mlengine/jobs) part of the Cloud ML Engine service. Use the "View Logs" link to get the details.

### Start Tensorboard

Instead of having you wait for the model to complete training, I have already pre-trained a model on 2.2m rows of data. Use the next cell to start TensorBoard and review the average_loss and RMSE. Note that after about 1 hr 30 min of training, the model was evaluated at roughly **$4** RMSE.

In [0]:
!pip install tensorboard==1.13.0
%reload_ext tensorboard.notebook 
%tensorboard --logdir 'gs://kmo-us-central1-misc/taxifare/feateng2.2m'

Compare it to the following visualization of the model trained on 10x as much data. Notice that evaluation RMSE is roughly the same. This means that the model designed for the problem is no longer benefiting from the additional data. Nonetheless, **these models achieved and exceeded the original goal of RMSE of $6 or less!**

In [0]:
%tensorboard --logdir 'gs://kmo-us-central1-misc/taxifare/feateng22m'

### Conclusions

The RMSE after training on the 2-million-row dataset is **$4.1**.  This graph shows the improvements you have achieved in this session.

In [0]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame({'Method' : pd.Series(['Heuristic Benchmark', 'tf.learn', '+Feature Eng.', '+ Hyperparam', '+ 2m rows']),
              'RMSE': pd.Series([8.026, 9.4, 8.3, 5.7, 4.1]) })

ax = sns.barplot(data = df, x = 'Method', y = 'RMSE')
ax.set_ylabel('RMSE (dollars)')
ax.set_xlabel('Methods')
plt.plot(np.linspace(-20, 120, 1000), [5] * 1000, 'b');

Copyright 2019 Counter Factual .AI LLC. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

<h1>OPTIONAL: Scalable retrieval of millions of rows of data from BigQuery</h1> 

In [0]:
!pip install apache-beam[gcp] google-apitools

You can ignore the message to restart the runtime and continue with the rest of the notebook.

In [0]:
import tensorflow as tf
import apache_beam as beam
import shutil
print(tf.__version__)

<h2>Specifying query to pull the data </h2>

Let's pull out a few extra columns from the timestamp.

In [0]:
def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid
  """
  base_query = """
    SELECT
      (tolls_amount + fare_amount) AS fare_amount,
      
      CONCAT( STRING(pickup_datetime), 
              CAST(pickup_longitude AS STRING), 
              CAST(pickup_latitude AS STRING),
              CAST(dropoff_latitude AS STRING), 
              CAST(dropoff_longitude AS STRING)) AS key,
              
      EXTRACT(DAYOFWEEK FROM pickup_datetime) AS dayofweek,
      EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
      pickup_longitude AS pickuplon,
      pickup_latitude AS pickuplat,
      dropoff_longitude AS dropofflon,
      dropoff_latitude AS dropofflat,
      passenger_count*1.0 AS passengers
    FROM
      `nyc-tlc.yellow.trips`
    WHERE
      {}
      AND trip_distance > 0
      AND fare_amount >= 2.5
      AND pickup_longitude > -78
      AND pickup_longitude < -70
      AND dropoff_longitude > -78
      AND dropoff_longitude < -70
      AND pickup_latitude > 37
      AND pickup_latitude < 45
      AND dropoff_latitude > 37
      AND dropoff_latitude < 45
      AND passenger_count > 0
  """
  if EVERY_N == None:
    if phase < 2:
      # training
      selector = "MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))), 4) < 2"
    else:
      selector = "MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))), 4) = 2"
  else:
      selector = "MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))), %d) = %d" % (EVERY_N, phase)
    
  query = base_query.format(selector)

  return query

sql = create_query(2, 100000)

<h2>Preprocessing Dataflow job from BigQuery </h2>

While we could read from BQ directly from TensorFlow (See: https://www.tensorflow.org/api_docs/python/tf/contrib/cloud/BigQueryReader), it is quite convenient to export to CSV and do the training off CSV.  Let's use Dataflow to do this at scale.

In [0]:
%%bash
gsutil -m rm -rf gs://$BUCKET/taxifare/taxi_preproc/

In [0]:
import datetime

####
# Arguments:
#   -rowdict: Dictionary. The beam bigquery reader returns a PCollection in
#     which each row is represented as a python dictionary
# Returns:
#   -rowstring: a comma separated string representation of the record with dayofweek
#     converted from int to string (e.g. 3 --> Tue)
####
def to_csv(rowdict):
  days = ['null', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
  CSV_COLUMNS = 'fare_amount,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers,key'.split(',')
  rowdict['dayofweek'] = days[rowdict['dayofweek']]
  rowstring = ','.join([str(rowdict[k]) for k in CSV_COLUMNS])
  return rowstring


####
# Arguments:
#   -EVERY_N: Integer. Sample one out of every N rows from the full dataset.
#     Larger values will yield smaller sample
#   -RUNNER: 'DirectRunner' or 'DataflowRunner'. Specfy to run the pipeline
#     locally or on Google Cloud respectively. 
# Side-effects:
#   -Creates and executes dataflow pipeline. 
#     See https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline
####
def preprocess(EVERY_N, RUNNER):
  phase_name = ['', 'train', 'valid']
  job_name = 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')
  print 'Launching Dataflow job {} ... hang on'.format(job_name)
  OUTPUT_DIR = 'gs://{0}/taxifare/taxi_preproc/'.format(BUCKET)

  #dictionary of pipeline options
  options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'),
    'project': PROJECT,
    'runner': RUNNER
  }
  #instantiate PipelineOptions object using options dictionary
  opts = beam.pipeline.PipelineOptions(flags=[], **options)
  #instantantiate Pipeline object using PipelineOptions
  p = beam.Pipeline(options=opts)
  for phase in [1,2]:
    query = create_query(phase, EVERY_N) 
    outfile = os.path.join(OUTPUT_DIR, '{}.csv'.format(phase_name[phase]))
    (
      p | 'read_{}'.format(phase) >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))
        | 'tocsv_{}'.format(phase) >> beam.Map(to_csv)
        | 'write_{}'.format(phase) >> beam.io.Write(beam.io.WriteToText(outfile))
    )

  p.run().wait_until_finish()

Run pipeline locally using `DirectRunner`

In [0]:
#50*10000 / 2.2k/  20s
#50*1000 / 22k / 5 min
#50*100 / 220k / >60 min
preprocess(50*10000, 'DirectRunner') 
# preprocess(50*10000, 'DataflowRunner') 

To run the pipeline on cloud on a larger sample size, change the arguments to preprocess to use `DataflowRunner` and a different sample size. When running this on Cloud Dataflow, you should go to the GCP Console (https://console.cloud.google.com/dataflow) to look at the status of the job. Note that it will take several minutes for the preprocessing job to launch.

Once the job completes, observe the files created in Google Cloud Storage

In [0]:
%%bash
gsutil ls -l gs://$BUCKET/taxifare/taxi_preproc/

In [0]:
%%bash
#print first 10 lines of first shard of train.csv
gsutil cat "gs://$BUCKET/taxifare/taxi_preproc/train.csv-00000-of-*" | head

Copyright 2019 Counter Factual .AI LLC. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License