<h1> Structured Data Solution </h1>

In this notebook, we will use the structured data package in Datalab to build a model to predict taxifares.

In [None]:
import os
PROJECT = 'cloud-training-demos'    # CHANGE THIS
BUCKET = 'cloud-training-demos-ml'  # CHANGE THIS
REGION = 'us-central1' # CHANGE THIS

os.environ['PROJECT'] = PROJECT # for bash
os.environ['BUCKET'] = BUCKET # for bash
os.environ['REGION'] = REGION # for bash

In [None]:
%bash
echo "project=$PROJECT"
echo "bucket=$BUCKET"
echo "region=$REGION"
gcloud config set project $PROJECT
gcloud config set compute/region $REGION
gcloud beta ml init-project -q

In [None]:
import tensorflow as tf
import datalab.ml as ml
import google.cloud.ml as cml
import datalab_solutions.structured_data as sd
from tensorflow.python.lib.io import file_io
import json
import shutil

print('tf ' + str(tf.__version__))
print('sd ' + str(sd.__version__))
print('cml ' + str(cml.__version__))

In [None]:
INDIR = '../feateng/sample'
OUTDIR = '.'

<h2> Set up schema file </h2>

Schema of training/test. Same format as BigQuery.  STRING/INTEGER/FLOAT only.

In [None]:
%writefile taxifare.json
[
    {
        "mode": "NULLABLE",
        "name": "fare_amount",
        "type": "FLOAT"
    }, 
    {
        "mode": "NULLABLE",
        "name": "dayofweek",
        "type": "STRING"
    },
    {
        "mode": "NULLABLE",
        "name": "hourofday",
        "type": "STRING"
    },
    {
        "mode": "NULLABLE",
        "name": "pickuplon",
        "type": "FLOAT"
    },
    {
        "mode": "NULLABLE",
        "name": "pickuplat",
        "type": "FLOAT"
    },
    {
        "mode": "NULLABLE",
        "name": "dropofflon",
        "type": "FLOAT"
    },
    {
        "mode": "NULLABLE",
        "name": "dropofflat",
        "type": "FLOAT"
    },
    {
        "mode": "NULLABLE",
        "name": "passengers",
        "type": "FLOAT"
    },
    {
        "mode": "REQUIRED",
        "name": "key",
        "type": "STRING"
    } 
]

<h2> Preprocessing </h2>

The first step of preprocessing is to compute the min, max, etc. for scaling purposes.

In [None]:
!rm -rf taxi_preproc taxi_trained

In [None]:
train_csv = ml.CsvDataSet(
  file_pattern=os.path.join(INDIR, 'train*'),
  schema_file=os.path.join(OUTDIR, 'taxifare.json'))
sd.local_preprocess(
  dataset=train_csv,
  output_dir=os.path.join(OUTDIR, 'taxi_preproc'),
)

The second step is to specify the feature columns and transformations.  The target and key transforms are required. Everything else is optional.

In [None]:
transforms = {
  "fare_amount": {"transform": "target"},
  "key": {"transform": "key"}, 
  "dayofweek": {"transform": "one_hot"},
  "hourofday": {"transform": "embedding", "embedding_dim": 2}, # group-combine the hour
}
file_io.write_string_to_file(os.path.join(OUTDIR, 'taxi_preproc/transforms.json'),
                             json.dumps(transforms, indent=2))

In [None]:
!ls taxi_preproc

<h2> Local Training and prediction </h2>

Train using the preproprocessed data.

In [None]:
eval_csv = ml.CsvDataSet(
  file_pattern=os.path.join(INDIR, 'valid*'),
  schema_file=os.path.join('.', 'taxifare.json'))

shutil.rmtree('./taxi_trained', ignore_errors=True)
sd.local_train(
  train_dataset=train_csv,
  eval_dataset=eval_csv,
  preprocess_output_dir=os.path.join(OUTDIR, 'taxi_preproc'),
  transforms=os.path.join(OUTDIR, 'taxi_preproc/transforms.json'),
  output_dir=os.path.join(OUTDIR, 'taxi_trained'),
  model_type='dnn_regression',
  max_steps=2500,
  layer_sizes=[64, 4]
)

In [54]:
!ls taxi_trained

evaluation_model  model  train


In [55]:
sd.local_predict(
  training_ouput_dir=os.path.join(OUTDIR, 'taxi_trained'),
  data=['Sun,0,-73.984685,40.769262,-73.991065,40.728145,5.0,row_01',
        'Sun,0,-74.006927,40.739993,-73.950025,40.773403,1.0,row_02',
        'Sun,0,-73.977345,40.779387,-73.97615,40.778867,1.0,row_03',
        'Sun,0,-73.97136,40.794413,-73.99623,40.74524,1.0,row_04',
        'Sun,0,-73.997642,40.763853,-73.99485,40.750282,1.0,row_05',
        'Sun,0,-74.004538,40.742202,-73.955823,40.773485,1.0,row_06',
        'Sun,0,-74.000589,40.73731,-73.985902,40.692725,1.0,row_07',
        'Sun,0,-73.995432,40.72114,-73.992403,40.719745,1.0,row_08',
        'Sun,0,-73.945033,40.779203,-73.952037,40.766802,1.0,row_09',
        'Sun,0,-73.968592,40.693262,-73.99231,40.694317,1.0,row_10']
)

Starting local prediction.
Local prediction done.


Unnamed: 0,key_from_input,predicted_target
0,row_01,11.257042
1,row_02,11.257042
2,row_03,11.257042
3,row_04,11.257042
4,row_05,11.257042
5,row_06,11.257042
6,row_07,11.257042
7,row_08,11.257042
8,row_09,11.257042
9,row_10,11.257042


In [None]:
shutil.rmtree('./batch_predict', ignore_errors=True)
sd.local_batch_predict(
  training_ouput_dir=os.path.join(OUTDIR, 'taxi_trained'),
  prediction_input_file=os.path.join(INDIR, 'valid*'),
  output_dir=os.path.join(OUTDIR, 'batch_predict'),
  output_format='csv',
  mode='prediction'
)

In [None]:
!ls batch_predict

<h2> Cloud preprocessing and training </h2>

In the above cells, change INDIR and OUTDIR to be GCS.

Change the calls from local_predict to cloud_predict. That's it.



Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License