# Scaling up ML using Cloud AI Platform

In this notebook, we take a previously developed TensorFlow model to predict taxifare rides and package it up so that it can be run in Cloud AI Platform. For now, we'll run this on a small dataset. The model that was developed is rather simplistic, and therefore, the accuracy of the model is not great either.  However, this notebook illustrates *how* to package up a TensorFlow model to run it within Cloud AI Platform. 

Later in the course, we will look at ways to make a more effective machine learning model.

## Environment variables for project and bucket

Note that:
<ol>
<li> Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page.  My dashboard reads:  <b>Project ID:</b> cloud-training-demos </li>
<li> Cloud training often involves saving and restoring model files. If you don't have a bucket already, I suggest that you create one from the GCP console (because it will dynamically check whether the bucket name you want is available). A common pattern is to prefix the bucket name by the project id, so that it is unique. Also, for cost reasons, you might want to use a single region bucket. </li>
</ol>
<b>Change the cell below</b> to reflect your Project ID and bucket name.


In [None]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

In [1]:
import tensorflow as tf
tf_version = tf.__version__
print(tf_version)

2.1.1-dlenv_tfe


In [2]:
import os
PROJECT = 'qwiklabs-gcp-00-5abd7c7e843f' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'qwiklabs-gcp-00-5abd7c7e843f' # REPLACE WITH YOUR BUCKET NAME
REGION = 'europe-west2' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

In [3]:
# For Python Code
# Model Info
MODEL_NAME = 'taxifare'
# Model Version
MODEL_VERSION = 'v1'
# Training Directory name
TRAINING_DIR = 'taxi_trained'

In [4]:
# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['MODEL_NAME'] = MODEL_NAME
os.environ['MODEL_VERSION'] = MODEL_VERSION
os.environ['TRAINING_DIR'] = TRAINING_DIR 
#os.environ['TFVERSION'] = '2.2.0'  # Tensorflow version
os.environ['TFVERSION'] = tf_version  # Tensorflow version

In [6]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


### Enable the Cloud Machine Learning Engine API

The next command works with Cloud AI Platform API.  In order for the command to work, you must enable the API using the Cloud Console UI.   Use this [link.](https://console.cloud.google.com/project/_/apis/library)  Then search the API list for Cloud Machine Learning and enable the API before executing the next cell.

Allow the Cloud AI Platform service account to read/write to the bucket containing training data.

In [7]:
%%bash
# This command will fail if the Cloud Machine Learning Engine API is not enabled using the link above.
echo "Getting the service account email associated with the Cloud AI Platform API"

AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin); \
    print (response['serviceAccount'])")  # If this command fails, the Cloud Machine Learning Engine API has not been enabled above.

echo "Authorizing the Cloud AI Platform account $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET   
gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET   # error message (if bucket is empty) can be ignored.  
gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET 

Getting the service account email associated with the Cloud AI Platform API
Authorizing the Cloud AI Platform account service-742239993613@cloud-ml.google.com.iam.gserviceaccount.com to access files in qwiklabs-gcp-00-5abd7c7e843f


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   235    0   235    0     0    318      0 --:--:-- --:--:-- --:--:--   318
Updated default ACL on gs://qwiklabs-gcp-00-5abd7c7e843f/
Encountered a problem: CommandException: No URLs matched: gs://qwiklabs-gcp-00-5abd7c7e843f/*
Updated ACL on gs://qwiklabs-gcp-00-5abd7c7e843f/


## Packaging up the code

Take your code and put into a standard Python package structure.  <a href="taxifare/trainer/model.py">model.py</a> and <a href="taxifare/trainer/task.py">task.py</a> containing the Tensorflow code from earlier (explore the <a href="taxifare/trainer/">directory structure</a>).

In [8]:
%%bash
find ${MODEL_NAME}

taxifare
taxifare/trainer
taxifare/trainer/__init__.py
taxifare/trainer/.ipynb_checkpoints
taxifare/trainer/.ipynb_checkpoints/model-checkpoint.py
taxifare/trainer/task.py
taxifare/trainer/model.py


In [9]:
%%bash
cat ${MODEL_NAME}/trainer/model.py

#!/usr/bin/env python

# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np
import shutil

import logging

logger = tf.get_logger()
logger.setLevel(logging.INFO)

# List the CSV columns
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']

#Choose which 

## Find absolute paths to your data

Note the absolute paths below. 

In [13]:
%%bash
echo "Working Directory: ${PWD}"
echo "Head of taxi-train.csv"
head -1 $PWD/taxi-train.csv
echo "Head of taxi-valid.csv"
head -1 $PWD/taxi-valid.csv

Working Directory: /home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow
Head of taxi-train.csv
12.0,-73.987625,40.750617,-73.971163,40.78518,1,0
Head of taxi-valid.csv
6.0,-74.013667,40.713935,-74.007627,40.702992,2,0


## Running the Python module from the command-line

#### Clean model training dir/output dir

In [14]:
%%bash
# This is so that the trained model is started fresh each time. However, this needs to be done before 
rm -rf $PWD/${TRAINING_DIR}

In [15]:
%%bash
# Setup python so it sees the task module which controls the model.py
export PYTHONPATH=${PYTHONPATH}:${PWD}/${MODEL_NAME}
# Currently set for python 2.  To run with python 3 
#    1.  Replace 'python' with 'python3' in the following command
#    2.  Edit trainer/task.py to reflect proper module import method 
python -m trainer.task \
   --train_data_paths="${PWD}/taxi-train*" \
   --eval_data_paths=${PWD}/taxi-valid.csv  \
   --output_dir=${PWD}/${TRAINING_DIR} \
   --train_steps=1000 --job-dir=./tmp

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator

In [27]:
%%bash
ls $PWD/${TRAINING_DIR}/export/exporter/

1593203162


In [41]:
# Use $$ for a literal $
#!echo "A system variable: $$HOME"
export_model_number=!ls $$PWD/$$TRAINING_DIR/export/exporter/
export_model_number = export_model_number[0]
export_model_number

'1593203162'

In [43]:
!saved_model_cli show --dir $$PWD/$$TRAINING_DIR/export/exporter/{export_model_number} --all


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['predict']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['dropofflat'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_2:0
    inputs['dropofflon'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_3:0
    inputs['passengers'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_4:0
    inputs['pickuplat'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_1:0
    inputs['pickuplon'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['predictions'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: dnn/logits/BiasAdd:0
  Method name is: tensorflow/serving/predict


In [53]:
%%writefile ./test.json
{"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2}

Overwriting ./test.json


In [54]:
%%bash
sudo find "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine" -name '*.pyc' -delete

In [55]:
%%bash
# This model dir is the model exported after training and is used for prediction
#
model_dir=$(ls ${PWD}/${TRAINING_DIR}/export/exporter | tail -1)
# predict using the trained model
gcloud ai-platform local predict  \
    --model-dir=${PWD}/${TRAINING_DIR}/export/exporter/${model_dir} \
    --json-instances=./test.json

PREDICTIONS
[11.127461433410645]


If the signature defined in the model is not serving_default then you must specify it via --signature-name flag, otherwise the command may fail.
Instructions for updating:
non-resource variables are not supported in the long term
2020-06-26 21:01:49.887830: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2020-06-26 21:01:49.888099: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f919a4e150 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-26 21:01:49.888127: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-26 21:01:49.888386: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.l

#### Clean model training dir/output dir

In [50]:
%%bash
# This is so that the trained model is started fresh each time. However, this needs to be done before 
rm -rf $PWD/${TRAINING_DIR}

## Running locally using gcloud

In [51]:
%%bash
# Use Cloud Machine Learning Engine to train the model in local file system
gcloud ai-platform local train \
   --module-name=trainer.task \
   --package-path=${PWD}/${MODEL_NAME}/trainer \
   -- \
   --train_data_paths=${PWD}/taxi-train.csv \
   --eval_data_paths=${PWD}/taxi-valid.csv  \
   --train_steps=1000 \
   --output_dir=${PWD}/${TRAINING_DIR} 

INFO:tensorflow:TF_CONFIG environment variable: {'environment': 'cloud', 'cluster': {}, 'job': {'args': ['--train_data_paths=/home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi-train.csv', '--eval_data_paths=/home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi-valid.csv', '--train_steps=1000', '--output_dir=/home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi_trained'], 'job_name': 'trainer.task'}, 'task': {}}
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100

In [52]:
%%bash
ls $PWD/${TRAINING_DIR}

checkpoint
eval
events.out.tfevents.1593205175.tensorflow-2-1-20200626-221646
export
graph.pbtxt
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-1000.data-00000-of-00001
model.ckpt-1000.index
model.ckpt-1000.meta


## Submit training job using gcloud

First copy the training data to the cloud.  Then, launch a training job.

After you submit the job, go to the cloud console (http://console.cloud.google.com) and select <b>AI Platform | Jobs</b> to monitor progress.  

<b>Note:</b> Don't be concerned if the notebook stalls (with a blue progress bar) or returns with an error about being unable to refresh auth tokens. This is a long-lived Cloud job and work is going on in the cloud.  Use the Cloud Console link (above) to monitor the job.

In [56]:
%%bash
# Clear Cloud Storage bucket and copy the CSV files to Cloud Storage bucket
echo $BUCKET
gsutil -m rm -rf gs://${BUCKET}/${MODEL_NAME}/smallinput/
gsutil -m cp ${PWD}/*.csv gs://${BUCKET}/${MODEL_NAME}/smallinput/

qwiklabs-gcp-00-5abd7c7e843f


CommandException: 1 files/objects could not be removed.
Copying file:///home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi-test.csv [Content-Type=text/csv]...
Copying file:///home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi-train.csv [Content-Type=text/csv]...
Copying file:///home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi-valid.csv [Content-Type=text/csv]...
/ [3/3 files][580.6 KiB/580.6 KiB] 100% Done                                    
Operation completed over 3 objects/580.6 KiB.                                    


In [57]:
%%bash
OUTDIR=gs://${BUCKET}/${MODEL_NAME}/smallinput/${TRAINING_DIR}
JOBNAME=${MODEL_NAME}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
# Clear the Cloud Storage Bucket used for the training job
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/${MODEL_NAME}/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC \
   --runtime-version 2.1 \
   --python-version 3.5 \
   -- \
   --train_data_paths="gs://${BUCKET}/${MODEL_NAME}/smallinput/taxi-train*" \
   --eval_data_paths="gs://${BUCKET}/${MODEL_NAME}/smallinput/taxi-valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=10000

gs://qwiklabs-gcp-00-5abd7c7e843f/taxifare/smallinput/taxi_trained europe-west2 taxifare_200626_210337
jobId: taxifare_200626_210337
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [taxifare_200626_210337] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe taxifare_200626_210337

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs taxifare_200626_210337


Don't be concerned if the notebook appears stalled (with a blue progress bar) or returns with an error about being unable to refresh auth tokens. This is a long-lived Cloud job and work is going on in the cloud. 

<b>Use the Cloud Console link to monitor the job and do NOT proceed until the job is done.</b>

In [58]:
%%bash
gsutil ls gs://${BUCKET}/${MODEL_NAME}/smallinput

gs://qwiklabs-gcp-00-5abd7c7e843f/taxifare/smallinput/taxi-test.csv
gs://qwiklabs-gcp-00-5abd7c7e843f/taxifare/smallinput/taxi-train.csv
gs://qwiklabs-gcp-00-5abd7c7e843f/taxifare/smallinput/taxi-valid.csv
gs://qwiklabs-gcp-00-5abd7c7e843f/taxifare/smallinput/taxi_trained/


## Train on larger dataset

I have already followed the steps below and the files are already available. <b> You don't need to do the steps in this comment. </b> In the next chapter (on feature engineering), we will avoid all this manual processing by using Cloud Dataflow.

Go to http://bigquery.cloud.google.com/ and type the query:
<pre>
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  'nokeyindata' AS key
FROM
  [nyc-tlc:yellow.trips]
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  AND ABS(HASH(pickup_datetime)) % 1000 == 1
</pre>

Note that this is now 1,000,000 rows (i.e. 100x the original dataset).  Export this to CSV using the following steps (Note that <b>I have already done this and made the resulting GCS data publicly available</b>, so you don't need to do it.):
<ol>
<li> Click on the "Save As Table" button and note down the name of the dataset and table.
<li> On the BigQuery console, find the newly exported table in the left-hand-side menu, and click on the name.
<li> Click on "Export Table"
<li> Supply your bucket name and give it the name train.csv (for example: gs://cloud-training-demos-ml/taxifare/ch3/train.csv). Note down what this is.  Wait for the job to finish (look at the "Job History" on the left-hand-side menu)
<li> In the query above, change the final "== 1" to "== 2" and export this to Cloud Storage as valid.csv (e.g.  gs://cloud-training-demos-ml/taxifare/ch3/valid.csv)
<li> Download the two files, remove the header line and upload it back to GCS.
</ol>

<p/>
<p/>

## Run Cloud training on 1-million row dataset

This took 60 minutes and uses as input 1-million rows.  The model is exactly the same as above. The only changes are to the input (to use the larger dataset) and to the Cloud MLE tier (to use STANDARD_1 instead of BASIC -- STANDARD_1 is approximately 10x more powerful than BASIC).  At the end of the training the loss was 32, but the RMSE (calculated on the validation dataset) was stubbornly at 9.03. So, simply adding more data doesn't help.

In [66]:
def create_query(phase, sample_size):
    basequery = """
    SELECT
      (tolls_amount + fare_amount) AS fare_amount,
      pickup_longitude AS pickuplon,
      pickup_latitude AS pickuplat,
      dropoff_longitude AS dropofflon,
      dropoff_latitude AS dropofflat,
      passenger_count*1.0 AS passengers,
      'nokeyindata' AS key
    FROM
        `nyc-tlc.yellow.trips`
    WHERE
        trip_distance > 0
        AND fare_amount >= 2.5
        AND pickup_longitude > -78
        AND pickup_longitude < -70
        AND dropoff_longitude > -78
        AND dropoff_longitude < -70
        AND pickup_latitude > 37
        AND pickup_latitude < 45
        AND dropoff_latitude > 37
        AND dropoff_latitude < 45
        AND passenger_count > 0
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N)) = 1
    """

    if phase == "TRAIN":
        subsample = """
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) >= (EVERY_N * 0)
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) <  (EVERY_N * 70)
        """
    elif phase == "VALID":
        subsample = """
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) >= (EVERY_N * 70)
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) <  (EVERY_N * 85)
        """
    elif phase == "TEST":
        subsample = """
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) >= (EVERY_N * 85)
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) <  (EVERY_N * 100)
        """

    query = basequery + subsample
    return query.replace("EVERY_N", sample_size)

In [65]:
def create_query_not_used(phase, sample_size):
    basequery = """
    SELECT
        (tolls_amount + fare_amount) AS fare_amount,
        EXTRACT(DAYOFWEEK from pickup_datetime) AS dayofweek,
        EXTRACT(HOUR from pickup_datetime) AS hourofday,
        pickup_longitude AS pickuplon,
        pickup_latitude AS pickuplat,
        dropoff_longitude AS dropofflon,
        dropoff_latitude AS dropofflat
    FROM
        `nyc-tlc.yellow.trips`
    WHERE
        trip_distance > 0
        AND fare_amount >= 2.5
        AND pickup_longitude > -78
        AND pickup_longitude < -70
        AND dropoff_longitude > -78
        AND dropoff_longitude < -70
        AND pickup_latitude > 37
        AND pickup_latitude < 45
        AND dropoff_latitude > 37
        AND dropoff_latitude < 45
        AND passenger_count > 0
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N)) = 1
    """

    if phase == "TRAIN":
        subsample = """
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) >= (EVERY_N * 0)
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) <  (EVERY_N * 70)
        """
    elif phase == "VALID":
        subsample = """
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) >= (EVERY_N * 70)
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) <  (EVERY_N * 85)
        """
    elif phase == "TEST":
        subsample = """
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) >= (EVERY_N * 85)
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) <  (EVERY_N * 100)
        """

    query = basequery + subsample
    return query.replace("EVERY_N", sample_size)

In [67]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

for phase in ["TRAIN", "VALID", "TEST"]:
    # 1. Create query string
    query_string = create_query(phase, "100000")
    # 2. Load results into DataFrame
    df = bq.query(query_string).to_dataframe()

    # 3. Write DataFrame to CSV
    df.to_csv("taxi-big-{}.csv".format(phase.lower()), index_label = False, index = False)
    print("Wrote {} lines to {}".format(len(df), "taxi-big-{}.csv".format(phase.lower())))

Wrote 7645 lines to taxi-big-train.csv
Wrote 1814 lines to taxi-big-valid.csv
Wrote 1017 lines to taxi-big-test.csv


In [68]:
%%bash
# Clear Cloud Storage bucket and copy the CSV files to Cloud Storage bucket
echo $BUCKET
gsutil -m rm -rf gs://${BUCKET}/${MODEL_NAME}/biginput/
gsutil -m cp ${PWD}/taxi-big*.csv gs://${BUCKET}/${MODEL_NAME}/biginput/

qwiklabs-gcp-00-5abd7c7e843f


CommandException: 1 files/objects could not be removed.
Copying file:///home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi-big-test.csv [Content-Type=text/csv]...
Copying file:///home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi-big-train.csv [Content-Type=text/csv]...
Copying file:///home/jupyter/ml-playground/notebook/training-data-analyst/03_tensorflow/taxi-big-valid.csv [Content-Type=text/csv]...
/ [3/3 files][656.4 KiB/656.4 KiB] 100% Done                                    
Operation completed over 3 objects/656.4 KiB.                                    


In [60]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

query_string = """
#standardSQL
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  'nokeyindata' AS key
FROM
  `nyc-tlc.yellow.trips`
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  AND ABS(HASH(pickup_datetime)) % 1000 == 1
"""

trips = bq.query("train.csv", query_string).to_dataframe()



In [63]:
trips.describe()
trips.to_csv(header=False)

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,trip_distance,fare_amount,extra,mta_tax,imp_surcharge,tip_amount,tolls_amount,total_amount
count,224010.0,224010.0,224008.0,224008.0,224010.0,224010.0,224010.0,224010.0,193917.0,15103.0,224010.0,224010.0,224010.0
mean,-72.567277,39.975019,-72.5686,39.953818,1.687076,2.830173,11.127325,0.287398,0.496207,0.297239,1.092133,0.215548,13.173473
std,17.184744,11.531709,14.173995,12.181259,1.317762,3.30788,9.09585,0.348067,0.047469,0.028856,1.951176,1.087451,10.879476
min,-3327.388155,-2108.147765,-2084.46887,-2587.703973,0.0,0.0,-52.0,-0.5,-1.0,-0.3,0.0,0.0,-52.8
25%,-73.992087,40.735032,-73.991471,40.734159,1.0,1.02,6.0,0.0,0.5,0.3,0.0,0.0,7.15
50%,-73.981826,40.752635,-73.98022,40.75311,1.0,1.75,8.5,0.0,0.5,0.3,0.0,0.0,10.0
75%,-73.967239,40.76714,-73.963937,40.768102,2.0,3.14,12.5,0.5,0.5,0.3,1.72,0.0,14.6
max,3442.185068,2614.663005,3442.185068,2958.581502,49.0,97.3,412.64,1.5,0.5,0.3,100.0,23.5,412.64


In [None]:
%%bash

OUTDIR=gs://${BUCKET}/${MODEL_NAME}/${TRAINING_DIR}
JOBNAME=${MODEL_NAME}_$(date -u +%y%m%d_%H%M%S)
CRS_BUCKET=cloud-training-demos # use the already exported data
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/${MODEL_NAME}/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=STANDARD_1 \
   --runtime-version 2.1 \
   --python-version 3.5 \
   -- \
   --train_data_paths="gs://${CRS_BUCKET}/${MODEL_NAME}/ch3/train.csv" \
   --eval_data_paths="gs://${CRS_BUCKET}/${MODEL_NAME}/ch3/valid.csv"  \
   --output_dir=$OUTDIR \
   --train_steps=100000

In [69]:
%%bash
OUTDIR=gs://${BUCKET}/${MODEL_NAME}/biginput/${TRAINING_DIR}
JOBNAME=${MODEL_NAME}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
# Clear the Cloud Storage Bucket used for the training job
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/${MODEL_NAME}/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=STANDARD_1 \
   --runtime-version 2.1 \
   --python-version 3.5 \
   -- \
   --train_data_paths="gs://${BUCKET}/${MODEL_NAME}/biginput/taxi-big-train*" \
   --eval_data_paths="gs://${BUCKET}/${MODEL_NAME}/biginput/taxi-big-valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=2000

gs://qwiklabs-gcp-00-5abd7c7e843f/taxifare/biginput/taxi_trained europe-west2 taxifare_200626_215447
jobId: taxifare_200626_215447
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [taxifare_200626_215447] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe taxifare_200626_215447

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs taxifare_200626_215447


## Challenge Exercise

Modify your solution to the challenge exercise in d_trainandevaluate.ipynb appropriately. Make sure that you implement training and deployment. Increase the size of your dataset by 10x since you are running on the cloud. Does your accuracy improve?

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License