<h1> Structured data prediction using Cloud ML Engine with scikit-learn </h1>

This notebook illustrates:
<ol>
<li> Creating datasets for Machine Learning using BigQuery
<li> Creating a model using scitkit learn 
<li> Training on Cloud ML Engine
<li> Deploying model
<li> Predicting with model
</ol>

In [57]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [79]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [3]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


In [4]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

## Exploring dataset

Please see [this notebook](../babyweight/babyweight.ipynb) for more context on this problem and how the features were chosen.

In [76]:
#%writefile babyweight/trainer/model.py

# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<h2> Creating a ML dataset using BigQuery </h2>

We can use BigQuery to create the training and evaluation datasets. Because of the masking (ultrasound vs. no ultrasound), the query itself is a little complex.

In [24]:
#%writefile -a babyweight/trainer/model.py
def create_queries():
  query_all = """
  WITH with_ultrasound AS (
    SELECT
      weight_pounds AS label,
      CAST(is_male AS STRING) AS is_male,
      mother_age,
      CAST(plurality AS STRING) AS plurality,
      gestation_weeks,
      ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  without_ultrasound AS (
    SELECT
      weight_pounds AS label,
      'Unknown' AS is_male,
      mother_age,
      IF(plurality > 1, 'Multiple', 'Single') AS plurality,
      gestation_weeks,
      ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  preprocessed AS (
    SELECT * from with_ultrasound
    UNION ALL
    SELECT * from without_ultrasound
  )

  SELECT
      label,
      is_male,
      mother_age,
      plurality,
      gestation_weeks
  FROM
      preprocessed
  """

  train_query = "{} WHERE MOD(hashmonth, 4) < 3".format(query_all)
  eval_query  = "{} WHERE MOD(hashmonth, 4) = 3".format(query_all)
  return train_query, eval_query

In [23]:
print create_queries()[0]


  WITH with_ultrasound AS (
    SELECT
      weight_pounds AS label,
      CAST(is_male AS STRING) AS is_male,
      mother_age,
      CAST(plurality AS STRING) AS plurality,
      gestation_weeks,
      ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  without_ultrasound AS (
    SELECT
      weight_pounds AS label,
      'Unknown' AS is_male,
      mother_age,
      IF(plurality > 1, 'Multiple', 'Single') AS plurality,
      gestation_weeks,
      ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  preprocessed AS (
    SELECT * from with_ultras

In [25]:
#%writefile -a babyweight/trainer/model.py
def create_dataframes(frac = None):
  # small dataset for testing
  if frac > 0 and frac < 1:
    sample = " AND RAND() < {}".format(frac)
  else:
    sample = ""

  train_query, eval_query = create_queries()
  train_query = "{} {}".format(train_query, sample)
  eval_query =  "{} {}".format(eval_query, sample)
  
  import google.datalab.bigquery as bq
  train_df = bq.Query(train_query).execute().result().to_dataframe()
  eval_df  = bq.Query(eval_query).execute().result().to_dataframe()
  return train_df, eval_df

In [20]:
train_df, eval_df = create_dataframes(0.001)
train_df.describe()

Unnamed: 0,label,mother_age,gestation_weeks
count,53092.0,53092.0,53092.0
mean,7.227666,27.369905,38.615592
std,1.312254,6.206432,2.537291
min,0.500449,12.0,17.0
25%,6.563162,22.0,38.0
50%,7.312733,27.0,39.0
75%,8.046873,32.0,40.0
max,13.776687,50.0,47.0


In [21]:
eval_df.head()

Unnamed: 0,label,is_male,mother_age,plurality,gestation_weeks
0,0.873031,false,23,1,17
1,0.813506,Unknown,19,Single,19
2,0.619499,false,19,2,19
3,0.500449,Unknown,30,Single,19
4,0.626113,true,18,1,20


<h2> Creating a scikit-learn model using random forests </h2>

Let's train the model locally

In [90]:
#%writefile -a babyweight/trainer/model.py
def input_fn(indf):
  import copy
  import pandas as pd
  df = copy.deepcopy(indf)

  # one-hot encode the categorical columns
  df["plurality"] = df["plurality"].astype(pd.api.types.CategoricalDtype(
                    categories=["Single","Multiple","1","2","3","4","5"]))
  df["is_male"] = df["is_male"].astype(pd.api.types.CategoricalDtype(
                  categories=["Unknown","0","1"]))
  # features, label
  label = df['label']
  del df['label']
  features = pd.get_dummies(df)
  return features, label

In [91]:
train_x, train_y = input_fn(train_df)
print(train_x[:5])
print(train_y[:5])

   mother_age  gestation_weeks  is_male_Unknown  is_male_0  is_male_1  \
0          23               17                1          0          0   
1          35               17                0          0          0   
2          37               17                1          0          0   
3          20               18                1          0          0   
4          26               18                0          0          0   

   plurality_Single  plurality_Multiple  plurality_1  plurality_2  \
0                 1                   0            0            0   
1                 0                   0            1            0   
2                 0                   1            0            0   
3                 1                   0            0            0   
4                 0                   0            1            0   

   plurality_3  plurality_4  plurality_5  
0            0            0            0  
1            0            0            0  
2            0   

In [48]:
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=0)
estimator.fit(train_x, train_y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

In [49]:
#print(estimator.feature_importances_)
eval_x, eval_y = input_fn(eval_df)
print(estimator.predict(eval_x)[1000:1005])
print(eval_y[1000:1005])
print(estimator.score(eval_x, eval_y))

[5.25707439 6.27823851 5.21946407 6.27823851 5.25707439]
1000    5.187477
1001    3.937456
1002    5.749656
1003    7.813183
1004    4.874421
Name: label, dtype: float64
0.3806785037567018




In [52]:
#%writefile -a babyweight/trainer/model.py
def train_and_evaluate(frac=0.001):
  # get data
  train_df, eval_df = create_dataframes(frac)
  train_x, train_y = input_fn(train_df)
  # train
  from sklearn.ensemble import RandomForestRegressor
  estimator = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=0)
  estimator.fit(train_x, train_y)
  # evaluate
  eval_x, eval_y = input_fn(eval_df)
  print("Eval score={}".format(estimator.score(eval_x, eval_y)))
  return estimator

In [72]:
#%writefile -a babyweight/trainer/model.py
def save_model(estimator, gcspath, name):
  from sklearn.externals import joblib
  import os, subprocess, datetime
  model = '{}.joblib'.format(name)
  joblib.dump(estimator, model)
  model_path = os.path.join(gcspath, datetime.datetime.now().strftime(
    'export_%Y%m%d_%H%M%S'), model)
  subprocess.check_call(['gsutil', 'cp', model, model_path])
  return model_path

In [69]:
saved = save_model(estimator, 'gs://{}/babyweight/sklearn'.format(BUCKET), 'babyweight')

In [70]:
print saved

gs://cloud-training-demos-ml/babyweight/sklearn/export_20180524_233356/babyweight.joblib


## Packaging up as a Python package

Note the %writefile in the cells above. I uncommented those and ran the cells to write out a model.py
The following cell writes out a task.py

In [83]:
%writefile babyweight/trainer/task.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os

import model

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--bucket',
        help = 'GCS path to output.',
        required = True
    )
    parser.add_argument(
        '--frac',
        help = 'Fraction of input to process',
        type = float,
        required = True
    )
    parser.add_argument(
        '--job-dir',
        help = 'this model ignores this field, but it is required by gcloud',
        default = 'junk'
    )
    
    args = parser.parse_args()
    arguments = args.__dict__
    estimator = model.train_and_evaluate(arguments['frac'])
    loc = model.save_model(estimator, 
                           'gs://{}/babyweight/sklearn'.format(arguments['bucket']), 'babyweight')
    print("Saved model to {}".format(loc))

# done

Overwriting trainer/task.py


In [89]:
%writefile babyweight/trainer/setup.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from setuptools import setup

setup(name='babyweight',
      version='1.0',
      description='Natality, with sklearn',
      url='http://github.com/GoogleCloudPlatform/training-data-analyst',
      author='Google',
      author_email='nobody@google.com',
      license='Apache2',
      packages=['babyweight'],
      install_requires=[
          'pydatalab',
      ],
      zip_safe=False)

Overwriting trainer/setup.py


Try out the package on a subset of the data

In [None]:
%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python -m trainer.task \
   --bucket=${BUCKET} --frac=0.001 --job-dir=./tmp

<h2> Training on Cloud ML Engine </h2>

Submit the code to the ML Engine service

In [93]:
%bash

RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME="babyweight_skl_$(date +"%Y%m%d_%H%M%S")"
JOB_DIR="gs://$BUCKET/scikit_learn_job_dir"

gcloud ml-engine jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $(pwd)/babyweight/trainer \
  --module-name trainer.task \
  --region us-central1 \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  -- \
  --bucket=${BUCKET} --frac=1.0

jobId: babyweight_skl_20180525_044903
state: QUEUED


Job [babyweight_skl_20180525_044903] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe babyweight_skl_20180525_044903

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs babyweight_skl_20180525_044903


The training finished with a score of ...

<h2> Deploying the trained model </h2>
<p>
Deploying the trained model to act as a REST web service is a simple gcloud call.

In [None]:
%bash
gsutil ls gs://${BUCKET}/babyweight/sklearn/

In [None]:
%bash
MODEL_NAME="babyweight"
MODEL_VERSION="skl"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/babyweight/trained_model/export/exporter/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud beta ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} \
    --runtime-version 1.8  --python-version=2.7

<h2> Using the model to predict </h2>
<p>
Send a JSON request to the endpoint of the service to make it predict a baby's weight ... I am going to try out how well the model would have predicted the weights of our two kids and a couple of variations while we are at it ...

In [120]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials)

request_data = {'instances':
  [
    {
      'is_male': 'True',
      'mother_age': 26.0,
      'plurality': 'Single(1)',
      'gestation_weeks': 39
    },
    {
      'is_male': 'False',
      'mother_age': 29.0,
      'plurality': 'Single(1)',
      'gestation_weeks': 38
    },
    {
      'is_male': 'True',
      'mother_age': 26.0,
      'plurality': 'Triplets(3)',
      'gestation_weeks': 39
    },
    {
      'is_male': 'Unknown',
      'mother_age': 29.0,
      'plurality': 'Multiple(2+)',
      'gestation_weeks': 38
    },
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'babyweight', 'soln')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)

[2018-01-06 23:55:02,342] {discovery.py:863} INFO - URL being requested: POST https://ml.googleapis.com/v1/projects/cloud-training-demos/models/babyweight/versions/soln:predict?alt=json
[2018-01-06 23:55:02,343] {client.py:614} INFO - Attempting refresh to obtain initial access_token
[2018-01-06 23:55:02,344] {client.py:903} INFO - Refreshing access_token
response={u'predictions': [{u'predictions': [7.649534225463867]}, {u'predictions': [7.198207855224609]}, {u'predictions': [6.499455451965332]}, {u'predictions': [6.16628360748291]}]}


  chunks = self.iterencode(o, _one_shot=True)


The four predictions are 7.6, 7.2, 6.5, and 6.2 pounds.

Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License