# Online Prediction with scikit-learn on Google Cloud Machine Learning Engine
This notebook builds a model to predict the weight of a baby. It uses United States Baby Weights data from the  [Natality BigQuery Sample Table](https://cloud.google.com/bigquery/sample-tables). After training locally, we upload the model to Cloud Machine Learning Engine (ML Engine), and use the model to make predictions.

The concept of predicting baby weights with this dataset is used, with permission, from [Lak Lakshmanan](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/blogs/sklearn/babyweight_skl.ipynb)

# How to bring your model to ML Engine
Getting your model ready for predictions can be done in 5 steps:
1. Save your model to a file
1. Upload the saved model to [Google Cloud Storage](https://cloud.google.com/storage)
1. Create a model resource on ML Engine
1. Create a model version (linking your scikit-learn model)
1. Make an online prediction

# Prerequisites
Before you jump in, let’s cover some of the different tools you’ll be using to get online prediction up and running on ML Engine. 

[Google Cloud Platform](https://cloud.google.com/) lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.

[Cloud ML Engine](https://cloud.google.com/ml-engine/) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.

[Google Cloud Storage](https://cloud.google.com/storage/) (GCS) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.

[Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products. In order to run this notebook, make sure that Cloud SDK is [installed](https://cloud.google.com/sdk/downloads) in the same environment as your Jupyter kernel.

[BigQuery](https://cloud.google.com/bigquery/) is Google's petabyte scale analytical database.


# Part 0: Setup
* [Create a project on GCP](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
* [Create a Google Cloud Storage Bucket](https://cloud.google.com/storage/docs/quickstart-console)
* [Enable Cloud Machine Learning Engine and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282)
* [Install Cloud SDK](https://cloud.google.com/sdk/downloads)
* [Install scikit-learn](http://scikit-learn.org/stable/install.html)
* [Install NumPy](https://docs.scipy.org/doc/numpy/user/install.html)
* [Install pandas](https://pandas.pydata.org/pandas-docs/stable/install.html)
* [Install Google API Python Client](https://github.com/google/google-api-python-client)
* [Install google-cloud-bigquery](https://pypi.org/project/google-cloud-bigquery/)

## Getting Baby Weight data from BigQuery

Baby Weight data is available as a [BigQuery Sample Table](https://cloud.google.com/bigquery/sample-tables) in `bigquery-public-data:samples.natality`. 

For the purposes of this demo, I extracted and translated a small number of them. (The original query can be found in the appendix below.)

In [1]:
# Note I have set up GOOGLE_APPLICATION_CREDENTIALS environment variables
# per https://google-cloud-python.readthedocs.io/en/latest/bigquery/usage.html#id3

from google.cloud import bigquery
client = bigquery.Client() 

In [2]:
# Enable magics in BigQuery. See https://google-cloud-python.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.magics.html#module-google.cloud.bigquery.magics
%load_ext google.cloud.bigquery

**Note:** *The query limits results to 1000 for the purposes of the demo. This also significantly impairs model accuracy.*

In [3]:
%%bigquery df
SELECT * FROM `sgreenberg-project2.misc_ml.baby_weights`
LIMIT 1000

Unnamed: 0,weight_pounds,is_male,mother_age,father_age,weight_gain_pounds,plurality,gestation_weeks,year,month,day,state
0,1.750470,False,42,36,17.0,2,22,2004,4,,FL
1,1.298523,True,14,20,5.0,1,23,2004,5,,NM
2,1.410958,False,45,99,20.0,1,23,2004,11,,NY
3,4.874421,True,15,99,40.0,1,24,2004,6,,WI
4,1.686536,True,15,99,18.0,1,25,2004,11,,IL
5,1.999593,True,42,99,99.0,1,25,2004,9,,CA
6,2.722709,True,42,43,18.0,1,26,2004,5,,NJ
7,1.375685,True,43,46,10.0,1,26,2004,11,,NV
8,6.563162,False,15,99,16.0,1,26,2004,10,,MS
9,1.873929,False,15,99,10.0,1,27,2004,6,,LA


# Part 1: Train/Save the model 
First, the data is loaded into a pandas DataFrame that can be used by scikit-learn. Then a simple model is created and fit against the training data. Lastly, sklearn's built in version of joblib is used to save the model to a file that can be uploaded to ML Engine.

In [4]:
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib

from sklearn.ensemble import RandomForestRegressor

In [5]:
# Remove the target
target_name = 'weight_pounds'
y = df[target_name]
del df[target_name]

In [6]:
# Client-side data transformations. 
columns_to_delete = ['day','weight_gain_pounds']
for c in columns_to_delete:
    del df[c]
    
# Coalesce categorical columns into strings
columns_to_stringify = ['is_male', 'year', 'month', 'state']
for c in columns_to_stringify:
    df[c] = df[c].apply(str)
    
# Dataframe to Dict
data = df.to_dict('records')

In [7]:
# Split data into training and testing.
x_train, x_test, y_train, y_test = train_test_split(data, y, test_size=0.2)
print 'training data size: {}'.format(len(x_train))
print 'test data size: {}'.format(len(x_test))

# Setup and train the pipeline.
pipeline = Pipeline(steps=[("dict_vect", DictVectorizer(sparse=False)),
                           ("estimator", RandomForestRegressor(max_depth=5, n_estimators=100, random_state=0))])


print("Training...")
pipeline.fit(x_train, y_train)
print("Done!")

training data size: 800
test data size: 200
Training...
Done!


In [8]:
# Show some predictions
predictions = pipeline.predict(x_test)

for i, prediction in enumerate(predictions[:10]):
  print('Prediction: %s\tActual: %s' % (prediction, y_test.iloc[i]))

Prediction: 7.736490764536651	Actual: 7.12534030784
Prediction: 7.7286938148220905	Actual: 7.4957169079999995
Prediction: 7.716898935899695	Actual: 7.3744626639
Prediction: 7.214548638244691	Actual: 6.686620406459999
Prediction: 6.960795416941369	Actual: 7.5618555866
Prediction: 6.625167689479333	Actual: 7.4957169079999995
Prediction: 7.022468149919181	Actual: 7.4075320032
Prediction: 7.321764529074358	Actual: 7.3744626639
Prediction: 5.52502228563584	Actual: 6.4374980503999994
Prediction: 7.594756537989166	Actual: 8.70164548114


In [9]:
# Export the model.
model = './model.joblib'
joblib.dump(pipeline, model)

['./model.joblib']

# Part 2: Upload the model
Next, you'll need to upload the model to your project's storage bucket in GCS. To use your model with ML Engine, it needs to be uploaded to Google Cloud Storage (GCS). This step takes your local ‘model.joblib’ file and uploads it GCS via the Cloud SDK using gsutil.

Before continuing, make sure you're [properly authenticated](https://cloud.google.com/sdk/gcloud/reference/auth/) and have [access to the bucket](https://cloud.google.com/storage/docs/access-control/). This next command sets your project to the one specified above.

Note: If you get an error below, make sure the Cloud SDK is installed in the kernel's environment.

These variables will be needed for the following steps.

** Replace: **
* `PROJECT_ID <YOUR_PROJECT_ID>` - with your project's id. Use the PROJECT_ID that matches your Google Cloud Platform project.
* `BUCKET_ID <YOUR_BUCKET_ID>` - with the bucket id you created above.
* `REGION <REGION>` - select a region from [here](https://cloud.google.com/ml-engine/docs/regions) or use the default '`us-central1`'. The region is where the model will be deployed.

**In cells below you'll also need to replace**
* `MODEL_NAME <YOUR_MODEL_NAME>` - with your model name, such as '`census`'
* `VERSION <YOUR_VERSION>` - with your version name, such as '`v1`'


In [10]:
%env PROJECT_ID sgreenberg-project2
%env BUCKET_ID sgreenberg-sklearn-cmle
%env REGION us-central-1b

env: PROJECT_ID=sgreenberg-project2
env: BUCKET_ID=sgreenberg-sklearn-cmle
env: REGION=us-central-1b


In [11]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


Note: The exact file name of of the exported model you upload to GCS is important! Your model must be named  “model.joblib”, “model.pkl”, or “model.bst” with respect to the library you used to export it. This restriction ensures that the model will be safely reconstructed later by using the same technique for import as was used during export.

In [12]:
! gsutil cp ./model.joblib gs://$BUCKET_ID/model.joblib

Copying file://./model.joblib [Content-Type=application/octet-stream]...
| [1 files][358.5 KiB/358.5 KiB]                                                
Operation completed over 1 objects/358.5 KiB.                                    


# Part 3: Create a model resource
Cloud ML Engine organizes your trained models using model and version resources. A Cloud ML Engine model is a container for the versions of your machine learning model. For more information on model resources and model versions look [here](https://cloud.google.com/ml-engine/docs/deploying-models#creating_a_model_version). 

At this step, you create a container that you can use to hold several different versions of your actual model.

This variable will be needed for the following steps.

** Replace: **
* `MODEL_NAME <YOUR_MODEL_NAME>` - with your model name, such as '`census`'

In [13]:
%env MODEL_NAME baby_weight
! gcloud ml-engine models create $MODEL_NAME --regions us-central1

env: MODEL_NAME=baby_weight
Created ml engine model [projects/sgreenberg-project2/models/baby_weight].


# Part 4: Create a model version

Now it’s time to get your model online and ready for predictions. The model version requires a few components as specified [here](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions#Version).

* __name__ - The name specified for the version when it was created. This will be the `VERSION_NAME` variable you declared at the beginning.
* __deployment Uri__ (curl) or __origin__ (gcloud) - The Google Cloud Storage location of the trained model used to create the version. This is the bucket that you uploaded the model to with your `BUCKET_ID`
* __runtime__ version - The Google Cloud ML runtime version to use for this deployment. This is set to 1.4
* __framework__ - The framework specifies if you are using: `TENSORFLOW`, `SCIKIT_LEARN`, `XGBOOST`. This is set to `SCIKIT_LEARN`
* __pythonVersion__ - This specifies whether you’re using Python 2.7 or Python 3.5. The default value is set to `“2.7”`, if you are using Python 3.5, set the value to `“3.5”`


Note: It can take several minutes for you model to be available.

Note: If you require a feature of scikit-learn that isn’t available in the publicly released version yet, you can specify “runtimeVersion”: “HEAD” instead, and that would get the latest version of scikit-learn available from the github repo. Otherwise the following versions will be used:
* scikit-learn: 0.19.0

This variable will be needed for the following steps.

** Replace: **
* `VERSION <YOUR_VERSION>` - with your version name, such as '`v1`'

In [14]:
%env VERSION_NAME v1
! gcloud ml-engine versions create $VERSION_NAME --async --model $MODEL_NAME --framework scikit-learn --runtime-version 1.8 --origin gs://$BUCKET_ID

env: VERSION_NAME=v1


## Wait for Model deployment to complete

In [19]:
! gcloud ml-engine versions list --model $MODEL_NAME

NAME  DEPLOYMENT_URI                STATE
v1    gs://sgreenberg-sklearn-cmle  READY


# Part 5: Make online predictions
It’s time to make an online prediction with your newly deployed model.

In [21]:
import googleapiclient.discovery
import os

PROJECT_ID = os.environ['PROJECT_ID']
MODEL_NAME = os.environ['MODEL_NAME']
VERSION_NAME = os.environ['VERSION_NAME']

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL_NAME)
name += '/versions/{}'.format(VERSION_NAME)

responses = service.projects().predict(
    name=name,
    body={'instances': x_test}
).execute()

if 'error' in responses:
  print (responses['error'])
else:
  results = responses['predictions']
  # Print the first 10 responses
  for i, prediction in enumerate(results[:10]):
    print('Prediction: %s\tActual: %s' % (prediction, y_test.iloc[i]))

Prediction: 7.73649076454	Actual: 7.12534030784
Prediction: 7.72869381482	Actual: 7.4957169079999995
Prediction: 7.7168989359	Actual: 7.3744626639
Prediction: 7.21454863824	Actual: 6.686620406459999
Prediction: 6.96079541694	Actual: 7.5618555866
Prediction: 6.62516768948	Actual: 7.4957169079999995
Prediction: 7.02246814992	Actual: 7.4075320032
Prediction: 7.32176452907	Actual: 7.3744626639
Prediction: 5.52502228564	Actual: 6.4374980503999994
Prediction: 7.59475653799	Actual: 8.70164548114


# Appendix - Original query to extract Baby Weight data from BigQuery

```sql
 SELECT
    weight_pounds,
    is_male,
    mother_age,
    father_age,
    weight_gain_pounds,
    plurality,
    gestation_weeks,
    year,
    month,
    day,
    state
  FROM
    publicdata.samples.natality
  WHERE
    gestation_weeks > 0
    AND mother_age > 0
    AND plurality > 0
    AND weight_pounds > 0
    AND year > 2003
```