# Online Prediction with scikit-learn on Google Cloud Machine Learning Engine
This notebook uses the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to create a simple model, train the model, upload the model to Cloud Machine Learning Engine (ML Engine), and lastly use the model to make predictions. 

# How to bring your model to ML Engine
Getting your model ready for predictions can be done in 5 steps:
1. Save your model to a file
1. Upload the saved model to [Google Cloud Storage](https://cloud.google.com/storage)
1. Create a model resource on ML Engine
1. Create a model version (linking your scikit-learn model)
1. Make an online prediction

# Prerequisites
Before you jump in, let’s cover some of the different tools you’ll be using to get online prediction up and running on ML Engine. 

[Google Cloud Platform](https://cloud.google.com/) lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.

[Cloud ML Engine](https://cloud.google.com/ml-engine/) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.

[Google Cloud Storage](https://cloud.google.com/storage/) (GCS) is unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.

[The Cloud SDK](https://cloud.google.com/sdk/) is a set of tools for Cloud Platform. It contains gcloud, gsutil, and bq, which you can use to access Google Compute Engine, Google Cloud Storage, Google BigQuery, and other products and services from the command-line.


# Part 0: Setup
* [Create a project on GCP](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
* [Create a Google Cloud Storage Bucket](https://cloud.google.com/storage/docs/quickstart-console)
* [Enable Cloud Machine Learning Engine and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282)

These variables will be needed for the following steps.

** Replace: **
* `PROJECT_ID <YOUR_PROJECT_ID>` - with your project's id. Use the PROJECT_ID that matches your Google Cloud Platform project.
* `BUCKET_ID <YOUR_BUCKET_ID>` - with the bucket id you created above.
* `MODEL_NAME <YOUR_MODEL_NAME>` - with your model name, such as '`census`'
* `VERSION <YOUR_VERSION>` - with your version name, such as '`v1`'
* `REGION <REGION>` - select a region from [here](https://cloud.google.com/ml-engine/docs/regions) or use the default '`us-central1`'. The region is where the model will be deployed.
* `INPUT_DATA_FILE <data.json>` - a JSON file that contains the data used as input to your model’s predict method. (You'll create this in the code below)

In [3]:
%env PROJECT_ID true-ability-192918

env: PROJECT_ID=true-ability-192918


In [4]:
%env BUCKET_ID true-ability-192918

env: BUCKET_ID=true-ability-192918


In [5]:
%env MODEL_NAME census

env: MODEL_NAME=census


In [6]:
%env VERSION_NAME v1

env: VERSION_NAME=v1


In [7]:
%env REGION us-central1

env: REGION=us-central1


In [8]:
%env INPUT_DATA_FILE data.json

env: INPUT_DATA_FILE=data.json


## Download the data
The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample
uses for training is hosted by the [UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/datasets/). We have hosted the data
on Google Cloud Storage in a slightly cleaned form:

 * Training file is `adult.data`
 * Evaluation file is `adult.test`


### Disclaimer
This dataset is provided by a third party. Google provides no representation,
warranty, or other guarantees about the validity or any other aspects of this dataset.

In [10]:
# Create a directory to hold the data
! mkdir census_data

# Download the data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data --output census_data/adult.data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test --output census_data/adult.test

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3881k  100 3881k    0     0  2733k      0  0:00:01  0:00:01 --:--:-- 2733k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1956k  100 1956k    0     0  2034k      0 --:--:-- --:--:-- --:--:-- 2033k


# Part 1: Train/Save the model 
First, the data is loaded into a numpy array that can be used by scikit-learn. Then a simple model is created and fit against the training data. Lastly, sklearn's built in version of joblib is used to save the model to a file that can be uploaded to ML Engine.

In [7]:
import numpy as np
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

# Define the format of your input data including unused columns (These are the columns from the census data files)
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)


# Load the training census dataset
with open('./census_data/adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# Remove the column we are trying to predict ('income-level') from our features list
train_features = raw_training_data.drop('income-level', axis=1)
# Create our training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')


# Load the test census dataset
with open('./census_data/adult.test', 'r') as test_data:
    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# Remove the column we are trying to predict ('income-level') from our features list
test_features = raw_testing_data.drop('income-level', axis=1)
# Create our training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')


# Convert the categorical columns to a numerical value in both the training and testing dataset
from sklearn.preprocessing import LabelEncoder
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}

for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])

for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])


# Create and train a classifier
classifier = RandomForestClassifier()
classifier.fit(train_features, train_labels)

# Export the classifier to a file
joblib.dump(classifier, 'model.joblib')

print('Done')

Done


# Part 2: Upload the model
Next, you'll need to upload the model to your project's storage bucket in GCS. To use your model with ML Engine, it needs to be uploaded to Google Cloud Storage (GCS). This step takes your local ‘model.joblib’ file and uploads it GCS via the Cloud SDK using gsutil.

Note: The exact file name of of the exported model you upload to GCS is important! Your model must be named  “model.joblib”, “model.pkl”, or “model.bst” with respect to the library you used to export it. This restriction ensures that the model will be safely reconstructed later by using the same technique for import as was used during export.

In [14]:
! gsutil cp ./model.joblib gs://$BUCKET_ID/model.joblib

Copying file://./model.joblib [Content-Type=application/octet-stream]...
-
Operation completed over 1 objects/6.5 MiB.                                      


# Part 3: Create a model resource
Cloud ML Engine organizes your trained models using model and version resources. A Cloud ML Engine model is a container for the versions of your machine learning model. For more information on model resources and model versions look [here](https://cloud.google.com/ml-engine/docs/deploying-models#creating_a_model_version). 

At this step, you create a container that you can use to hold several different versions of your actual model.

In [12]:
! gcloud ml-engine models create $MODEL_NAME --regions us-central1

Created ml engine model [projects/true-ability-192918/models/census].


# Part 4: Create a model version

Now it’s time to get your model online and ready for predictions. The model version requires a few components as specified [here](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions#Version).

* __name__ - The name specified for the version when it was created. This will be the `VERSION_NAME` variable you declared at the beginning.
* __deployment Uri__ (curl) or __origin__ (gcloud) - The Google Cloud Storage location of the trained model used to create the version. This is the bucket that you uploaded the model to with your `BUCKET_ID`
* __runtime__ version - The Google Cloud ML runtime version to use for this deployment. This is set to 1.4
* __framework__ - The framework specifies if you are using: `TENSORFLOW`, `SCIKIT_LEARN`, `XGBOOST`. This is set to `SCIKIT_LEARN`

Note: It can take several minutes for you model to be available.

Note: If you require a feature of scikit-learn that isn’t available in the publicly released version yet, you can specify “runtimeVersion”: “HEAD” instead, and that would get the latest version of scikit-learn available from the github repo. Otherwise the following versions will be used:
* scikit-learn: 0.19.0

In [13]:
! curl -X POST -H "Content-Type: application/json" \
   -d '{"name": "'$VERSION_NAME'", "deploymentUri": "gs://'$BUCKET_ID'/", "runtimeVersion": "1.4", "framework": "SCIKIT_LEARN"}' \
   -H "Authorization: Bearer `gcloud auth print-access-token`" \
    https://ml.googleapis.com/v1/projects/$PROJECT_ID/models/$MODEL_NAME/versions

{
  "name": "projects/true-ability-192918/operations/create_census_v1-1517440102159",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.ml.v1.OperationMetadata",
    "createTime": "2018-01-31T23:08:22Z",
    "operationType": "CREATE_VERSION",
    "modelName": "projects/true-ability-192918/models/census",
    "version": {
      "name": "projects/true-ability-192918/models/census/versions/v1",
      "deploymentUri": "gs://true-ability-192918/",
      "createTime": "2018-01-31T23:08:22Z",
      "etag": "tc3SVp7707A=",
      "framework": "SCIKIT_LEARN"
    }
  }
}


# Part 5: Make an online prediction
It’s time to make an online prediction with your newly deployed model. Before you begin, you'll need to take some of the test data and prepare it, so that the test data can be used by the deployed model.

To get online predictions, the data needs to be converted from a numpy array to a json array. 

In [14]:
import json

data = []
for i in range(len(test_features)):
  data.append([])
  for col in COLUMNS[:-1]: # Ignore the 'income-level' column as it is not in the feature set.
    # Convert from numpy integers to standard integers
    data[i].append(int(np.uint64(test_features[col][i]).item()))

# Write the test data to a json file
with open('data.json', 'w') as outfile:
  json.dump(data, outfile)

# Get one person that makes <=50K and one that makes >50K to test our model.
print('Show a person that makes <=50K:')
print('\tFeatures: {0} --> Label: {1}\n'.format(data[0], test_labels[0]))

with open('less_than_50K.json', 'w') as outfile:
  json.dump(data[0], outfile)

  
print('Show a person that makes >50K:')
print('\tFeatures: {0} --> Label: {1}'.format(data[2], test_labels[2]))

with open('more_than_50K.json', 'w') as outfile:
  json.dump(data[2], outfile)

Show a person that makes <=50K:
	Features: [25, 4, 226802, 1, 7, 4, 7, 3, 2, 1, 0, 0, 40, 38] --> Label: False

Show a person that makes >50K:
	Features: [28, 2, 336951, 7, 12, 2, 11, 0, 4, 1, 0, 0, 40, 38] --> Label: True


## Use gcloud to make online predictions
Use the two people (as seen in the table) gathered in the previous step for the gcloud predictions.

| **Person** | age | workclass | fnlwgt | education | education-num | marital-status | occupation |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
| **1** | 25| 4 | 226802 | 1 | 7 | 4 | 7 |
| **2** | 28| 2 | 336951 | 7 | 12 | 2 | 11 |

| **Person** | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country || (Label) income-level|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:||:-:
| **1** | 3 | 2 | 1 | 0 | 0 | 40 | 38 || False (<=50K) |
| **2** | 0 | 4 | 1 | 0 | 0 | 40 | 38 || True (>50K) |


Creating a model version can take several minutes, check the status of your model version to see if it is available.

In [22]:
! gcloud ml-engine versions list --model $MODEL_NAME

NAME  DEPLOYMENT_URI             STATE
v1    gs://true-ability-192918/  CREATING


In [27]:
! gcloud ml-engine versions list --model $MODEL_NAME

NAME  DEPLOYMENT_URI             STATE
v1    gs://true-ability-192918/  READY


Test the model with an online prediction using the data of a person who makes <=50K.

Note: If you see an error, the model from Part 4 may not be created yet as it takes several minutes for a new model version to be created.

In [28]:
! gcloud ml-engine predict --model $MODEL_NAME --version $VERSION_NAME --json-instances less_than_50K.json

[False]


Test the model with an online prediction using the data of a person who makes >50K.

In [29]:
! gcloud ml-engine predict --model $MODEL_NAME --version $VERSION_NAME --json-instances more_than_50K.json

[True]


## Use Python to make online predictions
Test the model with the entire test set and print out some of the results.

In [16]:
import googleapiclient.discovery
import os

PROJECT_ID = os.environ['PROJECT_ID']
VERSION_NAME = os.environ['VERSION_NAME']
MODEL_NAME = os.environ['MODEL_NAME']

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL_NAME)
name += '/versions/{}'.format(VERSION_NAME)

response = service.projects().predict(
    name=name,
    body={'instances': data}
).execute()

if 'error' in response:
  print response['error']
else:
  responses = response['predictions']
  # Print the first 10 responses
  for i, response in enumerate(responses[:10]):
    print 'Prediction: {}\tLabel: {}'.format(response, test_labels[i])

[2018-01-31 23:13:42,137] {discovery.py:863} INFO - URL being requested: POST https://ml.googleapis.com/v1/projects/true-ability-192918/models/census/versions/v1:predict?alt=json
Prediction: False	Label: False
Prediction: False	Label: False
Prediction: True	Label: True
Prediction: True	Label: True
Prediction: False	Label: False
Prediction: False	Label: False
Prediction: False	Label: False
Prediction: True	Label: True
Prediction: False	Label: False
Prediction: False	Label: False
