# Model Training and Prediction with Google Cloud Machine Learning Engine
In this notebook instance, you will use Cloud Machine Learning Engine to train a model using scikit-learn and serve the trained model. You will then use the served model to classify some new and unseen data.

This notebook uses the [Census Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to demonstrate how to train a model on Cloud Machine Learning Engine (ML Engine).

# GCP Products Which Will Be Used
Before you jump in, let’s cover some of the different tools you’ll be using to get online prediction up and running on ML Engine. 

[Cloud ML Engine](https://cloud.google.com/ml-engine/) (CMLE) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.

[Google Cloud Storage](https://cloud.google.com/storage/) (GCS) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.

[Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products.


# Step One: Setup the Environment
In order to use CMLE, first you need to [enable Cloud Machine Learning Engine and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282).

You'll also need a number of environment variables to run this sample. We have already defined them for you:

In [None]:
# Get the current GCP Project ID:
PROJECT_LIST = !gcloud config get-value project
PROJECT_ID = PROJECT_LIST[0]

# SET THE BUCKET NAME
import time
BUCKET_NAME = PROJECT_ID + '_census_training_' + str(int(time.time()))
print('Bucket Name: %s' % BUCKET_NAME)

In [None]:
%env PROJECT_ID=$PROJECT_ID
%env BUCKET_NAME=$BUCKET_NAME
%env REGION us-central1
%env PACKAGE_DIR census_training
%env MAIN_TRAINER_MODULE census_training.train
%env JOB_DIR gs://$BUCKET_NAME/census_job_dir
%env OUTPUT_DIR model_directory
%env MODEL_NAME CensusPredictor
%env MODEL_VERSION v1
%env RUNTIME_VERSION 1.9
%env PYTHON_VERSION 3.5

Finally, you'll create the required bucket and directory:

In [None]:
# Create the Bucket
!gsutil mb gs://$BUCKET_NAME
    
# Create the Package Directory:
!mkdir $PACKAGE_DIR

##  The Data
The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample
uses for training is provided by the [UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/datasets/). We have hosted the data on a public GCS bucket `gs://cloud-samples-data/ml-engine/sklearn/census_data/`. 

 * Training file is `adult.data`
 * Evaluation file is `adult.test` (not used in this notebook)

Note: Your typical development process with your own data would require you to upload your data to GCS so that ML Engine can access that data. However, in this case, we have put the data on GCS to avoid the steps of having you download the data from UC Irvine and then upload the data to GCS.

#### Disclaimer
This dataset is provided by a third party. Google provides no representation,
warranty, or other guarantees about the validity or any other aspects of this dataset.

# Step Two: Create your Python Model File

First, you'll create the python model file (provided below) that you'll upload to ML Engine. This is similar to your normal process for creating a scikit-learn model. The main difference is that the training data and the trained model are stored in GCS.

The code in this file loads the data into a pandas DataFrame that can be used by scikit-learn. Then the model is fit against the training data. Lastly, pickle is used to save the model to a file that can be uploaded to [ML Engine's prediction service](https://cloud.google.com/ml-engine/docs/scikit/getting-predictions#deploy_models_and_versions).

Note that the following code is not executed in this notebook. Instead, it will be saved in a Python file and packages and passed to CMLE as you create a training job.

In [None]:
%%writefile ./census_training/train.py
import datetime
import argparse
import pickle
import pandas as pd

from google.cloud import storage

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer

parser = argparse.ArgumentParser()
parser.add_argument(
      '--bucket-name',
      help="The bucket name",
      required=True
      )

parser.add_argument(
      '--output',
      help="The output directory",
      required=True
      )

arguments, unknown = parser.parse_known_args()
bucket_name = arguments.bucket_name
output_dir = arguments.output

# ---------------------------------------
# 1. Add code to download the data from GCS (in this case, using the publicly hosted data).
# ML Engine will then be able to use the data when training your model.
# ---------------------------------------
# Public bucket holding the census data
bucket = storage.Client().bucket('cloud-samples-data')

# Path to the data inside the public bucket
blob = bucket.blob('ml-engine/sklearn/census_data/adult.data')
# Download the data
blob.download_to_filename('adult.data')

# ---------------------------------------
# This is where your model code would go. Below is an example model using the census dataset.
# ---------------------------------------
# Define the format of your input data including unused columns (These are the columns from the census data files)
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)


# Load the training census dataset
with open('./adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
    # Removing the whitespaces in categorical features
    for col in CATEGORICAL_COLUMNS:
        raw_training_data[col] = raw_training_data[col].apply(lambda x: str(x).strip())

# Remove the column we are trying to predict ('income-level') from our features list
# Convert the Dataframe to a lists of lists
train_features = raw_training_data.drop('income-level', axis=1).values.tolist()
# Create our training labels list, convert the Dataframe to a lists of lists
train_labels = (raw_training_data['income-level'] == ' >50K').values.tolist()

# [START categorical-feature-conversion]
# Since the census data set has categorical features, we need to convert
# them to numerical values. We'll use a list of pipelines to convert each
# categorical column and then use FeatureUnion to combine them before calling
# the RandomForestClassifier.
categorical_pipelines = []

# Each categorical column needs to be extracted individually and converted to a numerical value.
# To do this, each categorical column will use a pipeline that extracts one feature column via
# SelectKBest(k=1) and a LabelBinarizer() to convert the categorical value to a numerical one.
# A scores array (created below) will select and extract the feature column. The scores array is
# created by iterating over the COLUMNS and checking if it is a CATEGORICAL_COLUMN.
for i, col in enumerate(COLUMNS[:-1]):
    if col in CATEGORICAL_COLUMNS:
        # Create a scores array to get the individual categorical column.
        # Example:
        #  data = [39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married', 'Adm-clerical', 
        #         'Not-in-family', 'White', 'Male', 2174, 0, 40, 'United-States']
        #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        #
        # Returns: [['State-gov']]
        # Build the scores array
        scores = [0] * len(COLUMNS[:-1])
        # This column is the categorical column we want to extract.
        scores[i] = 1
        skb = SelectKBest(k=1)
        skb.scores_ = scores
        # Convert the categorical column to a numerical value
        lbn = LabelBinarizer()
        r = skb.transform(train_features)
        lbn.fit(r)
        # Create the pipeline to extract the categorical feature
        categorical_pipelines.append(
            ('categorical-{}'.format(i), Pipeline([
                ('SKB-{}'.format(i), skb),
                ('LBN-{}'.format(i), lbn)])))

# Create pipeline to extract the numerical features
skb = SelectKBest(k=6)
# From COLUMNS use the features that are numerical
skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]
categorical_pipelines.append(('numerical', skb))

# Combine all the features using FeatureUnion
preprocess = FeatureUnion(categorical_pipelines)

# Create the classifier
classifier = RandomForestClassifier()

# Transform the features and fit them to the classifier
classifier.fit(preprocess.transform(train_features), train_labels)

# Create the overall model as a single pipeline
pipeline = Pipeline([
    ('union', preprocess),
    ('classifier', classifier)
])


# ---------------------------------------
# 2. Export and save the model to GCS
# ---------------------------------------
# Export the model to a file

#model = 'model.joblib'
#joblib.dump(pipeline, model)

model = 'model.pkl'

with open(model, 'wb') as model_file:
    pickle.dump(pipeline, model_file)

# Upload the model to GCS
bucket = storage.Client().bucket(bucket_name)
blob = bucket.blob('{}/{}'.format(output_dir, model))
blob.upload_from_filename(model)


Before you can run your trainer application with ML Engine, your code and any dependencies must be placed in a Google Cloud Storage location that your Google Cloud Platform project can access. You can find more info here

In [None]:
%%writefile ./census_training/__init__.py
# Note that __init__.py can be an empty file.

# Step Three: Submit Training Job
Next you need to submit the job for training on ML Engine. You'll use gcloud to submit the job which has the following flags:

* `job-name` - A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter). In this case: `census_training_$(date +"%Y%m%d_%H%M%S")`
* `job-dir` - The path to a Google Cloud Storage location to use for job output.
* `package-path` - A packaged training application that is staged in a Google Cloud Storage location. If you are using the gcloud command-line tool, this step is largely automated.
* `module-name` - The name of the main module in your trainer package. The main module is the Python file you call to start the application. If you use the gcloud command to submit your job, specify the main module name in the --module-name argument. Refer to Python Packages to figure out the module name.
* `region` - The Google Cloud Compute region where you want your job to run. You should run your training job in the same region as the Cloud Storage bucket that stores your training data. Select a region from [here](https://cloud.google.com/ml-engine/docs/regions) or use the default '`us-central1`'.
* `runtime-version` - The version of Cloud ML Engine to use for the job. If you don't specify a runtime version, the training service uses the default Cloud ML Engine runtime version 1.0. See the list of runtime versions for more information.
* `python-version` - The Python version to use for the job. Python 3.5 is available with runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.
* `scale-tier` - A scale tier specifying the type of processing cluster to run your job on. This can be the CUSTOM scale tier, in which case you also explicitly specify the number and type of machines to use.
* `--` is a separator. Anything after that will be passed to the Python code as input arguments.
* `bucket-name` is the name of the bucket you created earlier to save the model and the job information.
* `output` is the path where the model will be saved in.

In [None]:
# Submitting the training job:
! gcloud ml-engine jobs submit training census_training_$(date +"%Y%m%d_%H%M%S") \
  --job-dir $JOB_DIR \
  --package-path ./$PACKAGE_DIR \
  --module-name $MAIN_TRAINER_MODULE \
  --region $REGION \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --scale-tier BASIC \
  -- \
  --bucket-name $BUCKET_NAME \
  --output $OUTPUT_DIR

### [Optional] StackDriver Logging
You can view the logs for your training job:
1. Go to https://console.cloud.google.com/
1. Select "Logging" in left-hand pane
1. Select "Cloud ML Job" resource from the drop-down
1. In filter by prefix, use the value of $JOB_NAME to view the logs

### Verify Model File in GCS
View the contents of the destination model folder to verify that model file has indeed been uploaded to GCS.

Note: The model can take a few minutes to train and show up in GCS.

In [None]:
! gsutil ls gs://$BUCKET_NAME/$OUTPUT_DIR

# Step Four: Serving the Model

Once the model is successfully created and trained, you can serve it. In CMLE, a model can have different versions. Therefore, in order to serve the model, you'll have to create two things.

First, you will create a model with just a name and a region. This will somewht act as a container for diffrent versions:

In [None]:
!gcloud ml-engine models create $MODEL_NAME --regions us-central1

After creating the model, you can create a version and point to the model that you created in the previous step:

In [None]:
!gcloud ml-engine versions create $MODEL_VERSION \
  --model=$MODEL_NAME \
  --framework=scikit-learn \
  --origin=gs://$BUCKET_NAME/$OUTPUT_DIR \
  --python-version=$PYTHON_VERSION \
  --runtime-version=$RUNTIME_VERSION

# Step Five: Prediction

With the model and the version created in the last step, you are ready to use it to make some predictions:

In [None]:
MODEL_NAME = %env MODEL_NAME
MODEL_VERSION = %env MODEL_VERSION
import googleapiclient.discovery

instances = [
 [50, 'Private', 160187, '8th', 5, 'Married-spouse-absent', 'Other-service', 'Not-in-family', 'Black', 'Female', 0, 0, 16, 'Jamaica'],
 [52, 'Self-emp-not-inc', 209642, 'HS-grad', 9, 'Married-civ-spouse', 'Exec-managerial', 'Husband', 'White', 'Male', 0, 0, 48, 'United-States']
]

service = googleapiclient.discovery.build('ml', 'v1')

name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, MODEL_VERSION)

response = service.projects().predict(
        name=name,
        body={'instances': instances}
    ).execute()

if 'error' in response:
    print(response['error'])
else:
    print(response['predictions'])

# Step Six: Cleanup

To clean up and delete everything that you created in this tutorial, simply run the following commands:

In [None]:
# Delete the model version
!gcloud ml-engine versions delete $MODEL_VERSION --model=$MODEL_NAME --quiet

# Delete the model
!gcloud ml-engine models delete $MODEL_NAME --quiet

# Delete the bucket
!gsutil rm -r gs://$BUCKET_NAME