# Training and Serving CARET models using AI Platform Custom Containers and Cloud Run
## Overview

This notebook illustrates how to use [CARET](https://topepo.github.io/caret/) R package to build an ML model to estimate the baby's weight given a number of factors, using the [BigQuery natality dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=samples&t=natality&page=table&_ga=2.99329886.-1705629017.1551465326&_gac=1.109796023.1561476396.CI2rz-z4hOMCFc6RhQods4oEXA). We use [AI Platform Training](https://cloud.google.com/ml-engine/docs/tensorflow/training-overview) with **Custom Containers** to train the TensorFlow model at scale. Rhen use the [Cloud Run](https://cloud.google.com/run/docs/) to serve the trained model as a Web API for online predictions.

R is one of the most widely used programming languages for statistical modeling, which has a large and active community of data scientists and ML professional. 
With over 10,000 packages in the open-source repository of CRAN, R caters to all statistical data analysis applications, ML, and visualisation.


## Dataset
The dataset used in this tutorial is natality data, which describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008, with more than 137 million records.
The dataset is available in [BigQuery public dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=samples&t=natality&page=table&_ga=2.99329886.-1705629017.1551465326&_gac=1.109796023.1561476396.CI2rz-z4hOMCFc6RhQods4oEXA). We use the data extracted from BigQuery and stored as CSV in Cloud Storage (GCS) in the [Exploratory Data Analysis](01_EDA-with-R-and-BigQuery) notebook.

In this notebook, we focus on Exploratory Data Analysis, while the goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.

## Objective
The goal of this tutorial is to:
1. Create a CARET regression model
2. Train the CARET model using on AI Platform Training with custom R container
3. Implement a Web API wrapper to the trained model using Plumber R package
4. Build Docker container image for the prediction Web API
5. Deploy the prediction Web API container image model on Cloud Run
6. Invoke the deployed Web API for predictions.
7. Use the AI Platform Notebooks to drive the workflow.



## Costs
This tutorial uses billable components of Google Cloud Platform (GCP):
1. Create a TensorFlow premade Estimator trainer using R interface
2. Train and export the Estimator on AI Platform Training using the cloudml APIs
3. Deploy the exported model to AI Platform prediction using the cloudml APIs
4. Invoke the deployed model API for predictions.
5. Use the AI Platform Notebooks to drive the workflow.


Learn about GCP pricing, use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.


## 0. Setup

In [1]:
version

               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          5.1                         
year           2018                        
month          07                          
day            02                          
svn rev        74947                       
language       R                           
version.string R version 3.5.1 (2018-07-02)
nickname       Feather Spray               

Install and import the required libraries. 

This may take several minutes if not installed already...

In [2]:
install.packages(c("caret"))

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [3]:
library(caret) # used to build a regression model

Loading required package: lattice
Loading required package: ggplot2


Set your `PROJECT_ID`, `BUCKET_NAME`, and `REGION`

In [4]:
# Set the project id
PROJECT_ID <- "r-on-gcp"

# Set yout GCS bucket
BUCKET_NAME <- "r-on-gcp"

# Set your training and model deployment region
REGION <- 'europe-west1'

## 1. Building a CARET Regression Model

### 1.1. Load data

If you run the [Exploratory Data Analysis](01_EDA-with-R-and-BigQuery) Notebook, you should have the **train_data.csv** and **eval_data.csv** files uploaded to GCS. You can download them to train your model locally using the following cell. However, if you have the files available locally, you can skip the following cell.

In [5]:
dir.create(file.path('data'), showWarnings = FALSE)
gcs_data_dir <- paste0("gs://", BUCKET_NAME, "/data/*_data.csv")
command <- paste("gsutil cp -r", gcs_data_dir, "data/")
print(command)
system(command, intern = TRUE)

[1] "gsutil cp -r gs://r-on-gcp/data/*_data.csv data/"


In [6]:
train_file <- "data/train_data.csv"
eval_file <- "data/eval_data.csv"
header <- c(
    "weight_pounds", 
    "is_male", "mother_age", "mother_race", "plurality", "gestation_weeks", 
    "mother_married", "cigarette_use", "alcohol_use", 
    "key")

target <- "weight_pounds"
key <- "key"
features <- setdiff(header, c(target, key))

train_data <- read.table(train_file, col.names = header, sep=",")
eval_data <- read.table(eval_file, col.names = header, sep=",")

### 1.2. Train the model
In this example, we will train an XGboost Tree model for regression.

In [7]:
trainControl <- trainControl(method = 'boot', number = 10)
hyper_parameters <- expand.grid(
    nrounds = 100,
    max_depth = 6,
    eta = 0.3,
    gamma = 0,
    colsample_bytree = 1,
    min_child_weight = 1,
    subsample = 1
)
  
print('Training the model...')

model <- train(
    y=train_data$weight_pounds, 
    x=train_data[, features], 
    preProc = c("center", "scale"),
    method='xgbTree', 
    trControl=trainControl,
    tuneGrid=hyper_parameters
)

print('Model is trained.')

[1] "Training the model..."
[1] "Model is trained."


### 1.2. Evaluate the model

In [8]:
eval(model)

eXtreme Gradient Boosting 

7708 samples
   8 predictor

Pre-processing: centered (4), scaled (4), ignore (4) 
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 7708, 7708, 7708, 7708, 7708, 7708, ... 
Resampling results:

  RMSE      Rsquared   MAE      
  1.094446  0.2985824  0.8440141

Tuning parameter 'nrounds' was held constant at a value of 100
Tuning
 held constant at a value of 1
Tuning parameter 'subsample' was held
 constant at a value of 1

### 1.3. Save the trained model

In [9]:
model_dir <- "models"
model_name <- "caret_babyweight_estimator"

In [10]:
# Saving the trained model
dir.create(model_dir, showWarnings = FALSE)
dir.create(file.path(model_dir, model_name), showWarnings = FALSE)
saveRDS(model, file.path(model_dir, model_name, "trained_model.rds"))

### 1.4. Implementing a model prediction function
This is an implementation of wrapper function to the model to perform prediction. The function expects a list of instances in a JSON format, and returns a list of predictions (estimated weights). This prediction function implementation will be used when serving the model as a Web API for online predictions. 

In [11]:
xgbtree <- readRDS(file.path(model_dir, model_name, "trained_model.rds"))

estimate_babyweights <- function(instances_json){
    library("rjson")
    instances <- jsonlite::fromJSON(instances_json)
    df_instances <- data.frame(instances)
    # fix data types
    boolean_columns <- c("is_male", "mother_married", "cigarette_use", "alcohol_use")
    for(col in boolean_columns){
        df_instances[[col]] <- as.logical(df_instances[[col]])
    }
    
    estimates <- predict(xgbtree, df_instances)
    return(estimates) 
}

instances_json <- '
[
    {
        "is_male": "TRUE",
        "mother_age": 28,
        "mother_race": 8,
        "plurality": 1,
        "gestation_weeks":  28,
        "mother_married": "TRUE",
        "cigarette_use": "FALSE",
        "alcohol_use": "FALSE"
     },
    {
        "is_male": "FALSE",
        "mother_age": 38,
        "mother_race": 18,
        "plurality": 1,
        "gestation_weeks":  28,
        "mother_married": "TRUE",
        "cigarette_use": "TRUE",
        "alcohol_use": "TRUE"
     }
]
'

estimate <- round(estimate_babyweights(instances_json), digits = 2)
print(paste("Estimated weight(s):", estimate))

[1] "Estimated weight(s): 4.5"  "Estimated weight(s): 2.57"


## 3. Submit a Training Job to AI Platform with Custom Containers
In order to train your CARET model in at scale using AI Platform Training, you need to implement your training logic in an R script file, containerize it in a Docker image, and submit the Docker image to AI Platform Training.

The [src/caret/training](src/caret/training) directory includes the following code files:
1. [model_trainer.R](src/caret/training/model_trainer.R) - This is the implementation of the CARET model training logic.
1. [Dockerfile](src/caret/training/Dockerfile) - This is the definition of the Docker container image to run the **model_trainer.R** script.

To submit the training job with the custom container to AI Platform, you need to do the following steps:
1. set your PROJECT_ID and BUCKET_NAME in training/model_trainer.R, and PROJECT_ID in training/Dockerfile so that the first line reads "FROM gcr.io/[PROJECT_ID]/caret_base"
2. **Build** a Docker container image with that runs the model_trainer.R
3. **Push** the Docker container image to **Container Registry**.
4. **Submit** an **AI Platform Training** job with the **custom container**.

### 3.1. Build and Push the Docker container image.
#### A - Build base image
This can take several minutes ...

In [13]:
# Create base image
base_image_url <- paste0("gcr.io/", PROJECT_ID, "/caret_base")
print(base_image_url)

setwd("src/caret")
getwd()

print("Building the base Docker container image...")
command <- paste0("docker build -f Dockerfile --tag ", base_image_url, " ./")
print(command)
system(command, intern = TRUE)

print("Pushing the baseDocker container image...")
command <- paste0("gcloud docker -- push ", base_image_url)
print(command)
system(command, intern = TRUE)

setwd("../..")
getwd()

[1] "gcr.io/r-on-gcp/caret_base"


[1] "Building the base Docker container image..."
[1] "docker build -f Dockerfile --tag gcr.io/r-on-gcp/caret_base ./"


[1] "Pushing the baseDocker container image..."
[1] "gcloud docker -- push gcr.io/r-on-gcp/caret_base"


#### B - Build trainer image

In [14]:
training_image_url <- paste0("gcr.io/", PROJECT_ID, "/", model_name, "_training")
print(training_image_url)

setwd("src/caret/training")
getwd()

print("Building the Docker container image...")
command <- paste0("docker build -f Dockerfile --tag ", training_image_url, " ./")
print(command)
system(command, intern = TRUE)

print("Pushing the Docker container image...")
command <- paste0("gcloud docker -- push ", training_image_url)
print(command)
system(command, intern = TRUE)

setwd("../../..")
getwd()

[1] "gcr.io/r-on-gcp/caret_babyweight_estimator_training"


[1] "Building the Docker container image..."
[1] "docker build -f Dockerfile --tag gcr.io/r-on-gcp/caret_babyweight_estimator_training ./"


[1] "Pushing the Docker container image..."
[1] "gcloud docker -- push gcr.io/r-on-gcp/caret_babyweight_estimator_training"


#### C- Verifying uploaded images to Container Registry

In [15]:
command <- paste0("gcloud container images list --repository=gcr.io/", PROJECT_ID)
system(command, intern = TRUE)

### 3.2. Submit an AI Plaform Training job with the custom container. 

In [16]:
job_name <- paste0("train_caret_contrainer_", format(Sys.time(), "%Y%m%d_%H%M%S"))

command = paste0("gcloud beta ml-engine jobs submit training ", job_name, 
  " --master-image-uri=", training_image_url,
  " --scale-tier=BASIC", 
  " --region=", REGION
)
print(command)

system(command, intern = TRUE)

[1] "gcloud beta ml-engine jobs submit training train_caret_contrainer_20190725_131432 --master-image-uri=gcr.io/r-on-gcp/caret_babyweight_estimator_training --scale-tier=BASIC --region=europe-west1"


Verify the trained model in GCS after the job finishes

In [18]:
model_name <- 'caret_babyweight_estimator'
gcs_model_dir <- paste0("gs://", BUCKET_NAME, "/models/", model_name)
command <- paste0("gsutil ls ", gcs_model_dir)
system(command, intern = TRUE)

## 4. Deploy the trained model to Cloud Run
In order to serve the trained CARET model as a Web API, you need to wrap it with a prediction function, as serve this prediction function as a REST API. Then you containerize this Web API and deploy it in Cloud Run.

The [src/caret/serving](src/caret/serving) directory includes the following code files:
1. [model_prediction.R](src/caret/serving/model_prediction.R) - This script downloads the trained model from GCS and loads (only once). It includes **estimate** function, which accepts instances in JSON format, and return the of baby weight estimate for each instance.
2. [model_api.R](src/caret/serving/model_prediction.R) - This is a [plumber](https://www.rplumber.io/) Web API that runs  **model_prediction.R**.
3. [Dockerfile](src/caret/serving/Dockerfile) - This is the definition of Docker container image that runs the **model_api.R**

To deploy the prediction Web API to Cloud Run, you need to do the following steps:
1. set your PROJECT_ID and BUCKET_NAME in serving/model_prediction.R, and PROJECT_ID in serving/Dockerfile so that the first line reads "FROM gcr.io/[PROJECT_ID]/caret_base"
2. **Build** the Docker container image for the prediction API.
3. **Push** the Docker container image to **Cloud Registry**.
4. Enable the Cloud Run API if not enabled yet, click "Enable" at https://console.developers.google.com/apis/api/run.googleapis.com/overview .
5. **Deploy** the Docker container to **Cloud Run**. 



### (Optional) 4.0. Upload the trained model to GCS
If you train your model using the model_trainer.R in AI Platform, it will upload the saved model to GCS. However, if you only train your model locally and have your saved model locally, you need to upload it to GCS.

In [None]:
model_name <- 'caret_babyweight_estimator'
gcs_model_dir = paste0("gs://", BUCKET_NAME, "/models/", model_name, "/")
command <- paste0("gsutil cp -r models/", model_name ,"/* ",gcs_model_dir)
print(command)
system(command, intern = TRUE)

### 4.1. Build and Push prediction Docker container image

In [19]:
serving_image_url <- paste0("gcr.io/", PROJECT_ID, "/", model_name, "_serving")
print(serving_image_url)

setwd("src/caret/serving")
getwd()

print("Building the Docker container image...")
command <- paste0("docker build -f Dockerfile --tag ", serving_image_url, " ./")
print(command)
system(command, intern = TRUE)

print("Pushing the Docker container image...")
command <- paste0("gcloud docker -- push ", serving_image_url)
print(command)
system(command, intern = TRUE)

setwd("../../..")
getwd()

[1] "gcr.io/r-on-gcp/caret_babyweight_estimator_serving"


[1] "Building the Docker container image..."
[1] "docker build -f Dockerfile --tag gcr.io/r-on-gcp/caret_babyweight_estimator_serving ./"


[1] "Pushing the Docker container image..."
[1] "gcloud docker -- push gcr.io/r-on-gcp/caret_babyweight_estimator_serving"


In [20]:
command <- paste0("gcloud container images list --repository=gcr.io/", PROJECT_ID)
system(command, intern = TRUE)

### 4.2. Deploy prediction container to Cloud Run

In [None]:
service_name <- "caret-babyweight-estimator"
command <- paste(
    "gcloud beta run deploy", service_name,
    "--image", serving_image_url,
    "--platform managed",
    "--allow-unauthenticated",
    "--region", REGION
)

print(command)
system(command, intern = TRUE)

## 5. Invoke the Model API for Predictions

When the **caret-babyweight-estimator** service is deployed to Cloud Run:
1. Go to Cloud Run in the [Cloud Console](https://console.cloud.google.com/run/).
2. Select the **caret-babyweight-estimator** service.
3. Copy the service URL, and use it to update the **url** variable in the following cell.

In [21]:
# Update to the deployed service URL
url <- "https://caret-babyweight-estimator-lbcii4x34q-uc.a.run.app/"
endpoint <- "estimate"

In [22]:
instances_json <- '
[
    {
        "is_male": "TRUE",
        "mother_age": 28,
        "mother_race": 8,
        "plurality": 1,
        "gestation_weeks":  28,
        "mother_married": "TRUE",
        "cigarette_use": "FALSE",
        "alcohol_use": "FALSE"
     },
    {
        "is_male": "FALSE",
        "mother_age": 38,
        "mother_race": 18,
        "plurality": 1,
        "gestation_weeks":  28,
        "mother_married": "TRUE",
        "cigarette_use": "TRUE",
        "alcohol_use": "TRUE"
     }
]
'

In [23]:
library("httr")
full_url <- paste0(url, endpoint)
response <- POST(full_url, body = instances_json)
estimates <- content(response)
print(paste("Estimated weight(s):", estimate))


Attaching package: ‘httr’

The following object is masked from ‘package:caret’:

    progress



[1] "Estimated weight(s): 4.5"  "Estimated weight(s): 2.57"


# License

Authors: Daniel Sparing & Khalid Salama

---
**Disclaimer**: This is not an official Google product. The sample code provided for an educational purpose.

---

Copyright 2019 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.