## Lab: Continuous Training with TensorFlow, PyTorch, XGBoost, and Scikit-learn Models with KubeFlow and AI Platform Pipelines

In this lab we will create containerized training applications for ML models in TensorFlow, PyTorch, XGBoost, and Scikit-learn. Will will then use these images as ops in a KubeFlow pipeline and train multiple models in parallel. We will then set up recurring runs of our KubeFlow pipeline in the UI. 

First, we will containerize models in TF, PyTorch, XGBoost and Scikit-learn following a step-wise process for each:
* Create the training script
* Package training script into a Docker Image 
* Build and push training image to Google Cloud Container Registry

Once we have all four training images built and pushed to the Container Registry, we will build a KubeFlow pipeline that does two things:
* Queries BigQuery to create training/validation splits and export results as sharded CSV files in GCS
* Launches AI Platform training jobs with our four containerized training applications, using the exported CSV data as input 

Finally, we will compile and deploy our pipeline. In the UI we will set up Continuous Training with recurring pipeline runs.

**PRIOR TO STARTING THE LAB:** Make sure you create a new instance with AI Platform Pipelines. Once the GKE cluster is spun up, copy the endpoint because you will need it in this lab.    

Install needed package

In [None]:
%pip install --upgrade google-cloud-bigquery-storage

Start by setting some global variables. Make sure you have created a bucket that is gs://PROJECT_ID

In [None]:
REGION = 'us-central1'
PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]
BUCKET = 'gs://' + PROJECT_ID 

First, create a BigQuery dataset. We will then query a public BigQuery dataset to populate a table in this dataset. This is census data. We will use age, workclass, education, occupation, and hours per week to predict income bracket. Note: We also grab functional_weight in our query. We do not use this feature in our models, however we use it to hash on when creating training/validation splits. 

In [None]:
!bq --location=US mk census

In [None]:
%%bigquery
    
CREATE OR REPLACE TABLE census.data AS
    
SELECT age, workclass, education_num, occupation, hours_per_week,income_bracket,functional_weight 
FROM `bigquery-public-data.ml_datasets.census_adult_income` 
WHERE AGE IS NOT NULL
AND workclass IS NOT NULL
AND education_num IS NOT NULL
AND occupation IS NOT NULL
AND hours_per_week IS NOT NULL
AND income_bracket IS NOT NULL 
AND functional_weight IS NOT NULL

### Create Scikit-learn Training Script

We will develop our first training script with Scikit-learn. We will use Pandas to read the CSV data then train a simple [SGD Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html).

In [None]:
!mkdir scikit_trainer_image

In [21]:
%%writefile ./scikit_trainer_image/train.py

"""Census Scikit-learn classifier trainer script."""



import pickle
import subprocess
import sys
import datetime
import os

import fire
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler


def train_evaluate(training_dataset_path, validation_dataset_path,output_dir):
    """Trains the Census Classifier model."""
    
    # Ingest data into Pandas Dataframes 
    df_train = pd.read_csv(training_dataset_path)
    df_validation = pd.read_csv(validation_dataset_path)
    df_train = pd.concat([df_train, df_validation])
    
    numeric_features = [
        'age', 'education_num','hours_per_week'
    ]
    
    categorical_features = ['workclass', 'occupation']
    
    # Scale numeric features, one-hot encode categorical features
    preprocessor = ColumnTransformer(transformers=[(
        'num', StandardScaler(),
        numeric_features),
        ('cat', OneHotEncoder(), categorical_features)])
    
    pipeline = Pipeline([('preprocessor', preprocessor),
                         ('classifier', SGDClassifier(loss='log'))])
    
    num_features_type_map = {feature: 'float64' for feature in numeric_features}
    df_train = df_train.astype(num_features_type_map)
    df_validation = df_validation.astype(num_features_type_map)
    
    X_train = df_train.drop('income_bracket', axis=1)
    y_train = df_train['income_bracket']
    
    # Set parameters of the model and fit
    pipeline.set_params(classifier__alpha=0.0005, classifier__max_iter=250)
    pipeline.fit(X_train, y_train)
    
    # Save the model locally
    model_filename = 'model.pkl'
    with open(model_filename, 'wb') as model_file:
        pickle.dump(pipeline, model_file)
        
    # Copy to model to GCS 
    EXPORT_PATH = os.path.join(
        output_dir, datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
    
    gcs_model_path = '{}/{}'.format(EXPORT_PATH, model_filename)
    subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path])
    print('Saved model in: {}'.format(gcs_model_path))


if __name__ == '__main__':
    fire.Fire(train_evaluate)

Writing ./scikit_trainer_image/train.py


### Package Scikit-learn Training Script into a Docker Image
The next step is to package this training script into a Docker Image. We need to be sure to list the dependencies in this Dockerfile. For the Scikit-learn model we need to ensure scikit-learn version 0.23.2 and Pandas version 1.1.1

In [None]:
%%writefile ./scikit_trainer_image/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire scikit-learn==0.23.2 pandas==1.1.1
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

### Build the scikit-learn trainer image

Now we will use Cloud Build to build the image and push it your project's Container Registry. Here we are using the remote cloud service to build the image, so we don't need a local installation of Docker. Note: Building and pushing the image will take a few minutes. Since we will be building and pushing 4 different images in this lab, I suggest taking a detailed look at the training scripts while you wait and make sure you understand the data ingestion/model building/training code for frameworks you develop with.

In [None]:
SCIKIT_IMAGE_NAME='scikit_trainer_image'
SCIKIT_IMAGE_TAG='latest'
SCIKIT_IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, SCIKIT_IMAGE_NAME, SCIKIT_IMAGE_TAG)

In [None]:
!gcloud builds submit --tag $SCIKIT_IMAGE_URI $SCIKIT_IMAGE_NAME

### Create TensorFlow Training Script
One down, three to go! Now we will develop a TensorFlow training script. We will use the tf.data API to ingest the data from CSVs then build/train a neural network with the tf.keras Functional API. 

In [None]:
!mkdir tensorflow_trainer_image

In [None]:
%%writefile ./tensorflow_trainer_image/train.py

"""Census Tensorflow classifier trainer script."""

import pickle
import subprocess
import sys
import fire
import pandas as pd
import tensorflow as tf
import datetime
import os

CSV_COLUMNS = ["age",
               "workclass",
               "education_num",
               "occupation",
               "hours_per_week",
               "income_bracket"]

# Add string name for label column
LABEL_COLUMN = "income_bracket"

# Set default values for each CSV column as a list of lists.
# Treat is_male and plurality as strings.
DEFAULTS = [[18], ["?"], [4], ["?"], [20],["<=50K"]]

def features_and_labels(row_data):
    cols = tf.io.decode_csv(row_data, record_defaults=DEFAULTS)
    feats = {
        'age': tf.reshape(cols[0], [1,]),
        'workclass': tf.reshape(cols[1],[1,]),
        'education_num': tf.reshape(cols[2],[1,]),
        'occupation': tf.reshape(cols[3],[1,]),
        'hours_per_week': tf.reshape(cols[4],[1,]),
        'income_bracket': cols[5]
    }
    label = feats.pop('income_bracket')
    label_int = tf.case([(tf.math.equal(label,tf.constant([' <=50K'])), lambda: 0),
                        (tf.math.equal(label,tf.constant([' >50K'])), lambda: 1)])
    
    return feats, label_int

def load_dataset(pattern, batch_size=1, mode='eval'):
    # Make a CSV dataset
    filelist = tf.io.gfile.glob(pattern)
    dataset = tf.data.TextLineDataset(filelist).skip(1)
    dataset = dataset.map(features_and_labels)

    # Shuffle and repeat for training
    if mode == 'train':
        dataset = dataset.shuffle(buffer_size=10*batch_size).batch(batch_size).repeat()
    else:
        dataset = dataset.batch(10)

    return dataset

def train_evaluate(training_dataset_path, validation_dataset_path, batch_size, num_train_examples, num_evals, output_dir):
    inputs = {
        'age': tf.keras.layers.Input(name='age',shape=[None],dtype='int32'),
        'workclass': tf.keras.layers.Input(name='workclass',shape=[None],dtype='string'),
        'education_num': tf.keras.layers.Input(name='education_num',shape=[None],dtype='int32'),
        'occupation': tf.keras.layers.Input(name='occupation',shape=[None],dtype='string'),
        'hours_per_week': tf.keras.layers.Input(name='hours_per_week',shape=[None],dtype='int32')
    }
    
    batch_size = int(batch_size)
    num_train_examples = int(num_train_examples)
    num_evals = int(num_evals)
    
    feat_cols = {
        'age': tf.feature_column.numeric_column('age'),
        'workclass': tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_hash_bucket(
                key='workclass', hash_bucket_size=100
            )
        ),
        'education_num': tf.feature_column.numeric_column('education_num'),
        'occupation': tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_hash_bucket(
                key='occupation', hash_bucket_size=100
            )
        ),
        'hours_per_week': tf.feature_column.numeric_column('hours_per_week')
    }
    
    dnn_inputs = tf.keras.layers.DenseFeatures(
        feature_columns=feat_cols.values())(inputs)
    h1 = tf.keras.layers.Dense(64, activation='relu')(dnn_inputs)
    h2 = tf.keras.layers.Dense(128, activation='relu')(h1)
    h3 = tf.keras.layers.Dense(64, activation='relu')(h2)
    output = tf.keras.layers.Dense(1, activation='sigmoid')(h3)
    
    model = tf.keras.models.Model(inputs=inputs,outputs=output)
    model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
    
    trainds = load_dataset(
        pattern=training_dataset_path,
        batch_size=batch_size,
        mode='train')
    
    evalds = load_dataset(
        pattern=validation_dataset_path,
        mode='eval')
    
    
    steps_per_epoch = num_train_examples // (batch_size * num_evals)
    
    history = model.fit(
        trainds,
        validation_data=evalds,
        validation_steps=100,
        epochs=num_evals,
        steps_per_epoch=steps_per_epoch
    )
    
    EXPORT_PATH = os.path.join(
    output_dir, datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
    tf.saved_model.save(
        obj=model, export_dir=EXPORT_PATH)  # with default serving function
    
    print("Exported trained model to {}".format(EXPORT_PATH))
    
if __name__ == '__main__':
    fire.Fire(train_evaluate)

### Package TensorFlow Training Script into a Docker Image (TODO in this cell block)
Note that the dependencies in this Dockerfile are different than for the Scikit-learn one. 

In [None]:
%%writefile ./tensorflow_trainer_image/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire tensorflow==2.1.1
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

### Build the Tensorflow Trainer Image 
Build the image and push it to your project's container registry. Again, this will take a few minutes. 

In [None]:
TF_IMAGE_NAME='tensorflow_trainer_image'
TF_IMAGE_TAG='latest'
TF_IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, TF_IMAGE_NAME, TF_IMAGE_TAG)

In [None]:
!gcloud builds submit --tag $TF_IMAGE_URI $TF_IMAGE_NAME

### Create PyTorch Training Script
Two down, two to go! Now we will develop a PyTorch training script. We will use Pandas DataFrames combined with PyTorch's Dataset and Dataloader to ingest the data from CSVs, build a model with torch.nn, and write a training loop to train this model. 

In [None]:
!mkdir pytorch_trainer_image

In [None]:
%%writefile ./pytorch_trainer_image/train.py

import os 
import subprocess
import datetime
import fire

import torch 
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pandas as pd 
import numpy as np
from sklearn.preprocessing import StandardScaler

class TrainData(Dataset):
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)
    
class BinaryClassifier(nn.Module):
    def __init__(self):
        super(BinaryClassifier, self).__init__()
        # 27 input features
        self.h1 = nn.Linear(27, 64) 
        self.h2 = nn.Linear(64, 64)
        self.output_layer = nn.Linear(64, 1) 
        
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.1)
        self.batchnorm1 = nn.BatchNorm1d(64)
        self.batchnorm2 = nn.BatchNorm1d(64)
        
    def forward(self, inputs):
        x = self.relu(self.h1(inputs))
        x = self.batchnorm1(x)
        x = self.relu(self.h2(x))
        x = self.batchnorm2(x)
        x = self.dropout(x)
        x = self.output_layer(x)
        
        return x

def binary_acc(y_pred, y_true):
    """Calculates accuracy"""
    y_pred_tag = torch.round(torch.sigmoid(y_pred))

    correct_results_sum = (y_pred_tag == y_true).sum().float()
    acc = correct_results_sum/y_true.shape[0]
    acc = torch.round(acc * 100)
    
    return acc

def train_evaluate(training_dataset_path, validation_dataset_path, batch_size, num_epochs, output_dir):
    
    batch_size = int(batch_size)
    num_epochs = int(num_epochs)
    
    # Read in train/validation data and concat 
    df_train = pd.read_csv(training_dataset_path)
    df_validation = pd.read_csv(validation_dataset_path)
    df = pd.concat([df_train, df_validation])

    categorical_features = ['workclass', 'occupation']
    target='income_bracket'

    # One-hot encode categorical variables 
    df = pd.get_dummies(df,columns=categorical_features)

    # Change label to 0 if <=50K, 1 if >50K
    df[target] = df[target].apply(lambda x: 0 if x==' <=50K' else 1)

    # Split features and labels into 2 different vars
    X_train = df.loc[:, df.columns != target]
    y_train = np.array(df[target])

    # Normalize features 
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)

    # Training data
    train_data = TrainData(torch.FloatTensor(X_train), 
                           torch.FloatTensor(y_train))

    # Use torch DataLoader to feed data to model 
    train_loader = DataLoader(dataset=train_data, batch_size=batch_size, drop_last=True)

    # Instantiate model 
    model = BinaryClassifier()
    
    # Loss is binary crossentropy w/ logits. Must manually implement sigmoid for inference
    criterion = nn.BCEWithLogitsLoss()
    
    # Adam optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    model.train()
    for e in range(1, num_epochs+1):
        epoch_loss = 0
        epoch_acc = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()

            y_pred = model(X_batch)

            loss = criterion(y_pred, y_batch.unsqueeze(1))
            acc = binary_acc(y_pred, y_batch.unsqueeze(1))

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            epoch_acc += acc.item()


        print(f'Epoch {e}: Loss = {epoch_loss/len(train_loader):.5f} | Acc = {epoch_acc/len(train_loader):.3f}')

    # Save the model locally
    model_filename='model.pt'
    torch.save(model.state_dict(), model_filename)

    EXPORT_PATH = os.path.join(
        output_dir, datetime.datetime.now().strftime("%Y%m%d%H%M%S"))

    # Copy the model to GCS
    gcs_model_path = '{}/{}'.format(EXPORT_PATH, model_filename)
    subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path])
    print('Saved model in: {}'.format(gcs_model_path))
    
if __name__ == '__main__':
    fire.Fire(train_evaluate)

### Package PyTorch Training Script into a Docker Image
Note the dependencies.

In [None]:
%%writefile ./pytorch_trainer_image/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire torch==1.6.0 scikit-learn==0.23.2 pandas==1.1.1
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

### Build the PyTorch Trainer Image 
Build and push the PyTorch training image to your project's Container Registry. Again, this will take a few minutes.

In [None]:
TORCH_IMAGE_NAME='pytorch_trainer_image'
TORCH_IMAGE_TAG='latest'
TORCH_IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, TORCH_IMAGE_NAME, TORCH_IMAGE_TAG)

In [None]:
!gcloud builds submit --tag $TORCH_IMAGE_URI $TORCH_IMAGE_NAME

### Create XGBoost Training Script
Three down, one to go! Create the final training script. This script will ingest and preprocess data with Pandas, then train a [Gradient Boosted Tree model](https://en.wikipedia.org/wiki/Gradient_boosting). 

In [None]:
!mkdir xgboost_trainer_image

In [None]:
%%writefile ./xgboost_trainer_image/train.py

import os 
import subprocess
import datetime
import fire
import pickle 

import pandas as pd 
import numpy as np
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

def train_evaluate(training_dataset_path, validation_dataset_path,max_depth,n_estimators,output_dir):
    
    df_train = pd.read_csv(training_dataset_path)
    df_validation = pd.read_csv(validation_dataset_path)
    df = pd.concat([df_train, df_validation])

    categorical_features = ['workclass', 'occupation']
    target='income_bracket'

    # One-hot encode categorical variables 
    df = pd.get_dummies(df,columns=categorical_features)

    # Change label to 0 if <=50K, 1 if >50K
    df[target] = df[target].apply(lambda x: 0 if x==' <=50K' else 1)

    # Split features and labels into 2 different vars
    X_train = df.loc[:, df.columns != target]
    y_train = np.array(df[target])

    # Normalize features 
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    
    grid = {
        'max_depth': int(max_depth),
        'n_estimators': int(n_estimators)
    }
    
    model = XGBClassifier()
    model.set_params(**grid)
    model.fit(X_train,y_train)
    
    model_filename = 'xgb_model.pkl'
    pickle.dump(model, open(model_filename, "wb"))
        
    EXPORT_PATH = os.path.join(
        output_dir, datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
    
    gcs_model_path = '{}/{}'.format(EXPORT_PATH, model_filename)
    subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path])
    print('Saved model in: {}'.format(gcs_model_path))  

if __name__ == '__main__':
    fire.Fire(train_evaluate)

### Package XGBoost Training Script into a Docker Image
Note the dependencies. 

In [None]:
%%writefile ./xgboost_trainer_image/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire scikit-learn==0.23.2 pandas==1.1.1 xgboost==1.2.0
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

### Build the XGBoost Trainer Image
Build and push the XGBoost training image. This will take a few minutes (this is the last one, woohoo!) 

In [None]:
XGB_IMAGE_NAME='xgboost_trainer_image'
XGB_IMAGE_TAG='latest'
XGB_IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, XGB_IMAGE_NAME, XGB_IMAGE_TAG)

In [None]:
!gcloud builds submit --tag $XGB_IMAGE_URI $XGB_IMAGE_NAME

## Develop KubeFlow Pipeline
Now that you have all four of your training applications as containers in your project's Container Registry, let's build a KubeFlow pipeline. 

The KubeFlow pipeline will have two BigQuery Ops. We will use the pre-built BigQuery Query component (no need to reinvent the wheel) to do the following: 
* Create a training split in our data and export to CSV
* Create a validation split in our data and export to CSV

The output of these BigQuery Ops will be the input data into four AI Platform Training Ops. For this we will also use a pre-built component. Each AI Platform Training Op will train one of our containerized models - Tensorflow, PyTorch, XGBoost, and Scikit-learn.

THERE ARE TODOs IN THE FOLLOWING CODE

In [None]:
!mkdir pipeline

In [None]:
%%writefile ./pipeline/census_training_pipeline.py

import os
import kfp
from kfp.dsl.types import GCPProjectID
from kfp.dsl.types import GCPRegion
from kfp.dsl.types import GCSPath
from kfp.dsl.types import String
from kfp.gcp import use_gcp_secret
import kfp.components as comp
import kfp.dsl as dsl
import kfp.gcp as gcp
import json

# We will use environment vars to set the trainer image names and bucket name
TF_TRAINER_IMAGE = os.getenv('TF_TRAINER_IMAGE')
SCIKIT_TRAINER_IMAGE = os.getenv('SCIKIT_TRAINER_IMAGE')
TORCH_TRAINER_IMAGE = os.getenv('TORCH_TRAINER_IMAGE')
XGB_TRAINER_IMAGE = os.getenv('XGB_TRAINER_IMAGE')
BUCKET = os.getenv('BUCKET')

# Paths to export the training/validation data from bigquery
TRAINING_OUTPUT_PATH = BUCKET + '/census/data/training.csv'
VALIDATION_OUTPUT_PATH = BUCKET + '/census/data/validation.csv'

COMPONENT_URL_SEARCH_PREFIX = 'https://raw.githubusercontent.com/kubeflow/pipelines/0.2.5/components/gcp/'

# Create component factories
component_store = kfp.components.ComponentStore(
    local_search_paths=None, url_search_prefixes=[COMPONENT_URL_SEARCH_PREFIX])

# TODO: Load BigQuery and AI Platform Training ops from component_store
# as bigquery_query_op and mlengine_train_op 

bigquery_query_op = 
mlengine_train_op = 

def get_query(dataset='training'):
    """Function that returns either training or validation query"""
    if dataset=='training':
        split = "MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 100) < 80"
    elif dataset=='validation':
        split = """MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 100) >= 80 
        AND MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 100) < 90"""
    else:
        split = "MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 100) >= 90"
        
    query = """SELECT age, workclass, education_num, occupation, hours_per_week,income_bracket 
    FROM census.data 
    WHERE {0}""".format(split)
    
    return query

# We will use the training/validation queries as inputs to our pipeline
# This lets us change the training/validation datasets if we wish by simply
# Changing the query. 
TRAIN_QUERY = get_query(dataset='training')
VALIDATION_QUERY=get_query(dataset='validation')

@dsl.pipeline(
    name='Continuous Training with Multiple Frameworks',
    description='Pipeline to create training/validation splits w/ BigQuery then launch multiple AI Platform Training Jobs'
)
def pipeline(
    project_id,
    train_query=TRAIN_QUERY,
    validation_query=VALIDATION_QUERY,
    region='us-central1'
):
    # Creating the training data split
    create_training_split = bigquery_query_op(
        query=train_query,
        project_id=project_id,
        output_gcs_path=TRAINING_OUTPUT_PATH
    ).set_display_name('BQ Train Split')
    
    # TODO: Create the validation data split
    
    # These are the output directories where our models will be saved
    tf_output_dir = BUCKET + '/census/models/tf'
    scikit_output_dir = BUCKET + '/census/models/scikit'
    torch_output_dir = BUCKET + '/census/models/torch'
    xgb_output_dir = BUCKET + '/census/models/xgb'
    
    # Training arguments to be passed to the TF Trainer
    tf_args = [
        '--training_dataset_path', create_training_split.outputs['output_gcs_path'],
        '--validation_dataset_path', create_validation_split.outputs['output_gcs_path'],
        '--output_dir', tf_output_dir,
        '--batch_size', '32', 
        '--num_train_examples', '1000',
        '--num_evals', '10'
    ]
    
    # TODO: Fill in the list of the training arguments to be passed to the Scikit-learn Trainer
    scikit_args = []
    
    # Training arguments to be passed to the PyTorch Trainer
    torch_args = [
        '--training_dataset_path', create_training_split.outputs['output_gcs_path'],
        '--validation_dataset_path', create_validation_split.outputs['output_gcs_path'],
        '--output_dir', torch_output_dir,
        '--batch_size', '32', 
        '--num_epochs', '15',
    ]
    
    # Training arguments to be passed to the XGBoost Trainer 
    xgb_args = [
        '--training_dataset_path', create_training_split.outputs['output_gcs_path'],
        '--validation_dataset_path', create_validation_split.outputs['output_gcs_path'],
        '--output_dir', xgb_output_dir,
        '--max_depth', '10', 
        '--n_estimators', '100'
    ]
    
    # AI Platform Training Jobs with all 4 trainer images 
    
    train_scikit = mlengine_train_op(
        project_id=project_id,
        region=region,
        master_image_uri=SCIKIT_TRAINER_IMAGE,
        args=scikit_args).set_display_name('Scikit-learn Model - AI Platform Training')
    
    train_tf = mlengine_train_op(
        project_id=project_id,
        region=region,
        master_image_uri=TF_TRAINER_IMAGE,
        args=tf_args).set_display_name('Tensorflow Model - AI Platform Training')
    
    train_torch = mlengine_train_op(
        project_id=project_id,
        region=region,
        master_image_uri=TORCH_TRAINER_IMAGE,
        args=torch_args).set_display_name('Pytorch Model - AI Platform Training')
    
    # TODO: Provide arguments to mlengine_train_op to train the XGBoost model
    train_xgb = mlengine_train_op(
        # Arguments go here
    ).set_display_name('XGBoost Model - AI Platform Training')
    

Set environment variables for the different trainer image names as well as our bucket.

In [None]:
TAG = 'latest'
SCIKIT_TRAINER_IMAGE = 'gcr.io/{}/scikit_trainer_image:{}'.format(PROJECT_ID, TAG)
TF_TRAINER_IMAGE = 'gcr.io/{}/tensorflow_trainer_image:{}'.format(PROJECT_ID, TAG)
TORCH_TRAINER_IMAGE = 'gcr.io/{}/pytorch_trainer_image:{}'.format(PROJECT_ID, TAG)
XGB_TRAINER_IMAGE = 'gcr.io/{}/xgboost_trainer_image:{}'.format(PROJECT_ID, TAG)

In [None]:
%env TF_TRAINER_IMAGE={TF_TRAINER_IMAGE}
%env SCIKIT_TRAINER_IMAGE={SCIKIT_TRAINER_IMAGE}
%env TORCH_TRAINER_IMAGE={TORCH_TRAINER_IMAGE}
%env XGB_TRAINER_IMAGE={XGB_TRAINER_IMAGE}
%env BUCKET={BUCKET}

### Compile the Pipeline
Compile the pipeline with the CLI compiler. This will save a census_training_pipeline.yaml file locally 

In [None]:
!dsl-compile --py pipeline/census_training_pipeline.py --output census_training_pipeline.yaml

### Take a look at the head of the yaml file

In [None]:
!head census_training_pipeline.yaml

#### Set the command fields in the pipeline YAML

In [None]:
!sed -i 's/\"command\": \[\]/\"command\": \[python, -u, -m, kfp_component.launcher\]/g' census_training_pipeline.yaml

In [None]:
!cat census_training_pipeline.yaml | grep "component.launcher"

You should see 6 lines in the output that were modified by the sed command.

### Deploy your KubeFlow Pipeline
Now let's deploy the KubeFlow pipeline. Prior to the lab you should have spun up an AI Platform Pipelines Instance. In the AI Platform Pipeline UI click 'settings' on your pipeline and copy the endpoint. Paste the endpoint as the value for the string variable ENDPOINT.

In [None]:
#TODO: Change ENDPOINT to the ENDPOINT for your AI Platform Pipelines Instance
ENDPOINT = ''
PIPELINE_NAME = 'census_trainer_multiple_models'

In [None]:
!kfp --endpoint $ENDPOINT pipeline upload \
-p $PIPELINE_NAME \
./census_training_pipeline.yaml

### Continuous Training: Create Pipeline Run, then Create Recurring Runs in the UI. 
Now that we deployed our KubeFlow pipeline, let's head over to the UI and launch a pipeline run. You can either click **Open Pipelines Dashboard** in the AI Platform Pipeline UI or run the Python cell below and copy/paste the output into the browser. Once in the UI, complete the following steps:
* Select **Pipelines** on the left-hand navigation panel
* Select **census_trainer_multiple_models** then click **Create Run**
* In the **Experiment** field select **Default**
* For **Run Type** select **One-Off**
* Enter your **Project ID** and hit **Start**. You can now monitor your pipeline run in the UI (it will take about **10 minutes** to complete the run).



**NOTE that your pipeline run may fail due to the bug in a BigQuery component that does not handle certain race conditions. If you observe the pipeline failure, please click on Retry on the top right in the KFP UI to re-run the failed steps.**


Now let's set up a recurring run:
* Select **Pipelines** then **census_trainer_multiple_models** and click **Create Run**
* In the **Experiement** field select **Default**
* For **Run Type** select **Recurring**
* Configure the Recurring Run to start tomorrow at 5pm and run once weekly
* Enter your **Project ID** and hit **Start**

Now your pipeline will run weekly starting tomorrow, it's as easy as that!

In [None]:
print(f"https://{ENDPOINT}")

### End of Lab, Congrats!
In this lab you created containerized training applications for TensorFlow, PyTorch, Scikit-learn, an XGBoost models. You then created a KubeFlow pipeline that used pre built components to create training/validation splits in BigQuery data, export that data as CSV files to GCS, and launch AI Platform Training Jobs with the four containerized training applications. Finally, you ran the pipeline from the UI and set up Continuous Training to re-run your pipeline once a week!