# Training at scale with the Vertex AI Training Service
**Learning Objectives:**
  1. Learn how to organize your training code into a Python package
  1. Train your model using cloud infrastructure via Google Cloud Vertex AI Training Service
  1. Learn how to run your training package using Docker containers and push training Docker images on a Docker registry
  1. Monitor Cloud job using TensorBoard

## Introduction

In this notebook we'll make the jump from training locally, to training in the cloud. We'll take advantage of Google Cloud's [Vertex AI Training Service](https://cloud.google.com/vertex-ai/). 

Vertex AI Training Service is a managed service that allows the training and deployment of ML models without having to provision or maintain servers. The infrastructure is handled seamlessly by the managed service for us.

Each learning objective will correspond to a __#TODO__ in the [student lab notebook](../labs/1_training_at_scale_vertex.ipynb) -- try to complete that notebook first before reviewing this solution notebook.

Specify your project name and bucket name in the cell below.

In [None]:
import os
import warnings
from datetime import datetime

import tensorboard
from google import api_core
from google.cloud import aiplatform, bigquery

%load_ext tensorboard
warnings.filterwarnings("ignore")
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

Change the following cell as necessary:

In [None]:
# Change below if necessary
PROJECT = !gcloud config get-value project  # noqa: E999
PROJECT = PROJECT[0]
BUCKET = PROJECT
REGION = "us-central1"

OUTDIR = f"gs://{BUCKET}/taxifare/data"

%env PROJECT=$PROJECT
%env BUCKET=$BUCKET
%env REGION=$REGION
%env OUTDIR=$OUTDIR

Confirm below that the bucket is regional and its region equals to the specified region:

In [None]:
%%bash
gsutil ls -Lb gs://$BUCKET | grep "gs://\|Location"
echo $REGION

In [None]:
%%bash
gcloud config set project $PROJECT
gcloud config set ai/region $REGION

## Create BigQuery tables

If you have not already created a BigQuery dataset for our data, run the following cell:

In [None]:
bq = bigquery.Client(project=PROJECT)
dataset = bigquery.Dataset(bq.dataset("taxifare"))

try:
    bq.create_dataset(dataset)
    print("Dataset created")
except api_core.exceptions.Conflict:
    print("Dataset already exists")

Let's create a table with 1 million examples.

This query reflects the best practice of using a hash function (`FARM_FINGERPRINT`) in the `WHERE` and `ORDER BY` clauses to ensure reproducibility while introducing randomness.

Note that the order of columns is exactly what was in our CSV files.

In [None]:
%%bigquery

CREATE OR REPLACE TABLE taxifare.feateng_training_data AS

SELECT
    (tolls_amount + fare_amount) AS fare_amount,
    pickup_datetime,
    pickup_longitude AS pickuplon,
    pickup_latitude AS pickuplat,
    dropoff_longitude AS dropofflon,
    dropoff_latitude AS dropofflat,
    passenger_count*1.0 AS passengers,
    'unused' AS key
FROM `nyc-tlc.yellow.trips` as ny_taxi_trips
WHERE ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 1000)) = 1
AND
    trip_distance > 0
    AND fare_amount >= 2.5
    AND pickup_longitude > -78
    AND pickup_longitude < -70
    AND dropoff_longitude > -78
    AND dropoff_longitude < -70
    AND pickup_latitude > 37
    AND pickup_latitude < 45
    AND dropoff_latitude > 37
    AND dropoff_latitude < 45
    AND passenger_count > 0
ORDER BY FARM_FINGERPRINT(TO_JSON_STRING(ny_taxi_trips))

Make the validation dataset be 1/10 the size of the training dataset.

In [None]:
%%bigquery

CREATE OR REPLACE TABLE taxifare.feateng_valid_data AS

SELECT
    (tolls_amount + fare_amount) AS fare_amount,
    pickup_datetime,
    pickup_longitude AS pickuplon,
    pickup_latitude AS pickuplat,
    dropoff_longitude AS dropofflon,
    dropoff_latitude AS dropofflat,
    passenger_count*1.0 AS passengers,
    'unused' AS key
FROM `nyc-tlc.yellow.trips` as ny_taxi_trips
WHERE ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 10000)) = 2
AND
    trip_distance > 0
    AND fare_amount >= 2.5
    AND pickup_longitude > -78
    AND pickup_longitude < -70
    AND dropoff_longitude > -78
    AND dropoff_longitude < -70
    AND pickup_latitude > 37
    AND pickup_latitude < 45
    AND dropoff_latitude > 37
    AND dropoff_latitude < 45
    AND passenger_count > 0
ORDER BY FARM_FINGERPRINT(TO_JSON_STRING(ny_taxi_trips))

## Export the tables as CSV files

In [None]:
%%bash

echo "Deleting current contents of $OUTDIR"
gsutil -m -q rm -rf $OUTDIR

echo "Extracting training data to $OUTDIR"
bq --location=US extract \
   --destination_format CSV  \
   --field_delimiter "," --noprint_header \
   taxifare.feateng_training_data \
   $OUTDIR/taxi-train-*.csv

echo "Extracting validation data to $OUTDIR"
bq --location=US extract \
   --destination_format CSV  \
   --field_delimiter "," --noprint_header \
   taxifare.feateng_valid_data \
   $OUTDIR/taxi-valid-*.csv

gsutil ls -l $OUTDIR

Confirm that you have created both the training and validation datasets in Google Cloud Storage.

In [None]:
!gsutil ls gs://$BUCKET/taxifare/data

In [None]:
!gsutil cat gs://$BUCKET/taxifare/data/taxi-train-000000000000.csv | head -2

In [None]:
!gsutil cat gs://$BUCKET/taxifare/data/taxi-valid-000000000000.csv | head -2

## Make code compatible with Vertex AI Training Service
In order to make our code compatible with Vertex AI Training Service we need to make the following changes:

1. Upload data to Google Cloud Storage 
2. Move code into a trainer Python package
4. Submit training job with `gcloud` to train on Vertex AI

### Move code into a python package

The first thing to do is to convert your training code snippets into a regular Python package. 

A Python package is simply a collection of one or more `.py` files along with an `__init__.py` file to identify the containing directory as a package. The `__init__.py` sometimes contains initialization code but for our purposes an empty file is sufficient.

#### Create the package directory

Our package directory contains 3 files:

In [None]:
ls ./taxifare/trainer/

#### Paste existing code into model.py

A Python package requires our code to be in a .py file, as opposed to notebook cells. So, we simply copy and paste our existing code we 
developed in [this notebook](../../introduction_to_tensorflow/solutions/5_custom_feature_engineering.ipynb) into a single file.

**Lab Task #1**: Organizing your training code into a Python package

There are two places to fill in TODOs in `model.py`. 

 * in the `build_dnn_model` function, add code to use an optimizer with a custom learning rate.
 * in the `train_and_evaluate` function, add code to define variables using the `hparams` dictionary.

In [None]:
%%writefile ./taxifare/trainer/model.py
"""Data prep, train and evaluate DNN model."""

import logging
import os

import numpy as np
import tensorflow as tf
import keras
from keras import callbacks
from keras.layers import (
    Concatenate,
    Dense,
    Discretization,
    Embedding,
    Flatten,
    HashedCrossing,
    Input,
    Lambda,
)

def parse_csv(row):
    ds = tf.strings.split(row, ",")
    # Label: fare_amount
    label = tf.strings.to_number(ds[0])
    # Feature: pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude
    feature = tf.strings.to_number(ds[2:6])  # use some features only
    # Passing feature in tuple so that we can handle them separately.
    return (feature[0], feature[1], feature[2], feature[3]), label


def create_dataset(pattern, batch_size, num_repeat, mode="eval"):
    ds = tf.data.Dataset.list_files(pattern)
    ds = ds.flat_map(tf.data.TextLineDataset)
    ds = ds.map(parse_csv)
    if mode == "train":
        ds.shuffle(buffer_size=1000)
    ds = ds.repeat(num_repeat).batch(batch_size, drop_remainder=True)
    return ds

def lat_lon_parser(row, pick_lat):
    ds = tf.strings.split(row, ",")
    # latitude idx: 3 and 5, longitude idx: 2 and 4
    idx = [3,5] if pick_lat else [2,4]
    return tf.strings.to_number(tf.gather(ds, idx))

def adapt_normalize(train_data_path):
    ds = tf.data.Dataset.list_files(train_data_path)
    ds = ds.flat_map(tf.data.TextLineDataset)
    lat_values = ds.map(lambda x: lat_lon_parser(x, True)).batch(10000)
    lon_values = ds.map(lambda x: lat_lon_parser(x, False)).batch(10000)

    lat_scaler = keras.layers.Normalization(axis=None)
    lon_scaler = keras.layers.Normalization(axis=None)
    lat_scaler.adapt(lat_values)
    lon_scaler.adapt(lon_values)

    print("Computed statistics for latitude:")
    print(f"mean: {lat_scaler.mean}, variance: {lat_scaler.variance}")
    print("+++++")
    print("Computed statistics for longitude:")
    print(f"mean: {lon_scaler.mean}, variance: {lon_scaler.variance}")

    return lat_scaler, lon_scaler


def euclidean(params):
    lon1, lat1, lon2, lat2 = params
    londiff = lon2 - lon1
    latdiff = lat2 - lat1
    return tf.sqrt(londiff * londiff + latdiff * latdiff)


def transform(inputs, nbuckets, normalizers):
    lat_scaler, lon_scaler = normalizers

    # Normalize longitude
    scaled_plon = lon_scaler(inputs["pickup_longitude"])
    scaled_dlon = lon_scaler(inputs["dropoff_longitude"])

    # Normalize latitude
    scaled_plat = lat_scaler(inputs["pickup_latitude"])
    scaled_dlat = lat_scaler(inputs["dropoff_latitude"])

    # Lambda layer for the custom euclidean function
    euclidean_distance = Lambda(euclidean, name="euclidean")(
        [scaled_plon, scaled_plat, scaled_dlon, scaled_dlat]
    )

    # Discretization
    latbuckets = np.linspace(start=-5, stop=5, num=nbuckets).tolist()
    lonbuckets = np.linspace(start=-5, stop=5, num=nbuckets).tolist()

    plon = Discretization(lonbuckets, name="plon_bkt")(scaled_plon)
    plat = Discretization(latbuckets, name="plat_bkt")(scaled_plat)
    dlon = Discretization(lonbuckets, name="dlon_bkt")(scaled_dlon)
    dlat = Discretization(latbuckets, name="dlat_bkt")(scaled_dlat)

    # Feature Cross with HashedCrossing layer
    p_fc = HashedCrossing(num_bins=(nbuckets + 1) ** 2, name="p_fc")((plon, plat))
    d_fc = HashedCrossing(num_bins=(nbuckets + 1) ** 2, name="d_fc")((dlon, dlat))
    pd_fc = HashedCrossing(num_bins=(nbuckets + 1) ** 4, name="pd_fc")((p_fc, d_fc))

    # Embedding with Embedding layer
    pd_embed = Flatten()(
        Embedding(input_dim=(nbuckets + 1) ** 4, output_dim=10, name="pd_embed")(
            pd_fc
        )
    )

    transformed = Concatenate()([
        scaled_plon,
        scaled_dlon,
        scaled_plat,
        scaled_dlat,
        euclidean_distance, 
        pd_embed
    ])

    return transformed


def rmse(y_true, y_pred):
    squared_error = tf.keras.ops.square(y_pred[:, 0] - y_true)
    return tf.keras.ops.sqrt(tf.keras.ops.mean(squared_error))

def build_dnn_model(nbuckets, nnsize, lr, normalizers):
    INPUT_COLS = [
        "pickup_longitude",
        "pickup_latitude",
        "dropoff_longitude",
        "dropoff_latitude",
    ]

    inputs = {
        colname: Input(name=colname, shape=(1,), dtype="float32")
        for colname in INPUT_COLS
    }

    # transforms
    x = transform(inputs, nbuckets, normalizers)

    for layer, nodes in enumerate(nnsize):
        x = Dense(nodes, activation="relu", name=f"h{layer}")(x)
    output = Dense(1, name="fare")(x)

    model = keras.Model(inputs=list(inputs.values()), outputs=output)

    # TODO 1a: Your code here

    return model


def train_and_evaluate(hparams):
    # TODO 1b: Your code here
    nnsize = [int(s) for s in hparams["nnsize"].split()]
    eval_data_path = hparams["eval_data_path"]
    num_evals = hparams["num_evals"]
    num_examples_to_train_on = hparams["num_examples_to_train_on"]
    output_dir = hparams["output_dir"]
    train_data_path = hparams["train_data_path"]

    model_export_path = os.path.join(output_dir, "model.keras")
    serving_model_export_path = os.path.join(output_dir, "savedmodel")
    checkpoint_path = os.path.join(output_dir, "checkpoint.keras")
    tensorboard_path = os.path.join(output_dir, "tensorboard")

    if tf.io.gfile.exists(output_dir):
        tf.io.gfile.rmtree(output_dir)

    normalizers = adapt_normalize(eval_data_path)

    model = build_dnn_model(nbuckets, nnsize, lr, normalizers)
    logging.info(model.summary())
    
    trainds = create_dataset(
        pattern=train_data_path, batch_size=batch_size, num_repeat=None, mode="train"
    )

    evalds = create_dataset(
        pattern=eval_data_path, batch_size=batch_size, num_repeat=1, mode="eval"
    )

    steps_per_epoch = num_examples_to_train_on // (batch_size * num_evals)

    checkpoint_cb = callbacks.ModelCheckpoint(
        checkpoint_path, verbose=1
    )
    tensorboard_cb = callbacks.TensorBoard(tensorboard_path, histogram_freq=1)

    history = model.fit(
        trainds,
        validation_data=evalds,
        epochs=num_evals,
        steps_per_epoch=max(1, steps_per_epoch),
        verbose=2,  # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[checkpoint_cb, tensorboard_cb],
    )

    # Save the Keras model file.
    model.save(model_export_path)
    # Exporting the model in savedmodel for serving.
    model.export(serving_model_export_path)
    return history

### Define Command Line Parser

If you look closely above, you'll notice a new function, `train_and_evaluate` that wraps the code that actually trains the model. This allows us to parametrize the training by passing a dictionary of parameters to this function (e.g, `batch_size`, `num_examples_to_train_on`, `train_data_path` etc.)

This is useful because the output directory, data paths and number of train steps will be different depending on whether we're training locally or in the cloud. Parametrizing allows us to use the same code for both.

We specify these parameters at run time via the command line. Which means we need to add code to parse command line parameters and invoke `train_and_evaluate()` with those params. This is the job of the `task.py` file. 

In [None]:
%%writefile taxifare/trainer/task.py
"""Argument definitions for model training code in `trainer.model`."""

import argparse

from trainer import model

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--batch_size",
        help="Batch size for training steps",
        type=int,
        default=32,
    )
    parser.add_argument(
        "--eval_data_path",
        help="GCS location pattern of eval files",
        required=True,
    )
    parser.add_argument(
        "--nnsize",
        help="Hidden layer sizes (provide space-separated sizes)",
        default="32 8",
    )
    parser.add_argument(
        "--nbuckets",
        help="Number of buckets to divide lat and lon with",
        type=int,
        default=10,
    )
    parser.add_argument(
        "--lr", help="learning rate for optimizer", type=float, default=0.001
    )
    parser.add_argument(
        "--num_evals",
        help="Number of times to evaluate model on eval data training.",
        type=int,
        default=5,
    )
    parser.add_argument(
        "--num_examples_to_train_on",
        help="Number of examples to train on.",
        type=int,
        default=100,
    )
    parser.add_argument(
        "--output_dir",
        help="GCS location to write checkpoints and export models",
        required=True,
    )
    parser.add_argument(
        "--train_data_path",
        help="GCS location pattern of train files containing eval URLs",
        required=True,
    )
    args = parser.parse_args()
    hparams = args.__dict__

    model.train_and_evaluate(hparams)


### Run trainer module package locally

Now we can test our training code locally as follows using the local test data. We'll run a very small training job over a single file with a small batch size and one eval step.

In [None]:
%%bash

EVAL_DATA_PATH=../data/taxi-traffic-valid*
TRAIN_DATA_PATH=../data/taxi-traffic-train*
OUTPUT_DIR=./taxifare-model

test ${OUTPUT_DIR} && rm -rf ${OUTPUT_DIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare

python3 -m trainer.task \
--eval_data_path $EVAL_DATA_PATH \
--output_dir $OUTPUT_DIR \
--train_data_path $TRAIN_DATA_PATH \
--batch_size 5 \
--num_examples_to_train_on 100 \
--num_evals 1 \
--nbuckets 10 \
--lr 0.001 \
--nnsize "32 8"

## Run your training package on Vertex AI


Once the code works in standalone mode locally, you can run it on the Cloud using Vertex AI.

In Vertex AI, ou can provide training code to Vertex AI in one of the following forms:

- **A Python training application to use with a prebuilt container**. Create a [Python source distribution](https://packaging.python.org/en/latest/overview/#python-source-distributions) with code that trains an ML model and exports it to Cloud Storage. This training application can use any of the dependencies included in the prebuilt container that you plan to use it with. Use this option if one of the Vertex AI prebuilt containers for training includes all the dependencies that you need for training.

- **A custom container image**. Create a Docker container image with code that trains an ML model and exports it to Cloud Storage. Include any dependencies required by your code in the container image.


### Method 1: Prebuilt Container
First, let's run a cloud training using a prebuild containers.

In order to do so, we need to package our code as a source distribution. For this, we can use `setuptools`. 

In [None]:
%%writefile taxifare/setup.py
"""Using `setuptools` to create a source distribution."""

from setuptools import find_packages, setup

setup(
    name="taxifare_trainer",
    version="0.1",
    packages=find_packages(),
    include_package_data=True,
    description="Taxifare model training application.",
)

In [None]:
%%bash
cd taxifare
python ./setup.py sdist --formats=gztar
cd ..

We will store our package in the Cloud Storage bucket.

In [None]:
%%bash
gsutil cp taxifare/dist/taxifare_trainer-0.1.tar.gz gs://${BUCKET}/taxifare/

#### Submit Custom Job using the Python SDK

To submit this source distribution to the Cloud we use [CustomJob](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.CustomJob#google_cloud_aiplatform_CustomJob_get) module under Vertex AI Python SDK, and specify some parameters for Vertex AI Training service under [worker_pool_spec](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.types.WorkerPoolSpec), which includes:
- [`machine_spec`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.types.MachineSpec): Specification of a single machine where we run training. Here we can speficy `machine_type`, as well as the [accelerator specifications](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.types.AcceleratorType).
- `python_package_spec`: Here we provide specification about our Python package runtime, including the prebuilt container URL (`executor_image_uri`), our Python package path (`package_uris`), as well as the custom arguments (`args`) we pass to our training application can be defined and provided here.

Because this is on the entire dataset, it will take a while. You can monitor the job from the GCP console in the Vertex AI Training section.

**Lab Task #2**: Train your model using cloud infrastructure via Google Cloud Vertex AI Training Service
Fill in the TODOs in the code below to submit your job for training on Vertex AI. 

In [None]:
BATCH_SIZE = 64
NUM_EXAMPLES_TO_TRAIN_ON = 500000
NUM_EVALS = 100
NBUCKETS = 10
LR = 0.001
NNSIZE = "32 8"

base_path = f"gs://{BUCKET}/taxifare"
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

args = [
    "--eval_data_path",
    f"{base_path}/data/taxi-valid*",
    "--output_dir",
    f"{base_path}/trained_model_{timestamp}",
    "--train_data_path",
    f"{base_path}/data/taxi-train*",
    "--batch_size",
    f"{BATCH_SIZE}",
    "--num_examples_to_train_on",
    f"{NUM_EXAMPLES_TO_TRAIN_ON}",
    "--num_evals",
    f"{NUM_EVALS}",
    "--nbuckets",
    f"{NBUCKETS}",
    "--lr",
    f"{LR}",
    "--nnsize",
    f"{NNSIZE}",
]

worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-4",
            "accelerator_type": None,
            "accelerator_count": None,
        },
        "replica_count": 1,
        "python_package_spec": {
            "executor_image_uri": "us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-17.py310:latest",
            "package_uris": [f"{base_path}/taxifare_trainer-0.1.tar.gz"],
            "python_module": "trainer.task",
            "args": args,
        },
    }
]

training_job = aiplatform.CustomJob(
    # TODO 2: Your code here
)

training_job.submit()

Also, you can run a custom training via bash command, using [`gcloud ai custom-jobs create`](https://cloud.google.com/sdk/gcloud/reference/ai/custom-jobs/create).

Equivalent bash command:
```bash
TIMESTAMP=$(date -u +%Y%m%d_%H%M%S)
OUTDIR=gs://${BUCKET}/taxifare/trained_model_$TIMESTAMP
JOB_NAME=taxifare_$TIMESTAMP
echo ${OUTDIR} ${REGION} ${JOB_NAME}

PYTHON_PACKAGE_URIS=gs://${BUCKET}/taxifare/taxifare_trainer-0.1.tar.gz
MACHINE_TYPE=n1-standard-4
REPLICA_COUNT=1
PYTHON_PACKAGE_EXECUTOR_IMAGE_URI="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-17.py310:latest"
PYTHON_MODULE=trainer.task

# Model and training hyperparameters
BATCH_SIZE=64
NUM_EXAMPLES_TO_TRAIN_ON=500000
NUM_EVALS=100
NBUCKETS=10
LR=0.001
NNSIZE="32 8"

# GCS paths
GCS_PROJECT_PATH=gs://$BUCKET/taxifare
DATA_PATH=$GCS_PROJECT_PATH/data
TRAIN_DATA_PATH=$DATA_PATH/taxi-train*
EVAL_DATA_PATH=$DATA_PATH/taxi-valid*

WORKER_POOL_SPEC="machine-type=$MACHINE_TYPE,\
replica-count=$REPLICA_COUNT,\
executor-image-uri=$PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,\
python-module=$PYTHON_MODULE"

ARGS="--eval_data_path=$EVAL_DATA_PATH,\
--output_dir=$OUTDIR,\
--train_data_path=$TRAIN_DATA_PATH,\
--batch_size=$BATCH_SIZE,\
--num_examples_to_train_on=$NUM_EXAMPLES_TO_TRAIN_ON,\
--num_evals=$NUM_EVALS,\
--nbuckets=$NBUCKETS,\
--lr=$LR,\
--nnsize=$NNSIZE"

gcloud ai custom-jobs create \
  --region=${REGION} \
  --display-name=$JOB_NAME \
  --python-package-uris=$PYTHON_PACKAGE_URIS \
  --worker-pool-spec=$WORKER_POOL_SPEC \
  --args="$ARGS"
```

#### Open TensorBoard
Since we specified the `TensorBoard` callback in `model.fit`, the training application saved the intermediate logs for TensorBoard dashboard.

While we can [host TensorBoard in Vertex AI Experiments](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction), here, let's simply open from this notebook.

You might not see any data for a bit until the job begins. Check the job status on the console, and then return here to click the refresh button in the top right to update TensorBoard.

In [None]:
%tensorboard --logdir {base_path}/trained_model_{timestamp} --port 8080

### Method 2: Using a custom container

Vertex AI Training also supports training in custom containers, allowing users to bring their own Docker containers with any pre-installed ML framework or algorithm to run on Vertex AI Training. 

In this last section, we'll see how to submit a Cloud training job using a customized Docker image. 

Containerizing our `./taxifare/trainer` package involves 3 steps:

* Writing a Dockerfile in `./taxifare`
* Building the Docker image
* Pushing it to the Google Cloud Artifact Registry in our GCP project

The `Dockerfile` specifies
1. How the container needs to be provisioned so that all the dependencies in our code are satisfied
2. Where to copy our trainer Package in the container
3. What command to run when the container is run (the `ENTRYPOINT` line)

**Lab Task #3**: Running your training package using Docker containers.
Fill in the TODOs in the code below for Dockerfile

In [None]:
%%writefile ./taxifare/Dockerfile
FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-17.py310:latest

# TODO 3: Your code here

In [None]:
ARTIFACT_REGISTRY_DIR = "asl-artifact-repo"
IMAGE_NAME = "taxifare_training_container"
IMAGE_URI = f"us-docker.pkg.dev/{PROJECT}/{ARTIFACT_REGISTRY_DIR}/{IMAGE_NAME}"

os.environ["IMAGE_URI"] = IMAGE_URI

In [None]:
%%bash 

PROJECT_DIR=$(cd ./taxifare && pwd)
DOCKERFILE=$PROJECT_DIR/Dockerfile

# Authorize docker command for Artifact Registry
gcloud auth print-access-token | docker login -u oauth2accesstoken --password-stdin https://us-docker.pkg.dev

docker build $PROJECT_DIR -f $DOCKERFILE -t $IMAGE_URI

docker push $IMAGE_URI

#### Submit Custom Job using the Python SDK

As we did above, let's define the worker_pool_spec using Vertex AI Python SDK and run the cloud training.

The definition is almost the same, but please note that here we specify `container_spec` instead of `python_package_spec`, which simply includes our custom container image path.

In [None]:
BATCH_SIZE = 64
NUM_EXAMPLES_TO_TRAIN_ON = 500000
NUM_EVALS = 100
NBUCKETS = 10
LR = 0.001
NNSIZE = "32 8"

base_path = f"gs://{BUCKET}/taxifare"
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

args = [
    "--eval_data_path",
    f"{base_path}/data/taxi-valid*",
    "--output_dir",
    f"{base_path}/trained_model_{timestamp}",
    "--train_data_path",
    f"{base_path}/data/taxi-train*",
    "--batch_size",
    f"{BATCH_SIZE}",
    "--num_examples_to_train_on",
    f"{NUM_EXAMPLES_TO_TRAIN_ON}",
    "--num_evals",
    f"{NUM_EVALS}",
    "--nbuckets",
    f"{NBUCKETS}",
    "--lr",
    f"{LR}",
    "--nnsize",
    f"{NNSIZE}",
]

worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-4",
            "accelerator_type": None,
            "accelerator_count": None,
        },
        "replica_count": 1,
        # Now we specify container_spec, instead of python_package_spec
        "container_spec": {
            "image_uri": IMAGE_URI,
            "args": args,
        },
    }
]

training_job = aiplatform.CustomJob(
    display_name=f"taxifare_container_{timestamp}",
    worker_pool_specs=worker_pool_specs,
    staging_bucket=f"{base_path}/staging",
)

training_job.submit()

Here is the equivalent bash command to run container training.

```bash
# Output directory and jobID
OUTDIR=gs://${BUCKET}/taxifare/trained_model_$TIMESTAMP
JOB_NAME=taxifare_container_$TIMESTAMP
echo ${OUTDIR} ${REGION} ${JOB_NAME}

# Model and training hyperparameters
BATCH_SIZE=64
NUM_EXAMPLES_TO_TRAIN_ON=500000
NUM_EVALS=100
NBUCKETS=10
LR=0.001
NNSIZE="32 8"

# Vertex AI machines to use for training
MACHINE_TYPE=n1-standard-4
REPLICA_COUNT=1

# GCS paths.
GCS_PROJECT_PATH=gs://$BUCKET/taxifare
DATA_PATH=$GCS_PROJECT_PATH/data
TRAIN_DATA_PATH=$DATA_PATH/taxi-train*
EVAL_DATA_PATH=$DATA_PATH/taxi-valid*

WORKER_POOL_SPEC="machine-type=$MACHINE_TYPE,\
replica-count=$REPLICA_COUNT,\
container-image-uri=$IMAGE_URI"

ARGS="--eval_data_path=$EVAL_DATA_PATH,\
--output_dir=$OUTDIR,\
--train_data_path=$TRAIN_DATA_PATH,\
--batch_size=$BATCH_SIZE,\
--num_examples_to_train_on=$NUM_EXAMPLES_TO_TRAIN_ON,\
--num_evals=$NUM_EVALS,\
--nbuckets=$NBUCKETS,\
--lr=$LR,\
--nnsize=$NNSIZE"

gcloud ai custom-jobs create \
  --region=$REGION \
  --display-name=$JOB_NAME \
  --worker-pool-spec=$WORKER_POOL_SPEC \
  --args="$ARGS"
```

#### Open TensorBoard

Let's check TensorBoard once more. A different port number will be used this time, as `8080` is occupied by another TensorBoard instance above.

In [None]:
%tensorboard --logdir {base_path}/trained_model_{timestamp} --port 8081

Copyright 2025 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License