# A Simple Composer Pipeline using Vertex AI Training, Model Upload and Model Deployment
**Learning Objectives:**
  1. Create custom Airflow operators that leverage Vertex AI jobs with docker containers.
  2. Build an Airflow pipeline that uses Vertex AI for training, model upload, and deployment.
  3. Run the Airflow pipeline with Cloud Composer

### Before Starting: Spin up Composer Environment. 
This will take approximately 25 minutes. 

**TODO**: Click the plus in the upper left and open a terminal. Copy and run the following command without change to create a Composer environment in your Google Cloud project: 

```gcloud composer environments create my-composer-env --location=us-central1 --image-version=composer-1.17.8-airflow-2.1.4```

After you run the command, feel free to return to this notebook and continue working through steps while your Composer environment spins up.

In [None]:
import os
from google.cloud import bigquery

PROJECT = !gcloud config list --format 'value(core.project)'
PROJECT = PROJECT[0]
BUCKET = PROJECT
REGION = "us-central1"
COMPOSER_ENV = "my-composer-env"

os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["COMPOSER_ENV"] = COMPOSER_ENV

## Organize/Clean the Dataset
* Create a BigQuery Dataset
* Query a public dataset to create tables of clean data for training and testing

In this lab you will be working with the babyweight dataset.

Step 1: Create BigQuery Dataset

In [None]:
%%bash

# Create a BigQuery dataset for babyweight if it doesn't exist
datasetexists=$(bq ls -d | grep -w babyweight)

if [ -n "$datasetexists" ]; then
    echo -e "BigQuery dataset already exists, let's not recreate it."

else
    echo "Creating BigQuery dataset titled: babyweight"
    
    bq --location=US mk --dataset \
        --description "Babyweight" \
        $PROJECT:babyweight
    echo "Here are your current datasets:"
    bq ls
fi
    
## Create GCS bucket if it doesn't exist already...
exists=$(gsutil ls -d | grep -w gs://${BUCKET}/)

if [ -n "$exists" ]; then
    echo -e "Bucket exists, let's not recreate it."
    
else
    echo "Creating a new GCS bucket."
    gsutil mb -l ${REGION} gs://${BUCKET}
    echo "Here are your current buckets:"
    gsutil ls
fi

Step 2: Create training and eval tables

In [None]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_data AS
SELECT
    weight_pounds,
    CAST(is_male AS STRING) AS is_male,
    mother_age,
    CASE
        WHEN plurality = 1 THEN "Single(1)"
        WHEN plurality = 2 THEN "Twins(2)"
        WHEN plurality = 3 THEN "Triplets(3)"
        WHEN plurality = 4 THEN "Quadruplets(4)"
        WHEN plurality = 5 THEN "Quintuplets(5)"
    END AS plurality,
    gestation_weeks,
    FARM_FINGERPRINT(
        CONCAT(
            CAST(year AS STRING),
            CAST(month AS STRING)
        )
    ) AS hashmonth
FROM
    publicdata.samples.natality
WHERE
    year > 2000
    AND weight_pounds > 0
    AND mother_age > 0
    AND plurality > 0
    AND gestation_weeks > 0;
    
CREATE OR REPLACE TABLE
    babyweight.babyweight_augmented_data AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    hashmonth
FROM
    babyweight.babyweight_data
UNION ALL
SELECT
    weight_pounds,
    "Unknown" AS is_male,
    mother_age,
    CASE
        WHEN plurality = "Single(1)" THEN plurality
        ELSE "Multiple(2+)"
    END AS plurality,
    gestation_weeks,
    hashmonth
FROM
    babyweight.babyweight_data;
    
CREATE OR REPLACE TABLE
    babyweight.babyweight_data_train AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks
FROM
    babyweight.babyweight_augmented_data
WHERE
    ABS(MOD(hashmonth, 4)) < 3;
    
CREATE OR REPLACE TABLE
    babyweight.babyweight_data_eval AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks
FROM
    babyweight.babyweight_augmented_data
WHERE
    ABS(MOD(hashmonth, 4)) = 3

### Training Application 
* In babyweight/trainer, feel free to look at `model.py` and `task.py`. These files contain code for a Tensorflow training application that builds and trains a model to predict baby weight.  
* Running the next two cells will package the Tensorflow training application as a source distribution and upload a gzipped tar file of the application to GCS

In [None]:
%%writefile babyweight/setup.py
from setuptools import find_packages
from setuptools import setup

setup(
    name='babyweight_trainer',
    version='0.1',
    packages=find_packages(),
    include_package_data=True,
    description='Babyweight model training application.'
)

In [None]:
%%bash
cd babyweight
python ./setup.py sdist --formats=gztar
cd ..
gsutil cp babyweight/dist/babyweight_trainer-0.1.tar.gz gs://${BUCKET}/babyweight/

### Custom Airflow Operators for Vertex AI  
* Airflow doesnt currently have Vertex AI operators, so we will build and push docker containers that leverage the Vertex API, so we can use Vertex services in our Composer DAG 

Feel free to look inside the folders vertex_train_docker, vertex_upload_model_docker, and vertex_deploy_docker if you are interested in the implementation of these custom containers.

Here we are creating Cloud Build config files to build and push the containers. 

In [None]:
import json
vertex_train_cloudbuild = {
    "steps": [
        {
            "name": "gcr.io/cloud-builders/docker",
            "args": ["build", "-t", f"gcr.io/{PROJECT}/vertex_train_image:latest", "."],
        },
        {
            "name": "gcr.io/cloud-builders/docker",
            "args": ["push", f"gcr.io/{PROJECT}/vertex_train_image:latest"],
        },
    ]
}

vertex_upload_cloudbuild = {
    "steps": [
        {
            "name": "gcr.io/cloud-builders/docker",
            "args": [
                "build",
                "-t",
                f"gcr.io/{PROJECT}/vertex_upload_model_image:latest",
                ".",
            ],
        },
        {
            "name": "gcr.io/cloud-builders/docker",
            "args": ["push", f"gcr.io/{PROJECT}/vertex_upload_model_image:latest"],
        },
    ]
}
vertex_deploy_cloudbuild = {
    "steps": [
        {
            "name": "gcr.io/cloud-builders/docker",
            "args": [
                "build",
                "-t",
                f"gcr.io/{PROJECT}/vertex_deploy_image:latest",
                ".",
            ],
        },
        {
            "name": "gcr.io/cloud-builders/docker",
            "args": ["push", f"gcr.io/{PROJECT}/vertex_deploy_image:latest"],
        },
    ]
}

with open("./vertex_train_docker/cloudbuild.json", "w") as outfile:
    json.dump(vertex_train_cloudbuild, outfile)
    
with open("./vertex_upload_model_docker/cloudbuild.json", "w") as outfile:
    json.dump(vertex_upload_cloudbuild, outfile)
    
with open("./vertex_deploy_docker/cloudbuild.json", "w") as outfile:
    json.dump(vertex_deploy_cloudbuild, outfile)

#### Build and push the containers to your projects private Container Regsitry. 
Later in the lab, you will use the Image URI of these containers to instatiate them as operators in your Airflow DAG. Each of these 3 cells may take a few minutes to run, as Docker containers are being built and pushed to your projects private Container Registry. 

In [None]:
%%bash
cd vertex_upload_model_docker
chmod +x build_image.sh
./build_image.sh
cd ..

In [None]:
%%bash
cd vertex_train_docker
chmod +x build_image.sh
./build_image.sh
cd ..

In [None]:
%%bash
cd vertex_deploy_docker
chmod +x build_image.sh
./build_image.sh
cd ..

Now you can build out your Composer DAG. The steps of the DAG will be:
* Export the data from your training and eval BigQuery tables to sharded CSVs in GCS
* Launch a Vertex AI Custom Training Job to train the Tensorflow model
* Upload the model to Vertex AI
* Create an endpoint and deploy the model 

In [None]:
%%writefile babyweight_composer_dag.py
import datetime
import logging
from base64 import b64encode as b64e
from airflow import DAG
from airflow.hooks.base_hook import BaseHook
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from airflow.contrib.operators.bigquery_to_gcs import (
    BigQueryToCloudStorageOperator)


DEFAULT_ARGS = {
    'owner': 'Google Cloud Learner',
    'depends_on_past': False,
    'start_date': datetime.datetime.now(),
    'email': ['gcp.learning@fake-email.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5)
}

def _get_project_id():
    """Get project ID from default Google Cloud connection."""

    extras = BaseHook.get_connection("google_cloud_default").extra_dejson
    key = "extra__google_cloud_platform__project"
    if key in extras:
        project_id = extras[key]
    else:
        raise ("Must configure project_id in google_cloud_default "
               "connection from Airflow Console")
    return project_id

PROJECT = _get_project_id()
BUCKET = PROJECT

# Output to store the model
OUTDIR= f'gs://{BUCKET}/babyweight/trained_model'
DATETIME = datetime.datetime.now().strftime("%Y%m%d%H%M%S")

# BQ Dataset and train/eval table names
DATASET = 'babyweight'
TRAIN_TABLE = 'babyweight_data_train'
EVAL_TABLE = 'babyweight_data_eval'

# Define Airflow DAG
with DAG(
        'babyweight_dag',
        catchup=False,
        default_args=DEFAULT_ARGS) as dag:
    
    # File path to store sharded CSVs exported from BigQuery
    data_path = f"gs://{BUCKET}/babyweight/data/"
    
    # Operator to shard the training data table to CSVs in GCS
    bq_export_train_csv_op = BigQueryToCloudStorageOperator(
        task_id="bq_export_gcs_train_csv_task",
        source_project_dataset_table=f"{DATASET}.{TRAIN_TABLE}",
        destination_cloud_storage_uris=[data_path + "train-*.csv"],
        export_format="CSV",
        print_header=False,
        dag=dag
    )
    
    # Operator to shard the eval data table to CSVs in GCS
    bq_export_eval_csv_op = BigQueryToCloudStorageOperator(
        task_id="bq_export_gcs_eval_csv_task",
        source_project_dataset_table=f"{DATASET}.{EVAL_TABLE}",
        destination_cloud_storage_uris=[data_path + "eval-*.csv"],
        export_format="CSV",
        print_header=False,
        dag=dag
    )
    
    # Operator to launch a Vertex AI Custom Training job
    vertex_train_op = (
        KubernetesPodOperator(
            image=f"gcr.io/{PROJECT}/vertex_train_image:latest",
            name="vertex_train_pod",
            arguments=[
                '--ml_framework=tensorflow',
                f'--project={PROJECT}',
                '--region=us-central1',
                '--job_display_name=babyweight-model-{}'.format(DATETIME),
                '--replica_count=1',
                '--pre_built_training_container_uri=us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-3:latest',
                f'--model_package_gcs_path=gs://{BUCKET}/babyweight/babyweight_trainer-0.1.tar.gz',
                '--python_module=trainer.task',
                '--machine_type=n1-standard-4',
                f'--trainer_args={{"train_data_path": "gs://{BUCKET}/babyweight/data/train*.csv", "eval_data_path": "gs://{BUCKET}/babyweight/data/eval*.csv", "output_dir": "{OUTDIR}", "num_epochs": 10, "train_examples": 10000, "eval_steps": 100, "batch_size": 32, "nembeds": 8}}'
            ],
            namespace="default",
            task_id="vertex_train_task",
            dag=dag
        )
    )
    
    # Operator to upload model to Vertex
    vertex_upload_model_op = (
        KubernetesPodOperator(
            image=f"gcr.io/{PROJECT}/vertex_upload_model_image:latest",
            name="vertex_upload_model_pod",
            arguments=[
                '--ml_framework=tensorflow',
                f'--project={PROJECT}',
                '--region=us-central1',
                '--model_display_name=babyweight-model-{}'.format(DATETIME),
                '--serving_container_image_uri=us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-3:latest',
                f'--artifact_uri={OUTDIR}',
            ],
            namespace="default",
            task_id="vertex_upload_model_task",
            dag=dag
        )
    )
        
    # Operator to create endpoint and deploy model 
    vertex_deploy_op = (
        KubernetesPodOperator(
            image=f"gcr.io/{PROJECT}/vertex_deploy_image:latest",
            name="vertex_deploy_pod",
            arguments=[
                f'--project={PROJECT}',
                '--region=us-central1',
                '--endpoint_display_name=babyweight-composer-model-endpoint',
                '--model_display_name=babyweight-model-{}'.format(DATETIME),
                '--deployed_model_display_name=deployed_model',
                '--machine_type=n1-standard-4',
            ],
            namespace="default",
            task_id="vertex_deploy_task",
            dag=dag
        )
    )
    
    [bq_export_train_csv_op, bq_export_eval_csv_op] >> vertex_train_op
    vertex_train_op >> vertex_upload_model_op
    vertex_upload_model_op >> vertex_deploy_op
    

Upload the DAG to your Composer environment. 

**Note**: If you get an error here, your Composer environment is likely still spinning up. You can verify this by navigating to the Composer UI in Google Cloud Console. You will need to wait for your Composer environment to be created before moving forward. 

In [None]:
%%bash
gcloud composer environments storage dags import \
    --environment my-composer-env1  \
    --location $REGION \
    --source babyweight_composer_dag.py

Run the pipeline

In [None]:
%%bash
gcloud composer environments run $COMPOSER_ENV \
    --location $REGION \
    dags trigger -- babyweight_dag

Monitor your pipeline run in the Airflow UI
* Run the following command to output the config of your Composer environment.
* Click on the airflowUri link to launch the Airflow UI where you can view and monitor your pipeline run

In [None]:
!gcloud composer environments describe $COMPOSER_ENV --location $REGION

### Clean Up
When you are finished with the lab spin down your Cloud Composer environment

In [None]:
%%bash
gcloud composer environments delete $COMPOSER_ENV \
    --location $REGION