In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# AutoMLOps - Tensorflow Transfer Learning GPU Example

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/automlops/blob/main/examples/training/02_tensorflow_transfer_learning_gpu_example.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/automlops/blob/main/examples/training/02_tensorflow_transfer_learning_gpu_example.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/automlops/main/examples/training/02_tensorflow_transfer_learning_gpu_example.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

# Overview

In this tutorial you'll use transfer learning to train an image classification model on the cassava dataset from TensorFlow Datasets. The architecture you'll use is a ResNet50 model from the tf.keras.applications library pretrained on the Imagenet dataset. This tutorial will walk you through how to use AutoMLOps to define, create and run a MLOps pipeline around this model training. For uptraining the ResNet50 model, we will use a GPU.

# Objective
In this tutorial, you will learn how to create and run MLOps pipelines integrated with CI/CD. This tutorial goes through training a tensorflow model using accelerators; the pipeline goes through the following workflow:
1. importer: Google cloud pipeline component for importing tensorflow models into Vertex Model Registry
2. custom_train_model: A custom component that trains a tensorflow model.
3. model_upload: Google cloud pipeline component that executes an upload operation.

# Prerequisites

In order to use AutoMLOps, the following are required:

- Python 3.7 - 3.10
- [Google Cloud SDK 407.0.0](https://cloud.google.com/sdk/gcloud/reference)
- [beta 2022.10.21](https://cloud.google.com/sdk/gcloud/reference/beta)
- `git` installed
- `git` logged-in:
```
  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"
```
- [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/provide-credentials-adc) are setup. This can be done through the following commands:
```
gcloud auth application-default login
gcloud config set account <account@example.com>
```

# APIs & IAM
Based on the user options selection, AutoMLOps will enable up to the following APIs during the provision step:
- [aiplatform.googleapis.com](https://cloud.google.com/vertex-ai/docs/reference/rest)
- [artifactregistry.googleapis.com](https://cloud.google.com/artifact-registry/docs/reference/rest)
- [cloudbuild.googleapis.com](https://cloud.google.com/build/docs/api/reference/rest)
- [cloudfunctions.googleapis.com](https://cloud.google.com/functions/docs/reference/rest)
- [cloudresourcemanager.googleapis.com](https://cloud.google.com/resource-manager/reference/rest)
- [cloudscheduler.googleapis.com](https://cloud.google.com/scheduler/docs/reference/rest)
- [compute.googleapis.com](https://cloud.google.com/compute/docs/reference/rest/v1)
- [iam.googleapis.com](https://cloud.google.com/iam/docs/reference/rest)
- [iamcredentials.googleapis.com](https://cloud.google.com/iam/docs/reference/credentials/rest)
- [logging.googleapis.com](https://cloud.google.com/logging/docs/reference/v2/rest)
- [pubsub.googleapis.com](https://cloud.google.com/pubsub/docs/reference/rest)
- [run.googleapis.com](https://cloud.google.com/run/docs/reference/rest)
- [storage.googleapis.com](https://cloud.google.com/storage/docs/apis)
- [sourcerepo.googleapis.com](https://cloud.google.com/source-repositories/docs/reference/rest)


AutoMLOps will create the following service account and update [IAM permissions](https://cloud.google.com/iam/docs/understanding-roles) during the provision step:
1. Pipeline Runner Service Account (defaults to: vertex-pipelines@PROJECT_ID.iam.gserviceaccount.com). Roles added:
- roles/aiplatform.user
- roles/artifactregistry.reader
- roles/bigquery.user
- roles/bigquery.dataEditor
- roles/iam.serviceAccountUser
- roles/storage.admin
- roles/cloudfunctions.admin

# User Guide

For a user-guide, please view these [slides](../../AutoMLOps_User_Guide.pdf).

# Costs

This tutorial uses billable components of Google Cloud:
- Vertex AI
- Artifact Registry
- Cloud Storage
- Cloud Source Repository
- Cloud Build
- Cloud Run
- Cloud Scheduler
- Cloud Pub/Sub

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

# Ground-rules for using AutoMLOps
1. Do not use variables, functions, code, etc. not defined within the scope of a custom component. These custom components will become containers and will have no reference to the out of scope code.
2. Import statements and helper functions must be added inside the function. Provide parameter type hints.
3. Test each of your components for accuracy and correctness before running them using AutoMLOps. We cannot fix bugs automatically; bugs are much more difficult to fix once they are made into pipelines.
4. If you are using Kubeflow, be sure to define all the requirements needed to run the custom component - it can be easy to leave out packages which will cause the container to fail when running within a pipeline. 


# Dataset
For training data, we are using the [cassava dataset](https://www.tensorflow.org/datasets/catalog/cassava) from [TensorFlow Datasets](https://www.tensorflow.org/datasets). This dataset consists of leaf images for the cassava plant depicting healthy and four (4) disease conditions; Cassava Mosaic Disease (CMD), Cassava Bacterial Blight (CBB), Cassava Greem Mite (CGM) and Cassava Brown Streak Disease (CBSD). Dataset consists of a total of 9430 labelled images. The 9430 labelled images are split into a training set (5656), a test set(1885) and a validation set (1889). The number of images per class are unbalanced with the two disease classes CMD and CBSD having 72% of the images.

# Setup Git
Set up your git configuration below

In [None]:
!git config --global user.email 'you@example.com'
!git config --global user.name 'Your Name'

# Install AutoMLOps

Install AutoMLOps from [PyPI](https://pypi.org/project/google-cloud-automlops/), or locally by cloning the repo and running `pip install .`

In [None]:
!pip3 install google-cloud-automlops --user

# Restart the kernel
Once you've installed the AutoMLOps package, you need to restart the notebook kernel so it can find the package.

**Note: Once this cell has finished running, continue on. You do not need to re-run any of the cells above.**

In [None]:
import os

if not os.getenv('IS_TESTING'):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

# Set variables
Set variables. If you don't know your project ID, leave the field blank and the following cells may be able to find it.

In [1]:
PROJECT_ID = '[your-project-id]'  # @param {type:"string"}

BUCKET_NAME = 'automlops-sandbox-bucket'  # @param {type:"string"}
BUCKET_URI = f'gs://{BUCKET_NAME}'
MODEL_DIR = BUCKET_URI + '/tensorflow_model'

TRAINING_IMAGE = 'us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-11.py310:latest' # includes required cuda packages
SERVING_IMAGE = 'us-docker.pkg.dev/vertex-ai/prediction/tf-gpu.2-11.py310:latest'

In [2]:
if PROJECT_ID == '' or PROJECT_ID is None or PROJECT_ID == '[your-project-id]':
    # Get your GCP project id from gcloud
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print('Project ID:', PROJECT_ID)

Project ID: automlops-sandbox


In [3]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


Set your Model_ID below:

In [None]:
MODEL_ID = 'cassava-resnet-50'

# AutoMLOps Tensorflow Example
This workflow will generate a pipeline using Kubeflow spec. AutoMLOps provides 2 functions for defining MLOps pipelines:

- `AutoMLOps.component(...)`: Defines a component, which is a containerized python function.
- `AutoMLOps.pipeline(...)`: Defines a pipeline, which is a series of components.

AutoMLOps provides 6 functions for building and maintaining MLOps pipelines:

- `AutoMLOps.generate(...)`: Generates the MLOps codebase. Users can specify the tooling and technologies they would like to use in their MLOps pipeline.
- `AutoMLOps.provision(...)`: Runs provisioning scripts to create and maintain necessary infra for MLOps.
- `AutoMLOps.deprovision(...)`: Runs deprovisioning scripts to tear down MLOps infra created using AutoMLOps.
- `AutoMLOps.deploy(...)`: Builds and pushes component container, then triggers the pipeline job.
- `AutoMLOps.launchAll(...)`: Runs `generate()`, `provision()`, and `deploy()` all in succession.
- `AutoMLOps.monitor(...)`: Creates model monitoring jobs on deployed endpoints.

Please see the [readme](https://github.com/GoogleCloudPlatform/automlops/blob/main/README.md) for more information.

**Note: This workflow requires python package `kfp>=2.0.0`.**

## Imports

In [4]:
from kfp.dsl import Metrics, Model, Output
from google_cloud_automlops import AutoMLOps

## Model Training
Define a custom component for training a model. The architecture you'll use is a ResNet50 model from the tf.keras.applications library pretrained on the [Imagenet dataset](https://www.image-net.org/).

In [6]:
@AutoMLOps.component(
    packages_to_install=[
        'tensorflow',
        'tensorflow_datasets',
        'opencv-python-headless'
    ]
)
def custom_train_model(
    metrics: Output[Metrics],
    model_dir: str,
    output_model: Output[Model],
    lr: float = 0.001,
    epochs: int = 10,
    steps: int = 200,
    distribute: str = 'single'
):
    import faulthandler
    import os
    import sys

    import tensorflow as tf
    import tensorflow_datasets as tfds
    from tensorflow.python.client import device_lib

    faulthandler.enable()
    tfds.disable_progress_bar()

    print('Component start')

    print(f'Python Version = {sys.version}')
    print(f'TensorFlow Version = {tf.__version__}')
    print(f'''TF_CONFIG = {os.environ.get('TF_CONFIG', 'Not found')}''')
    print(f'DEVICES = {device_lib.list_local_devices()}')

    # Single Machine, single compute device
    if distribute == 'single':
        if tf.test.is_gpu_available():
            strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0')
        else:
            strategy = tf.distribute.OneDeviceStrategy(device='/cpu:0')
    # Single Machine, multiple compute device
    elif distribute == 'mirror':
        strategy = tf.distribute.MirroredStrategy()
    # Multiple Machine, multiple compute device
    elif distribute == 'multi':
        strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

    # Multi-worker configuration
    print(f'num_replicas_in_sync = {strategy.num_replicas_in_sync}')

    # Preparing dataset
    BUFFER_SIZE = 10000
    BATCH_SIZE = 64

    def preprocess_data(image, label):
        '''Resizes and scales images.'''

        image = tf.image.resize(image, (300,300))
        return tf.cast(image, tf.float32) / 255., label

    def create_dataset(batch_size: int):
        '''Loads Cassava dataset and preprocesses data.'''

        data, info = tfds.load(name='cassava', as_supervised=True, with_info=True)
        number_of_classes = info.features['label'].num_classes
        train_data = data['train'].map(preprocess_data,
                                       num_parallel_calls=tf.data.experimental.AUTOTUNE)
        train_data  = train_data.cache().shuffle(BUFFER_SIZE).repeat()
        train_data  = train_data.batch(batch_size)
        train_data  = train_data.prefetch(tf.data.experimental.AUTOTUNE)

        # Set AutoShardPolicy
        options = tf.data.Options()
        options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
        train_data = train_data.with_options(options)

        return train_data, number_of_classes

    # Build the ResNet50 Keras model    
    def create_model(number_of_classes: int, lr: int = 0.001):
        '''Creates and compiles pretrained ResNet50 model.'''

        base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
        x = base_model.output
        x = tf.keras.layers.GlobalAveragePooling2D()(x)
        x = tf.keras.layers.Dense(1016, activation='relu')(x)
        predictions = tf.keras.layers.Dense(number_of_classes, activation='softmax')(x)
        model = tf.keras.Model(inputs=base_model.input, outputs=predictions)

        model.compile(
            loss=tf.keras.losses.sparse_categorical_crossentropy,
            optimizer=tf.keras.optimizers.Adam(lr),
            metrics=['accuracy'])
        return model

    # Train the model
    NUM_WORKERS = strategy.num_replicas_in_sync
    # Here the batch size scales up by number of workers since
    # `tf.data.Dataset.batch` expects the global batch size.
    GLOBAL_BATCH_SIZE = BATCH_SIZE * NUM_WORKERS
    train_dataset, number_of_classes = create_dataset(GLOBAL_BATCH_SIZE)

    with strategy.scope():
        # Creation of dataset, and model building/compiling need to be within `strategy.scope()`.
        resnet_model = create_model(number_of_classes, lr)

    h = resnet_model.fit(x=train_dataset, epochs=epochs, steps_per_epoch=steps)
    acc = h.history['accuracy'][-1]
    resnet_model.save(model_dir)
    
    output_model.path = model_dir
    metrics.log_metric('accuracy', (acc * 100.0))
    metrics.log_metric('framework', 'Tensorflow')

## Define the Pipeline
Define your pipeline using `@AutoMLOps.pipeline`. You can optionally give the pipeline a name and description. Define the structure by listing the components to be called in your pipeline; use `.after` to specify the order of execution.

In [7]:
@AutoMLOps.pipeline(name='tensorflow-gpu-example')
def pipeline(
    project_id: str,
    model_dir: str,
    lr: float,
    epochs: int,
    steps: int,
    serving_image: str,
    distribute: str,
):
    from google_cloud_pipeline_components.types import artifact_types
    from google_cloud_pipeline_components.v1.model import ModelUploadOp
    from kfp.v2.components import importer_node

    custom_train_model_task = custom_train_model(
        model_dir=model_dir,
        lr=lr,
        epochs=epochs,
        steps=steps,
        distribute=distribute
    )

    unmanaged_model_importer = importer_node.importer(
        artifact_uri=model_dir,
        artifact_class=artifact_types.UnmanagedContainerModel,
        metadata={
            'containerSpec': {
                'imageUri': serving_image
            }
        },
    )

    model_upload_op = ModelUploadOp(
        project=project_id,
        display_name='tensorflow_gpu_example',
        unmanaged_container_model=unmanaged_model_importer.outputs['artifact'],
    )
    model_upload_op.after(custom_train_model_task)

## Define the Pipeline Arguments

In [8]:
pipeline_params = {
    'project_id': PROJECT_ID,
    'model_dir': MODEL_DIR,
    'lr': 0.01,
    'epochs': 10,
    'steps': 200,
    'serving_image': SERVING_IMAGE,
    'distribute': 'single'
}

## Generate and Run the pipeline
`AutoMLOps.launchAll(...)` runs `generate()`, `provision()`, and `deploy()` all in succession. In this case, we are specifying a custom job spec, where we will use an Nvidia A100 GPU to accelerate the training of the model. 

*Note: if you run this cell below without a larger container, the training job will run out of memory and fail:*
```
The replica workerpool0-0 ran out-of-memory and exited with a non-zero status of 137(SIGKILL). To find out more about why your job exited please check the logs:
```

This use case is an ideal example for where specifying `custom_training_job_specs` for AutoMLOps is useful and necessary.

In [None]:
AutoMLOps.launchAll(project_id=PROJECT_ID, 
                    pipeline_params=pipeline_params, 
                    use_ci=True,
                    schedule_pattern='59 11 * * 0', # retrain every Sunday at Midnight
                    base_image=TRAINING_IMAGE,
                    naming_prefix=MODEL_ID,
                    custom_training_job_specs = [{
                       'component_spec': 'custom_train_model',
                       'display_name': 'train-model-accelerated',
                       'machine_type': 'a2-highgpu-1g',
                       'accelerator_type': 'NVIDIA_TESLA_A100',
                       'accelerator_count': 1
                    }]
)