<a href="https://colab.research.google.com/github/AnandInguva/beam/blob/notebook/beam/examples/notebooks%20/beam-ml/side_Input_model_updates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

# Use WatchFilePattern to auto-update ML models in RunInference

The pipeline in this notebook uses a `RunInference` PTransform to run inference on images using TensorFlow models. It uses a side input PCollection that emits `ModelMetadata` to update the model.

Using side inputs, you can update your model (which is passed in a ModelHandler configuration object) in real-time, even while the Beam pipeline is still running. This can be done either by leveraging one of Beam's provided patterns, such as the WatchFilePattern, or by configuring a custom side input PCollection that defines the logic for the model update.

For more information about side inputs, see the Side inputs section in the Apache Beam Programming Guide.

This notebook uses `WatchFilePattern` as a side input. `WatchFilePattern` is used to watch for the file updates matching the `file_pattern` based on timestamps. It emits the latest `ModelMetadata`, which is used in the `RunInference` PTransform to automatically update the ML model without stopping the Beam pipeline.


### Before you begin
Install the necessary dependencies that are used to run this notebook.

To use RunInference with side inputs for automatic model updates, install `Apache Beam` version `2.46.0` or later.

In [None]:
!pip install apache_beam[gcp]>=2.46.0 --quiet
!pip install tensorflow
!pip install tensorflow_hub

In [None]:
# Imports required for the notebook.
import logging
import time
from typing import Iterable

import apache_beam as beam
from apache_beam.examples.inference.tensorflow_imagenet_segmentation import PostProcessor
from apache_beam.examples.inference.tensorflow_imagenet_segmentation import read_image
from apache_beam.ml.inference.base import PredictionResult
from apache_beam.ml.inference.base import RunInference
from apache_beam.ml.inference.tensorflow_inference import TFModelHandlerTensor
from apache_beam.ml.inference.utils import WatchFilePattern
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.transforms.periodicsequence import PeriodicImpulse

In [None]:
# authenticate to your gcp account.
from google.colab import auth
auth.authenticate_user()

# Pipeline options

Configure the pipeline options for the pipeline to run on Dataflow. Make sure the streaming mode is on for this pipeline.

In [None]:
options = PipelineOptions()
options.view_as(StandardOptions).streaming = True

# provide required pipeline options for DataflowRunner
options.view_as(StandardOptions).runner = "DataflowRunner"

# Sets the project to the default project in your current Google Cloud environment.
options.view_as(GoogleCloudOptions).project = 'your-project'

# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'us-central1'

# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = "gs://your-bucket/tmp/"

# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location

# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location



We need to install the `tensorflow` and `tensorflow_hub` dependencies on Dataflow. We can pass them via `requirements_file` pipeline option.

In [None]:
# define dependencies in a requirements file required for the pipeline.
deps_required_for_pipeline = ['tensorflow>=2.12.0', 'tensorflow-hub>=0.10.0', 'Pillow>=9.0.0']
requirements_file_path = './requirements.txt'
# write the depencies to a requirements file.
with open(requirements_file_path, 'w') as f:
  for dep in deps_required_for_pipeline:
    f.write(dep + '\n')

# the pipeline needs dependencies needed to be installed on Dataflow.
options.view_as(SetupOptions).requirements_file = requirements_file_path

Let's define configuration for the `PeriodicImpulse`.

  * `PeriodicImpulse` transform generates an infinite sequence of elements with given runtime interval.

We use `PeriodicImpulse` in this notebook to mimic the `Pub/Sub` source. Since the inputs in a streaming pipleine arrives in intervals, we use `PeriodicImpulse` to output element at `m` intervals.

To learn more about PeriodicImpulse, please take a look at the [code](https://github.com/apache/beam/blob/9c52e0594d6f0e59cd17ee005acfb41da508e0d5/sdks/python/apache_beam/transforms/periodicsequence.py#L150)

In [None]:
start_timestamp = time.time()
end_timestamp = start_timestamp + 60 * 20
main_input_fire_interval = 60 # interval at which the main input PCollection is emitted.
side_input_fire_interval = 60 # interval at which the side input PCollection is emitted.

## TensorFlow ModelHandler
 In this notebook, we will use `TFModelHandlerTensor` as the ModelHandler. We will use `resnet_101` model trained on imagenet.

 Download the model from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet101_weights_tf_dim_ordering_tf_kernels.h5 and place it in a directory that you would use to auto model updates.

In [None]:
model_handler = TFModelHandlerTensor(
    model_uri="gs://your-bucket/resnet101_weights_tf_dim_ordering_tf_kernels.h5")

Now, let's jump into the pipeline code.

**Pipeline steps**:
1. Create a `PeriodImpulse`, which emits output every `n` seconds.
2. Read and pre-process the images using the `read_image` function.
3. Pass the images to the RunInference `PTransform`. RunInference takes `model_handler` and `model_metadata_pcoll` as input parameters.


The `model_metadata_pcoll` is a [side input](https://beam.apache.org/documentation/programming-guide/#side-inputs) `PCollection` to the RunInference `PTransform`. This side input is used to update the models in the `model_handler` without needing to stop the beam pipeline.
We will use `WatchFilePattern` as side input to watch a glob pattern matching `.h5` files.

`model_metadata_pcoll` expects a `PCollection` of ModelMetadata compatible with [AsSingleton](https://beam.apache.org/releases/pydoc/2.4.0/apache_beam.pvalue.html#apache_beam.pvalue.AsSingleton) view. Because the pipeline uses `WatchFilePattern` as side input, it will take care of windowing and wrapping the output into `ModelMetadata`.

**How to watch for auto model update**

After the pipeline starts processing data and when you see some outputs emitted from the RunInference `PTransform`, upload a `.h5` `TensorFlow` model(for example, [resnet_152](https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet152_weights_tf_dim_ordering_tf_kernels.h5)) that matches the `file_pattern` to the Google Cloud Storage bucket. RunInference will update the `model_uri` of `TFModelHandlerTensor` using `WatchFilePattern` as a side input.

**Note**: Side input update frequency is non-deterministic and can have longer intervals between updates.

When the inference is complete, RunInference outputs a `PredictionResult` object that contains `example`, `inference`, and `model_id` fields. The `model_id` is used to identify which model is used for running the inference.

In [None]:
pipeline = beam.Pipeline(options=options)

# file_pattern used in WatchFilePattern to watch for latest model files.
file_pattern = 'gs://your-bucket/*.h5'
with beam.Pipeline(options=options) as pipeline:

  # side input used to watch for .h5 file and auto update the model_uri of the TFModelHandlerTensor.
  side_input_pcoll = (
      pipeline
      | "WatchFilePattern" >> WatchFilePattern(file_pattern=file_pattern,
                                                interval=side_input_fire_interval,
                                                stop_timestamp=end_timestamp))

  read_images = (
      pipeline
      | "MainInputPcoll" >> PeriodicImpulse(
          start_timestamp=start_timestamp,
          stop_timestamp=end_timestamp,
          fire_interval=main_input_fire_interval)
      # since this example focuses on the auto model updates, we will use only one image for every prediction.
      | beam.Map(lambda x: "Cat-with-beanie.jpg")
      | "ReadImage" >> beam.Map(lambda image_name: read_image(
          image_name=image_name, image_dir='https://storage.googleapis.com/apache-beam-samples/image_captioning/')))

  inferences = (read_images | "ApplyWindowing" >> beam.WindowInto(beam.window.FixedWindows(10))
      | "RunInference" >> RunInference(model_handler=model_handler,
                                        model_metadata_pcoll=side_input_pcoll))

  post_processor = (inferences | "PostProcessResults" >> beam.ParDo(PostProcessor()))

  post_processor | "print" >> beam.Map(logging.info)
