# Working with Streaming Data

Learning Objectives
 - Learn how to process real-time data for ML models

## Introduction

It can be useful to leverage real time data in a machine learning model when making a prediction. However, doing so requires setting up a streaming data pipeline which can be non-trivial. 

Typically you will have the following:
 - A series of IoT devices generating and sending data from the field in real-time (in our case these are the taxis)
 - A messaging bus to that receives and temporarily stores the IoT data (in our case this is Cloud Pub/Sub)
 - A streaming processing service that subscribes to the messaging bus, windows the messages and performs data transformations on each window (in our case this is Cloud Dataflow)
 - A persistent store to keep the processed data (in our case this is BigQuery)

These steps happen continuously and in real-time, and are illustrated by the blue arrows in the diagram below. 

Once this streaming data pipeline is established, we need to modify our model serving to leverage it. This simply means adding a call to the persistent store (BigQuery) to fetch the latest real-time data when a prediction request comes in. This flow is illustrated by the red arrows in the diagram below. 

<img src='assets/taxi_streaming_data.png' width='80%'>


In this lab we will address how to process real-time data for machine learning models. We will use the same data as our previous 'taxifare' labs, but with the addition of `trips_last_5min` data as an additional feature. This is our proxy for real-time traffic.



In [None]:
import tensorflow as tf
import numpy as np
import shutil
import os

from matplotlib import pyplot as plt
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, DenseFeatures
from tensorflow.keras.callbacks import TensorBoard

print(tf.__version__)

## Re-train our model with `trips_last_5min` feature

In this lab, we want to show how to process real-time data for training and prediction. So, we need to retrain our previous model with this additional feature. Go through the notebook `train.ipynb`. Open and run the notebook to train and save a model. This notebook is very similar to what we did in the Introduction to Tensorflow module but note the added feature for `trips_last_5min` in the model and the dataset.

## Simulate Real Time Taxi Data

Since we don’t actually have real-time taxi data we will synthesize it using a simple python script. The script publishes events to Google Cloud Pub/Sub.

Inspect the `iot_devices.py` script in the `taxicab_traffic` folder. It is configured to send about 2,000 trip messages every five minutes with some randomness in the frequency to mimic traffic fluctuations. These numbers come from looking at the historical average of taxi ride frequency in BigQuery. 

In production this script would be replaced with actual taxis with IoT devices sending trip data to Cloud Pub/Sub. 

To execute the iot_devices.py script, launch a terminal and navigate to the `training-data-analyst/courses/machine_learning/production_ml` directory. Then run the following two commands.

```bash
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
python ./taxicab_traffic/iot_devices.py --project=$PROJECT_ID
```

You will see new messages being published every 5 seconds. **Keep this terminal open** so it continues to publish events to the Pub/Sub topic. If you open [Pub/Sub in your Google Cloud Console](https://console.cloud.google.com/cloudpubsub/topic/list), you should be able to see a topic called `taxifares`.

## Create a BigQuery table to collect the processed data

In the next section, we will create a dataflow pipeline to write processed taxifare data to a BigQuery Table, however that table does not yet exist. Execute the following commands to create a BigQuery dataset called `taxifare` and a table within that dataset called `taxifare`. 

In [None]:
%%bash
bq mk --dataset taxifare

Next, we create a table called `taxifare_realtime` and set up the schema.

In [None]:
%%bash
bq mk --table \
 --schema trips_last_5min:INTEGER,time:TIMESTAMP \
 taxifare.traffic_realtime

## Launch Streaming Dataflow Pipeline

Now that we have our taxi data being pushed to Pub/Sub, and our BigQuery table set up, let’s consume the Pub/Sub data using a streaming DataFlow pipeline.

The pipeline is defined in `./taxicab_trafic/streaming_count.py`. Open that file and inspect it. 

There are 5 transformations being applied:
 - Read from PubSub
 - Window the messages
 - Count number of messages in the window
 - Format the count for BigQuery
 - Write results to BigQuery

For the second transform, we specify a sliding window that is 5 minutes long, and recalculate values every 15 seconds. 

In a new terminal, launch the dataflow pipeline using the command below. You can change the `BUCKET` variable, if necessary. Here it is assumed to be your `PROJECT_ID`.

```bash
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
BUCKET=$PROJECT_ID # CHANGE AS NECESSARY 
python streaming_count.py \
	--input_topic taxi_rides \
	--runner=DataflowRunner \
	--project=$PROJECT_ID \
	--temp_location=gs://$BUCKET/dataflow_streaming

```

Once you've submitted the command above you can examine the progress of that job in the [Dataflow section of Cloud console](https://console.cloud.google.com/dataflow). 

## Explore the data in the table

After a few moments, you should also see new data written to your BigQuery table as well. 

Re-run the query periodically to observe new data streaming in! You should see a new row every 15 seconds. 

In [None]:
%load_ext google.cloud.bigquery

In [None]:
%%bigquery
SELECT
  *
FROM
  `taxifare.traffic_realtime`
ORDER BY
  time DESC

## Make predictions from the new data

To make online predictions, we'll take advantage of [AI Platforms Custom Prediction Routines](https://cloud.google.com/ml-engine/docs/tensorflow/custom-prediction-routines) which allows us to execute custom python code in response to every online prediction request. There are 5 steps to creating a custom prediction routine:

1. Upload Model Artifacts to GCS
2. Implement Predictor interface 
3. Package the prediction code and dependencies
4. Deploy
5. Invoke API

In the rest of the lab, we'll referece the model we trained and deployed from the previous labs.

In [None]:
PROJECT_ID = 'munn-sandbox'
BUCKET = 'munn-sandbox'

MODEL_BASE = "./export/savedmodel"
MODEL_PATH = os.path.join(MODEL_BASE,os.listdir(MODEL_BASE)[-1])
MODEL_NAME = 'taxifare'
VERSION_NAME = 'dnn'

In [None]:
loaded_model = tf.saved_model.load(export_dir=MODEL_PATH)

First, we'll upload our model artifacts to GCS.

In [None]:
!gsutil cp -r $MODEL_PATH/* gs://$BUCKET/taxifare/model/

Then, create our predictor interface. This will create an object `TaxifarePredictor`. This tells AI Platform how to load the model artifacts, and is where we specify our custom prediction code. By calling `.predict` we will make predictions on the most recent instances collected from our real-time dataset in BigQuery.

Note: the correct PROJECT_ID will automatically be inserted using the bash `sed` command in the subsequent cell.

In [None]:
%%writefile predictor.py
import tensorflow as tf
from google.cloud import bigquery

PROJECT_ID = 'will_be_replaced'

class TaxifarePredictor(object):
    def __init__(self, predict_fn):
      self.predict_fn = predict_fn   
    
    def predict(self, instances, **kwargs):
        bq = bigquery.Client(PROJECT_ID)
        query_string = """
        SELECT
          *
        FROM
          `taxifare.traffic_realtime`
        ORDER BY
          time DESC
        LIMIT 1
        """
        trips = bq.query(query_string).to_dataframe()['trips_last_5min'][0]
        instances['trips_last_5min'] = [trips for _ in range(len(list(instances.items())[0][1]))]
        predictions = self.predict_fn(instances)
        return predictions['predictions'].tolist() # convert to list so it is JSON serialiable (requirement)
    

    @classmethod
    def from_path(cls, model_dir):
        # predict_fn = tf.contrib.predictor.from_saved_model(model_dir,'predict')
        loaded_model = tf.saved_model.load(export_dir=model_dir)
        return cls(predict_fn)

In [None]:
%%bash
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
sed -i -e "s/will_be_replaced/${PROJECT_ID}/g" predictor.py