# FraudFinder - Streaming Inference

## Overview

This series of labs are updated upon [FraudFinder](https://github.com/googlecloudplatform/fraudfinder) repository which builds a end-to-end real-time fraud detection system on Google Cloud. Throughout the FraudFinder labs, you will learn how to read historical bank transaction data stored in data warehouse, read from a live stream of new transactions, perform exploratory data analysis (EDA), do feature engineering, ingest features into a feature store, train a model using feature store, register your model in a model registry, evaluate your model, deploy your model to an endpoint, do real-time inference on your model with feature store, and monitor your model.


### Objective

As you engineer features for model training, it's important to consider how the features are computed when making predictions with new data. For online predictions, you may have features that can be pre-computed via _batch feature engineering_. You may also have features that need to be computed on-the-fly via _streaming-based feature engineering_. For these Fraudfinder labs, for computing features based on the last n _days_, you will use _batch_ feature engineering in BigQuery; for computing features based on the last n _minutes_, you will use _streaming-based_ feature engineering using Dataflow.

In order to calculate very recent customer and terminal activity (i.e. within the last hour), computation has to be done on real-time streaming data, rather than via batch-based feature engineering. This notebook shows a step-by-step guide to create a real-time inference pipeline. You will learn to:

- Create a real-time inference pipeline using Apache Beam.
- Enrich streaming data with features from the Vertex AI Feature Store.
- Perform real-time inference using a deployed Vertex AI model.
- Deploy the Apache Beam pipeline to Dataflow.
- Write the inference results to Pub/Sub and BigQuery.

This lab uses the following Google Cloud services and resources:

- [Pub/Sub](https://cloud.google.com/pubsub/)
- [Dataflow](https://cloud.google.com/dataflow/)
- [Vertex AI](https://cloud.google.com/vertex-ai/)

The steps performed in this notebook are:

1. Read streaming data from a Pub/Sub topic.
2. Enrich the data by looking up customer and terminal features from the Vertex AI Feature Store.
3. Invoke a deployed Vertex AI model for real-time predictions.
4. Write the prediction results to another Pub/Sub topic for downstream consumption.
5. Write the prediction results to BigQuery for storage and analysis.


### Load configuration settings from the setup notebook

Set the constants used in this notebook and load the config settings from the `00_environment_setup.ipynb` notebook.

In [None]:
GCP_PROJECTS = !gcloud config get-value project
PROJECT_ID = GCP_PROJECTS[0]
BUCKET_NAME = f"{PROJECT_ID}-fraudfinder"
config = !gsutil cat gs://{BUCKET_NAME}/config/notebook_env.py
print(config.n)
exec(config.n)
OUTPUT_TOPIC_NAME = "fraud_finder_inference"

#### Create PubSub topic for inference pipeline:

In [None]:
!gcloud pubsub topics create fraud_finder_inference --project=$PROJECT_ID

In [None]:
!gcloud pubsub subscriptions create "fraud_finder_inference_sub" --topic="fraud_finder_inference" --topic-project=$PROJECT_ID

### Create folder

In favour of clean folder structure, we will create a separate folder and place all the files that we will produce there.

In [None]:
FOLDER = "./beam_pipeline"
PYTHON_SCRIPT = f"{FOLDER}/main.py"
REQUIREMENTS_FILE = f"{FOLDER}/requirements.txt"

# Create new folder for pipeline files
!rm -rf {FOLDER} || True
!mkdir {FOLDER}

## Before we begin

For deploying Apache Beam pipelines to Dataflow, it is a best practice to submit the job from a Python script rather than directly from a notebook. This approach helps in managing dependencies cleanly and is more suitable for production environments. When you run a Dataflow job, the entire session is serialized and sent to the workers. Notebook environments can have a lot of state that can cause issues with this serialization process.

Therefore, in the following cells, we will be writing our pipeline code to a Python script called `main.py`. We will then execute this script to deploy the Dataflow job. This method is used for clarity and to demonstrate a more robust deployment strategy.

### Write import statements

Here we write the code to import all the required libraries to the external python script

In [None]:
%%writefile {PYTHON_SCRIPT}
import apache_beam as beam

from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.combiners import CountCombineFn, MeanCombineFn

from apache_beam.transforms.enrichment import Enrichment
from apache_beam.transforms.enrichment_handlers.vertex_ai_feature_store import VertexAIFeatureStoreEnrichmentHandler

import google.auth

import json
from typing import Any
from typing import Dict

from google.cloud import aiplatform

from apache_beam.ml.inference.base import RunInference
from apache_beam.ml.inference.vertex_ai_inference import VertexAIModelHandlerJSON
from apache_beam.io import PubsubMessage

### Defining an auxiliary magic function

The magic function `writefile` from Jupyter Notebook can only write the cell as is and could not unpack Python variables. Hence, we need to create an auxiliary magic function that can unpack Python variables and write them to a file.

In [None]:
from IPython.core.magic import register_line_cell_magic


@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, "a") as f:
        f.write(cell.format(**globals()))

#### Retrieve deployed Vertex AI endpoint id

In [None]:
from google.cloud import aiplatform as vertex_ai

vertex_ai.init(project=PROJECT_ID, location=REGION)

endpoints = vertex_ai.Endpoint.list(
    filter=f"display_name={ENDPOINT_NAME}",  # optional: filter by specific endpoint name
    order_by="update_time",
)

ENDPOINT_ID = endpoints[-1].name
print(f"Vertex AI Endpoint ID={ENDPOINT_ID}")

### Write the variable values

Here we write the variable values to the external python script using the new magic function

In [None]:
# Adding additional variables to project_variables
project_variables = "\n".join(config[1:-1])
project_variables += f'\nPROJECT_ID = "{PROJECT}"'
project_variables += f'\nBUCKET_NAME = "{BUCKET_NAME}"'
project_variables += f'\nREQUIREMENTS_FILE = "{REQUIREMENTS_FILE}"'
project_variables += f'\nENDPOINT_ID = "{ENDPOINT_ID}"'

In [None]:
%%writetemplate {PYTHON_SCRIPT}

# Project variables
{project_variables}

### Write constant variables

Here we write constant variables to the external python script

In [None]:
%%writefile -a {PYTHON_SCRIPT}

# Pub/Sub variables
SUBSCRIPTION_NAME = "ff-tx-for-feat-eng-sub"
SUBSCRIPTION_PATH = f"projects/{PROJECT_ID}/subscriptions/{SUBSCRIPTION_NAME}"

# Dataflow variables
FIFTEEN_MIN_IN_SECS = 15 * 60
THIRTY_MIN_IN_SECS = 30 * 60
WINDOW_SIZE = 60 * 60 # 1 hour in secs
WINDOW_PERIOD = 1 * 60  # 1 min in secs

### Building the pipeline

Now we are ready to build the pipeline. The pipeline is designed to process streaming data, enrich it with features from a feature store, run inference using a trained model, and then output the results to two different destinations.

Here's a breakdown of the pipeline's components:

*   **Data Source (Pub/Sub):** The pipeline starts by reading messages from a Pub/Sub subscription. These messages represent real-time transactions that need to be evaluated for fraud.

*   **Enrichment (Vertex AI Feature Store):** Each incoming transaction is then enriched with pre-computed features from the Vertex AI Feature Store. We use the `Enrichment` transform from the Apache Beam SDK, along with the `VertexAIFeatureStoreEnrichmentHandler`. This allows us to look up features for both the customer and the terminal involved in the transaction.

*   **Inference (Vertex AI Prediction):** Once the transaction data is enriched with the necessary features, it is passed to a deployed Vertex AI model for inference. We use the `RunInference` transform, which is a generic transform for running machine learning models in an Apache Beam pipeline. We configure it with a `VertexAIModelHandlerJSON` to handle the communication with the Vertex AI Prediction service.

*   **Data Sinks (Pub/Sub and BigQuery):** The pipeline has two output branches:
    *   One branch writes the inference results to a Pub/Sub topic. This is useful for downstream applications that need to react to the predictions in real-time.
    *   The other branch writes the results to a BigQuery table. This allows for the storage and long-term analysis of the predictions.

The entire pipeline is defined within the `main` function, which will be written to our Python script and then deployed to Dataflow.

In [None]:
%%writefile -a {PYTHON_SCRIPT}

def main():
    # # Initialize Vertex AI client
    # aiplatform.init(
    #     project=PROJECT_ID,
    #     location=REGION
    # )
    
    API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"
    
    # Setup pipeline options for deploying to dataflow
    pipeline_options = PipelineOptions(streaming=True,
                                       save_main_session=True,
                                       runner="DataflowRunner",
                                       project=PROJECT_ID,
                                       region=REGION,
                                       temp_location=f"gs://{BUCKET_NAME}/dataflow/tmp",
                                       requirements_file=REQUIREMENTS_FILE,
                                       max_num_workers=1)
    
    # Build pipeline and transformation steps
    pipeline = beam.Pipeline(options=pipeline_options)

    output_table = f'{PROJECT_ID}.tx.streaming_pipeline'
    pub_sub_toipc_output = f"projects/{PROJECT_ID}/topics/fraud_finder_inference"

    def convert_row_to_payload(element: beam.Row):
        element_dict = element._asdict()
        # Default features in case if its not exist in a feature store:
        default_features = {
            'tx_amount': element_dict['TX_AMOUNT'],
            'customer_id_avg_amount_14day_window': 0,
            'customer_id_avg_amount_15min_window': 0,
            'customer_id_avg_amount_1day_window': 0,
            'customer_id_avg_amount_30min_window': 0,
            'customer_id_avg_amount_60min_window': 0,
            'customer_id_avg_amount_7day_window': 0,
            'customer_id_nb_tx_14day_window': 0,
            'customer_id_nb_tx_7day_window': 0,
            'customer_id_nb_tx_15min_window': 0,
            'customer_id_nb_tx_1day_window': 0,
            'customer_id_nb_tx_30min_window': 0,
            'customer_id_nb_tx_60min_window': 0,
            'terminal_id_avg_amount_15min_window': 0,
            'terminal_id_avg_amount_30min_window': 0,
            'terminal_id_avg_amount_60min_window':0,
            'terminal_id_nb_tx_14day_window': 0,
            'terminal_id_nb_tx_15min_window': 0,
            'terminal_id_nb_tx_1day_window': 0,
            'terminal_id_nb_tx_30min_window': 0,
            'terminal_id_nb_tx_60min_window': 0,
            'terminal_id_nb_tx_7day_window': 0,
            'terminal_id_risk_14day_window': 0,
            'terminal_id_risk_1day_window': 0,
            'terminal_id_risk_7day_window': 0
        }
        default_features.update(element_dict)
        del default_features['TX_AMOUNT']
        return default_features
    
    def item_to_message(item: Dict[str, Any]) -> PubsubMessage:
        # Re-import needed types. When using the Dataflow runner, this
        # function executes on a worker, where the global namespace is not
        # available. For more information, see:
        # https://cloud.google.com/dataflow/docs/guides/common-errors#name-error
        from apache_beam.io import PubsubMessage

        attributes = {"type": "inference"}
        data = bytes(json.dumps(item), "utf-8")

        return PubsubMessage(data=data, attributes=attributes)

    model_handler = VertexAIModelHandlerJSON(endpoint_id=ENDPOINT_ID,
                                             project=PROJECT_ID,
                                             location=REGION,
                                            ).with_preprocess_fn(convert_row_to_payload)

    vertex_ai_handler_customers = VertexAIFeatureStoreEnrichmentHandler(
        project=PROJECT_ID,
        location=REGION,
        api_endpoint=API_ENDPOINT,
        feature_store_name=FEATURESTORE_ID,
        feature_view_name="fv_fraudfinder_customers",
        row_key="CUSTOMER_ID",
    )

    vertex_ai_handler_terminals = VertexAIFeatureStoreEnrichmentHandler(
        project=PROJECT_ID,
        location=REGION,
        api_endpoint=API_ENDPOINT,
        feature_store_name=FEATURESTORE_ID,
        feature_view_name="fv_fraudfinder_terminals",
        row_key="TERMINAL_ID",
    )

    SUBSCRIPTION_NAME = "ff-tx-sub"
    SUBSCRIPTION_PATH = f"projects/{PROJECT_ID}/subscriptions/{SUBSCRIPTION_NAME}"

    source = (
        pipeline
        | 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(subscription=SUBSCRIPTION_PATH)
        | 'Decode message' >> beam.Map(lambda row: beam.Row(**json.loads(row.decode('utf-8'))))
    )

    inference = (
    source
    | "Enrich Customer Online FS" >> Enrichment(vertex_ai_handler_customers)
    | "Enrich Terminal Online FS" >> Enrichment(vertex_ai_handler_terminals)
    | "RunInference" >> RunInference(model_handler)
    | "Prep BQ Row" >> beam.Map(lambda x: {**x.example, "model_id": x.model_id, **x.inference}))
    
    
    _ = (
    inference
    | "Convert to Pub/Sub messages" >> beam.Map(item_to_message)
    | "Write to Pub/Sub" >> beam.io.WriteToPubSub(topic=pub_sub_toipc_output, with_attributes=True))

    _ = (
    inference
    | "Write BigQuery" >> beam.io.gcp.bigquery.WriteToBigQuery(
        table=output_table,
        method=beam.io.gcp.bigquery.WriteToBigQuery.Method.STREAMING_INSERTS,
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))
    
    # Run the pipeline (async)
    pipeline.run()

    
if __name__ == "__main__":
    main()

### Creating `requirement.txt` for Dataflow Workers

As we are using `google-cloud-aiplatform` and `google-apitools` package, we need to pass the `requirement.txt` to the Dataflow Workers so that the workers will install the packages in their respective environment before running the job.

In [None]:
%%writefile {REQUIREMENTS_FILE}
google-cloud-aiplatform==1.115.0
google-apitools==0.5.32

### Deploying the pipeline

Now we are ready to deploy this pipeline to Dataflow.

In [None]:
!python3 {PYTHON_SCRIPT}

Congratulations! You have successfully deployed a real-time inference pipeline to Dataflow. You can monitor the job and see the results in the [Dataflow console](https://console.cloud.google.com/dataflow/jobs). This concludes the notebook.