# FraudFinder - Feature Engineering (batch) (New Feature Store)

## Overview

This series of labs are updated upon [FraudFinder](https://github.com/googlecloudplatform/fraudfinder) repository which builds a end-to-end real-time fraud detection system on Google Cloud. Throughout the FraudFinder labs, you will learn how to read historical bank transaction data stored in data warehouse, read from a live stream of new transactions, perform exploratory data analysis (EDA), do feature engineering, ingest features into a feature store, train a model using feature store, register your model in a model registry, evaluate your model, deploy your model to an endpoint, do real-time inference on your model with feature store, and monitor your model.


In this notebook, we'll focus on a critical step in any machine learning project: **feature engineering**. You'll learn how to transform raw transaction data into meaningful features that can be used to train a powerful fraud detection model. We'll be using BigQuery for batch feature engineering and Vertex AI Feature Store to manage and serve our features.

### Objective

The primary goal of this notebook is to introduce you to the concepts of batch feature engineering and the Vertex AI Feature Store. You will learn how to:

* **Understand the difference between batch and streaming feature engineering.** We'll explore why both are essential for building a real-time fraud detection system.
* **Create powerful features using SQL in BigQuery.** You'll write queries to extract valuable insights from historical transaction data, such as customer spending habits and terminal risk profiles.
* **Leverage the Vertex AI Feature Store for efficient feature management.** You'll learn how to create a feature store, define feature groups, and ingest your newly created features for both training and online serving.
* **Prepare for real-time feature engineering.** We'll also set the stage for the next notebook in this series, where you'll learn how to create streaming features using Dataflow.

This lab uses the following Google Cloud services and resources:

- [Vertex AI](https://cloud.google.com/vertex-ai/)
- [BigQuery](https://cloud.google.com/bigquery/)
- [Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore/latest/overview)

Steps performed in this notebook:

- Build customer and terminal-related features
- Create Feature store, entities and features
- Ingest feature values in Feature store from BigQuery table
- Read features from the feature store

### Load configuration settings from the setup notebook

Set the constants used in this notebook and load the config settings from the `00_environment_setup.ipynb` notebook.

In [None]:
GCP_PROJECTS = !gcloud config get-value project
PROJECT_ID = GCP_PROJECTS[0]
BUCKET_NAME = f"{PROJECT_ID}-fraudfinder"
config = !gsutil cat gs://{BUCKET_NAME}/config/notebook_env.py
print(config.n)
exec(config.n)

### Import libraries

In [None]:
# General
import datetime as dt
import json
import os
import random
import sys
import time
from datetime import datetime, timedelta
from typing import List, Union

# Data Engineering
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 500)

# Vertex AI and Vertex AI Feature Store
from google.cloud import aiplatform as vertex_ai
from google.cloud import bigquery

### Define constants

In [None]:
# Define the date range of transactions for feature engineering (last 10 days up until yesterday)
YESTERDAY = datetime.today() - timedelta(days=1)
YEAR_MONTH_PREFIX = YESTERDAY.strftime("%Y-%m")
DATAPROCESSING_START_DATE = (YESTERDAY - timedelta(days=10)).strftime(
    "%Y-%m-%d"
)
DATAPROCESSING_END_DATE = YESTERDAY.strftime("%Y-%m-%d")

# Define BiqQuery dataset and tables to calculate features.
RAW_BQ_TRANSACTION_TABLE_URI = f"{PROJECT_ID}.tx.tx"

INGESTION_BQ_TRANSACTION_TABLE_URI = f"{PROJECT_ID}.tx.ingestion_tx_records"
INGESTION_BQ_LABELS_TABLE_URI = f"{PROJECT_ID}.tx.ingestion_tx_labels"

RAW_BQ_LABELS_TABLE_URI = f"{PROJECT_ID}.tx.txlabels"
FEATURES_BQ_TABLE_URI = f"{PROJECT_ID}.tx.wide_features_table"

# Define Vertex AI Feature store tables and views.

CUSTOMERS_FE_BQ_VIEW_URI = f"{PROJECT_ID}.tx.v_customers_features"
CUSTOMERS_FE_BQ_BATCH_TABLE_URI = f"{PROJECT_ID}.tx.t_customers_batch_features"

TERMINALS_TABLE_NAME = f"terminals_{DATAPROCESSING_END_DATE.replace('-', '')}"

TERMINALS_FE_BQ_VIEW_URI = f"{PROJECT_ID}.tx.v_terminals_features"
TERMINALS_FE_BQ_BATCH_TABLE_URI = f"{PROJECT_ID}.tx.t_terminals_batch_features"

CUSTOMERS_STREAMING_FE_TABLE_URI = (
    f"{PROJECT_ID}.tx.t_customers_streaming_features"
)
TERMINALS_STREAMING_FE_TABLE_URI = (
    f"{PROJECT_ID}.tx.t_terminals_streaming_features"
)

ONLINE_STORAGE_NODES = 1
FEATURE_TIME = "feature_ts"
CUSTOMER_ENTITY_ID = "customer"
TERMINAL_ENTITY_ID = "terminal"

### Helpers

Define a set of helper functions to run BigQuery query and create features. 

In [None]:
def run_bq_query(sql: str, show=False) -> Union[str, pd.DataFrame]:
    """
    Run a BigQuery query and return the job ID or result as a DataFrame
    Args:
        sql: SQL query, as a string, to execute in BigQuery
        show: A flag to show query result in a Pandas Dataframe
    Returns:
        df: DataFrame of results from query,  or error, if any
    """

    bq_client = bigquery.Client()

    # Try dry run before executing query to catch any errors
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    bq_client.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq_client.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    result = client_result.result()
    print(f"Finished job_id: {job_id}")

    if show:
        df = result.to_arrow().to_pandas()
        return df

## Creating Destination Tables for the Ingestion Pipeline

Before we start engineering features, we need a place to store our raw transaction data. In this section, we'll create two BigQuery tables to serve as the destination for our ingestion pipeline. These tables will store the raw transaction records and their corresponding labels (i.e., whether a transaction is fraudulent or not). This separation of raw data from engineered features is a good practice that helps with data organization and reusability.

### Creating a Table for Raw Transaction Records

In [None]:
create_ingestion_tx_records_table = f"""
CREATE OR REPLACE TABLE `{INGESTION_BQ_TRANSACTION_TABLE_URI}`
(
  TX_ID STRING OPTIONS(description="Unique transaction identifier"),
  TX_TS TIMESTAMP OPTIONS(description="Timestamp of the transaction"),
  CUSTOMER_ID STRING OPTIONS(description="Unique customer identifier"),
  TERMINAL_ID STRING OPTIONS(description="Unique terminal identifier"),
  TX_AMOUNT FLOAT64 OPTIONS(description="The monetary value of the transaction")
)
PARTITION BY
  DATE(TX_TS)
CLUSTER BY
  CUSTOMER_ID
OPTIONS (
  description = "A table to store customer transaction data, partitioned by day and clustered by customer."
)"""
print(create_ingestion_tx_records_table)

In [None]:
run_bq_query(create_ingestion_tx_records_table)

### Creating Table for input labels records

In [None]:
create_ingestion_tx_labels_table = f"""
CREATE OR REPLACE TABLE `{INGESTION_BQ_LABELS_TABLE_URI}`
(
  TX_ID STRING OPTIONS(description="Unique transaction identifier"),
  TX_FRAUD INT64 OPTIONS(description="The label for the transaction, 1-fraud/ 0-not a fraud")
)
OPTIONS (
  description = "A table to store fraud labels for transaction data"
)"""
print(create_ingestion_tx_labels_table)

In [None]:
run_bq_query(create_ingestion_tx_labels_table)

## Feature Engineering

### Batch Feature Engineering with BigQuery

Now it's time to dive into the core of this notebook: **batch feature engineering**. In this section, we'll use the power of SQL and BigQuery to create insightful features from our historical transaction data. These features will capture patterns in customer behavior and terminal activity that can help our machine learning model distinguish between legitimate and fraudulent transactions.

We'll be creating two main types of features:

**1. Customer-Related Features:** These features will focus on the spending habits of individual customers. By analyzing their transaction history over different time windows (e.g., the last 1, 7, and 14 days), we can identify unusual patterns that might indicate fraud. For example, a sudden spike in the number of transactions or the average transaction amount could be a red flag.

**2. Terminal-Related Features:** These features will assess the risk associated with different transaction terminals. Some terminals might be more susceptible to fraud than others. By analyzing the history of fraudulent transactions at each terminal, we can create a risk score that can be used as a powerful feature in our model.

To create these features, we'll be using SQL window functions in BigQuery. These functions allow us to perform calculations across a set of table rows that are somehow related to the current row. This is perfect for our use case, as it allows us to easily calculate aggregated statistics (e.g., the average transaction amount) over different time windows.

Let's get started! In the following cells, we'll walk you through the process of creating these features step by step.

#### Creating Batch Features with SQL

Now, let's get our hands dirty and write some SQL! In the following cells, we'll define the queries to create our customer and terminal-related features. We'll be using Common Table Expressions (CTEs) to make our queries more readable and organized.

**Set the Date Range for Feature Engineering**

First, let's define the time window for which we want to create features. We'll use the last 10 days of transaction data as our source.

In [None]:
print(
    f"""
DATAPROCESSING_START_DATE: {DATAPROCESSING_START_DATE}
DATAPROCESSING_END_DATE: {DATAPROCESSING_END_DATE}
"""
)

### Append historical records to the input tables:

#### For transactions table:

In [None]:
insert_transactions_historical_data = f"""INSERT INTO `{INGESTION_BQ_TRANSACTION_TABLE_URI}`
 (TX_ID,
  TX_TS,
  CUSTOMER_ID,
  TERMINAL_ID,
  TX_AMOUNT)
SELECT
  TX_ID,
  TX_TS,
  CUSTOMER_ID,
  TERMINAL_ID,
  TX_AMOUNT
FROM `{RAW_BQ_TRANSACTION_TABLE_URI}`
WHERE TX_TS BETWEEN TIMESTAMP_SUB(current_timestamp(), INTERVAL 15 DAY) AND current_timestamp()
"""
print(insert_transactions_historical_data)
run_bq_query(insert_transactions_historical_data)

for labels table:

#### For labels table:

In [None]:
insert_labels_historical_data = f"""INSERT INTO `{INGESTION_BQ_LABELS_TABLE_URI}`
 (TX_ID,
  TX_FRAUD)
  SELECT
    raw_tx.TX_ID,
    raw_lb.TX_FRAUD
  FROM
      `{INGESTION_BQ_TRANSACTION_TABLE_URI}` as raw_tx
  INNER JOIN 
    `{RAW_BQ_LABELS_TABLE_URI}` as raw_lb
  ON raw_tx.TX_ID = raw_lb.TX_ID
"""
print(insert_labels_historical_data)
run_bq_query(insert_labels_historical_data)

### Create Batch Table

#### Terminal feature table

Customer table SQL query string:

In [None]:
create_customer_batch_features_query = f"""
CREATE OR REPLACE TABLE `{CUSTOMERS_FE_BQ_BATCH_TABLE_URI}` AS
WITH
  -- CTE 1: Select raw transaction data from the source table
  get_raw_table AS (
  SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    TX_AMOUNT
  FROM `{INGESTION_BQ_TRANSACTION_TABLE_URI}`),

  -- CTE 2: Calculate customer spending behavior using window functions
  get_customer_spending_behaviour AS (
  SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    TX_AMOUNT,
    
    -- Calculate the number of transactions for each customer over 1, 7, and 14-day windows
    COUNT(TX_ID) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 86400 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_NB_TX_1DAY_WINDOW,
    COUNT(TX_ID) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 604800 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_NB_TX_7DAY_WINDOW,
    COUNT(TX_ID) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1209600 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_NB_TX_14DAY_WINDOW,
      
    -- Calculate the average transaction amount for each customer over 1, 7, and 14-day windows
    AVG(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 86400 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW,
    AVG(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 604800 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,
    AVG(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1209600 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_AVG_AMOUNT_14DAY_WINDOW
  FROM get_raw_table)

-- Final SELECT statement: Create the customer features table
SELECT
  PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TX_TS, "UTC")) as feature_timestamp,
  CUSTOMER_ID AS entity_id,
  CAST(CUSTOMER_ID_NB_TX_1DAY_WINDOW AS INT64) AS customer_id_nb_tx_1day_window,
  CAST(CUSTOMER_ID_NB_TX_7DAY_WINDOW AS INT64) AS customer_id_nb_tx_7day_window,
  CAST(CUSTOMER_ID_NB_TX_14DAY_WINDOW AS INT64) AS customer_id_nb_tx_14day_window,
  CAST(CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW AS FLOAT64) AS customer_id_avg_amount_1day_window,
  CAST(CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW AS FLOAT64) AS customer_id_avg_amount_7day_window,
  CAST(CUSTOMER_ID_AVG_AMOUNT_14DAY_WINDOW AS FLOAT64) AS customer_id_avg_amount_14day_window
FROM
  get_customer_spending_behaviour
"""
print(create_customer_batch_features_query)

##### Run the query 

You create the initial customer features table based on provided historical records snapshot

In [None]:
run_bq_query(create_customer_batch_features_query)

##### Inspect the result 

You can query some data rows to validate the result of the query

In [None]:
run_bq_query(
    f"SELECT * FROM `{CUSTOMERS_FE_BQ_BATCH_TABLE_URI}` LIMIT 10", show=True
)

#### Terminal feature table

Terminal table SQL query string:

In [None]:
create_terminal_batch_features_query = f"""
CREATE OR REPLACE TABLE `{TERMINALS_FE_BQ_BATCH_TABLE_URI}` AS
WITH
  -- CTE 1: Join transaction data with fraud labels
  get_raw_table AS (
  SELECT
    raw_tx.TX_TS,
    raw_tx.TX_ID,
    raw_tx.CUSTOMER_ID,
    raw_tx.TERMINAL_ID,
    raw_tx.TX_AMOUNT,
    raw_lb.TX_FRAUD
  FROM `{INGESTION_BQ_TRANSACTION_TABLE_URI}` raw_tx
  LEFT JOIN 
    `{INGESTION_BQ_LABELS_TABLE_URI}` as raw_lb
  ON raw_tx.TX_ID = raw_lb.TX_ID),

  -- CTE 2: Calculate delayed window variables for terminal risk assessment
  get_variables_delay_window AS (
  SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    
    -- Calculate the number of fraudulent transactions and total transactions over a 7-day delay period
    SUM(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 604800 PRECEDING
      AND CURRENT ROW ) AS NB_FRAUD_DELAY,
    COUNT(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 604800 PRECEDING
      AND CURRENT ROW ) AS NB_TX_DELAY,
      
    -- Calculate the number of fraudulent transactions and total transactions over 1, 7, and 14-day delayed windows
    SUM(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 691200 PRECEDING
      AND CURRENT ROW ) AS NB_FRAUD_1_DELAY_WINDOW,
    SUM(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1209600 PRECEDING
      AND CURRENT ROW ) AS NB_FRAUD_7_DELAY_WINDOW,
    SUM(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1814400 PRECEDING
      AND CURRENT ROW ) AS NB_FRAUD_14_DELAY_WINDOW,
    COUNT(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 691200 PRECEDING
      AND CURRENT ROW ) AS NB_TX_1_DELAY_WINDOW,
    COUNT(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1209600 PRECEDING
      AND CURRENT ROW ) AS NB_TX_7_DELAY_WINDOW,
    COUNT(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1814400 PRECEDING
      AND CURRENT ROW ) AS NB_TX_14_DELAY_WINDOW
  FROM get_raw_table),

  -- CTE 3: Calculate terminal risk factors
  get_risk_factors AS (
  SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    -- Calculate the number of fraudulent transactions for each terminal over 1, 7, and 14-day windows
    NB_FRAUD_1_DELAY_WINDOW - NB_FRAUD_DELAY AS TERMINAL_ID_NB_FRAUD_1DAY_WINDOW,
    NB_FRAUD_7_DELAY_WINDOW - NB_FRAUD_DELAY AS TERMINAL_ID_NB_FRAUD_7DAY_WINDOW,
    NB_FRAUD_14_DELAY_WINDOW - NB_FRAUD_DELAY AS TERMINAL_ID_NB_FRAUD_14DAY_WINDOW,
    -- Calculate the total number of transactions for each terminal over 1, 7, and 14-day windows
    NB_TX_1_DELAY_WINDOW - NB_TX_DELAY AS TERMINAL_ID_NB_TX_1DAY_WINDOW,
    NB_TX_7_DELAY_WINDOW - NB_TX_DELAY AS TERMINAL_ID_NB_TX_7DAY_WINDOW,
    NB_TX_14_DELAY_WINDOW - NB_TX_DELAY AS TERMINAL_ID_NB_TX_14DAY_WINDOW
      FROM
    get_variables_delay_window),

  -- CTE 4: Calculate the terminal risk index
  get_risk_index AS (
    SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    TERMINAL_ID_NB_TX_1DAY_WINDOW,
    TERMINAL_ID_NB_TX_7DAY_WINDOW,
    TERMINAL_ID_NB_TX_14DAY_WINDOW,
    -- Calculate the risk index for each terminal over 1, 7, and 14-day windows
    (TERMINAL_ID_NB_FRAUD_1DAY_WINDOW/(TERMINAL_ID_NB_TX_1DAY_WINDOW+0.0001)) AS TERMINAL_ID_RISK_1DAY_WINDOW,
    (TERMINAL_ID_NB_FRAUD_7DAY_WINDOW/(TERMINAL_ID_NB_TX_7DAY_WINDOW+0.0001)) AS TERMINAL_ID_RISK_7DAY_WINDOW,
    (TERMINAL_ID_NB_FRAUD_14DAY_WINDOW/(TERMINAL_ID_NB_TX_14DAY_WINDOW+0.0001)) AS TERMINAL_ID_RISK_14DAY_WINDOW
    FROM get_risk_factors 
  )

-- Final SELECT statement: Create the terminal features table
SELECT
  PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TX_TS, "UTC")) as feature_timestamp,
  TERMINAL_ID AS entity_id,
  CAST(TERMINAL_ID_NB_TX_1DAY_WINDOW AS INT64) AS terminal_id_nb_tx_1day_window,
  CAST(TERMINAL_ID_NB_TX_7DAY_WINDOW AS INT64) AS terminal_id_nb_tx_7day_window,
  CAST(TERMINAL_ID_NB_TX_14DAY_WINDOW AS INT64) AS terminal_id_nb_tx_14day_window,
  CAST(TERMINAL_ID_RISK_1DAY_WINDOW AS FLOAT64) AS terminal_id_risk_1day_window,
  CAST(TERMINAL_ID_RISK_7DAY_WINDOW AS FLOAT64) AS terminal_id_risk_7day_window,
  CAST(TERMINAL_ID_RISK_14DAY_WINDOW AS FLOAT64) AS terminal_id_risk_14day_window
FROM
  get_risk_index
"""
print(create_terminal_batch_features_query)

##### Run the query 

You create the customer features table

In [None]:
run_bq_query(create_terminal_batch_features_query)

##### Inspect the result 

You can query some data rows to validate the result of the query

In [None]:
run_bq_query(
    f"SELECT * FROM `{TERMINALS_FE_BQ_BATCH_TABLE_URI}` LIMIT 10", show=True
)

### Creating BigQuery Views for Feature Tables

Now that we have created the initial batch feature tables, we will create BigQuery views on top of them. A **BigQuery view** is a virtual table defined by a SQL query. It allows you to encapsulate the logic for generating features and provides a simplified, consistent interface for querying them. Views do not store any data themselves; instead, they run the underlying query every time they are accessed, ensuring that you always get the latest data.

In this notebook, we will create two views:
- `v_customers_features`: A view for the customer-related batch features.
- `v_terminals_features`: A view for the terminal-related batch features.

These views will be used as the source for our Vertex AI Feature Store, and they will also be used to create our final training dataset.

#### Customer feature 

This view will provide a real-time look at customer spending behavior.

In [None]:
create_customer_view_query = f"""
CREATE OR REPLACE VIEW `{CUSTOMERS_FE_BQ_VIEW_URI}` AS
WITH
  -- query to join labels with features -------------------------------------------------------------------------------------------
  get_raw_table AS (
  SELECT
    raw_tx.TX_TS,
    raw_tx.TX_ID,
    raw_tx.CUSTOMER_ID,
    raw_tx.TERMINAL_ID,
    raw_tx.TX_AMOUNT
  FROM (
    SELECT
      *
    FROM
      `{INGESTION_BQ_TRANSACTION_TABLE_URI}`
    WHERE
      TX_TS BETWEEN TIMESTAMP_SUB(current_timestamp(), INTERVAL 15 DAY) AND current_timestamp()
    ) raw_tx),

  -- query to calculate CUSTOMER spending behaviour --------------------------------------------------------------------------------
  get_customer_spending_behaviour AS (
  SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    TX_AMOUNT,
    
    # calc the number of customer tx over daily windows per customer (1, 7 and 15 days, expressed in seconds)
    COUNT(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 86400 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_NB_TX_1DAY_WINDOW,
    COUNT(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 604800 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_NB_TX_7DAY_WINDOW,
    COUNT(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1209600 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_NB_TX_14DAY_WINDOW,
      
    # calc the customer average tx amount over daily windows per customer (1, 7 and 15 days, expressed in seconds, in dollars ($))
    AVG(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 86400 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW,
    AVG(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 604800 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,
    AVG(TX_AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1209600 PRECEDING
      AND CURRENT ROW ) AS CUSTOMER_ID_AVG_AMOUNT_14DAY_WINDOW,
  FROM get_raw_table)

# Create the table with CUSTOMER  features ----------------------------------------------------------------------------
SELECT
  current_timestamp() as feature_timestamp,
  CUSTOMER_ID AS entity_id,
  CAST(CUSTOMER_ID_NB_TX_1DAY_WINDOW AS INT64) AS customer_id_nb_tx_1day_window,
  CAST(CUSTOMER_ID_NB_TX_7DAY_WINDOW AS INT64) AS customer_id_nb_tx_7day_window,
  CAST(CUSTOMER_ID_NB_TX_14DAY_WINDOW AS INT64) AS customer_id_nb_tx_14day_window,
  CAST(CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW AS FLOAT64) AS customer_id_avg_amount_1day_window,
  CAST(CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW AS FLOAT64) AS customer_id_avg_amount_7day_window,
  CAST(CUSTOMER_ID_AVG_AMOUNT_14DAY_WINDOW AS FLOAT64) AS customer_id_avg_amount_14day_window,
FROM
  get_customer_spending_behaviour
"""

In [None]:
print(create_customer_view_query)

##### Run the query 

You create the customer features table

In [None]:
run_bq_query(create_customer_view_query)

##### Inspect the result 

You can query some data rows to validate the result of the query

In [None]:
run_bq_query(f"SELECT * FROM `{CUSTOMERS_FE_BQ_VIEW_URI}` LIMIT 10", show=True)

#### Terminal feature table

Terminal table SQL query string:

In [None]:
create_terminal_view_query = f"""
# query to calculate TERMINAL spending behaviour --------------------------------------------------------------------------------
CREATE OR REPLACE VIEW `{TERMINALS_FE_BQ_VIEW_URI}` AS
WITH
  -- query to join labels with features -------------------------------------------------------------------------------------------
  get_raw_table AS (
  SELECT
    raw_tx.TX_TS,
    raw_tx.TX_ID,
    raw_tx.CUSTOMER_ID,
    raw_tx.TERMINAL_ID,
    raw_tx.TX_AMOUNT,
    raw_lb.TX_FRAUD
  FROM (
    SELECT
      *
    FROM
      `{INGESTION_BQ_TRANSACTION_TABLE_URI}`
    WHERE
      TX_TS BETWEEN TIMESTAMP_SUB(current_timestamp(), INTERVAL 15 DAY) AND current_timestamp()
    ) raw_tx
  LEFT JOIN 
    `{INGESTION_BQ_LABELS_TABLE_URI}` as raw_lb
  ON raw_tx.TX_ID = raw_lb.TX_ID),

  # query to calculate TERMINAL spending behaviour --------------------------------------------------------------------------------
  get_variables_delay_window AS (
  SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    
    # calc total amount of fraudulent tx and the total number of tx over the delay period per terminal (7 days - delay, expressed in seconds)
    SUM(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 604800 PRECEDING
      AND CURRENT ROW ) AS NB_FRAUD_DELAY,
    COUNT(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 604800 PRECEDING
      AND CURRENT ROW ) AS NB_TX_DELAY,
      
    # calc total amount of fraudulent tx and the total number of tx over the delayed window per terminal (window + 7 days - delay, expressed in seconds)
    SUM(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 691200 PRECEDING
      AND CURRENT ROW ) AS NB_FRAUD_1_DELAY_WINDOW,
    SUM(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1209600 PRECEDING
      AND CURRENT ROW ) AS NB_FRAUD_7_DELAY_WINDOW,
    SUM(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1814400 PRECEDING
      AND CURRENT ROW ) AS NB_FRAUD_14_DELAY_WINDOW,
    COUNT(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 691200 PRECEDING
      AND CURRENT ROW ) AS NB_TX_1_DELAY_WINDOW,
    COUNT(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1209600 PRECEDING
      AND CURRENT ROW ) AS NB_TX_7_DELAY_WINDOW,
    COUNT(TX_FRAUD) OVER (PARTITION BY TERMINAL_ID ORDER BY UNIX_SECONDS(TX_TS) ASC RANGE BETWEEN 1814400 PRECEDING
      AND CURRENT ROW ) AS NB_TX_14_DELAY_WINDOW,
  FROM get_raw_table),

  # query to calculate TERMINAL risk factors ---------------------------------------------------------------------------------------
  get_risk_factors AS (
  SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    # calculate numerator of risk index
    NB_FRAUD_1_DELAY_WINDOW - NB_FRAUD_DELAY AS TERMINAL_ID_NB_FRAUD_1DAY_WINDOW,
    NB_FRAUD_7_DELAY_WINDOW - NB_FRAUD_DELAY AS TERMINAL_ID_NB_FRAUD_7DAY_WINDOW,
    NB_FRAUD_14_DELAY_WINDOW - NB_FRAUD_DELAY AS TERMINAL_ID_NB_FRAUD_14DAY_WINDOW,
    # calculate denominator of risk index
    NB_TX_1_DELAY_WINDOW - NB_TX_DELAY AS TERMINAL_ID_NB_TX_1DAY_WINDOW,
    NB_TX_7_DELAY_WINDOW - NB_TX_DELAY AS TERMINAL_ID_NB_TX_7DAY_WINDOW,
    NB_TX_14_DELAY_WINDOW - NB_TX_DELAY AS TERMINAL_ID_NB_TX_14DAY_WINDOW,
      FROM
    get_variables_delay_window),

  # query to calculate the TERMINAL risk index -------------------------------------------------------------------------------------
  get_risk_index AS (
    SELECT
    TX_TS,
    TX_ID,
    CUSTOMER_ID,
    TERMINAL_ID,
    TERMINAL_ID_NB_TX_1DAY_WINDOW,
    TERMINAL_ID_NB_TX_7DAY_WINDOW,
    TERMINAL_ID_NB_TX_14DAY_WINDOW,
    # calculate the risk index
    (TERMINAL_ID_NB_FRAUD_1DAY_WINDOW/(TERMINAL_ID_NB_TX_1DAY_WINDOW+0.0001)) AS TERMINAL_ID_RISK_1DAY_WINDOW,
    (TERMINAL_ID_NB_FRAUD_7DAY_WINDOW/(TERMINAL_ID_NB_TX_7DAY_WINDOW+0.0001)) AS TERMINAL_ID_RISK_7DAY_WINDOW,
    (TERMINAL_ID_NB_FRAUD_14DAY_WINDOW/(TERMINAL_ID_NB_TX_14DAY_WINDOW+0.0001)) AS TERMINAL_ID_RISK_14DAY_WINDOW
    FROM get_risk_factors 
  )

# Create the table with CUSTOMER and TERMINAL features ----------------------------------------------------------------------------
SELECT
  current_timestamp() as feature_timestamp,
  # TERMINAL_ID AS terminal_id,
  TERMINAL_ID AS entity_id,
  CAST(TERMINAL_ID_NB_TX_1DAY_WINDOW AS INT64) AS terminal_id_nb_tx_1day_window,
  CAST(TERMINAL_ID_NB_TX_7DAY_WINDOW AS INT64) AS terminal_id_nb_tx_7day_window,
  CAST(TERMINAL_ID_NB_TX_14DAY_WINDOW AS INT64) AS terminal_id_nb_tx_14day_window,
  CAST(TERMINAL_ID_RISK_1DAY_WINDOW AS FLOAT64) AS terminal_id_risk_1day_window,
  CAST(TERMINAL_ID_RISK_7DAY_WINDOW AS FLOAT64) AS terminal_id_risk_7day_window,
  CAST(TERMINAL_ID_RISK_14DAY_WINDOW AS FLOAT64) AS terminal_id_risk_14day_window,
FROM
  get_risk_index
"""

In [None]:
print(create_terminal_view_query)

##### Run the query 

You create the customer features table

In [None]:
run_bq_query(create_terminal_view_query)

##### Inspect the result 

You can query some data rows to validate the result of the query

In [None]:
run_bq_query(f"SELECT * FROM `{TERMINALS_FE_BQ_VIEW_URI}` LIMIT 10", show=True)

### Automating Feature Updates with BigQuery Scheduled Queries

To ensure our batch features remain up-to-date, we need a way to periodically refresh them with the latest transaction data. Manually re-running our feature generation queries would be inefficient and error-prone. Instead, we can automate this process using **BigQuery scheduled queries**.

A scheduled query is a query that runs automatically on a recurring basis. In our case, we'll set up scheduled queries that run every 15 minutes. Each time they run, they will:

1.  Execute the query within our `v_customers_features` and `v_terminals_features` views to calculate the latest feature values.
2.  Append these new feature values to our batch feature tables (`t_customers_batch_features` and `t_terminals_batch_features`).

This ensures that our feature store always has access to fresh batch features, which is crucial for making accurate, timely fraud predictions. We'll use the `bq mk --transfer_config` command to create these scheduled queries.

In [None]:
!echo "{CUSTOMERS_FE_BQ_VIEW_URI}"

In [None]:
!bq mk --transfer_config \
--data_source='scheduled_query' \
--display_name='Append Customers Features to Batch Features Table' \
--target_dataset='tx' \
--schedule='every 15 mins' \
--params='{"query": "INSERT INTO `{CUSTOMERS_FE_BQ_BATCH_TABLE_URI}` SELECT * FROM `{CUSTOMERS_FE_BQ_VIEW_URI}`;"}'

In [None]:
!bq mk --transfer_config \
--data_source='scheduled_query' \
--display_name='Append Terminal Features to Batch Features Table' \
--target_dataset='tx' \
--schedule='every 15 mins' \
--params='{"query": "INSERT INTO `{TERMINALS_FE_BQ_BATCH_TABLE_URI}` SELECT * FROM `{TERMINALS_FE_BQ_VIEW_URI}`;"}'

### Initializing Tables for Real-Time (Streaming) Features

While this notebook focuses on batch features, our end-to-end fraud detection system will also use real-time features calculated over very short time windows (e.g., the last 15, 30, and 60 minutes). These features will be generated by a streaming pipeline using Dataflow, which is covered in the next notebook (`03_feature_engineering_streaming_new_fs.ipynb`).

However, we need to create the destination tables for these features *now*. The following queries will create empty tables in BigQuery with the correct schema for our future streaming features. This step is important because:

1.  **It defines the contract:** It establishes the schema that the streaming pipeline will write to.
2.  **It enables the training view:** It allows our final training dataset view (`v_ff_training_dataset`) to be created successfully, as it can reference these tables even though they are currently empty. The `IFNULL` function in the view will handle the absence of data, ensuring the query doesn't fail.

These tables will be populated with data once we run the Dataflow streaming pipeline in the next lab.

##### Customer feature table

Customer table SQL query string:

In [None]:
initiate_real_time_customer_features_query = f"""
CREATE OR REPLACE TABLE `{CUSTOMERS_STREAMING_FE_TABLE_URI}`
(
    entity_id STRING,
    feature_timestamp TIMESTAMP,
    customer_id_nb_tx_15min_window INT64,
    customer_id_nb_tx_30min_window INT64,
    customer_id_nb_tx_60min_window INT64,
    customer_id_avg_amount_15min_window FLOAT64,
    customer_id_avg_amount_30min_window FLOAT64,
    customer_id_avg_amount_60min_window FLOAT64
)
"""

In [None]:
initiate_real_time_terminal_features_query = f"""
CREATE OR REPLACE TABLE `{TERMINALS_STREAMING_FE_TABLE_URI}`
(
    entity_id STRING,
    feature_timestamp TIMESTAMP,
    terminal_id_nb_tx_15min_window INT64,
    terminal_id_nb_tx_30min_window INT64,
    terminal_id_nb_tx_60min_window INT64,
    terminal_id_avg_amount_15min_window FLOAT64,
    terminal_id_avg_amount_30min_window FLOAT64,
    terminal_id_avg_amount_60min_window FLOAT64
)
"""

#### Run the query above to initialize the real-time features.

In [None]:
for query in [
    initiate_real_time_customer_features_query,
    initiate_real_time_terminal_features_query,
]:
    run_bq_query(query)

#### Inspect BigQuery features tables

In [None]:
run_bq_query(
    f"SELECT * FROM `{CUSTOMERS_STREAMING_FE_TABLE_URI}` LIMIT 5", show=True
)

In [None]:
run_bq_query(
    f"SELECT * FROM `{TERMINALS_STREAMING_FE_TABLE_URI}` LIMIT 5", show=True
)

Let's look at the final schema of the features table:

### Creating the Final Training Dataset View

With our batch and streaming feature tables in place, we can now create a final BigQuery view that will serve as the source for our model training. This view, `v_ff_training_dataset`, will join the raw transaction data with the corresponding feature values from our batch and streaming tables.

A key challenge when creating a training dataset is ensuring that you are not introducing **data leakage**. Data leakage occurs when your training data contains information that would not be available at the time of prediction. For example, if we were to simply join our transaction data with the latest feature values, we would be leaking information from the future into the past.

To prevent this, we will use the `ML.ENTITY_FEATURES_AT_TIME` function in BigQuery. This function allows us to perform a **point-in-time lookup**, which means that for each transaction, we will retrieve the feature values that were valid at the time the transaction occurred. This ensures that our model is trained on the same data that it will see in a real-world prediction scenario, which is crucial for building a robust and accurate fraud detection model.

In [None]:
batch_customers_features_table = f"{PROJECT}.tx.t_customers_batch_features"
batch_terminals_features_table = f"{PROJECT}.tx.t_terminals_batch_features"

stream_customers_features_table = f"{PROJECT}.tx.t_customers_streaming_features"
stream_terminals_features_table = f"{PROJECT}.tx.t_terminals_streaming_features"

train_dataset_view_sql = f"""
CREATE OR REPLACE VIEW tx.v_ff_training_dataset AS
    WITH
      # -------------------------------------------
      # Using raw transaction table as a base table
      # filtered by specific time range
      # joined with labels data
      raw_tx_labled_range_table AS (
      SELECT
        raw_tx.TX_TS AS tx_timestamp,
        raw_tx.CUSTOMER_ID AS customer_id,
        raw_tx.TERMINAL_ID AS terminal_id,
        raw_tx.TX_AMOUNT AS tx_amount,
        raw_lb.TX_FRAUD AS tx_fraud,
      FROM
        `{INGESTION_BQ_TRANSACTION_TABLE_URI}` AS raw_tx
      LEFT JOIN
        `{INGESTION_BQ_LABELS_TABLE_URI}` AS raw_lb
      ON
        raw_tx.TX_ID = raw_lb.TX_ID
      WHERE
        raw_tx.TX_TS BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP()),
      # ---------------------------------------------
      # Using base transaction table
      # to create a entity_id with timestamp pairs
      # to lookup coresponding feautres for terminals
      terminals_time_table AS (
      SELECT
        raw_tx_labled_range_table.tx_timestamp AS `time`,
        raw_tx_labled_range_table.TERMINAL_ID AS entity_id,
      FROM
        raw_tx_labled_range_table),
      # ---------------------------------------------
      # Using base transaction table
      # to create a entity_id with timestamp pairs
      # to lookup coresponding feautres for customers
      customers_time_table AS (
      SELECT
        raw_tx_labled_range_table.tx_timestamp AS `time`,
        raw_tx_labled_range_table.CUSTOMER_ID AS entity_id,
      FROM
        raw_tx_labled_range_table)

    SELECT
    # Features from raw transaction:
    raw_tx_labled_range_table.tx_amount,
    raw_tx_labled_range_table.tx_fraud,
    raw_tx_labled_range_table.tx_timestamp AS `timestamp`,

    # Features from customers batch pipeline:
    IFNULL(f_customers.customer_id_avg_amount_1day_window, 0.0) as customer_id_avg_amount_1day_window,
    IFNULL(f_customers.customer_id_avg_amount_7day_window, 0.0) as customer_id_avg_amount_7day_window,
    IFNULL(f_customers.customer_id_avg_amount_14day_window, 0.0) as customer_id_avg_amount_14day_window,
    IFNULL(CAST(f_customers.customer_id_nb_tx_1day_window AS FLOAT64), 0.0) as customer_id_nb_tx_1day_window,
    IFNULL(CAST(f_customers.customer_id_nb_tx_7day_window AS FLOAT64), 0.0) as customer_id_nb_tx_7day_window,
    IFNULL(CAST(f_customers.customer_id_nb_tx_14day_window AS FLOAT64), 0.0) as customer_id_nb_tx_14day_window,

    # Features from terminals batch pipeline:
    IFNULL(f_terminals.terminal_id_risk_1day_window, 0.0)  AS terminal_id_risk_1day_window,
    IFNULL(f_terminals.terminal_id_risk_7day_window, 0.0)  AS terminal_id_risk_7day_window,
    IFNULL(f_terminals.terminal_id_risk_14day_window, 0.0)  AS terminal_id_risk_14day_window,
    IFNULL(CAST(f_terminals.terminal_id_nb_tx_1day_window AS FLOAT64), 0.0)  AS terminal_id_nb_tx_1day_window,
    IFNULL(CAST(f_terminals.terminal_id_nb_tx_7day_window AS FLOAT64), 0.0)  AS terminal_id_nb_tx_7day_window,
    IFNULL(CAST(f_terminals.terminal_id_nb_tx_14day_window AS FLOAT64), 0.0)  AS terminal_id_nb_tx_14day_window,
  
    # Features from customers streaming pipeline:
    IFNULL(f_customers_stream.customer_id_avg_amount_15min_window, 0.0) as customer_id_avg_amount_15min_window,
    IFNULL(f_customers_stream.customer_id_avg_amount_30min_window, 0.0) as customer_id_avg_amount_30min_window,
    IFNULL(f_customers_stream.customer_id_avg_amount_60min_window, 0.0) as customer_id_avg_amount_60min_window,
    IFNULL(CAST(f_customers_stream.customer_id_nb_tx_15min_window AS FLOAT64), 0.0)  AS customer_id_nb_tx_15min_window,
    IFNULL(CAST(f_customers_stream.customer_id_nb_tx_30min_window AS FLOAT64), 0.0)  AS customer_id_nb_tx_30min_window,
    IFNULL(CAST(f_customers_stream.customer_id_nb_tx_60min_window AS FLOAT64), 0.0)  AS customer_id_nb_tx_60min_window,
  
    # Features from terminals streaming pipeline:
    IFNULL(f_terminals_stream.terminal_id_avg_amount_15min_window, 0.0) as terminal_id_avg_amount_15min_window,
    IFNULL(f_terminals_stream.terminal_id_avg_amount_30min_window, 0.0) as terminal_id_avg_amount_30min_window,
    IFNULL(f_terminals_stream.terminal_id_avg_amount_60min_window, 0.0) as terminal_id_avg_amount_60min_window,
    IFNULL(CAST(terminal_id_nb_tx_15min_window AS FLOAT64), 0.0)  AS terminal_id_nb_tx_15min_window,
    IFNULL(CAST(terminal_id_nb_tx_30min_window AS FLOAT64), 0.0)  AS terminal_id_nb_tx_30min_window,
    IFNULL(CAST(f_terminals_stream.terminal_id_nb_tx_60min_window AS FLOAT64), 0.0)  AS terminal_id_nb_tx_60min_window
      
    FROM
      raw_tx_labled_range_table
      
    LEFT JOIN
      ML.ENTITY_FEATURES_AT_TIME( TABLE `{batch_customers_features_table}`,
        TABLE customers_time_table,
        num_rows => 1,
        ignore_feature_nulls => TRUE) AS f_customers
    ON
      raw_tx_labled_range_table.customer_id = f_customers.entity_id
      AND raw_tx_labled_range_table.tx_timestamp = f_customers.feature_timestamp
      
    LEFT JOIN

      ML.ENTITY_FEATURES_AT_TIME( TABLE `{batch_terminals_features_table}`,
        TABLE terminals_time_table,
        num_rows => 1,
        ignore_feature_nulls => TRUE) AS f_terminals
    ON
      raw_tx_labled_range_table.terminal_id = f_terminals.entity_id
      AND raw_tx_labled_range_table.tx_timestamp = f_terminals.feature_timestamp
      
    LEFT JOIN
      ML.ENTITY_FEATURES_AT_TIME( TABLE `{stream_customers_features_table}`,
        TABLE customers_time_table,
        num_rows => 1,
        ignore_feature_nulls => TRUE) AS f_customers_stream
    ON
      raw_tx_labled_range_table.customer_id = f_customers_stream.entity_id
      AND raw_tx_labled_range_table.tx_timestamp = f_customers_stream.feature_timestamp
    LEFT JOIN
      ML.ENTITY_FEATURES_AT_TIME( TABLE `{stream_terminals_features_table}`,
        TABLE terminals_time_table,
        num_rows => 1,
        ignore_feature_nulls => TRUE) AS f_terminals_stream
    ON
      raw_tx_labled_range_table.terminal_id = f_terminals_stream.entity_id
      AND raw_tx_labled_range_table.tx_timestamp = f_terminals_stream.feature_timestamp
"""
print(train_dataset_view_sql)

In [None]:
run_bq_query(train_dataset_view_sql, show=True)

#### Inspect BigQuery training dataset view:

In [None]:
run_bq_query("SELECT * FROM tx.v_ff_training_dataset LIMIT 5", show=True)

### Initialize Vertex AI SDK

Initialize the Vertex AI SDK to get access to Vertex AI services programmatically. 

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

## Managing Features with Vertex AI Feature Store

Now that we've engineered our features, we need a robust and efficient way to manage them. This is where the **Vertex AI Feature Store** comes in. A feature store is a centralized repository for storing, serving, and managing machine learning features. It plays a crucial role in the MLOps lifecycle by providing a single source of truth for features, which helps to ensure consistency between training and serving, prevent feature leakage, and promote feature reuse across different models and projects.

### Key Concepts in Vertex AI Feature Store

Before we start creating our feature store, let's quickly go over some key concepts:

*   **Feature Store**: A top-level container for organizing and managing your features.
*   **Entity Type**: A collection of semantically related features. In our case, we'll have two entity types: `customer` and `terminal`.
*   **Feature**: A measurable property or characteristic of an entity. For example, `customer_id_nb_tx_1day_window` is a feature of the `customer` entity type.
*   **Feature View**: A logical view of features from a data source. It defines how features are synced from the data source to the online store for serving.

### Benefits of Using a Feature Store

Using a feature store offers several advantages, including:

*   **Preventing Training-Serving Skew**: By using the same feature store for both training and serving, you can ensure that your model is using the exact same features in both environments, which helps to prevent performance degradation due to inconsistencies.
*   **Promoting Feature Reuse**: A centralized feature store makes it easy to discover and reuse existing features across different models and teams, which can save time and effort.
*   **Improving Model Governance**: A feature store provides a centralized place to track feature lineage and metadata, which can help with model explainability and compliance.

In the following cells, we'll create a feature store, define our entity types and features, and ingest our batch features from BigQuery.

### Import libraries

In [None]:
from google.cloud import bigquery
from google.cloud.aiplatform_v1 import (
    FeatureOnlineStoreAdminServiceClient,
    FeatureOnlineStoreServiceClient,
    FeatureRegistryServiceClient,
)
from google.cloud.aiplatform_v1.types import feature as feature_pb2
from google.cloud.aiplatform_v1.types import feature_group as feature_group_pb2
from google.cloud.aiplatform_v1.types import (
    feature_online_store as feature_online_store_pb2,
)
from google.cloud.aiplatform_v1.types import (
    feature_online_store_admin_service as feature_online_store_admin_service_pb2,
)
from google.cloud.aiplatform_v1.types import (
    feature_online_store_service as feature_online_store_service_pb2,
)
from google.cloud.aiplatform_v1.types import (
    feature_registry_service as feature_registry_service_pb2,
)
from google.cloud.aiplatform_v1.types import feature_view as feature_view_pb2
from google.cloud.aiplatform_v1.types import (
    featurestore_service as featurestore_service_pb2,
)
from google.cloud.aiplatform_v1.types import io as io_pb2

### Initialize Admin Service Client

In [None]:
API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"

In [None]:
admin_client = FeatureOnlineStoreAdminServiceClient(
    client_options={"api_endpoint": API_ENDPOINT}
)
registry_client = FeatureRegistryServiceClient(
    client_options={"api_endpoint": API_ENDPOINT}
)

### Create online store instance

To create an online store instance.
Create a `FeatureOnlineStore` instance with autoscaling.

In [None]:
online_store_config = feature_online_store_pb2.FeatureOnlineStore(
    bigtable=feature_online_store_pb2.FeatureOnlineStore.Bigtable(
        auto_scaling=feature_online_store_pb2.FeatureOnlineStore.Bigtable.AutoScaling(
            min_node_count=1, max_node_count=1, cpu_utilization_target=50
        )
    )
)

create_store_lro = admin_client.create_feature_online_store(
    feature_online_store_admin_service_pb2.CreateFeatureOnlineStoreRequest(
        parent=f"projects/{PROJECT_ID}/locations/{REGION}",
        feature_online_store_id=FEATURESTORE_ID,
        feature_online_store=online_store_config,
    )
)

### Verify online store instance creation

After the long-running operation (LRO) is complete, show the result.

> **Note:** This operation might take up to 10 minutes to complete.

In [None]:
# Wait for the LRO to finish and get the LRO result.
print(create_store_lro.result())

In [None]:
# Use list to verify the store is created.
admin_client.list_feature_online_stores(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}"
)

### Registering Feature Groups and Features

Before we can use our features for online serving, we need to register their metadata with the Vertex AI Feature Registry. This involves two main concepts:

*   **`FeatureGroup`**: A `FeatureGroup` is a logical container that groups features defined on the same BigQuery data source. It tells the Feature Store where your feature data is located (the `input_uri`) and which column contains the unique entity IDs.

*   **`Feature`**: A `Feature` represents a single column (a feature) within a `FeatureGroup`'s data source.

In our project, we have four distinct sets of features based on the entity type (customer or terminal) and the calculation method (batch or streaming). Therefore, we will create four `FeatureGroup`s to organize them, one for each of our BigQuery feature tables:
1.  **`fraudfinder_customers_batch`**: For batch-calculated customer features.
2.  **`fraudfinder_customers_streaming`**: For streaming-calculated customer features (currently empty).
3.  **`fraudfinder_terminals_batch`**: For batch-calculated terminal features.
4.  **`fraudfinder_terminals_streaming`**: For streaming-calculated terminal features (currently empty).

The following cells will use our helper function to create these four `FeatureGroup`s and register all the associated `Feature`s (columns) within each one.

#### Define utility method for feature groups creation

In [None]:
def create_fs_feature_group(
    bq_source_uri, entity_id_column, feature_group_id, feature_ids_list
):

    # Now, create the featureGroup
    feature_group_config = feature_group_pb2.FeatureGroup(
        big_query=feature_group_pb2.FeatureGroup.BigQuery(
            big_query_source=io_pb2.BigQuerySource(
                input_uri=f"bq://{bq_source_uri}"
            ),
            # Add the entity_id_columns parameter here
            entity_id_columns=[entity_id_column],
        )
    )
    create_group_lro = registry_client.create_feature_group(
        feature_registry_service_pb2.CreateFeatureGroupRequest(
            parent=f"projects/{PROJECT_ID}/locations/{REGION}",
            feature_group_id=feature_group_id,
            feature_group=feature_group_config,
        )
    )

    # After the long-running operation (LRO) is complete, show the result.
    print(create_group_lro.result())

    create_feature_lros = []
    for id in feature_ids_list:
        create_feature_lros.append(
            registry_client.create_feature(
                featurestore_service_pb2.CreateFeatureRequest(
                    parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureGroups/{feature_group_id}",
                    feature_id=id,
                    feature=feature_pb2.Feature(),
                )
            )
        )

    # Wait for FS Group creation
    for lro in create_feature_lros:
        # After the long-running operation (LRO) is complete, show the result.
        print(lro.result())

In [None]:
CUSTOMER_ID_COLUMN = "entity_id"  # entity_id

CUSTOMER_BATCH_FEATURES_GROUP_ID = "fraudfinder_customers_batch"

CUSTOMER_BATCH_FEATURE_IDS = [
    "customer_id_nb_tx_14day_window",
    "customer_id_avg_amount_7day_window",
    "customer_id_nb_tx_1day_window",
    "customer_id_avg_amount_1day_window",
    "customer_id_avg_amount_14day_window",
    "customer_id_nb_tx_7day_window",
]

# Creating feature Group for batch for customers
create_fs_feature_group(
    bq_source_uri=CUSTOMERS_FE_BQ_BATCH_TABLE_URI,
    entity_id_column=CUSTOMER_ID_COLUMN,
    feature_group_id=CUSTOMER_BATCH_FEATURES_GROUP_ID,
    feature_ids_list=CUSTOMER_BATCH_FEATURE_IDS,
)

In [None]:
CUSTOMER_STREAMING_FEATURES_GROUP_ID = "fraudfinder_customers_streaming"
CUSTOMER_STREAMING_FEATURE_IDS = [
    "customer_id_nb_tx_15min_window",
    "customer_id_nb_tx_30min_window",
    "customer_id_nb_tx_60min_window",
    "customer_id_avg_amount_15min_window",
    "customer_id_avg_amount_30min_window",
    "customer_id_avg_amount_60min_window",
]

# Creating feature Group for streaming for customers
create_fs_feature_group(
    bq_source_uri=CUSTOMERS_STREAMING_FE_TABLE_URI,
    entity_id_column=CUSTOMER_ID_COLUMN,
    feature_group_id=CUSTOMER_STREAMING_FEATURES_GROUP_ID,
    feature_ids_list=CUSTOMER_STREAMING_FEATURE_IDS,
)

In [None]:
# Now, create the featureGroup for terminals
TERMINAL_ID_COLUMN = "entity_id"

TERMINAL_BATCH_FEATURES_GROUP_ID = "fraudfinder_terminals_batch"
TERMINAL_BATCH_FEATURE_IDS = [
    "terminal_id_nb_tx_1day_window",
    "terminal_id_nb_tx_7day_window",
    "terminal_id_nb_tx_14day_window",
    "terminal_id_risk_1day_window",
    "terminal_id_risk_7day_window",
    "terminal_id_risk_14day_window",
]

# Creating feature Group for batch for customers
create_fs_feature_group(
    bq_source_uri=TERMINALS_FE_BQ_BATCH_TABLE_URI,
    entity_id_column=TERMINAL_ID_COLUMN,
    feature_group_id=TERMINAL_BATCH_FEATURES_GROUP_ID,
    feature_ids_list=TERMINAL_BATCH_FEATURE_IDS,
)

In [None]:
# Now, create the featureGroup for terminals streaming features
TERMINAL_STREAMING_FEATURES_GROUP_ID = "fraudfinder_terminals_streaming"
TERMINAL_STREAMING_FEATURE_IDS = [
    "terminal_id_nb_tx_15min_window",
    "terminal_id_nb_tx_30min_window",
    "terminal_id_nb_tx_60min_window",
    "terminal_id_avg_amount_15min_window",
    "terminal_id_avg_amount_30min_window",
    "terminal_id_avg_amount_60min_window",
]

# Creating feature Group for batch for customers
create_fs_feature_group(
    bq_source_uri=TERMINALS_STREAMING_FE_TABLE_URI,
    entity_id_column=TERMINAL_ID_COLUMN,
    feature_group_id=TERMINAL_STREAMING_FEATURES_GROUP_ID,
    feature_ids_list=TERMINAL_STREAMING_FEATURE_IDS,
)

### Connecting Feature Groups to the Online Store with `FeatureView`

Now that we have registered our `FeatureGroup`s (our feature sources), we need a way to tell the Feature Store to actually serve these features for real-time lookups. This is the job of a **`FeatureView`**.

A `FeatureView` acts as a bridge between the feature sources (`FeatureGroup`s) and the online serving cluster (the `FeatureOnlineStore` we created earlier). It defines which features from which feature groups should be made available for low-latency retrieval.

Key responsibilities of a `FeatureView`:
*   **Linking Sources:** It links one or more `FeatureGroup`s to a specific `FeatureOnlineStore`.
*   **Syncing Data:** It manages the synchronization of data from the BigQuery sources (defined in the `FeatureGroup`s) to the high-performance online store (Bigtable). You can configure this sync to run on a schedule (cron) or continuously. For this lab, we'll use a continuous sync to keep the online data as fresh as possible.

We will create two `FeatureView`s, one for each entity type:
1.  **`fv_fraudfinder_customers`**: This view will combine the batch and streaming features for customers.
2.  **`fv_fraudfinder_terminals`**: This view will combine the batch and streaming features for terminals.

The helper function below will create these views and start the data sync process.

In [None]:
def create_online_fs_view(
    fs_view_id,
    fs_online_store_id,
    feature_group_ids,
    feature_ids_list,
    continuous,
    cron_schedule=None,
):

    feature_groups = []

    for feature_group_id, feature_ids in zip(
        feature_group_ids, feature_ids_list
    ):
        feature_groups.append(
            feature_view_pb2.FeatureView.FeatureRegistrySource.FeatureGroup(
                feature_group_id=feature_group_id,
                feature_ids=feature_ids,
            )
        )

    feature_registry_source = (
        feature_view_pb2.FeatureView.FeatureRegistrySource(
            feature_groups=feature_groups
        )
    )

    if continuous:
        sync_config = feature_view_pb2.FeatureView.SyncConfig(continuous=True)
    else:
        sync_config = feature_view_pb2.FeatureView.SyncConfig(
            cron=cron_schedule
        )

    create_view_lro = admin_client.create_feature_view(
        feature_online_store_admin_service_pb2.CreateFeatureViewRequest(
            parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{fs_online_store_id}",
            feature_view_id=fs_view_id,
            run_sync_immediately=True,
            feature_view=feature_view_pb2.FeatureView(
                feature_registry_source=feature_registry_source,
                sync_config=sync_config,
            ),
        )
    )

    # Wait for LRO to complete and show result
    print(create_view_lro.result())

#### Test Check to ensure all resources provisioned:

In [None]:
# Test cell to ensure that all resources provisioned:
TEST_FEATURE_VIEW_ID = "fv_fraudfinder_test_provisioned"

test_feature_view = f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURESTORE_ID}/featureViews/{TEST_FEATURE_VIEW_ID}"

admin_client.list_feature_views(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURESTORE_ID}"
)

create_online_fs_view(
    fs_view_id=TEST_FEATURE_VIEW_ID,
    fs_online_store_id=FEATURESTORE_ID,
    feature_group_ids=[
        CUSTOMER_BATCH_FEATURES_GROUP_ID,
        CUSTOMER_STREAMING_FEATURES_GROUP_ID,
    ],
    feature_ids_list=[
        CUSTOMER_BATCH_FEATURE_IDS,
        CUSTOMER_STREAMING_FEATURE_IDS,
    ],
    continuous=True,
)

# Delete TEST FeatureView
delete_view_lro = admin_client.delete_feature_view(name=test_feature_view)

print(delete_view_lro.result())

admin_client.list_feature_views(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURESTORE_ID}"
)

#### Creating featurestore view for Customers Features

In [None]:
CUSTOMER_FEATURE_VIEW_ID = "fv_fraudfinder_customers"

create_online_fs_view(
    fs_view_id=CUSTOMER_FEATURE_VIEW_ID,
    fs_online_store_id=FEATURESTORE_ID,
    feature_group_ids=[
        CUSTOMER_BATCH_FEATURES_GROUP_ID,
        CUSTOMER_STREAMING_FEATURES_GROUP_ID,
    ],
    feature_ids_list=[
        CUSTOMER_BATCH_FEATURE_IDS,
        CUSTOMER_STREAMING_FEATURE_IDS,
    ],
    continuous=True,
)

#### Creating featurestore view for Terminals Features:

In [None]:
TERMINAL_FEATURE_VIEW_ID = "fv_fraudfinder_terminals"

create_online_fs_view(
    fs_view_id=TERMINAL_FEATURE_VIEW_ID,
    fs_online_store_id=FEATURESTORE_ID,
    feature_group_ids=[
        TERMINAL_BATCH_FEATURES_GROUP_ID,
        TERMINAL_STREAMING_FEATURES_GROUP_ID,
    ],
    feature_ids_list=[
        TERMINAL_BATCH_FEATURE_IDS,
        TERMINAL_STREAMING_FEATURE_IDS,
    ],
    continuous=True,
)

Verify that the `FeatureView` instance is created by listing all the feature views within the online store.

In [None]:
# Again, list all feature view under the FEATURESTORE_ID to confirm
admin_client.list_feature_views(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURESTORE_ID}"
)

In [None]:
admin_client.list_feature_view_syncs(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURESTORE_ID}/featureViews/{CUSTOMER_FEATURE_VIEW_ID}"
)

In [None]:
admin_client.list_feature_view_syncs(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURESTORE_ID}/featureViews/{TERMINAL_FEATURE_VIEW_ID}"
)

### Online Serving for Real-Time Fraud Detection

Now that we've ingested our features into the Vertex AI Feature Store, it's time to talk about **online serving**. In the context of fraud detection, online serving refers to the process of retrieving feature values for a single entity (e.g., a customer or a terminal) in real-time, with very low latency. This is a critical requirement for our use case, as we need to be able to make a fraud prediction in a matter of milliseconds, while a customer is waiting for their transaction to be approved.

Vertex AI Feature Store provides a highly scalable and low-latency online serving solution that is optimized for real-time use cases. When you ingest features into a feature store, they are stored in both an offline storage (BigQuery) for batch use cases and an online store (Bigtable) for real-time serving. This dual-storage architecture allows you to use the same features for both training and inference, without having to worry about data skew.

In the following cells, we'll show you how to use the Vertex AI SDK to fetch feature values from the online store.

In [None]:
data_client = FeatureOnlineStoreServiceClient(
    client_options={"api_endpoint": API_ENDPOINT}
)

The `FeatureView` already defines the features needed for the model (via the BigQuery view in this demo). To fetch the data, submit a `fetch_feature_values` request specifying the `FeatureView` resource path and the ID of the entity.

#### Lets get a customer record from BigQuery View:

In [None]:
%%bigquery customer_view_record --project {PROJECT_ID}
SELECT * 
FROM tx.v_customers_features
ORDER BY feature_timestamp DESC
LIMIT 1

In [None]:
customer_view_record

#### Now we can check that this record availible using Vertex AI Feature Store online serving
Note: it can take up to 

In [None]:
print(f"Featurestore ID: {FEATURESTORE_ID}")
print(f"Featurestore View ID: {CUSTOMER_FEATURE_VIEW_ID}")

customer_key = customer_view_record["entity_id"][0]
print(f"entity_id={customer_key}")

FEATURE_VIEW_FULL_ID = f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURESTORE_ID}/featureViews/{CUSTOMER_FEATURE_VIEW_ID}"

try:
    fe_data = data_client.fetch_feature_values(
        request=feature_online_store_service_pb2.FetchFeatureValuesRequest(
            feature_view=FEATURE_VIEW_FULL_ID,
            data_key=feature_online_store_service_pb2.FeatureViewDataKey(
                key=customer_key
            ),
            data_format=feature_online_store_service_pb2.FeatureViewDataFormat.PROTO_STRUCT,
        )
    )
    customer_features = json.dumps(
        {k: v for k, v in fe_data.proto_struct.items()}, indent=4
    )
    print(f"Customer Features: {customer_features}")
except Exception as exp:
    print("ERROR: " + str(exp))

In [None]:
%%bigquery terminal_view_record --project {PROJECT_ID}
SELECT * 
FROM tx.v_terminals_features
ORDER BY feature_timestamp DESC
LIMIT 1

In [None]:
terminal_view_record

In [None]:
print(f"Featurestore ID: {FEATURESTORE_ID}")
print(f"Featurestore View ID: {TERMINAL_FEATURE_VIEW_ID}")

terminal_key = terminal_view_record["entity_id"][0]
print(f"entity_id={terminal_key}")

FEATURE_VIEW_FULL_ID = f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURESTORE_ID}/featureViews/{TERMINAL_FEATURE_VIEW_ID}"

try:
    fe_data = data_client.fetch_feature_values(
        request=feature_online_store_service_pb2.FetchFeatureValuesRequest(
            feature_view=FEATURE_VIEW_FULL_ID,
            data_key=feature_online_store_service_pb2.FeatureViewDataKey(
                key=customer_key
            ),
            data_format=feature_online_store_service_pb2.FeatureViewDataFormat.PROTO_STRUCT,
        )
    )
    customer_features = json.dumps(
        {k: v for k, v in fe_data.proto_struct.items()}, indent=4
    )
    print(f"Customer Features: {customer_features}")
except Exception as exp:
    print("ERROR: " + str(exp))

### Inspect your feature store in the Vertex AI console

You can also inspect your feature store in the [Vertex AI Feature Store console](https://console.cloud.google.com/vertex-ai/feature-store/online-stores)

### END

Now you can go to the next notebook `03_feature_engineering_streaming_new_fs.ipynb`

Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.