# Deploy a BigQuery ML user churn propensity model to Vertex AI for online predictions

## Learning objectives

* Explore and preprocess a [Google Analytics 4](https://support.google.com/analytics/answer/7029846) data sample in [BigQuery]() for machine learning.  
* Train a [BigQuery ML (BQML)](https://cloud.google.com/bigquery-ml) [XGBoost](https://xgboost.readthedocs.io/en/latest/) classifier to predict user churn on a mobile gaming application.
* Evaluate the performance of a BQML XGBoost classifier.
* Explain your XGBoost model with [BQML Explainable AI](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-xai-overview) global feature attributions.
* Generate batch predictions with your BQML XGBoost model.
* Export a BQML XGBoost model to a [Google Cloud Storage](https://cloud.google.com/storage).
* Upload and deploy a BQML XGBoost model to a [Vertex AI Prediction](https://cloud.google.com/vertex-ai/docs/predictions/getting-predictions) Endpoint for online predictions.

## Introduction

In this lab, you will train, evaluate, explain, and generate batch and online predictions with a BigQuery ML XGBoost model. You will use a Google Analytics 4 dataset from a real mobile application, Flood it! ([Android app](https://play.google.com/store/apps/details?id=com.labpixies.flood), [iOS app](https://itunes.apple.com/us/app/flood-it!/id476943146?mt=8)), to determine the likelihood of users returning to the application. You will generate batch predictions with your BigQuery ML model as well as export and deploy it to **Vertex AI** for online predictions.

[Vertex AI](https://cloud.google.com/vertex-ai) is Google Cloud's next generation, unified platform for machine learning development. By developing machine learning solutions on Vertex AI, you can leverage the latest ML pre-built components and AutoML to significantly enhance development productivity, the ability to scale your workflow and decision making with your data, and accelerate time to value.

![BQML Vertex AI](./images/vertex-bqml-lab-architecture-diagram.png "Vertex BQML Lab Architecture Diagram")

Note: this lab is inspired by and extends [Churn prediction for game developers using Google Analytics 4 (GA4) and BigQuery ML](https://cloud.google.com/blog/topics/developers-practitioners/churn-prediction-game-developers-using-google-analytics-4-ga4-and-bigquery-ml). See that blog post and accompanying tutorial for additional depth on this use case and BigQuery ML. In this lab, you will go one step further and focus on how Vertex AI extends BigQuery ML's capabilities through online prediction so you can incorporate both customer churn predictions into decision making UIs such as [Looker dashboards](https://looker.com/google-cloud) but also online predictions directly into customer applications to power targeted interventions such as targeted incentives.

### Use case: user churn propensity modeling in the mobile gaming industry

According to a [2019 study](https://gameanalytics.com/reports/mobile-gaming-industry-analysis-h1-2019) on 100K mobile games by the Mobile Gaming Industry Analysis, most mobile games only see a 25% retention rate for users after the first 24 hours, known and any game "below 30% retention generally needs improvement". For mobile game developers, improving user retention is critical to revenue stability and increasing profitability. In fact, [Bain & Company research](https://hbr.org/2014/10/the-value-of-keeping-the-right-customers) found that 5% growth in retention rate can result in a 25-95% increase in profits. With lower costs to retain existing customers, the business objective for game developers is clear: reduce churn and improve customer loyalty to drive long-term profitability.

Your task in this lab: use machine learning to predict user churn propensity after day 1, a crucial user onboarding window, and serve these online predictions to inform interventions such as targeted in-game rewards and notifications.

## Setup

### Define constants

In [None]:
# Retrieve and set PROJECT_ID and REGION environment variables.
PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]

In [None]:
BQ_LOCATION = 'US'
REGION = 'us-central1'

### Import libraries

In [None]:
from google.cloud import bigquery
from google.cloud import aiplatform as vertexai
import numpy as np
import pandas as pd

### Create a GCS bucket for artifact storage

Create a globally unique Google Cloud Storage bucket for artifact storage. You will use this bucket to export your BQML model later in the lab and upload it to Vertex AI.

In [None]:
GCS_BUCKET = f"{PROJECT_ID}-bqmlga4"

In [None]:
!gsutil mb -l $REGION gs://$GCS_BUCKET

### Create a BigQuery dataset

Next, create a BigQuery dataset from this notebook using the Python-based [`bq` command line utility](https://cloud.google.com/bigquery/docs/bq-command-line-tool). 

This dataset will group your feature views, model, and predictions table together. You can view it in the [BigQuery](https://pantheon.corp.google.com/bigquery) console.

In [None]:
BQ_DATASET = f"{PROJECT_ID}:bqmlga4"

In [None]:
!bq mk --location={BQ_LOCATION} --dataset {BQ_DATASET}

### Initialize the Vertex Python SDK client

Import the Vertex SDK for Python into your Python environment and initialize it.

In [None]:
vertexai.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{GCS_BUCKET}")

## Exploratory Data Analysis (EDA) in BigQuery

This lab uses a [public BigQuery dataset]() that contains raw event data from a real mobile gaming app called **Flood it!** ([Android app](https://play.google.com/store/apps/details?id=com.labpixies.flood), [iOS app](https://itunes.apple.com/us/app/flood-it!/id476943146?mt=8)).

The data schema originates from Google Analytics for Firebase but is the same schema as Google Analytics 4.

Take a look at a sample of the raw event dataset using the query below:

In [None]:
%%bigquery --project $PROJECT_ID

SELECT 
    *
FROM
  `firebase-public-project.analytics_153293282.events_*`
    
TABLESAMPLE SYSTEM (1 PERCENT)

Google Analytics 4 uses an event based measurement model and each row in this dataset is an event. View the [complete schema](https://support.google.com/analytics/answer/7029846) and details about each column. As you can see above, certain columns are nested records and contain detailed information such as:

* app_info
* device
* ecommerce
* event_params
* geo
* traffic_source
* user_properties
* items*
* web_info*

This dataset contains 5.7M events from 15K+ users.

In [None]:
%%bigquery --project $PROJECT_ID

SELECT 
    COUNT(DISTINCT user_pseudo_id) as count_distinct_users,
    COUNT(event_timestamp) as count_events
FROM
  `firebase-public-project.analytics_153293282.events_*`

## Dataset preparation in BigQuery

### Defining churn for each user

There are many ways to define user churn, but for the purposes of this lab, you will predict 1-day churn as users who do not come back and use the app again after 24 hr of the user's first engagement.

In other words, after 24 hr of a user's first engagement with the app:

* if the user shows no event data thereafter, the user is considered **churned**.
* if the user does have at least one event datapoint thereafter, then the user is considered **returned**.

You may also want to remove users who were unlikely to have ever returned anyway after spending just a few minutes with the app, which is sometimes referred to as "bouncing". For example, we can say want to build our model only on users who spent at least 10 minutes with the app (users who didn't bounce).

The query below defines a churned user with the following definition:

**Churned = "any user who spent at least 10 minutes on the app, but after 24 hour from when they first engaged with the app, never used the app again"**

You will use the raw event data, from their first touch (app installation) to their last touch, to identify churned and bounced users in the `user_churn` view query below:

In [58]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE VIEW bqmlga4.user_churn AS (
  WITH firstlasttouch AS (
    SELECT
      user_pseudo_id,
      MIN(event_timestamp) AS user_first_engagement,
      MAX(event_timestamp) AS user_last_engagement
    FROM
      `firebase-public-project.analytics_153293282.events_*`
    WHERE event_name="user_engagement"
    GROUP BY
      user_pseudo_id

  )
  
SELECT
    user_pseudo_id,
    user_first_engagement,
    user_last_engagement,
    EXTRACT(MONTH from TIMESTAMP_MICROS(user_first_engagement)) as month,
    EXTRACT(DAYOFYEAR from TIMESTAMP_MICROS(user_first_engagement)) as julianday,
    EXTRACT(DAYOFWEEK from TIMESTAMP_MICROS(user_first_engagement)) as dayofweek,

    #add 24 hr to user's first touch
    (user_first_engagement + 86400000000) AS ts_24hr_after_first_engagement,
    
    #churned = 1 if last_touch within 24 hr of app installation, else 0
    IF (user_last_engagement < (user_first_engagement + 86400000000),
    1,
    0 ) AS churned,
    
    #bounced = 1 if last_touch within 10 min, else 0
    IF (user_last_engagement <= (user_first_engagement + 600000000),
    1,
    0 ) AS bounced,
  FROM
    firstlasttouch
  GROUP BY
    user_pseudo_id,
    user_first_engagement,
    user_last_engagement
    );

SELECT 
  * 
FROM 
  bqmlga4.user_churn 
LIMIT 100;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 731.22query/s] 
Downloading: 100%|██████████| 100/100 [00:01<00:00, 69.32rows/s]


Unnamed: 0,user_pseudo_id,user_first_engagement,user_last_engagement,month,julianday,dayofweek,ts_24hr_after_first_engagement,churned,bounced
0,DF0E6C6C5E4F3F34FEC6EE54FD945B05,1529343934517004,1538606646358115,6,169,2,1529430334517004,0,0
1,E6FEE8B98E75EA5311FE004F98559A27,1529382024104009,1538621546331027,6,170,3,1529468424104009,0,0
2,E50F7AC0680FD87AD6CE6B6700D209A7,1529424105586001,1538503181312005,6,170,3,1529510505586001,0,0
3,153C55ABB8207C667CC7DD1C08DB02AA,1530330807669016,1538624418141011,6,181,7,1530417207669016,0,0
4,233E53BB59E58D20163E84557044105B,1537224397694003,1538603987007129,9,260,2,1537310797694003,0,0
...,...,...,...,...,...,...,...,...,...
95,23D89EE594C105BFA999295B38C80B2B,1529247490656001,1529969545742120,6,168,1,1529333890656001,0,0
96,5C1F957882700804AA2D02552FA51DEC,1528840635045006,1529442415968000,6,163,3,1528927035045006,0,0
97,812C25744DA187D0D3E93A0AE53B4B59,1528902970617003,1537875721746030,6,164,4,1528989370617003,0,0
98,4D2510F8432E1848C9AB5930AEB7786C,1529240376163003,1529240561784039,6,168,1,1529326776163003,1,1


Review how many of the 15k users bounced and returned below:

In [59]:
%%bigquery --project $PROJECT_ID

SELECT
    bounced,
    churned, 
    COUNT(churned) as count_users
FROM
    bqmlga4.user_churn
GROUP BY 1,2
ORDER BY bounced

Query complete after 0.00s: 100%|██████████| 5/5 [00:00<00:00, 1931.61query/s]                        
Downloading: 100%|██████████| 3/3 [00:01<00:00,  1.91rows/s]


Unnamed: 0,bounced,churned,count_users
0,0,0,6148
1,0,1,1883
2,1,1,5557


For the training data, you will only end up using data where bounced = 0. Based on the 15k users, you can see that 5,557 ( about 41%) users bounced within the first ten minutes of their first engagement with the app. Of the remaining 8,031 users, 1,883 users ( about 23%) churned after 24 hours which you can validate with the query below:

In [60]:
%%bigquery --project $PROJECT_ID

SELECT
    COUNTIF(churned=1)/COUNT(churned) as churn_rate
FROM
    bqmlga4.returningusers
WHERE bounced = 0

Query complete after 0.00s: 100%|██████████| 4/4 [00:00<00:00, 1814.73query/s]                        
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.55s/rows]


Unnamed: 0,churn_rate
0,0.234466


### Extract user demographic features

There is various user demographic information included in this dataset, including `app_info`, `device`, `ecommerce`, `event_params`, and `geo`. Demographic features can help the model predict whether users on certain devices or countries are more likely to churn.

Note that a user's demographics may occasionally change (e.g. moving countries). For simplicity, you will use the demographic information that Google Analytics 4 provides when the user first engaged with the app as indicated by MIN(event_timestamp) in the query below. This enables every unique user to be represented by a single row.

In [None]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE VIEW bqmlga4.user_demographics AS (

  WITH first_values AS (
      SELECT
          user_pseudo_id,
          geo.country as country,
          device.operating_system as operating_system,
          device.language as language,
          ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS row_num
      FROM `firebase-public-project.analytics_153293282.events_*`
      WHERE event_name="user_engagement"
      )
  SELECT * EXCEPT (row_num)
  FROM first_values
  WHERE row_num = 1
  );

SELECT
  *
FROM
  bqmlga4.user_demographics
LIMIT 10

### Aggregate user behavioral features

Behavioral data in the raw event data spans across multiple events -- and thus rows -- per user. The goal of this section is to aggregate and extract behavioral data for each user, resulting in one row of behavioral data per unique user.



As a first step, you can explore all the unique events that exist in this dataset, based on event_name:

In [None]:
%%bigquery --project $PROJECT_ID

SELECT
    event_name,
    COUNT(event_name) as event_count
FROM
    `firebase-public-project.analytics_153293282.events_*`
GROUP BY 1
ORDER BY
   event_count DESC

For this lab, to predict whether a user will churn or return, you can start by counting the number of times a user engages in the following event types:

* user_engagement
* level_start_quickplay
* level_end_quickplay
* level_complete_quickplay
* level_reset_quickplay
* post_score
* spend_virtual_currency
* ad_reward
* challenge_a_friend
* completed_5_levels
* use_extra_steps

In the SQL query below, you will aggregate the behavioral data by calculating the total number of times when each of the above event_names occurred in the data set per user.

In [61]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE VIEW bqmlga4.user_behavior AS (
WITH
  events_first24hr AS (
    #select user data only from first 24 hr of using the app
    SELECT
      e.*
    FROM
      `firebase-public-project.analytics_153293282.events_*` e
    JOIN
      bqmlga4.user_churn c
    ON
      e.user_pseudo_id = c.user_pseudo_id
    WHERE
      e.event_timestamp <= c.ts_24hr_after_first_engagement
    )
SELECT
  user_pseudo_id,
  SUM(IF(event_name = 'user_engagement', 1, 0)) AS cnt_user_engagement,
  SUM(IF(event_name = 'level_start_quickplay', 1, 0)) AS cnt_level_start_quickplay,
  SUM(IF(event_name = 'level_end_quickplay', 1, 0)) AS cnt_level_end_quickplay,
  SUM(IF(event_name = 'level_complete_quickplay', 1, 0)) AS cnt_level_complete_quickplay,
  SUM(IF(event_name = 'level_reset_quickplay', 1, 0)) AS cnt_level_reset_quickplay,
  SUM(IF(event_name = 'post_score', 1, 0)) AS cnt_post_score,
  SUM(IF(event_name = 'spend_virtual_currency', 1, 0)) AS cnt_spend_virtual_currency,
  SUM(IF(event_name = 'ad_reward', 1, 0)) AS cnt_ad_reward,
  SUM(IF(event_name = 'challenge_a_friend', 1, 0)) AS cnt_challenge_a_friend,
  SUM(IF(event_name = 'completed_5_levels', 1, 0)) AS cnt_completed_5_levels,
  SUM(IF(event_name = 'use_extra_steps', 1, 0)) AS cnt_use_extra_steps,
FROM
  events_first24hr
GROUP BY
  user_pseudo_id
  );

SELECT
  *
FROM
  bqmlga4.user_behavior
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 677.37query/s] 
Downloading: 100%|██████████| 10/10 [00:01<00:00,  7.01rows/s]


Unnamed: 0,user_pseudo_id,cnt_user_engagement,cnt_level_start_quickplay,cnt_level_end_quickplay,cnt_level_complete_quickplay,cnt_level_reset_quickplay,cnt_post_score,cnt_spend_virtual_currency,cnt_ad_reward,cnt_challenge_a_friend,cnt_completed_5_levels,cnt_use_extra_steps
0,BB6882241BB572C3AF3E198490C83981,53,0,0,0,0,10,2,0,0,1,2
1,9E01D1A09FDD23C5CD84D1F955B9EBA8,1,0,0,0,0,0,0,0,0,0,0
2,55BF8BB2457C1E8B554058947A484334,67,13,5,4,3,4,18,0,0,0,18
3,8669CA0A64242060E992A27B5E4E8830,3,1,1,1,0,1,0,0,0,0,0
4,E33EE4D9090447D3981439CDA8910A97,15,4,4,0,0,0,0,0,0,0,0
5,5B333FE2BC3875C6C3882279B51B8BAF,32,5,5,3,0,7,0,0,0,0,0
6,3B4F8D8584100C1D66809A06266DE7FC,17,4,4,1,0,1,1,0,0,0,1
7,A05E5EF28C83231054B9A7009A418438,40,8,0,0,7,16,0,0,0,0,0
8,61633448555C128551BE7E095F5C0E10,8,0,0,0,0,0,0,0,0,0,0
9,1E9576958AA73D69DB9E4AFBFE27D553,8,3,0,0,2,0,0,0,0,0,0


### Prepare your train/test datasets for machine learning

In this section, you can now combine these three intermediary views (`user_churn`, `user_demographics`, and `user_behavior`) into the final training data view called `ml_features`. Here you can also specify bounced = 0, in order to limit the training data only to users who did not "bounce" within the first 10 minutes of using the app.

Note in the query below that a manual `data_split` column is created in your BQ ML table using [BigQuery's hashing functions](https://towardsdatascience.com/ml-design-pattern-5-repeatable-sampling-c0ccb2889f39) for repeatable sampling. It specifies a 80% train / 20% test split to evaluate your model's performance and generalization.

In [93]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE VIEW bqmlga4.ml_features AS (
    
  SELECT
    dem.user_pseudo_id,
    IFNULL(dem.country, "Unknown") AS country,
    IFNULL(dem.operating_system, "Unknown") AS operating_system,
    IFNULL(dem.language, "Unknown") AS language,
    IFNULL(beh.cnt_user_engagement, 0) AS cnt_user_engagement,
    IFNULL(beh.cnt_level_start_quickplay, 0) AS cnt_level_start_quickplay,
    IFNULL(beh.cnt_level_end_quickplay, 0) AS cnt_level_end_quickplay,
    IFNULL(beh.cnt_level_complete_quickplay, 0) AS cnt_level_complete_quickplay,
    IFNULL(beh.cnt_level_reset_quickplay, 0) AS cnt_level_reset_quickplay,
    IFNULL(beh.cnt_post_score, 0) AS cnt_post_score,
    IFNULL(beh.cnt_spend_virtual_currency, 0) AS cnt_spend_virtual_currency,
    IFNULL(beh.cnt_ad_reward, 0) AS cnt_ad_reward,
    IFNULL(beh.cnt_challenge_a_friend, 0) AS cnt_challenge_a_friend,
    IFNULL(beh.cnt_completed_5_levels, 0) AS cnt_completed_5_levels,
    IFNULL(beh.cnt_use_extra_steps, 0) AS cnt_use_extra_steps,
    ret.user_first_engagement,
    ret.month,
    ret.julianday,
    ret.dayofweek,
    ret.churned,
    # 80% 'TRAIN' | 10%'EVAL' | 10% 'TEST'
    CASE
      WHEN ABS(MOD(FARM_FINGERPRINT(dem.user_pseudo_id), 10)) <= 7
        THEN 'TRAIN'
      WHEN ABS(MOD(FARM_FINGERPRINT(dem.user_pseudo_id), 10)) = 8
        THEN 'EVAL'
      WHEN ABS(MOD(FARM_FINGERPRINT(dem.user_pseudo_id), 10)) = 9
        THEN 'TEST'    
          ELSE '' END AS data_split
  FROM
    bqmlga4.returningusers ret
  LEFT OUTER JOIN
    bqmlga4.user_demographics dem
  ON 
    ret.user_pseudo_id = dem.user_pseudo_id
  LEFT OUTER JOIN 
    bqmlga4.user_behavior beh
  ON
    ret.user_pseudo_id = beh.user_pseudo_id
  WHERE ret.bounced = 0
  );

SELECT
  *
FROM
  bqmlga4.ml_features
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 760.66query/s] 
Downloading: 100%|██████████| 10/10 [00:01<00:00,  8.72rows/s]


Unnamed: 0,user_pseudo_id,country,operating_system,language,cnt_user_engagement,cnt_level_start_quickplay,cnt_level_end_quickplay,cnt_level_complete_quickplay,cnt_level_reset_quickplay,cnt_post_score,...,cnt_ad_reward,cnt_challenge_a_friend,cnt_completed_5_levels,cnt_use_extra_steps,user_first_engagement,month,julianday,dayofweek,churned,data_split
0,9D535D543FD07B8247DEADB4669FADE6,United States,ANDROID,en-us,41,16,11,5,3,5,...,0,0,0,0,1531154706915004,7,190,2,0,TRAIN
1,85FD06F19EBB6097388EEE0B207598B7,United States,ANDROID,en-us,64,12,11,1,0,1,...,0,0,0,0,1529121975329001,6,167,7,0,TRAIN
2,5217AB1A454DAED6243E1C9818BE6A20,United States,ANDROID,en-us,89,0,0,0,0,27,...,0,0,1,1,1529861523949001,6,175,1,0,TRAIN
3,E0F91AA200D6F12DBE53B0D166AFD87B,Spain,ANDROID,en-gb,19,5,4,2,0,2,...,0,0,0,0,1530995021146010,7,188,7,0,TRAIN
4,2F0F260BD7B876C346B41977D0D03D62,United States,ANDROID,en-us,78,23,20,8,0,8,...,0,0,0,1,1530931242916001,7,188,7,0,TRAIN
5,7566596A1D6ACA781692A7A0B89B06EF,United States,ANDROID,en-us,8,1,1,0,0,0,...,0,0,0,0,1528843166489001,6,163,3,0,TRAIN
6,88A41BFED275BB69125BE1F5524F3B42,United States,ANDROID,en-us,5,2,1,0,0,1,...,0,0,0,0,1529325237353004,6,169,2,0,TRAIN
7,8BE7BF90C971453A34C1FF6FF2A0ACAE,Canada,ANDROID,en-ca,700,143,132,43,1,74,...,0,0,1,4,1531082372656001,7,189,1,0,TRAIN
8,1B2A9E461E99FB1D165894AFE613AD73,United States,ANDROID,en-us,29,8,7,4,0,4,...,0,0,0,0,1531200788492001,7,191,3,0,TRAIN
9,9B18A1CBA96D52D05CFA7CB7C9011EDE,Mexico,ANDROID,en-us,387,99,78,9,0,9,...,0,0,0,0,1531098901503002,7,190,2,0,TRAIN


### Validate feature splits

Run the query below to validate the number of examples in each data partition for the 80/20 train/test split.

In [94]:
%%bigquery --project $PROJECT_ID

SELECT
  data_split,
  COUNT(*) AS n_examples
FROM bqmlga4.ml_features
GROUP BY data_split

Query complete after 0.00s: 100%|██████████| 14/14 [00:00<00:00, 5624.55query/s]                       
Downloading: 100%|██████████| 3/3 [00:01<00:00,  2.84rows/s]


Unnamed: 0,data_split,n_examples
0,TRAIN,6386
1,EVAL,846
2,TEST,799


In [95]:
%%bigquery --project $PROJECT_ID

SELECT
  operating_system
FROM bqmlga4.ml_features
GROUP BY
  operating_system

Query complete after 0.00s: 100%|██████████| 14/14 [00:00<00:00, 6574.14query/s]                       
Downloading: 100%|██████████| 3/3 [00:01<00:00,  2.61rows/s]


Unnamed: 0,operating_system
0,ANDROID
1,IOS
2,Unknown


## Train and tune a BQML XGBoost propensity model to predict customer churn



The following code trains an XGBoost model. This model will take about 10 min to train.

For more information on the default hyperparameters used, you can read the documentation:
[CREATE MODEL statement for Boosted Tree models using XGBoost](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree)

|Model   | BQML model_type | Advantages | Disadvantages| 
|:-------|:----------:|:----------:|-------------:|
|XGBoost |     BOOSTED_TREE_CLASSIFIER [(documentation)](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree)       |   High model performance with feature importances and explainability | Slower to train than BQML LOGISTIC_REG |

Note: When you run the CREATE MODEL statement, BigQuery ML can automatically split your data into training and test so you can immediately evaluate your model's performance after training. This is a great option for fast model prototyping. In this lab, however, you split your data manually above using hashing for reproducible data splits that can be used 

In [96]:
MODEL_NAME="churn_xgb"

In [97]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE MODEL bqmlga4.churn_xgb

OPTIONS(
  MODEL_TYPE="BOOSTED_TREE_CLASSIFIER",
  # Declare label column.
  INPUT_LABEL_COLS=["churned"],
  # Specify custom data splitting using the `data_split` column.
  DATA_SPLIT_METHOD="CUSTOM",
  DATA_SPLIT_COL="data_split",
  # Enable Vertex Explainable AI aggregated feature attributions.
  ENABLE_GLOBAL_EXPLAIN=True,
  # Hyperparameter tuning trials and search space.
  num_trials=20,
  max_parallel_trials=2,
  HPARAM_TUNING_OBJECTIVES=["roc_auc"],
  EARLY_STOP=True,
  LEARN_RATE=HPARAM_RANGE(0.01, 0.1),
  MAX_TREE_DEPTH=HPARAM_CANDIDATES([5,6,7])
) AS

SELECT
  * EXCEPT(user_pseudo_id)
FROM
  bqmlga4.ml_features

Query complete after 0.00s: 100%|██████████| 21/21 [00:00<00:00, 13066.37query/s]                      


In [98]:
%%bigquery --project $PROJECT_ID

SELECT *
FROM
  ML.TRIAL_INFO(MODEL `bqmlga4.churn_xgb`)

Executing query with job ID: 47f3f20d-aae1-430c-9317-ee085c82ec48
Query executing: 0.50s


ERROR:
 400 Invalid table-valued function ML.TRIAL_INFO
TRIAL_INFO expects hyperparameter tuning model as input. at [3:3]

(job ID: 47f3f20d-aae1-430c-9317-ee085c82ec48)

        -----Query Job SQL Follows-----        

    |    .    |    .    |    .    |    .    |
   1:SELECT *
   2:FROM
   3:  ML.TRIAL_INFO(MODEL `bqmlga4.churn_xgb`)
    |    .    |    .    |    .    |    .    |


## Evaluate BQML XGBoost model performance

Once training is finished, you can run [ML.EVALUATE](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate) to return model evaluation metrics. 

In [99]:
%%bigquery --project $PROJECT_ID

SELECT
  *
FROM
  ML.EVALUATE(MODEL bqmlga4.churn_xgb);

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 834.19query/s] 
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.17s/rows]


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.62069,0.186047,0.781763,0.286282,0.450754,0.777391


In [63]:
%%bigquery --project $PROJECT_ID

SELECT
  *
FROM
  ML.EVALUATE(MODEL bqmlga4.churn_xgb);

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 498.37query/s]                          
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.58s/rows]


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.62069,0.186047,0.781763,0.286282,0.450754,0.777391


In [None]:
%%bigquery --project $PROJECT_ID

SELECT
  expected_label,
  _0 AS predicted_0,
  _1 AS predicted_1
FROM
  ML.CONFUSION_MATRIX(MODEL bqmlga4.churn_xgb)

In [None]:
%%bigquery df_roc --project $PROJECT_ID
SELECT * FROM ML.ROC_CURVE(MODEL bqmlga4.churn_xgb)

In [None]:
df_roc.plot(x="false_positive_rate", y="recall", title="AUC-ROC curve")

## Inspect global feature attributions

To provide further context to your model performance, you can use the [ML.GLOBAL_EXPLAIN](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-global-explain#get_global_feature_importance_for_each_class_of_a_boosted_tree_classifier_model) function which leverages Vertex Explainable AI as a back-end. [Vertex Explainable AI](https://cloud.google.com/vertex-ai/docs/explainable-ai) helps you understand your model's outputs for classification and regression tasks. Vertex AI tells you how much each feature in the data contributed to the predicted result. You can then use this information to verify that the model is behaving as expected, identify and mitigate biases in your models, and get ideas for ways to improve your model and your training data.

In [102]:
%%bigquery --project $PROJECT_ID

SELECT
  *
FROM
  ML.GLOBAL_EXPLAIN(MODEL bqmlga4.churn_xgb)
ORDER BY
  attribution DESC

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1591.56query/s]                        
Downloading: 100%|██████████| 18/18 [00:01<00:00, 13.92rows/s]


Unnamed: 0,feature,attribution
0,cnt_user_engagement,0.288081
1,user_first_engagement,0.140898
2,julianday,0.099747
3,operating_system,0.075372
4,cnt_level_start_quickplay,0.040235
5,cnt_post_score,0.026413
6,cnt_level_end_quickplay,0.026075
7,language,0.017465
8,dayofweek,0.015841
9,country,0.012824


## Generate batch predictions

You can generate batch predictions for your BQML XGBoost model using [ML.PREDICT](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-predict).

In [None]:
%%bigquery --project $PROJECT_ID

SELECT
  *
FROM
  ML.PREDICT(MODEL bqmlga4.churn_xgb,
  (SELECT * FROM bqmlga4.ml_features WHERE data_split=True)) 

In [62]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE TABLE bqmlga4.churn_predictions AS (
SELECT
  user_pseudo_id,
  churned,
  predicted_churned,
  predicted_churned_probs[OFFSET(0)].prob as probability_churned
FROM
  ML.PREDICT(MODEL bqmlga4.churn_xgb,
  (SELECT * FROM bqmlga4.ml_features WHERE data_split=True))
);

Query complete after 0.00s: 100%|██████████| 15/15 [00:00<00:00, 6350.52query/s]                       


## Export a BQML model to Vertex AI for online predictions

See the official BigQuery ML Guide: [Exporting a BigQuery ML model for online prediction](https://cloud.google.com/bigquery-ml/docs/export-model-tutorial) for additional details.

### Export BQML model to GCS

You will use the `bq extract` command in the `bq` command-line tool to export your BQML XGBoost model assets to Google Cloud Storage for persistence. See the [documentation](https://cloud.google.com/bigquery-ml/docs/exporting-models) for additional model export options.

In [None]:
BQ_MODEL = f"{BQ_DATASET}.{MODEL_NAME}"
BQ_MODEL_EXPORT_DIR = f"gs://{GCS_BUCKET}/{MODEL_NAME}"

In [None]:
!bq --location=$BQ_LOCATION extract \
--destination_format ML_XGBOOST_BOOSTER \
--model $BQ_MODEL \
$BQ_MODEL_EXPORT_DIR

Navigate to [Google Cloud Storage](https://pantheon.corp.google.com/storage) in Google Cloud Console to `"gs://{GCS_BUCKET}/{MODEL_NAME}"`. You will see your exported model assets in the below format:

```
|--/{GCS_BUCKET}/{MODEL_NAME}/
   |--/assets/                       # Contains preprocessing code.  
      |--0_categorical_label.txt     # Contains country vocabulary.
      |--1_categorical_label.txt     # Contains operating_system vocabulary.
      |--2_categorical_label.txt     # Contains language vocabulary.
      |--model_metadata.json         # contains model feature and label mappings.
   |--main.py                        # Can be called for local training runs.
   |--model.bst                      # XGBoost saved model format.
   |--xgboost_predictor-0.1.tar.gz   # Compress XGBoost model with prediction function. 
```

### Upload BQML model to Vertex AI from GCS

Vertex AI contains optimized pre-built training and prediction containers for popular ML frameworks such as TensorFlow, Pytorch, as well as XGBoost. You will upload your XGBoost from GCS to Vertex AI and provide the [latest pre-built Vertex XGBoost prediction container](https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers) to execute your model code to generate predictions in the cells below.

In [None]:
IMAGE_URI='us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-4:latest'

In [None]:
model = vertexai.Model.upload(
    display_name=MODEL_NAME,
    artifact_uri=BQ_MODEL_EXPORT_DIR,
    serving_container_image_uri=IMAGE_URI,
)

### Deploy a Vertex `Endpoint` for online predictions

Before you use your model to make predictions, you need to deploy it to an `Endpoint` object. When you deploy a model to an `Endpoint`, you associate physical (machine) resources with that model to enable it to serve online predictions. Online predictions have low latency requirements; providing resources to the model in advance reduces latency. You can do this by calling the deploy function on the `Model` resource. This will do two things:

1. Create an `Endpoint` resource for deploying the `Model` resource to.
2. Deploy the `Model` resource to the `Endpoint` resource.

The `deploy()` function takes the following parameters:

* `deployed_model_display_name`: A human readable name for the deployed model.
* `traffic_split`: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs. If only one model, then specify as { "0": 100 }, where "0" refers to this model being uploaded and 100 means 100% of the traffic.
* `machine_type`: The type of machine to use for training.
* `accelerator_type`: The hardware accelerator type.
* `accelerator_count`: The number of accelerators to attach to a worker replica.
* `starting_replica_count`: The number of compute instances to initially provision.
* `max_replica_count`: The maximum number of compute instances to scale to. In this lab, only one instance is provisioned.
* `explanation_parameters`: Metadata to configure the Explainable AI learning method.
* `explanation_metadata`: Metadata that describes your TensorFlow model for Explainable AI such as features, input and output tensors.

Note: this can take about 3-5 minutes to provision prediction resources for your model.

In [None]:
endpoint = model.deploy(
    traffic_split={"0": 100},
    machine_type="n1-standard-2",
)

### Query model for online predictions

XGBoost only takes numerical feature inputs. When you trained your BQML model above with CREATE MODEL statement, it automatically handled encoding of categorical features such as user `country`, `operating system`, and `language` into numeric representations. In order for our exported model to generate online predictions, you will use the categorical feature vocabulary files exported under the `assets/` folder of your model directory and the Scikit-Learn preprocessing code below to map your test instances to numeric values.

In [None]:
CATEGORICAL_FEATURES = ['country',
                        'operating_system',
                        'language']

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
def _build_cat_feature_encoders(cat_feature_list, gcs_bucket, model_name, na_value='Unknown'):
    """Build categorical feature encoders for mapping text to integers for XGBoost inference. 
    Args:
      cat_feature_list (list): List of string feature names.
      gcs_bucket (str): A string path to your Google Cloud Storage bucket.
      model_name (str): A string model directory in GCS where your BQML model was exported to.
      na_value (str): default is 'Unknown'. String value to replace any vocab NaN values prior to encoding.
    Returns:
      feature_encoders (dict): A dictionary containing OrdinalEncoder objects for integerizing 
        categorical features that has the format [feature] = feature encoder.
    """
    
    feature_encoders = {}
    
    for idx, feature in enumerate(cat_feature_list):
        feature_encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
        feature_vocab_file = f"gs://{gcs_bucket}/{model_name}/assets/{idx}_categorical_label.txt"
        feature_vocab_df = pd.read_csv(feature_vocab_file, delimiter = "\t", header=None).fillna(na_value)
        feature_encoder.fit(feature_vocab_df.values)
        feature_encoders[feature] = feature_encoder
    
    return feature_encoders

In [None]:
def preprocess_xgboost(instances, cat_feature_list, feature_encoders):
    """Transform instances to numerical values for inference.
    Args:
      instances (list[dict]): A list of feature dictionaries with the format feature: value. 
      cat_feature_list (list): A list of string feature names.
      feature_encoders (dict): A dictionary with the format feature: feature_encoder.
    Returns:
      transformed_instances (list[list]): A list of lists containing numerical feature values needed
        for Vertex XGBoost inference.
    """
    transformed_instances = []
    
    for instance in instances:
        for feature in cat_feature_list:
            feature_int = feature_encoders[feature].transform([[instance[feature]]]).item()
            instance[feature] = feature_int
            instance_list = list(instance.values())
        transformed_instances.append(instance_list)
    return transformed_instances

In [None]:
# Build a dictionary of ordinal categorical feature encoders.
feature_encoders = _build_cat_feature_encoders(CATEGORICAL_FEATURES, GCS_BUCKET, MODEL_NAME)

In [None]:
%%bigquery test_df --project $PROJECT_ID 

SELECT* EXCEPT (user_pseudo_id, churned, data_split)
FROM bqmlga4.ml_features
WHERE data_split=True
LIMIT 3;

In [None]:
# Convert dataframe records to feature dictionaries for preprocessing by feature name.
test_instances = test_df.astype(str).to_dict(orient='records')

In [None]:
# Apply preprocessing to transform categorical features and return numerical instances for prediction.
transformed_test_instances = preprocess_xgboost(test_instances, CATEGORICAL_FEATURES, feature_encoders)

In [None]:
# Generate predictions from model deployed to Vertex AI Endpoint.
predictions = endpoint.predict(instances=transformed_test_instances)

In [None]:
for idx, prediction in enumerate(predictions.predictions):
    # Class labels [1,0] retrieved from model_metadata.json in GCS model dir.
    # BQML binary classification default is 0.5 with above "Churn" and below "Not Churn".
    is_churned = "Churn" if prediction[0] >= 0.5 else "Not Churn"
    print(f"Prediction: Customer {idx} - {is_churned} {prediction}")
    print(test_df.iloc[idx].astype(str).to_json() + "\n")

## Next steps

Congratulations! In this lab, you trained, tuned, explained, and deployed a BigQuery ML user churn model to generate high business impact batch and online churn predictions to target customers likely to churn with interventions such as in-game rewards and reminder notifications.

In this lab, you used `user_psuedo_id` as a user identifier. As next steps, you can extend this code further by having your application return a `user_id` to Google Analytics so you can join your model's predictions with additional first-party data.

For batch predictions,

For online predictions, 

## License

In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.