# Overview
This notebook shows you how you can train, evaluate, and deploy a propensity model in BigQuery ML to predict user retention on a mobile game, based on app measurement data from Google Analytics 4.



### Dataset

This notebook uses [this public BigQuery dataset](https://console.cloud.google.com/bigquery?p=firebase-public-project&d=analytics_153293282&t=events_20181003&page=table), contains raw event data from a real mobile gaming app called Flood It! ([Android app](https://play.google.com/store/apps/details?id=com.labpixies.flood), [iOS app](https://itunes.apple.com/us/app/flood-it!/id476943146?mt=8)). The [data schema](https://support.google.com/analytics/answer/7029846) originates from Google Analytics for Firebase, but is the same schema as [Google Analytics 4](https://support.google.com/analytics/answer/9358801); this notebook applies to use cases that use either Google Analytics for Firebase or Google Analytics 4 data.

Google Analytics 4 (GA4) uses an [event-based](https://support.google.com/analytics/answer/9322688) measurement model. Events provide insight on what is happening in an app or on a website, such as user actions, system events, or errors. Every row in the dataset is an event, with various characteristics relevant to that event stored in a nested format within the row. While Google Analytics logs many types of events already by default, developers can also customize the types of events they also wish to log.

Note that as you cannot simply use the raw event data to train a machine learning model, in this notebook, you will also learn the important steps of how to pre-process the raw data into an appropriate format to use as training data for classification models.

### Objective and Problem Statement

The goal of this notebook is to provide an end-to-end solution for propensity modeling to predict user churn on GA4 data using BigQuery ML. Using the "Flood It!" dataset, based on a user's activity within the first 24 hrs of app installation, you will try various classification models to predict the propensity to churn (1) or not churn (0).

By the end of this notebook, you will know how to:
* Explore the export of Google Analytics 4 data on BigQuery
* Prepare the training data using demographic, behavioral data, and the label (churn/not-churn)
* Train classification models using BigQuery ML
* Evaluate classification models using BigQuery ML
* Make predictions on which users will churn using BigQuery ML
* Activate on model predictions

In [7]:
USER_FLAG = "--user"


In [8]:
!pip3 install {USER_FLAG} google-cloud-aiplatform==1.3.0 --upgrade
!pip3 install {USER_FLAG} kfp --upgrade

Collecting google-cloud-aiplatform==1.3.0
  Downloading google_cloud_aiplatform-1.3.0-py2.py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 5.3 MB/s eta 0:00:01
Installing collected packages: google-cloud-aiplatform
Successfully installed google-cloud-aiplatform-1.3.0
Collecting kfp
  Downloading kfp-1.8.2.tar.gz (248 kB)
[K     |████████████████████████████████| 248 kB 5.3 MB/s eta 0:00:01
[?25hCollecting absl-py<=0.11,>=0.9
  Downloading absl_py-0.11.0-py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 62.0 MB/s eta 0:00:01
Collecting google-api-python-client<2,>=1.7.8
  Downloading google_api_python_client-1.12.8-py2.py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 42 kB/s s eta 0:00:01
Collecting requests-toolbelt<1,>=0.8.0
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 3.4 MB/s  eta 0:00:01
Collecting kfp-server-api<2.0.0,>=1.1.2
  Down

In [1]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

In [1]:
!python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"


KFP SDK version: 1.8.2


In [2]:
!pip list | grep aiplatform


google-cloud-aiplatform        1.3.0


In [6]:
import os
PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Project ID:  vertex-ai-dev


In [4]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "your-project-id"  # @param {type:"string"}

In [7]:
BUCKET_NAME="gs://" + PROJECT_ID + "-bucket"


In [11]:
import kfp
import matplotlib.pyplot as plt
import pandas as pd
import requests

from kfp import dsl
from kfp.v2 import compiler
from kfp.v2.dsl import (Artifact, Dataset, Input, InputPath, Model, Output,
                        OutputPath, ClassificationMetrics, Metrics, component)

from google.cloud import aiplatform
from google.cloud.aiplatform import pipeline_jobs
from typing import NamedTuple

# We'll use this beta library for metadata querying
from google.cloud import aiplatform_v1beta1
from google.cloud import bigquery


In [8]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin
REGION="us-central1"

PIPELINE_ROOT = f"{BUCKET_NAME}/pipeline_root/"
PIPELINE_ROOT

env: PATH=/opt/conda/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/jupyter/.local/bin:/home/jupyter/.local/bin


'gs://vertex-ai-dev-bucket/pipeline_root/'

### Create a BigQuery dataset

In this notebook, you will need to create a dataset in your project called `bqmlga4`. To create it, run the following cell:

#@bigquery
-- create a dataset in Bigquery

CREATE SCHEMA bqmlga4
OPTIONS(
  location="us"
  )

## The dataset

### Using the sample gaming event data from Flood it!



The sample dataset contains raw event data, as shown in the next cell:

_Note_: Jupyter runs cells starting with %%bigquery as SQL queries

In [2]:
%%bigquery --project 	vertex-ai-dev

SELECT 
    *
FROM
  `firebase-public-project.analytics_153293282.events_*`
    
TABLESAMPLE SYSTEM (1 PERCENT)

Query complete after 0.08s: 100%|██████████| 2/2 [00:00<00:00, 635.74query/s]                         
Downloading: 100%|██████████| 50000/50000 [00:02<00:00, 24656.10rows/s]


Unnamed: 0,event_date,event_timestamp,event_name,event_params,event_previous_timestamp,event_value_in_usd,event_bundle_sequence_id,event_server_timestamp_offset,user_id,user_pseudo_id,user_properties,user_first_touch_timestamp,user_ltv,device,geo,app_info,traffic_source,stream_id,platform,event_dimensions
0,20180713,1531539374336001,session_start,"[{'key': 'firebase_conversion', 'value': {'str...",1.527598e+15,,98,-103725286,,9E635066BDD2E61E59252D382E0D2C61,"[{'key': 'initial_extra_steps', 'value': {'str...",1482918434034000,,"{'category': 'tablet', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'United S...","{'id': 'com.labpixies.flood', 'version': '2.62...",{'name': 'Mobile App | US | en | Mobile | Disp...,1051193346,ANDROID,
1,20180713,1531539307624001,screen_view,"[{'key': 'firebase_previous_id', 'value': {'st...",1.531539e+15,,98,-103725286,,9E635066BDD2E61E59252D382E0D2C61,"[{'key': 'initial_extra_steps', 'value': {'str...",1482918434034000,,"{'category': 'tablet', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'United S...","{'id': 'com.labpixies.flood', 'version': '2.62...",{'name': 'Mobile App | US | en | Mobile | Disp...,1051193346,ANDROID,
2,20180713,1531539305918002,screen_view,"[{'key': 'firebase_previous_id', 'value': {'st...",1.531539e+15,,98,-103725286,,9E635066BDD2E61E59252D382E0D2C61,"[{'key': 'initial_extra_steps', 'value': {'str...",1482918434034000,,"{'category': 'tablet', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'United S...","{'id': 'com.labpixies.flood', 'version': '2.62...",{'name': 'Mobile App | US | en | Mobile | Disp...,1051193346,ANDROID,
3,20180713,1531539310595006,screen_view,"[{'key': 'firebase_previous_id', 'value': {'st...",1.531539e+15,,98,-103725286,,9E635066BDD2E61E59252D382E0D2C61,"[{'key': 'initial_extra_steps', 'value': {'str...",1482918434034000,,"{'category': 'tablet', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'United S...","{'id': 'com.labpixies.flood', 'version': '2.62...",{'name': 'Mobile App | US | en | Mobile | Disp...,1051193346,ANDROID,
4,20180713,1531539352717007,screen_view,"[{'key': 'firebase_previous_id', 'value': {'st...",1.531539e+15,,98,-103725286,,9E635066BDD2E61E59252D382E0D2C61,"[{'key': 'initial_extra_steps', 'value': {'str...",1482918434034000,,"{'category': 'tablet', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'United S...","{'id': 'com.labpixies.flood', 'version': '2.62...",{'name': 'Mobile App | US | en | Mobile | Disp...,1051193346,ANDROID,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,20180713,1531543157708004,user_engagement,"[{'key': 'firebase_screen_class', 'value': {'s...",1.531543e+15,,14,309687,,F76E97B52858963AC10CEB05BDC185C6,"[{'key': 'first_open_time', 'value': {'string_...",1508569714315000,,"{'category': 'mobile', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'Canada',...","{'id': 'com.labpixies.flood', 'version': '2.62...","{'name': '(direct)', 'medium': '(none)', 'sour...",1051193346,ANDROID,
49996,20180713,1531543189743008,user_engagement,"[{'key': 'firebase_screen_class', 'value': {'s...",1.531543e+15,,14,309687,,F76E97B52858963AC10CEB05BDC185C6,"[{'key': 'first_open_time', 'value': {'string_...",1508569714315000,,"{'category': 'mobile', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'Canada',...","{'id': 'com.labpixies.flood', 'version': '2.62...","{'name': '(direct)', 'medium': '(none)', 'sour...",1051193346,ANDROID,
49997,20180713,1531543221146000,user_engagement,"[{'key': 'firebase_screen_class', 'value': {'s...",1.531543e+15,,15,669211,,F76E97B52858963AC10CEB05BDC185C6,"[{'key': 'first_open_time', 'value': {'string_...",1508569714315000,,"{'category': 'mobile', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'Canada',...","{'id': 'com.labpixies.flood', 'version': '2.62...","{'name': '(direct)', 'medium': '(none)', 'sour...",1051193346,ANDROID,
49998,20180713,1531542882131001,screen_view,"[{'key': 'firebase_previous_id', 'value': {'st...",1.531543e+15,,10,367204,,F76E97B52858963AC10CEB05BDC185C6,"[{'key': 'first_open_time', 'value': {'string_...",1508569714315000,,"{'category': 'mobile', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'Canada',...","{'id': 'com.labpixies.flood', 'version': '2.62...","{'name': '(direct)', 'medium': '(none)', 'sour...",1051193346,ANDROID,


#@bigquery

SELECT 
    *
FROM
  `firebase-public-project.analytics_153293282.events_*`
    
LIMIT 100

#@bigquery
SELECT 
    *
FROM
  `firebase-public-project.analytics_153293282.events_*`
    
LIMIT 100

It may be helpful to take a look at the overall schema used in Google Analytics 4. As mentioned earlier, Google Analytics 4 uses an event based measurement model and each row in this dataset is an event. [Click here](https://support.google.com/analytics/answer/7029846) to view the complete schema and details about each column. As you can see above, certain columns are nested records and contain detailed information:



* `app_info`
* `device`
* `ecommerce`
* `event_params`
* `geo`
* `traffic_source`
* `user_properties`
* `items`*
* `web_info`*

_* present by default in GA4 datasets_

As we can see below, there are 15K users and 5.7M events in this dataset:

#@bigquery
SELECT 
    COUNT(DISTINCT user_pseudo_id) as count_distinct_users,
    COUNT(event_timestamp) as count_events
FROM
  `firebase-public-project.analytics_153293282.events_*`

### Preparing the training data

You cannot simply use raw event data to train a machine learning model as it would not be in the right shape and format to use as training data. So in this section, you will learn how to pre-process the raw data into an appropriate format to use as training data for classification models.


To predict which user is going to _churn_ or _return_, the ideal training data format for classification should look like the following:  


|User ID|User demographic data|User behavioral data|Churned|
|-|-|-|-|
|User1|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|1
|User2|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|0
|User3|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|1


Characteristics of the training data:
- each row is a separate unique user ID
- feature(s) for **demographic data**
- feature(s) for **behavioral data**
- the actual **label** that you want to train the model to predict (e.g., 1 = churned, 0 = returned)

You can train a model with only demographic data or behavioral data, but having a combination of both will likely help you create a more predictive model. For this reason, in this section, you will learn how to pre-process the raw data to follow this training data format.

The following sections will walk you through preparing the demographic data, behavioral data, and the label before joining them all together as the training data.

1. Identifying the label for each user (churned or returned)
1. Extracting demographic data for each user
1. Extracting behavioral data for each user
1. Combining the label, demographic and behavioral data together as training data

#### Step 1: Identifying the label for each user

The raw dataset doesn't have a feature that simply identifies users as "churned" or "returned", so in this section, you will need to create this label based on some of the existing columns.

There are many ways to define user churn, but for the purposes of this notebook, you will predict 1-day churn as users who do not come back and use the app again after 24 hr of the user's first engagement. 

In other words, after 24 hr of a user's first engagement with the app:
- if the user _shows no event data thereafter_, the user is considered **churned**. 
- if the user _does have at least one event datapoint thereafter_, then the user is considered **returned**

You may also want to remove users who were unlikely to have ever returned anyway after spending just a few minutes with the app, which is sometimes referred to as "bouncing". For example, we can say want to build our model only on users who spent at least 10 minutes with the app (users who didn't bounce).

So your updated definition of a **churned user** for this notebook is:
> "any user who spent at least 10 minutes on the app, but after 24 hour from when they first engaged with the app, never used the app again"


In SQL, since the raw data contains all of the events for every user, from their first touch (app installation) to their last touch, you can use this information to create two columns: `churned` and `bounced`.


Take a look at the following SQL query and the results:

#@bigquery
CREATE OR REPLACE VIEW bqmlga4.returningusers AS (
  WITH firstlasttouch AS (
    SELECT
      user_pseudo_id,
      MIN(event_timestamp) AS user_first_engagement,
      MAX(event_timestamp) AS user_last_engagement
    FROM
      `firebase-public-project.analytics_153293282.events_*`
    WHERE event_name="user_engagement"
    GROUP BY
      user_pseudo_id

  )
  SELECT
    user_pseudo_id,
    user_first_engagement,
    user_last_engagement,
    EXTRACT(MONTH from TIMESTAMP_MICROS(user_first_engagement)) as month,
    EXTRACT(DAYOFYEAR from TIMESTAMP_MICROS(user_first_engagement)) as julianday,
    EXTRACT(DAYOFWEEK from TIMESTAMP_MICROS(user_first_engagement)) as dayofweek,

    (user_first_engagement + 86400000000) AS ts_24hr_after_first_engagement,

IF (user_last_engagement < (user_first_engagement + 86400000000),
    1,
    0 ) AS churned,

IF (user_last_engagement <= (user_first_engagement + 600000000),
    1,
    0 ) AS bounced,
  FROM
    firstlasttouch
  GROUP BY
    1,2,3
    );

SELECT 
  * 
FROM 
  bqmlga4.returningusers 
LIMIT 100;

#@bigquery
SELECT 
  COUNT(*) 
FROM 
  bqmlga4.train 


For the `churned` column, `churned=0` if the user performs an action after 24 hours since their first touch, otherwise if their last action was only within the first 24 hours, then `churned=1`.


For the `bounced` column, `bounced=1` if the user's last action was within the first ten minutes since their first touch with the app, otherwise `bounced=0`. We can use this column to filter our training data later on, by conditionally querying for users where `bounced = 0`.

You might wonder how many of these 15k users bounced and returned? You can run the following query to check:

#@bigquery
SELECT
    bounced,
    churned, 
    COUNT(churned) as count_users
FROM
    bqmlga4.returningusers
GROUP BY 1,2
ORDER BY bounced

For the training data, you will only end up using data where `bounced = 0`. Based on the 15k users, you can see that 5,557 (\~41%) users bounced within the first ten minutes of their first engagement with the app, but of the remaining 8,031 users, 1,883 users (\~23%) churned after 24 hours.

#@bigquery
SELECT
    COUNTIF(churned=1)/COUNT(churned) as churn_rate
FROM
    bqmlga4.returningusers
WHERE bounced = 0

#### Step 2. Extracting demographic data for each user

This section is focused on extracting the demographic information for each user. Different demographic information about the user is available in the dataset already, including `app_info`, `device`, `ecommerce`, `event_params`, `geo`. Demographic features can help the model predict whether users on certain devices or countries are more likely to churn.

For this notebook, you can start just with `geo.country`, `device.operating_system`, and `device.language`. If you are using your own dataset and have joinable first-party data, this section is a good opportunity to add any additional attributes for each user that may not be readily available in Google Analytics 4.

Note that a user's demographics may occasionally change (e.g. moving from one country to another). For simplicity, you will just use the demographic information that Google Analytics 4 provides when the user LAST engaged with the app as indicated by `MAX(event_timestamp)`. This enables every unique user to be represented by a single row.

#@bigquery
CREATE OR REPLACE VIEW bqmlga4.user_demographics AS (

  WITH first_values AS (
      SELECT
          user_pseudo_id,
          geo.country as country,
          device.operating_system as operating_system,
          device.language as language,
          ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS row_num
      FROM `firebase-public-project.analytics_153293282.events_*`
      WHERE event_name="user_engagement"
      )
  SELECT * EXCEPT (row_num)
  FROM first_values
  WHERE row_num = 1
  );

SELECT
  *
FROM
  bqmlga4.user_demographics
LIMIT 10

#### Step 3. Extracting behavioral data for each user

Behavioral data in the raw event data spans across multiple events -- and thus rows -- per user. The goal of this section is to aggregate and extract behavioral data for each user, resulting in one row of behavioral data per unique user.

But what kind of behavioral data will you need to prepare? Since the end goal of this notebook is to predict, based on a user's activity within the first 24 hrs since app installation, whether that user will churn or return thereafter, then you will want to use behavioral data from the first 24 hrs in your training data. Later on, we can also extract some extra time-related features from `user_first_engagement`, such as the month or day of the first engagement.

Google Analytics automatically collects [certain events](https://support.google.com/analytics/answer/6317485) that you can use to analyze behavior. In addition, there are certain recommended [events for games](https://support.google.com/analytics/answer/6317494). 


As a first step, you can explore all the unique events that exist in this dataset, based on `event_name`:

#@bigquery
SELECT
    event_name,
    COUNT(event_name) as event_count
FROM
    `firebase-public-project.analytics_153293282.events_*`
GROUP BY 1
ORDER BY
   event_count DESC

For this notebook, to predict whether a user will churn or return, you can start by counting the number of times a user engages in the following event types:

* `user_engagement`
* `level_start_quickplay`
* `level_end_quickplay`
* `level_complete_quickplay`
* `level_reset_quickplay`
* `post_score`
* `spend_virtual_currency`
* `ad_reward`
* `challenge_a_friend`
* `completed_5_levels`
* `use_extra_steps`


In SQL, you can aggregate the behavioral data by calculating the total number of times when each of the above `event_names` occurred in the data set per user.

If you are using your own dataset, you may have different event types that you can aggregate and extract. Your app may be sending very different `event_names` to Google Analytics so be sure to use events most suitable to your scenario.

#@bigquery
CREATE OR REPLACE VIEW bqmlga4.user_aggregate_behavior AS (
WITH
  events_first24hr AS (
    SELECT
      e.*
    FROM
      `firebase-public-project.analytics_153293282.events_*` e
    JOIN
      bqmlga4.returningusers r
    ON
      e.user_pseudo_id = r.user_pseudo_id
    WHERE
      e.event_timestamp <= r.ts_24hr_after_first_engagement
    )
SELECT
  user_pseudo_id,
  SUM(IF(event_name = 'user_engagement', 1, 0)) AS cnt_user_engagement,
  SUM(IF(event_name = 'level_start_quickplay', 1, 0)) AS cnt_level_start_quickplay,
  SUM(IF(event_name = 'level_end_quickplay', 1, 0)) AS cnt_level_end_quickplay,
  SUM(IF(event_name = 'level_complete_quickplay', 1, 0)) AS cnt_level_complete_quickplay,
  SUM(IF(event_name = 'level_reset_quickplay', 1, 0)) AS cnt_level_reset_quickplay,
  SUM(IF(event_name = 'post_score', 1, 0)) AS cnt_post_score,
  SUM(IF(event_name = 'spend_virtual_currency', 1, 0)) AS cnt_spend_virtual_currency,
  SUM(IF(event_name = 'ad_reward', 1, 0)) AS cnt_ad_reward,
  SUM(IF(event_name = 'challenge_a_friend', 1, 0)) AS cnt_challenge_a_friend,
  SUM(IF(event_name = 'completed_5_levels', 1, 0)) AS cnt_completed_5_levels,
  SUM(IF(event_name = 'use_extra_steps', 1, 0)) AS cnt_use_extra_steps,
FROM
  events_first24hr
GROUP BY
  1
  );

SELECT
  *
FROM
  bqmlga4.user_aggregate_behavior
LIMIT 10


In [1]:
%%bigquery --project vertex-ai-dev

CREATE OR REPLACE VIEW bqmlga4.user_aggregate_behavior AS (
WITH
  events_first24hr AS (
    #select user data only from first 24 hr of using the app
    SELECT
      e.*
    FROM
      `firebase-public-project.analytics_153293282.events_*` e
    JOIN
      bqmlga4.returningusers r
    ON
      e.user_pseudo_id = r.user_pseudo_id
    WHERE
      e.event_timestamp <= r.ts_24hr_after_first_engagement
    )
SELECT
  user_pseudo_id,
  SUM(IF(event_name = 'user_engagement', 1, 0)) AS cnt_user_engagement,
  SUM(IF(event_name = 'level_start_quickplay', 1, 0)) AS cnt_level_start_quickplay,
  SUM(IF(event_name = 'level_end_quickplay', 1, 0)) AS cnt_level_end_quickplay,
  SUM(IF(event_name = 'level_complete_quickplay', 1, 0)) AS cnt_level_complete_quickplay,
  SUM(IF(event_name = 'level_reset_quickplay', 1, 0)) AS cnt_level_reset_quickplay,
  SUM(IF(event_name = 'post_score', 1, 0)) AS cnt_post_score,
  SUM(IF(event_name = 'spend_virtual_currency', 1, 0)) AS cnt_spend_virtual_currency,
  SUM(IF(event_name = 'ad_reward', 1, 0)) AS cnt_ad_reward,
  SUM(IF(event_name = 'challenge_a_friend', 1, 0)) AS cnt_challenge_a_friend,
  SUM(IF(event_name = 'completed_5_levels', 1, 0)) AS cnt_completed_5_levels,
  SUM(IF(event_name = 'use_extra_steps', 1, 0)) AS cnt_use_extra_steps,
FROM
  events_first24hr
GROUP BY
  1
  );

SELECT
  *
FROM
  bqmlga4.user_aggregate_behavior
LIMIT 10

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 765.38query/s] 
Downloading: 100%|██████████| 10/10 [00:01<00:00,  6.87rows/s]


Unnamed: 0,user_pseudo_id,cnt_user_engagement,cnt_level_start_quickplay,cnt_level_end_quickplay,cnt_level_complete_quickplay,cnt_level_reset_quickplay,cnt_post_score,cnt_spend_virtual_currency,cnt_ad_reward,cnt_challenge_a_friend,cnt_completed_5_levels,cnt_use_extra_steps
0,C6B7ED4E2FA4AABE5693A61B9A07753C,153,3,2,1,0,27,2,0,0,1,2
1,6D4F9673F3DADCB39EE913EFCA2AE282,15,5,3,2,0,2,0,0,0,0,0
2,76DC62AB78716962E04332447C9FFEA2,49,17,16,13,1,13,0,0,0,0,0
3,CFE07D48CBBF9959BC8BA87C95F2B602,1,1,0,0,0,0,0,0,0,0,0
4,055A7B424EBE798EB0318F3DBA55418C,24,9,5,4,2,4,0,0,0,0,0
5,E8755950FCC0CDF1F62E9024153DD624,12,0,0,0,0,0,0,0,0,0,0
6,8CA1AEE6A223D37BBCD0DB7A2CE8C7B7,2,0,0,0,0,0,0,0,0,0,0
7,094EC71AC16334EFB5A3E0A98074C944,4,1,1,0,0,0,0,0,0,0,0
8,DB8FF95DA0346459D3EF59D5142D9ECA,6,0,0,0,0,0,0,0,0,0,0
9,2AA7CE8A30BD25743594101C76B46A6B,9,0,0,0,0,2,0,0,0,0,0


#### Step 4: Combining the label, demographic and behavioral data together as training data

In this section, you can now combine these three intermediary views (label, demographic, and behavioral data) into the final training data. Here you can also specify `bounced = 0`, in order to limit the training data only to users who did not "bounce" within the first 10 minutes of using the app.

#@bigquery
CREATE OR REPLACE VIEW bqmlga4.train AS (
    
  SELECT
    dem.*,
    IFNULL(beh.cnt_user_engagement, 0) AS cnt_user_engagement,
    IFNULL(beh.cnt_level_start_quickplay, 0) AS cnt_level_start_quickplay,
    IFNULL(beh.cnt_level_end_quickplay, 0) AS cnt_level_end_quickplay,
    IFNULL(beh.cnt_level_complete_quickplay, 0) AS cnt_level_complete_quickplay,
    IFNULL(beh.cnt_level_reset_quickplay, 0) AS cnt_level_reset_quickplay,
    IFNULL(beh.cnt_post_score, 0) AS cnt_post_score,
    IFNULL(beh.cnt_spend_virtual_currency, 0) AS cnt_spend_virtual_currency,
    IFNULL(beh.cnt_ad_reward, 0) AS cnt_ad_reward,
    IFNULL(beh.cnt_challenge_a_friend, 0) AS cnt_challenge_a_friend,
    IFNULL(beh.cnt_completed_5_levels, 0) AS cnt_completed_5_levels,
    IFNULL(beh.cnt_use_extra_steps, 0) AS cnt_use_extra_steps,
    ret.user_first_engagement,
    ret.month,
    ret.julianday,
    ret.dayofweek,
    ret.churned
  FROM
    bqmlga4.returningusers ret
  LEFT OUTER JOIN
    bqmlga4.user_demographics dem
  ON 
    ret.user_pseudo_id = dem.user_pseudo_id
  LEFT OUTER JOIN 
    bqmlga4.user_aggregate_behavior beh
  ON
    ret.user_pseudo_id = beh.user_pseudo_id
  WHERE ret.bounced = 0
  );

SELECT
  *
FROM
  bqmlga4.train
LIMIT 10

In [2]:
%%bigquery --project vertex-ai-dev

CREATE OR REPLACE VIEW bqmlga4.train AS (
    
  SELECT
    dem.*,
    IFNULL(beh.cnt_user_engagement, 0) AS cnt_user_engagement,
    IFNULL(beh.cnt_level_start_quickplay, 0) AS cnt_level_start_quickplay,
    IFNULL(beh.cnt_level_end_quickplay, 0) AS cnt_level_end_quickplay,
    IFNULL(beh.cnt_level_complete_quickplay, 0) AS cnt_level_complete_quickplay,
    IFNULL(beh.cnt_level_reset_quickplay, 0) AS cnt_level_reset_quickplay,
    IFNULL(beh.cnt_post_score, 0) AS cnt_post_score,
    IFNULL(beh.cnt_spend_virtual_currency, 0) AS cnt_spend_virtual_currency,
    IFNULL(beh.cnt_ad_reward, 0) AS cnt_ad_reward,
    IFNULL(beh.cnt_challenge_a_friend, 0) AS cnt_challenge_a_friend,
    IFNULL(beh.cnt_completed_5_levels, 0) AS cnt_completed_5_levels,
    IFNULL(beh.cnt_use_extra_steps, 0) AS cnt_use_extra_steps,
    ret.user_first_engagement,
    ret.month,
    ret.julianday,
    ret.dayofweek,
    ret.churned
  FROM
    bqmlga4.returningusers ret
  LEFT OUTER JOIN
    bqmlga4.user_demographics dem
  ON 
    ret.user_pseudo_id = dem.user_pseudo_id
  LEFT OUTER JOIN 
    bqmlga4.user_aggregate_behavior beh
  ON
    ret.user_pseudo_id = beh.user_pseudo_id
  WHERE ret.bounced = 0
  );

SELECT
  *
FROM
  bqmlga4.train
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 798.15query/s] 
Downloading: 100%|██████████| 10/10 [00:01<00:00,  7.04rows/s]


Unnamed: 0,user_pseudo_id,country,operating_system,language,cnt_user_engagement,cnt_level_start_quickplay,cnt_level_end_quickplay,cnt_level_complete_quickplay,cnt_level_reset_quickplay,cnt_post_score,cnt_spend_virtual_currency,cnt_ad_reward,cnt_challenge_a_friend,cnt_completed_5_levels,cnt_use_extra_steps,user_first_engagement,month,julianday,dayofweek,churned
0,CF2898B41B7243671C36D5168D9D89B0,United States,ANDROID,en-us,9,2,2,2,0,2,0,0,0,0,0,1529624025404004,6,172,5,0
1,17636078D57884AD7EA5406C60E2BD10,United States,ANDROID,en-us,169,54,51,19,1,19,0,0,0,0,0,1533397230425001,8,216,7,0
2,8EB99B177B914F63CE215683892205AD,Lebanon,ANDROID,en-gb,1,0,0,0,0,0,0,0,0,0,0,1533314555790000,8,215,6,0
3,5BFDC0426DE9C09CA5FF18E4982F2506,India,ANDROID,en-au,15,0,0,0,0,1,0,0,0,0,0,1533620440108003,8,219,3,0
4,88A41BFED275BB69125BE1F5524F3B42,United States,ANDROID,en-us,5,2,1,0,0,1,0,0,0,0,0,1529325237353004,6,169,2,0
5,920DB84FCC0F4421650B9E257E33180B,Sweden,ANDROID,sv-se,7,3,2,2,0,2,1,0,0,0,1,1528957982564003,6,165,5,0
6,84D185835F0DE9C48712855B8713996E,United States,ANDROID,en-us,11,2,1,0,0,0,0,0,2,0,0,1529579468237008,6,172,5,0
7,C753C435C40D421DB0CC1C7AC6D3356D,Australia,ANDROID,en-au,42,19,9,1,10,1,2,1,0,0,2,1531401977504002,7,193,5,0
8,4789C778386485B99F4077B97DFE34E3,United States,ANDROID,en-us,192,281,46,33,234,33,0,0,0,0,0,1528931183521002,6,164,4,0
9,E512079B53179DF9A608CF4ADE47DE9D,Australia,ANDROID,en-au,5,2,1,0,0,1,0,0,0,0,0,1529351778192007,6,169,2,0


In [13]:
client = bigquery.Client()
table_id='vertex-ai-dev.bqmlga4.churn_prediction_gaming_training'
# TODO(developer): Set table_id to the ID of the destination table.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.QueryJobConfig(destination=table_id)

sql = """
    SELECT * FROM bqmlga4.train
"""

# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config)  # Make an API request.
query_job.result()  # Wait for the job to complete.

print("Query results loaded to the table {}".format(table_id))

Query results loaded to the table vertex-ai-dev.bqmlga4.churn_prediction_gaming_training


#@bigquery
SELECT * FROM bqmlga4.train

In [48]:
# The following two lines are only necessary to run once.
# Comment out otherwise for speed-up.
from google.cloud.bigquery import Client, QueryJobConfig
client = Client()

query = """SELECT * FROM bqmlga4.train"""
job = client.query(query)
df = job.to_dataframe()

In [49]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
df1=df
labels = df1.pop("churned").tolist()
data = df1.values.tolist()
x_train, x_test, y_train, y_test = train_test_split(data, labels)
print(x_train[0:10])
print('\n\n\n\n')
print(y_train[0:10])
skmodel = LogisticRegression()
skmodel.fit(x_train,y_train)
score = skmodel.score(x_test,y_test)
print('accuracy is:',score)

'''metrics.log_metric("accuracy",(score * 100.0))
metrics.log_metric("framework", "Scikit Learn")
metrics.log_metric("dataset_size", len(df))
dump(skmodel, model.path + ".joblib")'''

[['ACAB78964E0D31C1C54BF9454BA84D1F', 'Japan', 'IOS', 'en-jp', 3, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1529962274994011, 6, 176, 2], ['702D18247CCF0214BF213FEA84BBC6D5', 'Japan', 'IOS', 'ja-jp', 7, 2, 2, 2, 0, 2, 0, 0, 0, 0, 0, 1531983759624003, 7, 200, 5], ['42CAF88542680DC1A0671B91943E6089', 'Denmark', 'IOS', 'da-dk', 38, 0, 0, 0, 0, 12, 0, 0, 0, 1, 0, 1536609303838015, 9, 253, 2], ['390E57F874E1FD656FCCBCEDE8F90604', 'Canada', 'ANDROID', 'en-ca', 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1532122274014000, 7, 201, 6], ['4C17F5B133B064508ED4E3E1A0C30D3E', 'United States', 'IOS', 'en-us', 13, 6, 6, 4, 0, 4, 0, 0, 0, 0, 0, 1529029640516003, 6, 166, 6], ['8DBD6ADC48F015EF2088ECB83F7CD107', 'Australia', 'IOS', 'en-au', 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1530589438735000, 7, 184, 3], ['EC3BAA23738340386BBAD8114221E1D7', 'Malaysia', 'ANDROID', 'zh-cn', 5, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1533525500693003, 8, 218, 2], ['5DB8D4CD1C161D01EB4F9174F6316A82', 'United Kingdom', 'IOS', 'en-gb', 4, 1, 1, 0, 0, 0, 0, 0, 

ValueError: could not convert string to float: 'ACAB78964E0D31C1C54BF9454BA84D1F'

In [14]:
@component(
    packages_to_install=["google-cloud-bigquery", "pandas", "pyarrow"],
    base_image="python:3.9",
    output_component_file="create_dataset.yaml"
)
def get_dataframe(
    bq_table: str,
    output_data_path: OutputPath("Dataset")
):
    from google.cloud import bigquery
    import pandas as pd

    bqclient = bigquery.Client()
    table = bigquery.TableReference.from_string(
        bq_table
    )
    rows = bqclient.list_rows(
        table
    )
    dataframe = rows.to_dataframe(
        create_bqstorage_client=True,
    )
    dataframe = dataframe.sample(frac=1, random_state=2)
    dataframe.to_csv(output_data_path)

In [30]:
@component(
    packages_to_install=["sklearn", "pandas", "joblib"],
    base_image="python:3.9",
    output_component_file="beans_model_component.yaml",
)
def sklearn_train(
    dataset: Input[Dataset],
    metrics: Output[Metrics],
    model: Output[Model]
):
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import roc_curve
    from sklearn.model_selection import train_test_split
    from joblib import dump

    import pandas as pd
    df = pd.read_csv(dataset.path)
    labels = df.pop("churned").tolist()
    data = df.values.tolist()
    x_train, x_test, y_train, y_test = train_test_split(data, labels)

    skmodel = DecisionTreeClassifier()
    skmodel.fit(x_train,y_train)
    score = skmodel.score(x_test,y_test)
    print('accuracy is:',score)

    metrics.log_metric("accuracy",(score * 100.0))
    metrics.log_metric("framework", "Scikit Learn")
    metrics.log_metric("dataset_size", len(df))
    dump(skmodel, model.path + ".joblib")

In [16]:
@component(
    packages_to_install=["google-cloud-aiplatform", "joblib", "sklearn"],
    base_image="python:3.9",
    output_component_file="beans_deploy_component.yaml",
)
def deploy_model(
    model: Input[Model],
    project: str,
    region: str,
    vertex_endpoint: Output[Artifact],
    vertex_model: Output[Model]
):
    from google.cloud import aiplatform

    aiplatform.init(project=project, location=region)

    deployed_model = aiplatform.Model.upload(
        display_name="beans-model-pipeline",
        artifact_uri = model.uri.replace("model", ""),
        serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest"
    )
    endpoint = deployed_model.deploy(machine_type="n1-standard-4")

    # Save data to the output params
    vertex_endpoint.uri = endpoint.resource_name
    vertex_model.uri = deployed_model.resource_name

In [31]:
@dsl.pipeline(
    # Default pipeline root. You can override it when submitting the pipeline.
    pipeline_root=PIPELINE_ROOT,
    # A name for the pipeline.
    name="mlmd-pipeline",
)
def pipeline(
    bq_table: str = "",
    output_data_path: str = "data.csv",
    project: str = PROJECT_ID,
    region: str = REGION
):
    dataset_task = get_dataframe(bq_table)

    model_task = sklearn_train(
        dataset_task.output
    )

    deploy_task = deploy_model(
        model=model_task.outputs["model"],
        project=project,
        region=region
    )

In [32]:
compiler.Compiler().compile(
    pipeline_func=pipeline, package_path="mlmd_pipeline.json"
)

In [33]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

In [34]:
run = pipeline_jobs.PipelineJob(
    display_name="mlmd-pipeline-2",
    template_path="mlmd_pipeline.json",
    job_id="mlmd-pipeline-small-{0}".format(TIMESTAMP),
    parameter_values={"bq_table": "vertex-ai-dev.bqmlga4.churn_prediction_gaming_training"},
    enable_caching=True,
)

In [35]:
run.run()

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/931647533046/locations/us-central1/pipelineJobs/mlmd-pipeline-small-20210927095413
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/931647533046/locations/us-central1/pipelineJobs/mlmd-pipeline-small-20210927095413')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlmd-pipeline-small-20210927095413?project=931647533046
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/931647533046/locations/us-central1/pipelineJobs/mlmd-pipeline-small-20210927095413 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/931647533046/locations/us-central1/pi

RuntimeError: Job failed with:
code: 9
message: "The DAG failed because some tasks failed. The failed tasks are: [sklearn-train].; Job (project_id = vertex-ai-dev, job_id = 1382649066067853312) is failed due to the above error.; Failed to handle the job: {project_number = 931647533046, job_id = 1382649066067853312}"


#@bigquery
CREATE OR REPLACE MODEL bqmlga4.churn_logreg

OPTIONS(
  MODEL_TYPE="LOGISTIC_REG",
  INPUT_LABEL_COLS=["churned"]
) AS

SELECT
  *
FROM
  bqmlga4.train

In [50]:
%%bigquery --project vertex-ai-dev

CREATE OR REPLACE MODEL bqmlga4.churn_logreg

OPTIONS(
  MODEL_TYPE="LOGISTIC_REG",
  INPUT_LABEL_COLS=["churned"]
) AS

SELECT
  *
FROM
  bqmlga4.train

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1279.79query/s]                        
