## Goal & Problem Statement
The goal of this notebook is to provide a solution for propensity modeling to predict user churn on GA4 data using Spark ML. The dataset used is "Flood it!" which is available publically and based on users' demographics and activities within the first 24 hours of app installation we will predict the propensity to churn (1) or not churn (0) using gradient boost classifier.

## Steps
1. Prepare the training data using demographic, behavioral data, and the label (churn/not-churn)
2. Preprocess the raw events data to identify users and the label features.
3. Process demographic and behavorial features.
4. Train classification models using Spark ML
5. Evaluate classification models using Spark ML
6. Make predictions on which users will churn using Spark ML

## About Dataset
This notebook uses this public BigQuery dataset, contains raw event data from a real mobile gaming app called Flood It! (Android app, iOS app). The data schema originates from Google Analytics for Firebase, but is the same schema as Google Analytics 4; this notebook applies to use cases that use either Google Analytics for Firebase or Google Analytics 4 data.

Google Analytics 4 (GA4) uses an event-based measurement model. Events provide insight on what is happening in an app or on a website, such as user actions, system events, or errors. Every row in the dataset is an event, with various characteristics relevant to that event stored in a nested format within the row. While Google Analytics logs many types of events already by default, developers can also customize the types of events they also wish to log.

ToDo: Dataproc Templates to get the public data into BigQuery (Refer - https://support.google.com/analytics/answer/9823238#zippy=%2Cin-this-article)

## Setup

In [None]:
# Install packages and dependencies
# !pip install google-cloud-bigquery

In [None]:
# Import required libraries
import json
import pprint
import subprocess

import google.auth
import google.auth.transport.requests
import requests

import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as sql_func
from pyspark.sql.functions import countDistinct

from pyspark.ml import Pipeline
import pyspark.ml.classification as classification
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
# Get credentials to authenticate with Google APIs
credentials, project_id = google.auth.default()
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)

In [None]:
# Create Spark Session
spark = SparkSession.builder \
    .appName("Customer Churn Prediction - Spark ML") \
    .enableHiveSupport() \
    .getOrCreate()

In [None]:
# Read data from bigquery
# Public bigquery dataset location: firebase-public-project.analytics_153293282.events_20181003
# Public metstore dataset location: dataproc-workspaces-notebooks.propensity_churn_dataset.churn_events

events_data_df = spark.read \
    .format("bigquery") \
    .load("firebase-public-project.analytics_153293282.events_20181003")
events_data_df.show(3)

## Exploring the dataset
It is always helpful to take a look at the overall schema of the data. Here, we look at the overall schema of Google Analytics 4 data which uses an event based measurement model with each row as an event.

In [None]:
""" Print the overall schema used in Google Analytics 4 as 
it is a event based measurement model and each row in this dataset is an event. """
events_data_df.printSchema()

In [None]:
# total number of users
events_data_df.select(countDistinct("user_pseudo_id")).show()

In [None]:
# total number of events
events_data_df.agg({'event_timestamp':'count'}).show()

__Findings:__ Certain columns are nested records, there are about 4k users and 50K events.

## Prepare the training data using demographic, behavioral data, and the label (churn/not-churn)

To predict which user is going to churn or return, the ideal training data format for classification should look like the following:

User ID	User demographic data	User behavioral data	Churned
User1	(e.g., country, device_type)	(e.g., # of times they did something within a time period)	1
User2	(e.g., country, device_type)	(e.g., # of times they did something within a time period)	0
User3	(e.g., country, device_type)	(e.g., # of times they did something within a time period)	1

Characteristics of the training data:

- each row is a separate unique user ID
 - feature(s) for demographic data
- feature(s) for behavioral data
- the actual label that you want to train the model to predict (e.g., 1 = churned, 0 = returned)
- You can train a model with only demographic data or behavioral data, but having a combination of both will likely help you create a more predictive model. For this reason, in this section, you will learn how to pre-process the raw data to follow this training data format.

The following sections will walk you through preparing the demographic data, behavioral data, and the label before joining them all together as the training data.

1. Identifying the label for each user (churned or returned)
2. Extracting demographic data for each user
3. Extracting behavioral data for each user
4. Combining the label, demographic and behavioral data together as training data

#### Step 1: Identify the label for each user (churned or returned)

Here we create the label for each user as churned or returned based on the existing columns as the raw dataset doesn't have a feature that simply identifies users. There are many ways to define user churn, but we will predict 1-day churn as users who do not come back and use the app again after 24 hr of the user's first engagement.

In other words, after 24 hr of a user's first engagement with the app:
- if the user shows no event data thereafter, the user is considered churned.
- if the user does have at least one event datapoint thereafter, then the user is considered returned

We will also remove users who were unlikely to have ever returned anyway after spending just a few minutes with the app, which is sometimes referred to as "bouncing". Therefore, here "any user who spent at least 10 minutes on the app, but after 24 hour from when they first engaged with the app, never used the app again" is a churned user.

In [None]:
# Creates a local temporary view with events_data_df. 
# The lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame."""

events_data_df.createOrReplaceTempView("temp_events_data")

In [None]:
# In SQL, since the raw data contains all of the events for every user, from their first touch (app installation) to their last touch, 
# we will use this information to create two columns: churned and bounced '''

returning_users = spark.sql("""
        SELECT
            user_pseudo_id,
            user_first_engagement,
            user_last_engagement,
            (user_first_engagement + 86400000000) AS ts_24hr_after_first_engagement,
            IF (user_last_engagement < (user_first_engagement + 86400000000), 1, 0) AS churned,
            IF (user_last_engagement <= (user_first_engagement + 600000000), 1, 0) AS bounced,
            EXTRACT(MONTH from TIMESTAMP_MICROS(user_first_engagement)) as month,
            EXTRACT(DAYOFWEEK from TIMESTAMP_MICROS(user_first_engagement)) as dayofweek
        FROM
            (SELECT
                user_pseudo_id,
                MIN(event_timestamp) AS user_first_engagement,
                MAX(event_timestamp) AS user_last_engagement
            FROM
                temp_events_data
            WHERE 
                event_name="user_engagement"
            GROUP BY
                user_pseudo_id) 
            AS first_last_touch_table
        GROUP BY
            user_pseudo_id,
            user_first_engagement,
            user_last_engagement """)
returning_users.createOrReplaceTempView("returning_users")
returning_users.show(1)

# Note: could not get dayofyear

In [None]:
# checking the count of users bounced and returned
spark.sql("""
            SELECT
                bounced,
                churned, 
                COUNT(churned) as count_users
            FROM
                returning_users
            GROUP BY 1,2
            ORDER BY bounced""").show(1)

In [None]:
spark.sql("""
    SELECT
        COUNT(CASE WHEN churned = 1 THEN 1 ELSE NULL END) / COUNT(*) AS churn_rate
    FROM
        returning_users
    WHERE
      bounced = 0""").show(1)

#### Step 2: Extracting demographic data for each user

Extract the demographic information for each user. Different demographic information about the user is available in the dataset already, including app_info, device, ecommerce, event_params, geo. Demographic features can help the model predict whether users on certain devices or countries are more likely to churn.

__Note:__ User's demographics may occasionally change (e.g. moving from one country to another). For simplicity, we will just use the demographic information that Google Analytics 4 provides when the user first engaged with the app as indicated by MIN(event_timestamp). This enables every unique user to be represented by a single row.



In [None]:
user_demographics = spark.sql("""
SELECT
    user_pseudo_id,
    country,
    operating_system,
    language
FROM
  (SELECT
    user_pseudo_id,
    geo.country as country,
    device.operating_system as operating_system,
    device.language as language,
    ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS row_num
  FROM
    temp_events_data
  WHERE
    event_name="user_engagement") AS first_values
WHERE
  row_num = 1
""")
user_demographics.createOrReplaceTempView("user_demographics")
user_demographics.show(1)


#### Step 3: Extracting behavioral data for each user
Here we aggregate and extract behavioral data for each user, resulting in one row of behavioral data per unique user.
Since the end goal of this notebook is to predict, based on a user's activity within the first 24 hrs of app installation and whether that user will churn or return thereafter, therefore, we use behavioral data from the first 24 hrs in your training data.

In [None]:
# First step is to explore all the unique events that exist in this dataset, based on event_name
spark.sql("""
    SELECT
        event_name,
        COUNT(event_name) as event_count
    FROM
        temp_events_data
    GROUP BY 1
    ORDER BY
       event_count DESC""").show(1)

In [None]:
user_aggregate_behaviour = spark.sql("""
SELECT 
        user_pseudo_id,
      SUM(IF(event_name = 'user_engagement', 1, 0)) AS cnt_user_engagement,
      SUM(IF(event_name = 'level_start_quickplay', 1, 0)) AS cnt_level_start_quickplay,
      SUM(IF(event_name = 'level_end_quickplay', 1, 0)) AS cnt_level_end_quickplay,
      SUM(IF(event_name = 'level_complete_quickplay', 1, 0)) AS cnt_level_complete_quickplay,
      SUM(IF(event_name = 'level_reset_quickplay', 1, 0)) AS cnt_level_reset_quickplay,
      SUM(IF(event_name = 'post_score', 1, 0)) AS cnt_post_score,
      SUM(IF(event_name = 'spend_virtual_currency', 1, 0)) AS cnt_spend_virtual_currency,
      SUM(IF(event_name = 'ad_reward', 1, 0)) AS cnt_ad_reward,
      SUM(IF(event_name = 'challenge_a_friend', 1, 0)) AS cnt_challenge_a_friend,
      SUM(IF(event_name = 'completed_5_levels', 1, 0)) AS cnt_completed_5_levels,
      SUM(IF(event_name = 'use_extra_steps', 1, 0)) AS cnt_use_extra_steps
FROM
    (SELECT
        e.*
    FROM
      temp_events_data e
    JOIN
      returning_users r
    ON
      e.user_pseudo_id = r.user_pseudo_id
    WHERE
      e.event_timestamp <= r.ts_24hr_after_first_engagement) AS users_event_table
GROUP BY 1 """)
user_aggregate_behaviour.createOrReplaceTempView("user_aggregate_behaviour")
user_aggregate_behaviour.show(1)

#### Step 4: Combining the label, demographic and behavioral data together as training data
We now combine the three intermediary views (label, demographic, and behavioral data) into the final training data. 

__Note:__ you can also specify bounced = 0, in order to limit the training data only to users who did not "bounce" within the first 10 minutes of using the app.

In [None]:
final_data = spark.sql("""
SELECT
    dem.*,
    IFNULL(beh.cnt_user_engagement, 0) AS cnt_user_engagement,
    IFNULL(beh.cnt_level_start_quickplay, 0) AS cnt_level_start_quickplay,
    IFNULL(beh.cnt_level_end_quickplay, 0) AS cnt_level_end_quickplay,
    IFNULL(beh.cnt_level_complete_quickplay, 0) AS cnt_level_complete_quickplay,
    IFNULL(beh.cnt_level_reset_quickplay, 0) AS cnt_level_reset_quickplay,
    IFNULL(beh.cnt_post_score, 0) AS cnt_post_score,
    IFNULL(beh.cnt_spend_virtual_currency, 0) AS cnt_spend_virtual_currency,
    IFNULL(beh.cnt_ad_reward, 0) AS cnt_ad_reward,
    IFNULL(beh.cnt_challenge_a_friend, 0) AS cnt_challenge_a_friend,
    IFNULL(beh.cnt_completed_5_levels, 0) AS cnt_completed_5_levels,
    IFNULL(beh.cnt_use_extra_steps, 0) AS cnt_use_extra_steps,
    ret.user_first_engagement,
    ret.month,
    ret.dayofweek,
    ret.churned
FROM
    returning_users ret
LEFT OUTER JOIN
    user_demographics dem
ON 
    ret.user_pseudo_id = dem.user_pseudo_id
LEFT OUTER JOIN 
    user_aggregate_behaviour beh
ON
    ret.user_pseudo_id = beh.user_pseudo_id
WHERE 
    ret.bounced = 0
""")

# train_data.createOrReplaceTempView("train_data")
# train_data.show(1)

## Propensity model with Spark ML

In [None]:
# Split the data into training and test sets
(training_data, test_data) = final_data.randomSplit([0.8, 0.2])

In [None]:
training_data.show(1)

In [None]:
test_data.show(1)