# Data Journey Day 2: ELT (Extract Load Transform) with BigQuery

<table align="left">

  <td>
    <a href="https://github.com/AmritRaj23/data-journey/blob/main/day-1/ELT%20(Extract%20Load%20Transform)/DataJourney_elt.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/create-managed?download_url=https://github.com/AmritRaj23/data-journey/blob/main/day-1/ELT%20(Extract%20Load%20Transform)/DataJourney_elt.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
     </a>
  </td>
</table>
</table>
<br/><br/><br/>

In this Notebook demonstrate [ELT (Extract Load Transform) data processing with BigQuery](https://cloud.google.com/bigquery/docs/migration/pipelines#elt). ELT is the contrasting concept of ETL (Extract Tranform Load), which we already saw in action using Apache Beam and Dataflow.

ELT trades off slower speed of on demand insights for lower storage cost. Typically ELT makes sense when analyzing large data volumes.

## Step 1: Create BigQuery Dataset

To kick things off we create a BigQuery dataset to accomodate our data. As sample data we are using the publicly firebase analytics data. However, since we are following the ELT approach we will NOT load the data directly in our dataset. ELT assumed the extraction is already done. We will only work with Views and materialize them on the fly as needed.

In [2]:
from google.cloud import bigquery

try: 
    client = bigquery.Client() # Constructing BQ client object.

    dataset_id = "{}.data_journey_elt".format(client.project) # Defining dataset ID.
    dataset = bigquery.Dataset(dataset_id)

    dataset.location = "<dataset-location>" # Defining dataset location

    dataset = client.create_dataset(dataset, timeout=30)  # Creating dataset by API requets.
    print("Created dataset {}.{}".format(client.project, dataset.dataset_id))
except: 
    print('Creation failed. Dataset may already exist.')

Creation failed. Dataset may already exist.


## Step 2: Explore BigQuery Data with BQ python API

To get a better feel for our data and python tools lets explore our dataset using the [BigQuery Python API](https://googleapis.dev/python/bigquery/latest/index.html).

The python utilities help us to define a query and load it directly into BigQuery.

In [16]:
query = """
    SELECT *
    FROM `firebase-public-project.analytics_153293282.events_*`
    LIMIT 3
    """

client.query(query).to_dataframe()

Unnamed: 0,event_date,event_timestamp,event_name,event_params,event_previous_timestamp,event_value_in_usd,event_bundle_sequence_id,event_server_timestamp_offset,user_id,user_pseudo_id,user_properties,user_first_touch_timestamp,user_ltv,device,geo,app_info,traffic_source,stream_id,platform,event_dimensions
0,20180814,1534311416381007,level_start_quickplay,"[{'key': 'board', 'value': {'string_value': 'S...",1527918972491007,,511,-340042,,6F21DD230241C6587130E8FA2B5C1420,"[{'key': 'plays_quickplay', 'value': {'string_...",1489016516414000,,"{'category': 'mobile', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'United S...","{'id': 'com.labpixies.flood', 'version': '2.62...","{'name': '(direct)', 'medium': '(none)', 'sour...",1051193346,ANDROID,
1,20180814,1534311453632006,level_fail_quickplay,"[{'key': 'board', 'value': {'string_value': 'S...",1527918316709006,,512,-228430,,6F21DD230241C6587130E8FA2B5C1420,"[{'key': 'plays_quickplay', 'value': {'string_...",1489016516414000,,"{'category': 'mobile', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'United S...","{'id': 'com.labpixies.flood', 'version': '2.62...","{'name': '(direct)', 'medium': '(none)', 'sour...",1051193346,ANDROID,
2,20180814,1534311500632007,level_end_quickplay,"[{'key': 'board', 'value': {'string_value': 'S...",1527918969486007,,512,-228430,,6F21DD230241C6587130E8FA2B5C1420,"[{'key': 'plays_quickplay', 'value': {'string_...",1489016516414000,,"{'category': 'mobile', 'mobile_brand_name': 'n...","{'continent': 'Americas', 'country': 'United S...","{'id': 'com.labpixies.flood', 'version': '2.62...","{'name': '(direct)', 'medium': '(none)', 'sour...",1051193346,ANDROID,


## Step 2: Transformation & Feature Engineering Using BigQuery Views

Starting into the Transform step of ELT we make use of [BigQuery Views](https://cloud.google.com/bigquery/docs/views-intro). Views allow mapping transformations over a dataset without processing the actual result. That saves the respective data storage and keeps our flexibility. Should we see the need to change anything in our processing pipeline we won't need to re-compute all our materialized and transformed data. Instead we only adapt the query that defines the view.

#### Defining view #1

To keep track of users that churned we create boolean features for 'churn' & 'bounce'.

In [12]:
query = """
    CREATE OR REPLACE VIEW data_journey_elt.user_returninginfo AS
    WITH firstlasttouch AS (
        SELECT
          user_pseudo_id,
          MIN(event_timestamp) AS user_first_engagement,
          MAX(event_timestamp) AS user_last_engagement
        FROM
          `firebase-public-project.analytics_153293282.events_*`
        WHERE event_name="user_engagement"
        GROUP BY
          user_pseudo_id

      )
      SELECT
        user_pseudo_id,
        user_first_engagement,
        user_last_engagement,
        EXTRACT(MONTH from TIMESTAMP_MICROS(user_first_engagement)) as month,
        EXTRACT(DAYOFYEAR from TIMESTAMP_MICROS(user_first_engagement)) as julianday,
        EXTRACT(DAYOFWEEK from TIMESTAMP_MICROS(user_first_engagement)) as dayofweek,

        (user_first_engagement + 86400000000) AS ts_24hr_after_first_engagement,

    IF (user_last_engagement < (user_first_engagement + 86400000000),
        1,
        0 ) AS churned,

    IF (user_last_engagement <= (user_first_engagement + 600000000),
        1,
        0 ) AS bounced,
      FROM
        firstlasttouch
      GROUP BY
        1,2,3
"""

client.query(query)

QueryJob<project=fourth-carport-363710, location=US, id=c495421b-10fd-4663-b405-fe597295c650>

#### Defining view #2

To keep track of user demographics.

In [13]:
query = """
        CREATE OR REPLACE VIEW data_journey_elt.user_demographics AS
        WITH first_values AS (
            SELECT
                user_pseudo_id,
                geo.country as country,
                device.operating_system as operating_system,
                device.language as language,
                ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS row_num
            FROM `firebase-public-project.analytics_153293282.events_*`
            WHERE event_name="user_engagement"
            )
        SELECT * EXCEPT (row_num)
        FROM first_values
        WHERE row_num = 1
"""

client.query(query)

QueryJob<project=fourth-carport-363710, location=US, id=d157636e-ab79-497e-89f4-94d81705bce4>

#### Defining view #3 

In view #3 we aggregate certain user behaviour events.

In [15]:
query = """
CREATE OR REPLACE VIEW data_journey_elt.user_aggregate_behaviour AS
WITH events_first24hr AS (
    SELECT
      e.*
    FROM
      `firebase-public-project.analytics_153293282.events_*` e
    JOIN
      data_journey_elt.user_returninginfo r
    ON
      e.user_pseudo_id = r.user_pseudo_id
    WHERE
      e.event_timestamp <= r.ts_24hr_after_first_engagement
    )
SELECT
  user_pseudo_id,
  SUM(IF(event_name = 'user_engagement', 1, 0)) AS cnt_user_engagement,
  SUM(IF(event_name = 'level_start_quickplay', 1, 0)) AS cnt_level_start_quickplay,
  SUM(IF(event_name = 'level_end_quickplay', 1, 0)) AS cnt_level_end_quickplay,
  SUM(IF(event_name = 'level_complete_quickplay', 1, 0)) AS cnt_level_complete_quickplay,
  SUM(IF(event_name = 'level_reset_quickplay', 1, 0)) AS cnt_level_reset_quickplay,
  SUM(IF(event_name = 'post_score', 1, 0)) AS cnt_post_score,
  SUM(IF(event_name = 'spend_virtual_currency', 1, 0)) AS cnt_spend_virtual_currency,
  SUM(IF(event_name = 'ad_reward', 1, 0)) AS cnt_ad_reward,
  SUM(IF(event_name = 'challenge_a_friend', 1, 0)) AS cnt_challenge_a_friend,
  SUM(IF(event_name = 'completed_5_levels', 1, 0)) AS cnt_completed_5_levels,
  SUM(IF(event_name = 'use_extra_steps', 1, 0)) AS cnt_use_extra_steps,
FROM
  events_first24hr
GROUP BY
  1
"""

client.query(query)

QueryJob<project=fourth-carport-363710, location=US, id=830ff960-0bee-417f-92d7-ba2c59e5a5e0>