Lab 1: Cell Tower Anomaly Detection with dbt

This lab is an adaptation of Lab 1: Cell Tower Anomaly Detection. It includes/changes:

Use BigLake to create GCS external tables in PARQUET and CSV files formats.
Refactor pySpark scripts into SQL (initial data cleaning scripts)
Use Open Source tool dbt to implement a data pipeline (DAG) including the logic for both SQL jobs and pySpark scripts.
Adds a bootstrap script to deploy required cloud infrastructure (GCS bucket, BQ datasets, Service Accounts .) using Terraform.

2. Lab overview and architecture

The lab includes studying and running a data pipeline (DAG) via dbt:

Curate Customer Master Data (defined using BigQuery SQL)
Curate Telco Customer Churn Data (defined using BigQuery SQL)
Calculate Cell Tower KPIs by Customer (defined using pySpark)
Calculate KPIs by cell tower (defined using pySpark)

Lab architecture:

The following diagram ilustrates the GCP architecture

3. Lab execution

Pre-requistes :

A working Google Cloud Project with billing enabled
Enough permissions (e.g. project owner role)
Access to Cloud Shell

1) Open a Cloud Terminal and clone the repository

git clone https://github.com/GoogleCloudPlatform/serverless-spark-workshop.git

2) Edit thevariables.json file with your own values:

cd serverless-spark-workshop/cell-tower-anomaly-detection-dbt/cell-tower-anomaly-detection-dbt/02-config
vi variables.json

For instance:

{
    "project_id": "velascoluis-dev-sandbox",
    "spark_serverless" : "false",
    "dataproc_cluster_name" : "spark-dataproc",
    "region": "us-central1",
    "bucket_name": "spark-dataproc-velascoluis-dev-sandbox",
    "bq_dataset_name": "spark_dataproc",
    "dbt_sa_name": "dbt-sa-account",
    "dbt_sa_key_location" : "dbt-sa.json"
}

If there are timeouts errors, re-execute the script.

NOTE 1: dbt Core v1.3, currently in beta, adds Python models to dbt. For more information check this

NOTE 2: Serverless spark support in dbt is experimental, please see this issue .The current adapter implementation uses the default network as the VPC subnetwork that executes Serverless Spark workloads.

NOTE 3: Do configure the default VPC network as indicated in here

3) Launch the infrastructure bootstrap script:

$> ./setup_infra.sh variables.json deploy

This script reads the config file and :

Checks and install binaries/packages if needed
Use Terraform to enable GCP APIs
Use Terraform to create Service Account (SA) and key for dbt
Use Terraform to grant roles to the dbt SA
Use Terraform to deploy infrastructure, including a single node dataproc cluster, a GCS bucket and a BigQuery dataset
Stages data in GCS
Use Terraform to deploy data infrastructure, including a BQ external connection, a couple of BigLake tables
Generates DBT config files (profile and config)

4) Launch dbt:

cd serverless-spark-workshop/cell-tower-anomaly-detection-dbt/cell-tower-anomaly-detection-dbt/00-scripts
$> ./run_dbt.sh

Depending on the spark_serverless flag, the pySpark code will be submitted via API using either create_batch (serverless) or submit_job_as_operation (classic)

5) Generate and browse dbtdocs:

cd serverless-spark-workshop/cell-tower-anomaly-detection-dbt/cell-tower-anomaly-detection-dbt/00-scripts
$> ./gen_docs.sh

Preview on default port 8080:

Browse documentation:

6) Destroy all resources created:

cd serverless-spark-workshop/cell-tower-anomaly-detection-dbt/cell-tower-anomaly-detection-dbt/02-config
$>  ./setup_infra.sh variables.json destroy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Lab 1: Cell Tower Anomaly Detection with dbt

2. Lab overview and architecture

3. Lab execution

Files

README.md

Latest commit

History

README.md

File metadata and controls

Lab 1: Cell Tower Anomaly Detection with dbt

2. Lab overview and architecture

3. Lab execution