This lab is an adaptation of Lab 1: Cell Tower Anomaly Detection. It includes/changes:
- Use BigLake to create GCS external tables in PARQUET and CSV files formats.
- Refactor pySpark scripts into SQL (initial data cleaning scripts)
- Use Open Source tool dbt to implement a data pipeline (DAG) including the logic for both SQL jobs and pySpark scripts.
- Adds a bootstrap script to deploy required cloud infrastructure (GCS bucket, BQ datasets, Service Accounts .) using Terraform.
The lab includes studying and running a data pipeline (DAG) via dbt:
- Curate Customer Master Data (defined using BigQuery SQL)
- Curate Telco Customer Churn Data (defined using BigQuery SQL)
- Calculate Cell Tower KPIs by Customer (defined using pySpark)
- Calculate KPIs by cell tower (defined using pySpark)
Lab architecture:
The following diagram ilustrates the GCP architecture
Pre-requistes :
- A working Google Cloud Project with billing enabled
- Enough permissions (e.g. project owner role)
- Access to Cloud Shell
1) Open a Cloud Terminal and clone the repository
git clone https://github.com/GoogleCloudPlatform/serverless-spark-workshop.git
2) Edit thevariables.json
file with your own values:
cd serverless-spark-workshop/cell-tower-anomaly-detection-dbt/cell-tower-anomaly-detection-dbt/02-config
vi variables.json
For instance:
{
"project_id": "velascoluis-dev-sandbox",
"spark_serverless" : "false",
"dataproc_cluster_name" : "spark-dataproc",
"region": "us-central1",
"bucket_name": "spark-dataproc-velascoluis-dev-sandbox",
"bq_dataset_name": "spark_dataproc",
"dbt_sa_name": "dbt-sa-account",
"dbt_sa_key_location" : "dbt-sa.json"
}
If there are timeouts errors, re-execute the script.
NOTE 1: dbt Core v1.3, currently in beta, adds Python models to dbt. For more information check this
NOTE 2: Serverless spark support in dbt is experimental, please see this issue .The current adapter implementation uses the default network as the VPC subnetwork that executes Serverless Spark workloads.
NOTE 3: Do configure the default VPC network as indicated in here
3) Launch the infrastructure bootstrap script:
$> ./setup_infra.sh variables.json deploy
This script reads the config file and :
- Checks and install binaries/packages if needed
- Use Terraform to enable GCP APIs
- Use Terraform to create Service Account (SA) and key for dbt
- Use Terraform to grant roles to the dbt SA
- Use Terraform to deploy infrastructure, including a single node dataproc cluster, a GCS bucket and a BigQuery dataset
- Stages data in GCS
- Use Terraform to deploy data infrastructure, including a BQ external connection, a couple of BigLake tables
- Generates DBT config files (profile and config)
4) Launch dbt
:
cd serverless-spark-workshop/cell-tower-anomaly-detection-dbt/cell-tower-anomaly-detection-dbt/00-scripts
$> ./run_dbt.sh
Depending on the spark_serverless flag, the pySpark code will be submitted via API using either create_batch
(serverless) or submit_job_as_operation
(classic)
5) Generate and browse dbt
docs:
cd serverless-spark-workshop/cell-tower-anomaly-detection-dbt/cell-tower-anomaly-detection-dbt/00-scripts
$> ./gen_docs.sh
Preview on default port 8080:
Browse documentation:
6) Destroy all resources created:
cd serverless-spark-workshop/cell-tower-anomaly-detection-dbt/cell-tower-anomaly-detection-dbt/02-config
$> ./setup_infra.sh variables.json destroy