DISSY-BQ is a DIStributed Schema Inference tool for BigQuery JSON data loading, written in Apache Beam. It is trivial to execute on Cloud Dataflow, in combination with data that resides in Cloud Storage.
The pipeline loads JSON data from a storage location (e.g. GCS) into BigQuery according to a common schema. The schema is auto-inferred based on the data, and is used to keep the target table up to date.
See doc
folder for a diagram and an Dataflow component overview.
- Bash terminal (for example scripts)
- Python 3.8
- gcloud command installed locally
- Google Cloud project of which you have sufficient rights + Dataflow API enabled
Create your own venv:
bash scripts/001_create_venv.sh
Run the tests:
bash scripts/002_run_tests.sh
Create resources on your current active project:
bash scripts/003_provision_resources.sh
Note that currently, this scripts generates a service account with owner permissions. Please adjust these to suit your needs. We recommend employing a least privilege setup.
Test the pipeline locally
bash scripts/004_run_pipeline_locally.sh
In order to test schema updates, you can uncomment the other
pipelines in script 004
to test the base
and modified
sample datasets.
Run the pipeline on Dataflow
bash scripts/005_run_pipeline_gcp.sh
Finally, scripts 008
, 009
and 010
allow you to test with some larger datasets,
which can make the Dataflow pipeline scale up.
- Pipeline can run in batch mode
- Dataflow api is enabled
- Dataset exists (not created by pipeline so we have freedom to e.g. terraform and assign IAM policies)
- Nested fields don't need to be nullable as BQ can handle that when parent is nullable
- Null fields can be ignored
- Input pattern support
- Basic Primitive JSON types
- TIMESTAMP data detector
- INTEGER to FLOAT coercion
- REQUIRED to NULLABLE conversion
- Table schema updates
- Optional data loading in the same job
- Ignore null fields (until another doc has a value)
- JSON null fields and empty lists are only supported if they appear as non-null in at least one doc
- nested lists are not supported
Some interesting ideas for the future:
- extended primitive types (e.g., geolocation)
- schema depth/width validation
- validation of values according to schema in step 2
- lineage tracking
- handle invalid inputs
- list with null
- create SA with least privilege
- override/seed schema
- forbidden characters in field names
- refactor into class structure
- make data detectors modular
- auto-casting during data load
- add counters in dataflow job (e.g. for validation)
- catch BQ ingestion errors
- error tolerance and dead-lettering
- BigQuery integration tests
The source code contains some pointers to future work by using the FUTURE
tag in the comments.