DISSI-BQ

DISSY-BQ is a DIStributed Schema Inference tool for BigQuery JSON data loading, written in Apache Beam. It is trivial to execute on Cloud Dataflow, in combination with data that resides in Cloud Storage.

The pipeline loads JSON data from a storage location (e.g. GCS) into BigQuery according to a common schema. The schema is auto-inferred based on the data, and is used to keep the target table up to date.

See doc folder for a diagram and an Dataflow component overview.

Prerequisites

Bash terminal (for example scripts)
Python 3.8
gcloud command installed locally
Google Cloud project of which you have sufficient rights + Dataflow API enabled

Quickstart

Create your own venv:

bash scripts/001_create_venv.sh

Run the tests:

bash scripts/002_run_tests.sh

Create resources on your current active project:

bash scripts/003_provision_resources.sh

Note that currently, this scripts generates a service account with owner permissions. Please adjust these to suit your needs. We recommend employing a least privilege setup.

Test the pipeline locally

bash scripts/004_run_pipeline_locally.sh

In order to test schema updates, you can uncomment the other pipelines in script 004 to test the base and modified sample datasets.

Run the pipeline on Dataflow

bash scripts/005_run_pipeline_gcp.sh

Finally, scripts 008, 009 and 010 allow you to test with some larger datasets, which can make the Dataflow pipeline scale up.

Assumptions

Pipeline can run in batch mode
Dataflow api is enabled
Dataset exists (not created by pipeline so we have freedom to e.g. terraform and assign IAM policies)
Nested fields don't need to be nullable as BQ can handle that when parent is nullable
Null fields can be ignored

Core Features

Input pattern support
Basic Primitive JSON types
TIMESTAMP data detector
INTEGER to FLOAT coercion
REQUIRED to NULLABLE conversion
Table schema updates
Optional data loading in the same job
Ignore null fields (until another doc has a value)

Known limitations

JSON null fields and empty lists are only supported if they appear as non-null in at least one doc
nested lists are not supported

Future work

Some interesting ideas for the future:

The source code contains some pointers to future work by using the FUTURE tag in the comments.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
doc		doc
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DISSI-BQ

Prerequisites

Quickstart

Assumptions

Core Features

Known limitations

Future work

About

Releases

Packages

Languages

JonnyDaenen/dissi-bq

Folders and files

Latest commit

History

Repository files navigation

DISSI-BQ

Prerequisites

Quickstart

Assumptions

Core Features

Known limitations

Future work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages