BQ Column sizes

This project is meant to allow extract column sizes from BigQuery tables to monitor fields weight changes over time.

How it works

The job will scan selected GCP project / datasets / tables then scan for partitions to launch a select as dry run per field and per partition to store the bytes that would be read.

Output

output data columns:

job_partition: String => the input partition formatted with the job frequency
project_id: String => the GCP project id of the read table
dataset: String => the dataset of the read table,
table: String => the table name of the read table,
field: String => the analyzed field name of the red table,
table_partition: String => the analyzed time partition of the read table,
size_in_bytes: Long => For the selected field & partition, the size in bytes

CSV

CSV output will create a dedicated file (or overwrite it if existing) that contains the job run will contain the header and the data per line.

BigQuery

BigQuery output will create a dedicated partitioned table if not existing to store the output per partition (based on job_frequency parameter or else daily).

How to launch

There is multiple ways to start the job:

Clone the source code a build the project to run the BQColumnSizes (Main) class
Install Coursier and fetch the Jitpack package

The Coursier + Jitpack option is likely easier to use:

Install Coursier
Fetch the dependency from Jitpack coursier fetch -r jitpack com.github.Kayrnt:bq_column_sizes:master-SNAPSHOT
Start the job coursier launch -r jitpack com.github.Kayrnt:bq_column_sizes:master-SNAPSHOT -- --help

Usage

BQColumnSizes 1.0.0
Usage: bq-column-sizes [options]

Help options:
  --usage     Print usage and exit
  -h, --help  Print help message and exit

Other options:
  --ep, --execution-project-id string?  GCP project id used for query execution if different from read.
  -p, --project-id string               GCP project id to analyze, if no dataset is provided, it will analyze all datasets for that project.
  -d, --dataset string?                 datasets to analyze, if no table is provided it will analyze all tables for that dataset. Use commas to split datasets if you have multiple datasets (ie ds_a,ds_b,...).
  -t, --table string?                   table(s) to analyze. Use commas to split tables if you have multiple tables (ie table_a,table_b,...).
  -o, --output-file-path string?        output file path, default in size.csv, you can use an absolute path such as /Users/myUser/size.csv.
  --mcq, --max-concurrent-queries int?  max concurrent queries as you're throttled on actual BQ requests, default to 4.
  --ow, --output-writer string?         output writer such as bq or csv, default to csv.
  --op, --output-project string?        Bigquery output project when bq output type is selected, default to the projectId field.
  --od, --output-dataset string?        Bigquery output dataset when bq output type is selected.
  --ot, --output-table string?          Bigquery output table when bq output type is selected.
  --pt, --partition string?             Partition to analyze, if not specified all partitions will be analyzed.
Format: yyyy, yyyyMM, yyyyMMdd, yyyyyMMddHH
Example: 20210101
  --jf, --job-frequency string?         Job frequency: it's how often you're going to run the job so that the job know how to partition the data.
Values: year, month, day, hour.
Default to day.
  --offset string?                      Partition offset as duration
Format: <length><unit> based on scala.concurrent.duration.Duration
Example: 1hour, 1d
  --write-disposition string?           output write disposition
based on the BQ write dispositions: write_empty, truncate or append
Default: truncate

Examples

For instance, following command allows to get the column sizes from bigquery-public-data.crypto_bitcoin.blocks table on partition 20210501 and output the result in table my_gcp_project.my_dataset.my_bq_column_sizes (which will be partitioned per day as default):

bq-column-sizes -p bigquery-public-data -d crypto_bitcoin -t blocks --ow bq --ot my_gcp_project --od my_dataset --ot my_bq_column_sizes --pt 20210501

If you would like to store it in a local CSV file, you can use simply use:

bq-column-sizes -p bigquery-public-data -d crypto_bitcoin -t blocks --pt 20210501

Known limitations

If you're using range partitioning on a read table, that partitioning will be ignored and treated as a table without partitioning. It's done so because the job is meant to be run on regular basis (ie a daily job).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
project		project
src		src
.gitignore		.gitignore
.sbtops		.sbtops
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
jitpack.yml		jitpack.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BQ Column sizes

How it works

Output

CSV

BigQuery

How to launch

Usage

Examples

Known limitations

About

Releases

Packages

Languages

License

Kayrnt/bq_column_sizes

Folders and files

Latest commit

History

Repository files navigation

BQ Column sizes

How it works

Output

CSV

BigQuery

How to launch

Usage

Examples

Known limitations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages