Skip to content

Add BigQuery query client and dialect support#1839

Merged
shangyian merged 17 commits intoDataJunction:mainfrom
colinmf:colinmf/bigquery-integration
Mar 8, 2026
Merged

Add BigQuery query client and dialect support#1839
shangyian merged 17 commits intoDataJunction:mainfrom
colinmf:colinmf/bigquery-integration

Conversation

@colinmf
Copy link
Copy Markdown
Contributor

@colinmf colinmf commented Mar 7, 2026

Summary

Adds full BigQuery support to DataJunction — schema introspection in datajunction-server and query execution in datajunction-query.

datajunction-server: BigQuery schema introspection & dialect

  • BigQueryClient — direct query client implementing BaseQueryServiceClient (same pattern as SnowflakeClient)
    • Introspects columns via INFORMATION_SCHEMA.COLUMNS with parameterized queries
    • Maps all BigQuery types (INT64, FLOAT64, NUMERIC, BIGNUMERIC, STRING, TIMESTAMP, etc.) to DJ ColumnType
    • Project resolution order: engine URI → client config → catalog name fallback
    • Supports service account credentials (JSON file or dict) and Application Default Credentials
  • BIGQUERY dialect — added to Dialect enum, registered with SQLGlotTranspilationPlugin
  • Config factory_create_configured_query_client() supports bigquery type
  • Optional installpip install 'datajunction-server[bigquery]'

datajunction-query: BigQuery query execution

  • BIGQUERY engine type — added to EngineType enum in djqs/config.py
  • run_bigquery_query() — executes queries via google.cloud.bigquery.Client, following the run_snowflake_query pattern
  • Credential handlingcredentials_path from engine extra_params, GOOGLE_APPLICATION_CREDENTIALS env var fallback, or Application Default Credentials
  • Routingrun_query() routes EngineType.BIGQUERY to the new function

Files changed

Component File Change
server query_clients/bigquery.py New BigQueryClient with column introspection and type mapping
server models/dialect.py BIGQUERY = "bigquery" enum value
server transpilation.py Register BigQuery with SQLGlot plugin
server query_clients/__init__.py Lazy import for BigQueryClient
server utils.py bigquery case in _create_configured_query_client()
server pyproject.toml bigquery = ["google-cloud-bigquery>=3.0.0"] optional extra
server tests/.../bigquery_query_client_test.py 38 tests — 100% branch coverage on bigquery.py
server tests/utils_test.py Factory tests for BigQuery client creation
djqs djqs/config.py BIGQUERY = "bigquery" engine type
djqs djqs/engine.py run_bigquery_query() + routing in run_query()
djqs pyproject.toml google-cloud-bigquery>=3.11.0 dependency
djqs tests/api/queries_test.py 7 tests — credentials, location, env var, errors, empty results
djqs tests/config.djqs.yml BigQuery test engine and catalog

DJ terminology mapping

DJ concept BigQuery equivalent
catalog GCP project
schema BigQuery dataset
table table name

Configuration

# Server — schema introspection
QUERY_CLIENT__TYPE=bigquery
QUERY_CLIENT__CONNECTION__PROJECT=my-gcp-project
QUERY_CLIENT__CONNECTION__CREDENTIALS_PATH=/path/to/sa.json  # optional

# Query service — engine extra_params
# project, credentials_path, location

Test plan

  • 38 server BigQuery tests pass (100% branch coverage on bigquery.py)
  • 7 djqs BigQuery tests pass (100% coverage on engine.py)
  • All pre-commit hooks pass (ruff, mypy, format)
  • Live-tested SQL generation with deployed BigQuery nodes — single metrics, multi-metric, dimensions, filters, cubes all generate valid BigQuery dialect SQL
  • End-to-end query execution against a real BigQuery instance (requires GCP credentials)

🤖 Generated with Claude Code

Implements a direct BigQuery integration following the same pattern as
the existing Snowflake client. Adds `BigQueryClient` for table
introspection via INFORMATION_SCHEMA, registers `bigquery` as a
supported dialect with sqlglot transpilation, and exposes it as an
optional install extra (`datajunction-server[bigquery]`).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 7, 2026

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 432927c
🔍 Latest deploy log https://app.netlify.com/projects/thriving-cassata-78ae72/deploys/69acb04c9799940008e76172

colinmf and others added 4 commits March 7, 2026 15:44
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Import QueryJobConfig and ScalarQueryParameter at module level so
  tests can patch them (accessing via bigquery=None failed)
- Fix BIGNUMERIC/BIGDECIMAL to use DecimalType(38, 38) since DJ's
  DecimalType caps max_precision at 38

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@shangyian shangyian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @colinmf, thanks for your contribution!

A few thoughts here:

DJ's design intentionally separates semantic layer concerns (eg datajunction-server) from query execution (the query service, which is packaged up in datajunction-query). I see that right now you've only implemented the schema introspection part of the BigQueryClient - that's consistent with how Snowflake is handled in query_clients/snowflake.py, so this looks good.

That said, if the intent is to eventually support query execution through this client too, that's where I'd push back. Query execution in the semantic layer creates scaling problems since they have very different resource profiles. If that comes up, the right home would a BigQuery query service implementation in datajunction-query.

Mirrors SnowflakeClient's _get_database_from_engine approach: parses
the GCP project from the engine URI netloc (bigquery://my-gcp-project)
so different DJ catalogs can point to different GCP projects.

Also adds BigQuery env config example to .env and updates tests.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@colinmf colinmf marked this pull request as draft March 7, 2026 16:53
colinmf and others added 5 commits March 7, 2026 18:02
* Add engine URI project resolution to BigQueryClient

Mirrors SnowflakeClient's _get_database_from_engine approach: parses
the GCP project from the engine URI netloc (bigquery://my-gcp-project)
so different DJ catalogs can point to different GCP projects.

Also adds BigQuery env config example to .env and updates tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Address review comments on BigQueryClient

- Add sqlglot to bigquery extra for dialect transpilation support
- Add BIGQUERY_AVAILABLE import coverage tests (True/False paths)
- Add BigQuery config documentation with examples to QueryClientConfig
- Remove redundant 0-based index comment in get_columns_for_table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…_client improvements

- Keep comprehensive _get_project_from_engine (host, path, query param fallbacks)
- Use _get_client(project=...) from fork to pass resolved project to BigQuery client
- Merge test suites: retain all URI parsing tests + fork's _get_client injection test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add BigQuery as a supported engine type in the query service, following
the existing Snowflake pattern. Supports project config via extra_params,
credentials via config or GOOGLE_APPLICATION_CREDENTIALS env var, and
Application Default Credentials as fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@colinmf
Copy link
Copy Markdown
Contributor Author

colinmf commented Mar 7, 2026

Hi @colinmf, thanks for your contribution!

A few thoughts here:

DJ's design intentionally separates semantic layer concerns (eg datajunction-server) from query execution (the query service, which is packaged up in datajunction-query). I see that right now you've only implemented the schema introspection part of the BigQueryClient - that's consistent with how Snowflake is handled in query_clients/snowflake.py, so this looks good.

That said, if the intent is to eventually support query execution through this client too, that's where I'd push back. Query execution in the semantic layer creates scaling problems since they have very different resource profiles. If that comes up, the right home would a BigQuery query service implementation in datajunction-query.

@shangyian

Thanks for the review

Agreed, and thanks for the guidance. The BigQueryClient in datajunction-server remains schema-introspection only (consistent with the Snowflake pattern in query_clients/snowflake.py).

For query execution, we've added BigQuery support in datajunction-query following the existing Snowflake pattern: EngineType.BIGQUERY in config, run_bigquery_query() in engine.py, with credentials via extra_params or GOOGLE_APPLICATION_CREDENTIALS.

colinmf and others added 4 commits March 7, 2026 22:02
Cover credentials path, location, env var fallback, error handling,
multi-row and empty results in datajunction-query. Add client project
override, location, factory with all options, unsupported type, engine
URI project override, and credentials precedence tests in
datajunction-server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap BigQuery rows in iter() to match Stream (Iterator) type.
Apply ruff format to test file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mock QueryJobConfig and ScalarQueryParameter which are None in CI
(google-cloud-bigquery not installed), matching the pattern used by
other get_columns_for_table tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover empty path segment fallthrough (131->134) and query params
without project key (136->145) to reach 100% branch coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@colinmf colinmf marked this pull request as ready for review March 7, 2026 22:18
@colinmf colinmf requested a review from shangyian March 7, 2026 22:19
colinmf and others added 2 commits March 8, 2026 00:03
DJ generates SQL with catalog-prefixed table names (e.g. my_catalog.dataset.table)
but BigQuery interprets three-part names as project.dataset.table. Since the BQ
client already has the project configured, strip the catalog prefix so BigQuery
receives dataset.table references. Also remove the GOOGLE_APPLICATION_CREDENTIALS
env var fallback from credentials_path — let bigquery.Client() handle ADC natively.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@shangyian shangyian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for addressing the comments @colinmf!

@colinmf
Copy link
Copy Markdown
Contributor Author

colinmf commented Mar 8, 2026

This looks good, thanks for addressing the comments @colinmf!

@shangyian Thanks for approving, Let me know how does the merge/release work ? if i need to do something on myend

@shangyian shangyian merged commit 3d522dc into DataJunction:main Mar 8, 2026
17 checks passed
@shangyian shangyian mentioned this pull request Mar 13, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants