Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-bigquery: fix a few of the common errors #1377

Merged
merged 2 commits into from Dec 8, 2022

Conversation

mobuchowski
Copy link
Member

This PR fixes few of the common issues with spark-bigquery integration and adds integration test for it, together with the CI configuration for it.

  • there are two spark-bigquery dependencies - spark-bigquery itself, and spark-bigquery-with-dependencies. They aren't compatible - for example we use com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryUtil which is google.cloud.bigquery.connector.common.BigQueryUtil in the non-with-dependencies version. Previously we mixed those two, now we use with-dependencies version everywhere as it's 10x more popular on maven central and recommended everywhere
  • split BigQueryNodeVisitor to BigQueryInputNodeVisitor and BigQueryOutputNodeVisitor. The "output" visitors process only root node, while "input" ones process whole tree. If we have one node visitor who is registered both for input and output events, the root node is double counter. The split prevents that.
  • prevent SaveIntoDataSourceCommandVisitor from processing BigQueryRelation.
  • BigQueryOutputNodeVisitor does not try to create BigQueryRelation to get proper table name, but utilizes BigQueryUtil.friendlyTableName(config.getTableId()). This prevents error when output table does not exist and is created by the spark process.

@mobuchowski mobuchowski force-pushed the bigquery/do-not-create-relation-on-output branch 2 times, most recently from 58b8338 to 33998a8 Compare December 5, 2022 09:02
Copy link
Contributor

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing PR with a tremendous amount of new features and improvements 💯
Added some comments with questions which may help me understand the changes.

Super happy that changes are covered with integration test with BigQuery 🚀🚀🚀🚀🚀🚀🚀 🥳

integration/spark/gradle.properties Outdated Show resolved Hide resolved
@@ -45,7 +45,7 @@ configurations {

ext {
assertjVersion = '3.23.1'
bigqueryVersion = '0.26.0'
bigqueryVersion = '0.22.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to use same version of dependency as baseline everywhere in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, there are just two places within codebase where this version is kept. Upgrading the other location is a preferred solution I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawel-big-lebowski moved to 0.26.0.

@mobuchowski mobuchowski force-pushed the bigquery/do-not-create-relation-on-output branch 12 times, most recently from 136aaf6 to 3ebf91d Compare December 7, 2022 17:49
Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
@mobuchowski mobuchowski force-pushed the bigquery/do-not-create-relation-on-output branch 3 times, most recently from 1028869 to 056a017 Compare December 7, 2022 20:23
Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
@mobuchowski mobuchowski force-pushed the bigquery/do-not-create-relation-on-output branch from 056a017 to 6f008ad Compare December 7, 2022 22:56
@mobuchowski mobuchowski merged commit f69c240 into main Dec 8, 2022
@mobuchowski mobuchowski deleted the bigquery/do-not-create-relation-on-output branch December 8, 2022 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants