spark-bigquery: fix a few of the common errors #1377

mobuchowski · 2022-12-05T08:42:32Z

This PR fixes few of the common issues with spark-bigquery integration and adds integration test for it, together with the CI configuration for it.

there are two spark-bigquery dependencies - spark-bigquery itself, and spark-bigquery-with-dependencies. They aren't compatible - for example we use com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryUtil which is google.cloud.bigquery.connector.common.BigQueryUtil in the non-with-dependencies version. Previously we mixed those two, now we use with-dependencies version everywhere as it's 10x more popular on maven central and recommended everywhere
split BigQueryNodeVisitor to BigQueryInputNodeVisitor and BigQueryOutputNodeVisitor. The "output" visitors process only root node, while "input" ones process whole tree. If we have one node visitor who is registered both for input and output events, the root node is double counter. The split prevents that.
prevent SaveIntoDataSourceCommandVisitor from processing BigQueryRelation.
BigQueryOutputNodeVisitor does not try to create BigQueryRelation to get proper table name, but utilizes BigQueryUtil.friendlyTableName(config.getTableId()). This prevents error when output table does not exist and is created by the spark process.

pawel-big-lebowski

Amazing PR with a tremendous amount of new features and improvements 💯
Added some comments with questions which may help me understand the changes.

Super happy that changes are covered with integration test with BigQuery 🚀🚀🚀🚀🚀🚀🚀 🥳

integration/spark/app/src/main/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

...on/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java

integration/spark/app/src/test/resources/spark_scripts/spark_bigquery.py

integration/spark/gradle.properties

pawel-big-lebowski · 2022-12-06T07:19:03Z

integration/spark/shared/build.gradle

@@ -45,7 +45,7 @@ configurations {

 ext {
    assertjVersion = '3.23.1'
-    bigqueryVersion = '0.26.0'
+    bigqueryVersion = '0.22.0'


Any reason for that?

We want to use same version of dependency as baseline everywhere in the code.

sure, there are just two places within codebase where this version is kept. Upgrading the other location is a preferred solution I guess.

@pawel-big-lebowski moved to 0.26.0.

...shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/BigQueryNodeInputVisitor.java

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>

mobuchowski requested a review from pawel-big-lebowski December 5, 2022 08:42

boring-cyborg bot added ci client/java integration/spark tests testing openlineage codes labels Dec 5, 2022

mobuchowski force-pushed the bigquery/do-not-create-relation-on-output branch 2 times, most recently from 58b8338 to 33998a8 Compare December 5, 2022 09:02

pawel-big-lebowski reviewed Dec 6, 2022

View reviewed changes

mobuchowski force-pushed the bigquery/do-not-create-relation-on-output branch 12 times, most recently from 136aaf6 to 3ebf91d Compare December 7, 2022 17:49

spark-bigquery: do not create relation on output dataset

cf80849

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>

mobuchowski force-pushed the bigquery/do-not-create-relation-on-output branch 3 times, most recently from 1028869 to 056a017 Compare December 7, 2022 20:23

spark: do not collect metrics on spark 2

6f008ad

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>

mobuchowski force-pushed the bigquery/do-not-create-relation-on-output branch from 056a017 to 6f008ad Compare December 7, 2022 22:56

mobuchowski requested a review from pawel-big-lebowski December 7, 2022 23:33

pawel-big-lebowski approved these changes Dec 8, 2022

View reviewed changes

mobuchowski merged commit f69c240 into main Dec 8, 2022

mobuchowski deleted the bigquery/do-not-create-relation-on-output branch December 8, 2022 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-bigquery: fix a few of the common errors #1377

spark-bigquery: fix a few of the common errors #1377

mobuchowski commented Dec 5, 2022

pawel-big-lebowski left a comment •

edited

pawel-big-lebowski Dec 6, 2022

mobuchowski Dec 6, 2022

pawel-big-lebowski Dec 7, 2022

mobuchowski Dec 7, 2022

spark-bigquery: fix a few of the common errors #1377

spark-bigquery: fix a few of the common errors #1377

Conversation

mobuchowski commented Dec 5, 2022

pawel-big-lebowski left a comment • edited

Choose a reason for hiding this comment

pawel-big-lebowski Dec 6, 2022

Choose a reason for hiding this comment

mobuchowski Dec 6, 2022

Choose a reason for hiding this comment

pawel-big-lebowski Dec 7, 2022

Choose a reason for hiding this comment

mobuchowski Dec 7, 2022

Choose a reason for hiding this comment

pawel-big-lebowski left a comment •

edited