New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spark-bigquery: fix a few of the common errors #1377
Conversation
58b8338
to
33998a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing PR with a tremendous amount of new features and improvements 💯
Added some comments with questions which may help me understand the changes.
Super happy that changes are covered with integration test with BigQuery 🚀🚀🚀🚀🚀🚀🚀 🥳
integration/spark/app/src/main/java/io/openlineage/spark/agent/OpenLineageSparkListener.java
Show resolved
Hide resolved
...on/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java
Outdated
Show resolved
Hide resolved
integration/spark/app/src/test/resources/spark_scripts/spark_bigquery.py
Show resolved
Hide resolved
@@ -45,7 +45,7 @@ configurations { | |||
|
|||
ext { | |||
assertjVersion = '3.23.1' | |||
bigqueryVersion = '0.26.0' | |||
bigqueryVersion = '0.22.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to use same version of dependency as baseline everywhere in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, there are just two places within codebase where this version is kept. Upgrading the other location is a preferred solution I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pawel-big-lebowski moved to 0.26.0
.
...shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/BigQueryNodeInputVisitor.java
Show resolved
Hide resolved
136aaf6
to
3ebf91d
Compare
Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
1028869
to
056a017
Compare
Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
056a017
to
6f008ad
Compare
This PR fixes few of the common issues with spark-bigquery integration and adds integration test for it, together with the CI configuration for it.
spark-bigquery
dependencies -spark-bigquery
itself, andspark-bigquery-with-dependencies
. They aren't compatible - for example we usecom.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryUtil
which isgoogle.cloud.bigquery.connector.common.BigQueryUtil
in the non-with-dependencies version. Previously we mixed those two, now we use with-dependencies version everywhere as it's 10x more popular on maven central and recommended everywhereBigQueryNodeVisitor
toBigQueryInputNodeVisitor
andBigQueryOutputNodeVisitor
. The "output" visitors process only root node, while "input" ones process whole tree. If we have one node visitor who is registered both for input and output events, the root node is double counter. The split prevents that.SaveIntoDataSourceCommandVisitor
from processingBigQueryRelation
.BigQueryOutputNodeVisitor
does not try to createBigQueryRelation
to get proper table name, but utilizesBigQueryUtil.friendlyTableName(config.getTableId())
. This prevents error when output table does not exist and is created by the spark process.