New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spark: get column-level lineage from JDBC dbtable option #2284
Conversation
c9359c5
to
343db5a
Compare
integration/spark/app/build.gradle
Outdated
@@ -364,7 +368,7 @@ shadowJar { | |||
relocate 'org.apache.commons.beanutils', 'io.openlineage.spark.shaded.org.apache.commons.beanutils' | |||
relocate 'org.apache.http', 'io.openlineage.spark.shaded.org.apache.http' | |||
relocate 'org.yaml.snakeyaml', 'io.openlineage.spark.shaded.org.yaml.snakeyaml' | |||
relocate 'org.slf4j', 'io.openlineage.spark.shaded.org.slf4j' | |||
// relocate 'org.slf4j', 'io.openlineage.spark.shaded.org.slf4j' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to dive deep into that relocate, since it makes internal logging not work... probably just make it compileOnly and add it in tests?
void testJdbcColumnDbtable() { | ||
GenericContainer mysql = SparkContainerUtils.makeMysqlContainer(network); | ||
try { | ||
mysql.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to create separate test suite like SparkJdbcIntegrationTest
to have mysql docker shared for all such tests?
Wouldn't it be better to run Spark in memory and access spark session within test method instead running whole docker? Such tests are a way faster than running spark in docker and do not require mock server container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote to Spark-in-memory test.
...rc/main/java/io/openlineage/spark/agent/lifecycle/plan/column/ColumnLevelLineageBuilder.java
Outdated
Show resolved
Hide resolved
integration/spark/app/src/test/resources/spark_scripts/spark_jdbc_column.py
Outdated
Show resolved
Hide resolved
a3f733c
to
012a13d
Compare
012a13d
to
bcdc5a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Two minor comments added: pls make sure they don't relate to some debug.
@mobuchowski Super happy to see you back contributing to Spark integration 🥇
integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/JdbcUtils.java
Outdated
Show resolved
Hide resolved
integration/spark/app/src/test/java/io/openlineage/spark/agent/SparkContainerUtils.java
Outdated
Show resolved
Hide resolved
bcdc5a8
to
5196ab0
Compare
@mobuchowski Some tests are still failing |
3c8a8bc
to
d8795ea
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2284 +/- ##
=========================================
Coverage 81.41% 81.41%
Complexity 125 125
=========================================
Files 90 90
Lines 3804 3804
Branches 33 33
=========================================
Hits 3097 3097
Misses 668 668
Partials 39 39 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
d8795ea
to
f7ade8d
Compare
There were several problems with JDBC column level lineage.
First, it did not support dbtable option - which ment that one-to-one relationships weren't supported by column lineage collector.
Second, the
JdbcColumnLineageCollector
did not report the lineage when there was only one single input column.Third, it used wrong, naive dataset name straight from the parser results, rather than fixed one from
JdbcUtils
.Those changes fix that.