spark: get column-level lineage from JDBC dbtable option #2284

mobuchowski · 2023-11-29T01:06:41Z

There were several problems with JDBC column level lineage.

First, it did not support dbtable option - which ment that one-to-one relationships weren't supported by column lineage collector.
Second, the JdbcColumnLineageCollector did not report the lineage when there was only one single input column.
Third, it used wrong, naive dataset name straight from the parser results, rather than fixed one from JdbcUtils.

Those changes fix that.

pawel-big-lebowski · 2023-11-29T09:58:15Z

integration/spark/app/build.gradle

@@ -364,7 +368,7 @@ shadowJar {
    relocate 'org.apache.commons.beanutils', 'io.openlineage.spark.shaded.org.apache.commons.beanutils'
    relocate 'org.apache.http', 'io.openlineage.spark.shaded.org.apache.http'
    relocate 'org.yaml.snakeyaml', 'io.openlineage.spark.shaded.org.yaml.snakeyaml'
-    relocate 'org.slf4j', 'io.openlineage.spark.shaded.org.slf4j'
+//    relocate 'org.slf4j', 'io.openlineage.spark.shaded.org.slf4j'


We need to dive deep into that relocate, since it makes internal logging not work... probably just make it compileOnly and add it in tests?

pawel-big-lebowski · 2023-11-29T10:00:37Z

...ration/spark/app/src/test/java/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

+  void testJdbcColumnDbtable() {
+    GenericContainer mysql = SparkContainerUtils.makeMysqlContainer(network);
+    try {
+      mysql.start();


Wouldn't it be better to create separate test suite like SparkJdbcIntegrationTest to have mysql docker shared for all such tests?

Wouldn't it be better to run Spark in memory and access spark session within test method instead running whole docker? Such tests are a way faster than running spark in docker and do not require mock server container.

Rewrote to Spark-in-memory test.

...rc/main/java/io/openlineage/spark/agent/lifecycle/plan/column/ColumnLevelLineageBuilder.java

integration/spark/app/src/test/resources/spark_scripts/spark_jdbc_column.py

pawel-big-lebowski

Looks good to me. Two minor comments added: pls make sure they don't relate to some debug.

@mobuchowski Super happy to see you back contributing to Spark integration 🥇

integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/JdbcUtils.java

integration/spark/app/src/test/java/io/openlineage/spark/agent/SparkContainerUtils.java

pawel-big-lebowski · 2023-12-04T07:31:31Z

@mobuchowski Some tests are still failing ColumnLineageIntegrationTest > columnLevelLineageSingleDestinationTest for 3.2 and 3.4

codecov-commenter · 2023-12-04T23:14:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0be9239) 81.41% compared to head (d8795ea) 81.41%.
Report is 10 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #2284   +/-   ##
=========================================
  Coverage     81.41%   81.41%           
  Complexity      125      125           
=========================================
  Files            90       90           
  Lines          3804     3804           
  Branches         33       33           
=========================================
  Hits           3097     3097           
  Misses          668      668           
  Partials         39       39

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>

mobuchowski requested a review from pawel-big-lebowski November 29, 2023 01:06

boring-cyborg bot added ci integration/spark labels Nov 29, 2023

mobuchowski force-pushed the spark-cll-dbtable branch 4 times, most recently from c9359c5 to 343db5a Compare November 29, 2023 10:05

pawel-big-lebowski reviewed Nov 29, 2023

View reviewed changes

mobuchowski force-pushed the spark-cll-dbtable branch 3 times, most recently from a3f733c to 012a13d Compare November 29, 2023 15:27

mobuchowski requested a review from pawel-big-lebowski November 30, 2023 09:58

mobuchowski force-pushed the spark-cll-dbtable branch from 012a13d to bcdc5a8 Compare November 30, 2023 10:08

pawel-big-lebowski approved these changes Nov 30, 2023

View reviewed changes

integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/JdbcUtils.java Outdated Show resolved Hide resolved

integration/spark/app/src/test/java/io/openlineage/spark/agent/SparkContainerUtils.java Outdated Show resolved Hide resolved

mobuchowski force-pushed the spark-cll-dbtable branch from bcdc5a8 to 5196ab0 Compare December 1, 2023 14:33

mobuchowski force-pushed the spark-cll-dbtable branch 8 times, most recently from 3c8a8bc to d8795ea Compare December 4, 2023 22:48

spark: get column-level lineage from JDBC dbtable option

f7ade8d

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>

mobuchowski force-pushed the spark-cll-dbtable branch from d8795ea to f7ade8d Compare December 5, 2023 10:07

mobuchowski merged commit 64993ca into main Dec 5, 2023
55 checks passed

mobuchowski deleted the spark-cll-dbtable branch December 5, 2023 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark: get column-level lineage from JDBC dbtable option #2284

spark: get column-level lineage from JDBC dbtable option #2284

mobuchowski commented Nov 29, 2023

pawel-big-lebowski Nov 29, 2023

mobuchowski Nov 29, 2023

pawel-big-lebowski Nov 29, 2023

mobuchowski Nov 29, 2023

pawel-big-lebowski left a comment

pawel-big-lebowski commented Dec 4, 2023

codecov-commenter commented Dec 4, 2023

spark: get column-level lineage from JDBC dbtable option #2284

spark: get column-level lineage from JDBC dbtable option #2284

Conversation

mobuchowski commented Nov 29, 2023

pawel-big-lebowski Nov 29, 2023

Choose a reason for hiding this comment

mobuchowski Nov 29, 2023

Choose a reason for hiding this comment

pawel-big-lebowski Nov 29, 2023

Choose a reason for hiding this comment

mobuchowski Nov 29, 2023

Choose a reason for hiding this comment

pawel-big-lebowski left a comment

Choose a reason for hiding this comment

pawel-big-lebowski commented Dec 4, 2023

codecov-commenter commented Dec 4, 2023

Codecov Report