Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark: get column-level lineage from JDBC dbtable option #2284

Merged
merged 1 commit into from Dec 5, 2023

Conversation

mobuchowski
Copy link
Member

There were several problems with JDBC column level lineage.

First, it did not support dbtable option - which ment that one-to-one relationships weren't supported by column lineage collector.
Second, the JdbcColumnLineageCollector did not report the lineage when there was only one single input column.
Third, it used wrong, naive dataset name straight from the parser results, rather than fixed one from JdbcUtils.

Those changes fix that.

@@ -364,7 +368,7 @@ shadowJar {
relocate 'org.apache.commons.beanutils', 'io.openlineage.spark.shaded.org.apache.commons.beanutils'
relocate 'org.apache.http', 'io.openlineage.spark.shaded.org.apache.http'
relocate 'org.yaml.snakeyaml', 'io.openlineage.spark.shaded.org.yaml.snakeyaml'
relocate 'org.slf4j', 'io.openlineage.spark.shaded.org.slf4j'
// relocate 'org.slf4j', 'io.openlineage.spark.shaded.org.slf4j'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to dive deep into that relocate, since it makes internal logging not work... probably just make it compileOnly and add it in tests?

void testJdbcColumnDbtable() {
GenericContainer mysql = SparkContainerUtils.makeMysqlContainer(network);
try {
mysql.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to create separate test suite like SparkJdbcIntegrationTest to have mysql docker shared for all such tests?

Wouldn't it be better to run Spark in memory and access spark session within test method instead running whole docker? Such tests are a way faster than running spark in docker and do not require mock server container.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote to Spark-in-memory test.

Copy link
Contributor

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Two minor comments added: pls make sure they don't relate to some debug.

@mobuchowski Super happy to see you back contributing to Spark integration 🥇

@pawel-big-lebowski
Copy link
Contributor

@mobuchowski Some tests are still failing ColumnLineageIntegrationTest > columnLevelLineageSingleDestinationTest for 3.2 and 3.4

@mobuchowski mobuchowski force-pushed the spark-cll-dbtable branch 8 times, most recently from 3c8a8bc to d8795ea Compare December 4, 2023 22:48
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0be9239) 81.41% compared to head (d8795ea) 81.41%.
Report is 10 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2284   +/-   ##
=========================================
  Coverage     81.41%   81.41%           
  Complexity      125      125           
=========================================
  Files            90       90           
  Lines          3804     3804           
  Branches         33       33           
=========================================
  Hits           3097     3097           
  Misses          668      668           
  Partials         39       39           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
@mobuchowski mobuchowski merged commit 64993ca into main Dec 5, 2023
55 checks passed
@mobuchowski mobuchowski deleted the spark-cll-dbtable branch December 5, 2023 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants