Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL] Fix parsing JDBCOptions(table=...) containing subquery #2546

Merged

Conversation

dolfinus
Copy link
Contributor

@dolfinus dolfinus commented Mar 29, 2024

Problem

馃憢 Thanks for opening a pull request! Please include a brief summary of the problem your change is trying to solve, or bug fix. If your change fixes a bug or you'd like to provide context on why you're making the change, please link the issue as follows:

In the following case:

df = spark.read.jdbc(table="(select * from table where filter='value') T")

openlineage-spark returns datasets with names like database.(select * from table where filter='value') T, not database.table. This is because JDBCOptions.table is expected to always return a table name, not subquery.

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a schema change, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change, then select one of the following:

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

If JDBCOptions.table contains a subquery, fallback to branch with SQL parsing to extract table names from this query.

I've also added a test that openlineage-spark can parse query of unknown JDBC dialect.

One-line summary:

Fix openlineage-spark producing datasets with names like database.(select * from table) for JDBC sources.

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

@dolfinus dolfinus changed the title Fix parsing JDBCOptions(table=...) containing subquery [SQL] Fix parsing JDBCOptions(table=...) containing subquery Mar 29, 2024
@mobuchowski mobuchowski self-requested a review March 29, 2024 11:51
@dolfinus dolfinus force-pushed the bug/spark-parse-table-subquery branch from 823de3d to 5682e76 Compare March 29, 2024 12:03
@@ -91,6 +92,12 @@ void testHandlingJdbcTable() {

@Test
void testHandlingJdbcDbTableAsSubQuery() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to create a test testHandlingJdbcDbTableAsSubQueryWithUnknownDialect? I am not following how testUnknownDialect verifies a behaviour described within issue description.

Copy link
Contributor Author

@dolfinus dolfinus Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testUnknownDialect is just checking that if here dialect is null, query still can be parsed using generic dialect:

String dialect = extractDialectFromJdbcUrl(relation.jdbcOptions().url());
Optional<SqlMeta> sqlMeta = OpenLineageSql.parse(Collections.singletonList(query), dialect);

I'm using Clickhouse JDBC driver, and want to be sure that even without Clickhouse dialect support in openlineage-sql, I still can get table names read by Spark.

Copy link
Contributor Author

@dolfinus dolfinus Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redarding testHandlingJdbcDbTableAsSubQueryWithUnknownDialect test, I don't think it will make much sense. Both cases table is not set, query=SELECT ... and table=(SELECT...) are now handled in the same way regardless the dialect, and covered by tests above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common practice about fixing issues and unit tests is to: (1) write a failing unit test and (2) fix a code to make a failing test pass.

Issue description is

openlineage-spark returns datasets with names like database.(select * from table where filter='value') T, not database.table

Is there any test verifying this?
How does the introduced test testUnknownDialect relate to issue description and code that has been modified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) write a failing unit test
(2) fix a code to make a failing test pass

I've updated testHandlingJdbcDbTableAsSubQuery test to cover this. Previously it covered just the same case as testHandlingJdbcQuery

Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru>
@dolfinus dolfinus force-pushed the bug/spark-parse-table-subquery branch from 5682e76 to fb46c07 Compare April 2, 2024 08:01
@pawel-big-lebowski pawel-big-lebowski merged commit 6bb5142 into OpenLineage:main Apr 2, 2024
33 checks passed
@dolfinus dolfinus deleted the bug/spark-parse-table-subquery branch April 2, 2024 08:53
blacklight pushed a commit to blacklight/OpenLineage that referenced this pull request Apr 4, 2024
)

Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru>
Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants