Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK] fix aggregate on databricks #1867

Merged
merged 1 commit into from
May 19, 2023

Conversation

pawel-big-lebowski
Copy link
Contributor

@pawel-big-lebowski pawel-big-lebowski commented May 19, 2023

Problem

Column lineage does not get collected on databricks for aggregate queries like groupBy.

Closes: #1861, #1821

Solution

There is a slightly different implementation of AggregateExpression which instead of resultId method contains resultIds which is a list.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

@boring-cyborg boring-cyborg bot added area:documentation Improvements or additions to documentation area:integration/spark labels May 19, 2023
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@codecov-commenter
Copy link

Codecov Report

Merging #1867 (b64749e) into main (4be5217) will decrease coverage by 16.45%.
The diff coverage is n/a.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@              Coverage Diff              @@
##               main    #1867       +/-   ##
=============================================
- Coverage     81.05%   64.61%   -16.45%     
  Complexity      100      100               
=============================================
  Files            80       26       -54     
  Lines          3420      373     -3047     
  Branches         27       27               
=============================================
- Hits           2772      241     -2531     
+ Misses          617      101      -516     
  Partials         31       31               

see 54 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski merged commit 724a0ca into main May 19, 2023
15 checks passed
@pawel-big-lebowski pawel-big-lebowski deleted the spark/fix-databricks-group-by branch May 19, 2023 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:documentation Improvements or additions to documentation area:integration/spark tool:databricks Databricks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Column lineage not working for aggregates on Databricks
3 participants