Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPEC] Add fileCount to dataset stats facets #2562

Merged

Conversation

dolfinus
Copy link
Contributor

@dolfinus dolfinus commented Apr 3, 2024

Problem

馃憢 Thanks for opening a pull request! Please include a brief summary of the problem your change is trying to solve, or bug fix. If your change fixes a bug or you'd like to provide context on why you're making the change, please link the issue as follows:

Closes: #2550

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a schema change, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change, then select one of the following:

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Adds "fileCount" field to DataQualityMetricsInputDatasetFacet and OutputStatisticsOutputDatasetFacet specification

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

@boring-cyborg boring-cyborg bot added area:client/java openlineage-java area:client/python openlineage-python area:documentation Improvements or additions to documentation area:integration/spark kind:proposal A formal proposal for a spec-related or significant change area:proxy Proxy configuration and issues area:spec Specifications and standards for the project labels Apr 3, 2024
@dolfinus dolfinus force-pushed the feature/dataset-stats-file-count branch 2 times, most recently from 4a488ae to dfbf2b8 Compare April 3, 2024 10:32
proposals/168/making_spark_visitors_extensible.md Outdated Show resolved Hide resolved
spec/facets/DataQualityMetricsInputDatasetFacet.json Outdated Show resolved Hide resolved
spec/facets/OutputStatisticsOutputDatasetFacet.json Outdated Show resolved Hide resolved
@dolfinus dolfinus force-pushed the feature/dataset-stats-file-count branch 4 times, most recently from 5e203dd to 35f9b65 Compare April 3, 2024 11:25
Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru>
@dolfinus dolfinus force-pushed the feature/dataset-stats-file-count branch from 35f9b65 to abeda68 Compare April 3, 2024 11:44
@dolfinus dolfinus changed the title spec: Add fileCount to dataset input & output stats spec: Add fileCount to dataset stats facets Apr 3, 2024
@dolfinus dolfinus changed the title spec: Add fileCount to dataset stats facets Spec: Add fileCount to dataset stats facets Apr 3, 2024
@dolfinus dolfinus marked this pull request as ready for review April 3, 2024 11:47
@dolfinus dolfinus requested a review from JDarDagran April 3, 2024 11:47
@dolfinus
Copy link
Contributor Author

dolfinus commented Apr 3, 2024

Changing number of fields in OutputStatisticsOutputDatasetFacet from 2 to 3 lead to replacing method .newOutputStatisticsOutputDatasetFacet(rowCount, size) with .newOutputStatisticsOutputDatasetFacetBuilder().rowCount(...).size(...).
This could introduce compatibility issues for users which use Java client.

@dolfinus dolfinus changed the title Spec: Add fileCount to dataset stats facets [SPEC] Add fileCount to dataset stats facets Apr 3, 2024
@JDarDagran
Copy link
Contributor

Changing number of fields in OutputStatisticsOutputDatasetFacet from 2 to 3 lead to replacing method .newOutputStatisticsOutputDatasetFacet(rowCount, size) with .newOutputStatisticsOutputDatasetFacetBuilder().rowCount(...).size(...). This could introduce compatibility issues for users which use Java client.

I don't understand why? I just checked and

public OutputStatisticsOutputDatasetFacet newOutputStatisticsOutputDatasetFacet(Long rowCount,
      Long size, Long fileCount) {
    return new OutputStatisticsOutputDatasetFacet(this.producer, rowCount, size, fileCount);
  }

gets generated as well.

@dolfinus
Copy link
Contributor Author

dolfinus commented Apr 3, 2024

Ah, I see, now method accepts 3 arguments instead of 2. Without default value for filesCount, users have to update their code to pass new argument to the method

@mobuchowski
Copy link
Member

Having exact number of arguments is by design. If you don't care you can use builder as solution here.

@JDarDagran
Copy link
Contributor

LGTM 馃憤
@dolfinus, may I kindly ask how did you find the need to add this field?
Would it be worth/possible to add the statistic somewhere in any of the existing integration?

@dolfinus
Copy link
Contributor Author

dolfinus commented Apr 4, 2024

I'm developing a ETL tool which allows both manipulating data in DBMS and file systems using PySpark, but also provide a way to download/upload/move raw files in file systems. Using fileCount field I can collect information that some process, for example, downloaded 100 files from SFTP to S3, but next ETL process in the chain read only 98 files from S3, and 2 files are missing for some reason.

Would it be worth/possible to add the statistic somewhere in any of the existing integration?

Yes, but in a separate PR. This may be tricky, for example Spark does not provide metrics for number of read/written files. In general, number of dataframe partitions is equal to number of files, so we can count number of successful tasks and use it as a file count. But Spark's Catalist can merge small partitions to a large one, or read one file in many partitions if file format is splittable, so this can produce wrong results.

@mobuchowski mobuchowski merged commit ba870d4 into OpenLineage:main Apr 4, 2024
72 checks passed
@mobuchowski
Copy link
Member

Thanks for contribution @dolfinus .

@dolfinus dolfinus deleted the feature/dataset-stats-file-count branch April 4, 2024 12:54
blacklight pushed a commit to blacklight/OpenLineage that referenced this pull request Apr 4, 2024
Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru>
Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:client/java openlineage-java area:client/python openlineage-python area:documentation Improvements or additions to documentation area:integration/spark area:proxy Proxy configuration and issues area:spec Specifications and standards for the project kind:proposal A formal proposal for a spec-related or significant change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PROPOSAL] Add fileCount to dataset stats facet
3 participants