[SPEC] Add fileCount to dataset stats facets #2562

dolfinus · 2024-04-03T10:32:00Z

Problem

👋 Thanks for opening a pull request! Please include a brief summary of the problem your change is trying to solve, or bug fix. If your change fixes a bug or you'd like to provide context on why you're making the change, please link the issue as follows:

Closes: #2550

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a schema change, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change, then select one of the following:

Note: All schema changes require discussion. Please link the issue for context.

Your change modifies the core OpenLineage model
Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Adds "fileCount" field to DataQualityMetricsInputDatasetFacet and OutputStatisticsOutputDatasetFacet specification

Checklist

You've signed-off your work
Your pull request title follows our guidelines
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

proposals/168/making_spark_visitors_extensible.md

spec/facets/DataQualityMetricsInputDatasetFacet.json

spec/facets/OutputStatisticsOutputDatasetFacet.json

Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru>

dolfinus · 2024-04-03T11:50:40Z

Changing number of fields in OutputStatisticsOutputDatasetFacet from 2 to 3 lead to replacing method .newOutputStatisticsOutputDatasetFacet(rowCount, size) with .newOutputStatisticsOutputDatasetFacetBuilder().rowCount(...).size(...).
This could introduce compatibility issues for users which use Java client.

JDarDagran · 2024-04-03T12:05:24Z

Changing number of fields in OutputStatisticsOutputDatasetFacet from 2 to 3 lead to replacing method .newOutputStatisticsOutputDatasetFacet(rowCount, size) with .newOutputStatisticsOutputDatasetFacetBuilder().rowCount(...).size(...). This could introduce compatibility issues for users which use Java client.

I don't understand why? I just checked and

public OutputStatisticsOutputDatasetFacet newOutputStatisticsOutputDatasetFacet(Long rowCount,
      Long size, Long fileCount) {
    return new OutputStatisticsOutputDatasetFacet(this.producer, rowCount, size, fileCount);
  }

gets generated as well.

dolfinus · 2024-04-03T12:10:32Z

Ah, I see, now method accepts 3 arguments instead of 2. Without default value for filesCount, users have to update their code to pass new argument to the method

mobuchowski · 2024-04-03T12:34:25Z

Having exact number of arguments is by design. If you don't care you can use builder as solution here.

JDarDagran · 2024-04-03T20:43:21Z

LGTM 👍
@dolfinus, may I kindly ask how did you find the need to add this field?
Would it be worth/possible to add the statistic somewhere in any of the existing integration?

dolfinus · 2024-04-04T08:04:47Z

I'm developing a ETL tool which allows both manipulating data in DBMS and file systems using PySpark, but also provide a way to download/upload/move raw files in file systems. Using fileCount field I can collect information that some process, for example, downloaded 100 files from SFTP to S3, but next ETL process in the chain read only 98 files from S3, and 2 files are missing for some reason.

Would it be worth/possible to add the statistic somewhere in any of the existing integration?

Yes, but in a separate PR. This may be tricky, for example Spark does not provide metrics for number of read/written files. In general, number of dataframe partitions is equal to number of files, so we can count number of successful tasks and use it as a file count. But Spark's Catalist can merge small partitions to a large one, or read one file in many partitions if file format is splittable, so this can produce wrong results.

mobuchowski · 2024-04-04T12:48:49Z

Thanks for contribution @dolfinus .

Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru> Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>

dolfinus force-pushed the feature/dataset-stats-file-count branch 2 times, most recently from 4a488ae to dfbf2b8 Compare April 3, 2024 10:32

JDarDagran requested changes Apr 3, 2024

View reviewed changes

proposals/168/making_spark_visitors_extensible.md Outdated Show resolved Hide resolved

spec/facets/DataQualityMetricsInputDatasetFacet.json Outdated Show resolved Hide resolved

spec/facets/OutputStatisticsOutputDatasetFacet.json Outdated Show resolved Hide resolved

dolfinus force-pushed the feature/dataset-stats-file-count branch 4 times, most recently from 5e203dd to 35f9b65 Compare April 3, 2024 11:25

spec: Add fileCount to dataset input & output stats

abeda68

Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru>

dolfinus force-pushed the feature/dataset-stats-file-count branch from 35f9b65 to abeda68 Compare April 3, 2024 11:44

dolfinus changed the title ~~spec: Add fileCount to dataset input & output stats~~ spec: Add fileCount to dataset stats facets Apr 3, 2024

dolfinus changed the title ~~spec: Add fileCount to dataset stats facets~~ Spec: Add fileCount to dataset stats facets Apr 3, 2024

dolfinus marked this pull request as ready for review April 3, 2024 11:47

dolfinus requested a review from JDarDagran April 3, 2024 11:47

dolfinus mentioned this pull request Apr 3, 2024

[PROPOSAL] Add fileCount to dataset stats facet #2550

Closed

dolfinus changed the title ~~Spec: Add fileCount to dataset stats facets~~ [SPEC] Add fileCount to dataset stats facets Apr 3, 2024

JDarDagran approved these changes Apr 3, 2024

View reviewed changes

JDarDagran requested review from pawel-big-lebowski and mobuchowski April 3, 2024 20:38

mobuchowski approved these changes Apr 4, 2024

View reviewed changes

mobuchowski merged commit ba870d4 into OpenLineage:main Apr 4, 2024
72 checks passed

dolfinus deleted the feature/dataset-stats-file-count branch April 4, 2024 12:54

blacklight pushed a commit to blacklight/OpenLineage that referenced this pull request Apr 4, 2024

spec: Add fileCount to dataset input & output stats (OpenLineage#2562)

223dd77

Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru> Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPEC] Add fileCount to dataset stats facets #2562

[SPEC] Add fileCount to dataset stats facets #2562

dolfinus commented Apr 3, 2024 •

edited

dolfinus commented Apr 3, 2024 •

edited

JDarDagran commented Apr 3, 2024

dolfinus commented Apr 3, 2024 •

edited

mobuchowski commented Apr 3, 2024

JDarDagran commented Apr 3, 2024

dolfinus commented Apr 4, 2024 •

edited

mobuchowski commented Apr 4, 2024

[SPEC] Add fileCount to dataset stats facets #2562

[SPEC] Add fileCount to dataset stats facets #2562

Conversation

dolfinus commented Apr 3, 2024 • edited

Problem

Solution

One-line summary:

Checklist

dolfinus commented Apr 3, 2024 • edited

JDarDagran commented Apr 3, 2024

dolfinus commented Apr 3, 2024 • edited

mobuchowski commented Apr 3, 2024

JDarDagran commented Apr 3, 2024

dolfinus commented Apr 4, 2024 • edited

mobuchowski commented Apr 4, 2024

dolfinus commented Apr 3, 2024 •

edited

dolfinus commented Apr 3, 2024 •

edited

dolfinus commented Apr 3, 2024 •

edited

dolfinus commented Apr 4, 2024 •

edited