New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPEC] Add fileCount to dataset stats facets #2562
[SPEC] Add fileCount to dataset stats facets #2562
Conversation
4a488ae
to
dfbf2b8
Compare
5e203dd
to
35f9b65
Compare
Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru>
35f9b65
to
abeda68
Compare
Changing number of fields in OutputStatisticsOutputDatasetFacet from 2 to 3 lead to replacing method |
I don't understand why? I just checked and public OutputStatisticsOutputDatasetFacet newOutputStatisticsOutputDatasetFacet(Long rowCount,
Long size, Long fileCount) {
return new OutputStatisticsOutputDatasetFacet(this.producer, rowCount, size, fileCount);
} gets generated as well. |
Ah, I see, now method accepts 3 arguments instead of 2. Without default value for filesCount, users have to update their code to pass new argument to the method |
Having exact number of arguments is by design. If you don't care you can use builder as solution here. |
LGTM 馃憤 |
I'm developing a ETL tool which allows both manipulating data in DBMS and file systems using PySpark, but also provide a way to download/upload/move raw files in file systems. Using
Yes, but in a separate PR. This may be tricky, for example Spark does not provide metrics for number of read/written files. In general, number of dataframe partitions is equal to number of files, so we can count number of successful tasks and use it as a file count. But Spark's Catalist can merge small partitions to a large one, or read one file in many partitions if file format is splittable, so this can produce wrong results. |
Thanks for contribution @dolfinus . |
Signed-off-by: Martynov Maxim <martinov_m_s_@mail.ru> Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
Problem
馃憢 Thanks for opening a pull request! Please include a brief summary of the problem your change is trying to solve, or bug fix. If your change fixes a bug or you'd like to provide context on why you're making the change, please link the issue as follows:
Closes: #2550
Solution
Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a schema change, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change, then select one of the following:
If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports
S3
andGCS
filesystem operations, tested with AWS EMR).One-line summary:
Adds "fileCount" field to DataQualityMetricsInputDatasetFacet and OutputStatisticsOutputDatasetFacet specification
Checklist
SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project