Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Add fileCount to dataset stats facet #2550

Closed
dolfinus opened this issue Mar 30, 2024 · 2 comments · Fixed by #2562
Closed

[PROPOSAL] Add fileCount to dataset stats facet #2550

dolfinus opened this issue Mar 30, 2024 · 2 comments · Fixed by #2562
Labels
kind:proposal A formal proposal for a spec-related or significant change

Comments

@dolfinus
Copy link
Contributor

Purpose:
This section gives the context of the proposal. It explains why this is needed.
Please describe the corresponding use cases.

Consider adding "fileCount" field to DataQualityMetricsInputDatasetFacet and OutputStatisticsOutputDatasetFacet:

{
  "outputStatistics": {
    "rowCount": 1000,
    "fileCount": 5,
    "size": 10240
}

For example, this allows to track Spark jobs which created many small files in S3 or HDFS. There is no need to store file names, only count.

Proposed implementation
This section describes how you propose to model it.
If you are you proposing a new facet, please mention its name and schema.

@dolfinus dolfinus added the kind:proposal A formal proposal for a spec-related or significant change label Mar 30, 2024
@dolfinus dolfinus changed the title [PROPOSAL] [PROPOSAL] Add fileCount to dataset stats Mar 30, 2024
@dolfinus dolfinus changed the title [PROPOSAL] Add fileCount to dataset stats [PROPOSAL] Add fileCount to dataset stats facet Mar 31, 2024
@mobuchowski
Copy link
Member

I think that proposal makes sense.

@dolfinus
Copy link
Contributor Author

dolfinus commented Apr 3, 2024

Implementation: #2562

I've made rowCount field in OutputStatisticsOutputDatasetFacet optional to address the case then there is no information about rows count, but only for files count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:proposal A formal proposal for a spec-related or significant change
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants