feat(governance): add Gebru-style datasheets for training datasets#52
Open
Hopelynconsult wants to merge 2 commits intodevelopfrom
Open
feat(governance): add Gebru-style datasheets for training datasets#52Hopelynconsult wants to merge 2 commits intodevelopfrom
Hopelynconsult wants to merge 2 commits intodevelopfrom
Conversation
Companion to the Mitchell-style model card generator (#37): where a model card describes the model, a datasheet describes the dataset that trained it (Gebru et al., 2018, "Datasheets for Datasets"). Both artifacts now ship with every release. The public API mirrors model_card.py (build / render / write / generate) so contributors only learn one pattern and the release CI pipeline calls them in sequence. Sections covered: motivation, composition, collection process, preprocessing, uses (intended + inappropriate), distribution, maintenance. A REQUIRED_QUESTIONS schema enforces the bare minimum a release datasheet must answer. - Datasheet dataclass with JSON serialisation - build_datasheet() / write_datasheet() / generate() / render_markdown() - scripts/generate_datasheet.py CLI wired the same way as the model card CLI for the release CI pipeline - 12 tests covering build, validation, defaults, markdown rendering, JSON round-trip, and YAML manifest loading
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
governance/datasheet.py— Gebru et al. 2018, "Datasheets for Datasets". Companion to the Mitchell-style model cards from feat(governance): add automated model card generator #37.scripts/generate_datasheet.pymirroringscripts/generate_model_card.pyso the release CI pipeline can run them in sequence.build_datasheet/render_markdown/write_datasheet/generate) so contributors only learn one pattern.Why we need this
A model card describes the model. A datasheet describes the dataset that trained the model. The two artifacts answer different questions and the responsible-AI literature is clear that releases need both — for ClimateVision specifically, the dataset choices (which biomes, which years, which label sources) drive the geographic-bias risk that #30 and the bias-audit framework are trying to surface, so we need a structured place to document those choices alongside every release.
What's in the PR
Sections covered: motivation, composition, collection process, preprocessing, uses (intended + inappropriate), distribution, maintenance.
Datasheetdataclass with JSON serialisation, mirrorsModelCardREQUIRED_QUESTIONSschema enforces minimum answers (purpose, creators, instances, labels, splits, source, timeframe, intended_uses, inappropriate_uses) — anything else is free-form so the schema can grow without code changesinappropriate_usesandmaintenanceso existing dataset manifests don't have to be exhaustive on day oneFollow-ups (out of scope here)
scripts/generate_datasheet.pyinto the release CI alongsidegenerate_model_card.py.governance/model_card.pyso each card includes a pointer to its training datasheet.Test plan
pytest tests/test_datasheet.py -q→ 12 passedREQUIRED_QUESTIONSis the right minimum bar — happy to relax/tighten before we author the first real datasheet.🤖 Generated with Claude Code