feat(governance): add Gebru-style datasheets for training datasets by Hopelynconsult · Pull Request #52 · Climate-Vision/ClimateVision

Hopelynconsult · 2026-05-07T16:19:59Z

Summary

Adds governance/datasheet.py — Gebru et al. 2018, "Datasheets for Datasets". Companion to the Mitchell-style model cards from feat(governance): add automated model card generator #37.
Adds scripts/generate_datasheet.py mirroring scripts/generate_model_card.py so the release CI pipeline can run them in sequence.
Public API matches the model_card module (build_datasheet / render_markdown / write_datasheet / generate) so contributors only learn one pattern.

Why we need this

A model card describes the model. A datasheet describes the dataset that trained the model. The two artifacts answer different questions and the responsible-AI literature is clear that releases need both — for ClimateVision specifically, the dataset choices (which biomes, which years, which label sources) drive the geographic-bias risk that #30 and the bias-audit framework are trying to surface, so we need a structured place to document those choices alongside every release.

What's in the PR

Sections covered: motivation, composition, collection process, preprocessing, uses (intended + inappropriate), distribution, maintenance.

Datasheet dataclass with JSON serialisation, mirrors ModelCard
REQUIRED_QUESTIONS schema enforces minimum answers (purpose, creators, instances, labels, splits, source, timeframe, intended_uses, inappropriate_uses) — anything else is free-form so the schema can grow without code changes
Sensible defaults for inappropriate_uses and maintenance so existing dataset manifests don't have to be exhaustive on day one
12 unit tests covering: build, override behaviour, all required-field validations, markdown rendering shape, JSON round-trip, and YAML manifest loading

Follow-ups (out of scope here)

Author the first concrete datasheets for the Sentinel-2 deforestation, ice-melt, and flood training datasets (separate doc-PR per dataset).
Wire scripts/generate_datasheet.py into the release CI alongside generate_model_card.py.
Cross-link datasheet → model card in governance/model_card.py so each card includes a pointer to its training datasheet.

Test plan

pytest tests/test_datasheet.py -q → 12 passed
Reviewer: confirm REQUIRED_QUESTIONS is the right minimum bar — happy to relax/tighten before we author the first real datasheet.

🤖 Generated with Claude Code

Companion to the Mitchell-style model card generator (#37): where a model card describes the model, a datasheet describes the dataset that trained it (Gebru et al., 2018, "Datasheets for Datasets"). Both artifacts now ship with every release. The public API mirrors model_card.py (build / render / write / generate) so contributors only learn one pattern and the release CI pipeline calls them in sequence. Sections covered: motivation, composition, collection process, preprocessing, uses (intended + inappropriate), distribution, maintenance. A REQUIRED_QUESTIONS schema enforces the bare minimum a release datasheet must answer. - Datasheet dataclass with JSON serialisation - build_datasheet() / write_datasheet() / generate() / render_markdown() - scripts/generate_datasheet.py CLI wired the same way as the model card CLI for the release CI pipeline - 12 tests covering build, validation, defaults, markdown rendering, JSON round-trip, and YAML manifest loading

Hopelynconsult requested a review from Goldokpa as a code owner May 7, 2026 16:20

Goldokpa mentioned this pull request May 7, 2026

governance/datasheet: _PROJECT_ROOT path index is off-by-one (parents[4] → parents[3]) #54

Open

fix(governance): correct _PROJECT_ROOT parents index in datasheet.py

e75dedc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(governance): add Gebru-style datasheets for training datasets#52

feat(governance): add Gebru-style datasheets for training datasets#52
Hopelynconsult wants to merge 2 commits intodevelopfrom
feature/governance-datasheet

Hopelynconsult commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Hopelynconsult commented May 7, 2026

Summary

Why we need this

What's in the PR

Follow-ups (out of scope here)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants