Skip to content

feat(governance): add Gebru-style datasheets for training datasets#52

Open
Hopelynconsult wants to merge 2 commits intodevelopfrom
feature/governance-datasheet
Open

feat(governance): add Gebru-style datasheets for training datasets#52
Hopelynconsult wants to merge 2 commits intodevelopfrom
feature/governance-datasheet

Conversation

@Hopelynconsult
Copy link
Copy Markdown
Collaborator

Summary

  • Adds governance/datasheet.py — Gebru et al. 2018, "Datasheets for Datasets". Companion to the Mitchell-style model cards from feat(governance): add automated model card generator #37.
  • Adds scripts/generate_datasheet.py mirroring scripts/generate_model_card.py so the release CI pipeline can run them in sequence.
  • Public API matches the model_card module (build_datasheet / render_markdown / write_datasheet / generate) so contributors only learn one pattern.

Why we need this

A model card describes the model. A datasheet describes the dataset that trained the model. The two artifacts answer different questions and the responsible-AI literature is clear that releases need both — for ClimateVision specifically, the dataset choices (which biomes, which years, which label sources) drive the geographic-bias risk that #30 and the bias-audit framework are trying to surface, so we need a structured place to document those choices alongside every release.

What's in the PR

Sections covered: motivation, composition, collection process, preprocessing, uses (intended + inappropriate), distribution, maintenance.

  • Datasheet dataclass with JSON serialisation, mirrors ModelCard
  • REQUIRED_QUESTIONS schema enforces minimum answers (purpose, creators, instances, labels, splits, source, timeframe, intended_uses, inappropriate_uses) — anything else is free-form so the schema can grow without code changes
  • Sensible defaults for inappropriate_uses and maintenance so existing dataset manifests don't have to be exhaustive on day one
  • 12 unit tests covering: build, override behaviour, all required-field validations, markdown rendering shape, JSON round-trip, and YAML manifest loading

Follow-ups (out of scope here)

  • Author the first concrete datasheets for the Sentinel-2 deforestation, ice-melt, and flood training datasets (separate doc-PR per dataset).
  • Wire scripts/generate_datasheet.py into the release CI alongside generate_model_card.py.
  • Cross-link datasheet → model card in governance/model_card.py so each card includes a pointer to its training datasheet.

Test plan

  • pytest tests/test_datasheet.py -q → 12 passed
  • Reviewer: confirm REQUIRED_QUESTIONS is the right minimum bar — happy to relax/tighten before we author the first real datasheet.

🤖 Generated with Claude Code

Companion to the Mitchell-style model card generator (#37): where a model
card describes the model, a datasheet describes the dataset that trained
it (Gebru et al., 2018, "Datasheets for Datasets"). Both artifacts now
ship with every release.

The public API mirrors model_card.py (build / render / write / generate)
so contributors only learn one pattern and the release CI pipeline calls
them in sequence.

Sections covered: motivation, composition, collection process,
preprocessing, uses (intended + inappropriate), distribution, maintenance.
A REQUIRED_QUESTIONS schema enforces the bare minimum a release datasheet
must answer.

- Datasheet dataclass with JSON serialisation
- build_datasheet() / write_datasheet() / generate() / render_markdown()
- scripts/generate_datasheet.py CLI wired the same way as the model card
  CLI for the release CI pipeline
- 12 tests covering build, validation, defaults, markdown rendering,
  JSON round-trip, and YAML manifest loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants