Skip to content

TechPolicyLab/Data-Statements

Repository files navigation

Data Statements for Natural Language Processing

This repository holds the templates and guides for Version 2 and Version 3 of data statements for natural language processing. The templates are provided as a starting point for creating new data statements. The guides provide specific background information and best practices in addition to the instructions given for each schema element, which are also provided in the templates.

For more information, see the data statements website.

About Data Statements

Data statements provide essential information about the characteristics of datasets, including but not limited to the curation rationale and data sources. Data statements are intended to help with (1) the conceptualization of and planning for datasets, in order to create datasets that reflect community needs, (2) the mitigation of the harms caused by bias in the dataset (such as a mismatch between training datasets and contexts where systems are deployed) and (3) the creation of a more inclusive data catalog, through identifying gaps. While first developed with language data types, data statements could be produced for a wide range of data types with adjustments to the schema to account for the unique characteristics of the specific data type.

Research Context

Data statements were first conceptualized in 2017 by Emily M. Bender and Batya Friedman at the University of Washington. The first version of data statements was published in 2018 in Transactions of the Association for Computational Linguistics and presented at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). The next two years saw significant interest and uptake. With the goals of supporting broader uptake and learning how to make data statements a suitable practice across different research and institutional contexts, in 2020 Emily M. Bender, Batya Friedman, and Angelina McMillan-Major organized a workshop at the 12th Language Resources and Evaluation Conference. The results of this workshop led to an updated schema (Version 2), a set of best practices, and A Guide for Writing Data Statements all released in 2021. Data statements schema Version 2 and Bender, Friedman, and McMillan-Major’s reflections on the documentation development process were published in 2023 in the first issue of the Association for Computing Machinery (ACM) Journal of Responsible Computing. McMillan-Major continued to develop data statements in her dissertation work by shifting the perspective of data statements to include prospective dataset documentation and incorporating language communities as collaborative partners in direct stakeholders1 of the dataset curation and documentation process. McMillan-Major and Bender then refined McMillan-Major’s dissertation work into the current version of the schema, Version 3.

Data Statements Lineage

Schema Version 1

Dataset documentation

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science (Bender & Friedman, 2018)

Schema Version 2

Dataset documentation refined by scientific community engagement

A Guide for Writing Data Statements for Natural Language Processing (Bender, Friedman, & McMillan-Major, 2021)

Data Statements: From Technical Concept to Community Practice (McMillan-Major, Bender, & Friedman, 2024)

Schema Version 3

Dataset documentation and creation with best practices for language community dataset development

Language Dataset Documentation Design: Learning from Deaf and Indigenous Communities (McMillan-Major, 2023)

A Guide for Creating and Documenting Language Datasets with Data Statements (McMillan-Major and Bender, 2024)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published