# **Data Readiness For AI Checklist**

 * Creator(s) John Pill & Lewis Lee
 * Affiliation: UK Met Office
 * History: 1.0
 * Last update: 27 August 2024.


---

## **Overview**
The checklist is developed using the 2019 draft readiness matrix developed by the Office of Science and Technology Policy Subcommittee on Open Science as a basis. The checklist has been improved based on further research and user feedback. Definitions for some concepts are listed at the end of this document. This checklist is developed through a collaboration of ESIP Data Readiness Cluster members include representatives from NOAA, NASA, USGS, and other organizations. The checklist will be updated periodically to reflect community feedback.

ESIP Data Readiness Cluster (2023): Checklist to Examine AI-readiness for Open Environmental Datasets v.1.0. ESIP. Online resource. https://doi.org/10.6084/m9.figshare.19983722.v1

Readiness Matrix (2020): What is AI-Ready Open Data? NOAA. Online resource. https://www.star.nesdis.noaa.gov/star/documents/meetings/2020AI/presentations/202010/20201022_Christensen.pdf

### Prerequisites
Ideally for AI-ready assessment, a dataset should be defined as the minimum measurable bundle (i.e., a physical parameter/variable of observational datasets or model simulations). The assessment at this scale will enable better integration of data from different sources for research and development. However, it can be an intensive process for manual assessment without automation. Therefore, we recommend current assessments be done on the data file level. If the dataset has different versions, the checklist should be applied to each dataset type (e.g. raw, derived).

### Learning Outcomes
* Know how to check a range of dataset features. 
* Assess a wide range of dataset features, which will impact the dataset's 'readiness' for machine learning.  


---

## **Tutorial Material TODO**


It is useful to **bold** your answers so they are more visable. Add 2 asterisks ```**around your text**``` to do so.


### Data section, optional
Scripts for pulling the data into the notebook assuming


---

## **Dataset General Info**


### Basic details

1. Dataset Name: 

2. Dataset Version:

3. Dataset Location/Link:

4. Assessor Name:

5. Assessor Email:


### Dataset details

6. Is this raw data or a derived/processed data product? Raw/Derived

7. Is this observational data, simulation/model output, or synthetic data? Observed/Modeled/Synthetic

8. Is the data single-source or aggregated from several sources? Is the data single-source or aggregated from several sources? Single-source/Aggregated

---

## **Data Quality**

### Data timeliness

1. Will the dataset be updated: Yes / No

    * If the data will be updated, how often will it be updated: when new data added, frequency (weekly, monthly).
       
    * Will there be different stages of the update (e.g., updated with preliminary data first and replaced by a later update of the full record)?
        
    * If yes, what is the delay between different stages?
    
    * Should the new version of the dataset supersede the current version?
    

### Data completeness

2. Is there any documentation about the completeness of the dataset? Yes / No - (If yes, link to the report/document)

3. How complete is the dataset compared to the expected spatial coverage? Complete / Partial / Unknown / Not applicable

4. How complete is the dataset compared to the expected temporal coverage? Complete / Partial / Unknown / Not applicable



### Data consistency

5. Is this dataset self-consistent in that its units, data types, and parameter names do not change over time and space? Yes / No / Not applicable

6. Is this dataset’s units, data types, and parameter names consistent with similar data collections? Yes / No / Not applicable

7. Are there processes to monitor for units, data types, and parameter consistency? Yes / No / Not applicable
* If yes, what measures are taken? Manual review / Automated review


### Data bias

1. Is there known bias in the dataset? Yes / No - If yes provide information.

2. Have measures been taken to examine bias? Yes / No
    * If yes, what measures were used?
    * Is the bias metrological traceable?


3. Is there reported bias in the data? No known bias / Bias found and reported / No information available
    * (optional) Link to the report/document on the bias
    * (optional) Link to tools available to reduce bias
    * (optional) Link to a bias-corrected or bias-reduced version of the dataset


4. Is there quantitative information about data resolution in space and time? Yes / No / Not applicable

5. Are there published data quality procedures or reports? Yes / No
    * If there is published quality information, please provide the link to the information.


6. Is the provenance of the dataset tracked and documented? Yes / No / Not applicable

7. Are there checksums / other checks for data integrity? Yes / No / Not applicable

8. What is the size of the dataset? Depending on the resource, this might be total data volume, dimensionality, number of images, data files, table rows, image size, etc. [Short Answer]




### Data Quality Assessment Matrix

<img src="Images/data_quality_matrix.png" width=800 height="auto" />

---

## **Data Documentation**

### Community standard or convention

1. Does the dataset metadata follow a community/domain standard or convention? Yes / No / Not applicable
      * If the metadata follows a community/domain standard, which standard is it? (CF, TBD, etc.)
      * Is the dataset metadata machine-readable? Yes / No / Not applicable
      * Does it include details on the spatial and temporal extent? Yes / No / Not applicable


### Data dictionary

2. Is there a comprehensive data dictionary/codebook that describes what each element of the dataset means? parameters? Yes / No / Not applicable
    * Is the data dictionary standardized? Yes / No / Not applicable
    * Is the data dictionary machine-readable? Yes / No / Not applicable
    * Do the parameters follow a defined standard? Yes / No / Not applicable
    * If the parameters follow a defined standard, which standard it is?
    * Are parameters crosswalked in an ontology or common vocabulary (e.g. NIEM)? Yes / No / Not applicable

### Unique persistent identifier

3. Does the dataset have a unique persistent identifier, e.g. DOI? Yes, [supply identifier] / No / Not applicable


### Contact information and feedback

4. Is there contact information for subject-matter experts? Yes / No / Not applicable

5. Is there a mechanism for user feedback and suggestions? Yes / No / Not applicable

### Examples codes / notebooks / toolkits

6. Are there example codes / notebooks / toolkits available showing how the data can be used? Yes / No / Not applicable


### Licesnses

7. What is the license for the data?

    * Is the license standardized and machine-readable (e.g. Creative Commons)? Yes / No / Not applicable


### Dataset useage

8. Has this dataset already been used in AI or ML activities? Link to publications/reports.

9. Are there recommendations on the intended use of the data, and uses that are not recommended? Yes / No/ Not applicable


### Data Documentation Assessment Matrix

<img src="Images/data_documentation_matrix.png" width=800 height="auto" />

---

## **Data Access**

### File formats

1. What is/are the major file formats? (CSV, netCDF, etc.)

    * Is this format machine-readable? Yes / No / Not applicable
    * Is the data available in at least one open, non-proprietary format? Yes / No / Not applicable
    * Are there tools/services to support data format conversion? Yes / No
    * If so, provide the link to the tools/services

### Data delivery
2. Does data access require authentication (e.g., a registered user account)? Yes / No / Not applicable

3. Can the file be accessed via direct file downloading or ordering? Yes / No / Not applicable

4. Is there an Application Programming Interface (API) or web service to access the data? Yes / No / Not applicable. 

    * If there is an API, does the API follow an open standard protocol (e.g., OGC)? Yes / No
    * If there is an API, is there documentation for the API? Yes / No
    * If “Yes”, please provide a URL to the documentation.

### Privacy and security

5. For restricted data, have measures been taken to provide some access while still applying appropriate protection for privacy and security? Yes / No / Not Applicable

    * Has the data been aggregated to reduce granularity? Yes / No / Not applicable
    * Has the data been anonymized / de-identified? Yes / No / Not applicable
    * Is there secure access to the full dataset for authorized users? Yes / No / Not applicable

### Data Access Assessment Matrix


<img src="Images/data_access_matrix.png" width=800 height="auto" />

---

## **Data Preparation**

### Null values
1. Have null values/gaps been filled? Yes / No / Not applicable

### Outliers

2. Have outliers been identified? Yes, tagged / Yes, removed / No / Not applicable


### Gridded data
3. Is the data gridded (regularly sampled in time and space)?

    * Regularly gridded in space / Constant time-frequency / Regularly gridded in space and constant time-frequency / Not gridded / Not applicable

    * If the data is gridded, was it transformed from a different original sampling? Yes, from irregular sampling / Yes, from a different regular sampling / No, this is the original sampling

    * If the data is resampled from the original sampling, is the data also available at the original sampling? Yes / No / Only available at request / Not applicable


### Targets / labels for supervised learning

4. Are there associated targets or labels for supervised learning techniques (i.e., can this be used as a training dataset for supervised learning techniques)? Yes / No / Not applicable

    * If there are associated targets/labels, are community labeling standards implemented (e.g., STAC label extension, ESA AIREO specification, etc)?


---

## **Appendix** - Definition of terms used in the checklist.

### Quality
* **Completeness**: the breadth of a dataset compared to an ideal 100% completion (spatial, temporal, demographic, etc.); important in avoiding sampling bias
* **Consistency**: uniformity within the entire dataset or compared with similar data collections; for example, no changes in units or data types over time; the item measured against itself or its a counterpart in another dataset or database
* **Bias**: a systematic tilt in the dataset when compared to a reference, caused for example by instrumentation, incorrect data processing, unrepresentative sampling, or human error; the exact nature of bias and how it is measured will vary depending on the type of data and the research domain.
* **Uncertainty**: parameter, associated with the result of a measurement, that characterizes the dispersion of the values that could reasonably be attributed to the measurand.
* **Timeliness**: the speed of data release, compared to when an event occurred or measurements were made; requirements will vary depending on the timeframe of the phenomenon (e.g., severe thunderstorms vs. climate change, or disease outbreaks vs. life expectancy trends)
* **Provenance**: identification of the data sources, how it was processed, and who released it.
* **Integrity**: verification that the data remains unchanged from the original; aka data fixity.

### Documentation
* **Dataset Metadata**: complete information about the dataset: quality, provenance, location, time period, responsible parties, purpose, etc.
* **Data Dictionary/Codebook**: complete information about the individual variables / measures / parameters within a dataset: type, units, null value, etc.
* **Identifier**: a code or number that uniquely identifies a dataset
* **Ontology**: formalized definitions of concepts within a domain of knowledge, and the nature of the inter-relationships among those concepts

### Data Access

* **Formats**: standards that govern how information is stored in a computer file (e.g., CSV, JSON, GeoTIFF, etc.); different AI user communities will have different requirements, so the best practice is to provide several format options to meet the needs of multiple high priority user communities.
* **Delivery Options**: mechanisms for publishing open data for public use (e.g., direct file download, Application Programming Interface (API), cloud services, etc.); different AI user communities will have different requirements, so the best practice is to provide several delivery options to meet the needs of multiple high priority user communities.
* **License/Usage Rights**: information on who is allowed to use the data and for what purposes, including data sharing agreements, fees, etc.; some federal data needs to have restrictions and some will be fully open, so rights should be documented in detail
* **Security/Privacy**: protection of data that is restricted in some way (privacy, proprietary/business information, national security, etc.)
