# **Data Readiness For AI Checklist**

 * Creator(s) John Pill
 * Affiliation: UK Met Office
 * History: 1.0
 * Last update: 27 August 2024.


---

## **Overview**
The checklist is developed using the 2019 draft readiness matrix developed by the Office of Science and Technology Policy Subcommittee on Open Science as a basis. The checklist has been improved based on further research and user feedback. Definitions for some concepts are listed at the end of this document. This checklist is developed through a collaboration of ESIP Data Readiness Cluster members include representatives from NOAA, NASA, USGS, and other organizations. The checklist will be updated periodically to reflect community feedback.

ESIP Data Readiness Cluster (2023): Checklist to Examine AI-readiness for Open Environmental Datasets v.1.0. ESIP. Online resource. https://doi.org/10.6084/m9.figshare.19983722.v1

Readiness Matrix (2020): What is AI-Ready Open Data? NOAA. Online resource. https://www.star.nesdis.noaa.gov/star/documents/meetings/2020AI/presentations/202010/20201022_Christensen.pdf

### Prerequisites
Ideally for AI-ready assessment, a dataset should be defined as the minimum measurable bundle (i.e., a physical parameter/variable of observational datasets or model simulations). The assessment at this scale will enable better integration of data from different sources for research and development. However, it can be an intensive process for manual assessment without automation. Therefore, we recommend current assessments be done on the data file level. If the dataset has different versions, the checklist should be applied to each dataset type (e.g. raw, derived).

### Learning Outcomes
* Know how to check a range of dataset features. 
* Assess a wide range of dataset features, which will impact the dataset's 'readiness' for machine learning.  


---

## **Tutorial Material TODO**


Remember to save your notebook regularly as you work through it to prevent loosing your answers.


Run this Jupyter notebook locally using Jupyter Lab. 
* **Add download and running instructions**
* **May need to 'run all cells' to generate checklist - need to test**.


### Data section, optional
Scripts for pulling the data into the notebook assuming


# NOTES 
### * Could switch to combobox then the default unselected value would be None if this would be more useful?
### * Consider creating a class to model a question, response, question type etc. This way it could be possible to attach different importance weightings to questions and make accessing the information easier later on.
### * Consider the best way to format the data object to store checklist results.
### * Add export functions for different formats (CSV, JSON, etc.)


In [1]:
import ipywidgets as widgets
from IPython.display import display, clear_output

---

## **1. Dataset General Info**


### Basic details

In [2]:

dataset_name = widgets.Text(
    value='',
    placeholder='1.1 Dataset name',
    disabled=False   
)

dataset_version = widgets.Text(
    value='',
    placeholder='1.2. Dataset version',
    disabled=False   
)

dataset_link = widgets.Text(
    value='',
    placeholder='1.3. Location / url link',
    disabled=False   
)

dataset_assessor_name = widgets.Text(
    value='',
    placeholder='1.4. Assessor name',
    disabled=False   
)

dataset_assessor_email = widgets.Text(
    value='',
    placeholder='1.5. Assessor email address',
    disabled=False   
)


display(dataset_name, dataset_version, dataset_link, dataset_assessor_name, dataset_assessor_email)

Text(value='', placeholder='1.1 Dataset name')

Text(value='', placeholder='1.2. Dataset version')

Text(value='', placeholder='1.3. Location / url link')

Text(value='', placeholder='1.4. Assessor name')

Text(value='', placeholder='1.5. Assessor email address')

### Dataset details

In [3]:

raw_derived = widgets.ToggleButtons(
            options=['Raw', 'Derived', 'Unknown'],
            value='Raw',
            description='6. Is this raw data or a derived/processed data product?',
            )

observe_model_synthetic = widgets.ToggleButtons(
            options=['Observed', 'Modeled', 'Synthetic'],
            value='Observed',
            description='7. Is this observational data, simulation/model output, or synthetic data?',
            )

data_sources = widgets.ToggleButtons(
            options=['Single-source', 'Aggregated'],
            value='Single-source',
            description='8. Is the data single-source or aggregated from several sources? ',
            )

display(raw_derived, observe_model_synthetic, data_sources)

ToggleButtons(description='6. Is this raw data or a derived/processed data product?', options=('Raw', 'Derived…

ToggleButtons(description='7. Is this observational data, simulation/model output, or synthetic data?', option…

ToggleButtons(description='8. Is the data single-source or aggregated from several sources? ', options=('Singl…

---

## **2. Data Quality**

### Data timeliness    

In [4]:

data_update = widgets.ToggleButtons(
            options=['Yes', 'No'],
            value='No',
            description='2.1 Will the dataset be updated?',
            disabled=False
            )

data_update_frequency = widgets.ToggleButtons(
            options=['When data updated', 'Hourly', 'Daily', 'Weekly', 'Monthly', 'Annually', 'Other', "N/A"],
            value='N/A',
            description='If the data will be updated, how often will it be updated?',
            disabled=False,
            layout=widgets.Layout(display='none'),
            )

data_update_stages = widgets.ToggleButtons(
            options=['Preliminary data first, then updated later', 'Full record', "N/A"],
            value='N/A',
            description='Will there be different stages of the update?',
            disabled=False,
            layout=widgets.Layout(display='none'),
            )

data_update_delay = widgets.Text(
            value='',
            placeholder='If yes, what is the delay between different stages?',
            disabled=False,
            layout=widgets.Layout(display='none', width='500px'),
            )

data_update_supersede = widgets.ToggleButtons(
            options=['Yes', 'No', "N/A"],
            value='N/A',
            description='Should the new version of the dataset supersede the current version?',
            disabled=False,
            layout=widgets.Layout(display='none'),
            )

# Function to change the display setting of the following UI components. 
def on_click_handler(change): 
    if change["new"] == "Yes":
        data_update_frequency.layout.display = ''
        data_update_stages.layout.display = ''
        data_update_delay.layout.display = ''
        data_update_supersede.layout.display = ''
    else:
        data_update_frequency.layout.display = 'none'
        data_update_stages.layout.display = 'none'
        data_update_delay.layout.display = 'none'
        data_update_supersede.layout.display = 'none'
        
        # Return the values back to default state if 1st option changed back.
        data_update_frequency.value = "N/A"
        data_update_stages.value = "N/A"
        data_update_delay.value = ""
        data_update_supersede.value = "N/A"
        

# Show UI components based on their display settings. 
display(data_update, data_update_frequency, data_update_stages, data_update_delay, data_update_supersede)

# Observe the first UI component for changes and call the on_click_handler function if value property changed. 
data_update.observe(on_click_handler, names="value")

ToggleButtons(description='2.1 Will the dataset be updated?', index=1, options=('Yes', 'No'), value='No')

ToggleButtons(description='If the data will be updated, how often will it be updated?', index=7, layout=Layout…

ToggleButtons(description='Will there be different stages of the update?', index=2, layout=Layout(display='non…

Text(value='', layout=Layout(display='none', width='500px'), placeholder='If yes, what is the delay between di…

ToggleButtons(description='Should the new version of the dataset supersede the current version?', index=2, lay…

### Data completeness

In [5]:

completeness_docs = widgets.ToggleButtons(
            options=['Yes', 'No'],
            value='No',
            description='2.2 Is there any documentation about the completeness of the dataset?',
            disabled=False
            )

completeness_docs_link = widgets.Text(
            value='',
            placeholder='Please provide a link to the document',
            disabled=False,
            layout=widgets.Layout(display='none', width="500px"),
            )

expected_spatial_coverage = widgets.ToggleButtons(
            options=['Complete', 'Partial', 'Unknown', 'N/A'],
            value='Unknown',
            description='2.3 How complete is the dataset compared to the expected spatial coverage?',
            disabled=False
            )

expected_temporal_coverage = widgets.ToggleButtons(
            options=['Complete', 'Partial', 'Unknown', 'N/A'],
            value='Unknown',
            description='2.4 How complete is the dataset compared to the expected temporal coverage?',
            disabled=False
            )

# Function to change the display setting of the following UI components. 
def on_click_handler(change): 
    if change["new"] == "Yes":
        completeness_docs_link.layout.display = ''
    else:
        completeness_docs_link.layout.display = 'none'
        
        # Return the values back to default state if 1st option changed back.
        completeness_docs_link.value = ""

# Show UI components based on their display settings. 
display(completeness_docs, completeness_docs_link, expected_spatial_coverage, expected_temporal_coverage)

# Observe the first UI component for changes and call the on_click_handler function if value property changed. 
completeness_docs.observe(on_click_handler, names="value")


ToggleButtons(description='2.2 Is there any documentation about the completeness of the dataset?', index=1, op…

Text(value='', layout=Layout(display='none', width='500px'), placeholder='Please provide a link to the documen…

ToggleButtons(description='2.3 How complete is the dataset compared to the expected spatial coverage?', index=…

ToggleButtons(description='2.4 How complete is the dataset compared to the expected temporal coverage?', index…

### Data consistency

In [6]:

self_consistent_units = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='2.5 Is this dataset self-consistent in that its units, data types, and parameter names do not change over time and space?',
            disabled=False
            )

consistent_units = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='2.6 Is this dataset’s units, data types, and parameter names consistent with similar data collections?',
            disabled=False
            )

consistent_unit_monitoring = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='2.7 Are there processes to monitor for units, data types, and parameter consistency?',
            disabled=False
            )

consistent_unit_review = widgets.Text(
            value='',
            placeholder='If yes, what measures are taken? Manual review / Automated review etc.',
            disabled=False,
            layout=widgets.Layout(display='none', width="500px"),
            )

# Function to change the display setting of the following UI components. 
def on_click_handler(change): 
    if change["new"] == "Yes":
        consistent_unit_review.layout.display = ''
    else:
        consistent_unit_review.layout.display = 'none'
        
        # Return the values back to default state if 1st option changed back.
        consistent_unit_review.value = ""

# Show UI components based on their display settings. 
display(self_consistent_units, consistent_units, consistent_unit_monitoring, consistent_unit_review)

# Observe the first UI component for changes and call the on_click_handler function if value property changed. 
consistent_unit_monitoring.observe(on_click_handler, names="value")


ToggleButtons(description='2.5 Is this dataset self-consistent in that its units, data types, and parameter na…

ToggleButtons(description='2.6 Is this dataset’s units, data types, and parameter names consistent with simila…

ToggleButtons(description='2.7 Are there processes to monitor for units, data types, and parameter consistency…

Text(value='', layout=Layout(display='none', width='500px'), placeholder='If yes, what measures are taken? Man…

### Data bias

In [7]:

dataset_bias = widgets.ToggleButtons(
            options=['Yes', 'No', 'Unknown'],
            value='Unknown',
            description='2.8 Is there known bias in the dataset?',
            )

dataset_bias_measures = widgets.ToggleButtons(
            options=['Yes', 'No', 'Unknown', 'N/A'],
            value='N/A',
            description='Have measures been taken to examine bias?',
            layout=widgets.Layout(display="none")
            )

dataset_bias_measures_detail = widgets.Textarea(
            value='',
            placeholder='If yes, what measures were used?',
            layout=widgets.Layout(display="none", width="500px")
            )

dataset_bias_metrological_traceable = widgets.Textarea(
            value='',
            placeholder='Is the bias metrological traceable?',
            layout=widgets.Layout(display="none", width="500px")
            )

dataset_bias_report = widgets.ToggleButtons(
            options=['No known bias', 'Found and reported', 'No info available', 'N/A'],
            value='N/A',
            description='Is there reported bias in the data?',
            layout=widgets.Layout(display="none")
            )

dataset_bias_report_link = widgets.Text(
            value='',
            placeholder='(optional) Link to the report/document on the bias',
            layout=widgets.Layout(display="none", width="500px")
            )

dataset_bias_corrected_link = widgets.Text(
            value='',
            placeholder='(optional) Link to a bias-corrected or bias-reduced version of the dataset',
            layout=widgets.Layout(display="none", width="500px")
            )

dataset_bias_tools_link = widgets.Text(
            value='',
            placeholder='(optional) Link to tools available to reduce bias',
            layout=widgets.Layout(display="none", width="500px")
            )

# Function to change the display setting of the following UI components. 
def on_click_handler(change):    

    # Show / hide main trunk of questions. 
    if dataset_bias.value == "Yes":
        dataset_bias_measures.layout.display = ''
        dataset_bias_report.layout.display = ''
        dataset_bias_report_link.layout.display = ''
        dataset_bias_corrected_link.layout.display = ''
        dataset_bias_tools_link.layout.display = ''

    else:   
        dataset_bias_measures.layout.display = 'none'
        dataset_bias_report.layout.display = 'none'
        dataset_bias_report_link.layout.display = 'none'
        dataset_bias_corrected_link.layout.display = 'none'
        dataset_bias_tools_link.layout.display = 'none'
        dataset_bias_measures.value = 'N/A'
        dataset_bias_report.value = 'N/A'
        dataset_bias_report_link.value = ''
        dataset_bias_corrected_link.value = ''
        dataset_bias_tools_link.value = ''

    # Show / hide 2nd trunk of questions.
    if dataset_bias_measures.value == "Yes":
        dataset_bias_measures_detail.layout.display = ''
        dataset_bias_metrological_traceable.layout.display = ''
    else:
        dataset_bias_measures_detail.layout.display = 'none'
        dataset_bias_metrological_traceable.layout.display = 'none'
        dataset_bias_measures_detail.value = ''
        dataset_bias_metrological_traceable.value = ''
        
            
# Display the UI components
display(dataset_bias, dataset_bias_measures, dataset_bias_measures_detail, dataset_bias_metrological_traceable, dataset_bias_report, dataset_bias_report_link, dataset_bias_corrected_link, dataset_bias_tools_link)

# Observe UI components for changes and call the on_click_handler function if value property changed. 
dataset_bias.observe(on_click_handler, names="value")
dataset_bias_measures.observe(on_click_handler, names="value")



ToggleButtons(description='2.8 Is there known bias in the dataset?', index=2, options=('Yes', 'No', 'Unknown')…

ToggleButtons(description='Have measures been taken to examine bias?', index=3, layout=Layout(display='none'),…

Textarea(value='', layout=Layout(display='none', width='500px'), placeholder='If yes, what measures were used?…

Textarea(value='', layout=Layout(display='none', width='500px'), placeholder='Is the bias metrological traceab…

ToggleButtons(description='Is there reported bias in the data?', index=3, layout=Layout(display='none'), optio…

Text(value='', layout=Layout(display='none', width='500px'), placeholder='(optional) Link to the report/docume…

Text(value='', layout=Layout(display='none', width='500px'), placeholder='(optional) Link to a bias-corrected …

Text(value='', layout=Layout(display='none', width='500px'), placeholder='(optional) Link to tools available t…

In [8]:

data_resolution_info = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='2.9 Is there quantitative information about data resolution in space and time?',
            )

data_quality_report = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='2.10 Are there published data quality procedures or reports?',
            )

data_quality_report_link = widgets.Text(
            value='',
            placeholder='If there is published quality information, please provide the link.',
            layout=widgets.Layout(width="500px")
            )

dataset_provenance = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='2.11 Is the provenance of the dataset tracked and documented?',
            )

data_integrity = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='2.12 Are there checksums / other checks for data integrity? ',
            )

display(data_resolution_info, data_quality_report, data_quality_report_link, dataset_provenance, data_integrity)

ToggleButtons(description='2.9 Is there quantitative information about data resolution in space and time?', in…

ToggleButtons(description='2.10 Are there published data quality procedures or reports?', index=2, options=('Y…

Text(value='', layout=Layout(width='500px'), placeholder='If there is published quality information, please pr…

ToggleButtons(description='2.11 Is the provenance of the dataset tracked and documented?', index=2, options=('…

ToggleButtons(description='2.12 Are there checksums / other checks for data integrity? ', index=2, options=('Y…

In [9]:

data_size_question = widgets.Label(
    value = '2.13 What is the size of the dataset? Depending on the resource, this might be:'
)

spacer = widgets.Box(layout=widgets.Layout(width='20px'))

total_data_volume = widgets.Text(
    value = '',
    placeholder='Total data volumn:'
)

num_data_dimensions_label = widgets.Label(
    value = "Number of data dimensions:"
)

num_data_dimensions = widgets.IntText(
    value = 0,
    layout = widgets.Layout(width="100px")
)

dimensions = widgets.HBox([num_data_dimensions_label, num_data_dimensions])


num_data_files_label = widgets.Label(
    value = "Number of data files:"
)

num_data_files = widgets.IntText(
    value = 0,
    layout = widgets.Layout(width="100px")
)

data_files = widgets.HBox([num_data_files_label, num_data_files])

num_data_rows_label = widgets.Label(
    value = "Number of data table rows:"
)

num_data_rows = widgets.IntText(
    value = 0,
    layout = widgets.Layout(width="100px")
)

data_rows = widgets.HBox([num_data_rows_label, num_data_rows])

num_data_images_label = widgets.Label(
    value = "Number of images:"
)

num_data_images = widgets.IntText(
    value = 0,
    layout = widgets.Layout(width="100px")
)

num_data_images_size_label = widgets.Label(
    value = "Size of images:"
)

num_data_images_size = widgets.Text(
    value = '',
    placeholder='228 x 228'
)

images = widgets.HBox([num_data_images_label, num_data_images, spacer, num_data_images_size_label, num_data_images_size])


display(data_size_question, total_data_volume, data_files, data_rows, dimensions, images)

Label(value='2.13 What is the size of the dataset? Depending on the resource, this might be:')

Text(value='', placeholder='Total data volumn:')

HBox(children=(Label(value='Number of data files:'), IntText(value=0, layout=Layout(width='100px'))))

HBox(children=(Label(value='Number of data table rows:'), IntText(value=0, layout=Layout(width='100px'))))

HBox(children=(Label(value='Number of data dimensions:'), IntText(value=0, layout=Layout(width='100px'))))

HBox(children=(Label(value='Number of images:'), IntText(value=0, layout=Layout(width='100px')), Box(layout=La…

### Data Quality Assessment Matrix

<img src="Images/data_quality_matrix.png" width=800 height="auto" />

---

## **3. Data Documentation**

### Community standard or convention


In [10]:

metadata_standard = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='3.1 Does the dataset metadata follow a community/domain standard or convention?',
            )

metadata_standard_detail = widgets.Text(
            value='',
            placeholder='Which standard is it? (CF, TBD, etc.)',
            layout = widgets.Layout(display="none")
            )

metadata_machine_readable = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Is the dataset metadata machine-readable?',
            layout = widgets.Layout(display="none")
            )

metadata_spatial_temporal = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Does it include details on the spatial and temporal extent?',
            layout = widgets.Layout(display="none")
            )


# Function to change the display setting of the following UI components. 
def on_click_handler(change):    

    # Show / hide main trunk of questions. 
    if metadata_standard.value == "Yes":
        metadata_standard_detail.layout.display = ''
        metadata_machine_readable.layout.display = ''
        metadata_spatial_temporal.layout.display = ''
    else: 
        metadata_standard_detail.layout.display = 'none'
        metadata_machine_readable.layout.display = 'none'
        metadata_spatial_temporal.layout.display = 'none'
        metadata_standard_detail.value = ''
        metadata_machine_readable.value = 'N/A'
        metadata_spatial_temporal.value = 'N/A'
        

display(metadata_standard, metadata_standard_detail, metadata_machine_readable, metadata_spatial_temporal)

# Observe UI components for changes and call the on_click_handler function if value property changed. 
metadata_standard.observe(on_click_handler, names="value")

ToggleButtons(description='3.1 Does the dataset metadata follow a community/domain standard or convention?', i…

Text(value='', layout=Layout(display='none'), placeholder='Which standard is it? (CF, TBD, etc.)')

ToggleButtons(description='Is the dataset metadata machine-readable?', index=2, layout=Layout(display='none'),…

ToggleButtons(description='Does it include details on the spatial and temporal extent?', index=2, layout=Layou…

### Data dictionary

In [18]:

data_dictionary = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='3.2 Is there a comprehensive data dictionary/codebook that describes what each element of the dataset means? parameters?',
            )

data_dictionary_standardized = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Is the data dictionary standardized?',
            layout=widgets.Layout(display="none")
            )

data_dictionary_machine_readable = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Is the data dictionary machine-readable?',
            layout=widgets.Layout(display="none")
            )

parameters_defined_standard = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Do the parameters follow a defined standard?',
            layout=widgets.Layout(display="none")
            )

parameters_defined_standard_detail = widgets.Text(
            value='',
            placeholder='If the parameters follow a defined standard, which standard it is??',
            layout=widgets.Layout(display="none", width="500px")
            )

parameters_common_vocabulary = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Are parameters crosswalked in an ontology or common vocabulary (e.g. NIEM)?',
            layout=widgets.Layout(display="none")
            )

# Function to change the display setting of the following UI components. 
def on_click_handler(change):    

    # Show / hide main trunk of questions. 
    if data_dictionary.value == "Yes":
        data_dictionary_standardized.layout.display = ''
        data_dictionary_machine_readable.layout.display = ''
        parameters_defined_standard.layout.display = ''
        parameters_defined_standard_detail.layout.display = ''
        parameters_common_vocabulary.layout.display = ''

    else:   
        data_dictionary_standardized.layout.display = 'none'
        data_dictionary_machine_readable.layout.display = 'none'
        parameters_defined_standard.layout.display = 'none'
        parameters_defined_standard_detail.layout.display = 'none'
        parameters_common_vocabulary.layout.display = 'none'
        data_dictionary_standardized.value = 'N/A'
        data_dictionary_machine_readable.value = 'N/A'
        parameters_defined_standard.value = 'N/A'
        parameters_defined_standard_detail.value = ''
        parameters_common_vocabulary.value = 'N/A'

            
# Display the UI components
display(data_dictionary, data_dictionary_standardized, data_dictionary_machine_readable, parameters_defined_standard, parameters_defined_standard_detail, parameters_common_vocabulary)

# Observe UI components for changes and call the on_click_handler function if value property changed. 
data_dictionary.observe(on_click_handler, names="value")



ToggleButtons(description='3.2 Is there a comprehensive data dictionary/codebook that describes what each elem…

ToggleButtons(description='Is the data dictionary standardized?', index=2, layout=Layout(display='none'), opti…

ToggleButtons(description='Is the data dictionary machine-readable?', index=2, layout=Layout(display='none'), …

ToggleButtons(description='Do the parameters follow a defined standard?', index=2, layout=Layout(display='none…

Text(value='', layout=Layout(display='none', width='500px'), placeholder='If the parameters follow a defined s…

ToggleButtons(description='Are parameters crosswalked in an ontology or common vocabulary (e.g. NIEM)?', index…

### Unique persistent identifier

3. Does the dataset have a unique persistent identifier, e.g. DOI? Yes, [supply identifier] / No / Not applicable


In [28]:

unique_persistent_identifier = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='3.3 Does the dataset have a unique persistent identifier, e.g. DOI?',
            )

unique_persistent_identifier_link = widgets.Text(
            value='',
            placeholder='If yes, please supply identifier',
            layout=widgets.Layout(width="500px")
            )

display(unique_persistent_identifier, unique_persistent_identifier_link)

ToggleButtons(description='3.3 Does the dataset have a unique persistent identifier, e.g. DOI?', index=2, opti…

Text(value='', layout=Layout(width='500px'), placeholder='If yes, please supply identifier')

### Contact information and feedback

In [30]:

contact_info_available = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='3.4 Is there contact information for subject-matter experts?',
            )

feedback_mechanism_available = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='3.5 Is there a mechanism for user feedback and suggestions?',
            )

display(contact_info_available, feedback_mechanism_available)


ToggleButtons(description='3.4 Is there contact information for subject-matter experts?', index=2, options=('Y…

ToggleButtons(description='3.5 Is there a mechanism for user feedback and suggestions?', index=2, options=('Ye…

### Examples codes / notebooks / toolkits


In [33]:

example_code_available = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='3.6 Are there example codes / notebooks / toolkits available showing how the data can be used?',
            )

display(example_code_available)

ToggleButtons(description='3.6 Are there example codes / notebooks / toolkits available showing how the data c…

### Licenses

In [37]:

dataset_licence = widgets.Text(
            value='',
            placeholder='3.7 What is the license for the data?',
            layout=widgets.Layout(width="500px")
            )

dataset_licence_machine_readable = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Is the license standardized and machine-readable (e.g. Creative Commons)?',
            )

display(dataset_licence, dataset_licence_machine_readable)

Text(value='', layout=Layout(width='500px'), placeholder='3.7 What is the license for the data?')

ToggleButtons(description='Is the license standardized and machine-readable (e.g. Creative Commons)?', index=2…

### Dataset useage

In [40]:

ai_ml_existing_useage_links = widgets.Textarea(
            value='',
            placeholder='3.8 Has this dataset already been used in AI or ML activities? Link to publications/reports',
            layout=widgets.Layout(width="500px")
            )

usage_recomendations = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='3.9 Are there recommendations on the intended use of the data, and uses that are not recommended?',
            )

display(ai_ml_existing_useage_links, usage_recomendations)

Textarea(value='', layout=Layout(width='500px'), placeholder='3.8 Has this dataset already been used in AI or …

ToggleButtons(description='3.9 Are there recommendations on the intended use of the data, and uses that are no…

### Data Documentation Assessment Matrix

<img src="Images/data_documentation_matrix.png" width=800 height="auto" />

---

## **4. Data Access**

### File formats

In [73]:

dataset_file_formats_label = widgets.Label(
    value = "4.1 What is/are the major file formats? (Use shift / Ctrl / CMD to select multiple)"
)

dataset_file_format_options = ['CSV', 'netCDF', 'geoJSON', 'Shapefile', 'GRIB', 'HDF', 'GeoTIFF', 'KML', 'GINI', 'Zarr', 'Other']

dataset_file_formats = widgets.SelectMultiple(
    options=dataset_major_file_format_options,
    value=(),
    rows=len(dataset_major_file_format_options),
    disabled=False
)

dataset_file_formats_machine_readable = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Are the main formats machine-readable?',
            )

dataset_file_formats_non_proprietary = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Is the data available in at least one open, non-proprietary format?',
            )

dataset_file_formats_conversion_tools = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='Are there tools/services to support data format conversion?',
            )

dataset_file_formats_conversion_tools_link = widgets.Text(
            value='',
            placeholder='If yes, provide the link to the tools/services',
            layout=widgets.Layout(width="500px")
            )

display(dataset_major_file_formats_label, dataset_major_file_formats, dataset_file_formats_machine_readable, dataset_file_formats_non_proprietary, dataset_file_formats_conversion_tools, dataset_file_formats_conversion_tools_link)

Label(value='4.1 What is/are the major file formats? (Use shift / Ctr / CMD to select multiple)')

SelectMultiple(index=(0, 1, 2, 3), options=('CSV', 'netCDF', 'geoJSON', 'Shapefile', 'GRIB', 'HDF', 'GeoTIFF',…

ToggleButtons(description='Are the main formats machine-readable?', index=2, options=('Yes', 'No', 'N/A'), val…

ToggleButtons(description='Is the data available in at least one open, non-proprietary format?', index=2, opti…

ToggleButtons(description='Are there tools/services to support data format conversion?', index=2, options=('Ye…

Text(value='', layout=Layout(width='500px'), placeholder='If yes, provide the link to the tools/services')

### Data delivery

In [84]:

dataset_authentication = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='4.2 Does data access require authentication (e.g., a registered user account)?',
            )

dataset_direct_access = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='4.3 Can the file be accessed via direct file downloading or ordering?',
            )

dataset_api_available = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='4.4 Is there an Application Programming Interface (API) or web service to access the data?',
            )

dataset_api_standard_protocol = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='If there is an API, does the API follow an open standard protocol (e.g., OGC)?',
            layout=widgets.Layout(display="none")
            )

dataset_api_documentation_available = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='If there is an API, is there documentation for the API?',
            layout=widgets.Layout(display="none")
            )

dataset_api_documentation_link = widgets.Text(
            value='',
            placeholder='If “Yes”, please provide a URL to the documentation.',
            layout=widgets.Layout(display="none", width="500px")
            )



# Function to change the display setting of the following UI components. 
def on_click_handler(change):    

    # Show / hide main trunk of questions. 
    if dataset_api_available.value == "Yes":
        dataset_api_standard_protocol.layout.display = ''
        dataset_api_documentation_available.layout.display = ''
        dataset_api_documentation_link.layout.display = ''

    else:   
        dataset_api_standard_protocol.layout.display = 'none'
        dataset_api_documentation_available.layout.display = 'none'
        dataset_api_documentation_link.layout.display = 'none'

        dataset_api_standard_protocol.value = 'N/A'
        dataset_api_documentation_available.value = 'N/A'
        dataset_api_documentation_link.value = ''


  
            
# Display the UI components
display(dataset_authentication, dataset_direct_access, dataset_api_available, dataset_api_standard_protocol, dataset_api_documentation_available, dataset_api_documentation_link)


# Observe UI components for changes and call the on_click_handler function if value property changed. 
dataset_api_available.observe(on_click_handler, names="value")






ToggleButtons(description='4.2 Does data access require authentication (e.g., a registered user account)?', in…

ToggleButtons(description='4.3 Can the file be accessed via direct file downloading or ordering?', index=2, op…

ToggleButtons(description='4.4 Is there an Application Programming Interface (API) or web service to access th…

ToggleButtons(description='If there is an API, does the API follow an open standard protocol (e.g., OGC)?', in…

ToggleButtons(description='If there is an API, is there documentation for the API?', index=2, layout=Layout(di…

Text(value='', layout=Layout(display='none', width='500px'), placeholder='If “Yes”, please provide a URL to th…

### Privacy and security


In [87]:

dataset_restricted_protection = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='4.5 For restricted data, have measures been taken to provide some access while still applying appropriate protection for privacy and security?',
            )

dataset_aggregation = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='4.6 Has the data been aggregated to reduce granularity?',
            )

dataset_anonymization = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='4.7 Has the data been anonymized / de-identified?',
            )

dataset_secure_access = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='4.8 Is there secure access to the full dataset for authorized users? ',
            )

display(dataset_restricted_protection, dataset_aggregation, dataset_anonymization, dataset_secure_access)


ToggleButtons(description='4.5 For restricted data, have measures been taken to provide some access while stil…

ToggleButtons(description='4.6 Has the data been aggregated to reduce granularity?', index=2, options=('Yes', …

ToggleButtons(description='4.7 Has the data been anonymized / de-identified?', index=2, options=('Yes', 'No', …

ToggleButtons(description='4.8 Is there secure access to the full dataset for authorized users? ', index=2, op…

### Data Access Assessment Matrix


<img src="Images/data_access_matrix.png" width=800 height="auto" />

---

## **5. Data Preparation**

### Null values

In [93]:

dataset_null_values = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description='5.1 Have null values/gaps been filled?',
            )

display(dataset_null_values)

ToggleButtons(description='5.1 Have null values/gaps been filled?', index=2, options=('Yes', 'No', 'N/A'), val…

### Outliers

In [91]:

dataset_outliers = widgets.ToggleButtons(
            options=['Yes, tagged ', 'Yes, removed', 'No', 'N/A'],
            value='N/A',
            description='5.2 Have outliers been identified?',
            )

display(dataset_outliers)


ToggleButtons(description='5.2 Have outliers been identified?', index=3, options=('Yes, tagged ', 'Yes, remove…

### Gridded data


In [120]:

dataset_gridded_label = widgets.Label(
    value = "5.3 Is the data gridded (regularly sampled in time and space)?"
)

dataset_gridded = widgets.Select(
            options=['Regularly gridded in space', 'Constant time-frequency', 'Regularly gridded in space and constant time-frequency', 'Not gridded', 'N/A'],
            value='N/A',
            layout = widgets.Layout(width='500px')
            )

dataset_gridded_transformed_label = widgets.Label(
    value = "If the data is gridded, was it transformed from a different original sampling?"
)

dataset_gridded_transformed = widgets.Select(
            options=['Yes, from irregular sampling', 'Yes, from a different regular sampling', 'No, this is the original sampling', 'N/A'],
            value='N/A',
            layout = widgets.Layout(width='500px')
            )


dataset_gridded_original_sample = widgets.ToggleButtons(
            options=['Yes', 'No', 'Only by request', 'N/A'],
            value='N/A',
            description = 'If the data is resampled from the original sampling, is the data also available at the original sampling?'
            )

display(dataset_gridded_label, dataset_gridded, dataset_gridded_transformed_label, dataset_gridded_transformed, dataset_gridded_original_sample)

Label(value='5.3 Is the data gridded (regularly sampled in time and space)?')

Select(index=4, layout=Layout(width='500px'), options=('Regularly gridded in space', 'Constant time-frequency'…

Label(value='If the data is gridded, was it transformed from a different original sampling?')

Select(index=3, layout=Layout(width='500px'), options=('Yes, from irregular sampling', 'Yes, from a different …

ToggleButtons(description='If the data is resampled from the original sampling, is the data also available at …

### Targets / labels for supervised learning

In [134]:

dataset_targets_or_labels = widgets.ToggleButtons(
            options=['Yes', 'No', 'N/A'],
            value='N/A',
            description = '5.4 Are there associated targets or labels for supervised learning techniques (i.e., can this be used as a training dataset for supervised learning techniques)?'
            )

dataset_targets_or_labels_standards_label = widgets.Label(
    value = "If there are associated targets/labels, are community labeling standards implemented?"
)

dataset_targets_or_labels_standards = widgets.Text(
            value = '',
            placeholder = 'e.g., STAC label extension, ESA AIREO specification, etc.',
            layout = widgets.Layout(width="500px")
)

display(dataset_targets_or_labels, dataset_targets_or_labels_standards_label, dataset_targets_or_labels_standards )

ToggleButtons(description='5.4 Are there associated targets or labels for supervised learning techniques (i.e.…

Label(value='If there are associated targets/labels, are community labeling standards implemented?')

Text(value='', layout=Layout(width='500px'), placeholder='e.g., STAC label extension, ESA AIREO specification,…

## Finished

In [133]:

button_finished = widgets.Button(description="Print checklist",  button_style='info')
output = widgets.Output()
display(button_finished, output)


results = {}

def generate_results(b):
    results["1.1 \tDataset name"] = dataset_name.value
    results["1.2 \tDataset version"] = dataset_version.value
    results["1.3 \tDataset link"] = dataset_link.value
    results["1.4 \tAssessor name"] = dataset_assessor_name.value
    results["1.5 \tAssessor email address"] = dataset_assessor_email.value
    results["1.6 \tData product"] = raw_derived.value
    results["1.7 \tData origin"] = observe_model_synthetic.value
    results["1.8 \tData source"] = data_sources.value
    results["2.1 \tWill dataset be updated"] = data_update.value
    results["2.1.1 \tUpdate frequency"] = data_update_frequency.value
    results["2.1.2 \tUpdate stages"] = data_update_stages.value
    results["2.1.3 \tUpdate delay reason"] = data_update_delay.value
    results["2.1.4 \tShould new version supersede"] = data_update_supersede.value
    results["2.2 \tDataset completeness documentation"] = completeness_docs.value
    results["2.2.1 \tDataset completeness doc link"] = completeness_docs_link.value
    results["2.3 \tDataset completion vs expected spatial coverage"] = expected_spatial_coverage.value
    results["2.4 \tDataset completion vs expected temporal coverage"] = expected_temporal_coverage.value
    results["2.5 \tSelf-consistent units, dtypes, parameters"] = self_consistent_units.value
    results["2.6 \tConsistent units, dtypes, parameter with similar datasets"] = consistent_units.value
    results["2.7 \tConsistent units, dtypes, parameters monitoring "] = consistent_unit_monitoring.value
    results["2.7.1 \tConsistent units, dtypes, parameter monitoring measures"] = consistent_unit_review.value
    results["2.8 \tIs there known bias in the dataset"] = dataset_bias.value
    results["2.8.1 \tHas bias been examined"] = dataset_bias_measures.value
    results["2.8.2 \tBias measures used"] = dataset_bias_measures_detail.value
    results["2.8.3 \tIs bias metrological traceable"] = dataset_bias_metrological_traceable.value
    results["2.8.3 \tReported bias in data"] = dataset_bias_report.value
    results["2.8.4 \tBias report link"] = dataset_bias_report_link.value
    results["2.8.5 \tBias corrected dataset version linke"] = dataset_bias_report.value
    results["2.8.6 \tTools to reduce bias link"] = dataset_bias_tools_link.value    
    results["2.9 \tSpace and time data resolution info"] = data_resolution_info.value
    results["2.10 \tPublished data quality procedures or reports"] = data_quality_report.value
    results["2.10.1 \tIf there is published quality information, please provide the link"] = data_quality_report_link.value
    results["2.11 \tProvenance of the dataset tracked and documented"] = dataset_provenance.value
    results["2.12 \tChecksums / other checks for data integrity"] = data_integrity.value
    results["2.13.1 \tTotal data volume"] = total_data_volume.value
    results["2.13.2 \tNumber of data dimensions"] = num_data_dimensions.value
    results["2.13.3 \tNumber of data files"] = num_data_files.value
    results["2.13.4 \tNumber of data table rows"] = num_data_rows.value
    results["2.13.5 \tNumber of images"] = num_data_images.value
    results["2.13.6 \tSize of images"] = num_data_images_size.value
    results["3.2 \tData dictionary for dataset / parameters"] = data_dictionary.value
    results["3.2.1 \tData dictionary standardized"] = data_dictionary_standardized.value
    results["3.2.2 \tData dictionary machine-readable"] = data_dictionary_machine_readable.value
    results["3.2.3 \tParameters follow a defined standard"] = parameters_defined_standard.value
    results["3.2.4 \tWhich standard to the parameters follow"] = parameters_defined_standard_detail.value
    results["3.2.5 \tAre parameters crosswalked in an ontology or common vocabulary"] = parameters_common_vocabulary.value
    results["3.3 \tHas a unique persistent identifier"] = unique_persistent_identifier.value
    results["3.3.1 \tUnique persistent identifier link"] = unique_persistent_identifier_link.value
    results["3.4 \tContact info available"] = contact_info_available.value
    results["3.5 \tFeedback mechanism available"] = feedback_mechanism_available.value
    results["3.6 \tExample code / notebooks / toolkits"] = example_code_available.value
    results["3.7 \tLicence"] = dataset_licence.value
    results["3.7.1 \tLicence machine-readable"] = dataset_licence_machine_readable.value
    results["3.8 \tAI / ML existing usage links"] = ai_ml_existing_useage_links.value
    results["3.9 \tDataset useage recomendations"] = usage_recomendations.value
    results["4.1 \tWhat is/are the major file formats"] = dataset_file_formats.value
    results["4.1.1 \tMain data formats machine-readable"] = dataset_file_formats_machine_readable.value
    results["4.1.2 \tData available in at least one open, non-proprietary format"] = dataset_file_formats_non_proprietary.value
    results["4.1.3 \tData format conversion tools / services"] = dataset_file_formats_conversion_tools.value
    results["4.1.4 \tData format conversion tools / services link"] = dataset_file_formats_conversion_tools_link.value
    results["4.2 \tDataset requires authentication"] = dataset_authentication.value
    results["4.3 \tDirect file download / order access"] = dataset_direct_access.value
    results["4.4 \tAPI / webservice available"] = dataset_api_available.value
    results["4.4.1 \tAPI follow an open standard protocol"] = dataset_api_standard_protocol.value
    results["4.4.2 \tAPI documentation available"] = dataset_api_documentation_available.value
    results["4.4.3 \tAPI documentation link "] = dataset_api_documentation_link.value
    results["4.5 \tRestricted data with some access but appropriate protections"] = dataset_restricted_protection.value
    results["4.6 \tData aggregated to reduce granularity"] = dataset_aggregation.value
    results["4.7 \tData has been anonymized / de-identified"] = dataset_anonymization.value
    results["4.8 \tSecure access to full dataset for authorized users"] = dataset_secure_access.value
    results["5.1 \tNull values / gaps been filled"] = dataset_null_values.value
    results["5.2 \tOutlier values been identified"] = dataset_outliers.value
    results["5.3 \tIs the data gridded"] = dataset_gridded.value
    results["5.3.1 \tIs gridded data transformed from original sampling"] = dataset_gridded_transformed.value
    results["5.3.2 \tIf data is resampled, is the original sampling available"] = dataset_gridded_original_sample.value
    results["5.4 \tAssociated targets or labels for supervised learning techniques"] = dataset_targets_or_labels.value
    results["5.4.1 \tAre targets or labels community standards implimented"] = dataset_targets_or_labels_standards.value
    
    # Print checklist results.      
    with output:
        clear_output()
        print("\n------ CHECKLIST RESULTS ------\n")
        for key, value in results.items():
            #if value != "":
            print(f"{key}: {value}")
        
button_finished.on_click(generate_results)

Button(button_style='info', description='Print checklist', style=ButtonStyle())

Output()

---

## **Appendix** - Definition of terms used in the checklist.

### Quality
* **Completeness**: the breadth of a dataset compared to an ideal 100% completion (spatial, temporal, demographic, etc.); important in avoiding sampling bias
* **Consistency**: uniformity within the entire dataset or compared with similar data collections; for example, no changes in units or data types over time; the item measured against itself or its a counterpart in another dataset or database
* **Bias**: a systematic tilt in the dataset when compared to a reference, caused for example by instrumentation, incorrect data processing, unrepresentative sampling, or human error; the exact nature of bias and how it is measured will vary depending on the type of data and the research domain.
* **Uncertainty**: parameter, associated with the result of a measurement, that characterizes the dispersion of the values that could reasonably be attributed to the measurand.
* **Timeliness**: the speed of data release, compared to when an event occurred or measurements were made; requirements will vary depending on the timeframe of the phenomenon (e.g., severe thunderstorms vs. climate change, or disease outbreaks vs. life expectancy trends)
* **Provenance**: identification of the data sources, how it was processed, and who released it.
* **Integrity**: verification that the data remains unchanged from the original; aka data fixity.

### Documentation
* **Dataset Metadata**: complete information about the dataset: quality, provenance, location, time period, responsible parties, purpose, etc.
* **Data Dictionary/Codebook**: complete information about the individual variables / measures / parameters within a dataset: type, units, null value, etc.
* **Identifier**: a code or number that uniquely identifies a dataset
* **Ontology**: formalized definitions of concepts within a domain of knowledge, and the nature of the inter-relationships among those concepts

### Data Access

* **Formats**: standards that govern how information is stored in a computer file (e.g., CSV, JSON, GeoTIFF, etc.); different AI user communities will have different requirements, so the best practice is to provide several format options to meet the needs of multiple high priority user communities.
* **Delivery Options**: mechanisms for publishing open data for public use (e.g., direct file download, Application Programming Interface (API), cloud services, etc.); different AI user communities will have different requirements, so the best practice is to provide several delivery options to meet the needs of multiple high priority user communities.
* **License/Usage Rights**: information on who is allowed to use the data and for what purposes, including data sharing agreements, fees, etc.; some federal data needs to have restrictions and some will be fully open, so rights should be documented in detail
* **Security/Privacy**: protection of data that is restricted in some way (privacy, proprietary/business information, national security, etc.)
