# **Data Readiness For AI Checklist - Part 4**

 * Creator(s) John Pill
 * Affiliation: UK Met Office
 * History: 1.0
 * Last update: 27 August 2024.


---

## **Tutorial Material**

* **Run this Jupyter notebook locally using Jupyter Lab**
* **Select 'Run All Cells' from the 'Run' menu to generate the checklist**.
* **Remember to save your notebook regularly as you work through it.**


## **Data section, optional**
Scripts for pulling the data into the notebook assuming

---

## **Setup Notebook**

In [1]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import json
import sys
import os
from aidatareadiness import utils
from aidatareadiness.utils import WIDGET_WIDTH, DESCRIPTION_STYLE, PLACEHOLDER  
from aidatareadiness.checklist_auto import gridded 

## **Load Data**

In [2]:
# Use the following function to laod your dataset and check its file format is compatible. 
# Add the filename / file path of your gridded dataset below:

gridded_file_path = "/home/coder/ai_data_readiness/new_data/conus_HUMID_20180101.nc"

# Uncomment the lines below to check compatibility and load your dataset. 
dataset = gridded.detect_gridded_format_and_open(gridded_file_path)
dataset

In [3]:
# Load checklist from JSON file:
checklist = utils.load_checklist()

#### Reset stored answers to start again:

In [4]:
# Reset all checklist answers back to original blank answers for all sections.
# Any completed information will be lost. 

# To reset the stored answers uncomment and run these lines of code below. Re-comment the lines afterwards to avoid them running again. 
# utils.reset_checklist()
# checklist = utils.load_checklist()

# You can then re-run each section to reload it on the reset data. 

In [5]:

print("Dataset:", checklist["GeneralInformation"]["DatasetName"])
print("Dataset link:", checklist["GeneralInformation"]["DatasetLink"])
print("Assessor:", checklist["GeneralInformation"]["AssessorName"])
print("Assessor email:", checklist["GeneralInformation"]["AssessorEmailAddress"])

Dataset: HUMID
Dataset link: 
Assessor: 
Assessor email: 


---

## **4. Data Access**

### File formats

In [6]:

dataset_file_formats_label = widgets.Label(
    value = "4.1 What is/are the major file formats? (Use shift / Ctrl / CMD to select multiple)"
)

dataset_file_format_options = ['CSV', 'netCDF', 'geoJSON', 'Shapefile', 'GRIB', 'HDF', 'GeoTIFF', 'KML', 'GINI', 'Zarr', 'Other']

dataset_file_formats = widgets.SelectMultiple(
            value=checklist['DataAccess']['FileFormats'],
            options=dataset_file_format_options,
            rows=len(dataset_file_format_options),
)

dataset_file_formats_machine_readable = widgets.Combobox(
            value=checklist['DataAccess']['FileFormatsMachineReadable'],
            options=['Yes', 'No', 'N/A'],
            description='Are the main formats machine-readable?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_file_formats_non_proprietary = widgets.Combobox(
            value=checklist['DataAccess']['OpenFormatAvailable'],
            options=['Yes', 'No', 'N/A'],
            description='Is the data available in at least one open, non-proprietary format?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_file_formats_conversion_tools = widgets.Combobox(
            value=checklist['DataAccess']['FormatConversionTools'],
            options=['Yes', 'No', 'N/A'],
            description='Are there tools/services to support data format conversion?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_file_formats_conversion_tools_link = widgets.Text(
            value=checklist['DataAccess']['ConversionToolsLink'],
            description='Tools / services link:',
            placeholder='If yes, provide the link to the tools/services',
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

def on_click_handler_sel_format(change):
    """
    Completes various fields based on the file format selection.
    """
    
    selected_formats = change['new']
    dataset_file_formats_conversion_tools_link_dict = {'CSV':['https://pandas.pydata.org/', 'https://www.qgis.org/', 'https://www.arcgis.com/home/index.html', ],
                                                        'netCDF':['https://unidata.github.io/netcdf4-python/','https://scitools-iris.readthedocs.io/en/stable/index.html', 'https://docs.xarray.dev/en/stable/', 'https://www.giss.nasa.gov/tools/panoply/', 'https://www.unidata.ucar.edu/software/tds/', 'https://www.mathworks.com/products/matlab.html'],
                                                        'geoJSON':['https://www.qgis.org/', 'https://www.arcgis.com/home/index.html', 'https://geopandas.org/en/stable/', 'https://shapely.readthedocs.io/en/stable/manual.html'],
                                                        'Shapefile':['https://www.qgis.org/', 'https://www.arcgis.com/home/index.html', 'https://gdal.org/en/latest/', 'https://pypi.org/project/pyshp/', 'https://geopandas.org/en/stable/',], 
                                                        'GRIB':['https://scitools-iris.readthedocs.io/en/stable/index.html', 'https://www.giss.nasa.gov/tools/panoply/', 'https://www.qgis.org/', 'https://www.cpc.ncep.noaa.gov/products/wesley/wgrib2/', 'https://github.com/ecmwf/cfgrib'],
                                                        'HDF':['https://www.h5py.org/', 'https://www.giss.nasa.gov/tools/panoply/','https://earth.esa.int/eogateway/tools/hdfview'],
                                                        'GeoTIFF':['https://www.qgis.org/', 'https://www.arcgis.com/home/index.html','https://gdal.org/en/latest/','https://rasterio.readthedocs.io/en/stable/'],
                                                        'KML':['https://www.qgis.org/', 'https://www.arcgis.com/home/index.html','https://fastkml.readthedocs.io/en/latest/','https://simplekml.readthedocs.io/en/latest/'],
                                                        'GINI':['https://gdal.org/en/latest/', 'https://www.unidata.ucar.edu/software/metpy/'],
                                                        'Zarr':['https://zarr.readthedocs.io/en/stable/', 'https://docs.xarray.dev/en/stable/','https://www.dask.org']}

    if set(selected_formats) & {'CSV', 'netCDF', 'geoJSON', 'Shapefile', 'GRIB', 'HDF', 'GeoTIFF', 'KML', 'GINI', 'Zarr'}:
        dataset_file_formats_machine_readable.value = 'Yes'
    elif set(selected_formats) & {'Other'}:
        dataset_file_formats_machine_readable.value = ''
    else:
        dataset_file_formats_machine_readable.value = 'No'
    
    if set(selected_formats) & {'CSV', 'netCDF', 'geoJSON', 'Shapefile', 'GRIB', 'HDF','GeoTIFF', 'Zarr'}:
        dataset_file_formats_non_proprietary.value = 'Yes'
    elif set(selected_formats) & {'Other'}:
        dataset_file_formats_non_proprietary.value = ''
    else:
        dataset_file_formats_non_proprietary.value = 'No'
    
    # Determine if conversion tools are available
    if set(selected_formats) & {'CSV', 'netCDF', 'geoJSON', 'Shapefile', 'GRIB', 'HDF', 'GeoTIFF', 'KML', 'GINI', 'Zarr'}:
        dataset_file_formats_conversion_tools.value = 'Yes'
        dataset_file_formats_conversion_tools_link_list = []
        for format in list(set(selected_formats)):
            [dataset_file_formats_conversion_tools_link_list.append(i) for i in dataset_file_formats_conversion_tools_link_dict[format]]
        dataset_file_formats_conversion_tools_link.value = ' '.join(list(set(dataset_file_formats_conversion_tools_link_list)))
    elif set(selected_formats) & {'Other'}:
        dataset_file_formats_conversion_tools.value = ''
        dataset_file_formats_conversion_tools_link.value = ''
    else:
        dataset_file_formats_conversion_tools.value = 'No'
        dataset_file_formats_conversion_tools_link.value = ''

dataset_file_formats.observe(on_click_handler_sel_format, names='value')

display(dataset_file_formats_label, dataset_file_formats, dataset_file_formats_machine_readable, dataset_file_formats_non_proprietary, dataset_file_formats_conversion_tools, dataset_file_formats_conversion_tools_link)

Label(value='4.1 What is/are the major file formats? (Use shift / Ctrl / CMD to select multiple)')

SelectMultiple(options=('CSV', 'netCDF', 'geoJSON', 'Shapefile', 'GRIB', 'HDF', 'GeoTIFF', 'KML', 'GINI', 'Zar…

Combobox(value='', description='Are the main formats machine-readable?', layout=Layout(width='900px'), options…

Combobox(value='', description='Is the data available in at least one open, non-proprietary format?', layout=L…

Combobox(value='', description='Are there tools/services to support data format conversion?', layout=Layout(wi…

Text(value='', description='Tools / services link:', layout=Layout(width='900px'), placeholder='If yes, provid…

### Data delivery

In [7]:

dataset_authentication = widgets.Combobox(
            value=checklist['DataAccess']['AuthenticationRequired'],
            options=['Yes', 'No', 'N/A'],
            description='4.2 Does data access require authentication (e.g., a registered user account)?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_direct_access = widgets.Combobox(
            value=checklist['DataAccess']['DirectDownloadAvailable'],
            options=['Yes', 'No', 'N/A'],
            description='4.3 Can the file be accessed via direct file downloading or ordering?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_api_available = widgets.Combobox(
            value=checklist['DataAccess']['APIorWebAvailable'],
            options=['Yes', 'No', 'N/A'],
            description='4.4 Is there an Application Programming Interface (API) or web service to access the data?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_api_standard_protocol = widgets.Combobox(
            value=checklist['DataAccess']['APIOpenStandard'],
            options=['Yes', 'No', 'N/A'],
            description='If there is an API, does the API follow an open standard protocol (e.g., OGC)?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(display="none", width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_api_documentation_available = widgets.Combobox(
            value=checklist['DataAccess']['APIDocumentation'],
            options=['Yes', 'No', 'N/A'],
            description='If there is an API, is there documentation for the API?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(display="none", width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_api_documentation_link = widgets.Text(
            value=checklist['DataAccess']['APIDocumentationLink'],
            placeholder='If “Yes”, please provide a URL to the documentation.',
            layout=widgets.Layout(display="none", width=WIDGET_WIDTH)
            )



# Function to change the display setting of the following UI components. 
def on_click_handler(change):    

    # Show / hide main trunk of questions. 
    if dataset_api_available.value == "Yes":
        dataset_api_standard_protocol.layout.display = ''
        dataset_api_documentation_available.layout.display = ''
        dataset_api_documentation_link.layout.display = ''

    else:   
        dataset_api_standard_protocol.layout.display = 'none'
        dataset_api_documentation_available.layout.display = 'none'
        dataset_api_documentation_link.layout.display = 'none'

        dataset_api_standard_protocol.value = 'N/A'
        dataset_api_documentation_available.value = 'N/A'
        dataset_api_documentation_link.value = ''

            
# Display the UI components
display(dataset_authentication, dataset_direct_access, dataset_api_available, dataset_api_standard_protocol, dataset_api_documentation_available, dataset_api_documentation_link)


# Observe UI components for changes and call the on_click_handler function if value property changed. 
dataset_api_available.observe(on_click_handler, names="value")



Combobox(value='', description='4.2 Does data access require authentication (e.g., a registered user account)?…

Combobox(value='', description='4.3 Can the file be accessed via direct file downloading or ordering?', layout…

Combobox(value='', description='4.4 Is there an Application Programming Interface (API) or web service to acce…

Combobox(value='', description='If there is an API, does the API follow an open standard protocol (e.g., OGC)?…

Combobox(value='', description='If there is an API, is there documentation for the API?', layout=Layout(displa…

Text(value='', layout=Layout(display='none', width='900px'), placeholder='If “Yes”, please provide a URL to th…

### Privacy and security


In [8]:

dataset_restricted_protection = widgets.Combobox(
            value=checklist['DataAccess']['SecurityMeasuresTaken'],
            options=['Yes', 'No', 'N/A'],
            description='4.5 For restricted data, have measures been taken to provide some access while still applying appropriate protection for privacy and security?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            
            )

dataset_aggregation = widgets.Combobox(
            value=checklist['DataAccess']['DataAggregated'],
            options=['Yes', 'No', 'N/A'],
            description='4.6 Has the data been aggregated to reduce granularity?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_anonymization = widgets.Combobox(
            value=checklist['DataAccess']['DataAnonymized'],
            options=['Yes', 'No', 'N/A'],
            description='4.7 Has the data been anonymized / de-identified?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_secure_access = widgets.Combobox(
            value=checklist['DataAccess']['SecureAccessForAuthorizedUsers'],
            options=['Yes', 'No', 'N/A'],
            description='4.8 Is there secure access to the full dataset for authorized users? ',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

display(dataset_restricted_protection, dataset_aggregation, dataset_anonymization, dataset_secure_access)


Combobox(value='', description='4.5 For restricted data, have measures been taken to provide some access while…

Combobox(value='', description='4.6 Has the data been aggregated to reduce granularity?', layout=Layout(width=…

Combobox(value='', description='4.7 Has the data been anonymized / de-identified?', layout=Layout(width='900px…

Combobox(value='', description='4.8 Is there secure access to the full dataset for authorized users? ', layout…

In [9]:

# Save button
save_button = widgets.Button(description="Save Data Access Answers to json file",  button_style="primary",  layout=widgets.Layout(flex='1 1 auto', width='auto'))

def generate_updates_access():

    updates = {
        "DataAccess": {
            #File Formats
            "FileFormats": dataset_file_formats.value,
            "FileFormatsMachineReadable": dataset_file_formats_machine_readable.value,
            "OpenFormatAvailable": dataset_file_formats_non_proprietary.value,
            "FormatConversionTools": dataset_file_formats_conversion_tools.value,
            "ConversionToolsLink": dataset_file_formats_conversion_tools_link.value, 

            # Data Delivery
            "AuthenticationRequired" : dataset_authentication.value,
            "DirectDownloadAvailable" : dataset_direct_access.value,
            "APIorWebAvailable" : dataset_api_available.value,
            "APIOpenStandard" : dataset_api_standard_protocol.value,
            "APIDocumentation" : dataset_api_documentation_available.value,
            "APIDocumentationLink" : dataset_api_documentation_link.value,

            # Privacy and Security
            "SecurityMeasuresTaken" : dataset_restricted_protection.value,
            "DataAggregated" : dataset_aggregation.value,
            "DataAnonymized" : dataset_anonymization.value,
            "SecureAccessForAuthorizedUsers" : dataset_secure_access.value,
        }
    }
    return updates

save_button.on_click(lambda b: utils.update_checklist(b, generate_updates_access()))

display(save_button)

Button(button_style='primary', description='Save Data Access Answers to json file', layout=Layout(flex='1 1 au…

## Finished

1. Make sure you saved your answers to the external json file using the buttons above. 
2. If you would like to view these saved answers use the button below. 
3. Move onto the notebook Template_Checklist_Part_5.ipynb covering Data Preparation

In [10]:

button_print_json = widgets.Button(description="Print json results",  button_style='info', layout=widgets.Layout(flex='1 1 auto', width='auto'))
output = widgets.Output()

display(button_print_json, output)

def print_json_info(b):
    """
    Loads a copy of the json file to checklist variable. 
    Then prints the json file contents to Jupyter notebook cell output.

    Arguments: b - represents the button calling the function. 
    """
    checklist = utils.load_checklist()
    with output:
        clear_output()
        for key, value in checklist.items():
            print(f"{key}:")
            if isinstance(value, dict):
                for sub_key, sub_value in value.items():
                    print(f"  {sub_key}: {sub_value}")
            else:
                print(f"  {value}")

button_print_json.on_click(print_json_info)


Button(button_style='info', description='Print json results', layout=Layout(flex='1 1 auto', width='auto'), st…

Output()

---

## **Appendix** - Definition of terms used in the checklist.

### Data Access

* **Formats**: standards that govern how information is stored in a computer file (e.g., CSV, JSON, GeoTIFF, etc.); different AI user communities will have different requirements, so the best practice is to provide several format options to meet the needs of multiple high priority user communities.
* **Delivery Options**: mechanisms for publishing open data for public use (e.g., direct file download, Application Programming Interface (API), cloud services, etc.); different AI user communities will have different requirements, so the best practice is to provide several delivery options to meet the needs of multiple high priority user communities.
* **License/Usage Rights**: information on who is allowed to use the data and for what purposes, including data sharing agreements, fees, etc.; some federal data needs to have restrictions and some will be fully open, so rights should be documented in detail
* **Security/Privacy**: protection of data that is restricted in some way (privacy, proprietary/business information, national security, etc.)
