# **Data Readiness For AI Checklist - Part 5**

 * Creator(s) John Pill
 * Affiliation: UK Met Office
 * History: 1.0
 * Last update: 27 August 2024.


---

## **Tutorial Material**

* **Run this Jupyter notebook locally using Jupyter Lab**
* **Select 'Run All Cells' from the 'Run' menu to generate the checklist**.
* **Remember to save your notebook regularly as you work through it.**


## **Data section, optional**
Scripts for pulling the data into the notebook assuming

---

## **Setup Notebook**

In [28]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import json
import sys
import os
import numpy as np
import traceback
from aidatareadiness import utils
from aidatareadiness.utils import WIDGET_WIDTH, DESCRIPTION_STYLE, PLACEHOLDER  
from aidatareadiness.checklist_auto import gridded 

## **Load Data**

In [29]:
# Use the following function to laod your dataset and check its file format is compatible. 
# Add the filename / file path of your gridded dataset below:

gridded_file_path = "/home/coder/ai_data_readiness/new_data/conus_HUMID_20180101.nc"

# Uncomment the lines below to check compatibility and load your dataset. 
dataset = gridded.detect_gridded_format_and_open(gridded_file_path)
dataset

In [30]:
# Convert temporal coord name to 'time'

# Sometimes the temporal coord has a different name, you can use code below to rename the dimension to time to work with the following functions. 
# Use the function below to rename the temporal coord to 'time', this will be used in the later functions.

# Uncoment the line below to check the current coord names and identify the temporal coord. 
print(dataset.sizes)

# Uncomment the function call below and update the 2nd argument with the temporal coord name.
dataset = gridded.temporal_check(dataset, "Time")

Frozen({'Time': 1, 'lat': 2901, 'lon': 4608})
No time coordinate found.
Time coordinate name changed to time


In [31]:
# Load checklist from JSON file:
checklist = utils.load_checklist()

#### Reset stored answers to start again:

In [32]:
# Reset all checklist answers back to original blank answers for all sections.
# Any completed information will be lost. 

# To reset the stored answers uncomment and run these lines of code below. Re-comment the lines afterwards to avoid them running again. 
# utils.reset_checklist()
# checklist = utils.load_checklist()

# You can then re-run each section to reload it on the reset data. 

In [33]:

print("Dataset:", checklist["GeneralInformation"]["DatasetName"])
print("Dataset link:", checklist["GeneralInformation"]["DatasetLink"])
print("Assessor:", checklist["GeneralInformation"]["AssessorName"])
print("Assessor email:", checklist["GeneralInformation"]["AssessorEmailAddress"])

Dataset: HUMID
Dataset link: 
Assessor: 
Assessor email: 


---

## **5. Data Preparation**

### Null values

In [34]:

dataset_null_values = widgets.Combobox(
            value=checklist['DataPreparation']['NullValuesFilled'],
            options=['Yes', 'No', 'N/A'],
            description='5.1 Have null values/gaps been filled?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

display(dataset_null_values)

Combobox(value='', description='5.1 Have null values/gaps been filled?', layout=Layout(width='900px'), options…

In [35]:
# This function will help you review if the dataset has missing or filled values. 
# However, it is still worth reading the dataset documentation as it should be explained in detail there. 

# Uncomment the line below to run the find_missing_values function.
missing_value_stats = utils.find_missing_values(dataset)

# Uncomment the lines below to print out the results:
print("MISSING / FILLED VALUE REPORT:")
for stats in missing_value_stats:
    for key, value in stats.items():
        print(f"    > {key}: {value}")
    print("-" * 50)

MISSING / FILLED VALUE REPORT:
    > variable_name: FRC_URB2D
    > missing_values_count: 0
    > percentage_missing: 0.0
    > has_fill_value: False
    > fill_value: None
    > filled_values_count: 0
    > percentage_filled: 0.0
--------------------------------------------------


### Outliers

In [36]:

dataset_outliers = widgets.Combobox(
            value=checklist['DataPreparation']['OutliersIdentified'],
            options=['Yes, tagged ', 'Yes, removed', 'No', 'N/A'],
            description='5.2 Have outliers been identified?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

display(dataset_outliers)


Combobox(value='', description='5.2 Have outliers been identified?', layout=Layout(width='900px'), options=('Y…

In [37]:
# This function could be used to review the z-score values over a set threshold. 
# This could be useful as an indication of outliers, however again, reading the dataset documentation should have information regarding outliers. 

# Uncomment the line below to run the count_z_score_outliers_for_dataset function:
try:
    outlier_stats = utils.count_z_score_outliers_for_dataset(dataset, threshold=3)
    
    # Uncomment the lines below to print the results for each variable
    print("OUTLIER VALUE REPORT:")
    for stats in outlier_stats:
        for key, value in stats.items():
            print(f"    > {key}: {value}")
        print("-" * 50)
except Exception as e:
    print('NO OUTLIER VALUE REPORT FROM DATA')
    print(e)

NO OUTLIER VALUE REPORT FROM DATA
ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''


  z_scores = (data_clean - mean) / std_dev


### Gridded data


In [38]:

dataset_gridded = widgets.Combobox(
            value=checklist['DataPreparation']['Gridded'],
            options=['Regularly gridded in space', 'Constant time-frequency', 'Regularly gridded in space and constant time-frequency', 'Not gridded', 'N/A'],
            description='5.3 Is the data gridded (regularly sampled in time and space)?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_gridded_transformed = widgets.Combobox(
            value=checklist['DataPreparation']['TransformedFromOriginal'],
            options=['Yes, from irregular sampling', 'Yes, from a different regular sampling', 'No, this is the original sampling', 'N/A'],
            description='If the data is gridded, was it transformed from a different original sampling?',            
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )


dataset_gridded_original_sample = widgets.Combobox(
            value=checklist['DataPreparation']['OriginalSamplingAvailable'],
            options=['Yes', 'No', 'Only by request', 'N/A'],
            description = 'If the data is resampled from the original sampling, is the data also available at the original sampling?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

display(dataset_gridded, dataset_gridded_transformed, dataset_gridded_original_sample)

Combobox(value='', description='5.3 Is the data gridded (regularly sampled in time and space)?', layout=Layout…

Combobox(value='', description='If the data is gridded, was it transformed from a different original sampling?…

Combobox(value='', description='If the data is resampled from the original sampling, is the data also availabl…

In [40]:
# The functions and information found in notebook 1 (part 2) might prove useful to answer some of the questions above. 

# Uncomment the 4 lines below to run various functions to check spatial and temporal information:
resolution, coverage = gridded.get_spatial_resolution_and_coverage(dataset)
temporal_resolution, temporal_coverage = gridded.get_temporal_resolution_and_coverage(dataset)
spatial_consistency = gridded.check_spatial_consistency(dataset)
temporal_consistency = gridded.check_temporal_consistency(dataset)

# Uncomment the lines below to print the results from the 4 functions above. 
print(f"Spatial resolution: Latitude {resolution[0]} deg, Longitude {resolution[1]} deg")
print(f"Spatial coverage: Latitude {coverage['latitude'][0]} to {coverage['latitude'][1]}, Longitude {coverage['longitude'][0]} to {coverage['longitude'][1]}")

try:
    print("Temporal resolution:", temporal_resolution)
    print("Min time:", temporal_coverage['time'][0])
    print("Max time:", temporal_coverage['time'][1])
except Exception as e:
    print('SINGLE TIME FILE')
    print(e)
    traceback.print_exc()

print("Spatial resolution is consistent" if spatial_consistency == True 
      else "Spatial resolution not consistent" if spatial_consistency == False
      else "Unable to determin spatial resolution consistency")

print("Temporal resolution is consistent" if temporal_consistency == True 
      else "Temporal resolution not consistent" if temporal_consistency == False
      else "Unable to determin temporal resolution consistency")

Time coordinate not found in the dataset.
Spatial resolution is not consistent.
Time coordinate not found in the dataset.


Spatial resolution: Latitude [0.00832367 0.0083313  0.00833893 ... 0.00833893 0.0083313  0.00832367] deg, Longitude [0.0025177  0.00253296 0.00250244 ... 0.00250244 0.00253296 0.00253296] deg
Spatial coverage: Latitude 22.558914184570312 to 51.905601501464844, Longitude 230.37026977539062 to 295.6297302246094
Temporal resolution: None
SINGLE TIME FILE
'NoneType' object is not subscriptable
Spatial resolution not consistent
Unable to determin temporal resolution consistency


Traceback (most recent call last):
  File "/tmp/ipykernel_19748/607420800.py", line 15, in <module>
    print("Min time:", temporal_coverage['time'][0])
TypeError: 'NoneType' object is not subscriptable


### Targets / labels for supervised learning

In [41]:

dataset_targets_or_labels = widgets.Combobox(
            value=checklist['DataPreparation']['SupervisedLearningLabels'],
            options=['Yes', 'No', 'N/A'],
            description = '5.4 Are there associated targets or labels for supervised learning techniques?',
            placeholder='Click to select option - (Can this be used as a training dataset)?',
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_targets_or_labels_standards_label = widgets.Label(
    value = "If there are associated targets/labels, are community labeling standards implemented?"
)

dataset_targets_or_labels_standards = widgets.Text(
            value=checklist['DataPreparation']['SupervisedLearningLabelStandards'],
            placeholder = 'e.g., STAC label extension, ESA AIREO specification, etc.',
            layout = widgets.Layout(width=WIDGET_WIDTH)
)

display(dataset_targets_or_labels, dataset_targets_or_labels_standards_label, dataset_targets_or_labels_standards)

Combobox(value='', description='5.4 Are there associated targets or labels for supervised learning techniques?…

Label(value='If there are associated targets/labels, are community labeling standards implemented?')

Text(value='', layout=Layout(width='900px'), placeholder='e.g., STAC label extension, ESA AIREO specification,…

In [42]:

# Save button
save_button = widgets.Button(description="Save Data Access Answers to json file",  button_style="primary",  layout=widgets.Layout(flex='1 1 auto', width='auto'))

def generate_updates_preparation():

    updates = {
        "DataPreparation": {
            "NullValuesFilled": dataset_null_values.value,
            "OutliersIdentified": dataset_outliers.value,
            "Gridded": dataset_gridded.value,
            "TransformedFromOriginal": dataset_gridded_transformed.value,
            "OriginalSamplingAvailable": dataset_gridded_original_sample.value, 
            "SupervisedLearningLabels" : dataset_targets_or_labels.value,
            "SupervisedLearningLabelStandards" : dataset_targets_or_labels_standards.value,
          
        }
    }
    return updates

save_button.on_click(lambda b: utils.update_checklist(b, generate_updates_preparation()))

display(save_button)

Button(button_style='primary', description='Save Data Access Answers to json file', layout=Layout(flex='1 1 au…

## Finished

In [43]:

button_print_json = widgets.Button(description="Print json results",  button_style='info', layout=widgets.Layout(flex='1 1 auto', width='auto'))
output = widgets.Output()

display(button_print_json, output)

def print_json_info(b):
    """
    Loads a copy of the json file to checklist variable. 
    Then prints the json file contents to Jupyter notebook cell output.

    Arguments: b - represents the button calling the function. 
    """
    checklist = utils.load_checklist()
    with output:
        clear_output()
        for key, value in checklist.items():
            print(f"{key}:")
            if isinstance(value, dict):
                for sub_key, sub_value in value.items():
                    print(f"  {sub_key}: {sub_value}")
            else:
                print(f"  {value}")

button_print_json.on_click(print_json_info)


Button(button_style='info', description='Print json results', layout=Layout(flex='1 1 auto', width='auto'), st…

Output()

---

## **Appendix** - Definition of terms used in the checklist.