
<link rel="stylesheet" href="https://unpkg.com/thebe@latest/lib/index.css">
<script src="https://unpkg.com/thebe@latest/lib/index.js"></script>

<script type="text/javascript">
  document.addEventListener("DOMContentLoaded", function() {
    thebelab.bootstrap({
      requestKernel: true,
      binderOptions: {
        repo: "your-repo/your-project",
        ref: "main",
      },
      codeMirrorConfig: {
        theme: "abcdef",
      },
    });
  });
</script>


# **Data Readiness For AI Tabular Checklist - Part 5**

 * Creator(s) John Pill
 * Affiliation: UK Met Office
 * History: 1.0
 * Last update: 27 August 2024.


---

## **Tutorial Material**

* **Run this Jupyter notebook locally using Jupyter Lab**
* **Select 'Run All Cells' from the 'Run' menu to generate the checklist**.
* **Remember to save your notebook regularly as you work through it.**


---

## **Setup Notebook**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import json
import sys
import os
sys.path.append(os.path.abspath('..')) # Add the parent directory to the system path
from aidatareadiness import utils
from aidatareadiness.utils import WIDGET_WIDTH, DESCRIPTION_STYLE, PLACEHOLDER  
from aidatareadiness.checklist_auto import tabular

In [None]:
# Load checklist from JSON file:
checklist = utils.load_checklist()

#### Reset stored answers to start again:

In [None]:
# Reset all checklist answers back to original blank answers for all sections.
# Any completed information will be lost. 

# To reset the stored answers uncomment and run these lines of code below. Re-comment the lines afterwards to avoid them running again. 
# utils.reset_checklist()
# checklist = utils.load_checklist()

# You can then re-run each section to reload it on the reset data. 

In [None]:

print("Dataset:", checklist["GeneralInformation"]["DatasetName"])
print("Dataset link:", checklist["GeneralInformation"]["DatasetLink"])
print("Assessor:", checklist["GeneralInformation"]["AssessorName"])
print("Assessor email:", checklist["GeneralInformation"]["AssessorEmailAddress"])

## **Load Data**

In [None]:
# Replace add_your_file_path_here with the path to your data file (csv, txt etc.). 
# file_path = "add_your_file_path_here.csv"

# Uncomment the lines below after replacing your file path above. 

# df = tabular.read_file(file_path)
# df

---

## **5. Data Preparation**

### Null values

In [None]:

dataset_null_values = widgets.Combobox(
            value=checklist['DataPreparation']['NullValuesFilled'],
            options=['Yes', 'No', 'N/A'],
            description='5.1 Have null values/gaps been filled?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

display(dataset_null_values)

In [None]:
# Uncomment the next line to call the null_percent function to analyse the null count and percentage. 
# tabular.null_percent(df)

In [None]:
# By reading the dataset documentation you might discover null values have been masked with a particular value(s). 
# For instance the NOAA Global Surface Summary of the Day (GSOD) dataset uses all 9 values such as 9999.9, 999.9, and 99.99 to represent null. 

# Uncomment the next line and define the values to replace as an array - these should be specified in the dataset documentation. 
# values_to_mask = []

# Uncomment to call the mask_values function and pass the dataframe, an array of values to mask and a new value of Nan.
# df_masked = tabular.mask_values(df, values_to_mask, np.nan)

# Uncomment to use the null_percent function to re-assess the dataset for null values. 
# tabular.null_percent(df_masked)

### Outliers
Outliers can be challenging to detect sometimes. Below are 3 ideas to start exploring the dataset (Describe, Visualise and Z-score). 

In [None]:

dataset_outliers = widgets.Combobox(
            value=checklist['DataPreparation']['OutliersIdentified'],
            options=['Yes, tagged ', 'Yes, removed', 'No', 'N/A'],
            description='5.2 Have outliers been identified?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

display(dataset_outliers)


**Describe**

First we could check key statistics relating to our dataset 
Review the min and max values to assess whether they might include outliers, particularly when considering the mean.

In [None]:
# Uncomment the describe method below to do this. 
# tabular.df_masked.describe()

**Visualise**

Next, we could visualise the values for each column to try and assess if there are values which lie outside the expected range and a rough distribution. <br>
You may want to analyse the distribution of each of the features in more detail if you suspect outliers. 

In [None]:
# Uncomment the next line to create a list of feature column names by dropping columns with unhelpful data. 
# column_feature_names = df.drop(['FEATURE_1', 'FEATURE_2'], axis=1).columns    # Change the FEATURE_1 and 2 to those you want to drop. 

# Uncomment the next line to mask the missing values with 0, however you might want to refine this decision to develop more accurate results. 
# df_masked_zero = tabular.mask_values(df, values_to_mask, 0)

# Uncomment the next line to call the plot_violin_graphs function, passing the dataframe and columns to plot. 
# tabular.plot_violin_graphs(df_masked_zero, column_feature_names)

**Z-Score**

Calculating the z-score can also help to identify outliers. <br>
The z-score is an indication of how many standard deviations from the mean data point. <br>
A z-score of more than 2 could be an outlier. A z score of more than 3 is more likely to be an outlier. 

In [None]:
# Uncomment the next line to call the print_z_scores function and pass the masked dataframe with the column_feature_names selected. 
# z_score_info = tabular.print_z_scores(df_masked[column_feature_names])

### Gridded data


In [None]:

dataset_gridded = widgets.Combobox(
            value=checklist['DataPreparation']['Gridded'],
            options=['Regularly gridded in space', 'Constant time-frequency', 'Regularly gridded in space and constant time-frequency', 'Not gridded', 'N/A'],
            description='5.3 Is the data gridded (regularly sampled in time and space)?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_gridded_transformed = widgets.Combobox(
            value=checklist['DataPreparation']['TransformedFromOriginal'],
            options=['Yes, from irregular sampling', 'Yes, from a different regular sampling', 'No, this is the original sampling', 'N/A'],
            description='If the data is gridded, was it transformed from a different original sampling?',            
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )


dataset_gridded_original_sample = widgets.Combobox(
            value=checklist['DataPreparation']['OriginalSamplingAvailable'],
            options=['Yes', 'No', 'Only by request', 'N/A'],
            description = 'If the data is resampled from the original sampling, is the data also available at the original sampling?',
            placeholder=PLACEHOLDER,
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

display(dataset_gridded, dataset_gridded_transformed, dataset_gridded_original_sample)

### Targets / labels for supervised learning

In [None]:

dataset_targets_or_labels = widgets.Combobox(
            value=checklist['DataPreparation']['SupervisedLearningLabels'],
            options=['Yes', 'No', 'N/A'],
            description = '5.4 Are there associated targets or labels for supervised learning techniques?',
            placeholder='Click to select option - (Can this be used as a training dataset)?',
            layout=widgets.Layout(width=WIDGET_WIDTH),
            style = DESCRIPTION_STYLE
            )

dataset_targets_or_labels_standards_label = widgets.Label(
    value = "If there are associated targets/labels, are community labeling standards implemented?"
)

dataset_targets_or_labels_standards = widgets.Text(
            value=checklist['DataPreparation']['SupervisedLearningLabelStandards'],
            placeholder = 'e.g., STAC label extension, ESA AIREO specification, etc.',
            layout = widgets.Layout(width=WIDGET_WIDTH)
)

display(dataset_targets_or_labels, dataset_targets_or_labels_standards_label, dataset_targets_or_labels_standards)

In [None]:

# Save button
save_button = widgets.Button(description="Save Data Access Answers to json file",  button_style="primary",  layout=widgets.Layout(flex='1 1 auto', width='auto'))

def generate_updates_preparation():

    updates = {
        "DataPreparation": {
            "NullValuesFilled": dataset_null_values.value,
            "OutliersIdentified": dataset_outliers.value,
            "Gridded": dataset_gridded.value,
            "TransformedFromOriginal": dataset_gridded_transformed.value,
            "OriginalSamplingAvailable": dataset_gridded_original_sample.value, 
            "SupervisedLearningLabels" : dataset_targets_or_labels.value,
            "SupervisedLearningLabelStandards" : dataset_targets_or_labels_standards.value,
          
        }
    }
    return updates

save_button.on_click(lambda b: utils.update_checklist(b, generate_updates_preparation()))

display(save_button)

## Finished

In [None]:

button_print_json = widgets.Button(description="Print json results",  button_style='info', layout=widgets.Layout(flex='1 1 auto', width='auto'))
output = widgets.Output()

display(button_print_json, output)

def print_json_info(b):
    """
    Loads a copy of the json file to checklist variable. 
    Then prints the json file contents to Jupyter notebook cell output.

    Arguments: b - represents the button calling the function. 
    """
    checklist = utils.load_checklist()
    with output:
        clear_output()
        for key, value in checklist.items():
            print(f"{key}:")
            if isinstance(value, dict):
                for sub_key, sub_value in value.items():
                    print(f"  {sub_key}: {sub_value}")
            else:
                print(f"  {value}")

button_print_json.on_click(print_json_info)


---

## **Appendix** - Definition of terms used in the checklist.