# Conversion of Data to a Machine Learning Friendly Format

This notebook demonstrates taking a single NetCDF file and converting the file into analysis ready numpy arrays stored in a zarr file for later use in neural network training.

Specifically, after having loaded multiple NetCDF files from UM model data into a single Iris CubeList, and saving this CubeList to disk, this notebook will:<ul>

<li>Load the single NetCDF file back from disk.</li>
<li>Extract the desired cubes: cloud volume fraction, specific humidity, air pressure, and air temperature.</li>
<li>Combine cubes of the same feature where metadata differences have prevented concatenation.</li>
<li>Convert the cubes to numpy arrays.</li>
<li>Format the arrays into a desirable dimension: (Sample Number, Height Level, Feature).</li>
<li>Generate data for the desired target we want to make a prediction on (cloud base height at a level in a sample).</li>
<li>Normalize data where necessary.</li>
<li>Save the data to disk for later loading to perform ML tasks.</li></ul>

### Motivation

Machine learning ready data enables the inputs and the outputs around a model to be easily manipulated and accessed. In our context the data exists as many .pp files separated on a number of dimensions. In order to train a model on the raw data, each seperate file would have to be loaded to get the desired data for a step in training where it is desirable to randomize our training batches, having to load each file would require a search mechanism and/or involve inefficient loading. For this issue, it is desirable to perform operations on the data to concatenate it together in a format that works well with the task so that the data can be interfaced with as a single unit, without having to worry about searching for the correct data points accross multiple files.

It is possible to build the infrastucture that will search the data on the individual files, which might happen as a data reading "driver" at the locations where this data search is required, but a better option is to give our data the desired qualities right at the beginning, so that it may be reused for other use cases, or additional tasks.

### Intended Outcomes

After exploring this notebook, converting raw data correctly into files that are more suited for machine learning tasks will be an easier and informed process

### Environment

This notebook is intended to be run in the py-lightning-cbh conda environment defined in requirements_torch.yml

Define imports:

In [1]:
import glob
import os
import pathlib
import re

import dask
import iris
import numpy as np
import zarr
from dask.diagnostics import ProgressBar, ResourceProfiler

import cbh_data_definitions  # used for testing the load back in of the data

### The original files are split on time, to make processing easier, concatenate the individual files together in iris and save back to disk:

In [2]:
CONCATENATE_INITIAL_FILES = False
if CONCATENATE_INITIAL_FILES:
    dev_indiv_files = pathlib.Path(os.environ["SCRATCH"]) / "cbh_data/dev_indiv" / "*.pp"
    dev_indiv_paths = glob.glob(str(dev_indiv_files))
    dev_cubes = iris.load(dev_indiv_paths)
    dev_save_large_path = pathlib.Path(os.environ["SCRATCH"]) / "cbh_data/dev/dev_large.nc"
    iris.save(dev_cubes, str(dev_save_large_path))

    train_inidiv_files = (
        pathlib.Path(os.environ["SCRATCH"]) / "cbh_data/train_individual_files" / "*.pp"
    )
    train_indiv_paths = glob.glob(str(train_inidiv_files))
    train_cubes = iris.load(train_indiv_paths)
    train_save_large_path = pathlib.Path(os.environ["SCRATCH"]) / "cbh_data/train/train_large.nc"
    iris.save(train_cubes, train_save_large_path)

In [7]:
CONCAT_TEST_FILES = True
SAVE_TEST_SCRATCH = False

if CONCAT_TEST_FILES:
    test_indiv_dir = str(
        pathlib.Path(os.environ["DATADIR"]) / "cbh_data" / "test" / "*.pp"
    )
    test_indiv_files = glob.glob(test_indiv_dir)
    test_cube = iris.load(test_indiv_files)
    if SAVE_TEST_SCRATCH:
        
        test_save_large_path_SCRATCH = str(
            pathlib.Path(os.environ["SCRATCH"]) / "cbh_data/test/test_large.nc"
        )
        iris.save(test_cube, test_save_large_path_SCRATCH)

    test_save_large_path_DATADIR = str(
            pathlib.Path(os.environ["DATADIR"]) / "cbh_data/test/test_large.nc"
        )
    iris.save(test_cube, test_save_large_path_DATADIR)




### Define file paths:

In [8]:
root_data_directory = pathlib.Path(os.environ["DATADIR"]) / "cbh_data"

paths_to_load = (
    root_data_directory / "test" / "test_large.nc"
)  # one large nc file of iris' concatenation of all small nc files
path_to_save_result = (
    root_data_directory / "analysis_ready" / "test.npz"
)  # ouput for numpy arrays
path_to_save_zarr = (
    root_data_directory / "analysis_ready" / "test.zarr"
)  # output for zarr files

#### Settings for the notebook, with each constant given a comment above for a purpose description:

Since the notebook involves converting data, we want to perform this operation on the test data, the training data, and the validation data. Since the notebook will have to be run multiple times to achieve all of the analysis ready data, settings allow the user to keep desired functionality on different runs of the notebook.

In [9]:
# generates a positional encoding array for a feature of each height layer in the sample
GENERATE_POSITIONAL_ENCODING_ARRAYS = False
# adds the height layer number to every feature vector in the input array
if GENERATE_POSITIONAL_ENCODING_ARRAYS:
    CONCATENATE_POSITIONAL_ENCONDING_TO_FEATURE_VECTOR = False

# realises the input array computation in two halves to avoid memory constraints of large computation
COMPUTE_INPUT_ARRAY_IN_HALVES = False

FREE_UP_MEMORY_AFTER_TARGET_COMPUTATION = False

# show all samples where clouds exist in the final layer (none)
# the final layer is used as the desired classification in the case of no cloud base existance prediction
VERIFY_NO_FINAL_LAYER_CLOUDS = False

# do extra compute to find the number of samples with cloud bases in the dataset
COMPUTE_CLOUD_BASE_SAMPLE_NUMBER = True

# save npz or not
SAVE_NPZ = False

# perform some computations that may take a long while, but give more information for general understanding
PERFORM_LONG_COMPUTATIONS_FOR_EXTRA_INFO = True

# save the cloud base position as a class label, or a onehot vector
SAVE_ONEHOT_INSTEAD_OF_CLASS_LABEL = False
# one can easily be converted to the other, so only saving one is necessary

# Normalize the input data to a range [0,1]
NORMALIZE_INPUT_DATA = False
# defaults to false as models should be desined to normalize the data, and avoids issues like an unknown global maximum
# e.g. the max value of temperature in the loaded data can be different to the max temp of all possible input temps for your model

# Disable warnings
DISABLE_WARNINGS = True

if DISABLE_WARNINGS:
    import warnings
    warnings.filterwarnings('ignore')

## Loading in the Cloud Base Height Data

In [10]:
cubes = iris.load(str(paths_to_load))

print("Find files complete, list of paths:", paths_to_load)

Find files complete, list of paths: /data/users/hsouth/cbh_data/test/test_large.nc


Show cube names:

In [11]:
print("Cube names:\n", [str(cube.name()) for cube in cubes])

print("\n" + "Example of cube metadata:", cubes[2].summary())

Cube names:
 ['m01s05i250_0', 'cloud_volume_fraction_in_atmosphere_layer', 'cloud_volume_fraction_in_atmosphere_layer', 'm01s05i250', 'air_pressure', 'air_pressure', 'air_temperature', 'air_temperature', 'convective_rainfall_flux', 'convective_rainfall_flux', 'convective_snowfall_flux', 'convective_snowfall_flux', 'specific_humidity', 'specific_humidity', 'stratiform_rainfall_flux', 'stratiform_rainfall_flux', 'stratiform_snowfall_flux', 'stratiform_snowfall_flux', 'upward_air_velocity', 'upward_air_velocity']

Example of cube metadata: cloud_volume_fraction_in_atmosphere_layer / (1) (model_level_number: 70; latitude: 480; longitude: 640)
    Dimension coordinates:
        model_level_number                                         x             -               -
        latitude                                                   -             x               -
        longitude                                                  -             -               x
    Auxiliary coordinates:
  

## Preprocess the data

### Extract the desired cubes: cloud volume fraction, specific humidity, air pressure, and air temperature

Cloud volume fraction will be used as our target for the problem, and the rest of the cubes are used as input.

In [12]:
def create_dataset(cubes):
    list_of_input_cubes = ["air_temperature", "air_pressure", "specific_humidity"]
    target_cube_name = ["cloud_volume_fraction_in_atmosphere_layer"]

    target_cube = iris.cube.CubeList(
        [cube for cube in cubes if (cube.long_name in target_cube_name)]
    )
    inp_cube = iris.cube.CubeList(
        [cube for cube in cubes if (cube.standard_name in list_of_input_cubes)]
    )

    return inp_cube, target_cube

Call the function defined above and verify success:

In [13]:
inp_cube, tar_cube = create_dataset(cubes)

print("input cube:\n", inp_cube, "\n")
print("target cubes:\n", tar_cube)

input cube:
 0: air_pressure / (Pa)                 (model_level_number: 70; latitude: 480; longitude: 640)
1: air_pressure / (Pa)                 (model_level_number: 70; latitude: 480; longitude: 640)
2: air_temperature / (K)               (model_level_number: 70; latitude: 480; longitude: 640)
3: air_temperature / (K)               (model_level_number: 70; latitude: 480; longitude: 640)
4: specific_humidity / (kg kg-1)       (model_level_number: 70; latitude: 480; longitude: 640)
5: specific_humidity / (kg kg-1)       (model_level_number: 70; latitude: 480; longitude: 640) 

target cubes:
 0: cloud_volume_fraction_in_atmosphere_layer / (1) (model_level_number: 70; latitude: 480; longitude: 640)
1: cloud_volume_fraction_in_atmosphere_layer / (1) (model_level_number: 70; latitude: 480; longitude: 640)


### Combine cubes of the same feature where metadata differences have prevented concatenation, while also extracting the numpy array of each cube

if duplicate cubes exist, concatenate them using numpy to avoid metadata matching issues:

In [14]:
def order_two_objects_by_len_ascending(obj1, obj2):
    len1 = len(obj1[1])
    len2 = len(obj2[1])
    if len1 >= len2:
        return obj2, obj1
    else:
        return obj1, obj2


def concatenate_same_cubes(cube_list):

    cube_name_dictionary = {}

    for cube in cube_list:
        # print('start cube load')
        cube_np_array = cube.core_data()
        # print('end load')

        cube_name = cube.name()

        try:
            # concat along the differing axis, forcast reference time
            # MUST CONCATENATE IN THE SAME ORDER FOR EACH ARRAY (Since dim len is diffent each array, we can have the
            short_arr, long_arr = order_two_objects_by_len_ascending(
                cube_np_array, cube_name_dictionary[cube_name]
            )
            cube_name_dictionary[cube_name] = np.concatenate(
                (short_arr, long_arr), axis=1
            )

            # print(cube_name_dictionary[cube_name].shape)

        except KeyError:
            cube_name_dictionary[cube_name] = cube_np_array

    return cube_name_dictionary

Call the function defined above and verify success:

In [15]:
inp_dict = concatenate_same_cubes(inp_cube)
tar_dict = concatenate_same_cubes(tar_cube)

print("Air Pressure array shape:", inp_dict["air_pressure"].shape)
print(
    "Cloud Volume array shape:",
    tar_dict["cloud_volume_fraction_in_atmosphere_layer"].shape,
)
print("Array types:", type(inp_dict["air_pressure"]))
print("Input cube arrays found:", inp_dict.keys())
print("Target cube arrays found:", tar_dict.keys())

Air Pressure array shape: (70, 960, 640)
Cloud Volume array shape: (70, 960, 640)
Array types: <class 'dask.array.core.Array'>
Input cube arrays found: dict_keys(['air_pressure', 'air_temperature', 'specific_humidity'])
Target cube arrays found: dict_keys(['cloud_volume_fraction_in_atmosphere_layer'])


Combine dictionary elements to one array:

In [16]:
def combine_feats(dict_of_feats):

    add_dim_for_feature = [np.expand_dims(x, axis=0) for x in dict_of_feats.values()]
    feat_concat_array = np.concatenate(add_dim_for_feature, axis=0)
    return feat_concat_array

In [17]:
inp_array = combine_feats(inp_dict)
tar_array = combine_feats(tar_dict)

# verify and check dims
print("Dimensions to standardize for processing:")
print("Current Input Shape:", inp_array.shape)
print("Current Target Shape:", tar_array.shape)

Dimensions to standardize for processing:
Current Input Shape: (3, 70, 960, 640)
Current Target Shape: (1, 70, 960, 640)


Expand the dimensions of 'short' arrays to work in flattening (this applies in practice the smaller dev set of the data):

In [18]:
if len(inp_array.shape) == 4:
    time_time2_dims_to_add = [1, 2]
    inp_array = np.expand_dims(inp_array, time_time2_dims_to_add)
    tar_array = np.expand_dims(tar_array, time_time2_dims_to_add)
    print("New and correct shapes (should be 6 dims):")
    print(inp_array.shape)
    print(tar_array.shape)
elif len(inp_array.shape) == 5:
    time_time2_dims_to_add = [1]
    inp_array = np.expand_dims(inp_array, time_time2_dims_to_add)
    tar_array = np.expand_dims(tar_array, time_time2_dims_to_add)
    print("New and correct shapes (should be 6 dims):")
    print(inp_array.shape)
    print(tar_array.shape)

New and correct shapes (should be 6 dims):
(3, 1, 1, 70, 960, 640)
(1, 1, 1, 70, 960, 640)


The new dimensions that have been expanded out ensure that when we are processing smaller data, such as the dev set of data instead of the larger training data, the dimensions match up so the same function can be applied to both data at this point in the notebook.

These dimensions are: Atmospheric Variable, forecast reference time, time, height, lon, lat

In [19]:
print("Show array storage metadata:")
inp_array

Show array storage metadata:


Unnamed: 0,Array,Chunk
Bytes,492.19 MiB,82.03 MiB
Shape,"(3, 1, 1, 70, 960, 640)","(1, 1, 1, 70, 480, 640)"
Dask graph,6 chunks in 20 graph layers,6 chunks in 20 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 492.19 MiB 82.03 MiB Shape (3, 1, 1, 70, 960, 640) (1, 1, 1, 70, 480, 640) Dask graph 6 chunks in 20 graph layers Data type float32 numpy.ndarray",1  1  3  640  960  70,

Unnamed: 0,Array,Chunk
Bytes,492.19 MiB,82.03 MiB
Shape,"(3, 1, 1, 70, 960, 640)","(1, 1, 1, 70, 480, 640)"
Dask graph,6 chunks in 20 graph layers,6 chunks in 20 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [20]:
print("Show array storage metadata:")
tar_array

Show array storage metadata:


Unnamed: 0,Array,Chunk
Bytes,164.06 MiB,82.03 MiB
Shape,"(1, 1, 1, 70, 960, 640)","(1, 1, 1, 70, 480, 640)"
Dask graph,2 chunks in 7 graph layers,2 chunks in 7 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 164.06 MiB 82.03 MiB Shape (1, 1, 1, 70, 960, 640) (1, 1, 1, 70, 480, 640) Dask graph 2 chunks in 7 graph layers Data type float32 numpy.ndarray",1  1  1  640  960  70,

Unnamed: 0,Array,Chunk
Bytes,164.06 MiB,82.03 MiB
Shape,"(1, 1, 1, 70, 960, 640)","(1, 1, 1, 70, 480, 640)"
Dask graph,2 chunks in 7 graph layers,2 chunks in 7 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Flatten the arrays

Flatten time and lat/long down to a single dimension, sample number </br>
Function expects 6-d array where each expected dimension is named in the function - cube_num, time, time2, height, lat, long

In [21]:
def flatten_cubes_with_numpy(np_array):

    # print('input dimensions:', np_array.shape)

    cube_num, time, time2, height, lat, long = np_array.shape

    # # verify shape
    # print(np_array.shape)

    # swap axis of time and height to ensure flattening preserves height
    cube_array = np_array.transpose(0, 3, 1, 2, 4, 5)
    cubes_flattened = np.reshape(
        cube_array, (cube_num, height, (time * time2 * lat * long))
    )

    # print('new dimensions', cubes_flattened.shape)

    cube_to_return = cubes_flattened.T
    # remove unnecessary dimensions
    cube_to_return = cube_to_return.squeeze()
    return cube_to_return

In [22]:
dask.config.set(
    {"array.slicing.split_large_chunks": False}
)  # allow the potentially large chunk of data

inp_array = flatten_cubes_with_numpy(inp_array)
tar_array = flatten_cubes_with_numpy(tar_array)

# print('verify squeeze')
print("Shapes of flattened and transposed arrays:")
print("Input:", inp_array.shape)
print("Target:", tar_array.shape)

Shapes of flattened and transposed arrays:
Input: (614400, 70, 3)
Target: (614400, 70)


Rechunk large data to ensure large chunks are reduced for easier handling in dask:

In [23]:
tar_array = dask.array.rechunk(tar_array, chunks="auto")
print("Rechunked array storage metadata for target:")
tar_array

Rechunked array storage metadata for target:


Unnamed: 0,Array,Chunk
Bytes,164.06 MiB,82.03 MiB
Shape,"(614400, 70)","(307200, 70)"
Dask graph,2 chunks in 11 graph layers,2 chunks in 11 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 164.06 MiB 82.03 MiB Shape (614400, 70) (307200, 70) Dask graph 2 chunks in 11 graph layers Data type float32 numpy.ndarray",70  614400,

Unnamed: 0,Array,Chunk
Bytes,164.06 MiB,82.03 MiB
Shape,"(614400, 70)","(307200, 70)"
Dask graph,2 chunks in 11 graph layers,2 chunks in 11 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [24]:
inp_array = dask.array.rechunk(inp_array, chunks="auto")
print("Rechunked array storage metadata for input:")
inp_array

Rechunked array storage metadata for input:


Unnamed: 0,Array,Chunk
Bytes,492.19 MiB,82.03 MiB
Shape,"(614400, 70, 3)","(307200, 70, 1)"
Dask graph,6 chunks in 23 graph layers,6 chunks in 23 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 492.19 MiB 82.03 MiB Shape (614400, 70, 3) (307200, 70, 1) Dask graph 6 chunks in 23 graph layers Data type float32 numpy.ndarray",3  70  614400,

Unnamed: 0,Array,Chunk
Bytes,492.19 MiB,82.03 MiB
Shape,"(614400, 70, 3)","(307200, 70, 1)"
Dask graph,6 chunks in 23 graph layers,6 chunks in 23 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [25]:
# rechunk enforcing samples are kept together
inp_arr_chunks = inp_array.chunksize
inp_array = inp_array.rechunk((inp_arr_chunks[0] / 10, 70, 3))
inp_array

Unnamed: 0,Array,Chunk
Bytes,492.19 MiB,24.61 MiB
Shape,"(614400, 70, 3)","(30720, 70, 3)"
Dask graph,20 chunks in 24 graph layers,20 chunks in 24 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 492.19 MiB 24.61 MiB Shape (614400, 70, 3) (30720, 70, 3) Dask graph 20 chunks in 24 graph layers Data type float32 numpy.ndarray",3  70  614400,

Unnamed: 0,Array,Chunk
Bytes,492.19 MiB,24.61 MiB
Shape,"(614400, 70, 3)","(30720, 70, 3)"
Dask graph,20 chunks in 24 graph layers,20 chunks in 24 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


For some more information about dask, display the dask object output for the input array

In [26]:
inp_array.dask

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  1  shape  (70, 480, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-c9ef66fcc9acaffcbd1d0a9baabc7623",640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
shape,"(70, 480, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-c9ef66fcc9acaffcbd1d0a9baabc7623

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  1  shape  (70, 480, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-5f7466fcd729c89413f627f775da8552",640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
shape,"(70, 480, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-5f7466fcd729c89413f627f775da8552

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  2  shape  (70, 960, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on array-5f7466fcd729c89413f627f775da8552  array-c9ef66fcc9acaffcbd1d0a9baabc7623",640  960  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,2
shape,"(70, 960, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,array-5f7466fcd729c89413f627f775da8552
,array-c9ef66fcc9acaffcbd1d0a9baabc7623

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  2  shape  (1, 70, 960, 640)  dtype  float32  chunksize  (1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on concatenate-81772518f6030ee388e5d1a4de8a93d5",1  1  640  960  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,2
shape,"(1, 70, 960, 640)"
dtype,float32
chunksize,"(1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,concatenate-81772518f6030ee388e5d1a4de8a93d5

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  1  shape  (70, 480, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-c06cfd41f2b73653d9eb04a736ac9af6",640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
shape,"(70, 480, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-c06cfd41f2b73653d9eb04a736ac9af6

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  1  shape  (70, 480, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-018bc915a3d32a72a06cb9e53457e7b8",640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
shape,"(70, 480, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-018bc915a3d32a72a06cb9e53457e7b8

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  2  shape  (70, 960, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on array-c06cfd41f2b73653d9eb04a736ac9af6  array-018bc915a3d32a72a06cb9e53457e7b8",640  960  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,2
shape,"(70, 960, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,array-c06cfd41f2b73653d9eb04a736ac9af6
,array-018bc915a3d32a72a06cb9e53457e7b8

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  2  shape  (1, 70, 960, 640)  dtype  float32  chunksize  (1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on concatenate-8d137bec7753e056dc24a08c5c0c58f4",1  1  640  960  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,2
shape,"(1, 70, 960, 640)"
dtype,float32
chunksize,"(1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,concatenate-8d137bec7753e056dc24a08c5c0c58f4

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  1  shape  (70, 480, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-aaa99d0a37fc1d37d7cb9f7e246ded3b",640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
shape,"(70, 480, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-aaa99d0a37fc1d37d7cb9f7e246ded3b

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  1  shape  (70, 480, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-4d58093f17db89961567daf46ad9eeea",640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
shape,"(70, 480, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-4d58093f17db89961567daf46ad9eeea

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  2  shape  (70, 960, 640)  dtype  float32  chunksize  (70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on array-aaa99d0a37fc1d37d7cb9f7e246ded3b  array-4d58093f17db89961567daf46ad9eeea",640  960  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,2
shape,"(70, 960, 640)"
dtype,float32
chunksize,"(70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,array-aaa99d0a37fc1d37d7cb9f7e246ded3b
,array-4d58093f17db89961567daf46ad9eeea

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  2  shape  (1, 70, 960, 640)  dtype  float32  chunksize  (1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on concatenate-e81c20253cd306499e3eb2e9a8f23215",1  1  640  960  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,2
shape,"(1, 70, 960, 640)"
dtype,float32
chunksize,"(1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,concatenate-e81c20253cd306499e3eb2e9a8f23215

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  6  shape  (3, 70, 960, 640)  dtype  float32  chunksize  (1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on reshape-146a2340e57cfe5f8ed4ec1d12a47d2f  reshape-dfc0b577f34a25ca600ff723360f195f  reshape-4149fffe36f66e9a2aab54595ecfaa34",3  1  640  960  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,6
shape,"(3, 70, 960, 640)"
dtype,float32
chunksize,"(1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,reshape-146a2340e57cfe5f8ed4ec1d12a47d2f
,reshape-dfc0b577f34a25ca600ff723360f195f

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  6  shape  (3, 1, 1, 70, 960, 640)  dtype  float32  chunksize  (1, 1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on concatenate-abbb77dd98c68e5e5aec3c1b97e799a1",1  1  3  640  960  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,6
shape,"(3, 1, 1, 70, 960, 640)"
dtype,float32
chunksize,"(1, 1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,concatenate-abbb77dd98c68e5e5aec3c1b97e799a1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  6  shape  (3, 70, 1, 1, 960, 640)  dtype  float32  chunksize  (1, 70, 1, 1, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on reshape-cb7878dd786cca044cee9f5e1d989e55",1  70  3  640  960  1

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,6
shape,"(3, 70, 1, 1, 960, 640)"
dtype,float32
chunksize,"(1, 70, 1, 1, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,reshape-cb7878dd786cca044cee9f5e1d989e55

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  6  shape  (3, 70, 614400)  dtype  float32  chunksize  (1, 70, 307200)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on transpose-4614e185e792a64175fb4425be6b0bef",614400  70  3

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,6
shape,"(3, 70, 614400)"
dtype,float32
chunksize,"(1, 70, 307200)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,transpose-4614e185e792a64175fb4425be6b0bef

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  6  shape  (614400, 70, 3)  dtype  float32  chunksize  (307200, 70, 1)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on reshape-ebe3b1c50b4d0dd9804b82991da759f4",3  70  614400

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,6
shape,"(614400, 70, 3)"
dtype,float32
chunksize,"(307200, 70, 1)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,reshape-ebe3b1c50b4d0dd9804b82991da759f4

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  80  shape  (614400, 70, 3)  dtype  float32  chunksize  (30720, 70, 3)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on transpose-8496d39dfe69068a4638fa83689c4318",3  70  614400

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,80
shape,"(614400, 70, 3)"
dtype,float32
chunksize,"(30720, 70, 3)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,transpose-8496d39dfe69068a4638fa83689c4318


In [27]:
tar_arr_chunks = tar_array.chunksize
tar_array = tar_array.rechunk((tar_arr_chunks[0] / 10, 70))
tar_array

Unnamed: 0,Array,Chunk
Bytes,164.06 MiB,8.20 MiB
Shape,"(614400, 70)","(30720, 70)"
Dask graph,20 chunks in 12 graph layers,20 chunks in 12 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 164.06 MiB 8.20 MiB Shape (614400, 70) (30720, 70) Dask graph 20 chunks in 12 graph layers Data type float32 numpy.ndarray",70  614400,

Unnamed: 0,Array,Chunk
Bytes,164.06 MiB,8.20 MiB
Shape,"(614400, 70)","(30720, 70)"
Dask graph,20 chunks in 12 graph layers,20 chunks in 12 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


## Preprocess the data toward ML algorithm input

### Generate data for the target of cloud base at certain height

preprocess the target
for the target, we define a cloud existing in a height layer:
if the cloud volume fraction is greater than 2 out of possible 8 oktas </br>
the cell below finds the first occurrences where the cloud volume is greater than the threshold marking a 1 in the array location, and stores 0 otherwise. </br>
Later, the final height layer will be marker for samples without a cloud base

In [28]:
# See an example of target array values
print(
    "Example of cloud volume samples (first 10 samples, first 30 layers):\n",
    tar_array[0:10, 0:30].compute(),
)
if PERFORM_LONG_COMPUTATIONS_FOR_EXTRA_INFO:
    print("Maximum value in data:", np.max(tar_array).compute())

Example of cloud volume samples (first 10 samples, first 30 layers):
 [[1.       1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.      ]
 [0.984375 1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.      ]
 [0.984375 1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.      ]
 [0.984375 1.       1.       1.       1.       1.       1.       1.
  1.       1.       1.       1.       1.       1.       1.       1.
  1.       

## Data Size Consideration
The size of the data in the task is a large issue for this challenge. The data is large enough that performing regular operations on the data will cause the kernel to crash. In order to avoid these memory issues, computations are done with dask, though even dask does not handle the amount of data at times! For this reason dask is used, as well as breaking the computations down into smaller parts, and removing variables no longer needed.

In [29]:
cloud_threshold = 2.0 / 8.0
cloud_over_threshold = dask.array.where(tar_array > cloud_threshold)

In [30]:
%%time
# realize the values for the where condition (dask array to numpy array)
print("Start compute")
dask.compute(cloud_over_threshold[0], cloud_over_threshold[1])
sample_with_cloud = cloud_over_threshold[0].compute()
index_on_sample = cloud_over_threshold[1].compute()

Start compute
CPU times: user 2.16 s, sys: 1.54 s, total: 3.7 s
Wall time: 1.4 s


Remove repeat indicies, e.g. where there are multiple layers above the cloud threshold, we only want the first occurence in a sample (the base):

In [31]:
%%time
_, first_duplicate_indicies = np.unique(sample_with_cloud, return_index=True)

if COMPUTE_CLOUD_BASE_SAMPLE_NUMBER:
    print("Start duplicate indicies compute")
    first_duplicate_indicies = first_duplicate_indicies
    print("Number of cloud bases found:", first_duplicate_indicies.shape)
    print("Out of samples:", tar_array.shape[0])

Start duplicate indicies compute
Number of cloud bases found: (492414,)
Out of samples: 614400
CPU times: user 77.9 ms, sys: 82 ms, total: 160 ms
Wall time: 158 ms


For clouds where no base was found, add a marker at the final height layer
(where no cloud volume over threshold appears in the data).

In [32]:
%%time

# encode the cloud in onehot vector
one_hot_encoded_bases = np.zeros(tar_array.shape)
one_hot_encoded_bases[
    sample_with_cloud[first_duplicate_indicies],
    index_on_sample[first_duplicate_indicies],
] = 1
# mark the end (final layer) if no cloud base detected
flip = lambda booleanVal: not booleanVal
vflip = np.vectorize(flip)
one_hot_encoded_bases[np.where(vflip(np.any(one_hot_encoded_bases, axis=1)))[0], -1] = 1

# Now reduce vectors as if each height layer is treated as a class where the model will predict, onehot -> class label e.g. 0,0,1,0, -> 2
class_label_encoded_bases = np.argmax(one_hot_encoded_bases, axis=1)

CPU times: user 214 ms, sys: 144 ms, total: 359 ms
Wall time: 357 ms


In [33]:
print("Target as class label:", class_label_encoded_bases.shape)
print("Output dim:", one_hot_encoded_bases.shape)

Target as class label: (614400,)
Output dim: (614400, 70)


Compute and unmask target array (cloud volume):

In [34]:
%%time
print("Current type of target array:", type(tar_array))
print("Target shape:", tar_array.shape)
tar_array = tar_array.compute()
print("Finished compute of target array")

num_of_masked = np.ma.count_masked(tar_array)
print("Number of masked values after computation:", num_of_masked)
assert num_of_masked == 0

# unmask
tar_array = np.ma.filled(tar_array, np.nan)

Current type of target array: <class 'dask.array.core.Array'>
Target shape: (614400, 70)
Finished compute of target array
Number of masked values after computation: 0
CPU times: user 472 ms, sys: 469 ms, total: 941 ms
Wall time: 1.94 s


In [35]:
if VERIFY_NO_FINAL_LAYER_CLOUDS:
    # verify the claim that no cloud bases appear in the final layer
    # can be strengthened to, no clouds exist in the final layer (next line returns 0)
    print(
        "list of clouds at final height level:",
        np.where(tar_array[:, -1] > cloud_threshold),
    )

### Show some samples of what has been produced

In [36]:
one_hot_encoded_bases

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

In [37]:
class_label_encoded_bases

array([0, 0, 0, ..., 0, 0, 0])

In [38]:
tar_array

array([[1.      , 1.      , 1.      , ..., 0.      , 0.      , 0.      ],
       [0.984375, 1.      , 1.      , ..., 0.      , 0.      , 0.      ],
       [0.984375, 1.      , 1.      , ..., 0.      , 0.      , 0.      ],
       ...,
       [0.875   , 0.828125, 0.578125, ..., 0.      , 0.      , 0.      ],
       [0.875   , 0.8125  , 0.578125, ..., 0.      , 0.      , 0.      ],
       [0.859375, 0.8125  , 0.578125, ..., 0.      , 0.      , 0.      ]],
      dtype=float32)

In [39]:
# optionally, free up some memory
if FREE_UP_MEMORY_AFTER_TARGET_COMPUTATION:
    del sample_with_cloud
    del cloud_over_threshold
    del first_duplicate_indicies
    del index_on_sample
    del tar_dict
    del tar_cube
    del cubes

In [40]:
if SAVE_ONEHOT_INSTEAD_OF_CLASS_LABEL:
    del class_label_encoded_bases
else:
    del one_hot_encoded_bases

In [41]:
# get information for correct chunking
print(class_label_encoded_bases.shape)
print(tar_array.shape)

(614400,)
(614400, 70)


### Normalize Input Data

For the normalization of input data: we first transpose the input array so that the feature dimension is at the top level of the array, and numpy has an easier time accessing all values of the same feature. Then all values are normalized by being scaled in the range \[0,1\]

(must investigate mistake relating to ptp of local datasets instead of global values and make changes)

In [42]:
print("Current input array type:", type(inp_array))

Current input array type: <class 'dask.array.core.Array'>


In [43]:
%%time
if NORMALIZE_INPUT_DATA:
    inp_array = inp_array.T
    inp_array = (inp_array - np.min(inp_array, axis=(1, 2)).reshape((3, 1, 1))) / (
        np.ptp(inp_array, axis=(1, 2)).reshape((3, 1, 1))
    )
    inp_array = inp_array.T

CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 14.5 µs


In [44]:
# show 5 samples, first 5 height layers only, displaying all features in the layer (features not indexed)
# (automatic numpy array display reduction is quite large for this array)
inp_array[0:5, 0:5, :].compute()

masked_array(
  data=[[[6.86636250e+04, 2.31500000e+02, 9.08970833e-05],
         [6.84328750e+04, 2.31375000e+02, 9.15527344e-05],
         [6.81108750e+04, 2.31500000e+02, 9.50694084e-05],
         [6.76995000e+04, 2.32750000e+02, 1.07586384e-04],
         [6.72033750e+04, 2.35875000e+02, 1.52766705e-04]],

        [[6.86712500e+04, 2.31250000e+02, 8.91089439e-05],
         [6.84405000e+04, 2.31125000e+02, 9.11355019e-05],
         [6.81185000e+04, 2.31125000e+02, 9.41157341e-05],
         [6.77070000e+04, 2.32000000e+02, 1.05500221e-04],
         [6.72105000e+04, 2.34875000e+02, 1.46985054e-04]],

        [[6.86773750e+04, 2.31375000e+02, 9.06586647e-05],
         [6.84466250e+04, 2.31250000e+02, 9.19103622e-05],
         [6.81246250e+04, 2.31125000e+02, 9.40561295e-05],
         [6.77128750e+04, 2.32000000e+02, 1.04963779e-04],
         [6.72162500e+04, 2.34875000e+02, 1.46746635e-04]],

        [[6.86817500e+04, 2.31250000e+02, 8.94665718e-05],
         [6.84510000e+04, 2.31125000

#### Save a selection of wanted arrays (inp_array, tar_array, one_hot_encoded_bases)

Now to save the computed array </br>
(will not save one of class label output or one_hot as easy conversion between the two)</br>
(went with saving one-hot to emulate the data produced/used by base solution)

In [45]:
# verify input and output shapes
print("Input dim:", inp_array.shape)

Input dim: (614400, 70, 3)


The following cell is code to create a positional encoding for the height layers in the data, e.g. the data at height layer 0 would have the positional encoding of: 0 as part of the input feature. It is commented out as PyTorch Dataloaders are found to have the capability to produce this information at load-time, which seems like a better option than creating a potentially huge array for each position that is scaled up to the size of the sample number redundantly.

In [46]:
# create an extra positional encoding optionally for input use
if GENERATE_POSITIONAL_ENCODING_ARRAYS:
    sample_num, height_dim, _ = inp_array.shape
    # generate height values
    height_position_vector = np.arange(height_dim)
    # extend dimensions out to match input feats
    height_position_vector = np.repeat([height_position_vector], sample_num, axis=0)

    # verify
    print("shape of encoding vector:", height_position_vector.shape)

    x, y = height_position_vector.shape
    # add a dimension for height to act as a feature
    height_position_vector = height_position_vector.reshape(x, y, 1)

    # fit the dtype of the feature to match the dtype of other feats
    height_position_vector = height_position_vector.astype(inp_array.dtype)

    # combine height feature into input array
    if CONCATENATE_POSITIONAL_ENCONDING_TO_FEATURE_VECTOR:
        inp_array = np.concatenate(
            (height_position_vector, inp_array), axis=2, dtype=np.float32
        )  # leave the concat for within the model after producing embedding

    # verify datatypes
    print("input dtype", inp_array.dtype)
    print("height encoding dtype", height_position_vector.dtype)

In [47]:
%%time
if SAVE_NPZ:
    print("Saving numpy arrays")

    with open(path_to_save_result, "w+b") as f:
        # variable assignment that name the arrays for the saved file
        if SAVE_ONEHOT_INSTEAD_OF_CLASS_LABEL:
            output_labels = one_hot_encoded_bases
        else:
            output_labels = class_label_encoded_bases
        output_cloud_volume = tar_array

        if GENERATE_POSITIONAL_ENCODING_ARRAYS:
            if CONCATENATE_POSITIONAL_ENCONDING_TO_FEATURE_VECTOR:
                np.savez(
                    f,
                    input_x=inp_array,
                    output_cloud_volume=output_cloud_volume,
                    output_labels=output_labels,
                )
            else:
                np.savez(
                    f,
                    input_x=inp_array,
                    output_cloud_volume=output_cloud_volume,
                    output_labels=output_labels,
                    height_position_vector=height_position_vector,
                )
        else:
            input_x = inp_array.compute()

            np.savez(
                f,
                input_x=input_x,
                output_cloud_volume=output_cloud_volume,
                output_labels=output_labels,
            )

CPU times: user 8 µs, sys: 0 ns, total: 8 µs
Wall time: 15.5 µs


## Convert numpy arrays to zarr files
Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. Zarr is a desirable format for machine learning ready data because, due to the size of the data, computations will have to be broken down. Dask is the selected technology for chunk based computation therefore the chunk-based storage method fits well into the chunk-based computation of dask. Zarr is additionally useful as the metadata of the file is stored separately from the data itself to allow the user to inform themselves of the data before choosing to load the data into their environment.

Store the numpy arrays into zarr, which will chunk and compress each array:

In [48]:
# the ncdf files are stored in chunks of 307200
# we want our zarr to chunk across dimensions reasonably small
sample_chunking = 102400  # hardcoded 1/3rd of a days data
height_sample = int(tar_array.shape[1])  # want to keep height layers together
feat_num_for_chunks = 3  # keeping features together on input

In [49]:
tar_array

array([[1.      , 1.      , 1.      , ..., 0.      , 0.      , 0.      ],
       [0.984375, 1.      , 1.      , ..., 0.      , 0.      , 0.      ],
       [0.984375, 1.      , 1.      , ..., 0.      , 0.      , 0.      ],
       ...,
       [0.875   , 0.828125, 0.578125, ..., 0.      , 0.      , 0.      ],
       [0.875   , 0.8125  , 0.578125, ..., 0.      , 0.      , 0.      ],
       [0.859375, 0.8125  , 0.578125, ..., 0.      , 0.      , 0.      ]],
      dtype=float32)

In [50]:
print(class_label_encoded_bases.shape)
class_label_encoded_bases

(614400,)


array([0, 0, 0, ..., 0, 0, 0])

In [51]:
inp_array.rechunk(sample_chunking, height_sample, feat_num_for_chunks)

Unnamed: 0,Array,Chunk
Bytes,492.19 MiB,82.03 MiB
Shape,"(614400, 70, 3)","(102400, 70, 3)"
Dask graph,6 chunks in 25 graph layers,6 chunks in 25 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 492.19 MiB 82.03 MiB Shape (614400, 70, 3) (102400, 70, 3) Dask graph 6 chunks in 25 graph layers Data type float32 numpy.ndarray",3  70  614400,

Unnamed: 0,Array,Chunk
Bytes,492.19 MiB,82.03 MiB
Shape,"(614400, 70, 3)","(102400, 70, 3)"
Dask graph,6 chunks in 25 graph layers,6 chunks in 25 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [52]:
%%time
store = zarr.DirectoryStore(path_to_save_zarr)
# define objected for arrays to be grouped under

zarr_grouping = zarr.group(store=store, overwrite=True)

# initialize and then write on zarr arrays for all desired arrays to be saved

cloud_volume_fraction_y = zarr_grouping.zeros(
    shape=tar_array.shape,
    dtype=tar_array.dtype,
    name="cloud_volume_fraction_y.zarr",
    chunks=(sample_chunking, height_sample),
)
print("Start cloud volume save")
cloud_volume_fraction_y[:] = tar_array

if SAVE_ONEHOT_INSTEAD_OF_CLASS_LABEL:
    cloud_base_onehot_y = zarr_grouping.zeros(
        shape=one_hot_encoded_bases.shape,
        dtype=one_hot_encoded_bases.dtype,
        name="cloud_base_onehot_y.zarr",
        chunks=(sample_chunking),
    )
    print("Start base label save")
    cloud_base_onehot_y[:] = one_hot_encoded_bases
else:
    cloud_base_label_y = zarr_grouping.zeros(
        shape=class_label_encoded_bases.shape,
        dtype=class_label_encoded_bases.dtype,
        name="cloud_base_label_y.zarr",
        chunks=(sample_chunking),
    )
    print("Start base label save")
    cloud_base_label_y[:] = class_label_encoded_bases

Start cloud volume save
Start base label save
CPU times: user 333 ms, sys: 30.4 ms, total: 363 ms
Wall time: 169 ms


In [53]:
if type(inp_array) == np.ndarray:
    humidity_temp_pressure_x = zarr_grouping.zeros(
        shape=inp_array.shape,
        dtype=inp_array.dtype,
        name="humidity_temp_pressure_x.zarr",
        chunks=(sample_chunking, height_sample, feat_num_for_chunks),
    )
    print("Start input save")
    humidity_temp_pressure_x[:] = inp_array
else:
    print("Start input save")
    with ProgressBar(), ResourceProfiler(5):
        inp_array.to_zarr(
            path_to_save_zarr,
            "humidity_temp_pressure_x.zarr",
            overwrite=True,
            compute=True,
            return_stored=False,
        )

Start input save
[########################################] | 100% Completed | 2.87 ss


In [54]:
if GENERATE_POSITIONAL_ENCODING_ARRAYS:
    if not CONCATENATE_POSITIONAL_ENCONDING_TO_FEATURE_VECTOR:
        height_position_x = zarr_grouping.zeros(
            shape=height_position_vector.shape,
            dtype=height_position_vector.dtype,
            name="height_position_x.zarr",
            chunks=(sample_chunking, height_sample, feat_num_for_chunks),
        )

        print("Start encoded height save")
        height_position_x[:] = height_position_vector

Now the zarr group has been saved, show some of the printed information about these groupings to try and understand what has been created

In [55]:
# output some summary for zarr
# view group values
printF = lambda obj: print(obj)
print("Elements of zarr group:")
zarr_grouping.visitvalues(printF)
# view group tree
print("\nTree of zarr group:\n", zarr_grouping.tree())
# see chunk size
print("\nShape array example:", cloud_volume_fraction_y.shape)
print("\nZarr chunking shape of an array:", cloud_base_label_y.chunks)

Elements of zarr group:
<zarr.core.Array '/cloud_base_label_y.zarr' (614400,) int64>
<zarr.core.Array '/cloud_volume_fraction_y.zarr' (614400, 70) float32>
<zarr.core.Array '/humidity_temp_pressure_x.zarr' (614400, 70, 3) float32>

Tree of zarr group:
 /
 ├── cloud_base_label_y.zarr (614400,) int64
 ├── cloud_volume_fraction_y.zarr (614400, 70) float32
 └── humidity_temp_pressure_x.zarr (614400, 70, 3) float32

Shape array example: (614400, 70)

Zarr chunking shape of an array: (102400,)


## Discussion

In this notebook, data is transformed from its raw form into something that is hopefully much closer to the form the data will take during machine learning training. There are challenges around: considering the size of the data, requiring some knowledge of the desired ML task in order to select which parts of the raw data are necessary, and finding a data format that fits the problem.

In terms of next steps, some information about the data was provided in order to assist in pre-processing, but further exploration is required in order to make correct choices in modelling and avoid pitfalls.

#### Links/Resources

1. https://zarr.readthedocs.io/en/stable/
1. https://www.dea.ga.gov.au/about/analysis-ready-data
1. https://www.dask.org/
