# Conversion of Data to a Machine Learning Friendly Format

This notebook demonstrates taking a single NetCDF file and converting the file into analysis ready numpy arrays stored in a zarr file for later use in neural network training.

Specifically, after having loaded multiple NetCDF files from UM model data into a single Iris CubeList, and saving this CubeList to disk, this notebook will:<ul>

<li>Load the single NetCDF file back from disk.</li>
<li>Extract the desired cubes: cloud volume fraction, specific humidity, air pressure, and air temperature.</li>
<li>Combine cubes of the same feature where metadata differences have prevented concatenation.</li>
<li>Convert the cubes to numpy arrays.</li>
<li>Format the arrays into a desirable dimension: (Sample Number, Height Level, Feature).</li>
<li>Generate data for the desired target we want to make a prediction on (cloud base height at a level in a sample).</li>
<li>Normalize data where necessary.</li>
<li>Save the data to disk for later loading to perform ML tasks.</li></ul> 

Define imports:

In [1]:
import os
import pathlib
import re

import dask
import iris
import numpy as np

import cbh_data_definitions #  used for testing the load back in of the data

import zarr

Define file paths:

In [2]:
root_data_directory = pathlib.Path(os.environ["SCRATCH"]) / "cbh_data"

paths_to_load = (
    root_data_directory / "train" / "train_large.nc"
)  # one large nc file of iris' concatenation of all small nc files
path_to_save_result = (
    root_data_directory / "analysis_ready" / "train.npz"
)  # ouput for numpy arrays
path_to_save_zarr = (
    root_data_directory / "analysis_ready" / "train.zarr"
)  # output for zarr files

Settings for the notebook, with each constant given a comment above for a purpose description:

In [3]:
# generates a positional encoding array for a feature of each height layer in the sample
GENERATE_POSITIONAL_ENCODING_ARRAYS = False
# adds the height layer number to every feature vector in the input array
if GENERATE_POSITIONAL_ENCODING_ARRAYS:
    CONCATENATE_POSITIONAL_ENCONDING_TO_FEATURE_VECTOR = False

# realises the input array computation in two halves to avoid memory constraints of large computation
COMPUTE_INPUT_ARRAY_IN_HALVES = False

FREE_UP_MEMORY_AFTER_TARGET_COMPUTATION = True

# show all samples where clouds exist in the final layer (none)
# the final layer is used as the desired classification in the case of no cloud base existance prediction
VERIFY_NO_FINAL_LAYER_CLOUDS = False

# do extra compute to find the number of samples with cloud bases in the dataset
COMPUTE_CLOUD_BASE_SAMPLE_NUMBER = False

# save npz or not
SAVE_NPZ = False

# perform some computations that may take a long while, but give more information for general understanding
PERFORM_LONG_COMPUTATIONS_FOR_EXTRA_INFO = False

# save the cloud base position as a class label, or a onehot vector
SAVE_ONEHOT_INSTEAD_OF_CLASS_LABEL = False
# one can easily be converted to the other, so only saving one is necessary

# Normalize the input data to a range [0,1]
NORMALIZE_INPUT_DATA = False
# defaults to false as models should be desined to normalize the data, and avoids issues like an unknown global maximum
# e.g. the max value of temperature in the loaded data can be different to the max temp of all possible input temps for your model

## Loading in the Cloud Base Height Data

In [4]:
cubes = iris.load(str(paths_to_load))

print("Find files complete, list of paths:", paths_to_load)

Find files complete, list of paths: /scratch/hsouth/cbh_data/train/train_large.nc


Show cube names:

In [5]:
print("Cube names:\n", [str(cube.name()) for cube in cubes])

print("\n" + "Example of cube metadata:", cubes[2].summary())

Cube names:
 ['cloud_volume_fraction_in_atmosphere_layer', 'm01s05i250', 'cloud_volume_fraction_in_atmosphere_layer', 'm01s05i250_0', 'air_pressure', 'air_pressure', 'air_temperature', 'air_temperature', 'convective_rainfall_flux', 'convective_rainfall_flux', 'convective_snowfall_flux', 'convective_snowfall_flux', 'specific_humidity', 'specific_humidity', 'stratiform_rainfall_flux', 'stratiform_rainfall_flux', 'stratiform_snowfall_flux', 'stratiform_snowfall_flux', 'upward_air_velocity', 'upward_air_velocity']

Example of cube metadata: cloud_volume_fraction_in_atmosphere_layer / (1) (forecast_period: 4; forecast_reference_time: 60; model_level_number: 70; latitude: 480; longitude: 640)
    Dimension coordinates:
        forecast_period                                         x                           -                       -             -               -
        forecast_reference_time                                 -                           x                       -            

## Preprocess the data

### Extract the desired cubes: cloud volume fraction, specific humidity, air pressure, and air temperature

Cloud volume fraction will be used as our target for the problem, and the rest of the cubes are used as input.

In [6]:
def create_dataset(cubes):
    list_of_input_cubes = ["air_temperature", "air_pressure", "specific_humidity"]
    target_cube_name = ["cloud_volume_fraction_in_atmosphere_layer"]

    target_cube = iris.cube.CubeList(
        [cube for cube in cubes if (cube.long_name in target_cube_name)]
    )
    inp_cube = iris.cube.CubeList(
        [cube for cube in cubes if (cube.standard_name in list_of_input_cubes)]
    )

    return inp_cube, target_cube

Call the function defined above and verify success:

In [7]:
inp_cube, tar_cube = create_dataset(cubes)

print("input cube:\n", inp_cube, "\n")
print("target cubes:\n", tar_cube)

input cube:
 0: air_pressure / (Pa)                 (forecast_period: 4; forecast_reference_time: 60; model_level_number: 70; latitude: 480; longitude: 640)
1: air_pressure / (Pa)                 (forecast_period: 4; forecast_reference_time: 31; model_level_number: 70; latitude: 480; longitude: 640)
2: air_temperature / (K)               (forecast_period: 4; forecast_reference_time: 31; model_level_number: 70; latitude: 480; longitude: 640)
3: air_temperature / (K)               (forecast_period: 4; forecast_reference_time: 60; model_level_number: 70; latitude: 480; longitude: 640)
4: specific_humidity / (kg kg-1)       (forecast_period: 4; forecast_reference_time: 60; model_level_number: 70; latitude: 480; longitude: 640)
5: specific_humidity / (kg kg-1)       (forecast_period: 4; forecast_reference_time: 31; model_level_number: 70; latitude: 480; longitude: 640) 

target cubes:
 0: cloud_volume_fraction_in_atmosphere_layer / (1) (forecast_period: 4; forecast_reference_time: 31; model

### Combine cubes of the same feature where metadata differences have prevented concatenation, while also extracting the numpy array of each cube

if duplicate cubes exist, concatenate them using numpy to avoid metadata matching issues:

In [8]:
def order_two_objects_by_len_ascending(obj1, obj2):
    len1 = len(obj1[1])
    len2 = len(obj2[1])
    if(len1 >= len2):
        return obj2, obj1 
    else:
        return obj1, obj2
    
def concatenate_same_cubes(cube_list):

    cube_name_dictionary = {}

    for cube in cube_list:
        # print('start cube load')
        cube_np_array = cube.core_data()
        # print('end load')
        
        cube_name = cube.name()

        try:
            # concat along the differing axis, forcast reference time
            # MUST CONCATENATE IN THE SAME ORDER FOR EACH ARRAY (Since dim len is diffent each array, we can have the 
            short_arr, long_arr = order_two_objects_by_len_ascending(cube_np_array, cube_name_dictionary[cube_name])
            cube_name_dictionary[cube_name] = np.concatenate(
                (short_arr, long_arr), axis=1
            )

            # print(cube_name_dictionary[cube_name].shape)

        except KeyError:
            cube_name_dictionary[cube_name] = cube_np_array

    return cube_name_dictionary

Call the function defined above and verify success:

In [9]:
inp_dict = concatenate_same_cubes(inp_cube)
tar_dict = concatenate_same_cubes(tar_cube)

print("Air Pressure array shape:", inp_dict["air_pressure"].shape)
print(
    "Cloud Volume array shape:",
    tar_dict["cloud_volume_fraction_in_atmosphere_layer"].shape,
)
print("Array types:", type(inp_dict["air_pressure"]))
print("Input cube arrays found:", inp_dict.keys())
print("Target cube arrays found:", tar_dict.keys())

Air Pressure array shape: (4, 91, 70, 480, 640)
Cloud Volume array shape: (4, 91, 70, 480, 640)
Array types: <class 'dask.array.core.Array'>
Input cube arrays found: dict_keys(['air_pressure', 'air_temperature', 'specific_humidity'])
Target cube arrays found: dict_keys(['cloud_volume_fraction_in_atmosphere_layer'])


Combine dictionary elements to one array:

In [10]:
def combine_feats(dict_of_feats):

    add_dim_for_feature = [np.expand_dims(x, axis=0) for x in dict_of_feats.values()]
    feat_concat_array = np.concatenate(add_dim_for_feature, axis=0)
    return feat_concat_array

In [11]:
inp_array = combine_feats(inp_dict)
tar_array = combine_feats(tar_dict)

# verify and check dims
print("Dimensions to standardize for processing:")
print("Current Input Shape:", inp_array.shape)
print("Current Target Shape:", tar_array.shape)

Dimensions to standardize for processing:
Current Input Shape: (3, 4, 91, 70, 480, 640)
Current Target Shape: (1, 4, 91, 70, 480, 640)


Expand the dimensions of 'short' arrays to work in flattening (this applies in practice the smaller dev set of the data):

In [12]:
if len(inp_array.shape) == 4:
    time_time2_dims_to_add = [1, 2]
    inp_array = np.expand_dims(inp_array, time_time2_dims_to_add)
    tar_array = np.expand_dims(tar_array, time_time2_dims_to_add)
    print("New and correct shapes (should be 6 dims):")
    print(inp_array.shape)
    print(tar_array.shape)

In [13]:
print("Show array storage metadata:")
inp_array

Show array storage metadata:


Unnamed: 0,Array,Chunk
Bytes,87.48 GiB,82.03 MiB
Shape,"(3, 4, 91, 70, 480, 640)","(1, 1, 1, 70, 480, 640)"
Count,19 Graph Layers,1092 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 87.48 GiB 82.03 MiB Shape (3, 4, 91, 70, 480, 640) (1, 1, 1, 70, 480, 640) Count 19 Graph Layers 1092 Chunks Type float32 numpy.ndarray",91  4  3  640  480  70,

Unnamed: 0,Array,Chunk
Bytes,87.48 GiB,82.03 MiB
Shape,"(3, 4, 91, 70, 480, 640)","(1, 1, 1, 70, 480, 640)"
Count,19 Graph Layers,1092 Chunks
Type,float32,numpy.ndarray


In [14]:
print("Show array storage metadata:")
tar_array

Show array storage metadata:


Unnamed: 0,Array,Chunk
Bytes,29.16 GiB,82.03 MiB
Shape,"(1, 4, 91, 70, 480, 640)","(1, 1, 1, 70, 480, 640)"
Count,6 Graph Layers,364 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 29.16 GiB 82.03 MiB Shape (1, 4, 91, 70, 480, 640) (1, 1, 1, 70, 480, 640) Count 6 Graph Layers 364 Chunks Type float32 numpy.ndarray",91  4  1  640  480  70,

Unnamed: 0,Array,Chunk
Bytes,29.16 GiB,82.03 MiB
Shape,"(1, 4, 91, 70, 480, 640)","(1, 1, 1, 70, 480, 640)"
Count,6 Graph Layers,364 Chunks
Type,float32,numpy.ndarray


### Flatten the arrays

Flatten time and lat/long down to a single dimension, sample number </br>
Function expects 6-d array where each expected dimension is named in the function - cube_num, time, time2, height, lat, long

In [15]:
def flatten_cubes_with_numpy(np_array):

    # print('input dimensions:', np_array.shape)

    cube_num, time, time2, height, lat, long = np_array.shape

    # # verify shape
    # print(np_array.shape)

    # swap axis of time and height to ensure flattening preserves height
    cube_array = np_array.transpose(0, 3, 1, 2, 4, 5)
    cubes_flattened = np.reshape(
        cube_array, (cube_num, height, (time * time2 * lat * long))
    )

    # print('new dimensions', cubes_flattened.shape)

    cube_to_return = cubes_flattened.T
    # remove unnecessary dimensions
    cube_to_return = cube_to_return.squeeze()
    return cube_to_return

In [16]:
dask.config.set(
    {"array.slicing.split_large_chunks": False}
)  # allow the potentially large chunk of data

inp_array = flatten_cubes_with_numpy(inp_array)
tar_array = flatten_cubes_with_numpy(tar_array)

# print('verify squeeze')
print("Shapes of flattened and transposed arrays:")
print("Input:", inp_array.shape)
print("Target:", tar_array.shape)

Shapes of flattened and transposed arrays:
Input: (111820800, 70, 3)
Target: (111820800, 70)


Rechunk large data to ensure large chunks are reduced for easier handling in dask:

In [17]:
tar_array = dask.array.rechunk(tar_array, chunks="auto")
print("Rechunked array storage metadata for target:")
tar_array

Rechunked array storage metadata for target:


Unnamed: 0,Array,Chunk
Bytes,29.16 GiB,106.64 MiB
Shape,"(111820800, 70)","(3993600, 7)"
Count,12 Graph Layers,280 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 29.16 GiB 106.64 MiB Shape (111820800, 70) (3993600, 7) Count 12 Graph Layers 280 Chunks Type float32 numpy.ndarray",70  111820800,

Unnamed: 0,Array,Chunk
Bytes,29.16 GiB,106.64 MiB
Shape,"(111820800, 70)","(3993600, 7)"
Count,12 Graph Layers,280 Chunks
Type,float32,numpy.ndarray


In [18]:
inp_array = dask.array.rechunk(inp_array, chunks="auto")
print("Rechunked array storage metadata for input:")
inp_array

Rechunked array storage metadata for input:


Unnamed: 0,Array,Chunk
Bytes,87.48 GiB,124.41 MiB
Shape,"(111820800, 70, 3)","(4659200, 7, 1)"
Count,24 Graph Layers,720 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 87.48 GiB 124.41 MiB Shape (111820800, 70, 3) (4659200, 7, 1) Count 24 Graph Layers 720 Chunks Type float32 numpy.ndarray",3  70  111820800,

Unnamed: 0,Array,Chunk
Bytes,87.48 GiB,124.41 MiB
Shape,"(111820800, 70, 3)","(4659200, 7, 1)"
Count,24 Graph Layers,720 Chunks
Type,float32,numpy.ndarray


In [19]:
# rechunk enforcing samples are kept together
inp_arr_chunks = inp_array.chunksize
inp_array = inp_array.rechunk((inp_arr_chunks[0]/10, 70, 3))
inp_array

Unnamed: 0,Array,Chunk
Bytes,87.48 GiB,373.24 MiB
Shape,"(111820800, 70, 3)","(465920, 70, 3)"
Count,26 Graph Layers,240 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 87.48 GiB 373.24 MiB Shape (111820800, 70, 3) (465920, 70, 3) Count 26 Graph Layers 240 Chunks Type float32 numpy.ndarray",3  70  111820800,

Unnamed: 0,Array,Chunk
Bytes,87.48 GiB,373.24 MiB
Shape,"(111820800, 70, 3)","(465920, 70, 3)"
Count,26 Graph Layers,240 Chunks
Type,float32,numpy.ndarray


For some more information about dask, display the dask object output for the input array

In [20]:
inp_array.dask

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  240  shape  (4, 60, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-1f8f4c1913867a21841dc21a103f3dde",60  4  640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,240
shape,"(4, 60, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-1f8f4c1913867a21841dc21a103f3dde

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  124  shape  (4, 31, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-f6b2b95d176d379dcf11c60b940e4bb7",31  4  640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,124
shape,"(4, 31, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-f6b2b95d176d379dcf11c60b940e4bb7

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  364  shape  (4, 91, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on array-f6b2b95d176d379dcf11c60b940e4bb7  array-1f8f4c1913867a21841dc21a103f3dde",91  4  640  480  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,364
shape,"(4, 91, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,array-f6b2b95d176d379dcf11c60b940e4bb7
,array-1f8f4c1913867a21841dc21a103f3dde

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  364  shape  (1, 4, 91, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on concatenate-7f81c9c93de83e90b86316b056db1469",91  4  1  640  480  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,364
shape,"(1, 4, 91, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,concatenate-7f81c9c93de83e90b86316b056db1469

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  240  shape  (4, 60, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-c5bd0746cb701b7d82a5e4ee409383fa",60  4  640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,240
shape,"(4, 60, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-c5bd0746cb701b7d82a5e4ee409383fa

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  124  shape  (4, 31, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-44faa26f00772efe0bb574652a1e42fb",31  4  640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,124
shape,"(4, 31, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-44faa26f00772efe0bb574652a1e42fb

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  364  shape  (4, 91, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on array-c5bd0746cb701b7d82a5e4ee409383fa  array-44faa26f00772efe0bb574652a1e42fb",91  4  640  480  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,364
shape,"(4, 91, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,array-c5bd0746cb701b7d82a5e4ee409383fa
,array-44faa26f00772efe0bb574652a1e42fb

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  364  shape  (1, 4, 91, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on concatenate-e18b706bd7c976a3c743fa1effdd6d53",91  4  1  640  480  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,364
shape,"(1, 4, 91, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,concatenate-e18b706bd7c976a3c743fa1effdd6d53

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  240  shape  (4, 60, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-a105c0bdb54bcd2e036d0eb59ad851ae",60  4  640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,240
shape,"(4, 60, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-a105c0bdb54bcd2e036d0eb59ad851ae

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  124  shape  (4, 31, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on original-array-a42a5d46817c21b9471b8e279b85945a",31  4  640  480  70

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,124
shape,"(4, 31, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,original-array-a42a5d46817c21b9471b8e279b85945a

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  364  shape  (4, 91, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on array-a105c0bdb54bcd2e036d0eb59ad851ae  array-a42a5d46817c21b9471b8e279b85945a",91  4  640  480  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,364
shape,"(4, 91, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,array-a105c0bdb54bcd2e036d0eb59ad851ae
,array-a42a5d46817c21b9471b8e279b85945a

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  364  shape  (1, 4, 91, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on concatenate-76eab8f246bb07254835615db97dca8f",91  4  1  640  480  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,364
shape,"(1, 4, 91, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,concatenate-76eab8f246bb07254835615db97dca8f

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  1092  shape  (3, 4, 91, 70, 480, 640)  dtype  float32  chunksize  (1, 1, 1, 70, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on reshape-b8c105765a9177f8d360ae63336c1659  reshape-2789e27f75ee666f6fb2660a1db678ba  reshape-6c8a3fbbff1c44e8d561554b08b1206c",91  4  3  640  480  70

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1092
shape,"(3, 4, 91, 70, 480, 640)"
dtype,float32
chunksize,"(1, 1, 1, 70, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,reshape-b8c105765a9177f8d360ae63336c1659
,reshape-2789e27f75ee666f6fb2660a1db678ba

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  1092  shape  (3, 70, 4, 91, 480, 640)  dtype  float32  chunksize  (1, 70, 1, 1, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on concatenate-77d9c780bae8376b290e6ef023d97478",4  70  3  640  480  91

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1092
shape,"(3, 70, 4, 91, 480, 640)"
dtype,float32
chunksize,"(1, 70, 1, 1, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,concatenate-77d9c780bae8376b290e6ef023d97478

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  12  shape  (3, 70, 4, 91, 480, 640)  dtype  float32  chunksize  (1, 70, 1, 91, 480, 640)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on transpose-40296edd73032de78ee14243f212d5d8",4  70  3  640  480  91

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,12
shape,"(3, 70, 4, 91, 480, 640)"
dtype,float32
chunksize,"(1, 70, 1, 91, 480, 640)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,transpose-40296edd73032de78ee14243f212d5d8

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  12  shape  (3, 70, 111820800)  dtype  float32  chunksize  (1, 70, 27955200)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on rechunk-merge-07ca054a6bfc5768131096b57f808d92",111820800  70  3

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,12
shape,"(3, 70, 111820800)"
dtype,float32
chunksize,"(1, 70, 27955200)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,rechunk-merge-07ca054a6bfc5768131096b57f808d92

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  12  shape  (111820800, 70, 3)  dtype  float32  chunksize  (27955200, 70, 1)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on reshape-8de36ee96e8634b6280777beb05a366e",3  70  111820800

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,12
shape,"(111820800, 70, 3)"
dtype,float32
chunksize,"(27955200, 70, 1)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,reshape-8de36ee96e8634b6280777beb05a366e

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  1440  shape  (111820800, 70, 3)  dtype  float32  chunksize  (4659200, 7, 1)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on transpose-427c942ac42a75b7a47be4b401d989f6",3  70  111820800

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1440
shape,"(111820800, 70, 3)"
dtype,float32
chunksize,"(4659200, 7, 1)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,transpose-427c942ac42a75b7a47be4b401d989f6

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  240  shape  (111820800, 70, 3)  dtype  float32  chunksize  (4659200, 7, 3)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on rechunk-merge-b895343c48059fa0b437a9a3317934a9",3  70  111820800

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,240
shape,"(111820800, 70, 3)"
dtype,float32
chunksize,"(4659200, 7, 3)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,rechunk-merge-b895343c48059fa0b437a9a3317934a9

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  2640  shape  (111820800, 70, 3)  dtype  float32  chunksize  (465920, 70, 3)  type  dask.array.core.Array  chunk_type  numpy.ndarray  depends on rechunk-merge-d061782a6db49c4da9cdc61b37cffe36",3  70  111820800

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,2640
shape,"(111820800, 70, 3)"
dtype,float32
chunksize,"(465920, 70, 3)"
type,dask.array.core.Array
chunk_type,numpy.ndarray
depends on,rechunk-merge-d061782a6db49c4da9cdc61b37cffe36


In [21]:
tar_arr_chunks = tar_array.chunksize
tar_array = tar_array.rechunk((tar_arr_chunks[0]/10, 70))
tar_array

Unnamed: 0,Array,Chunk
Bytes,29.16 GiB,106.64 MiB
Shape,"(111820800, 70)","(399360, 70)"
Count,15 Graph Layers,280 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 29.16 GiB 106.64 MiB Shape (111820800, 70) (399360, 70) Count 15 Graph Layers 280 Chunks Type float32 numpy.ndarray",70  111820800,

Unnamed: 0,Array,Chunk
Bytes,29.16 GiB,106.64 MiB
Shape,"(111820800, 70)","(399360, 70)"
Count,15 Graph Layers,280 Chunks
Type,float32,numpy.ndarray


## Preprocess the data toward ML algorithm input

### Generate data for the target of cloud base at certain height

preprocess the target
for the target, we define a cloud existing in a height layer:
if the cloud volume fraction is greater than 2 out of possible 8 oktas </br>
the cell below finds the first occurrences where the cloud volume is greater than the threshold marking a 1 in the array location, and stores 0 otherwise. </br>
Later, the final height layer will be marker for samples without a cloud base

In [22]:
# See an example of target array values
print(
    "Example of cloud volume samples (first 10 samples, first 30 layers):\n",
    tar_array[0:10,0:30].compute()
)
if PERFORM_LONG_COMPUTATIONS_FOR_EXTRA_INFO:
    print("Maximum value in data:", np.max(tar_array).compute())

Example of cloud volume samples (first 10 samples, first 30 layers):
 [[0.15625  0.171875 0.171875 0.171875 0.15625  0.109375 0.046875 0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.      ]
 [0.15625  0.15625  0.171875 0.171875 0.140625 0.09375  0.046875 0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.      ]
 [0.15625  0.15625  0.171875 0.15625  0.140625 0.09375  0.03125  0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.      ]
 [0.15625  0.15625  0.15625  0.15625  0.140625 0.078125 0.03125  0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       

In [23]:
cloud_threshold = 2.0 / 8.0
cloud_over_threshold = dask.array.where(tar_array > cloud_threshold)

In [24]:
%%time
# realize the values for the where condition (dask array to numpy array)
print("Start base found sample compute")
sample_with_cloud = cloud_over_threshold[0].compute()
print("Start sample index compute")
index_on_sample = cloud_over_threshold[1].compute()

Start base found sample compute
Start sample index compute
CPU times: user 10min 12s, sys: 11min 34s, total: 21min 47s
Wall time: 7min 26s


Remove repeat indicies, e.g. where there are multiple layers above the cloud threshold, we only want the first occurence in a sample (the base):

In [25]:
%%time
_, first_duplicate_indicies = np.unique(sample_with_cloud, return_index=True)

if COMPUTE_CLOUD_BASE_SAMPLE_NUMBER:
    print("Start duplicate indicies compute")
    first_duplicate_indicies = first_duplicate_indicies.compute()
    print("Number of cloud bases found:", first_duplicate_indicies.shape)
    print("Out of samples:", tar_array.shape[0])

CPU times: user 13.8 s, sys: 13 s, total: 26.8 s
Wall time: 26.8 s


For clouds where no base was found, add a marker at the final height layer
(where no cloud volume over threshold appears in the data).

In [26]:
%%time

# encode the cloud in onehot vector
one_hot_encoded_bases = np.zeros(tar_array.shape)
one_hot_encoded_bases[
    sample_with_cloud[first_duplicate_indicies],
    index_on_sample[first_duplicate_indicies],
] = 1
# mark the end (final layer) if no cloud base detected
flip = lambda booleanVal: not booleanVal
vflip = np.vectorize(flip)
one_hot_encoded_bases[np.where(vflip(np.any(one_hot_encoded_bases, axis=1)))[0], -1] = 1

# Now reduce vectors as if each height layer is treated as a class where the model will predict, onehot -> class label e.g. 0,0,1,0, -> 2
class_label_encoded_bases = np.argmax(one_hot_encoded_bases, axis=1)

CPU times: user 36.8 s, sys: 24.7 s, total: 1min 1s
Wall time: 1min 1s


In [27]:
print("Target as class label:", class_label_encoded_bases.shape)
print("Output dim:", one_hot_encoded_bases.shape)

Target as class label: (111820800,)
Output dim: (111820800, 70)


Compute and unmask target array (cloud volume):

In [28]:
%%time
print("Current type of target array:", type(tar_array))
print("Target shape:", tar_array.shape)
tar_array = tar_array.compute()
print("Finished compute of target array")

num_of_masked = np.ma.count_masked(tar_array)
print("Number of masked values after computation:", num_of_masked)
assert num_of_masked == 0

# unmask
tar_array = np.ma.filled(tar_array, np.nan)

Current type of target array: <class 'dask.array.core.Array'>
Target shape: (111820800, 70)
Finished compute of target array
Number of masked values after computation: 0
CPU times: user 5min 10s, sys: 7min 26s, total: 12min 36s
Wall time: 4min 55s


In [29]:
if VERIFY_NO_FINAL_LAYER_CLOUDS:
    # verify the claim that no cloud bases appear in the final layer
    # can be strengthened to, no clouds exist in the final layer (next line returns 0)
    print(
        "list of clouds at final height level:",
        np.where(tar_array[:, -1] > cloud_threshold),
    )

### Show some samples of what has been produced

In [30]:
one_hot_encoded_bases

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [31]:
class_label_encoded_bases

array([69, 69, 69, ..., 69, 69, 69])

In [32]:
tar_array

array([[0.15625 , 0.171875, 0.171875, ..., 0.      , 0.      , 0.      ],
       [0.15625 , 0.15625 , 0.171875, ..., 0.      , 0.      , 0.      ],
       [0.15625 , 0.15625 , 0.171875, ..., 0.      , 0.      , 0.      ],
       ...,
       [0.03125 , 0.015625, 0.      , ..., 0.      , 0.      , 0.      ],
       [0.03125 , 0.015625, 0.      , ..., 0.      , 0.      , 0.      ],
       [0.03125 , 0.015625, 0.      , ..., 0.      , 0.      , 0.      ]],
      dtype=float32)

In [33]:
# optionally, free up some memory
if FREE_UP_MEMORY_AFTER_TARGET_COMPUTATION:
    del sample_with_cloud
    del cloud_over_threshold
    del first_duplicate_indicies
    del index_on_sample
    del tar_dict
    del tar_cube
    del cubes

In [34]:
if SAVE_ONEHOT_INSTEAD_OF_CLASS_LABEL:
    del class_label_encoded_bases
else:
    del one_hot_encoded_bases

In [35]:
# get information for correct chunking
print(class_label_encoded_bases.shape)
print(tar_array.shape)

(111820800,)
(111820800, 70)


### Normalize Input Data

For the normalization of input data: we first transpose the input array so that the feature dimension is at the top level of the array, and numpy has an easier time accessing all values of the same feature. Then all values are normalized by being scaled in the range \[0,1\]

(must investigate mistake relating to ptp of local datasets instead of global values and make changes)

In [36]:
print("Current input array type:", type(inp_array))

Current input array type: <class 'dask.array.core.Array'>


In [37]:
%%time
if NORMALIZE_INPUT_DATA:
    inp_array = inp_array.T
    inp_array = (inp_array - np.min(inp_array, axis=(1, 2)).reshape((3, 1, 1))) / (
        np.ptp(inp_array, axis=(1, 2)).reshape((3, 1, 1))
    )
    inp_array = inp_array.T

    # a 2 half compute used to avoid memory constraints
    if COMPUTE_INPUT_ARRAY_IN_HALVES:
        half = int(len(inp_array) / 2)
        inp_array_1 = inp_array[:half].compute()
        len_first_half = inp_array_1.shape[0]

    else:
        inp_array = inp_array.compute()
        print("Finished compute of input array normalization")
        # convert to regular array, after verifying mask does not identify any values
        # (print below gives 0 masked values)
        num_of_masked = np.ma.count_masked(inp_array)
        print("Number of masked values after computation:", num_of_masked)
        assert num_of_masked == 0
        # unmask, giving all masked values NaN (but no masked values)
        inp_array = np.ma.filled(inp_array, np.nan)

        # # verify dimensions
        # print(inp_array.shape)
        # # and verify type
        # print('type of unmasked array:', type(inp_array))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs


In [38]:
%%time
# second half of memory constraint compute, see above cell
if NORMALIZE_INPUT_DATA:
    if COMPUTE_INPUT_ARRAY_IN_HALVES:
        inp_array_2 = inp_array[half:].compute()
        print("Array type after compute:", type(inp_array_2))
        num_of_masked = np.ma.count_masked(inp_array_2)
        print("Count of masked (unfilled) values:", num_of_masked)
        assert num_of_masked == 0
        inp_array_2 = np.ma.filled(inp_array_2, np.nan)
        print("Array type after compute:", type(inp_array_2))
        len_second_half = inp_array_2.shape[0]

        # verify
        print(len_second_half + len_first_half)
        print(inp_array.shape)

        # combine halves
        inp_array = inp_array_1 = np.concatenate((inp_array_1, inp_array_2), axis=0)
        del inp_array_1
        del inp_array_2

CPU times: user 1e+03 ns, sys: 1e+03 ns, total: 2 µs
Wall time: 5.72 µs


In [39]:
# show 5 samples, first 5 height layers only, displaying all features in the layer (features not indexed)
# (automatic numpy array display reduction is quite large for this array)
inp_array[0:5, 0:5, :]

Unnamed: 0,Array,Chunk
Bytes,300 B,300 B
Shape,"(5, 5, 3)","(5, 5, 3)"
Count,27 Graph Layers,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 300 B 300 B Shape (5, 5, 3) (5, 5, 3) Count 27 Graph Layers 1 Chunks Type float32 numpy.ndarray",3  5  5,

Unnamed: 0,Array,Chunk
Bytes,300 B,300 B
Shape,"(5, 5, 3)","(5, 5, 3)"
Count,27 Graph Layers,1 Chunks
Type,float32,numpy.ndarray


#### Save a selection of wanted arrays (inp_array, tar_array, one_hot_encoded_bases)

Now to save the computed array </br>
(will not save one of class label output or one_hot as easy conversion between the two)</br>
(went with saving one-hot to emulate the data produced/used by base solution)

In [40]:
# verify input and output shapes
print("Input dim:", inp_array.shape)

Input dim: (111820800, 70, 3)


The following cell is code to create a positional encoding for the height layers in the data, e.g. the data at height layer 0 would have the positional encoding of: 0 as part of the input feature. It is commented out as PyTorch Dataloaders are found to have the capability to produce this information at load-time, which seems like a better option than creating a potentially huge array for each position that is scaled up to the size of the sample number redundantly.

In [41]:
# create an extra positional encoding optionally for input use
if GENERATE_POSITIONAL_ENCODING_ARRAYS:
    sample_num, height_dim, _ = inp_array.shape
    # generate height values
    height_position_vector = np.arange(height_dim)
    # extend dimensions out to match input feats
    height_position_vector = np.repeat([height_position_vector], sample_num, axis=0)

    # verify
    print("shape of encoding vector:", height_position_vector.shape)

    x, y = height_position_vector.shape
    # add a dimension for height to act as a feature
    height_position_vector = height_position_vector.reshape(x, y, 1)

    # fit the dtype of the feature to match the dtype of other feats
    height_position_vector = height_position_vector.astype(inp_array.dtype)

    # combine height feature into input array
    if CONCATENATE_POSITIONAL_ENCONDING_TO_FEATURE_VECTOR:
        inp_array = np.concatenate(
            (height_position_vector, inp_array), axis=2, dtype=np.float32
        )  # leave the concat for within the model after producing embedding

    # verify datatypes
    print("input dtype", inp_array.dtype)
    print("height encoding dtype", height_position_vector.dtype)

In [42]:
%%time
if SAVE_NPZ:
    print("Saving numpy arrays")

    with open(path_to_save_result, "w+b") as f:
        # variable assignment that name the arrays for the saved file
        input_x = inp_array
        output_onehot = one_hot_encoded_bases
        output_cloud_volume = tar_array

        np.savez(
            f,
            input_x=input_x,
            output_cloud_volume=output_cloud_volume,
            output_onehot=output_onehot,
        )

CPU times: user 1e+03 ns, sys: 2 µs, total: 3 µs
Wall time: 5.25 µs


In [43]:
print(type(inp_array))

<class 'dask.array.core.Array'>


## Convert numpy arrays to zarr files

Store the numpy arrays into zarr, which will chunk and compress each array:

In [44]:
#the ncdf files are stored in chunks of 307200
# we want our zarr to chunk across dimensions reasonably small
sample_chunking = 102400 # hardcoded 1/3rd of a days data
height_sample = int(tar_array.shape[1]) # want to keep height layers together
feat_num_for_chunks = 3 # keeping features together on input

In [45]:
tar_array

array([[0.15625 , 0.171875, 0.171875, ..., 0.      , 0.      , 0.      ],
       [0.15625 , 0.15625 , 0.171875, ..., 0.      , 0.      , 0.      ],
       [0.15625 , 0.15625 , 0.171875, ..., 0.      , 0.      , 0.      ],
       ...,
       [0.03125 , 0.015625, 0.      , ..., 0.      , 0.      , 0.      ],
       [0.03125 , 0.015625, 0.      , ..., 0.      , 0.      , 0.      ],
       [0.03125 , 0.015625, 0.      , ..., 0.      , 0.      , 0.      ]],
      dtype=float32)

In [46]:
print(class_label_encoded_bases.shape)
class_label_encoded_bases

(111820800,)


array([69, 69, 69, ..., 69, 69, 69])

In [47]:
inp_array.rechunk(sample_chunking,height_sample,feat_num_for_chunks)

Unnamed: 0,Array,Chunk
Bytes,87.48 GiB,82.03 MiB
Shape,"(111820800, 70, 3)","(102400, 70, 3)"
Count,27 Graph Layers,1092 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 87.48 GiB 82.03 MiB Shape (111820800, 70, 3) (102400, 70, 3) Count 27 Graph Layers 1092 Chunks Type float32 numpy.ndarray",3  70  111820800,

Unnamed: 0,Array,Chunk
Bytes,87.48 GiB,82.03 MiB
Shape,"(111820800, 70, 3)","(102400, 70, 3)"
Count,27 Graph Layers,1092 Chunks
Type,float32,numpy.ndarray


In [48]:
%%time
store = zarr.DirectoryStore(path_to_save_zarr)
# define objected for arrays to be grouped under

zarr_grouping = zarr.group(store=store, overwrite=True)

# initialize and then write on zarr arrays for all desired arrays to be saved

cloud_volume_fraction_y = zarr_grouping.zeros(
    shape=tar_array.shape, 
    dtype=tar_array.dtype, 
    name="cloud_volume_fraction_y.zarr", 
    chunks=(sample_chunking, height_sample)
)
print("Start cloud volume save")
cloud_volume_fraction_y[:] = tar_array

cloud_base_label_y = zarr_grouping.zeros(
    shape=class_label_encoded_bases.shape,
    dtype=class_label_encoded_bases.dtype,
    name="cloud_base_label_y.zarr",
    chunks=(sample_chunking)
)
print("Start base label save")
cloud_base_label_y[:] = class_label_encoded_bases

Start cloud volume save
Start base label save
CPU times: user 53.4 s, sys: 1min 12s, total: 2min 6s
Wall time: 37.8 s


In [49]:
import sys
def sizeof_fmt(num, suffix='B'):
    ''' by Fred Cirera,  https://stackoverflow.com/a/1094933/1870254, modified'''
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                         key= lambda x: -x[1])[:10]:
    print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))

                           _30: 58.3 GiB
     class_label_encoded_bases: 853.1 MiB
                           _31: 853.1 MiB
                           _46: 853.1 MiB
                             _: 663.7 MiB
                           _i3:  1.5 KiB
                          _i41:  1.2 KiB
                          _i37:  1.2 KiB
                           _i8:  1.1 KiB
                          _i38:  866.0 B


In [50]:
del _30

In [51]:
from dask.diagnostics import ProgressBar, ResourceProfiler
print("Start input save")
with ProgressBar(), ResourceProfiler(5):
    inp_array.to_zarr(
        path_to_save_zarr, 
        'humidity_temp_pressure_x.zarr', 
        overwrite=True, 
        compute=True, 
        return_stored=False,
    )

Start input save
[########################################] | 100% Completed | 13m 26s


In [52]:
# output some summary for zarr
# view group values
printF = lambda obj: print(obj)
print("Elements of zarr group:")
zarr_grouping.visitvalues(printF)
# view group tree
print("\nTree of zarr group:\n", zarr_grouping.tree())
# see chunk size
print("\nShape array example:", cloud_volume_fraction_y.shape)
print("\nZarr chunking shape of an array:", cloud_base_label_y.chunks)

Elements of zarr group:
<zarr.core.Array '/cloud_base_label_y.zarr' (111820800,) int64>
<zarr.core.Array '/cloud_volume_fraction_y.zarr' (111820800, 70) float32>
<zarr.core.Array '/humidity_temp_pressure_x.zarr' (111820800, 70, 3) float32>

Tree of zarr group:
 /
 ├── cloud_base_label_y.zarr (111820800,) int64
 ├── cloud_volume_fraction_y.zarr (111820800, 70) float32
 └── humidity_temp_pressure_x.zarr (111820800, 70, 3) float32

Shape array example: (111820800, 70)

Zarr chunking shape of an array: (102400,)


In [55]:
def load_data_from_zarr(path):
    
    store = zarr.DirectoryStore(path)
    zarr_group = zarr.group(store=store)
    print('Loaded zarr, file information:\n', zarr_group.info, '\n')
    
    x = dask.array.from_zarr(zarr_group['humidity_temp_pressure_x.zarr'])
    x.rechunk(zarr_group['humidity_temp_pressure_x.zarr'].chunks)
    
    y_lab = dask.array.from_zarr(zarr_group['cloud_base_label_y.zarr'])
    y_lab.rechunk(zarr_group['cloud_base_label_y.zarr'].chunks)
    
    y_cont = dask.array.from_zarr(zarr_group['cloud_volume_fraction_y.zarr'])
    y_cont.rechunk(zarr_group['cloud_volume_fraction_y.zarr'].chunks)
    
    return x, y_lab, y_cont

In [56]:
# Now, verify that a load back in of the data preserves desired qualities
x, lab, y = load_data_from_zarr('/scratch/hsouth/cbh_data/analysis_ready/train.zarr')

Loaded zarr, file information:
 Name        : /
Type        : zarr.hierarchy.Group
Read-only   : False
Store type  : zarr.storage.DirectoryStore
No. members : 3
No. arrays  : 3
No. groups  : 0
Arrays      : cloud_base_label_y.zarr, cloud_volume_fraction_y.zarr,
            : humidity_temp_pressure_x.zarr
 



In [57]:
# Do the samples match up across groups?
threshold = 2.0/8.0

# same sample number
assert len(x) == len(y) == len(lab)
# preserved order (checked with between label and volume comparison
one_percent_selection = int(0.01*len(x))
indices_to_test = np.random.choice(np.arange(len(x)), size=one_percent_selection)
print("First 20 random indices:", indices_to_test[0:20])
for i in range(len(indices_to_test)):
    vol = y[indices_to_test[i]].compute()
    base_label_position = lab[indices_to_test[i]].compute()
    # print(vol)
    thresh_overcome = np.where(vol > threshold)
    
    # print(thresh_overcome)
    try:
        vol_base = thresh_overcome[0][0]
    except:
        vol_base = len(vol) - 1
    # print('vol_base', vol_base, 'base_label_position', base_label_position)
    assert vol_base == base_label_position, ('base mismatch', vol_base, 'vs', base_label_position, "vol=",vol)
print("Pass")

First 20 random indices: [ 83303136  16552347  72419691  56640977  19667221  52040922  96584297
  24842529  56168398   9927204  27437198   5261262  85083366  31712296
  96827855  49859288 105470772  67758162  29065438  87465766]


KeyboardInterrupt: 

In [58]:
cubes = iris.load(str(paths_to_load))
print(cubes) # shorter comes first
inp_cube_humid = cubes[13]
tar_cube = cubes[1]
inp_cube_wrong = cubes[12]

0: cloud_volume_fraction_in_atmosphere_layer / (1) (forecast_period: 4; forecast_reference_time: 31; model_level_number: 70; latitude: 480; longitude: 640)
1: m01s05i250 / (unknown)              (forecast_period: 4; forecast_reference_time: 31; model_level_number: 70; latitude: 480; longitude: 640)
2: cloud_volume_fraction_in_atmosphere_layer / (1) (forecast_period: 4; forecast_reference_time: 60; model_level_number: 70; latitude: 480; longitude: 640)
3: m01s05i250_0 / (unknown)            (forecast_period: 4; forecast_reference_time: 60; model_level_number: 70; latitude: 480; longitude: 640)
4: air_pressure / (Pa)                 (forecast_period: 4; forecast_reference_time: 60; model_level_number: 70; latitude: 480; longitude: 640)
5: air_pressure / (Pa)                 (forecast_period: 4; forecast_reference_time: 31; model_level_number: 70; latitude: 480; longitude: 640)
6: air_temperature / (K)               (forecast_period: 4; forecast_reference_time: 31; model_level_number: 70;

In [59]:
print(inp_cube_wrong[0][0][0][0][0].data)

0.00026512146


In [None]:
for i in range(30):
    print("CUBE INP:", inp_cube_humid[0][0][0][0][i].data,"SAVED INP:", x[i,0,2].compute(), "CUBE OUT:", tar_cube[0][0][0][0][i].data, "SAVED OUT:", y[i,0].compute())

CUBE INP: 0.00040107965 SAVED INP: 0.00040107965 CUBE OUT: 0.0 SAVED OUT: 0.15625
CUBE INP: 0.00040102005 SAVED INP: 0.00040102005 CUBE OUT: 0.0 SAVED OUT: 0.15625
CUBE INP: 0.00040096045 SAVED INP: 0.00040096045 CUBE OUT: 0.0 SAVED OUT: 0.15625
CUBE INP: 0.00040096045 SAVED INP: 0.00040096045 CUBE OUT: 0.0 SAVED OUT: 0.15625
CUBE INP: 0.00040090084 SAVED INP: 0.00040090084 CUBE OUT: 0.0 SAVED OUT: 0.140625
CUBE INP: 0.00040090084 SAVED INP: 0.00040090084 CUBE OUT: 0.0 SAVED OUT: 0.140625
CUBE INP: 0.00040084124 SAVED INP: 0.00040084124 CUBE OUT: 0.0 SAVED OUT: 0.140625
CUBE INP: 0.00040078163 SAVED INP: 0.00040078163 CUBE OUT: 0.0 SAVED OUT: 0.140625
CUBE INP: 0.00040072203 SAVED INP: 0.00040072203 CUBE OUT: 0.0 SAVED OUT: 0.140625
CUBE INP: 0.00040072203 SAVED INP: 0.00040072203 CUBE OUT: 0.0 SAVED OUT: 0.125
CUBE INP: 0.00040066242 SAVED INP: 0.00040066242 CUBE OUT: 0.0 SAVED OUT: 0.125
CUBE INP: 0.00040060282 SAVED INP: 0.00040060282 CUBE OUT: 0.0 SAVED OUT: 0.125
CUBE INP: 0.00040