## Preprocessing NARVAL

**This data was not used to train the NNs**

Converting the data into npy makes it possible for us to work with it efficiently; originally we require 500GB of RAM which is always difficult to guarantee. We preprocess QUBICC in another ipynb notebook precisely because of this issue.

1) We read the data
2) Reshape variables so that they have equal dimensionality
3) Reshape into data samples fit for the NN and convert into a DataFrame
4) Downsample the data: Remove data above 21kms, remove condensate-free clouds, combat class-imbalance
5) Split into input and output
6) Save as npy

Note: We neither scale nor split the data into training/validation/test sets. <br>
The reason is that i) in order to scale we need the entire dataset but this can only be done in conjunction with the Qubicc dataset. Also for cross-validation different scalings will be necessary based on different subsets of the data, ii) The split into subsets will be done by the cross-validation procedure or not at all when training the final model.

In [1]:
import sys
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time
# import importlib
# importlib.reload(my_classes)

base_path = '/pf/b/b309170'
output_path = base_path + '/my_work/icon-ml_data/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/based_on_var_interpolated_data'

# Add path with my_classes to sys.path
sys.path.insert(0, base_path + '/workspace_icon-ml/cloud_cover_parameterization/')

# Which days to load
days_narval = 'all'

from my_classes import load_data

VERT_LAYERS = 31

#Set a numpy seed for the permutation later on!
np.random.seed(10)

# Set output_var to one of {'clc', 'cl_area'}
output_var = 'cl_area'

## 1) Reading the data
### Input:
- fr_land: Fraction of land
- coriolis: Coriolis parameter
- zg: Geometric height at full levels (3D)
- qv: Specific water vapor content (3D)
- qc: Specific cloud water content (3D)
- qi: Specific cloud ice content (3D)
- temp: Temperature (3D)
- pres: Pressure (3D)
- u: Zonal wind (3D)
- v: Meridional wind (3D)

$10$ input nodes

### Output:
- clc: Cloud Cover

$1$ output nodes

The data above 21km is capped.

In [2]:
# For cl_area I only need the output as I already have the input
# I still need 'qc', 'qi', 'clc' for condensate-free clouds
# If I were to use 'cl_area' for condensate-free clouds I would get an estimate 
# which is slightly different due to coarse-graining

order_of_vars_narval = ['qv', 'qc', 'qi', 'temp', 'pres', 'u', 'v', 'zg', 'coriolis', 'fr_land', output_var]

In [None]:
# Load NARVAL data
data_dict = load_data(source='narval', days=days_narval, resolution='R02B05', 
                             order_of_vars=order_of_vars_narval)

In [None]:
for key in data_dict.keys():
    print(key, data_dict[key].shape)

In [5]:
(TIME_STEPS, VERT_LAYERS, HORIZ_FIELDS) = data_dict[output_var].shape

In [9]:
#Reshaping into nd-arrays of equaling shapes (don't reshape in the vertical)
data_dict['zg'] = np.repeat(np.expand_dims(data_dict['zg'], 0), TIME_STEPS, axis=0)
try: 
    data_dict['coriolis'] = np.repeat(np.expand_dims(data_dict['coriolis'], 0), TIME_STEPS, axis=0)
    data_dict['coriolis'] = np.repeat(np.expand_dims(data_dict['coriolis'], 1), VERT_LAYERS, axis=1)
    data_dict['fr_land'] = np.repeat(np.expand_dims(data_dict['fr_land'], 0), TIME_STEPS, axis=0)
    data_dict['fr_land'] = np.repeat(np.expand_dims(data_dict['fr_land'], 1), VERT_LAYERS, axis=1)
except:
    pass

In [10]:
# Carry along information about the vertical layer of a grid cell. int16 is sufficient for < 1000.
vert_layers = np.int16(np.repeat(np.expand_dims(np.arange(1, VERT_LAYERS+1), 0), TIME_STEPS, axis=0))
vert_layers = np.repeat(np.expand_dims(vert_layers, 2), HORIZ_FIELDS, axis=2)
vert_layers.shape

(1721, 31, 4450)

In [11]:
# Reshaping into 1D-arrays and converting dict into a DataFrame-object (the following is based on Aurelien Geron)
for key in order_of_vars_narval:
    data_dict[key] = np.reshape(data_dict[key], -1) 
    vert_layers = np.reshape(vert_layers, -1)

df = pd.DataFrame.from_dict(data_dict)

# Number of samples/rows
len(df)

237411950

**Downsampling the data (minority class: clc = 0)**

In [12]:
# Remove data above 21kms
df = df.loc[df['zg'] < 21000]

In [13]:
# There are no nans left
assert np.all(np.isnan(df) == False) == True

In [14]:
# Some quick sanity checks regarding the input data
if output_var == 'clc':
    assert np.all(df['temp'] > 150) and np.all(df['pres'] > 150)

In [15]:
# Remove condensate-free clouds (7.3% of clouds)
# Here we have to use 'clc' to keep the size of the output consistent!
df = df.loc[~((df['clc'] > 0) & (df['qc'] == 0) & (df['qi'] == 0))]

In [16]:
# We ensure that clc != 0 is as large as clc = 0 (which then has 294 Mio samples) and keep the original order intact
df_noclc = df.loc[df['clc']==0]
print(len(df_noclc))

# len(downsample_indices) will be the number of noclc samples that remain
downsample_ratio = (len(df) - len(df_noclc))/len(df_noclc)
shuffled_indices = np.random.permutation(df.loc[df['clc']==0].index)
size_noclc = int(len(df_noclc)*downsample_ratio)
downsample_indices = shuffled_indices[:size_noclc] 

# Concatenate df.loc[df['cl']!=0].index and downsample_indices
final_indices = np.concatenate((downsample_indices, df.loc[df['clc']!=0].index))

# Sort final_indices so that we can more or less recover the timesteps
final_indices = np.sort(final_indices)

# Label-based (loc) not positional-based
df = df.loc[final_indices]

138399774


In [17]:
# Number of samples after downsampling
len(df)

126853676

In [18]:
#Modifies df as well
def split_input_output(dataset):
    output_df = dataset[output_var]
    del dataset[output_var]
    return output_df

In [19]:
output_df = split_input_output(df)

In [20]:
# Save the data
if output_var == 'clc':
    np.save(output_path + '/cloud_cover_input_narval.npy', np.float32(df))
    np.save(output_path + '/cloud_cover_output_narval.npy', np.float32(output_df))
elif output_var == 'cl_area':
    np.save(output_path + '/cloud_area_output_narval.npy', np.float32(output_df))

# Save the corresponding vertical layers (int16 is sufficient for layers < 1000)
if output_var == 'clc':
    np.save(output_path + '/samples_vertical_layers_narval.npy', vert_layers[df.index])

Test whether qi from the saved data coincides with the qi here

In [24]:
# If this yields True then we're done
np.all(old_input[:,2] == df['qi'])

True