## Preprocessing Narval

**This data was not used to train the NNs**

Converting the data into npy makes it possible for us to work with it efficiently; originally we require 500GB of RAM which is always difficult to guarantee. We preprocess QUBICC in another ipynb notebook precisely because of this issue.

1) We read the data
2) Reshape variables so that they have equal dimensionality
3) Remove data above 21kms
4) Reshape into data samples fit for the NN
5) Split into input and output
6) Save as npy in float32

Note: We neither scale nor split the data into training/validation/test sets. <br>
The reason is that i) in order to scale we need the entire dataset but this can only be done in conjunction with the Qubicc dataset. Also for cross-validation different scalings will be necessary based on different subsets of the data, ii) The split into subsets will be done by the cross-validation procedure or not at all when training the final model.

In [1]:
import sys
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time
import gc
# import importlib
# importlib.reload(my_classes)

base_path = '/pf/b/b309170'
output_path = base_path + '/my_work/icon-ml_data/cloud_cover_parameterization/grid_column_based_QUBICC_R02B05/based_on_var_interpolated_data'

# Add path with my_classes to sys.path
sys.path.insert(0, base_path + '/workspace_icon-ml/cloud_cover_parameterization/')

# Which days to load
days_narval = 'all'

from my_classes import load_data

VERT_LAYERS = 31

# Set output_var to one of {'clc', 'cl_area'}
output_var = 'cl_area'

## 1) Reading the data
### Input:
- fr_land: Fraction of land
- zg: Geometric height at full levels (3D)
- qv: Specific water vapor content (3D)
- qc: Specific cloud water content (3D)
- qi: Specific cloud ice content (3D)
- temp: Temperature (3D)
- pres: Pressure (3D)

$186$ $( = 1+24[zf]+26[q_c]+27\cdot 5$) input nodes

### Output:
- clc: Cloud Cover

$27$ output nodes

The data above 21km is capped.

In [2]:
# For cl_area I only need the output as I already have the input
# I still need 'qc', 'qi', 'clc' for condensate-free clouds
# If I were to use 'cl_area' for condensate-free clouds I would get an estimate 
# which is slightly different due to coarse-graining

order_of_vars_narval = ['qv', 'qc', 'qi', 'temp', 'pres', 'zg', 'fr_land', output_var]

In [None]:
# Load NARVAL data
data_dict = load_data(source='narval', days=days_narval, resolution='R02B05', 
                             order_of_vars=order_of_vars_narval)

In [4]:
if output_var == 'clc':
    # Are there any bad data points
    ta_is_0 = np.where(data_dict['temp'] == 0)
    for i in range(3):
        assert ta_is_0[i].size == 0

    del ta_is_0
    gc.collect()

Counting fraction of condensate-free clouds <br>
I'll leave them in the training data for the column-based and region-based models. The reason is that we would have to remove quite a lot around the given grid cell. I can do that in the grid-cell based models.

In [5]:
# clouds = 0
# count_cond_free_clouds = 0
# for i in range(data_dict['clc'].shape[0]):
#     for j in range(data_dict['clc'].shape[1]):
#         for k in range(data_dict['clc'].shape[2]):
#             if (data_dict['clc'][i,j,k] > 0 and data_dict['qc'][i,j,k] + data_dict['qi'][i,j,k] == 0):
#                 count_cond_free_clouds += 1
#             if (data_dict['clc'][i,j,k] > 0):
#                 clouds += 1

In [6]:
# count_cond_free_clouds/clouds

In [None]:
for key in data_dict.keys():
    print(key, data_dict[key].shape)

In [8]:
(TIME_STEPS, VERT_LAYERS, HORIZ_FIELDS) = data_dict['clc'].shape

In [9]:
try:
    #Reshaping into nd-arrays of equaling shapes (don't reshape in the vertical)
    data_dict['zg'] = np.repeat(np.expand_dims(data_dict['zg'], 0), TIME_STEPS, axis=0)
    data_dict['fr_land'] = np.repeat(np.expand_dims(data_dict['fr_land'], 0), TIME_STEPS, axis=0)
except:
    pass

In [10]:
# One sample should contain a column of information
data_dict_reshaped = {}

for key in data_dict.keys():
    if data_dict[key].shape[1] == VERT_LAYERS:  
        # Removing data above 21kms
        for i in range(4, VERT_LAYERS):
            new_key = '{}{}{:d}'.format(key,'_',(i+17)) # Should start at 21
            data_dict_reshaped[new_key] = np.reshape(data_dict[key][:,i,:], -1)
    else:
        data_dict_reshaped[key] = np.reshape(data_dict[key], -1)

In [11]:
#Converting dict into a DataFrame-object 
df = pd.DataFrame.from_dict(data_dict_reshaped)
df.head()

Unnamed: 0,qc_21,qc_22,qc_23,qc_24,qc_25,qc_26,qc_27,qc_28,qc_29,qc_30,...,cl_area_38,cl_area_39,cl_area_40,cl_area_41,cl_area_42,cl_area_43,cl_area_44,cl_area_45,cl_area_46,cl_area_47
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.003891,0.003891,0.160102,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.832987,0.273194,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,6.467355e-08,1.200798e-08,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,8.338172e-08,1.548238e-08,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
#Modifies df as well
def split_input_output(dataset):
    output_df = pd.DataFrame()
    for i in range(21, 48):
        output_df['cl_area_%d'%i] = dataset['cl_area_%d'%i] # Should start at 21
        del dataset['cl_area_%d'%i]
    return output_df

In [13]:
output_df = split_input_output(df)

In [14]:
# Save the data
if output_var == 'clc':
    np.save(output_path + '/cloud_cover_input_narval.npy', np.float32(df))
    np.save(output_path + '/cloud_cover_output_narval.npy', np.float32(output_df))
elif output_var == 'cl_area':
    np.save(output_path + '/cloud_area_output_narval.npy', np.float32(output_df))

In [16]:
# Test
if output_var == 'cl_area':
    old_input = np.load(output_path + '/cloud_cover_input_narval.npy')
    # If this yields True once then we're done
    for i in range(old_input.shape[1]):
        print(np.all(old_input[:,i] == df['qi_25']))

False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
