tgb - 12/20/2021- The goal is to directly derive the climate-invariant dataset using the custom generator to avoid inconsistencies in the formulation of the relative humidity, plume buoyancy, and scaled latent heat flux rescalings. This dataset can then be used for the causal discovery project led by Nando Iglesias. 

# Imports

In [1]:
from cbrain.climate_invariant import *
from cbrain.climate_invariant_utils import *

import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)
tf.config.experimental.set_memory_growth(physical_devices[1], True)
tf.config.experimental.set_memory_growth(physical_devices[2], True)
import os
os.environ["CUDA_VISIBLE_DEVICES"]="2"

/nfspool-0/home/tbeucler/CBRAIN-CAM/notebooks/tbeucler_devlog


In [2]:
path_data = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/SPCAM_PHYS/'

# Define data generators

## Below is how we would build a standard or "brute-force" data generator `train_gen_BF`.

1. We would first specify the input variables `in_vars`, the output variables `out_vars`, and the path of the training set `path_train`. 

In [10]:
#in_vars = ['QBP','TBP','PS','SOLIN','SHFLX','LHFLX'] # We take the large-scale climate state as inputs
in_vars = ['QBP','TBP','PS','SOLIN','SHFLX','LHFLX']
#out_vars = ['PHQ','TPHYSTND','FSNT','FSNS','FLNT','FLNS', 'PRECT'] # and we output the response of clouds/storms to these climate conditions
out_vars = ['TPHYSTND500']
#path_train = path_data + 'Aqua_0K_withVBP/2021_09_02_TRAIN_For_Nando.nc'
#path_train = path_data + '2022_01_10_TRAIN_For_Nando_t-dt.nc'
path_train = path_data + '2022_01_19_TRAIN_M4K_TPHYSTND500.nc'
path_valid = path_data + '2022_01_19_VALID_M4K_TPHYSTND500.nc'
path_norm = path_data + '2022_01_20_NORM_TPHYSTND500.nc'

2. To make sure all outputs have the same units (in our case W/m$^2 $), we multiply the raw outputs by the right physical constants, stored in a dictionary called `scale_dict`. 

In [4]:
import pickle
scale_dict = pickle.load(open(path_data+'CIML_Zenodo/009_Wm2_scaling.pkl','rb'))
scale_dict['TPHYSTND500'] = scale_dict['TPHYSTND'][18]

3. We scale the inputs to [-1,1] by subtracting their mean before dividing them by their range. The means and ranges are stored in a normalization file stored in `path_input_norm`. 

In [5]:
path_input_norm = path_data + '2022_01_10_Norm_Outputs_t-dt.nc'

4. We are now ready to build our first data generator!

In [6]:
N_batch = 8192

In [13]:
train_gen_BF = DataGeneratorCI(
    data_fn = path_train,
    input_vars = in_vars,
    output_vars = out_vars,
    norm_fn = path_norm,
    input_transform = ('mean', 'maxrs'),
    output_transform = scale_dict,
    shuffle = False,
    batch_size=N_batch
)

In [14]:
train_gen_BF[50][0].shape

(8192, 64)

In [15]:
train_gen_BF[50][1].shape

(8192, 1)

## Now, we would like to build a "climate-invariant" data generator `train_gen_CI`, which requires a few more steps

### First, we have to create one standard generator per input rescaling. This will help us renormalize the inputs to [-1,1] every time we feed them to the neural network. 

1. First, let's define the path to the three normalization files for the three input rescalings:
Relative humidity `RH`, plume buoyancy `BMSE`, and normalized latent heat flux `LHF_nsDELQ` 

In [16]:
path_norm_RH = path_data + '2021_02_01_NORM_O3_RH_small.nc'
path_norm_BMSE = path_data + '2021_06_16_NORM_BMSE_small.nc'
path_norm_LHF_nsDELQ = path_data + '2021_02_01_NORM_O3_LHF_nsDELQ_small.nc'

2. We can now define one data generator per input rescaling

In [17]:
def train_gen_rescaling(input_rescaling):
    return DataGeneratorCI(
        data_fn = path_train,
        input_vars = input_rescaling,
        output_vars = out_vars,
        norm_fn = path_input_norm,
        input_transform = ('mean', 'maxrs'),
        output_transform = scale_dict)

In [18]:
train_gen_RH = train_gen_rescaling(in_vars)
train_gen_BMSE = train_gen_rescaling(in_vars)
train_gen_LHF_nsDELQ = train_gen_rescaling(in_vars)

### Then, the normalization factors of these generators can be combined to form a "climate-invariant" data generator `train_gen_CI`

In [29]:
train_gen_CI = DataGeneratorCI(
    data_fn = path_train,
    input_vars = in_vars,
    output_vars = out_vars,
    norm_fn = path_norm,
    input_transform = ('mean','maxrs'),
    output_transform = scale_dict,
    shuffle = False,
    batch_size=N_batch,
    Qscaling = 'RH',
    Tscaling = 'BMSE',
    LHFscaling = 'LHF_nsDELQ',
    hyam=hyam, hybm=hybm, # Arrays to define mid-levels of hybrid vertical coordinate
    inp_sub_Qscaling=train_gen_RH.input_transform.sub, # What to subtract from RH inputs
    inp_div_Qscaling=train_gen_RH.input_transform.div, # What to divide RH inputs by
    inp_sub_Tscaling=train_gen_BMSE.input_transform.sub,
    inp_div_Tscaling=train_gen_BMSE.input_transform.div,
    inp_sub_LHFscaling=train_gen_LHF_nsDELQ.input_transform.sub,
    inp_div_LHFscaling=train_gen_LHF_nsDELQ.input_transform.div
)

# Regenerate the scaled dataset

## Create new training file

In [30]:
#path_train = path_data + 'Aqua_0K_withVBP/2021_09_02_TRAIN_For_Nando.nc'
#path_train = path_data + '2022_01_10_TRAIN_For_Nando_t-dt.nc'
#path_train = path_data + '2022_01_17_RG_VALID_M4K_PRECTt-dt.nc'

In [62]:
path_train = path_data + '2022_01_19_VALID_M4K_TPHYSTND500.nc'

In [63]:
train_raw = xr.open_dataset(path_train)

### var_names

In [64]:
train_raw['var_names'].values

array(['QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP',
       'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP',
       'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP', 'QBP',
       'QBP', 'QBP', 'QBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP',
       'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP',
       'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP',
       'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'TBP', 'PS', 'SOLIN', 'SHFLX',
       'LHFLX', 'TPHYSTND500'], dtype=object)

In [65]:
train_raw_CI = train_raw.copy()

In [66]:
var_names_CI = train_raw_CI['var_names'].values
for i in range(60):
    if i<30: var_names_CI[i] = 'RH'
    else: var_names_CI[i] = 'BMSE'

In [67]:
var_names_CI[63] = 'LHF_nsDELQ'

In [68]:
train_raw_CI.assign_coords({'var_names':var_names_CI})

### vars

In [69]:
train_gen_CI.output_transform.scale.shape

(1,)

In [70]:
train_gen_CI[0][0].shape

(8192, 64)

In [71]:
train_raw_CI['vars'][0].values.shape

(65,)

In [72]:
new_values = np.zeros(train_raw_CI['vars'].shape)

In [73]:
new_values.shape

(48357376, 65)

In [74]:
train_raw['var_names'].values[64:]

array(['TPHYSTND500'], dtype=object)

In [75]:
for ibatch in range((train_gen_CI.n_samples)//N_batch):
    if ibatch % 10==0: print('progress=','%2.2f' % (100*ibatch/((train_gen_CI.n_samples)//N_batch)),
                              '%','               ',end='\r')
    train_gen_CI_pu = (train_gen_CI[ibatch][0]*train_gen_CI.input_transform.div+\
                       train_gen_CI.input_transform.sub)
    new_values[ibatch*N_batch:((1+ibatch)*N_batch),:] = np.concatenate(
        (train_gen_CI_pu[:,:64],
        train_raw_CI['vars'][ibatch*N_batch:((1+ibatch)*N_batch),64:]),
        axis=1
    )

progress= 99.84 %                

In [76]:
new_values.shape

(48357376, 65)

In [77]:
train_raw_CI['vars'].values = new_values

### Save new training dataset

In [78]:
train_raw_CI.to_netcdf(path_save_dir+'2022_01_20_VALID_CI_TPHYSTND500.nc',mode='w')

In [44]:
path_save_dir = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/SPCAM_PHYS/'

In [None]:
train_raw_CI

In [None]:
new_values = {}

In [None]:
import sys
def sizeof_fmt(num, suffix='B'):
    ''' by Fred Cirera,  https://stackoverflow.com/a/1094933/1870254, modified'''
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                         key= lambda x: -x[1])[:10]:
    print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))

In [None]:
train_raw_CI.to_netcdf(path_save_dir+'2022_01_18_VALID_CI_PRECTt-dt.nc',mode='w')

In [None]:
train_raw_CI['var_names'].values

## Create new normalization file

In [None]:
norm_RH_dataset = xr.open_dataset(path_norm_RH)
norm_BMSE_dataset = xr.open_dataset(path_norm_BMSE)
norm_LHF_nsDELQ_dataset = xr.open_dataset(path_norm_LHF_nsDELQ) 

In [None]:
norm_dataset = xr.open_dataset(path_input_norm)

In [None]:
new_norm_dataset = norm_dataset.copy()

### Coordinates

In [None]:
var_names_full = norm_dataset['var_names'].values
var_names_full_single = norm_dataset['var_names_single'].values

In [None]:
var_names_full = np.append(var_names_full,'TPHYSTND500')
var_names_full_single = np.append(var_names_full_single,'TPHYSTND500')

In [None]:
for i in range(30): var_names_full = np.append(var_names_full,'RH')
for i in range(30): var_names_full = np.append(var_names_full,'BMSE')
var_names_full = np.append(var_names_full,'LHF_nsDELQ')

var_names_full_single = np.append(var_names_full_single,'RH')
var_names_full_single = np.append(var_names_full_single,'BMSE')
var_names_full_single = np.append(var_names_full_single,'LHF_nsDELQ')

In [None]:
var_names_full.shape

In [None]:
var_names_full_single[12]

In [None]:
var_names_full_single.shape

In [None]:
new_coor = {}
new_coor['var_names'] = var_names_full
new_coor['var_names_single'] = var_names_full_single

### Data

#### Full profiles

In [None]:
KEY = ['mean','std','min','max']

In [None]:
norm_data = {}

In [None]:
for key in KEY:
    norm_data[key] = norm_dataset[key].values

In [None]:
norm_data[key][244+18]

In [None]:
for key in KEY:
    norm_data[key] = np.append(norm_data[key],norm_data[key][244+18])

In [None]:
norm_RH_dataset['var_names'][:30].values

In [None]:
for key in KEY:
    norm_data[key] = np.append(norm_data[key],norm_RH_dataset[key][:30].values)

In [None]:
norm_BMSE_dataset['var_names'][30:60].values

In [None]:
for key in KEY:
    norm_data[key] = np.append(norm_data[key],norm_BMSE_dataset[key][30:60].values)

In [None]:
norm_LHF_nsDELQ_dataset['var_names'][93].values

In [None]:
for key in KEY:
    norm_data[key] = np.append(norm_data[key],norm_LHF_nsDELQ_dataset[key][93].values)

#### One std per variable

In [None]:
key0 = 'std_by_var'

In [None]:
norm_data[key0] = norm_dataset[key0].values

In [None]:
norm_dataset[key0].values.shape

In [None]:
norm_data[key0] = np.append(norm_data[key0],
                            norm_data[key0][12])

In [None]:
norm_RH_dataset['var_names_single'][0].values

In [None]:
norm_data[key0] = np.append(norm_data[key0],
                            norm_RH_dataset[key0][0].values)

In [None]:
norm_BMSE_dataset['var_names_single'][1].values

In [None]:
norm_data[key0] = np.append(norm_data[key0],
                            norm_BMSE_dataset[key0][1].values)

In [None]:
norm_LHF_nsDELQ_dataset['var_names_single'][6].values

In [None]:
norm_data[key0] = np.append(norm_data[key0],
                            norm_LHF_nsDELQ_dataset[key0][6].values)

In [None]:
for key in norm_data.keys():
    print(key+str(norm_data[key].shape))

In [None]:
norm_data_dict = {}

In [None]:
for key in KEY:
    norm_data_dict[key] = (['var_names'],norm_data[key])
norm_data_dict[key0] = (['var_names_single'],norm_data[key0])

### Combine coordinates and data into a new xarray dataset

In [None]:
new_norm = xr.Dataset(
    data_vars = norm_data_dict,
    coords = new_coor
)

### Check that new normalization file was created correctly

#### Full profiles

In [None]:
new_norm['var_names'][-31:-1].values

In [None]:
new_norm['min'][-31:-1].values

In [None]:
norm_BMSE_dataset['var_names'][30:60].values

In [None]:
norm_BMSE_dataset['min'][30:60].values

#### One std per variable

In [None]:
new_norm['var_names_single'][-3].values

In [None]:
new_norm['std_by_var'][-3].values

In [None]:
norm_RH_dataset['var_names_single'][0].values

In [None]:
norm_RH_dataset['std_by_var'][0].values

## Save new norm file

In [None]:
new_norm

In [None]:
norm_dataset

In [None]:
new_norm.to_netcdf(path_save_dir+'2022_01_20_NORM_TPHYSTND500.nc',mode='w')

# Check that training is now stable

## Climate invariant, without outputs [t-dt]

In [None]:
path_save_dir = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/SPCAM_PHYS/Aqua_0K_ClimInv_withVBP/'

In [None]:
path_train = '2021_12_22_TRAIN_For_Nando_CI.nc'
path_newnorm = '2021_12_22_NORM_For_Nando_CI.nc'

In [None]:
test_train = xr.open_dataset(path_save_dir+path_train)

In [None]:
test_train['var_names'][90:95]

In [None]:
in_vars = ['RH','BMSE','PS', 'SOLIN', 'SHFLX', 'LHF_nsDELQ']
out_vars = ['PHQ','TPHYSTND','FSNT','FSNS','FLNT','FLNS','PRECT']

In [None]:
train_gen_Nando = DataGeneratorCI(
    data_fn = path_save_dir+path_train,
    input_vars = in_vars,
    output_vars = out_vars,
    norm_fn = path_save_dir+path_newnorm,
    input_transform = ('mean', 'maxrs'),
    output_transform = scale_dict,
    batch_size=N_batch
)

In [None]:
train_gen_Nando[0][0]

In [None]:
inp = Input(shape=(64,)) ## input after rh and tns transformation
densout = Dense(128, activation='linear')(inp)
densout = LeakyReLU(alpha=0.3)(densout)
for i in range (6):
    densout = Dense(128, activation='linear')(densout)
    densout = LeakyReLU(alpha=0.3)(densout)
dense_out = Dense(65, activation='linear')(densout)
model = tf.keras.models.Model(inp, dense_out)

In [None]:
model.summary()

In [None]:
model.compile(tf.keras.optimizers.Adam(), loss=mse)

In [None]:
# Where to save the model
path_HDF5 = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/HDF5_DATA/'
save_name = '2022_01_17_CI_Rasp_config_without_t-dt'

In [None]:
earlyStopping = EarlyStopping(monitor='val_loss', patience=10, verbose=0, mode='min')
mcp_save_pos = ModelCheckpoint(path_HDF5+save_name+'.hdf5',save_best_only=True, monitor='val_loss', mode='min')

In [None]:
Nep = 20
model.fit_generator(train_gen_Nando, epochs=Nep, validation_data=train_gen_Nando,\
                    callbacks=[earlyStopping, mcp_save_pos])

## Climate invariant, with tendencies [t-dt]

In [None]:
path_save_dir = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/SPCAM_PHYS/'

In [None]:
path_train = '2022_01_13_TRAIN_For_Nando_CI_t-dt.nc'
path_valid = '2022_01_13_VALID_For_Nando_CI_t-dt.nc'
path_newnorm = '2022_01_13_NORM_For_Nando_CI_t-dt.nc'

In [None]:
in_vars = ['RH','BMSE','PS', 'SOLIN', 'SHFLX', 'LHF_nsDELQ',
          'PHQt-dt','TPHYSTNDt-dt','FSNTt-dt','FSNSt-dt',
           'FLNTt-dt','FLNSt-dt','PRECTt-dt']
out_vars = ['PHQ','TPHYSTND','FSNT','FSNS','FLNT','FLNS','PRECT']

In [None]:
N_batch = 8192

In [None]:
train_gen_Nando = DataGeneratorCI(
    data_fn = path_save_dir+path_train,
    input_vars = in_vars,
    output_vars = out_vars,
    norm_fn = path_save_dir+path_newnorm,
    input_transform = ('mean', 'maxrs'),
    output_transform = scale_dict,
    batch_size=N_batch
)

In [None]:
valid_gen_Nando = DataGeneratorCI(
    data_fn = path_save_dir+path_valid,
    input_vars = in_vars,
    output_vars = out_vars,
    norm_fn = path_save_dir+path_newnorm,
    input_transform = ('mean', 'maxrs'),
    output_transform = scale_dict,
    batch_size=N_batch
)

In [None]:
train_gen_Nando[0][0].shape

In [None]:
train_gen_Nando[0][1]

In [None]:
inp = Input(shape=(129,)) ## input after rh and tns transformation
densout = Dense(128, activation='linear')(inp)
densout = LeakyReLU(alpha=0.3)(densout)
for i in range (6):
    densout = Dense(128, activation='linear')(densout)
    densout = LeakyReLU(alpha=0.3)(densout)
dense_out = Dense(65, activation='linear')(densout)
model = tf.keras.models.Model(inp, dense_out)

In [None]:
model.summary()

In [None]:
model.compile(tf.keras.optimizers.Adam(), loss=mse)

In [None]:
# Where to save the model
path_HDF5 = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/HDF5_DATA/'
save_name = '2022_01_14_Test_Nando_CI_t-dt'

In [None]:
earlyStopping = EarlyStopping(monitor='val_loss', patience=10, verbose=0, mode='min')
mcp_save_pos = ModelCheckpoint(path_HDF5+save_name+'.hdf5',save_best_only=True, monitor='val_loss', mode='min')

In [None]:
Nep = 20
model.fit_generator(train_gen_Nando, epochs=Nep, validation_data=train_gen_Nando,\
                    callbacks=[earlyStopping, mcp_save_pos])

In [None]:
Nep = 20
model.fit_generator(train_gen_Nando, epochs=Nep, validation_data=valid_gen_Nando,\
                    callbacks=[earlyStopping, mcp_save_pos])

## Brute force, with tendencies [t-dt]

Redefine generator

In [None]:
path_data = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/SPCAM_PHYS/'

In [None]:
path_train = path_data + '2022_01_10_TRAIN_For_Nando_t-dt.nc' 
path_valid = path_data + '2022_01_10_VALID_For_Nando_t-dt.nc'
path_norm = path_data + '2022_01_10_Norm_Outputs_t-dt.nc'

In [None]:
in_vars = ['QBP','TBP','PS', 'SOLIN', 'SHFLX', 'LHFLX',
          'PHQt-dt','TPHYSTNDt-dt','FSNTt-dt','FSNSt-dt',
           'FLNTt-dt','FLNSt-dt','PRECTt-dt']
out_vars = ['PHQ','TPHYSTND','FSNT','FSNS','FLNT','FLNS','PRECT']

In [None]:
train_gen_BF = DataGeneratorCI(
    data_fn = path_train,
    input_vars = in_vars,
    output_vars = out_vars,
    norm_fn = path_norm,
    input_transform = ('mean', 'maxrs'),
    output_transform = scale_dict,
    shuffle = False,
    batch_size=N_batch
)

In [None]:
valid_gen_BF = DataGeneratorCI(
    data_fn = path_valid,
    input_vars = in_vars,
    output_vars = out_vars,
    norm_fn = path_norm,
    input_transform = ('mean', 'maxrs'),
    output_transform = scale_dict,
    shuffle = False,
    batch_size=N_batch
)

Load already trained model

In [None]:
# Where to save the model
path_HDF5 = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/HDF5_DATA/'
load_name = '2022_01_10_Test_check_with_answer'

In [None]:
model_BF = tf.keras.models.load_model(path_HDF5+save_name+'.hdf5')

In [None]:
model_BF.summary()

In [None]:
# Where to save the model
path_HDF5 = '/DFS-L/DATA/pritchard/tbeucler/SPCAM/HDF5_DATA/'
save_name = '2022_01_14_Test_Nando_BF_t-dt'

In [None]:
earlyStopping = EarlyStopping(monitor='val_loss', patience=10, verbose=0, mode='min')
mcp_save_pos = ModelCheckpoint(path_HDF5+save_name+'.hdf5',save_best_only=True, monitor='val_loss', mode='min')

In [None]:
Nep = 20
model_BF.fit_generator(train_gen_BF, epochs=Nep, validation_data=train_gen_BF,\
                    callbacks=[earlyStopping, mcp_save_pos])

In [None]:
Nep = 20
model_BF.fit_generator(train_gen_BF, epochs=Nep, validation_data=valid_gen_BF,\
                    callbacks=[earlyStopping, mcp_save_pos])