### Predict Sahel rainfall with LSTM/fc models

In this project we work with **C**limate **I**ndex **C**ollection based on **Mo**del **D**ata (CICMoD) data set (https://github.com/MarcoLandtHayen/climate_index_collection). 

Here, we will try to **predict future** Sahel rainfall (lead times 1 / 3 / 6 months) from current and past information (t<=0) of all input features (including PREC_SAHEL) with **LSTM/fc** models:

- Prepare inputs and targets.
- Set up model.
- Evaluate model performance.

**Note:** We start with predicting future Sahel rainfall from its own history alone, hence with **univariate** inputs. Then, we add further input features to have **multivariate** inputs. And ultimately, we add **months as additional input features**.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from json import dump, load
from pathlib import Path

### Import additional functions:
from predict_sahel_rainfall.plot import bar_color
from predict_sahel_rainfall.preprocessing import prepare_inputs_and_target
from predict_sahel_rainfall.models import set_LSTM_fc

### Prepare inputs and targets: Univariate

Load collection of climate indices directly from GitHub release.
Use the complete preprocessing pipeline function.

In [42]:
## Set common parameters (except ESN and lead time) for data preprocessing:

# Set url to csv file containing CICMoD indices from desired release:
data_url = (
    "https://github.com/MarcoLandtHayen/climate_index_collection/"
    "releases/download/v2023.03.29.1/climate_indices.csv"
)

# Select target index:
target_index = 'PREC_SAHEL'

# Select all input features:
input_features = [
    'AMO', 'ENSO_12', 'ENSO_3', 'ENSO_34', 'ENSO_4', 'NAO_PC', 'NAO_ST', 
    'NP', 'PDO_PC', 'PREC_SAHEL', 'SAM_PC', 'SAM_ZM', 'SAT_N_ALL', 'SAT_N_LAND',
    'SAT_N_OCEAN', 'SAT_S_ALL', 'SAT_S_LAND', 'SAT_S_OCEAN', 'SOI',
    'SSS_ENA', 'SSS_NA', 'SSS_SA', 'SSS_WNA', 'SST_ESIO', 'SST_HMDR',
    'SST_MED', 'SST_TNA', 'SST_TSA', 'SST_WSIO'
]

# # Select subset of input features:
# input_features = [
#     'PREC_SAHEL',
# ]

# Choose, whether to add months as one-hot encoded features:
add_months = True

# Choose, whether to normalize target index:
norm_target = True

# Specify input length:
input_length = 24

# Specify amount of combined training and validation data relative to test data:
train_test_split = 0.9

# Specify relative amount of combined training and validation used for training:
train_val_split = 0.8

## Optionally choose to scale or normalize input features according to statistics from training data:
# 'no': Keep raw input features.
# 'scale_01': Scale input features with min/max scaling to [0,1].
# 'scale_11': Scale input features with min/max scaling to [-1,1].
# 'norm': Normalize input features, hence subtract mean and divide by std dev.
scale_norm = 'scale_11'

In [43]:
# Set parameters for LSTM/fc model:
LSTM_units = [10,20]
fc_units = [20,10]
fc_activation = 'sigmoid'
output_activation = 'linear'
LSTM_weight_init = 'glorot_uniform'
LSTM_recurrent_init = 'orthogonal'
LSTM_bias_init = 'zeros'
fc_weight_init = 'glorot_uniform'
fc_bias_init = 'zeros'
LSTM_weight_reg = None
LSTM_recurrent_reg = None
LSTM_bias_reg = None
fc_weight_reg = None
fc_bias_reg = None
learning_rate = 0.005
loss_function = 'mse'

In [44]:
# Set choice of ESMs:
ESMs = ['CESM', 'FOCI']

# Set choice of lead times:
lead_times = [1,3,6]

# Set number of runs per setting:
n_runs = 3

# Set number of training epochs:
n_epochs = 20

# Set batch size:
batch_size = 20

# Get number of input features, depending on whether or not months are addes as additional features:
if add_months:
    n_features = len(input_features) + 12
else:
    n_features = len(input_features)
    
# Check number of input channels:
print('Number of input features:',n_features)

Number of input features: 41


In [45]:
## Initializs storages for loss curves and correlation, dimension (#ESMs, #lead times, #runs, #epochs+1).
## Need #epochs+1, since we want to store results for untrained model plus after each epoch.
train_loss_all = np.zeros((len(ESMs),len(lead_times),n_runs,n_epochs+1))
val_loss_all = np.zeros((len(ESMs),len(lead_times),n_runs,n_epochs+1))
test_loss_all = np.zeros((len(ESMs),len(lead_times),n_runs,n_epochs+1))
train_correl_all = np.zeros((len(ESMs),len(lead_times),n_runs,n_epochs+1))
val_correl_all = np.zeros((len(ESMs),len(lead_times),n_runs,n_epochs+1))
test_correl_all = np.zeros((len(ESMs),len(lead_times),n_runs,n_epochs+1))

## Loop over ESMs:
for m in range(len(ESMs)):
    
    # Get current ESM:
    ESM = ESMs[m]
    
    # Print status:
    print('ESM:',m+1,'of',len(ESMs))

    ## Loop over lead times:
    for l in range(len(lead_times)):
        
        # Get current lead time:
        lead_time = lead_times[l]
        
        # Print status:
        print('  lead time:',l+1,'of',len(lead_times))

        # Prepare inputs and target for current ESM and lead time:
        (
            train_input,
            train_target,
            val_input,
            val_target,
            test_input,
            test_target,
            train_mean,
            train_std,
            train_min,
            train_max,
        ) = prepare_inputs_and_target(    
            data_url=data_url,
            ESM=ESM,
            target_index=target_index,
            input_features=input_features,
            add_months=add_months,
            norm_target=norm_target,
            lead_time=lead_time,
            input_length=input_length,
            train_test_split=train_test_split,
            train_val_split=train_val_split,
            scale_norm=scale_norm,
        )
        
        # Loop over desired number of training runs:
        for r in range(n_runs):
            
            # Print status:
            print('    run:',r+1,'of',n_runs)
            
            # Set up CNN/fc model:
            model = set_LSTM_fc(
                input_length=input_length, 
                n_features=n_features, 
                LSTM_units=LSTM_units,
                fc_units=fc_units, 
                fc_activation=fc_activation,
                output_activation=output_activation,
                LSTM_weight_init=LSTM_weight_init,
                LSTM_recurrent_init=LSTM_recurrent_init,
                LSTM_bias_init=LSTM_bias_init,
                fc_weight_init=fc_weight_init,
                fc_bias_init=fc_bias_init,
                LSTM_weight_reg=LSTM_weight_reg,
                LSTM_recurrent_reg=LSTM_recurrent_reg,
                LSTM_bias_reg=LSTM_bias_reg,
                fc_weight_reg=fc_weight_reg,
                fc_bias_reg=fc_bias_reg,
                learning_rate=learning_rate, 
                loss_function=loss_function
            )
            
            ### Train model: Epoch-by-epoch
            
            ## Store results for untrained model:
            
            # Get model predictions on training, validation and test data:
            train_pred = model.predict(train_input)
            val_pred = model.predict(val_input)
            test_pred = model.predict(test_input)

            # Compute mse of model predictions vs. true targets:
            train_loss = np.mean((train_target-train_pred)**2)
            val_loss = np.mean((val_target-val_pred)**2)
            test_loss = np.mean((test_target-test_pred)**2)

            # Compute correlation coefficient of model predictions vs. true targets:
            train_correl = np.corrcoef(np.stack([train_target[:,0],train_pred[:,0]]))[0,1]
            val_correl = np.corrcoef(np.stack([val_target[:,0],val_pred[:,0]]))[0,1]
            test_correl = np.corrcoef(np.stack([test_target[:,0],test_pred[:,0]]))[0,1]
            
            # Store results:
            train_loss_all[m,l,r,0] = train_loss
            val_loss_all[m,l,r,0] = val_loss
            test_loss_all[m,l,r,0] = test_loss
            train_correl_all[m,l,r,0] = train_correl
            val_correl_all[m,l,r,0] = val_correl
            test_correl_all[m,l,r,0] = test_correl          
            
            # Loop over epochs:
            for e in range(n_epochs):
                
                # Train model for single epoch:
                history = model.fit(train_input, train_target, epochs=1, batch_size=batch_size, shuffle=True, verbose=0)

                ## Store results after current epoch:
            
                # Get model predictions on training, validation and test data:
                train_pred = model.predict(train_input)
                val_pred = model.predict(val_input)
                test_pred = model.predict(test_input)

                # Compute mse of model predictions vs. true targets:
                train_loss = np.mean((train_target-train_pred)**2)
                val_loss = np.mean((val_target-val_pred)**2)
                test_loss = np.mean((test_target-test_pred)**2)

                # Compute correlation coefficient of model predictions vs. true targets:
                train_correl = np.corrcoef(np.stack([train_target[:,0],train_pred[:,0]]))[0,1]
                val_correl = np.corrcoef(np.stack([val_target[:,0],val_pred[:,0]]))[0,1]
                test_correl = np.corrcoef(np.stack([test_target[:,0],test_pred[:,0]]))[0,1]

                # Store results:
                train_loss_all[m,l,r,e+1] = train_loss
                val_loss_all[m,l,r,e+1] = val_loss
                test_loss_all[m,l,r,e+1] = test_loss
                train_correl_all[m,l,r,e+1] = train_correl
                val_correl_all[m,l,r,e+1] = val_correl
                test_correl_all[m,l,r,e+1] = test_correl          

ESM: 1 of 2
  lead time: 1 of 3
    run: 1 of 3
    run: 2 of 3
    run: 3 of 3
  lead time: 2 of 3
    run: 1 of 3
    run: 2 of 3
    run: 3 of 3
  lead time: 3 of 3
    run: 1 of 3
    run: 2 of 3
    run: 3 of 3
ESM: 2 of 2
  lead time: 1 of 3
    run: 1 of 3
    run: 2 of 3
    run: 3 of 3
  lead time: 2 of 3
    run: 1 of 3
    run: 2 of 3
    run: 3 of 3
  lead time: 3 of 3
    run: 1 of 3
    run: 2 of 3
    run: 3 of 3


In [46]:
# ### Store results:

# ## LSTM/fc - univariate:

# # Specify model setup:
# setup = 'LSTM/fc - univariate'

# # Save loss and correlation results:
# np.save('../results/quickrun_LSTM_fc_univariate_train_loss_all.npy', train_loss_all)
# np.save('../results/quickrun_LSTM_fc_univariate_val_loss_all.npy', val_loss_all)
# np.save('../results/quickrun_LSTM_fc_univariate_test_loss_all.npy', test_loss_all)
# np.save('../results/quickrun_LSTM_fc_univariate_train_correl_all.npy', train_correl_all)
# np.save('../results/quickrun_LSTM_fc_univariate_val_correl_all.npy', val_correl_all)
# np.save('../results/quickrun_LSTM_fc_univariate_test_correl_all.npy', test_correl_all)

# # Store parameters:
# parameters = {
#     "setup": setup,
#     "data_url": data_url,
#     "target_index": target_index,
#     "input_features": input_features,
#     "add_months": add_months,
#     "norm_target": norm_target,
#     "input_length": input_length,
#     "train_test_split": train_test_split,
#     "train_val_split": train_val_split,
#     "train_val_split": train_val_split,
#     "scale_norm": scale_norm,    
#     "LSTM_units": LSTM_units,
#     "fc_units": fc_units,
#     "fc_activation": fc_activation,
#     "output_activation": output_activation,
#     "LSTM_weight_init": LSTM_weight_init,
#     "LSTM_recurrent_init": LSTM_recurrent_init,
#     "LSTM_bias_init": LSTM_bias_init,
#     "fc_weight_init": fc_weight_init,
#     "fc_bias_init": fc_bias_init,
#     "LSTM_weight_reg": LSTM_weight_reg,
#     "LSTM_recurrent_reg": LSTM_recurrent_reg,
#     "LSTM_bias_reg": LSTM_bias_reg,
#     "fc_weight_reg": fc_weight_reg,
#     "fc_bias_reg": fc_bias_reg,
#     "learning_rate": learning_rate,
#     "loss_function": loss_function,
#     "ESMs": ESMs,
#     "lead_times": lead_times,
#     "n_runs": n_runs,
#     "n_epochs": n_epochs,
#     "batch_size": batch_size,    
# }

# path_to_store_results = Path('../results')
# with open(path_to_store_results / "quickrun_LSTM_fc_univariate_parameters.json", "w") as f:
#     dump(parameters, f)

# #######################################
    
# ## LSTM/fc - multivariate:

# # Specify model setup:
# setup = 'LSTM/fc - multivariate'

# # Save loss and correlation results:
# np.save('../results/quickrun_LSTM_fc_multivariate_train_loss_all.npy', train_loss_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_val_loss_all.npy', val_loss_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_test_loss_all.npy', test_loss_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_train_correl_all.npy', train_correl_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_val_correl_all.npy', val_correl_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_test_correl_all.npy', test_correl_all)

# # Store parameters:
# parameters = {
#     "setup": setup,
#     "data_url": data_url,
#     "target_index": target_index,
#     "input_features": input_features,
#     "add_months": add_months,
#     "norm_target": norm_target,
#     "input_length": input_length,
#     "train_test_split": train_test_split,
#     "train_val_split": train_val_split,
#     "train_val_split": train_val_split,
#     "scale_norm": scale_norm,    
#     "LSTM_units": LSTM_units,
#     "fc_units": fc_units,
#     "fc_activation": fc_activation,
#     "output_activation": output_activation,
#     "LSTM_weight_init": LSTM_weight_init,
#     "LSTM_recurrent_init": LSTM_recurrent_init,
#     "LSTM_bias_init": LSTM_bias_init,
#     "fc_weight_init": fc_weight_init,
#     "fc_bias_init": fc_bias_init,
#     "LSTM_weight_reg": LSTM_weight_reg,
#     "LSTM_recurrent_reg": LSTM_recurrent_reg,
#     "LSTM_bias_reg": LSTM_bias_reg,
#     "fc_weight_reg": fc_weight_reg,
#     "fc_bias_reg": fc_bias_reg,
#     "learning_rate": learning_rate,
#     "loss_function": loss_function,
#     "ESMs": ESMs,
#     "lead_times": lead_times,
#     "n_runs": n_runs,
#     "n_epochs": n_epochs,
#     "batch_size": batch_size,    
# }

# path_to_store_results = Path('../results')
# with open(path_to_store_results / "quickrun_LSTM_fc_multivariate_parameters.json", "w") as f:
#     dump(parameters, f)

# #######################################
    
# ## LSTM/fc - multivariate - months as additional input features:

# # Specify model setup:
# setup = 'LSTM/fc - multivariate - with months'

# # Save loss and correlation results:
# np.save('../results/quickrun_LSTM_fc_multivariate_with_months_train_loss_all.npy', train_loss_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_with_months_val_loss_all.npy', val_loss_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_with_months_test_loss_all.npy', test_loss_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_with_months_train_correl_all.npy', train_correl_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_with_months_val_correl_all.npy', val_correl_all)
# np.save('../results/quickrun_LSTM_fc_multivariate_with_months_test_correl_all.npy', test_correl_all)

# # Store parameters:
# parameters = {
#     "setup": setup,
#     "data_url": data_url,
#     "target_index": target_index,
#     "input_features": input_features,
#     "add_months": add_months,
#     "norm_target": norm_target,
#     "input_length": input_length,
#     "train_test_split": train_test_split,
#     "train_val_split": train_val_split,
#     "train_val_split": train_val_split,
#     "scale_norm": scale_norm,    
#     "LSTM_units": LSTM_units,
#     "fc_units": fc_units,
#     "fc_activation": fc_activation,
#     "output_activation": output_activation,
#     "LSTM_weight_init": LSTM_weight_init,
#     "LSTM_recurrent_init": LSTM_recurrent_init,
#     "LSTM_bias_init": LSTM_bias_init,
#     "fc_weight_init": fc_weight_init,
#     "fc_bias_init": fc_bias_init,
#     "LSTM_weight_reg": LSTM_weight_reg,
#     "LSTM_recurrent_reg": LSTM_recurrent_reg,
#     "LSTM_bias_reg": LSTM_bias_reg,
#     "fc_weight_reg": fc_weight_reg,
#     "fc_bias_reg": fc_bias_reg,
#     "learning_rate": learning_rate,
#     "loss_function": loss_function,
#     "ESMs": ESMs,
#     "lead_times": lead_times,
#     "n_runs": n_runs,
#     "n_epochs": n_epochs,
#     "batch_size": batch_size,    
# }

# path_to_store_results = Path('../results')
# with open(path_to_store_results / "quickrun_LSTM_fc_multivariate_with_months_parameters.json", "w") as f:
#     dump(parameters, f)

In [7]:
# ### Reload results:

# ## LSTM/fc - univariate:

# # Load loss and correlation results:
# train_loss_all = np.load('../results/quickrun_LSTM_fc_univariate_train_loss_all.npy')
# val_loss_all = np.load('../results/quickrun_LSTM_fc_univariate_val_loss_all.npy')
# test_loss_all = np.load('../results/quickrun_LSTM_fc_univariate_test_loss_all.npy',)
# train_correl_all = np.load('../results/quickrun_LSTM_fc_univariate_train_correl_all.npy')
# val_correl_all = np.load('../results/quickrun_LSTM_fc_univariate_val_correl_all.npy')
# test_correl_all = np.load('../results/quickrun_LSTM_fc_univariate_test_correl_all.npy')

# # Load parameters:
# path_to_store_results = Path('../results')
# with open(path_to_store_results / 'quickrun_LSTM_fc_univariate_parameters.json', 'r') as f:
#     parameters=load(f)

# ESMs = parameters['ESMs']
# lead_times = parameters['lead_times']
# n_runs = parameters['n_runs']
# n_epochs = parameters['n_epochs']
        

# #######################################
    
# ## LSTM/fc - multivariate:

# # Load loss and correlation results:
# train_loss_all = np.load('../results/quickrun_LSTM_fc_multivariate_train_loss_all.npy')
# val_loss_all = np.load('../results/quickrun_LSTM_fc_multivariate_val_loss_all.npy')
# test_loss_all = np.load('../results/quickrun_LSTM_fc_multivariate_test_loss_all.npy',)
# train_correl_all = np.load('../results/quickrun_LSTM_fc_multivariate_train_correl_all.npy')
# val_correl_all = np.load('../results/quickrun_LSTM_fc_multivariate_val_correl_all.npy')
# test_correl_all = np.load('../results/quickrun_LSTM_fc_multivariate_test_correl_all.npy')

# # Load parameters:
# path_to_store_results = Path('../results')
# with open(path_to_store_results / 'quickrun_LSTM_fc_multivariate_parameters.json', 'r') as f:
#     parameters=load(f)

# ESMs = parameters['ESMs']
# lead_times = parameters['lead_times']
# n_runs = parameters['n_runs']
# n_epochs = parameters['n_epochs']

# #######################################
    
# ## LSTM/fc - multivariate - with months:

# # Load loss and correlation results:
# train_loss_all = np.load('../results/quickrun_LSTM_fc_multivariate_with_months_train_loss_all.npy')
# val_loss_all = np.load('../results/quickrun_LSTM_fc_multivariate_with_months_val_loss_all.npy')
# test_loss_all = np.load('../results/quickrun_LSTM_fc_multivariate_with_months_test_loss_all.npy',)
# train_correl_all = np.load('../results/quickrun_LSTM_fc_multivariate_with_months_train_correl_all.npy')
# val_correl_all = np.load('../results/quickrun_LSTM_fc_multivariate_with_months_val_correl_all.npy')
# test_correl_all = np.load('../results/quickrun_LSTM_fc_multivariate_with_months_test_correl_all.npy')

# # Load parameters:
# path_to_store_results = Path('../results')
# with open(path_to_store_results / 'quickrun_LSTM_fc_multivariate_with_months_parameters.json', 'r') as f:
#     parameters=load(f)

# ESMs = parameters['ESMs']
# lead_times = parameters['lead_times']
# n_runs = parameters['n_runs']
# n_epochs = parameters['n_epochs']

### Postprocessing

We now have loss ('mse') and correlation for complete training, validation and test data after each epoch (starting with untrained model as epoch 0).

Next, we aim to find the **minimum loss and correlation on test data** for the **epoch with minimum validation loss**.
This search is done separately for each ESM, lead time and model run.

In a second step, we compute the **mean loss and correlation on test data over all runs**, separately for each ESM and lead time.

In [47]:
## Initializs storages for test loss (min) and correlation (max), where val loss takes its minimum,
## dimension (#ESMs, #lead times, #runs).
test_loss_min = np.zeros((len(ESMs),len(lead_times),n_runs))
test_correl_max = np.zeros((len(ESMs),len(lead_times),n_runs))

## Initializs storages for mean test loss and correlation, averaged over all training runs,
## dimension (#ESMs, #lead times).
test_loss_min_mean = np.zeros((len(ESMs),len(lead_times)))
test_correl_max_mean = np.zeros((len(ESMs),len(lead_times)))

## Loop over ESMs:
for m in range(len(ESMs)):
    
    ## Loop over lead times:
    for l in range(len(lead_times)):
        
        # Loop over desired number of training runs:
        for r in range(n_runs):
            
            # Get epoch with minimum validation loss for current ESM, lead time and training run:
            e_min = np.argmin(val_loss_all[m,l,r])
            
            # Store corresponding test loss and correlation:
            test_loss_min[m,l,r] = test_loss_all[m,l,r,e_min]
            test_correl_max[m,l,r] = test_correl_all[m,l,r,e_min]
            
        # Get mean test loss and correlation over all training runs, for current ESM and lead time:
        test_loss_min_mean[m,l] = np.mean(test_loss_min[m,l])
        test_correl_max_mean[m,l] = np.mean(test_correl_max[m,l])

### Results: Univariate LSTM/fc

In [33]:
test_loss_min_mean

array([[1.06639628, 1.09783796, 1.09870807],
       [0.7963384 , 0.82496788, 0.82492727]])

In [32]:
test_correl_max_mean

array([[ 0.17008899,  0.00900148,  0.00200208],
       [ 0.19373106,  0.00629848, -0.00032851]])

### Results: Multivariate LSTM/fc (without months as additional input features)

In [40]:
test_loss_min_mean

array([[0.94346567, 0.98352461, 1.01630537],
       [0.79506946, 0.78965815, 0.83073399]])

In [41]:
test_correl_max_mean

array([[0.38275446, 0.32599457, 0.27692552],
       [0.24301202, 0.18573714, 0.0812276 ]])

### Results: Multivariate LSTM/fc (with months as additional input features)

In [48]:
test_loss_min_mean

array([[0.91628381, 0.93798229, 0.98725102],
       [0.75946334, 0.77863034, 0.84218491]])

In [49]:
test_correl_max_mean

array([[0.41818234, 0.39294707, 0.32146911],
       [0.30634719, 0.2251764 , 0.09865145]])

### Discussion: LSTM / fc models - univariate/multivariate (with/without months as additional input features)

Here, we tried to **predict future** Sahel rainfall (lead times 1 / 3 / 6 months) from current and past information (t<=0) of all input features (including PREC_SAHEL) with **LSTM/fc** models.

We started with predicting future Sahel rainfall from its own history alone, hence with **univariate** inputs, which gives us only poor results in terms of high loss ('mse') and low correlation of model predictions and true targets.

Adding further input features to have **multivariate** inputs helps to improve prediction accuracy tremendously.
And ultimately, we added **months as additional input features**, which gives us the best results.

However, we find better results for models trained on **CESM** data, compared to **FOCI**.
And we see unreasonable behaviour for models trained on FOCI data with lead time 6: Lowest loss is found for models trained on univariate inputs and highest loss for models trained on all inputs including months. Would have expected the reverse order.