### Simple heuristics to predict Sahel rainfall

As simple baseline, we try **various heuristics**:

- Use the value of the **previous month** as prediction for the current month. This works only for **lead time = 1**.
- Use the value of the **previous year**'s same month as prediction for the current years current month.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

from predict_sahel_rainfall.preprocessing import prepare_inputs_and_target

### Prepare inputs and targets

Load collection of climate indices directly from GitHub release.
Use the complete preprocessing pipeline function, although we only need the target series.

In [3]:
## Set parameters:

# Set url to csv file containing CICMoD indices from desired release:
data_url = (
    "https://github.com/MarcoLandtHayen/climate_index_collection/"
    "releases/download/v2023.03.29.1/climate_indices.csv"
)

# Choose ESM ('CESM' or 'FOCI'):
ESM = 'FOCI'

# Select target index:
target_index = 'PREC_SAHEL'

# Select all input features:
input_features = [
    'AMO', 'ENSO_12', 'ENSO_3', 'ENSO_34', 'ENSO_4', 'NAO_PC', 'NAO_ST', 
    'NP', 'PDO_PC', 'PREC_SAHEL', 'SAM_PC', 'SAM_ZM', 'SAT_N_ALL', 'SAT_N_LAND',
    'SAT_N_OCEAN', 'SAT_S_ALL', 'SAT_S_LAND', 'SAT_S_OCEAN', 'SOI',
    'SSS_ENA', 'SSS_NA', 'SSS_SA', 'SSS_WNA', 'SST_ESIO', 'SST_HMDR',
    'SST_MED', 'SST_TNA', 'SST_TSA', 'SST_WSIO'
]

# # Select subset of input features:
# input_features = ['PREC_SAHEL', 'SAM_ZM']

# Choose, whether to add months as one-hot encoded features:
add_months = False

# Choose, whether to normalize target index:
norm_target = True

# Set lead time for target index:
lead_time = 1

# Specify input length:
input_length = 1

# Specify amount of combined training and validation data relative to test data:
train_test_split = 0.9

# Specify relative amount of combined training and validation used for training:
train_val_split = 0.8

## Optionally choose to scale or normalize input features according to statistics from training data:
# 'no': Keep raw input features.
# 'scale_01': Scale input features with min/max scaling to [0,1].
# 'scale_11': Scale input features with min/max scaling to [-1,1].
# 'norm': Normalize input features, hence subtract mean and divide by std dev.
scale_norm = 'norm'

In [4]:
# Prepare inputs and target:
(
    train_input,
    train_target,
    val_input,
    val_target,
    test_input,
    test_target,
    train_mean,
    train_std,
    train_min,
    train_max,
) = prepare_inputs_and_target(    
    data_url=data_url,
    ESM=ESM,
    target_index=target_index,
    input_features=input_features,
    add_months=add_months,
    norm_target=norm_target,
    lead_time=lead_time,
    input_length=input_length,
    train_test_split=train_test_split,
    train_val_split=train_val_split,
    scale_norm=scale_norm,
)

In [12]:
np.std(np.concatenate([train_target, val_target, test_target]))

1.0000416615970107

In [25]:
# Check dimensions:
print("train_input shape (samples, time steps, features): ", train_input.shape)
print("val_input shape (samples, time steps, features): ", val_input.shape)
print("test_input shape (samples, time steps, features): ", test_input.shape)

print("\ntrain_target shape (samples, 1): ", train_target.shape)
print("val_target shape (samples, 1): ", val_target.shape)
print("test_target shape (samples, 1): ", test_target.shape)

train_input shape (samples, time steps, features):  (8639, 1, 29)
val_input shape (samples, time steps, features):  (2160, 1, 29)
test_input shape (samples, time steps, features):  (1200, 1, 29)

train_target shape (samples, 1):  (8639, 1)
val_target shape (samples, 1):  (2160, 1)
test_target shape (samples, 1):  (1200, 1)


In [21]:
## Use the previous month as prediction for the current month. This works only for lead time=1.

## CESM:

# mse on test data:
print('test mse: ', np.round(np.mean((test_target[1:,0]-test_target[:-1,0])**2),3))
print('test correl: ', np.round(np.corrcoef(np.stack([test_target[1:,0],test_target[:-1,0]]))[0,1],3))

test mse:  1.739
test correl:  0.207


In [26]:
## Use the previous month as prediction for the current month. This works only for lead time=1.

## FOCI:

# mse on test data:
print('test mse: ', np.round(np.mean((test_target[1:,0]-test_target[:-1,0])**2),3))
print('test correl: ', np.round(np.corrcoef(np.stack([test_target[1:,0],test_target[:-1,0]]))[0,1],3))

test mse:  1.324
test correl:  0.187


In [22]:
## Use the value of the **previous year**'s same month as prediction for the current years current month.

## CESM:

# mse on test data:
print('test mse: ', np.round(np.mean((test_target[12:,0]-test_target[:-12,0])**2),3))
print('test correl: ', np.round(np.corrcoef(np.stack([test_target[12:,0],test_target[:-12,0]]))[0,1],3))

test mse:  2.328
test correl:  -0.058


In [27]:
## Use the value of the **previous year**'s same month as prediction for the current years current month.

## FOCI:

# mse on test data:
print('test mse: ', np.round(np.mean((test_target[12:,0]-test_target[:-12,0])**2),3))
print('test correl: ', np.round(np.corrcoef(np.stack([test_target[12:,0],test_target[:-12,0]]))[0,1],3))

test mse:  1.47
test correl:  0.089


### Discussion on using simple heuristics to predict Sahel rainfall

Find the autocorrelation of Sahel precipitation index to be rather low. For a time shift of only **one months**, the correlation drops to 0.207 and 0.187 for CESM and FOCI test data, respectively. Further increasing the time shift to **one year**, correlation reads -0.058 and 0.089 for CESM and FOCI test data, respectively.

The observed mse is way higher, compared to former experiments with simple CNN/fc models and linear regression. We therefore find the approach to use simple heuristics as predictor to be useless.