This notebook is used to use KIM to predict heat fluxes from selected predictors using eddy covariance data.

In [1]:
# Libraries
from pathlib import Path
import pandas as pd
import numpy as np

from kim.map import KIM
from kim.data import Data
from kim.mapping_model import MLP

import jax

%load_ext autoreload
%autoreload 2


# Read the data
- The `Output_fluxes_daily.csv` file includes the observations of heat fluxes to be predicted, represented by $\mathbf{Y}$.
- The `Input_forcings_daily.csv` file includes the observations of forcing data, represented by $\mathbf{X}$.

In [2]:
# File and folder paths
f_y = Path("./data/Output_fluxes_daily.csv")
f_x = Path("./data/Input_forcings_daily.csv")


In [4]:
df_x, df_y = pd.read_csv(f_x, index_col=0),pd.read_csv(f_y, index_col=0)
y_keys, x_keys = df_y.keys(), df_x.keys()
y, x = df_y.values, df_x.values
x.shape, y.shape


((1279, 10), (1279, 3))

# Configurations

## Preliminary analysis configuration

In [5]:
# The random seed used in the statistical significance test
seed_shuffle = 1234

# The folder where the data analysis results will be saved
f_data_save = Path("./results/data_daily")


In [7]:
# Data configuration
data_params = {
    "xscaler_type": "minmax",
    "yscaler_type": "minmax",
}

# Sensitivity analysis configuration
sensitivity_params = {
    "method": "pc", "metric": "it-knn",
    "sst": True, "ntest": 100, "alpha": 0.05, "k": 3,
    "n_jobs": 20, "seed_shuffle": seed_shuffle,
    "verbose": 1
}


## Ensemble learning configuration

In [8]:
# Some basis ensemble learning configuration
Ns_train = 365
Ns_val = 365
hidden_activation = 'sigmoid'
final_activation = 'leaky_relu'
seed_ens = 1024
seed_predict = 3636
seed_dl = 10
seed_model = 100
training_verbose = 1
n_models = 100
n_jobs = 20

# Locations where the ensemble learning results will be saved
f_kim_save1 = Path("./results/map_many2many_daily")
f_kim_save2 = Path("./results/map_many2one_daily")
f_kim_save3 = Path("./results/map_many2one_cond_daily")


In [9]:
# Mapping parameters for each test below
map_configs = {
    "model_type": MLP,
    'n_model': n_models,
    'ensemble_type': 'ens_random',
    'model_hp_choices': {
        "depth": [1,3,5,6],
        "width_size": [3,6,10]
    },
    'model_hp_fixed': {
        "hidden_activation": hidden_activation,
        "final_activation": final_activation,
        "model_seed": seed_model
    },
    'optax_hp_choices': {
        'learning_rate': [0.01, 0.005, 0.003],
    },
    'optax_hp_fixed': {
        'nsteps': 300,
        'optimizer_type': 'adam',
    },
    'dl_hp_choices': {
    },
    'dl_hp_fixed': {
        'dl_seed': seed_dl,
        'num_train_sample': Ns_train,
        'num_val_sample': Ns_val,
        'batch_size': 64
    },
    'ens_seed': seed_ens,
    'training_parallel': True,
    'parallel_config': {
        'n_jobs': n_jobs, 
        'backend': 'loky',
        'verbose': 1
    },
    'device': None,
}

# Perform preliminary data analysis
The analysis include both sensitivity analysis and redundancy filtering check

In [10]:
data = Data(x, y, **data_params)
data.calculate_sensitivity(**sensitivity_params)


Using the kNN-based information theoretic metrics ...
Performing pairwise analysis to remove insensitive inputs ...


100%|██████████| 10/10 [07:35<00:00, 45.58s/it]


Performing conditional independence testing to remove redundant inputs ...


# Train the inverse mapping

Now, let's train the inverse mappings via ensemble learning. We are training three types of inverse mappings:
- `kim1`: The naive inverse mapping from all $\mathbf{Y}$ to all $\mathbf{X}$
- `kim2`: The knowledge-informed inverse mapping from sensitive $\mathbf{Y}$ to each of $\mathbf{X}$ using global sensitivity analysis
- `kim3`: The knowledge-informed inverse mapping from sensitive $\mathbf{Y}$ to each of $\mathbf{X}$ using global sensitivity analysis + redundancy filtering check


In [11]:
# Initialize three diffferent KIMs
kim1 = KIM(data, map_configs, map_option='many2many')
kim2 = KIM(data, map_configs, mask_option="sensitivity", map_option='many2one')
kim3 = KIM(data, map_configs, mask_option="cond_sensitivity", map_option='many2one')

# Train the mappings
kim1.train()
kim2.train()
kim3.train()



 Performing ensemble training in parallel with 100 model configurations...



[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
100%|██████████| 300/300 [00:03<00:00, 78.69it/s]
100%|██████████| 300/300 [00:03<00:00, 78.41it/s]
100%|██████████| 300/300 [00:03<00:00, 78.18it/s]
100%|██████████| 300/300 [00:03<00:00, 77.82it/s]
100%|██████████| 300/300 [00:03<00:00, 78.25it/s]
100%|██████████| 300/300 [00:03<00:00, 77.19it/s]
100%|██████████| 300/300 [00:04<00:00, 68.44it/s]
100%|██████████| 300/300 [00:04<00:00, 68.41it/s]
100%|██████████| 300/300 [00:04<00:00, 68.61it/s]
100%|██████████| 300/300 [00:04<00:00, 68.20it/s]
 94%|█████████▎| 281/300 [00:04<00:00, 74.68it/s][Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    5.6s
100%|██████████| 300/300 [00:04<00:00, 60.20it/s]
100%|██████████| 300/300 [00:05<00:00, 59.76it/s]
100%|██████████| 300/300 [00:04<00:00, 60.51it/s]
100%|██████████| 300/300 [00:05<00:00, 58.66it/s]
100%|██████████| 300/300 [00:05<00:00, 58.33it/s]
100%|██████████| 300/300 [00:05<00:00, 56.82it/s]
100%|███████

Training completes.

 Performing ensemble training in parallel with 100 model configurations...



[Parallel(n_jobs=20)]: Done 100 out of 100 | elapsed:   58.8s finished
[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
  pid = os.fork()
100%|██████████| 300/300 [00:03<00:00, 75.76it/s]
100%|██████████| 300/300 [00:03<00:00, 75.73it/s]
100%|██████████| 300/300 [00:03<00:00, 75.55it/s]
100%|██████████| 300/300 [00:04<00:00, 73.31it/s]
100%|██████████| 300/300 [00:04<00:00, 73.16it/s]
100%|██████████| 300/300 [00:04<00:00, 67.63it/s]
100%|██████████| 300/300 [00:04<00:00, 68.04it/s]
100%|██████████| 300/300 [00:04<00:00, 69.01it/s]
100%|██████████| 300/300 [00:04<00:00, 65.29it/s]
100%|██████████| 300/300 [00:05<00:00, 58.43it/s]
 94%|█████████▎| 281/300 [00:05<00:00, 68.57it/s][Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    5.6s
100%|██████████| 300/300 [00:05<00:00, 54.43it/s]
100%|██████████| 300/300 [00:05<00:00, 58.40it/s]
100%|██████████| 300/300 [00:05<00:00, 58.03it/s]
100%|██████████| 300/300 [00:05<00:00, 56.34it/s]
100%|██████████| 300/30

Training completes.

 Performing ensemble training in parallel with 100 model configurations...



100%|██████████| 300/300 [00:03<00:00, 81.09it/s]
100%|██████████| 300/300 [00:03<00:00, 80.05it/s]
100%|██████████| 300/300 [00:03<00:00, 80.23it/s]
100%|██████████| 300/300 [00:03<00:00, 78.23it/s]
100%|██████████| 300/300 [00:03<00:00, 77.67it/s]
100%|██████████| 300/300 [00:03<00:00, 78.27it/s]
100%|██████████| 300/300 [00:04<00:00, 71.91it/s]
100%|██████████| 300/300 [00:04<00:00, 71.22it/s]
100%|██████████| 300/300 [00:04<00:00, 70.49it/s]
100%|██████████| 300/300 [00:04<00:00, 68.85it/s]
 83%|████████▎ | 249/300 [00:04<00:00, 70.24it/s][Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    4.6s
100%|██████████| 300/300 [00:04<00:00, 63.23it/s]
100%|██████████| 300/300 [00:04<00:00, 62.70it/s]
100%|██████████| 300/300 [00:04<00:00, 62.12it/s]
100%|██████████| 300/300 [00:04<00:00, 61.56it/s]
100%|██████████| 300/300 [00:05<00:00, 58.77it/s]
100%|██████████| 300/300 [00:05<00:00, 58.13it/s]
100%|██████████| 300/300 [00:05<00:00, 59.10it/s]
100%|██████████| 300/300 [00:05<00:00, 

Training completes.

 Performing ensemble training in parallel with 100 model configurations...



100%|██████████| 300/300 [00:03<00:00, 82.74it/s]
100%|██████████| 300/300 [00:03<00:00, 82.23it/s]
100%|██████████| 300/300 [00:03<00:00, 81.83it/s]
100%|██████████| 300/300 [00:03<00:00, 80.60it/s]
100%|██████████| 300/300 [00:03<00:00, 79.76it/s]
100%|██████████| 300/300 [00:03<00:00, 78.98it/s]
100%|██████████| 300/300 [00:04<00:00, 72.75it/s]
100%|██████████| 300/300 [00:04<00:00, 72.20it/s]
100%|██████████| 300/300 [00:04<00:00, 71.95it/s]
100%|██████████| 300/300 [00:04<00:00, 66.74it/s]
 88%|████████▊ | 265/300 [00:04<00:00, 75.19it/s][Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    4.6s
100%|██████████| 300/300 [00:04<00:00, 65.63it/s]
100%|██████████| 300/300 [00:04<00:00, 65.52it/s]
100%|██████████| 300/300 [00:04<00:00, 62.41it/s]
100%|██████████| 300/300 [00:04<00:00, 62.09it/s]
100%|██████████| 300/300 [00:04<00:00, 62.09it/s]
100%|██████████| 300/300 [00:04<00:00, 61.33it/s]
100%|██████████| 300/300 [00:04<00:00, 60.20it/s]
100%|██████████| 300/300 [00:05<00:00, 

Training completes.

 Performing ensemble training in parallel with 100 model configurations...



100%|██████████| 300/300 [00:03<00:00, 82.12it/s]
100%|██████████| 300/300 [00:03<00:00, 81.93it/s]
100%|██████████| 300/300 [00:03<00:00, 81.65it/s]
100%|██████████| 300/300 [00:03<00:00, 81.35it/s]
100%|██████████| 300/300 [00:03<00:00, 80.83it/s]
100%|██████████| 300/300 [00:03<00:00, 79.69it/s]
100%|██████████| 300/300 [00:04<00:00, 71.89it/s]
100%|██████████| 300/300 [00:04<00:00, 71.46it/s]
100%|██████████| 300/300 [00:04<00:00, 70.92it/s]
100%|██████████| 300/300 [00:04<00:00, 69.46it/s]
 83%|████████▎ | 249/300 [00:04<00:00, 71.76it/s][Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    4.4s
100%|██████████| 300/300 [00:04<00:00, 65.19it/s]
100%|██████████| 300/300 [00:04<00:00, 64.94it/s]
100%|██████████| 300/300 [00:04<00:00, 64.63it/s]
100%|██████████| 300/300 [00:04<00:00, 62.11it/s]
100%|██████████| 300/300 [00:04<00:00, 62.05it/s]
100%|██████████| 300/300 [00:04<00:00, 61.27it/s]
100%|██████████| 300/300 [00:04<00:00, 60.61it/s]
100%|██████████| 300/300 [00:05<00:00, 

Training completes.

 Performing ensemble training in parallel with 100 model configurations...



100%|██████████| 300/300 [00:03<00:00, 82.50it/s]
100%|██████████| 300/300 [00:03<00:00, 81.53it/s]
100%|██████████| 300/300 [00:03<00:00, 81.16it/s]
100%|██████████| 300/300 [00:03<00:00, 80.75it/s]
100%|██████████| 300/300 [00:03<00:00, 80.06it/s]
100%|██████████| 300/300 [00:03<00:00, 79.23it/s]
100%|██████████| 300/300 [00:04<00:00, 70.88it/s]
100%|██████████| 300/300 [00:04<00:00, 70.70it/s]
100%|██████████| 300/300 [00:04<00:00, 69.70it/s]
100%|██████████| 300/300 [00:04<00:00, 68.96it/s]
 86%|████████▌ | 257/300 [00:04<00:00, 72.88it/s][Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    4.4s
100%|██████████| 300/300 [00:04<00:00, 63.54it/s]
100%|██████████| 300/300 [00:04<00:00, 62.37it/s]
100%|██████████| 300/300 [00:04<00:00, 61.15it/s]
100%|██████████| 300/300 [00:04<00:00, 60.54it/s]
100%|██████████| 300/300 [00:04<00:00, 60.16it/s]
100%|██████████| 300/300 [00:05<00:00, 59.84it/s]
100%|██████████| 300/300 [00:05<00:00, 59.14it/s]
100%|██████████| 300/300 [00:05<00:00, 

Training completes.

 Performing ensemble training in parallel with 100 model configurations...



[Parallel(n_jobs=20)]: Done 100 out of 100 | elapsed:   51.6s finished
[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
100%|██████████| 300/300 [00:03<00:00, 76.57it/s]
100%|██████████| 300/300 [00:03<00:00, 76.16it/s]
100%|██████████| 300/300 [00:03<00:00, 78.75it/s]
100%|██████████| 300/300 [00:03<00:00, 76.65it/s]
100%|██████████| 300/300 [00:03<00:00, 75.75it/s]
100%|██████████| 300/300 [00:03<00:00, 75.68it/s]
100%|██████████| 300/300 [00:04<00:00, 70.14it/s]
100%|██████████| 300/300 [00:04<00:00, 68.51it/s]
100%|██████████| 300/300 [00:04<00:00, 68.29it/s]
100%|██████████| 300/300 [00:04<00:00, 65.84it/s]
100%|██████████| 300/300 [00:04<00:00, 61.79it/s]
100%|██████████| 300/300 [00:05<00:00, 58.81it/s]
100%|██████████| 300/300 [00:05<00:00, 58.72it/s]
100%|██████████| 300/300 [00:04<00:00, 60.67it/s]
100%|██████████| 300/300 [00:05<00:00, 58.33it/s]
100%|██████████| 300/300 [00:05<00:00, 57.50it/s]
100%|██████████| 300/300 [00:05<00:00, 58.22it/s]
10

Training completes.


100%|██████████| 300/300 [00:04<00:00, 60.05it/s]
[Parallel(n_jobs=20)]: Done 100 out of 100 | elapsed:   32.3s finished


# Save the results to disk

In [12]:
# Preliminary analysis results
data.save(f_data_save)

# Inverse mapping results
kim1.save(f_kim_save1)
kim2.save(f_kim_save2)
kim3.save(f_kim_save3)
