This notebook is used to perform KIM to predict soil respirations from selected predictors using SRDB database.

In [1]:
# Libraries
from pathlib import Path
import pandas as pd

from kim.map import KIM
from kim.data import Data
from kim.mapping_model import MLP


# Read the data

In [2]:
# File and folder paths
f_data = Path('./selected_SRDB.csv')

In [3]:
df = pd.read_csv(f_data)
df.head()

Unnamed: 0,Site_ID,Latitude,Longitude,MAT,MAP,Annual_coverage,Soil_BD,Rs_annual
0,US-FIFG-600PPM,38.73,-120.8,18.0,1000.0,0.25,1.14,647.0
1,CA-CAR-CHMB3,49.85,-125.32,8.6,1452.0,1.0,1.35,2200.0
2,IE-OPR-PASTURE,52.85,-6.9,9.4,824.0,1.0,1.42,1110.0
3,CN-DHS-BF,23.16,112.51625,21.825,1746.75,1.0,0.85,1047.9125
4,US-NC-LFREQ,35.78,-75.9,16.9,1270.0,1.0,0.08,1077.0


In [4]:
# Predictors
x_keys = [
    "Latitude", "Longitude", "MAT", "MAP", "Annual_coverage", "Soil_BD"
]

# Predictands
y_keys = [
    "Rs_annual"
    # , "Ra_annual", "Rh_annual", "GPP"
]


In [5]:
x, y = df[x_keys].values, df[y_keys].values


In [6]:
x.shape, y.shape

((823, 6), (823, 1))

# Configurations

## Preliminary analysis configuration

In [7]:
seed_shuffle = 1234
f_data_save = Path("./results/data")


In [8]:
# Data configuration
data_params = {
    "xscaler_type": "minmax",
    "yscaler_type": "minmax",
}

# Sensitivity analysis configuration
sensitivity_params = {
    "method": "pc", "metric": "it-knn",
    "sst": True, "ntest": 100, "alpha": 0.05, "k": 3,
    "n_jobs": 50, "seed_shuffle": seed_shuffle,
    "verbose": 1
}


## Ensemble learning configuration

In [9]:
Ns_train = 600
Ns_val = 100
hidden_activation = 'sigmoid'
final_activation = 'leaky_relu'
seed_ens = 1024
seed_predict = 3636
seed_dl = 10
seed_model = 100
training_verbose = 1
n_models = 100
n_jobs = 50

f_kim_save1 = Path("./results/map_many2many")
f_kim_save2 = Path("./results/map_many2one")
f_kim_save3 = Path("./results/map_many2one_cond")


In [10]:
# Mapping parameters for each test below
map_configs = {
    "model_type": MLP,
    'n_model': n_models,
    'ensemble_type': 'ens_random',
    'model_hp_choices': {
        "depth": [1,3,5,6],
        "width_size": [3,6,10]
    },
    'model_hp_fixed': {
        "hidden_activation": hidden_activation,
        "final_activation": final_activation,
        "model_seed": seed_model
    },
    'optax_hp_choices': {
        'learning_rate': [0.01, 0.005, 0.003],
    },
    'optax_hp_fixed': {
        'nsteps': 300,
        'optimizer_type': 'adam',
    },
    'dl_hp_choices': {
    },
    'dl_hp_fixed': {
        'dl_seed': seed_dl,
        'num_train_sample': Ns_train,
        'num_val_sample': Ns_val,
        'batch_size': 64
    },
    'ens_seed': seed_ens,
    'training_parallel': True,
    'parallel_config': {
        'n_jobs': n_jobs, 
        'backend': 'loky',
        'verbose': 1
    },
    'device': None,
}

# Exploratory data analysis

In [11]:
data = Data(x, y, **data_params)
data.calculate_sensitivity(**sensitivity_params)
# Save the sensitivity analysis to disk
data.save(f_data_save)


Using the kNN-based information theoretic metrics ...
Performing pairwise analysis to remove insensitive inputs ...


100%|██████████| 6/6 [05:59<00:00, 59.91s/it] 


Performing conditional independence testing to remove redundant inputs ...
Thu Nov 28 19:16:28 2024:    ERROR: auth.munge: munge_decode() failed. Socket communication error


In [12]:
data.sensitivity_mask

array([[ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True]])

In [13]:
data.cond_sensitivity_mask

array([[ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True]])

# Train the inverse mapping

In [14]:
# Initialize three diffferent KIMs
kim1 = KIM(data, map_configs, map_option='many2many')
kim2 = KIM(data, map_configs, mask_option="sensitivity", map_option='many2one')
kim3 = KIM(data, map_configs, mask_option="cond_sensitivity", map_option='many2one')

# Train the mappings
kim1.train()
kim2.train()
kim3.train()



 Performing ensemble training in parallel with 100 model configurations...



[Parallel(n_jobs=50)]: Using backend LokyBackend with 50 concurrent workers.
100%|██████████| 300/300 [00:04<00:00, 72.94it/s]
100%|██████████| 300/300 [00:04<00:00, 73.46it/s]
100%|██████████| 300/300 [00:04<00:00, 73.30it/s]
100%|██████████| 300/300 [00:04<00:00, 70.27it/s]
100%|██████████| 300/300 [00:04<00:00, 72.06it/s]
100%|██████████| 300/300 [00:04<00:00, 72.04it/s]
100%|██████████| 300/300 [00:04<00:00, 68.96it/s]
100%|██████████| 300/300 [00:04<00:00, 71.22it/s]
100%|██████████| 300/300 [00:04<00:00, 68.98it/s]
100%|██████████| 300/300 [00:04<00:00, 67.83it/s]
100%|██████████| 300/300 [00:04<00:00, 68.03it/s]
100%|██████████| 300/300 [00:04<00:00, 67.47it/s]
100%|██████████| 300/300 [00:04<00:00, 70.99it/s]
100%|██████████| 300/300 [00:04<00:00, 62.74it/s]
100%|██████████| 300/300 [00:04<00:00, 62.52it/s]
100%|██████████| 300/300 [00:04<00:00, 63.04it/s]
100%|██████████| 300/300 [00:04<00:00, 63.29it/s]
100%|██████████| 300/300 [00:04<00:00, 62.29it/s]
100%|██████████| 300/30

Training completes.

 Performing ensemble training in parallel with 100 model configurations...



100%|██████████| 300/300 [00:03<00:00, 78.36it/s]
100%|██████████| 300/300 [00:03<00:00, 77.63it/s]
100%|██████████| 300/300 [00:03<00:00, 77.73it/s]
100%|██████████| 300/300 [00:03<00:00, 76.74it/s]
100%|██████████| 300/300 [00:03<00:00, 76.56it/s]
100%|██████████| 300/300 [00:03<00:00, 75.68it/s]
100%|██████████| 300/300 [00:04<00:00, 74.98it/s]
100%|██████████| 300/300 [00:04<00:00, 74.95it/s]
100%|██████████| 300/300 [00:04<00:00, 72.66it/s]
100%|██████████| 300/300 [00:03<00:00, 76.74it/s]
100%|██████████| 300/300 [00:04<00:00, 72.20it/s]
100%|██████████| 300/300 [00:03<00:00, 75.31it/s]
100%|██████████| 300/300 [00:04<00:00, 72.33it/s]
100%|██████████| 300/300 [00:04<00:00, 66.62it/s]
100%|██████████| 300/300 [00:04<00:00, 65.14it/s]
100%|██████████| 300/300 [00:04<00:00, 62.56it/s]
100%|██████████| 300/300 [00:04<00:00, 62.56it/s]
100%|██████████| 300/300 [00:04<00:00, 62.45it/s]
100%|██████████| 300/300 [00:04<00:00, 63.79it/s]
100%|██████████| 300/300 [00:04<00:00, 61.48it/s]


Training completes.

 Performing ensemble training in parallel with 100 model configurations...



100%|██████████| 300/300 [00:05<00:00, 57.07it/s]
100%|██████████| 300/300 [00:04<00:00, 60.19it/s]
[Parallel(n_jobs=50)]: Done 100 out of 100 | elapsed:  1.6min finished
[Parallel(n_jobs=50)]: Using backend LokyBackend with 50 concurrent workers.
  pid = os.fork()
100%|██████████| 300/300 [00:03<00:00, 77.07it/s]
100%|██████████| 300/300 [00:04<00:00, 74.64it/s]
100%|██████████| 300/300 [00:04<00:00, 70.66it/s]
100%|██████████| 300/300 [00:04<00:00, 74.48it/s]
100%|██████████| 300/300 [00:03<00:00, 76.61it/s]
100%|██████████| 300/300 [00:03<00:00, 75.47it/s]
100%|██████████| 300/300 [00:04<00:00, 72.96it/s]
100%|██████████| 300/300 [00:04<00:00, 65.97it/s]
100%|██████████| 300/300 [00:04<00:00, 63.47it/s]
100%|██████████| 300/300 [00:04<00:00, 65.21it/s]
100%|██████████| 300/300 [00:05<00:00, 57.27it/s]
100%|██████████| 300/300 [00:04<00:00, 61.54it/s]
100%|██████████| 300/300 [00:05<00:00, 56.93it/s]
100%|██████████| 300/300 [00:05<00:00, 53.16it/s]
100%|██████████| 300/300 [00:05<00

Training completes.


100%|██████████| 300/300 [00:05<00:00, 51.51it/s]
[Parallel(n_jobs=50)]: Done 100 out of 100 | elapsed:  1.5min finished


In [15]:
# Save 
kim1.save(f_kim_save1)
kim2.save(f_kim_save2)
kim3.save(f_kim_save3)
