# 2.0.2: Area of Applicability (AoA)

While spatial K-fold cross-validation (SKCV) is useful for minimizing spatial dependence on model error estimation, it may not be sufficient for explaining the transferability of model extrapolations in areas where predictor data varies significantly from the reference data (Meyer & Pebesma, 2021). To address this, we can apply methods developed by Meyer & Pebesma (2021) for determining the dissimilarity index (DI) between unseen predictor data (“new”) and predictor data on which models were trained (“train”), and thus the area of applicability (AoA) of the final trait maps.

Briefly, dissimilarity in the predictor space is calculated by first computing the average minimum distance between observations in each cross-validation fold in “train” (i.e. the minimum distances between points in one fold from points in all other folds), followed by calculating the minimum distances between observations in the “new” data from the training data. The DI can then be determined as “new” distances divided by mean “train” distance. The 95% percentile DI value can then be set as the threshold value for determining the AoA, with values below the threshold being within the AoA and values above the threshold being outside it.

## Imports and config

In [1]:
import numpy as np
import pandas as pd
from verstack import NaNImputer

from src.conf.conf import get_config
from src.conf.environment import log
from src.utils.dataset_utils import get_train_fn, get_cv_splits, add_cv_splits_to_column


cfg = get_config()

2024-07-01 12:02:58.900628: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-01 12:02:58.900677: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-01 12:02:58.900706: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-01 12:02:58.908788: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load training data

In [2]:
train_fn = get_train_fn(cfg)

train_df = pd.read_parquet(train_fn)

coord_cols = ["x", "y"]
y_cols = [col for col in train_df.columns if col.startswith("X")]
x_cols = train_df.columns.difference(y_cols + coord_cols).to_list()

Let's just focus on a single trait for the purposes of this exploration.

In [3]:
y_col = y_cols[0]
df = train_df[x_cols + [y_col]].copy()
xy = train_df[coord_cols]

cv_splits = get_cv_splits(cfg, y_col)
df = add_cv_splits_to_column(df, cv_splits)

df.head()

Unnamed: 0,ETH_GlobalCanopyHeightSD_2020_v1,ETH_GlobalCanopyHeight_2020_v1,bdod_0-5cm_mean,bdod_100-200cm_mean,bdod_15-30cm_mean,bdod_30-60cm_mean,bdod_5-15cm_mean,bdod_60-100cm_mean,cec_0-5cm_mean,cec_100-200cm_mean,...,vodca_x-band_p5,vodca_x-band_p95,wc2.1_30s_bio_1,wc2.1_30s_bio_12,wc2.1_30s_bio_13-14,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_7,X4_mean,cv_split
0,3.0,4.0,,,,,,,514.0,257.0,...,9998.0,10628.0,-11.3625,0.0,0.0,0.0,1278.113159,40.300003,0.465956,0.0
1,2.0,1.0,,,,,,,500.0,243.0,...,15506.0,14737.0,-11.970834,364.0,50.0,54.235405,1224.654785,38.400002,0.504636,0.0
2,10.0,25.0,105.0,119.0,111.0,114.0,109.0,118.0,234.0,205.0,...,,,25.6,2662.0,251.0,38.029896,119.125633,7.6,0.526075,1.0
3,3.0,3.0,,,,,,,549.0,265.0,...,7909.0,9810.0,-11.2125,273.0,40.0,55.311668,1290.799805,40.299999,0.354417,0.0
4,10.0,25.0,112.0,124.0,116.0,119.0,115.0,123.0,209.0,187.0,...,,,25.825001,2476.0,231.0,41.319893,119.914726,7.599998,0.52777,1.0


## Impute missing values

Looking at the output of the dataframe above, it looks like there are some missing values in our training data in the form of `NaN`s. This is no problem for our models, but the method described by Meyer & Pebesma does not address the handling of missing values. Because AutoGluon `TabularPredictor` models can tolerate missing values and we did not do any data imputation prior to training, we are faced with a few options when calculating AoA:

a) drop all observations in “new” and “train” that contain any missing features (and therefore misrepresent the actual spatial coverage of the training data as well as reduce feature space variance present in DI calculation compared to actual reference data used in training), possibly resulting in an overly pessimistic AoA;

b) drop features in both the “new” and “train” data that contain any missing values, likely resulting in an overly optimistic AoA as feature-space complexity is reduced; or 

c) a “middle ground” approach of imputing missing values and assuming that the resulting predictor space is still a robust representation of the true reference data.

To retain as much of the original signature of the true reference data when calculating dissimilarity, as well as to ensure that final AoA maps match the geographic extent of the predictions, we will choose option “c”. It should be noted, however, that, given the novelty of this method, room for improvement likely exists. 

### Imputation method

To achieve the best possible missing value imputation we can use the `NaNImputer` method from the python `verstack` library, which fits gradient boosted tree regression models for each feature to fill missing values (Zherebtsov, 2020/2023).


In [4]:
imputer = NaNImputer()

In [5]:
df_imputed = imputer.impute(df)


 * Initiating NaNImputer.impute
     . Dataset dimensions:
     .. rows:         5351830
     .. columns:      152
     .. mb in memory: 3144.0
     .. NaN cols num: 150

   - Drop hopeless NaN cols

   - Processing whole data for imputation

   - Imputing single core 150 cols
     . Imputed (regression) - 3608     NaN in ETH_GlobalCanopyHeightSD_2020_v1
     . Imputed (regression) - 3608     NaN in ETH_GlobalCanopyHeight_2020_v1
     . Imputed (regression) - 85073    NaN in bdod_0-5cm_mean
     . Imputed (regression) - 85153    NaN in bdod_100-200cm_mean
     . Imputed (regression) - 85073    NaN in bdod_15-30cm_mean
     . Imputed (regression) - 85073    NaN in bdod_30-60cm_mean
     . Imputed (regression) - 85073    NaN in bdod_5-15cm_mean
     . Imputed (regression) - 85073    NaN in bdod_60-100cm_mean
     . Imputed (regression) - 82608    NaN in cec_0-5cm_mean
     . Imputed (regression) - 82608    NaN in cec_100-200cm_mean
     . Imputed (regression) - 82608    NaN in cec_15-30

In [6]:
df_imputed.head()

Unnamed: 0,ETH_GlobalCanopyHeightSD_2020_v1,ETH_GlobalCanopyHeight_2020_v1,bdod_0-5cm_mean,bdod_100-200cm_mean,bdod_15-30cm_mean,bdod_30-60cm_mean,bdod_5-15cm_mean,bdod_60-100cm_mean,cec_0-5cm_mean,cec_100-200cm_mean,...,vodca_x-band_p5,vodca_x-band_p95,wc2.1_30s_bio_1,wc2.1_30s_bio_12,wc2.1_30s_bio_13-14,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_7,X4_mean,cv_split
0,3.0,4.0,72.869946,75.747543,69.1737,65.633842,73.474933,77.171645,514.0,257.0,...,9998.0,10628.0,-11.3625,0.0,0.0,0.0,1278.113159,40.300003,0.465956,0.0
1,2.0,1.0,72.869946,75.747543,69.1737,65.633842,73.474933,77.171645,500.0,243.0,...,15506.0,14737.0,-11.970834,364.0,50.0,54.235405,1224.654785,38.400002,0.504636,0.0
2,10.0,25.0,105.0,119.0,111.0,114.0,109.0,118.0,234.0,205.0,...,2630.500423,6022.519499,25.6,2662.0,251.0,38.029896,119.125633,7.6,0.526075,1.0
3,3.0,3.0,72.869946,75.747543,69.1737,65.633842,73.549443,76.922851,549.0,265.0,...,7909.0,9810.0,-11.2125,273.0,40.0,55.311668,1290.799805,40.299999,0.354417,0.0
4,10.0,25.0,112.0,124.0,116.0,119.0,115.0,123.0,209.0,187.0,...,2630.500423,6022.519499,25.825001,2476.0,231.0,41.319893,119.914726,7.599998,0.52777,1.0
