# 2.0.2: Area of Applicability (AoA)

While spatial K-fold cross-validation (SKCV) is useful for minimizing spatial dependence on model error estimation, it may not be sufficient for explaining the transferability of model extrapolations in areas where predictor data varies significantly from the reference data (Meyer & Pebesma, 2021). To address this, we can apply methods developed by Meyer & Pebesma (2021) for determining the dissimilarity index (DI) between unseen predictor data (“new”) and predictor data on which models were trained (“train”), and thus the area of applicability (AoA) of the final trait maps.

Briefly, dissimilarity in the predictor space is calculated by first computing the average minimum distance between observations in each cross-validation fold in “train” (i.e. the minimum distances between points in one fold from points in all other folds), followed by calculating the minimum distances between observations in the “new” data from the training data. The DI can then be determined as “new” distances divided by mean “train” distance. The 95% percentile DI value can then be set as the threshold value for determining the AoA, with values below the threshold being within the AoA and values above the threshold being outside it.

## Imports and config

In [4]:
import dask.dataframe as dd
import pandas as pd
from src.utils.dataset_utils import get_predict_imputed_fn, get_y_fn
from src.utils.training_utils import assign_splits, filter_trait_set, set_yx_index
from src.conf.conf import get_config
from src.conf.environment import log

cfg = get_config()

## 1. Determine average minimum distance in training feature space

The first step of calculating the dissimilarity index (DI) is to calculate the minimum distances between all observations in the feature space of the training data, and to then determine the average minimum distance from that. This provides a sort of normalization coefficient that will be used to determine how similar observations in the predict feature space are to those in train.

However, to avoid comparing observations (or points) to other points within their spatial autocorrelation range, we need to ensure that distances are calculated only to points outside of each point's respective spatial cross-validation fold.

Additionally, because not all features are as influential as the others, we will also need to first normalize the feature data and then weight them according to their respective importance as determined during model training with cross-validation.

### Load training data

Let's load the training data for the combined sPlot and GBIF trait dataset along with their fold assignments (AKA CV splits).

In [7]:
Y_COL: str = "X11_mean"

splot_gbif = (
    pd.read_parquet(get_y_fn(), columns=["x", "y", Y_COL, "source"])
    .pipe(set_yx_index)
    .pipe(assign_splits, label_col=Y_COL)
    .pipe(filter_trait_set, trait_set="splot_gbif")
    .drop(columns=[Y_COL, "source"])
    .merge(
        # Merge using inner join with the imputed predict data (described below)
        dd.read_parquet(get_predict_imputed_fn()).compute().pipe(set_yx_index),
        how="inner",
        left_index=True,
        right_index=True,
        validate="1:1"
    )
    .reset_index(drop=True) # We no longer need the x and y indices
)
splot_gbif.head()

### Normalize features and weight by feature importance

In [6]:
len(dd.read_parquet(get_y_fn()))

6437463

### Calculate minimum distances between points (same-fold excluded)

Example script provided by GPT:

In [None]:
def calculate_min_distances(block, fold_ids):
    unique_fold_ids = np.unique(fold_ids)
    min_distances = []

    for fold_id in unique_fold_ids:
        # Filter out rows with the same fold ID
        mask = fold_ids != fold_id
        block_filtered = block[mask]
        fold_block = block[fold_ids == fold_id]

        # Calculate pairwise distances
        distances = pairwise_distances(fold_block, block_filtered)

        # Get the minimum distance for each row in the fold_block
        min_distances.append(np.min(distances, axis=1))

    return np.concatenate(min_distances)


# Apply the function to each partition
dask_array = ddf.drop(columns="fold_id").to_dask_array(lengths=True)
fold_ids = ddf["fold_id"].to_dask_array(lengths=True)

min_distances = dask_array.map_blocks(calculate_min_distances, fold_ids, dtype=float)

## Load imputed predict data

In order to compare our "new" and "train" feature spaces, we will need to compute the distance between them. Let's get started by loading our "new" data (AKA `predict`).

Although our models can tolerate missing values in the data (an important attribute as not all of data variables are available simultaneously at all locations), to calculate pairwise distances or K-d trees, a dense matrix is required (i.e. no missing values).

This leaves us with a few options:

a) drop all observations in “new” and “train” that contain any missing features (and therefore misrepresent the actual spatial coverage of the training data as well as reduce feature space variance present in DI calculation compared to actual reference data used in training), possibly resulting in an overly pessimistic AoA;

b) drop features in both the “new” and “train” data that contain any missing values, likely resulting in an overly optimistic AoA as feature-space complexity is reduced; or 

c) a “middle ground” approach of imputing missing values and assuming that the resulting predictor space is still a robust representation of the true reference data.

To retain as much of the original signature of the true reference data when calculating dissimilarity, as well as to ensure that final AoA maps match the geographic extent of the predictions, we will choose option “c”. It should be noted, however, that, given the novelty of this method, room for improvement likely exists. 

### Imputation method

To achieve the best possible missing value imputation we can use the `NaNImputer` method from the python `verstack` library, which fits gradient boosted tree regression models for each feature to fill missing values (Zherebtsov, 2020/2023). However, since we're working at high resolution with global extent, even small patches of missing data means lots of missing values, which will require some pretty intensive computation time for `NaNImputer` as it utilizes gradient boosting using LightGBM to fit regressors to each feature for imputation (https://verstack.readthedocs.io/en/latest/index.html#nanimputer).

In [2]:
from src.utils.dataset_utils import get_predict_imputed_fn


pred_imputed = dd.read_parquet(get_predict_imputed_fn())
pred_imputed.head()

Unnamed: 0,ETH_GlobalCanopyHeight_2020_v1,ETH_GlobalCanopyHeightSD_2020_v1,sur_refl_b03_2001-2024_m3_mean,sur_refl_b04_2001-2024_m11_mean,sur_refl_b05_2001-2024_m8_mean,sur_refl_b02_2001-2024_m5_mean,sur_refl_b01_2001-2024_m10_mean,sur_refl_ndvi_2001-2024_m1_mean,sur_refl_b03_2001-2024_m1_mean,sur_refl_b05_2001-2024_m12_mean,...,vodca_k-band_p5,vodca_x-band_mean,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_13-14,wc2.1_30s_bio_7,wc2.1_30s_bio_12,wc2.1_30s_bio_1,y,x
0,5,7,8733,3559,3653,3295,2156,-226,2707,74,...,21161,18950,67.323479,1262.418335,70.0,43.699997,365.0,-4.9,65.615,-161.725
1,9,15,6549,4598,3370,2333,1561,-215,6088,2841,...,16250,12988,51.977913,1035.702393,73.0,35.800003,497.0,-0.220833,61.115,-160.315
2,8,10,1696,1322,2056,1782,686,994,928,71,...,27081,19413,53.933914,1369.168823,46.0,47.700001,330.0,-2.4375,64.925,-147.365
3,8,12,9558,7117,4395,6202,2140,-636,8650,3315,...,12126,9466,59.336483,898.583069,113.0,30.099998,682.0,-1.033333,60.635,-164.635
4,11,15,8091,5478,3082,3082,1331,-587,7383,2045,...,9968,8306,65.759773,1007.976074,98.0,33.5,539.0,-1.304167,61.935,-164.065
