# 2.0.2: Area of Applicability (AoA)

While spatial K-fold cross-validation (SKCV) is useful for minimizing spatial dependence on model error estimation, it may not be sufficient for explaining the transferability of model extrapolations in areas where predictor data varies significantly from the reference data (Meyer & Pebesma, 2021). To address this, we can apply methods developed by Meyer & Pebesma (2021) for determining the dissimilarity index (DI) between unseen predictor data (“new”) and predictor data on which models were trained (“train”), and thus the area of applicability (AoA) of the final trait maps.

Briefly, dissimilarity in the predictor space is calculated by first computing the average minimum distance between observations in each cross-validation fold in “train” (i.e. the minimum distances between points in one fold from points in all other folds), followed by calculating the minimum distances between observations in the “new” data from the training data. The DI can then be determined as “new” distances divided by mean “train” distance. The 95% percentile DI value can then be set as the threshold value for determining the AoA, with values below the threshold being within the AoA and values above the threshold being outside it.

## Imports and config

In [1]:
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
from sklearn.impute import SimpleImputer

from src.conf.conf import get_config
from src.conf.environment import log
from src.utils.dataset_utils import (
    get_predict_fn,
    get_cv_splits,
    add_cv_splits_to_column,
)


cfg = get_config()

## Load predict data

In order to compare our "new" and "train" feature spaces, we will need to compute the distance between them. Let's get started by loading our "new" data (AKA `predict`). 

In [2]:
predict_fn = get_predict_fn(cfg)
predict_df = pd.read_parquet(predict_fn)
predict_df.head()

Unnamed: 0,x,y,ETH_GlobalCanopyHeight_2020_v1,ETH_GlobalCanopyHeightSD_2020_v1,sur_refl_b03_2001-2024_m3_mean,sur_refl_b04_2001-2024_m11_mean,sur_refl_b05_2001-2024_m8_mean,sur_refl_b02_2001-2024_m5_mean,sur_refl_b01_2001-2024_m10_mean,sur_refl_ndvi_2001-2024_m1_mean,...,vodca_k-band_p95,vodca_c-band_mean,vodca_k-band_p5,vodca_x-band_mean,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_13-14,wc2.1_30s_bio_7,wc2.1_30s_bio_12,wc2.1_30s_bio_1
0,-179.385,71.305,5.0,6.0,8794.0,133.0,2982.0,7129.0,5386.0,791.0,...,,,,,56.240726,1209.005005,22.0,36.799999,139.0,-12.120833
1,-169.985,63.365,2.0,5.0,9225.0,3930.0,3265.0,5740.0,1705.0,-804.0,...,,,,,60.474316,934.597961,66.0,31.0,408.0,-3.6875
2,-165.895,68.685,3.0,4.0,9088.0,2606.0,3313.0,6565.0,5982.0,1990.0,...,,,,,39.175591,1108.526489,28.0,34.599998,273.0,-8.141666
3,-173.645,65.455,3.0,4.0,9256.0,5494.0,3883.0,7404.0,4481.0,-102.0,...,25060.0,13676.0,15569.0,14873.0,34.449146,1099.378906,47.0,35.200001,578.0,-6.645833
4,-175.135,66.905,4.0,6.0,9011.0,3579.0,3551.0,6904.0,4221.0,-39.0,...,17490.0,9460.0,11338.0,9931.0,38.611019,1160.903564,42.0,36.799999,418.0,-8.016666


In [7]:
print(predict_df.isnull().sum())

x                                         0
y                                         0
ETH_GlobalCanopyHeight_2020_v1      6743792
ETH_GlobalCanopyHeightSD_2020_v1    6743792
sur_refl_b03_2001-2024_m3_mean        34554
                                     ...   
wc2.1_30s_bio_4                      339694
wc2.1_30s_bio_13-14                  339694
wc2.1_30s_bio_7                      339694
wc2.1_30s_bio_12                     339694
wc2.1_30s_bio_1                      339694
Length: 152, dtype: int64


## Impute missing values

Looking at the output above, it looks like there are some missing values in our training data in the form of `NaN`s. This is no problem for our models, but the method described by Meyer & Pebesma does not address the handling of missing values. Because AutoGluon `TabularPredictor` models can tolerate missing values and we did not do any data imputation prior to training, we are faced with a few options when calculating AoA:

a) drop all observations in “new” and “train” that contain any missing features (and therefore misrepresent the actual spatial coverage of the training data as well as reduce feature space variance present in DI calculation compared to actual reference data used in training), possibly resulting in an overly pessimistic AoA;

b) drop features in both the “new” and “train” data that contain any missing values, likely resulting in an overly optimistic AoA as feature-space complexity is reduced; or 

c) a “middle ground” approach of imputing missing values and assuming that the resulting predictor space is still a robust representation of the true reference data.

To retain as much of the original signature of the true reference data when calculating dissimilarity, as well as to ensure that final AoA maps match the geographic extent of the predictions, we will choose option “c”. It should be noted, however, that, given the novelty of this method, room for improvement likely exists. 

### Imputation method

To achieve the best possible missing value imputation we can use the `NaNImputer` method from the python `verstack` library, which fits gradient boosted tree regression models for each feature to fill missing values (Zherebtsov, 2020/2023). However, since we're working at high resolution with global extent, even small patches of missing data means lots of missing values, which will require some pretty intensive computation time for `NaNImputer` as it fits 
XGBoost regressors to each feature for imputation.

As this notebook is for demonstration purposes and we don't have 500GB of RAM or several days available for processing, we'll use a simple imputer instead.

In [3]:
def assign_tiles_vectorized_dask(
    ddf: dd.DataFrame, tile_size: int | float = 10
) -> dd.DataFrame:
    """Assign tile IDs to a Dask DataFrame of points."""
    x_tiles = da.floor(ddf["x"] / tile_size) * tile_size
    y_tiles = da.floor(ddf["y"] / tile_size) * tile_size
    ddf["tile"] = (
        x_tiles.astype(int).astype(str) + "_" + y_tiles.astype(int).astype(str)
    )
    return ddf


def assign_tiles_vectorized_pandas(
    df: pd.DataFrame, tile_size: int | float = 10
) -> pd.DataFrame:
    """Assign tile IDs to a pandas DataFrame of points."""
    x_tiles = (df["x"] // tile_size) * tile_size
    y_tiles = (df["y"] // tile_size) * tile_size
    df["tile"] = x_tiles.astype(int).astype(str) + "_" + y_tiles.astype(int).astype(str)
    return df


# Example usage with a Dask DataFrame
predict_df = assign_tiles_vectorized_pandas(predict_df)
predict_df.head()

Unnamed: 0,x,y,ETH_GlobalCanopyHeight_2020_v1,ETH_GlobalCanopyHeightSD_2020_v1,sur_refl_b03_2001-2024_m3_mean,sur_refl_b04_2001-2024_m11_mean,sur_refl_b05_2001-2024_m8_mean,sur_refl_b02_2001-2024_m5_mean,sur_refl_b01_2001-2024_m10_mean,sur_refl_ndvi_2001-2024_m1_mean,...,vodca_c-band_mean,vodca_k-band_p5,vodca_x-band_mean,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_13-14,wc2.1_30s_bio_7,wc2.1_30s_bio_12,wc2.1_30s_bio_1,tile
0,-179.385,71.305,5.0,6.0,8794.0,133.0,2982.0,7129.0,5386.0,791.0,...,,,,56.240726,1209.005005,22.0,36.799999,139.0,-12.120833,-180_70
1,-169.985,63.365,2.0,5.0,9225.0,3930.0,3265.0,5740.0,1705.0,-804.0,...,,,,60.474316,934.597961,66.0,31.0,408.0,-3.6875,-170_60
2,-165.895,68.685,3.0,4.0,9088.0,2606.0,3313.0,6565.0,5982.0,1990.0,...,,,,39.175591,1108.526489,28.0,34.599998,273.0,-8.141666,-170_60
3,-173.645,65.455,3.0,4.0,9256.0,5494.0,3883.0,7404.0,4481.0,-102.0,...,13676.0,15569.0,14873.0,34.449146,1099.378906,47.0,35.200001,578.0,-6.645833,-180_60
4,-175.135,66.905,4.0,6.0,9011.0,3579.0,3551.0,6904.0,4221.0,-39.0,...,9460.0,11338.0,9931.0,38.611019,1160.903564,42.0,36.799999,418.0,-8.016666,-180_60


Lastly, we can define an `impute_tile` function which we can then apply to the DataFrame.

In [4]:
# Group the dataframe by tile and fillnas with the mean of each column

predict_df_imputed = predict_df.groupby("tile").apply(lambda x: x.fillna(x.mean()))

In [3]:
# fill missing values in each column with the mean of that column
imputer = SimpleImputer(strategy="mean").set_output(transform="pandas")
predict_df_imputed = imputer.fit_transform(predict_df)

## Load training data

Let's just focus on a single trait for the purposes of this exploration.

In [3]:
y_col = y_cols[0]
df = train_df[x_cols + [y_col]].copy()
xy = train_df[coord_cols]

cv_splits = get_cv_splits(cfg, y_col)
df = add_cv_splits_to_column(df, cv_splits)

df.head()

Unnamed: 0,ETH_GlobalCanopyHeightSD_2020_v1,ETH_GlobalCanopyHeight_2020_v1,bdod_0-5cm_mean,bdod_100-200cm_mean,bdod_15-30cm_mean,bdod_30-60cm_mean,bdod_5-15cm_mean,bdod_60-100cm_mean,cec_0-5cm_mean,cec_100-200cm_mean,...,vodca_x-band_p5,vodca_x-band_p95,wc2.1_30s_bio_1,wc2.1_30s_bio_12,wc2.1_30s_bio_13-14,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_7,X4_mean,cv_split
0,3.0,4.0,,,,,,,514.0,257.0,...,9998.0,10628.0,-11.3625,0.0,0.0,0.0,1278.113159,40.300003,0.465956,0.0
1,2.0,1.0,,,,,,,500.0,243.0,...,15506.0,14737.0,-11.970834,364.0,50.0,54.235405,1224.654785,38.400002,0.504636,0.0
2,10.0,25.0,105.0,119.0,111.0,114.0,109.0,118.0,234.0,205.0,...,,,25.6,2662.0,251.0,38.029896,119.125633,7.6,0.526075,1.0
3,3.0,3.0,,,,,,,549.0,265.0,...,7909.0,9810.0,-11.2125,273.0,40.0,55.311668,1290.799805,40.299999,0.354417,0.0
4,10.0,25.0,112.0,124.0,116.0,119.0,115.0,123.0,209.0,187.0,...,,,25.825001,2476.0,231.0,41.319893,119.914726,7.599998,0.52777,1.0
