# 2.0.1: Area of Applicability (AoA)

While spatial K-fold cross-validation (SKCV) is useful for minimizing spatial dependence on model error estimation, it may not be sufficient for explaining the transferability of model extrapolations in areas where predictor data varies significantly from the reference data (Meyer & Pebesma, 2021). To address this, we can apply methods developed by Meyer & Pebesma (2021) for determining the dissimilarity index (DI) between unseen predictor data (“new”) and predictor data on which models were trained (“train”), and thus the area of applicability (AoA) of the final trait maps.

Briefly, dissimilarity in the predictor space is calculated by first computing the average minimum distance between observations in each cross-validation fold in “train” (i.e. the minimum distances between points in one fold from points in all other folds), followed by calculating the minimum distances between observations in the “new” data from the training data. The DI can then be determined as “new” distances divided by mean “train” distance. The 95% percentile DI value can then be set as the threshold value for determining the AoA, with values below the threshold being within the AoA and values above the threshold being outside it.

## Imports and config

In [7]:
import cupy as cp
import dask.dataframe as dd
import dask_cudf
import numpy as np
from dask import compute, delayed
import pandas as pd
from dask_cuda import LocalCUDACluster
from distributed import Client, LocalCluster
from sklearn.preprocessing import MinMaxScaler

from src.conf.conf import get_config
from src.conf.environment import log
from src.utils.dataset_utils import get_predict_imputed_fn, get_y_fn
from src.utils.df_utils import pipe_log
from src.utils.log_utils import suppress_dask_logging
from src.utils.training_utils import assign_splits, filter_trait_set, set_yx_index

cfg = get_config()

## 1. Determine average minimum distance in training feature space

The first step of calculating the dissimilarity index (DI) is to calculate the minimum distances between all observations in the feature space of the training data, and to then determine the average minimum distance from that. This provides a sort of normalization coefficient that will be used to determine how similar observations in the predict feature space are to those in train.

However, to avoid comparing observations (or points) to other points within their spatial autocorrelation range, we need to ensure that distances are calculated only to points outside of each point's respective spatial cross-validation fold.

Additionally, because not all features are as influential as the others, we will also need to first normalize the feature data and then weight them according to their respective importance as determined during model training with cross-validation.

### Load training data

Let's load the training data for the combined sPlot and GBIF trait dataset along with their fold assignments (AKA CV splits).

In [14]:
Y_COL: str = "X11_mean"

client = Client(n_workers=20)
suppress_dask_logging()

splot_gbif = (
    pd.read_parquet(get_y_fn(), columns=["x", "y", Y_COL, "source"])
    .pipe(pipe_log, "Setting yx index and assigning splits...")
    .pipe(set_yx_index)
    .pipe(assign_splits, label_col=Y_COL)
    .groupby("fold", group_keys=False)
    .sample(frac=0.5, random_state=cfg.random_seed)
    .reset_index()
    .pipe(pipe_log, "Filtering trait set...")
    .pipe(filter_trait_set, trait_set="splot_gbif")  # TODO: Rename df from "splot_gbif" to "splot"
    .drop(columns=[Y_COL, "source"])
    .pipe(pipe_log, "Converting to dask dataframe...")
    .pipe(lambda _df: dd.from_pandas(_df, npartitions=50))
    .pipe(pipe_log, "Merging with imputed predict data...")
    .merge(
        # Merge using inner join with the imputed predict data (described below)
        dd.read_parquet(get_predict_imputed_fn()).repartition(npartitions=200),
        how="inner",
        on=["x", "y"],
    )
    .drop(columns=["x", "y"])
    .reset_index(drop=True)
    .compute()
    .reset_index(drop=True)
)

print(splot_gbif.shape)
splot_gbif.head()

[94m2024-09-20 12:46:30 CEST - src.utils.df_utils - INFO - Setting yx index and assigning splits...[0m
[94m2024-09-20 12:46:47 CEST - src.utils.df_utils - INFO - Filtering trait set...[0m
[94m2024-09-20 12:46:48 CEST - src.utils.df_utils - INFO - Converting to dask dataframe...[0m
[94m2024-09-20 12:46:49 CEST - src.utils.df_utils - INFO - Merging with imputed predict data...[0m
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


(2818279, 151)


Unnamed: 0,fold,ETH_GlobalCanopyHeight_2020_v1,ETH_GlobalCanopyHeightSD_2020_v1,sur_refl_b03_2001-2024_m3_mean,sur_refl_b04_2001-2024_m11_mean,sur_refl_b05_2001-2024_m8_mean,sur_refl_b02_2001-2024_m5_mean,sur_refl_b01_2001-2024_m10_mean,sur_refl_ndvi_2001-2024_m1_mean,sur_refl_b03_2001-2024_m1_mean,...,vodca_k-band_p95,vodca_c-band_mean,vodca_k-band_p5,vodca_x-band_mean,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_13-14,wc2.1_30s_bio_7,wc2.1_30s_bio_12,wc2.1_30s_bio_1
0,0,20,7,622,663,2923,3142,388,3590,1289,...,32766,17510,24545,20376,16.814999,592.012512,47.0,23.9,1104.0,7.233333
1,0,0,0,706,938,3025,2628,1168,2628,659,...,15009,4285,10136,6884,72.014816,744.499023,61.0,34.900002,320.0,14.583333
2,0,19,7,511,625,2967,3103,640,3287,595,...,32766,19277,25608,21548,14.560808,700.306763,34.0,29.9,938.0,11.575
3,0,7,5,429,824,3070,2445,994,5385,486,...,32595,17382,27179,19745,88.186821,179.942627,290.0,16.299999,1440.0,18.370832
4,0,5,5,372,820,3258,2993,685,5487,474,...,23998,11486,16108,13328,7.570564,458.587036,25.0,22.6,1155.0,16.420834


### Scale features and weight by feature importance

In [3]:
# Scale the features by subtracting from the mean and dividing by the standard deviation.
# Store the mean and standard deviation for each feature so that we can scale the test data in the same way.

fold = splot_gbif[["fold"]]
sg_scaled = splot_gbif.drop(columns=["fold"]).copy()
means = sg_scaled.mean()
stds = sg_scaled.std()

sg_scaled = (sg_scaled - means) / stds
sg_scaled

Unnamed: 0,ETH_GlobalCanopyHeight_2020_v1,ETH_GlobalCanopyHeightSD_2020_v1,sur_refl_b03_2001-2024_m3_mean,sur_refl_b04_2001-2024_m11_mean,sur_refl_b05_2001-2024_m8_mean,sur_refl_b02_2001-2024_m5_mean,sur_refl_b01_2001-2024_m10_mean,sur_refl_ndvi_2001-2024_m1_mean,sur_refl_b03_2001-2024_m1_mean,sur_refl_b05_2001-2024_m12_mean,...,vodca_k-band_p95,vodca_c-band_mean,vodca_k-band_p5,vodca_x-band_mean,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_13-14,wc2.1_30s_bio_7,wc2.1_30s_bio_12,wc2.1_30s_bio_1
0,-0.785311,-0.391536,-0.493182,-0.217694,1.987673,2.124178,-0.393713,0.782339,-0.333559,1.108293,...,-0.031811,-0.098556,-0.445730,-0.141520,-0.854978,-0.566807,-0.651716,-0.776554,-0.306828,0.012713
1,1.326528,1.907760,1.470920,0.070001,-1.319267,-0.792220,-0.526524,-1.628681,0.480740,-2.043951,...,0.879881,1.289406,0.953352,0.781659,-0.859533,-0.373631,-0.240744,-0.662429,0.873255,-1.188864
2,1.643304,0.593876,-0.065338,-0.582298,-1.025484,-0.943451,-0.657517,-0.405929,0.004047,-1.130493,...,0.931291,1.398374,-0.054254,1.343636,-0.064783,1.144442,-0.209130,0.550149,-0.312661,-0.951093
3,-0.574128,-0.391536,-0.494509,-0.431329,0.547007,0.956054,-0.442835,0.362484,-0.464680,-0.089261,...,0.734731,0.784635,0.805861,0.834243,-0.991933,-0.121738,-0.588490,-0.305788,0.429995,0.167781
4,0.059424,0.265405,0.032834,-0.120846,0.869038,-0.086914,-0.108076,-1.079748,0.815282,0.092887,...,0.547951,-0.155584,-0.957199,-0.078419,-0.435342,1.238692,-0.493650,0.635743,-0.542067,-0.427838
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228332,-0.468536,0.265405,-0.537625,-0.321663,-2.611158,-1.789558,-0.390074,1.060068,-0.776553,-1.260987,...,-3.040143,-2.758516,-2.657263,-2.750246,-0.746775,-1.324046,0.470555,-1.817944,2.465492,2.627397
228333,1.220936,0.922347,-0.635797,-0.757478,-0.874826,-0.375033,-0.990457,2.098289,-0.900934,-0.547349,...,0.082465,-0.979414,0.022241,-0.051076,1.599155,0.592749,6.113527,0.122181,3.387007,0.405552
228334,1.220936,0.593876,-0.055388,-0.317390,-0.791965,-0.867835,-0.557453,-0.636593,0.583063,-0.480742,...,0.734312,0.621395,0.002409,0.585092,-0.317474,0.665110,-0.335584,0.136446,-0.405979,-0.668790
228335,-0.257352,-0.063065,-0.486549,-0.526752,-1.019835,-1.145526,-0.331855,0.704985,-0.743466,-0.703670,...,-0.006385,-0.523615,0.463048,-0.122590,0.497914,0.292973,-0.003644,0.921055,-0.602335,0.830200


Load feature importances.

In [4]:
from src.utils.dataset_utils import get_latest_run, get_trait_models_dir


fi = pd.read_csv(
    get_latest_run(get_trait_models_dir(Y_COL))
    / "splot"
    / cfg.train.feature_importance,
    index_col=0,
    header=[0, 1],
).sort_values(by=("importance", "mean"), ascending=False)

fi.head()

Unnamed: 0_level_0,importance,importance,stddev,stddev,p_value,p_value,n,n,p99_high,p99_high,p99_low,p99_low
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std
index,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
sur_refl_b02_2001-2024_m6_mean,0.644542,0.091474,0.029778,0.010544,1.989283e-13,3.133852e-13,10.0,0.0,0.675144,0.101128,0.61394,0.082116
wc2.1_30s_bio_1,0.082667,0.050667,0.010721,0.003443,6.268857e-07,1.344188e-06,10.0,0.0,0.093685,0.052837,0.071649,0.048658
sur_refl_b02_2001-2024_m5_mean,0.056415,0.016348,0.007031,0.000706,3.90936e-09,5.576501e-09,10.0,0.0,0.06364,0.016579,0.049189,0.016148
wc2.1_30s_bio_12,0.056015,0.055764,0.008995,0.004616,3.012817e-06,5.192768e-06,10.0,0.0,0.065259,0.059765,0.046771,0.051887
sur_refl_ndvi_2001-2024_m4_mean,0.056006,0.033391,0.010964,0.006438,1.389498e-07,2.730383e-07,10.0,0.0,0.067274,0.039778,0.044738,0.027115


Multiply the feature importances by the normalized features to weight their influence in the feature space according to their importance.

In [5]:
# Ensure the feature importance indices are in the same order as the sg_norm columns
fi_mean = fi["importance"]["mean"].to_frame().loc[sg_scaled.columns]

sg_scaled_wt = pd.concat([sg_scaled * fi_mean.T.values, fold], axis=1)
sg_scaled_wt

Unnamed: 0,ETH_GlobalCanopyHeight_2020_v1,ETH_GlobalCanopyHeightSD_2020_v1,sur_refl_b03_2001-2024_m3_mean,sur_refl_b04_2001-2024_m11_mean,sur_refl_b05_2001-2024_m8_mean,sur_refl_b02_2001-2024_m5_mean,sur_refl_b01_2001-2024_m10_mean,sur_refl_ndvi_2001-2024_m1_mean,sur_refl_b03_2001-2024_m1_mean,sur_refl_b05_2001-2024_m12_mean,...,vodca_c-band_mean,vodca_k-band_p5,vodca_x-band_mean,wc2.1_30s_bio_15,wc2.1_30s_bio_4,wc2.1_30s_bio_13-14,wc2.1_30s_bio_7,wc2.1_30s_bio_12,wc2.1_30s_bio_1,fold
0,-0.013093,-0.004403,-0.008205,-0.002630,0.013529,0.119835,-0.014794,0.030798,-0.005673,0.017810,...,-0.003552,-0.011175,-0.003840,-0.002504,-0.024427,-0.025142,-0.039684,-0.017187,0.001051,0
1,0.022117,0.021456,0.024473,0.000846,-0.008980,-0.044693,-0.019784,-0.064116,0.008176,-0.032846,...,0.046465,0.023901,0.021212,-0.002517,-0.016102,-0.009287,-0.033852,0.048915,-0.098280,0
2,0.027398,0.006679,-0.001087,-0.007034,-0.006980,-0.053224,-0.024706,-0.015980,0.000069,-0.018167,...,0.050392,-0.001360,0.036462,-0.000190,0.049321,-0.008068,0.028114,-0.017514,-0.078624,0
3,-0.009572,-0.004403,-0.008227,-0.005210,0.003723,0.053935,-0.016639,0.014270,-0.007903,-0.001434,...,0.028275,0.020203,0.022639,-0.002905,-0.005246,-0.022703,-0.015626,0.024086,0.013870,0
4,0.000991,0.002985,0.000546,-0.001460,0.005915,-0.004903,-0.004061,-0.042506,0.013866,0.001493,...,-0.005607,-0.023997,-0.002128,-0.001275,0.053383,-0.019044,0.032488,-0.030364,-0.035368,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228332,-0.007812,0.002985,-0.008945,-0.003886,-0.017773,-0.100957,-0.014657,0.041731,-0.013207,-0.020264,...,-0.099406,-0.066619,-0.074634,-0.002187,-0.057061,0.018153,-0.092901,0.138104,0.217199,4
228333,0.020356,0.010373,-0.010578,-0.009150,-0.005954,-0.021157,-0.037216,0.082602,-0.015322,-0.008796,...,-0.035294,0.000558,-0.001386,0.004683,0.025545,0.235849,0.006244,0.189723,0.033526,4
228334,0.020356,0.006679,-0.000922,-0.003834,-0.005390,-0.048959,-0.020946,-0.025060,0.009916,-0.007725,...,0.022393,0.000060,0.015878,-0.000930,0.028664,-0.012946,0.006973,-0.022741,-0.055287,4
228335,-0.004291,-0.000709,-0.008095,-0.006363,-0.006941,-0.064624,-0.012469,0.027753,-0.012644,-0.011308,...,-0.018869,0.011609,-0.003327,0.001458,0.012626,-0.000141,0.047068,-0.033740,0.068630,4


### Calculate average distances between all points

Convert the data from a `pandas.DataFrame` to a `cudf.DataFrame` to enable GPU acceleration.

In [6]:
import cudf
import cupy as cp

device_ids = np.arange(2, 4)

with cp.cuda.Device(device_ids[0]):
    sg_scaled_wt_gpu = cudf.DataFrame.from_pandas(sg_scaled_wt).drop(columns=["fold"]).to_cupy()

Calculate average distances between all points in batches to avoid memory overflow errors.

In [7]:
import dask.array as da
from dask import compute, delayed
from cuml.metrics import pairwise_distances
from dask_cuda import LocalCUDACluster

# Specify the device IDs you want to use
device_ids = np.arange(2, 4)

# Initialize Dask CUDA cluster for multi-GPU processing
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=device_ids)
client = Client(cluster)


# Function to process the data in batches
def process_batch(data, start_idx, end_idx):
    batch_data = data[start_idx:end_idx]
    distances = pairwise_distances(data, batch_data)
    avg_distances = cp.mean(distances, axis=1)
    return avg_distances


# Convert data to Dask array for parallel processing
dask_data = da.from_array(
    sg_scaled_wt_gpu, chunks=(sg_scaled_wt_gpu.shape[0] // len(device_ids), sg_scaled_wt_gpu.shape[1])
)

# Define number of batches
batch_size = 10000  # Adjust based on your GPU memory capacity
n_samples = sg_scaled_wt_gpu.shape[0]
num_batches = (n_samples + batch_size - 1) // batch_size

# Create Dask delayed tasks for each batch
tasks = []
for batch_idx in range(num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min((batch_idx + 1) * batch_size, n_samples)
    task = delayed(process_batch)(dask_data, start_idx, end_idx)
    tasks.append(task)

# Compute the results in parallel
batch_results = compute(*tasks)

# Sum the results and normalize by the number of batches
avg_distances = cp.sum(cp.array(batch_results), axis=0) / num_batches

# Close the Dask cluster
cluster.close()
client.close()

# Compute the overall mean of the average distances
mean_distance = cp.mean(avg_distances).item()  # Convert from CuPy to Python float

print(f"Average Pairwise Distance for training data: {mean_distance}")

del sg_scaled_wt_gpu

Perhaps you already have a cluster running?
Hosting the HTTP server on port 37629 instead
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Average Pairwise Distance for training data: 0.8552790578822509


### Calculate DI for all training data (same-fold excluded)

Convert pandas DataFrame to `cudf.DataFrame` to enable GPU compatibility.

In [12]:
import cudf
import cupy as cp
from cuml.neighbors import NearestNeighbors

device_ids = np.arange(2,4)

with cp.cuda.Device(device_ids[0]):
    df_gpu = cudf.DataFrame.from_pandas(sg_scaled_wt)
    folds = [
        df_gpu[df_gpu["fold"] == i].drop(columns=["fold"]).to_cupy() for i in range(5)
    ]
    del df_gpu

Find the nearest neighbor for each observation, excluding neighbors within the current point's own fold.

In [14]:
from cuml.neighbors import NearestNeighbors

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=device_ids)
client = Client(cluster)


# Function to process each fold in batches
def process_fold_in_batches(fold_index: int, fold_data, folds):
    nn_model = NearestNeighbors(n_neighbors=1, algorithm="brute")
    fold_min_distances = cp.full(fold_data.shape[0], cp.inf)

    # fold_avg_distances = cp.zeros(fold_data.shape[0])
    # total_batches = 0

    for other_index, other_fold_data in enumerate(folds):
        if fold_index == other_index:
            continue  # Skip the current fold

        nn_model.fit(other_fold_data)
        distances, _ = nn_model.kneighbors(fold_data)
        fold_min_distances = cp.minimum(fold_min_distances, distances.flatten())

        # num_batches = (other_fold_data.shape[0] + batch_size - 1) // batch_size
        # total_batches += num_batches

        # for batch_idx in range(num_batches):
        #     start_idx = batch_idx * batch_size
        #     end_idx = min((batch_idx + 1) * batch_size, other_fold_data.shape[0])

        #     batch_data = other_fold_data[start_idx:end_idx]
        #     distances = pairwise_distances(fold_data, batch_data)
        #     fold_avg_distances += cp.mean(distances, axis=1)

    # fold_avg_distances /= total_batches
    return fold_min_distances


# Convert folds to Dask arrays for parallel processing
dask_folds = [da.from_array(fold) for fold in folds]

# Define batch size
# For splot-only, a batch size of 10000 works, but for splot_gbif, a batch size of 1000
# is more appropriate due to the larger number of samples
# batch_size = 1000

# Create Dask delayed tasks for each fold
tasks = [
    delayed(process_fold_in_batches)(fold_index, fold_data, dask_folds)
    for fold_index, fold_data in enumerate(dask_folds)
]

# Compute the results in parallel
min_distances = compute(*tasks)

# Convert the results back to a single CuPy array
min_distances = cp.concatenate([cp.array(dist) for dist in min_distances])
di_train = min_distances / mean_distance

# Find the upper whisker threshold for the DI values (75th percentile + 1.5 * IQR)
di_threshold = cp.percentile(di_train, 75) + 1.5 * cp.subtract(
    *cp.percentile(di_train, [75, 25])
)

print(f"DI threshold: {di_threshold}")

del folds, dask_folds
# Close the Dask cluster
cluster.close()
client.close()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 45533 instead


DI threshold: 0.19293052655834353


## Load imputed predict data and calculate minimum distances

In order to compare our "new" and "train" feature spaces, we will need to compute the distance between them. Let's get started by loading our "new" data (AKA `predict`).

Although our models can tolerate missing values in the data (an important attribute as not all of data variables are available simultaneously at all locations), to calculate pairwise distances or K-d trees, a dense matrix is required (i.e. no missing values).

This leaves us with a few options:

a) drop all observations in “new” and “train” that contain any missing features (and therefore misrepresent the actual spatial coverage of the training data as well as reduce feature space variance present in DI calculation compared to actual reference data used in training), possibly resulting in an overly pessimistic AoA;

b) drop features in both the “new” and “train” data that contain any missing values, likely resulting in an overly optimistic AoA as feature-space complexity is reduced; or 

c) a “middle ground” approach of imputing missing values and assuming that the resulting predictor space is still a robust representation of the true reference data.

To retain as much of the original signature of the true reference data when calculating dissimilarity, as well as to ensure that final AoA maps match the geographic extent of the predictions, we will choose option “c”. It should be noted, however, that, given the novelty of this method, room for improvement likely exists. 

### Imputation method

To achieve the best possible missing value imputation we used the `NaNImputer` method from the python `verstack` library, which fits gradient boosted tree regression models for each feature to fill missing values (Zherebtsov, 2020/2023) (https://verstack.readthedocs.io/en/latest/index.html#nanimputer).

### Calculating distances between large dataframes using `dask_cudf`

Load predict data, normalize, and apply feature weights.

In [1]:
(1000 + 2 - 1) // 2

500

In [16]:
def scale(df: pd.DataFrame) -> pd.DataFrame:
    # We already have the means and standard deviations from the training data
    # Make sure that the columns are in the same order as the training data
    return (df - means) / stds


cluster = LocalCluster()
client = Client(cluster)

pred = dd.read_parquet(get_predict_imputed_fn(), npartitions=50).sample(
    frac=0.2, random_state=cfg.random_seed
)
xy = pred[["x", "y"]]
pred = pred.drop(columns=["x", "y"])

# min_values = pred.min().compute()
# max_values = pred.max().compute()

# Use map_partitions to apply normalization across all partitions
pred_scaled = pred.map_partitions(scale)

fi = (
    pd.read_csv(
        get_latest_run(get_trait_models_dir(Y_COL))
        / "splot"
        / cfg.train.feature_importance,
        index_col=0,
        header=[0, 1],
    )
    .sort_values(by=("importance", "mean"), ascending=False)["importance"]["mean"]
    .to_frame()
    .loc[pred_scaled.columns]
)

pred_scaled_wt = dd.concat([xy, pred_scaled * fi.T.values], axis=1)

cluster.close()
client.close()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 40001 instead


Initialize dask CUDA cluster and convert predict and train dataframes to dask_cudf dataframes.

In [17]:
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=device_ids)
client = Client(cluster)

pred_scaled_wt_gpu = pred_scaled_wt.to_backend("cudf")
sg_scaled_wt_gpu = dd.from_pandas(sg_scaled_wt).to_backend("cudf")

Perhaps you already have a cluster running?
Hosting the HTTP server on port 38043 instead


In [18]:
# Function to compute nearest neighbors for a chunk
def compute_nearest_neighbors(
    chunk: cudf.DataFrame, smaller_df: cudf.DataFrame
) -> cudf.DataFrame:
    nn_model = NearestNeighbors(n_neighbors=1, algorithm="brute")
    nn_model.fit(smaller_df.drop(columns=["fold"]))
    chunk_xy = chunk[["x", "y"]]
    distances, _ = nn_model.kneighbors(chunk.drop(columns=["x", "y"]))
    result = cudf.DataFrame(
        {"x": chunk_xy["x"], "y": chunk_xy["y"], "distance": distances},
        index=chunk.index,
    )
    return result


# Apply the function to each chunk of the large dataframe
distances = pred_scaled_wt_gpu.map_partitions(
    compute_nearest_neighbors, sg_scaled_wt_gpu
)

# Compute the results
distances = distances.compute()

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


In [13]:
distances.head()

Unnamed: 0,x,y,distance
746506,-159.055,64.985,0.221444
132789,-154.055,69.925,0.107131
368375,-160.575,62.855,0.199781
191580,-156.595,65.245,0.181087
729142,-160.055,59.345,0.243463


In [25]:
# Sort the results by the index to maintain the original order
# distances = distances.sort_values(by="index")

# Divide all distances by the average_min_distance constant
# average_min_distance = 1.23  # Replace with your actual constant
distances["di"] = distances["distance"] / mean_distance
distances["aoa"] = distances["di"] > di_threshold.item()

In [28]:
distances["aoa"].value_counts(normalize=True)

aoa
False    0.776161
True     0.223839
Name: proportion, dtype: float64

In [28]:
distances.to_parquet("X11_mean_DI_splot_gbif.parquet", compression="zstd")

## Compare AoA of sPlot-only models with combined sPlot-GBIF models

In [6]:
import pandas as pd

from src.utils.df_utils import grid_df_to_raster

def aoa(df: pd.DataFrame, thresh: float = 0.95) -> pd.DataFrame:
    df["aoa"] = df["di"] <= thresh
    return df

splot = (
    pd.read_parquet("X11_mean_DI_splot-only.parquet")
    .pipe(set_yx_index)
    .pipe(aoa)
)
splot_gbif = (
    pd.read_parquet("X11_mean_DI_splot-gbif.parquet")
    .pipe(set_yx_index)
    .pipe(aoa)
)

In [19]:
splot_gbif.describe()

Unnamed: 0,distance,di
count,70831990.0,70831990.0
mean,0.226786,16.75145
std,0.0382724,2.826971
min,0.1236836,9.135824
25%,0.1994979,14.73581
50%,0.2220517,16.40174
75%,0.2475636,18.28616
max,0.5391432,39.82353


In [9]:
splot_gbif.shape

(70831986, 3)

## Testing ground

In [2]:
import pandas as pd
from src.conf.environment import log


splot_gbif_aoa = pd.read_parquet("X11_mean_DI_splot_gbif.parquet")
splot_gbif_aoa.head()

Unnamed: 0,x,y,distance,di,aoa
746506,-159.055,64.985,0.139548,0.161644,False
132789,-154.055,69.925,0.096417,0.111684,False
368375,-160.575,62.855,0.151536,0.175531,False
191580,-156.595,65.245,0.110599,0.128111,False
729142,-160.055,59.345,0.133722,0.154896,False


In [15]:
splot_aoa = pd.read_parquet("X11_mean_DI_splot.parquet")
splot_aoa.head()

Unnamed: 0,x,y,distance,di,aoa
746506,-159.055,64.985,0.221444,0.258961,False
132789,-154.055,69.925,0.107131,0.125281,False
368375,-160.575,62.855,0.199781,0.233627,False
191580,-156.595,65.245,0.181087,0.211766,False
729142,-160.055,59.345,0.243463,0.28471,True


In [3]:
splot_gbif_aoa["aoa"].value_counts(normalize=True)

aoa
False    0.993865
True     0.006135
Name: proportion, dtype: float64

In [16]:
splot_aoa["aoa"].value_counts(normalize=True)

aoa
False    0.917558
True     0.082442
Name: proportion, dtype: float64

In [19]:
from pathlib import Path
from src.utils.df_utils import grid_df_to_raster

grid_df_to_raster(
    splot_gbif_aoa[["x", "y", "di", "aoa"]].set_index(["y", "x"]),
    0.01,
    Path("aoa_splot_gbif.tif"),
)
grid_df_to_raster(
    splot_aoa[["x", "y", "di", "aoa"]].set_index(["y", "x"]),
    0.01,
    Path("aoa_splot.tif"),
)