# 1.0.1: Calculate spatial lag of trait data (Y)

A common issue when training models on geospatial data is the problem of spatial autocorrelation. If the labels for your data are spatially autocorrelated, then traditional cross-validation (CV) approaches such as randomized K-Fold CV may indicate overly-optimistic model performance. This is because, when points are randomly selected for fold assignment, it is very likely that some points that are spatially correlated with each other may be assigned to different folds. As a result, spatially correlated (i.e. dependently similar) may exist in both the train and the test folds, confounding the generalizability assessement of the model (see [Meyer and Pebesma, 2022](https://www.nature.com/articles/s41467-022-29838-9) and [Kattenborn *et al*., 2022](https://www.sciencedirect.com/science/article/pii/S2667393222000072)).

To overcome this, we can use **Spatial K-Fold cross-validation** (SKCV) ([Pohjankukka *et al*., 2017](http://arxiv.org/abs/2005.14263)). SKCV first requires one to calculate the spatial lag, or spatial autocorrelation range of the response variable. Next, we create a hex grid with side length equal to that range, and then bin the data by the hex cells. Finally, we assign fold IDs to *entire hex cells* instead of to individual points. This, in turn, gives us confidence that we are never testing a model against data that may be spatially correlated to its training data.

## Imports and config

In [1]:
from pathlib import Path

from dask import compute, delayed
from dask.distributed import Client
import numpy as np
import pandas as pd
from pykrige.ok import OrdinaryKriging
import utm

from src.conf.conf import get_config
from src.conf.environment import log

pd.set_option('display.max_columns', None)

%load_ext autoreload
%autoreload 2

cfg = get_config()

## Calculate variograms of each trait

Since we just featurized the EO data together with the combined sPlot and GBIF data, we can simply load the newly generated features and fit a variogram model to the data. We don't know by default which type of variogram is best, however (e.g. isotropic vs. anisotropic). To determine this, we can fit both types of models (one assuming directional independence and another directional dependence), calculate the Akaike Information Criterion (AIC) of both models, and then we can use the range of the model with the lowest AIC.

Keep in mind that this is not at all an exhaustive approach to selecting the best variogram model, but since we have so many variables to compare, it's at least a step in the right direction while remaining relatively automated.

### Convert geographic coordinates to Cartesian coordinates using UTM zones

Before we can move directly into fitting variograms, we need to be able to calculate Euclidean distances in order to fit a variogram model. The problem is, there is no Euclidean space where To do this, we can calculate the UTM coordinates and zone for each point. We can then group our dataframe by UTM zones and calculate the spatial autocorrelation range for each zone. Afterward, we can take the mean autocorrelation range of all the zones.

Load the features,select the x, y, and one trait (Y) columns for example purposes, and downcast the types slightly to help a little with memory management.

In [4]:
feats = (
    pd.read_parquet(
        Path(cfg.train.dir)
        / cfg.PFT
        / cfg.model_res
        / cfg.datasets.Y.use
        / cfg.train.features
    )[["x", "y", "X50_mean"]]
    .dropna()
    .astype({"x": np.float32, "y": np.float32, "X50_mean": np.float32})
)

feats.head(2)

Unnamed: 0,x,y,X50_mean
0,-179.895004,68.394997,1.447954
1,-179.895004,68.254997,1.063078


Let's set up some helper functions to add easting, northing, and UTM zone information to our `DataFrame`. Due to the size of the data, we're going to add some parallelization with Dask.

In [5]:
@delayed
def get_utm_zones(x: np.ndarray, y: np.ndarray) -> tuple[list, list, list]:
    eastings, northings, zones = [], [], []

    for x_, y_ in zip(x, y):
        easting, northing, zone, letter = utm.from_latlon(y_, x_)
        eastings.append(easting)
        northings.append(northing)
        zones.append(f"{zone}{letter}")

    return eastings, northings, zones


def add_utm(df: pd.DataFrame, chunksize: int = 10000) -> pd.DataFrame:
    x = df.x.to_numpy()
    y = df.y.to_numpy()

    # Split x and y into chunks
    x_chunks = [x[i : i + chunksize] for i in range(0, len(x), chunksize)]
    y_chunks = [y[i : i + chunksize] for i in range(0, len(y), chunksize)]

    # Compute the UTM zones for each chunk in parallel
    results = [
        get_utm_zones(x_chunk, y_chunk) for x_chunk, y_chunk in zip(x_chunks, y_chunks)
    ]

    results = compute(*results)

    # Assign the results to new columns in df
    df["easting"] = [e for result in results for e in result[0]]
    df["northing"] = [n for result in results for n in result[1]]
    df["zone"] = [z for result in results for z in result[2]]

    return df

Now we can add the UTM info to the `feats` `DataFrame`.

In [6]:
with Client(dashboard_address=cfg.dask_dashboard, n_workers=60) as client:
    feats = add_utm(feats)
    client.close()
feats.head()

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Unnamed: 0,x,y,X50_mean,easting,northing,zone
0,-179.895004,68.394997,1.447954,381078.108646,7589698.0,1W
1,-179.895004,68.254997,1.063078,380345.240365,7574103.0,1W
2,-179.884995,67.364998,1.369326,376132.402265,7474949.0,1W
3,-179.875,68.705002,1.630068,383513.280606,7624193.0,1W
4,-179.854996,68.735001,2.958496,384478.378837,7627497.0,1W


Lastly, we can group `feats` by UTM zone and then calculate the autocorrelation range for each zone. While we are not able to compare points separated by zones, this approach does have the double benefit of ensuring our points are always in Euclidean space and making the code nicely parallelizable without exceeding our memory constraints.

In [34]:
@delayed
def calculate_variogram(
    group: pd.DataFrame, data_col: str, **kwargs
) -> float | None:
    if not isinstance(group, pd.DataFrame) or len(group) < 200:
        return 0
    
    n_max = 20_000
    
    if "n_max" in kwargs:
        n_max = kwargs.pop("n_max")

    if len(group) > n_max:
        group = group.sample(n_max)

    OK = OrdinaryKriging(group["easting"], group["northing"], group[data_col], **kwargs)

    return OK.variogram_model_parameters[1]

In [38]:
with Client(dashboard_address=cfg.dask_dashboard, n_workers=3) as client:
    grouped = feats[["X50_mean", "easting", "northing", "zone"]].groupby("zone")

    kwargs = {
        "n_max": 20000,
        "variogram_model": "spherical",
        "nlags": 30,
        "anisotropy_scaling": 1,
        "anisotropy_angle": 0,
    }

    results = [
        calculate_variogram(group, "X50_mean", **kwargs)
        for _, group in grouped
    ]

    # Apply the function to each group
    autocorr_ranges = list(compute(*results))

    # Print the autocorrelation ranges
    filt_ranges = [r for r in autocorr_ranges if r != 0]

    print(kwargs)
    print(f"Mean range: {np.mean(filt_ranges) / 111325} deg")
    print(f"Median range: {np.median(filt_ranges) / 111325} deg")
    print(f"5th percentile range: {np.quantile(filt_ranges, 0.05) / 111325} deg")
    print(f"95th percentile range: {np.quantile(filt_ranges, 0.95) / 111325} deg")

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


{'n_max': 20000, 'variogram_model': 'spherical', 'nlags': 30, 'anisotropy_scaling': 1, 'anisotropy_angle': 0}
Mean range: 3.535415466870557 deg
Median range: 2.300591841183184 deg
5th percentile range: 0.07223058805334966 deg
95th percentile range: 9.045320496182633 deg


```python
{
    "n_max": 20000,
    "variogram_model": "spherical",
    "nlags": 30,
    "anisotropy_scaling": 1,
    "anisotropy_angle": 0,
}
# Mean range: 3.535415466870557 deg
# Median range: 2.300591841183184 deg
# 5th percentile range: 0.07223058805334966 deg
# 95th percentile range: 9.045320496182633 deg
```

```python
{
    "n_max": 20000,
    "variogram_model": "spherical",
    "nlags": 20,
    "anisotropy_scaling": 1,
    "anisotropy_angle": 0,
}
# Mean range: 3.5283237814004558 deg
# Median range: 2.511821835038483 deg
# 5th percentile range: 0.0710157956836979 deg
# 95th percentile range: 8.96835642780179 deg
```

```python
{
    "n_max": 20000,
    "variogram_model": "spherical",
    "nlags": 10,
    "anisotropy_scaling": 1,
    "anisotropy_angle": 0,
}
# Mean range: 3.3299486856787843 deg
# Median range: 2.2360914720180007 deg
# 5th percentile range: 0.07918879354605478 deg
# 95th percentile range: 8.604451120028585 deg
```

```python
n_max = 20_000
model = "spherical"
anistropy_angle = 90
anisotropy_scaling = 1.5

# Mean range: 3.5017888367085375 deg
# Median range: 2.056769300783017 deg
# 5th percentile range: 0.13736201609716384 deg
# 95th percentile range: 9.450227873471638 deg
```

```python
n_max = 20_000
model = "spherical"
anistropy_angle = 0
anisotropy_scaling = 1

# Mean range: 2.9805928835492264 deg
# Median range: 1.8259634833926948 deg
# 5th percentile range: 0.09173615524985357 deg
# 95th percentile range: 7.991990058407153 deg
```

```python
n_max = 10_000
model = "spherical"
anistropy_angle = 0
anisotropy_scaling = 1

# Mean range: 2.90236212377645 deg
# Median range: 1.7994449513764208 deg
# 5th percentile range: 0.09173615524985357 deg
# 95th percentile range: 7.980099791529539 deg
```

```python
n_max = 20_000
model = "linear"
anistropy_angle = 0
anisotropy_scaling = 1

# Mean range: 2.170098969135144e-06 deg
# Median range: 1.8805479230810616e-06 deg
# 5th percentile range: 1.0384935296796446e-06 deg
# 95th percentile range: 4.388690590407629e-06 deg
```

After looking through a few different runs, it seems like choosing a spherical variogram with no anisotropy and around 15 lags is probably a robust enough solution to handle the many different distributions of trait data.

The last thing to do is to calculate these mean ranges for all traits and save them in a file.