# Geospatial data splitting

When doing machine learning on geospatial data, special care must be taken to prevent data leakage due to spatiotemporal correlation (for example, when a certain pixel is in the training set, it could influence the prediction of a neighbouring pixel in the test set).

In order to prevent this, it's a good idea to take a different approach than just splitting pixels randomly. There are a few ways to handle this, including:
* Temporal splitting: Where you train on a certain timerange in the data, and evaluate on a different timerange
* Spatial splitting: 
* In either of the cases above, you can choose to just pick a splitting point (e.g. a everything above a certain x coordiante, everything before a certain timestamp, ...) or you can do it stratified (every 5th month is in the test set, separating the data in strips that are in the train or test set, dividing your data in a grid, ...)

## Strip-based splitting

An example to do this is shown below, where we divide the satellite image in different strips, based on the latitude of the image. The goal is to have a train and a test set, that each consist of different strips going from north to south, to avoid data leakage.

In [43]:
import random

import xarray as xr
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from lightgbm import LGBMRegressor

We load in the satellite data as in the previous notebook, but we also need to keep the coordinate values this time, as they are used for train / test splitting.

In [None]:
ds = xr.open_dataset('data/s3_20200420T101527.nc')
df = ds.to_dataframe()

features = ['Rrs400_a', 'Rrs412_a', 'Rrs443_a', 'Rrs490_a', 'Rrs510_a',
            'Rrs560_a', 'Rrs620_a', 'Rrs665_a', 'Rrs674_a', 'Rrs682_a',
            'Rrs709_a', 'Rrs754_a', 'Rrs768_a', 'Rrs779_a',
            'Rrs865_a', 'Rrs884_a']

target = 'CHL'

df = df.reset_index()
df = df[features + [target] + ['lat', 'lon']].dropna()

Here we choose to divide the data up into 50 horizontal strips. A new column is added indicating which strip each data point belongs to.

In [36]:
n_strips = 50
strips = range(n_strips)
df['lat_strip'] = pd.cut(df['lat'], n_strips, labels=False)

In [45]:
train_strips, test_strips = train_test_split(strips, test_size=0.2, random_state=42)

df_train = df[df['lat_strip'].isin(train_strips)]
df_test = df[df['lat_strip'].isin(test_strips)]

print(f'Number of train samples: {len(df_train)}')
print(f'Number of test samples: {len(df_test)}')

Number of train samples: 721744
Number of test samples: 193005


In [40]:
X_train = df_train[features]
y_train = df_train[target]
X_test = df_test[features]
y_test = df_test[target]

model = LGBMRegressor()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean squared error: {mse}')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010390 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4080
[LightGBM] [Info] Number of data points in the train set: 725273, number of used features: 16
[LightGBM] [Info] Start training from score 4.743175
Mean squared error: 4.8316257658220225


We find that the error is significantly higher than in the previous notebook. This does not necessarily mean that the model performed worse. A more likely explanation is that this method avoided data leakage, and that this error is more representative.

## Grid-based splitting

In [None]:
ds = xr.open_dataset('data/s3_20200420T101527.nc')
df = ds.to_dataframe()

features = ['Rrs400_a', 'Rrs412_a', 'Rrs443_a', 'Rrs490_a', 'Rrs510_a',
            'Rrs560_a', 'Rrs620_a', 'Rrs665_a', 'Rrs674_a', 'Rrs682_a',
            'Rrs709_a', 'Rrs754_a', 'Rrs768_a', 'Rrs779_a',
            'Rrs865_a', 'Rrs884_a']

target = 'CHL'

df = df.reset_index()
df = df[features + [target] + ['lat', 'lon']].dropna()

We define a number of strips in the latitude and longitude direction which will make up the grid.

In [46]:
n_strips_lat = 20
n_strips_lon = 20
strips = range(n_strips)
df['grid_lat'] = pd.cut(df['lat'], n_strips_lat, labels=False)
df['grid_lon'] = pd.cut(df['lon'], n_strips_lon, labels=False)

# Get the grid cells that actually contain data
unique_grid_cells = df[['grid_lon', 'grid_lat']].drop_duplicates()
print(f"Number of grid cells that contain data: {len(unique_grid_cells)}")

Number of grid cells that contain data: 257


Separate into train and test grid cells.

In [None]:
train_cells, test_cells = train_test_split(unique_grid_cells.index, test_size=0.2, random_state=42)