# Satellite data preparation

While the previous notebooks focussed on machine learning techniques on tabular data, in practice we'll be working with satellite imagery. This notebook shows how this data can be prepared in order to apply machine learning to it.

In [15]:
import xarray as xr
import hvplot.xarray
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor

We can load in satellite imagery using xarray.

In [3]:
ds = xr.open_dataset('data/s3_20200420T101527.nc')
ds

Visualizing this data can be done using the HvPlot package

In [None]:
ds['chl_merged'].hvplot.quadmesh(x='lon', y='lat', title='Chlorophyll Merged', tiles='ESRI', cmap='jet', clim=(0, 30))

## Converting the data

Ideally, the data from xarray is converted to a format that we are more familiar with when using machine learning, such as a pandas dataframe (note that there are packages that work straight on xarray data, but we'll leave that out of scope for now).

Luckily, xarray has built-in functionality to turn an xarray dataset into a pandas dataframe.

In [5]:
df = ds.to_dataframe()
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Rrs400_c,Rrs412_c,Rrs443_c,Rrs490_c,Rrs510_c,Rrs560_c,Rrs620_c,Rrs665_c,Rrs674_c,Rrs682_c,...,Rrs768_ca,Rrs779_ca,Rrs865_ca,Rrs884_ca,Rrs_walgo_ca,chl_merged,CHL,collocationFlags_CHL_20200420,bathymetry,collocationFlags_bathymetry_feature
lat,lon,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
55.838232,1.293941,,,,,,,,,,,...,,,,,,,0.585224,1,-81.022911,1
55.838232,1.296666,,,,,,,,,,,...,,,,,,,0.585224,1,-81.070915,1
55.838232,1.299391,,,,,,,,,,,...,,,,,,,0.585224,1,-80.585670,1
55.838232,1.302115,,,,,,,,,,,...,,,,,,,0.632866,1,-79.879280,1
55.838232,1.304840,,,,,,,,,,,...,,,,,,,0.632866,1,-79.964081,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51.217272,6.705042,,,,,,,,,,,...,,,,,,,,1,,1
51.217272,6.707766,,,,,,,,,,,...,,,,,,,,1,,1
51.217272,6.710491,,,,,,,,,,,...,,,,,,,,1,,1
51.217272,6.713216,,,,,,,,,,,...,,,,,,,,1,,1


As done previously, we can select which columns we want to keep as features or target

In [18]:
features = ['Rrs400_a', 'Rrs412_a', 'Rrs443_a', 'Rrs490_a', 'Rrs510_a',
            'Rrs560_a', 'Rrs620_a', 'Rrs665_a', 'Rrs674_a', 'Rrs682_a',
            'Rrs709_a', 'Rrs754_a', 'Rrs768_a', 'Rrs779_a',
            'Rrs865_a', 'Rrs884_a']
target = 'CHL'

df = df.reset_index()
df = df[features + [target]].dropna()

df

Unnamed: 0,Rrs400_a,Rrs412_a,Rrs443_a,Rrs490_a,Rrs510_a,Rrs560_a,Rrs620_a,Rrs665_a,Rrs674_a,Rrs682_a,Rrs709_a,Rrs754_a,Rrs768_a,Rrs779_a,Rrs865_a,Rrs884_a,CHL
677631,0.006249,0.005987,0.006647,0.007026,0.005889,0.003823,0.001276,0.000887,0.000855,0.000848,0.000522,0.000465,0.000439,0.000334,0.000252,0.000184,0.330278
677632,0.006249,0.005987,0.006647,0.007026,0.005889,0.003823,0.001276,0.000887,0.000855,0.000848,0.000522,0.000465,0.000439,0.000334,0.000252,0.000184,0.332297
677633,0.006249,0.005987,0.006647,0.007026,0.005889,0.003823,0.001276,0.000887,0.000855,0.000848,0.000522,0.000465,0.000439,0.000334,0.000252,0.000184,0.332297
679622,0.006249,0.005987,0.006647,0.007026,0.005889,0.003823,0.001276,0.000887,0.000855,0.000848,0.000522,0.000465,0.000439,0.000334,0.000252,0.000184,0.330278
679623,0.006249,0.005987,0.006647,0.007026,0.005889,0.003823,0.001276,0.000887,0.000855,0.000848,0.000522,0.000465,0.000439,0.000334,0.000252,0.000184,0.332297
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3299759,0.008406,0.007396,0.010466,0.014414,0.017267,0.026062,0.023503,0.017732,0.015753,0.016833,0.019384,0.007616,0.006467,0.007314,0.003409,0.002293,36.008640
3299760,0.008758,0.008161,0.011136,0.015394,0.018448,0.027389,0.025846,0.020024,0.018091,0.019322,0.022643,0.009821,0.008154,0.009487,0.004752,0.003343,36.109863
3299761,0.008758,0.008161,0.011136,0.015394,0.018448,0.027389,0.025846,0.020024,0.018091,0.019322,0.022643,0.009821,0.008154,0.009487,0.004752,0.003343,36.109863
3301737,0.008262,0.007384,0.009744,0.013407,0.016278,0.024983,0.022507,0.016303,0.014593,0.015779,0.018498,0.007263,0.006055,0.006982,0.003814,0.003009,39.085072


## Splitting the data and training a model
As we did before, we can now split the data into a train and test set, and train a machine learning model.

In [19]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

X_train = df_train[features]
y_train = df_train[target]

X_test = df_test[features]
y_test = df_test[target]

In [20]:
model = LGBMRegressor()
model.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.033476 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4080
[LightGBM] [Info] Number of data points in the train set: 731799, number of used features: 16
[LightGBM] [Info] Start training from score 5.247112


In [21]:
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean squared error: {mse}')

Mean squared error: 1.9679613366296864


This result looks good at first sight, but it might be a bit too good. In machine learning, and definitely in geospatial data, we need to be wary of data leakage. This is where information from the training data leaks into the test data. This can happen in a random split, where information from nearby pixel (e.g. from the training set) can leak information about another pixel (e.g. in the test set).

How to deal with this is explored in the next notebook.