# Train a binary classifier on a cutout

In this notebook, we will train a binary classifier on a cutout of Sentinel-1 RGB image. Later, this trained model can be used to classify the full image.

## Data used in this notebook

The data used in this notebook has been preprocessed and is available in the `data` directory:

- `sentinel2_rgb_res_20_size_8000_cog.tif`: Sentinel-2 image with RGB band of the area of interest, resolution 20m, size 8000x8000 pixels
- `sentinel2_rgb_res_20_cutout.tif`: a cutout of the above image
- `waterbody_labels.gpkg`: manually created waterbody polygons in the cutout
- `none_waterbody_labels.gpkg`: manually created none-waterbody polygons in the cutout

In this notebook, we will use the two `.gpkg` files as positive and negative labels to train the binary classifier on `sentinel2_rgb_res_20_cutout.tif`. In step 2, we will apply the trained model to the full image `sentinel2_rgb_res_20_size_8000_cog.tif`.

In [None]:
from pathlib import Path
import rioxarray
import geopandas as gpd
import xarray as xr
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.patches import Patch

from sklearn.ensemble import RandomForestClassifier
import joblib # For save and load of the model

## Inspect the AoI

At beginning, we can inspect the full image. It is a Cloud Optimized GeoTIFF (COG) file, which can be opened with a overview level instead of loading the full image into memory. Here we use `overview_level = 1` to load a 4x4 downsampled image.

In [None]:
# Load the RGB image with overviews level 1
# This is a lazy loading without persisting data in memory
# An overview level of 1 means it will load the first overview, which is a lower resolution version of the image
path_rgb_full = Path("data/sentinel2_rgb_res_20_size_8000_cog.tif")
rgb_full = rioxarray.open_rasterio(
    path_rgb_full, overview_level=1)
rgb_full

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
rgb_full.plot.imshow(ax=ax, robust=True)

As we can see, the area of interest (AoI) covers the central-east of the Netherlands and the western part of Germany. Our goal is to segment waterbodies and non-waterbodies in this region.

## Load and investigate training data

Now we load a cutout of this image, and the labels for waterbodies and non-waterbodies.

In [None]:
# Assume data is in the 'data' directory
path_rgb_cutout = Path("data/sentinel2_rgb_res_20_cutout.tif")
path_positive_label = Path("data/waterbody_labels.gpkg")
path_negative_label = Path("data/none_waterbody_labels.gpkg")
rgb_cutout = rioxarray.open_rasterio(path_rgb_cutout)
gpd_positive = gpd.read_file(path_positive_label)
gpd_negative = gpd.read_file(path_negative_label)

In [None]:
# Plot the ploygon labels over the RGB cutout
fig, ax = plt.subplots(figsize=(8, 8))
rgb_cutout.plot.imshow(ax=ax, robust=True)
gpd_positive.plot(ax=ax, color='cyan', alpha=0.6, edgecolor='k')
gpd_negative.plot(ax=ax, color='red', alpha=0.6, edgecolor='k')

# Legends
legend_elements = [
    Patch(facecolor='cyan', edgecolor='k', alpha=0.6, label='Waterbody Labels'),
    Patch(facecolor='red', edgecolor='k', alpha=0.6, label='Non-waterbody Labels')
]
ax.legend(handles=legend_elements, loc='upper right', frameon=True, fancybox=True, shadow=True)

In the cutout, the polygon labels are manually created, marking examples of waterbodies and non-waterbodies.

## Covert labels from vector to raster

We will use scikit-learn `RandomForestClassifier` to train the model. First we need to convert the vector labels (polygons) to raster labels, since the classifier only accepts array-like data. We write a custom function `generate_label_array` to do this conversion.

By this function, we will create a single band array with the same spatial size as the RGB cutout, where the pixel values are 1 for waterbodies, 0 for non-waterbodies, and -1 for no label:

In [None]:
def generate_label_array(raster, gdf_positive, gdf_negative):
    """
    Generate a label array from the raster and positive/negative GeoDataFrames.

    Parameters:
    raster (xarray.DataArray): The input raster data.
    gdf_positive (geopandas.GeoDataFrame): GeoDataFrame containing positive labels.
    gdf_negative (geopandas.GeoDataFrame): GeoDataFrame containing negative labels.

    Returns:
    xarray.DataArray: A label array where positive labels are 1, negative labels are 0, 
    and areas without labels are -1.
    """
    # Make positive labels
    pos_mask = xr.full_like(
        raster.isel(band=0).drop_vars("band"), fill_value=1, dtype=np.int32
    )
    pos_mask = pos_mask.rio.write_nodata(-1)
    pos_label_array = pos_mask.rio.clip(gdf_positive["geometry"], drop=False)

    # Make negative labels
    neg_mask = xr.full_like(
        raster.isel(band=0).drop_vars("band"), fill_value=0, dtype=np.int32
    )
    neg_mask = neg_mask.rio.write_nodata(-1)
    neg_label_array = neg_mask.rio.clip(gdf_negative["geometry"], drop=False)

    # Combine positive and negative labels
    label_array = -(pos_label_array * neg_label_array)

    return label_array

Then we can convert the labels to arrays, and visualize them once again over the RGB cutout:

In [None]:
# Covert labels from vector to raster
label_array = generate_label_array(rgb_cutout, gpd_positive, gpd_negative)

# Plot labels over the RGB image
fig, ax = plt.subplots(figsize=(8, 8))
rgb_cutout.plot.imshow(ax=ax, robust=True)
label_array.plot.imshow(ax=ax, vmin=-1, vmax=1, alpha=0.5)

## Prepare training data

Now the labels are ready. For training data, we can extract the RGB values at the pixel locations where labels are 1 (waterbodies) or 0 (non-waterbodies). We use each band of the RGB cutout, i.e., red, green, and blue, as features for the classifier.

We wrote a function `extract_training_data` to prepare the training data.

In [None]:
def prepare_training_data(image, labels):
    # Reshape input data to [n_instances, n_features]
    flattened = labels.flatten()
    positive_data = image.reshape((image.shape[0], -1))[:, flattened == 1].transpose()
    negative_data = image.reshape((image.shape[0], -1))[:, flattened == 0].transpose()
    positive_labels = np.full_like(positive_data[:,0], 1)
    negative_labels = np.full_like(negative_data[:,0], 0)
    train_data = np.concatenate((positive_data, negative_data))
    train_labels = np.concatenate((positive_labels, negative_labels))

    # Shuffle the training data
    indices = np.arange(train_data.shape[0])
    indices_shuffled = np.random.permutation(indices)
    train_data = train_data[indices_shuffled]
    train_labels = train_labels[indices_shuffled]

    return train_data, train_labels

In [None]:
train_data, train_labels = prepare_training_data(rgb_cutout.data, label_array.data)

print(f"dimensions of training data: {train_data.shape}")
print(f"dimensions of training labels: {train_labels.shape}")

## Train the RandomForestClassifier

Now we can train the `RandomForestClassifier` with the training data and labels. And export the trained model to later use.


In [None]:
# This automatically computes the dask arrays to convert them to numpy arrays for training
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(train_data, train_labels)

In [None]:
# Export the trained classifier
joblib.dump(classifier, 'binary_classifier_waterbody.pkl')

## Prediction on the cutout

Before applying the model to the full image, we can also test its performance on the cutout itself. The predictions will be a 2 band array, where the first band is the probability of being a non-waterbody (label 0) and the second band is the probability of being a waterbody (label 1).

In [None]:
# Reshape rgb cutout as input for prediction
input = rgb_cutout.data.reshape((rgb_cutout.data.shape[0], -1)).transpose()

# Load the trained classifier
classifier = joblib.load('binary_classifier_waterbody.pkl')

# Perform predictions
predictions = classifier.predict_proba(input)
predictions

In [None]:
# Reshape predictions back to image dimensions for visualization
prediction_map = predictions.transpose().reshape((2, rgb_cutout.data.shape[1], rgb_cutout.data.shape[2]))

In [None]:
# Visualize the predictions
img_extent = (rgb_cutout.x.min(), rgb_cutout.x.max(), rgb_cutout.y.min(), rgb_cutout.y.max())
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
rgb_cutout.plot.imshow(ax=axes[0], alpha=0.6, robust=True)
axes[0].imshow(prediction_map[0], cmap='Reds', alpha=0.7, extent=img_extent)
axes[0].set_title('Non-waterbody Probability')
axes[0].axis('off')
plt.colorbar(axes[0].images[1], ax=axes[0], shrink=0.7)
rgb_cutout.plot.imshow(ax=axes[1], alpha=0.6, robust=True)
axes[1].imshow(prediction_map[1], cmap='Blues', alpha=0.7, extent=img_extent)
axes[1].set_title('Waterbody Probability')
axes[1].axis('off')
plt.colorbar(axes[1].images[1], ax=axes[1], shrink=0.7)
plt.tight_layout()
