# Forest Classification in Mozambique

This notebook demonstrates how to classify forest cover in Mozambique using remote sensing data. The classification is based on data from Sentinel-2 and Landsat. The Data has been prepared and is now available as yearly mosaics split into the different value quantiles.

In [None]:
from pathlib import Path
from typing import Any

import geopandas as gpd
import hvplot.pandas
import hvplot.xarray  # noqa
import numpy as np
import xarray as xr
from rasterio import Affine
from rasterio.features import rasterize
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Let's start by loading the necessary image data, aswell as the json file that contains the regions of interest (ROIs) for the classification.
These ROIs are essentialy polygons covering areas that are either forest or non-forest and are labeled accordingly.

In [None]:
# Setup Paths to the data
assets: Path = Path("assets").resolve(strict=True)
filepath: Path = next(assets.glob("*2024*.nc"))

# Load Data
ds = xr.open_dataset(filepath)
slivers: gpd.GeoDataFrame = gpd.read_file(assets / "forest-features.geojson")

# Prepare the crs and affine transformation
spatial_ref: dict[str, Any] = ds["spatial_ref"].attrs
crs = spatial_ref["crs_wkt"]
geotransfrom: list[float] = [float(x) for x in spatial_ref["GeoTransform"].split()]
affine = Affine.from_gdal(*geotransfrom)

Next we will plot the image data and the ROIs to get an overview of the data we are working with.

In [None]:
# Plot RGB Image
ds[["Red", "Green", "Blue"]].sel(quantile=0.5).to_dataarray().plot.imshow(robust=True)

In [None]:
# Combine forest and non-forest labels
slivers["class_id"] = slivers["forest"].notna() + (slivers["non-forest"].notna() * 2)
slivers.hvplot.polygons(
    tiles=True,
)

## Prepare Data for Classification

Before we can start the classification, we need to prepare the data, since scikit-learn requires the data to be in a specific format. This involves rasterizing the polygons, creating dataframes from the image data and the polygons, and preparing the features and target variable for classification.

In [None]:
# https://rasterio.readthedocs.io/en/stable/api/rasterio.features.html#rasterio.features.rasterize
# Rasterize the polygons
rst = rasterize(
    shapes=slivers.to_crs(crs)[["geometry", "class_id"]]
    .to_records(index=False)
    .tolist(),
    out_shape=ds["Red"].shape[1:],
    transform=affine,
)

# Create a DataArray from the rasterized polygons
label_cube = xr.DataArray(rst, coords={"y": ds.y, "x": ds.x}, name="class_id")

# Create a DataFrame from the image data
feature_frame = (
    ds.drop_vars("spatial_ref").to_dataframe().dropna().unstack(level="quantile")
)

# Rename the columns to include quantile information
feature_frame.columns = [
    f"{c[0]}_P{np.round(100 * c[1], 0).astype(int):03}" for c in feature_frame.columns
]

In [None]:
# Create a DataFrame from the label data (originally the polygons)
label_frame = label_cube.to_dataframe()
train_frame = label_frame[label_frame["class_id"] != 0].join(feature_frame)

# Prepare the features and target variable for classification
X_features = train_frame.drop(columns=["class_id"])
y_target = train_frame["class_id"]

## Classify Forest Cover with Scikit-learn

In the next step, we will use the prepared data to train a Random Forest classifier. The data will first be split into training and testing sets, and then the classifier will be trained on the training set. After training, we can evaluate the classifier's performance on the test set and visualize the results.

In [None]:
# Prepare the training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(
    X_features,
    y_target,
    test_size=0.2,
    random_state=42,
)

# Classify Forest Cover with Scikit-learn
randforest = RandomForestClassifier()
randforest_test = randforest.fit(X_train, y_train)
randforest_predict = randforest.predict(X_test)

Now that we have trained our model, we can use it to predict the forest cover in the image data. We will use the trained Random Forest model to predict the forest cover for each pixel in the image. As before, we need to prepare the image data, such that it is essentially a vector of pixels.

In [None]:
# Prepare the image data for classification
img = (
    ds.drop_vars("spatial_ref")
    .to_dataarray()
    .transpose("x", "y", "quantile", "variable")
)

# Reshape the image data
num_of_pixels = img.sizes["x"] * img.sizes["y"]
num_of_bands = img.sizes["quantile"] * img.sizes["variable"]
X_image_data = img.values.reshape(num_of_pixels, num_of_bands)

In [None]:
# Predict using the trained model
randforest_predict_img = randforest.predict(X_image_data)
randforest_predict_img = randforest_predict_img.reshape(
    img.sizes["x"],
    img.sizes["y"],
).transpose()

# Recreate the DataArray for the predicted forest cover
predicted_forest = xr.DataArray(
    randforest_predict_img,
    dims=("y", "x"),
    coords={
        "x": ds["x"],
        "y": ds["y"],
    },
)

In [None]:
# Map the predicted forest classification
predicted_forest.where(predicted_forest == 1).hvplot.image(
    x="x",
    y="y",
    tiles=True,
    cmap="greens",
    crs=crs,
)