# Forest Classification in Mozambique

This notebook presents a workflow for classifying forest cover in Mozambique using remote sensing data. The classification utilizes annual mosaics derived from Sentinel-2 and Landsat sources, with data organized by value quantiles for analysis.

## Workflow Setup

We begin by importing the necessary libraries for geospatial analysis, data handling, visualization, and machine learning. These tools form the foundation for the classification workflow that follows.

In [None]:
from enum import Enum
from pathlib import Path
from typing import Any

import geopandas as gpd
import hvplot.pandas
import hvplot.xarray  # noqa
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from rasterio import Affine
from rasterio.features import rasterize
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

The next step is to load the remote sensing image data and the GeoJSON file containing the regions of interest (ROIs) for classification. These ROIs are defined as polygons that represent areas labeled as forest or non-forest. The polygon labels are created using QGIS and provide the ground truth required for supervised classification.

In [None]:
# Setup Paths to the data
assets: Path = Path("assets").resolve(strict=True)
year: int = 2024
filepath: Path = next(assets.glob(f"*{year}*.nc"))

# Load Data
feature_cube = xr.open_dataset(filepath)
feature_cube = feature_cube.sel(quantile=slice(0.1, 0.9))  # remove outliers
slivers: gpd.GeoDataFrame = gpd.read_file(assets / "label_features_i02.geojson")

# Prepare the crs and affine transformation
spatial_ref: dict[str, Any] = feature_cube["spatial_ref"].attrs
crs = spatial_ref["crs_wkt"]
slivers.to_crs(crs, inplace=True)
geotransfrom: list[float] = [float(x) for x in spatial_ref["GeoTransform"].split()]
affine = Affine.from_gdal(*geotransfrom)

In [None]:
# Prepare the crs and affine transformation
spatial_ref: dict[str, Any] = feature_cube["spatial_ref"].attrs
crs = spatial_ref["crs_wkt"]
slivers.to_crs(crs, inplace=True)
_geotransfrom: list[float] = [float(x) for x in spatial_ref["GeoTransform"].split()]
affine = Affine.from_gdal(*_geotransfrom)

Defining land cover types as an enumeration provides a clear mapping between each class and its numeric value. This improves code readability and ensures consistency when assigning and interpreting classification labels throughout the workflow.

In [None]:
# Give the labels a meaningful name


class LandCoverType(Enum):
    """Binding land cover types to their numeric values."""

    TREE_COVER = 10
    SHRUBLAND = 20
    GRASSLAND = 30
    CROPLAND = 40
    SETTLEMENT = 50
    BARE_SPARSE = 60
    SNOW_ICE_ABSENT = 70
    PERMANENT_WATER = 80
    INTERMITTENT_WATER = 86
    WETLAND = 90

We will now plot the image data alongside the regions of interest (ROIs) to obtain a comprehensive understanding of the spatial distribution of the labeled areas. This visualization will help us verify the alignment of the labeled polygons with the underlying remote sensing data.

In [None]:
# Plot RGB Image
fig, ax = plt.subplots()
feature_cube[["Red", "Green", "Blue"]].sel(quantile=0.5).to_dataarray().plot.imshow(
    robust=True,
    ax=ax,
)
slivers.plot(ax=ax, column="label", cmap="RdYlBu")
plt.show()

## Prepare Data for Classification

Prior to the actual machine learning classification, the data needs to be preprocessed and organized into a suitable format. The preparation involves several key steps:
1. Rasterization of the polygons to create a mask for the areas of interest.
2. Extraction of pixel values from the image data using the mask.
3. Creation of a feature matrix and target vector for classification from the image data.

In [None]:
# https://rasterio.readthedocs.io/en/stable/api/rasterio.features.html#rasterio.features.rasterize
# Rasterize the polygons
rst = rasterize(
    shapes=slivers.to_crs(crs)[["geometry", "poly_id"]]
    .to_records(index=False)
    .tolist(),
    out_shape=feature_cube["Red"].shape[1:],  # only x and y dimensions
    transform=affine,
)

# Create a DataArray from the rasterized polygons
label_cube = xr.DataArray(
    rst,
    coords={"y": feature_cube.y, "x": feature_cube.x},
    name="poly_id",
)

# Create a DataFrame from the image data
feature_frame = (
    feature_cube.drop_vars("spatial_ref")
    .to_dataframe()
    .dropna()
    .unstack(level="quantile")
)

# Rename the columns to include quantile information
feature_frame.columns = [
    f"{c[0]}_P{np.round(100 * c[1], 0).astype(int):03}" for c in feature_frame.columns
]

In [None]:
# Create a DataFrame from the label data (originally the polygons)
label_frame = (
    label_cube.to_dataframe()
    .reset_index()
    .merge(slivers[["poly_id", "label"]])
    .set_index(["y", "x"])
)
train_frame = label_frame.join(feature_frame)

# Prepare the features and target variable for classification
X_features = train_frame.drop(columns=["label", "poly_id"])
y_target = train_frame["label"]

## Classify Forest Cover with Scikit-learn

We will proceed with training a Random Forest classifier using the prepared data.
This process involves the following key steps:
1. Data Splitting into Training and Testing Sets to evaluate the model's performance.
2. Model Training on the Training Set.
3. Model Evaluation on the Testing Set.
4. Visualization of Results.

In [None]:
# Prepare the training and testing datasets
SEED = 42
X_train, X_test, y_train, y_test = train_test_split(
    X_features,
    y_target,
    test_size=0.2,
    random_state=SEED,
)

# Classify Forest Cover with Scikit-learn
randforest = RandomForestClassifier(
    random_state=SEED,
    class_weight="balanced",
    n_jobs=-1,
)
randforest_test = randforest.fit(X_train, y_train)
randforest_predict = randforest.predict(X_test)

With the trained model, we can now make predictions on the image data.
The Random Forest classifier will be used to predict the forest cover for each pixel in the image.
For that the image data needs to be prepared in a similar way as the training data, as the model expects essentially a vector of features.

In [None]:
# Prepare the image data for classification
img = (
    feature_cube.drop_vars("spatial_ref")
    .to_dataarray()
    .transpose("x", "y", "quantile", "variable")
)

# Reshape the image data
num_of_pixels = img.sizes["x"] * img.sizes["y"]
num_of_bands = img.sizes["quantile"] * img.sizes["variable"]
X_image_data = img.values.reshape(num_of_pixels, num_of_bands)

In [None]:
# Predict using the trained model
randforest_predict_img = randforest.predict(X_image_data)
randforest_predict_img = randforest_predict_img.reshape(
    img.sizes["x"],
    img.sizes["y"],
).transpose()

# Recreate the DataArray for the predicted forest cover
predicted_forest = (
    xr.DataArray(
        randforest_predict_img,
        dims=("y", "x"),
        coords={
            "x": feature_cube["x"],
            "y": feature_cube["y"],
        },
    )
    .rio.write_crs(crs)
    .rio.write_transform(affine)
)

In [None]:
# Map the predicted forest classification
predicted_forest.where(predicted_forest == LandCoverType.TREE_COVER.value).hvplot.image(
    x="x",
    y="y",
    tiles=True,
    cmap="greens",
    crs=crs,
    hover=False,
)

In [None]:
# Display the classification report
print(classification_report(y_test, randforest_predict))

In [None]:
# Saving the results for future analysis
savepath = Path(f"forest_classification_{year}.tif")
predicted_forest.rio.to_raster(savepath, driver="GTiff", compress="lzw")