# Image classification with a Convolutional Neural Network for Sentinel-2 Satellite Imagery

Michael Mommert, Stuttgart University of Applied Sciences, 2025

This Notebook showcases how a Convolutional Neural Network can be used to perform image classification. This task will be trained on image labels corresponding to the most prevalent land-use/land-cover class in Sentinel-2 satellite images from the [*ben-ge-800* dataset](https://zenodo.org/records/12941231). This Notebook builds on top of the [Pixel-wise Classification with a Multilayer Perceptron](https://github.com/Hochschule-fuer-Technik-Stuttgart/teaching-mommert/blob/main/classification/pixel-wise/mlp/sentinel-2/classification_pixel-wise_mlp_sentinel-2.ipynb) Notebook.

In [None]:
%pip install numpy \
    scipy \
    pandas \
    matplotlib \
    rasterio \
    scikit-learn \
    torch \
    torchmetrics \
    tqdm

## Setup and Data Download

We're setting up our Python environment for this tutorial by installing and importing the necessary modules and packages:

In [None]:
# system level modules for handling files and file structures
import os
import tarfile
import copy

# scipy ecosystem imports for numerics, data handling and plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

# pytorch and helper modules
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchmetrics import Accuracy

# utils
from tqdm.notebook import tqdm
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# rasterio for reading in satellite image data
import rasterio as rio


We download the *ben-ge-800* dataset and unpack it:

In [None]:
#!wget https://zenodo.org/records/12941231/files/ben-ge-800.tar.gz?download=1 -O ben-ge-800.tar.gz
  
#tarfile = tarfile.open('ben-ge-800.tar.gz')  # open ben-ge-800 tarball 
#tarfile.extractall('./', filter='data')  # extract tarball

#data_base_path = os.path.join(os.path.abspath('.'), 'ben-ge-800')

**ben-ge-800** contains samples for 800 locations with co-located Sentinel-1 SAR data, Sentinel-2 multispectral data, elevation data, land-use/land-cover data, as well as environmental data. **ben-ge-800** is a subset of the much larger **ben-ge** dataset (see [https://github.com/HSG-AIML/ben-ge](https://github.com/HSG-AIML/ben-ge) for details.) We deliberately use a very small subset of **ben-ge** to enable reasonable runtimes for the examples shown in this tutorial.

The environment is now set up and the data in place. Before we define the dataset classes and dataloaders to access the data efficiently, we fix some random seeds to obtain reproduceable results:

In [None]:
data_base_path = os.path.join(os.path.abspath('.'), 'ben-ge-800')

np.random.seed(42)     # sets the seed value in Numpy
torch.manual_seed(42)  # sets the seed value in Pytorch

## Data Handling

Before we start implementing our model, let's have a look at the data. In this notebook, we need two different data products that are available for every single sample in the dataset:
* Sentinel-2 multispectral data: 12-band Level-2A images of size 120x120; we will restrict ourselves to the 4 bands that carry 10m-resolution imaging data (bands 2, 3, 4 and 8)
* [ESAWorldCover](https://esa-worldcover.org/en) land-use/land-cover image labels: for each image, this label consists of the most prevalent (based on area covered in the image) land-use/land-cover class in the image.

We will train a image classification model to predict this land-use/land-cover label for each image. 

For this purpose, we modify our dataset class from the Notebook [Supervised Classification with Machine Learning Methods for Sentinel-2 Satellite Imagery](https://github.com/Hochschule-fuer-Technik-Stuttgart/teaching-mommert/blob/main/remote_sensing/classification/lulc_ml/lulc_ml.ipynb):

In [None]:
# define labels of the different lulc labels
ewc_label_names = ["tree_cover", "shrubland", "grassland", "cropland", "built-up",
                   "bare/sparse_vegetation", "snow_and_ice","permanent_water_bodies",
                   "herbaceous_wetland", "mangroves","moss_and_lichen"]

class BENGE(Dataset):
    """A dataset class implementing the Sentinel-1, Sentinel-2 and ESAWorldCover data modalities."""
    def __init__(self, 
                 data_dir=data_base_path, 
                 split='train',
                 s2_bands=[2, 3, 4, 8]):
        """Dataset class constructor

        keyword arguments:
        data_dir -- string containing the path to the base directory of ben-ge dataset, default: ben-ge-800 directory
        split    -- string, describes the split to be instantiated, either `train`, `val` or `test`
        s2_bands -- list of Sentinel-2 bands to be extracted, default: all bands

        returns:
        BENGE object
        """
        super(BENGE, self).__init__()

        # store some definitions
        self.s2_bands = s2_bands
        self.data_dir = data_dir

        # read in relevant data files and definitions
        self.name = self.data_dir.split("/")[-1]
        self.split = split
        self.meta = pd.read_csv(f"{self.data_dir}/{self.name}_meta.csv")

        # extract prevalent lulc label for each sample
        ewc = pd.read_csv(f"{self.data_dir}/{self.name}_esaworldcover.csv")
        self.meta.loc[:, 'lulc'] = np.argmax(ewc.loc[:, 'tree_cover':'moss_and_lichen'].values, axis=1)
              
        # we shuffle the indices in the meta file and then select the first 500 samples for training, 150 for validation and 150 for testing
        if split == 'train':
            self.meta = self.meta.iloc[0:500]
        if split == 'val':
            self.meta = self.meta.iloc[500:650]
        if split == 'test':
            self.meta = self.meta.iloc[650:800]
        
        #self.meta = self.meta.loc[self.meta.split == split, :]  # filter by split

    def __getitem__(self, idx):
        """Return sample `idx` as dictionary from the dataset."""
        sample_info = self.meta.iloc[idx]
        patch_id = sample_info.patch_id  # extract Sentinel-2 patch id

        # retrieve Sentinel-2 data
        s2 = np.empty((4, 120, 120))
        for i, band in enumerate(self.s2_bands):
            with rio.open(f"{self.data_dir}/sentinel-2/{patch_id}/{patch_id}_B0{band}.tif") as dataset:
                data = dataset.read(1)
            s2[i,:,:] = data
        s2 = np.clip(s2.astype(float) / 10000, 0, 1)  # normalize Sentinel-2 data

        # create sample dictionary containing all the data
        sample = {
            "patch_id": patch_id,  # Sentinel-2 id of this patch
            "s2": torch.from_numpy(s2).float(),  # Sentine;-2 data [4, 120, 120]
            "lulc": torch.tensor(sample_info.lulc).long(),  # most prevalent ESA WorldCover lulc class
            }

        return sample

    def __len__(self):
        """Return length of this dataset."""
        return self.meta.shape[0]

    def display(self, idx):
        """Method to display a data sample, consisting of the Sentinel-2 image.
        
        positional arguments:
        idx -- sample index
        """

        # retrieve sample
        sample = self[idx]

        f, ax = plt.subplots(1, 1, figsize=(4, 4))
        
        # display Sentinel-2 image
        img_rgb = np.dstack(sample['s2'][0:3].numpy()[::-1])  # extract RGB, reorder, and perform a deep stack (shape: 120, 120, 3)
        ax.imshow((img_rgb-np.min(img_rgb))/(np.max(img_rgb)-np.min(img_rgb)))
        ax.set_title(ewc_label_names[sample['lulc'].numpy()])
        ax.axis('off')


We can now instantiate the different splits for this dataset:

In [None]:
train_data = BENGE(split='train')
val_data = BENGE(split='val')
test_data = BENGE(split='test')

len(train_data), len(val_data), len(test_data)

We can retrieve a single sample simply by indexing:

In [None]:
train_data[3]

Let's display this sample:

In [None]:
train_data.display(3)

For Neural Network training we have to define data loaders. When we do so, we have to define the batch size, which is typically limited by the GPU RAM during training. For evaluation purposes, we can typically pick a larger batch size, since we need less memory.

In [None]:
train_batchsize = 8
eval_batchsize = 16

train_dataloader = DataLoader(train_data, batch_size=train_batchsize, num_workers=4, pin_memory=True)
val_dataloader = DataLoader(val_data, batch_size=eval_batchsize, num_workers=4, pin_memory=True)
test_dataloader = DataLoader(test_data, batch_size=eval_batchsize, num_workers=4, pin_memory=True)

## Model Implementation

We build a very simple Convolutional Neural Network that consists of three convolutional layers and two linear layers to learn our image classification task. 

The **convolutional layers** will be set up as follows:
* The first convolutional layer will take in our input images (120 x 120 pixels) with 4 channels. Aiming for 8 output channels and using a kernel size of 5 and a stride of 2 (to decrease the size of the resulting feature maps), this will result in a feature map of size `[8, 58, 58]`. This layer will be followed by ReLU for non-linear activations and an additional Maxpooling layer that reduces the size of the output feature map by a factor of 2, resulting in an output of size `[8, 29, 29]`.
* The second convolutional layer will result in 16 output channels and also use a kernel size of 5 and a stride of 2. Followed by ReLU and Maxpooling, the output feature map will be of size `[16, 6, 6]`.

At the intersection between the convolutional layers and the linear layers we have to transform the feature maps of size `[16, 6, 6]` into a vector. This vector will have a lenght of $16 \times 6 \times 6 = 576$. 

This vector will serve as input for our **linear layers**:
* The first linear layer will take an input of length 576 and output a vector of length 100. This layer will be followed by ReLU.
* The second linear layer will take the vector of length 100 and result in a vector of length 11, which represents the number of land-use/land-cover labels in our dataset. This layer will not be followed by ReLU, since we don't want to use non-linear activations here. Instead, we will apply a logsoftmax function to output values that we can interpret as classification probabilities.

In [None]:
class BENGENet(nn.Module):
    
    def __init__(self):
        super(BENGENet, self).__init__()
        
        self.conv1 = nn.Conv2d(4, 8, 5, stride=2) # 4 channels to 8 channels with a kernel size of 5
        self.conv2 = nn.Conv2d(8, 16, 5, stride=2) # 4 channels to 8 channels with a kernel size of 5
        
        self.linear1 = nn.Linear(16*6*6, 100, bias=True)
        self.linear2 = nn.Linear(100, 11, bias=True)

        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(2)  # maxpooling by a factor of 2
        self.logsoftmax = nn.LogSoftmax(dim=1)

    def forward(self, x):

        # convolutional layers
        x = self.maxpool(self.relu(self.conv1(x)))
        x = self.maxpool(self.relu(self.conv2(x)))

        # reshape feature maps for linear layers
        x = x.view(-1, 16*6*6)

        # linear layers
        x = self.relu(self.linear1(x))
        x = self.logsoftmax(self.linear2(x))
        
        return x

Now we instantiate the model and we're ready for training.

In [None]:
model = BENGENet()

## Training and Validation

First of all, let's verify if a GPU is available on our compute machine. If not, the CPU will be used instead.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print('Device used: {}'.format(device))

Before we can implement the training pipeline we have to define two more things: a Loss function and an optimizer that will update our model weights during training. We also define our evaluation metric, for which we use the accuracy score.

In [None]:
# we will use the cross entropy loss
loss = nn.NLLLoss()

# we will use the Adam optimizer
learning_rate = 0.0001
opt = optim.Adam(params=model.parameters(), lr=learning_rate)

# we instantiate the accuracy metric
accuracy = Accuracy(task="multiclass", num_classes=11)

Now, we have to move the model and the loss function on the GPU, since the computationally heavy work will be conducted there.

In [None]:
model.to(device)
loss.to(device)
accuracy.to(device)

Finally, we can implement our training pipeline.


In [None]:
epochs = 30  # training for 10 epochs

train_losses_epochs = []
val_losses_epochs = []
train_accs_epochs = []
val_accs_epochs = []

for ep in range(epochs):

    train_losses = []
    val_losses = []
    train_accs = []
    val_accs = []

    # we perform training for one epoch
    model.train()   # it is very important to put your model into training mode!
    for samples in tqdm(train_dataloader):
        # we extract the input data (Sentinel-2)
        x = samples['s2'].to(device)

        # now we extract the target (lulc class) and move it to the gpu
        y = samples['lulc'].to(device)

        # we make a prediction with our model
        output = model(x)

        # we reset the graph gradients
        model.zero_grad()

        # we determine the classification loss
        loss_train = loss(output, y)

        # we run a backward pass to comput the gradients
        loss_train.backward()

        # we update the network paramaters
        opt.step()

        # we write the mini-batch loss and accuracy into the corresponding lists
        train_losses.append(loss_train.detach().cpu())
        train_accs.append(accuracy(torch.argmax(output, dim=1), y).detach().cpu())

    # we evaluate the current state of the model on the validation dataset
    model.eval()   # it is very important to put your model into evaluation mode!
    with torch.no_grad():
        for samples in tqdm(val_dataloader):
            # we extract the input data (Sentinel-2)
            x = samples['s2'].to(device)

            # now we extract the target (lulc class) and move it to the gpu
            y = samples['lulc'].to(device)

            # we make a prediction with our model
            output = model(x)

            # we determine the classification loss
            loss_val = loss(output, y)

            # we write the mini-batch loss and accuracy into the corresponding lists
            val_losses.append(loss_val.detach().cpu())
            val_accs.append(accuracy(torch.argmax(output, dim=1), y).detach().cpu())

    train_losses_epochs.append(np.mean(train_losses))
    train_accs_epochs.append(np.mean(train_accs))
    val_losses_epochs.append(np.mean(val_losses))
    val_accs_epochs.append(np.mean(val_accs))

    print("epoch {}: train: loss={}, acc={}; val: loss={}, acc={}".format(
        ep, train_losses_epochs[-1], train_accs_epochs[-1], 
        val_losses_epochs[-1], val_accs_epochs[-1]))

Training progress looks good: train and validation losses are decreasing, accuracies are increasing.

Let's plot the available metrics as a function of the number of training iterations:

In [None]:
f, ax = plt.subplots(1, 2, sharex=True, figsize=(10,5))

ax[0].plot(np.arange(1, len(train_losses_epochs)+1), train_losses_epochs, label='Train', color='blue')
ax[0].plot(np.arange(1, len(val_losses_epochs)+1), val_losses_epochs, label='Val', color='red')
ax[0].set_xlabel('Iterations')
ax[0].set_ylabel('Loss')
ax[0].legend()

ax[1].plot(np.arange(1, len(train_accs_epochs)+1), train_accs_epochs, label='Train', color='blue')
ax[1].plot(np.arange(1, len(val_accs_epochs)+1), val_accs_epochs, label='Val', color='red')
ax[1].set_xlabel('Iterations')
ax[1].set_ylabel('Accuracy')
ax[1].legend()

The model learns well and we stopped the learning process before overfitting sets in. Let's evaluate the model again on the test dataset.

In [None]:
test_accs = []
predictions = []
groundtruths = []

model.eval()   # it is very important to put your model into evaluation mode!
with torch.no_grad():
    for samples in tqdm(test_dataloader):
        x = samples['s2'].to(device)

        # now we extract the target (lulc class) and move it to the gpu
        y = samples['lulc'].to(device)
        groundtruths.append(y.cpu())

        # we make a prediction with our model
        output = model(x)

        predictions.append(np.argmax(output.cpu().numpy(), axis=1))

        # we determine the classification loss
        loss_val = loss(output, y)

        # we write the mini-nbatch loss and accuracy into the corresponding lists
        test_accs.append(accuracy(torch.argmax(output, dim=1), y).cpu().numpy())

print('test dataset accuracy:', np.mean(test_accs))

# flatten predictions and groundtruths
predictions = np.concatenate(predictions).ravel()
groundtruths = np.concatenate(groundtruths).ravel()


The test dataset performance is very close to the validation dataset performance, which is a good sign.

Finally, let's have a look at the confusion matrix.

In [None]:
f, ax = plt.subplots(1, 1, figsize=(8,6))

# plot the confusion matrix
disp = ConfusionMatrixDisplay.from_predictions(
    groundtruths, predictions,
    display_labels=[ewc_label_names[i] for i in np.unique(groundtruths)],
    normalize='true',
    ax=ax)

# rotate x labels for better readability
ax.set_xticks(ax.get_xticks(), ax.get_xticklabels(), rotation=90)

As you can see, the model is far from performing perfectly: a significant number of classes is misclassified as tree cover or grassland. This confusion is to be expected, since the problem is ill-posed: naturally, most images will include more than a single class. Therefore, only the largest classes will be predicted by the model. 

**Exercise**: Modify the design of the CNN. Will the results improve if you add more convolutional/linear layers?

In [None]:
# use this cell for the exercise