<a href="https://colab.research.google.com/github/RicoStaedeli/ML-Eurosat/blob/main/Generate_Training_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploring and Preparing the EuroSAT Dataset for Model Training**  

In this notebook, we analyze images from the **EuroSAT dataset**, which is derived from **Sentinel-2 satellite imagery**. Additionally, we examine the images from the **evaluation dataset**, stored as `.npy` files containing **12 spectral bands**.  

The primary objective is to process and standardize these images into a consistent shape, ensuring they are suitable for model training.

## Creation of Training Dataset
The dataset for the training is created from 12 of the 13 bands of the original image. The band 10 of the orignial image was removed due the assumption that this band was not used.

In [None]:
!pip install rasterio

Collecting rasterio
  Downloading rasterio-1.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.1 kB)
Collecting affine (from rasterio)
  Downloading affine-2.4.0-py3-none-any.whl.metadata (4.0 kB)
Collecting cligj>=0.5 (from rasterio)
  Downloading cligj-0.7.2-py3-none-any.whl.metadata (5.0 kB)
Collecting click-plugins (from rasterio)
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl.metadata (6.4 kB)
Downloading rasterio-1.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.2/22.2 MB[0m [31m70.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Installing collected packages: cligj, click-plugins, affine, rasterio
Successfully installed affine-2.4.0 click-plugins-1.1.1 cligj-0.7.2 rasterio-1.4.3


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Load dataset into instance for faster inference

In [None]:
import os
import numpy as np
import rasterio
import matplotlib.pyplot as plt
import glob
from tqdm import tqdm

In [None]:
!cp /content/drive/MyDrive/HSG/ML/Project/Datasets/EuroSATallBands.zip /content/

!unzip /content/EuroSATallBands.zip -d /content/dataset

In [None]:
path_dataset_train_tif = '/content/dataset/ds/images/remote_sensing/otherDatasets/sentinel_2/tif'
path_dataset_test_npy = '/content/drive/MyDrive/HSG/ML/Project/Datasets/Challenge Testdata/testset/testset'
path_dataset_processed_tif_to_npy = '/content/processed_dataset'



## Convert


*   Convert Images from GeoTiff format to Numpy image
*   remove band 10
*   place band 13 which is Band 8A at 9th position



In [None]:
import numpy as np
import tifffile as tiff
import os

# Define input and output paths
input_dir = path_dataset_train_tif
output_dir = path_dataset_processed_tif_to_npy

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

def convert_tif_to_npy(tif_path, npy_output_path):
    # Load the .tif image
    tif_image = tiff.imread(tif_path)

    # Check that the image has at least 13 bands (original image should)
    if tif_image.shape[-1] >= 13:
        # Remove Band 10 (index 9)
        tif_image = np.delete(tif_image, 9, axis=-1)

        # Extract Band 13 (original index 12, but after deletion it becomes 11)
        band_13 = tif_image[..., 11]

        # Remove the current Band 13 (now at index 11 after deletion)
        tif_image = np.delete(tif_image, 11, axis=-1)

        # Insert Band 13 at position 8 (index 8)
        tif_image = np.insert(tif_image, 8, band_13, axis=-1)
    else:
        raise ValueError(f"Image at {tif_path} does not have at least 13 bands.")

    # Save the converted image as .npy
    np.save(npy_output_path, tif_image)
    print(f"Saved converted image to {npy_output_path}")

def process_tif_folder(input_dir, output_dir):
    os.makedirs(output_dir, exist_ok=True)

    # Iterate over each class directory
    for class_folder in os.listdir(input_dir):
        class_path = os.path.join(input_dir, class_folder)

        # Ensure it's a directory (skip files)
        if os.path.isdir(class_path):
            # Process all .tif images inside this class folder
            for filename in os.listdir(class_path):
                if filename.endswith(".tif"):
                    input_path = os.path.join(class_path, filename)

                    # Flatten output - use class name + filename for uniqueness
                    output_filename = f"{filename.replace('.tif', '.npy')}"
                    output_path = os.path.join(output_dir, output_filename)

                    convert_tif_to_npy(input_path, output_path)

In [None]:
process_tif_folder(input_dir, output_dir)

In [None]:
# Get list of .npy files
image_files = glob.glob(os.path.join(path_dataset_processed_tif_to_npy, "*.npy"))
# Count .npy files
file_count = len(image_files)

print(f"Number of .npy files: {file_count}")


Number of .npy files: 27000


In [None]:
#store to drive for later inference
import shutil

# Define paths
drive_dataset_path = "/content/drive/MyDrive/HSG/ML/Project/Datasets/Eurosat_train_dataset/"

# Ensure the target directory exists
shutil.os.makedirs(drive_dataset_path, exist_ok=True)

# Copy the entire folder
shutil.copytree(path_dataset_processed_tif_to_npy, drive_dataset_path, dirs_exist_ok=True)

print("Dataset successfully copied to Google Drive!")


Dataset successfully copied to Google Drive!


In [None]:
# Get list of .npy files
image_files = glob.glob(os.path.join(drive_dataset_path, "*.npy"))
# Count .npy files
file_count = len(image_files)

print(f"Number of .npy files: {file_count}")


Number of .npy files: 27000


In [None]:
import shutil

# Path to the folder you want to zip
path_dataset_processed_tif_to_npy = '/content/processed_dataset'

# Output zip file path (without .zip extension)
output_zip_path = '/content/EuroSAT_training_dataset_numpy'

# Create the zip file
shutil.make_archive(output_zip_path, 'zip', path_dataset_processed_tif_to_npy)

print(f"Folder zipped successfully at {output_zip_path}.zip")

Folder zipped successfully at /content/EuroSAT_training_dataset_numpy.zip


In [None]:
drive_dataset_path_zip = '/content/drive/MyDrive/HSG/ML/Project/Datasets/'
output_zip_path = '/content/EuroSAT_training_dataset_numpy.zip'

shutil.copy(output_zip_path, drive_dataset_path_zip)

print(f"Zipped folder copied to: {drive_dataset_path_zip}")

Zipped folder copied to: /content/drive/MyDrive/HSG/ML/Project/Datasets/
