# Dataset Preparation - Stage 1
This Jupyter Notebook is the first Notebook from a series of Notebooks that make use of a Convolutional Neural Network to perform land classification. This notebook outlines the steps involved in preparing the EuroSAT Sentinel 2 Dataset for the training of the CNN model. The preparation process of the dataset include:
1. The **importing** of the necessary librares including os, shutil, random and gdal
2. **Reducing** the amount of images found within the dataset to reduce load times and process times.
3. The **organisation** of the dataset into appropriate sub-directories including the **Train, Test and Valid** directories within the dataset directory
4. **Conversion and Enhancement** *of JPG images are performed, resulting in more vivid images with higher contrast and improved color differentiation due to the emphasis on the RGB bands.,
5. Files are appropriatly **shuffled and split** into training (80%), validation (10%) and test (10%) based on predifined indices.
6.Theny the files themsleves are** move**d to the respective directories based on the split
7. Steps 5 and 6 are repeated for custom jpg images from Malta which are used at a later stage to test the CNNs

### Importing Necessary Libraries

In [1]:
# Import the os module for interacting with the operating system's file system
import os
# Import the shutil module for high-level file operations such as copying and moving files
import shutil
# Import the random module for generating random numbers and making random selections
import random
# Import the subprocess module for running external commands
import subprocess
# Import tqdm for displaying progress bars
from tqdm import tqdm  
# Import the gdal module from the osgeo package for working with geospatial data formats
from osgeo import gdal

### Setting the Base path and Random Seed

In [2]:
# Set the path to the datasets directory
path = "C:/Users/isaac/datasets/eurosat-dataset"

# Set the random seed for reproducibility, this ensures that the random operations produce the same result every time the code is run
SEED = random.seed(123)

### Collecting Category Names and File Paths

In [3]:
# Initialize empty lists to store category names and file paths
categories = []  # List to hold the names of the categories (subdirectories)
jpg_files = []   # List to hold the paths of all .jpg files

# Walk through the directory tree starting from the specified path
# os.walk generates the file names in a directory tree by walking the tree either top-down or bottom-up
for dirpath, dirnames, filenames in os.walk(path):
    # Add the directory names (categories) to the categories list
    categories.extend(dirnames)
    # Iterate over all filenames in the current directory
    for filename in filenames:
        # Check if the file has a .jpg extension
        if filename.endswith('.jpg'):
            # Add the full path of the .jpg file to the jpg_files list
            jpg_files.append(os.path.join(dirpath, filename))

# Print the list of categories (subdirectory names)
print(categories)

['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake']


### Reducing the Dataset by half

In [4]:
# Reduce the number of images in each category by 50% to reduce load times and process times
for category in categories:
    # Construct the path to the current category
    category_path = os.path.join(path, category)
    
    # Filter jpg_files to get only the files in the current category
    category_files = [file for file in jpg_files if file.startswith(category_path)]

    # If there are more than one file in the category, reduce the number by 50%
    if len(category_files) > 1:
        # Calculate the number of files to keep (50% of the total)
        num_files_to_keep = len(category_files) // 2
        
        # Randomly select a subset of files to keep
        files_to_keep = random.sample(category_files, num_files_to_keep)
        
        # Determine the files to remove (those not in the files_to_keep list)
        files_to_remove = set(category_files) - set(files_to_keep)
        
        # Iterate over the files to remove and delete them
        for file_to_remove in files_to_remove:
            os.remove(file_to_remove)  # Delete the file

# Print a completion message
print("Reduction of images in each category by 50% completed.")

Reduction of images in each category by 50% completed.


### Creating and Organising Directories

In [5]:
# List of split names
split_names = ['train', 'test', 'valid']

# Create new folders for train, test, and valid sets with subfolders for all categories
for sp_name in split_names:
    directory = os.path.join(path, 'dataset_splits', sp_name)  # Path to the split folder
    if not os.path.exists(directory):  # Check if the split folder doesn't exist
        os.makedirs(directory)  # Create the split folder if it doesn't exist
        print(f"Created folder: {directory}")
        print()
    # Create category folders within each split
    for category in categories:
        dir_cat = os.path.join(directory, category)  # Path to the category folder within the split
        if not os.path.exists(dir_cat):  # Check if the category folder doesn't exist
            os.makedirs(dir_cat)  # Create the category folder if it doesn't exist
            print(f"Created category folder: {dir_cat}")

print("All folders created successfully.")

Created folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train

Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\AnnualCrop
Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\Forest
Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\HerbaceousVegetation
Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\Highway
Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\Industrial
Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\Pasture
Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\PermanentCrop
Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\Residential
Created category folder: C:/Users/isaac/datasets/eurosat-dataset\dataset_splits\train\River
Created category folder: C:/Users/isaac/datasets/e

### Enhancing Colors and Contrast of JPG Images Using GDAL

In [6]:
# Function to enhance the colors and contrast of the JPG's using GDAL 
def process_image(input_file, output_file, bands=[1, 2, 3]):
    try:
        # Construct the gdal_translate command
        command = ['gdal_translate', '-of', 'JPEG']  # Set output format to JPEG
        for band in bands:
            # Add each specified band to the command
            command.extend(['-b', str(band)])
        # Add the scaling operation and specify input and output file paths
        command.extend(['-scale', input_file, output_file])
        
        # Run the constructed command using subprocess
        subprocess.run(command, check=True)
    except subprocess.CalledProcessError as e:
        # Catch and print an error message if the command fails
        print(f"Error converting {input_file} to {output_file}: {e}")
        return False
    
    # Return True to indicate successful conversion
    return True

### Processing and Splitting Enhanced JPG Images by Category

In [7]:
# Iterate through each category
for category in categories:
    # Define directories for input JPG files and output JPG files
    directory = os.path.join(path, category)  # Input directory
    directory_train = os.path.join(path, 'dataset_splits', 'train', category)  # Output JPG directory for training set
    directory_valid = os.path.join(path, 'dataset_splits', 'test', category)  # Output JPG directory for validation set
    directory_test = os.path.join(path, 'dataset_splits', 'valid', category)  # Output JPG directory for test set
    
    # Create output directories if they do not exist
    os.makedirs(directory_train, exist_ok=True)
    os.makedirs(directory_valid, exist_ok=True)
    os.makedirs(directory_test, exist_ok=True)
    
    # List all files in the input directory
    try:
        cat_files = os.listdir(directory)
    except FileNotFoundError:
        continue  # Skip the category if the directory does not exist

    # Filter out unwanted files
    cat_files = [file for file in cat_files if file.lower().endswith(('.jpg', '.jpeg'))]
    
    # Convert JPG files 
    with tqdm(total=len(cat_files), desc=f'Converting {category} JPGs') as pbar:  # Initialize tqdm progress bar
        for file in cat_files:
            file_no_ext = file.split('.')[0]  # Remove file extension
            img_in = os.path.join(directory, file)  # Input file path
            img_out = os.path.join(directory_train, file_no_ext + '.jpg')  # Output JPG file path
                
            # Check if the converted JPEG file already exists in the training directory
            if not os.path.exists(img_out):
                success = process_image(img_in, img_out, bands=[1, 2, 3])
                if not success:
                    print(f"Conversion failed for {img_in}")
            
            pbar.update(1)  # Update progress bar
    
    # Sort files into test and validation folders
    filenames = os.listdir(directory_train)  # List JPG files in the training directory
    filenames.sort()  # Sort filenames alphabetically
    if '.DS_Store' in filenames:
        filenames.remove('.DS_Store')
        
    random.shuffle(filenames)  # Shuffle filenames randomly
    split_1 = int(0.8 * len(filenames))  # Split index for training-validation split
    split_2 = int(0.9 * len(filenames))  # Split index for validation-test split
    train_filenames = filenames[:split_1]  # Filenames for training set
    valid_filenames = filenames[split_1:split_2]  # Filenames for validation set
    test_filenames = filenames[split_2:]  # Filenames for test set
        
    for file in os.listdir(directory_train):
        if file in valid_filenames:
            shutil.move(os.path.join(directory_train, file), os.path.join(directory_valid, file))  # Move to validation directory
        elif file in test_filenames:
            shutil.move(os.path.join(directory_train, file), os.path.join(directory_test, file))  # Move to test directory

Converting AnnualCrop JPGs: 100%|██████████| 1500/1500 [01:00<00:00, 24.79it/s]
Converting Forest JPGs: 100%|██████████| 1500/1500 [00:57<00:00, 25.91it/s]
Converting HerbaceousVegetation JPGs: 100%|██████████| 1500/1500 [00:55<00:00, 26.81it/s]
Converting Highway JPGs: 100%|██████████| 1250/1250 [00:45<00:00, 27.18it/s]
Converting Industrial JPGs: 100%|██████████| 1250/1250 [00:49<00:00, 25.51it/s]
Converting Pasture JPGs: 100%|██████████| 1000/1000 [00:39<00:00, 25.48it/s]
Converting PermanentCrop JPGs: 100%|██████████| 1250/1250 [00:49<00:00, 25.39it/s]
Converting Residential JPGs: 100%|██████████| 1500/1500 [01:00<00:00, 24.65it/s]
Converting River JPGs: 100%|██████████| 1250/1250 [00:50<00:00, 24.62it/s]
Converting SeaLake JPGs: 100%|██████████| 1500/1500 [00:56<00:00, 26.39it/s]


### Converting Test Images to JPEG Format

In [8]:
# Define directories for input JPG files and output JPG files
directory = 'C:/Users/isaac/datasets/test-jpg-images' # Input JPG's
directory_converted = 'C:/Users/isaac/datasets/converted-jpg-images' # Output JPG's
    
# Create output directories if they do not exist
os.makedirs(directory_converted, exist_ok=True)
    
# List all files in the input directory
try:
    cat_files = os.listdir(directory)
except FileNotFoundError:
    cat_files = [] 

# Filter out unwanted files
cat_files = [file for file in cat_files if file.lower().endswith(('.jpg', '.jpeg'))]

# Enhance JPG files and move them to the training directory
with tqdm(total=len(cat_files), desc=f'Converting JPGs') as pbar:  # Initialize tqdm progress bar
    for file in cat_files:
        file_no_ext = file.split('.')[0]  # Remove file extension
        img_in = os.path.join(directory, file)  # Input JPG file path
        img_out = os.path.join(directory_converted, file_no_ext + '.jpg')  # Output JPG file path
                
        # Check if the converted JPEG file already exists in the training directory
        if not os.path.exists(img_out):
            success = process_image(img_in, img_out, bands=[1, 2, 3])
        if not success:
            print(f"Conversion failed for {img_in}")
            
        pbar.update(1)  # Update progress bar        

Converting JPGs: 100%|██████████| 5/5 [00:00<00:00, 12.66it/s]


By employing a structured approach, I ensure that the dataset is well-organised and ready for the training, validation and the testing phases of the development of the CNN.