# Preparing and preprocessing of the EuroSAT data
This Jupyter Notebook will present an example of how to prepare and preprocess an image dataset. The EuroSAT satellite images are used as an example, because they will be needed in the quantum K-means implementation in the Jupyter Notebook called "QKmeans". 

The steps executed in this notebook will be the following:
1. Transforming the image dataset into a CSV file

2. Reading the images and performing feature extraction with VGG16

3. Reducing the dimensions of the dataset with PCA

In [None]:
# Begin with importing all the necessary libraries

import random
import os
import pandas as pd
from PIL import Image
import torch
from torchvision import models
from torchvision import transforms
import numpy as np
from sklearn.decomposition import PCA

## 1. Transforming the image dataset into a csv file
The first step will be getting the information of the image paths and their labels into a CSV file. To do this step, the EuroSAT dataset should be downloaded on your computer, it can be downloaded from https://zenodo.org/records/7711810#.ZAm3k-zMKEA . In this example we are using the RGB version of the dataset, which contains jpg images, but there is also a mutli-spectral verison available. 

In [None]:
# Defining the path to the directory where the data is stored
root_dir = '/Path/to/your/EuroSAT_RGB'
#root_dir2 = '/Users/mmakital/Document/EuroSAT_MS' # the other version of the dataset, RGB is easier to handle so we use that

# Initializing the lists which will contain the paths to te images and their labels
image_paths = []
labels = []

# We loop over each subdirectory in the root directory to save every image
for class_label in os.listdir(root_dir):
    class_dir = os.path.join(root_dir, class_label)

    if os.path.isdir(class_dir): #check to see if it is a directory
        # loop over the images in that directory and save them into the lists
        for image_file in os.listdir(class_dir):
            if image_file.endswith('.jpg'):
                image_paths.append(os.path.join(class_dir, image_file))
                labels.append(class_label)

# creating the DataFrame
df = pd.DataFrame({
    'image_path' : image_paths,
    'labels' : labels
})

# saving to a CSV
csv_path = '/Path/to/your/EuroSAT_RGB.csv'
df.to_csv(csv_path, index=False)

# Check the result
print(df.head())

In [None]:
# We can also select only a few images from each subdirectory. This way the dataset is smaller and it will be faster to cluster it.

# Initialize lists to store image paths and labels
image_paths2 = []
labels2 = []

# Traverse the main folder
for subdir, dirs, files in os.walk(root_dir):
    if subdir == root_dir:
        continue  # Skip the main folder itself, we only want subfolders
    
    # Filter and collect image files (you can add more extensions if needed)
    image_files = [f for f in files if f.lower().endswith(('.jpg'))]
    
    # Randomly select a number of images from the subfolder
    n = 50
    selected_images = random.sample(image_files, min(len(image_files), n))
    
    # Collect full paths and labels
    for image in selected_images:
        image_paths2.append(os.path.join(subdir, image))
        labels2.append(os.path.basename(subdir))  # Label is the name of the subfolder

# Create a pandas DataFrame
df2 = pd.DataFrame({
    'image_path': image_paths2,
    'label': labels2
})

# Save the DataFrame to a CSV file
csv_path2 = '/Path/to/your/EuroSAT_RGB_small.csv'
df2.to_csv(csv_path2)

# You can also check that the file is saved correctly and check the shape of it
print(f"CSV file saved to {csv_path2}")
print(df2.head())
print(df2.shape)

## 2. Reading the images and performing feature extraction with VGG16

In this section we will read the images from our CSV file and use the pretrained VGG16 model from PyTorch, https://pytorch.org/vision/main/models/generated/torchvision.models.vgg16.html, to perform feature extraction from the images. VGG16 is a convolutional neural network that was developped by the Visual Geometry Group at the University of Oxford and it is widely used because of its simplicity. For example, in this example we can use the pretrained model to extract the features from the images rather quickly and easily. The features that it is collecting are for example edges, textures, patterns, shapes, and objects. By collecting these, the VGG16 model creates feature maps for each spatial location in the image and each position in the feature map represents how strongly a feature is present in that part of the image. After using this model we get an array that contains feature maps for each image.

In [None]:
# Initializing a list from Image objects
all_image_objects = []

# Iterating over the images in the DatafRame in batches
def process_batch(image_paths):
    image_objects = []
    for image_path in image_paths:
        try:
            if os.path.exists(image_path):
                with Image.open(image_path) as img:
                    image_objects.append(img.copy())
            else:
                print(f"Image not found: {image_path}")
        except Exception as e:
            print(f"Error opening image {image_path}: {e}")
    return image_objects
    
batch_size = 100 # This can be modified based on how many images you are iterating through
for batch in pd.read_csv(csv_path, chunksize=batch_size):
    image_paths = batch['image_path'].tolist()
    batch_image_objects = process_batch(image_paths)
    all_image_objects.extend(batch_image_objects)

# Printing a few images to see that it works
for img in all_image_objects[:5]:
    print(img)
    img.show()

In [None]:
# loading the pretrained model an its weights
Vgg16 = models.vgg16(weights= models.VGG16_Weights.IMAGENET1K_FEATURES)
Vgg16.eval()

In [None]:
# Defining the preprocessing transformations, this is to have the images in the form so that the VGG16 accepts them
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Defining a function to do the feature extraction for each image
def extract_features(image):
    image = preprocess(image).unsqueeze(0)
    with torch.no_grad():
        features = Vgg16.features(image)
        features = features.view(features.size(0), -1)
    return features.numpy()

# Iterating through the image objects and saving the features into an array
feature_list = []

for image in all_image_objects:
    features = extract_features(image)
    feature_list.append(features[0])

features_array = np.array(feature_list)

# Checking that the array is correct
print(features_array.shape)
print(features_array.dtype)

In [None]:
# Creating a Data Frame to save the features in 
df3 = pd.DataFrame(features_array)
csv_path3 = '/Path/to/your/EuroSAT_RGB_Preprocessed.csv'
df3.to_csv(csv_path3, index=False)

# Check the result
print(df3.head())

## 3. Reducing the dimensions of the dataset with PCA

To finish the preprocessing, PCA is used to reduce the dimesions of the feature array. PCA (principal component analysis) is an unsupervised machinle learning algorithm where the main goal is to reduce the dimensionality of the dataset while preserving the most important patterns and relations between the data points. This technique is used because the increase of dimensions can lead to overfitting, increased computation time, and reduced accuracy when performing different machine learning tasks.
This for example applies to the k-means algorithm, thus we use PCA to make the clustering task easier. Also, in the quantum k-means the number of qubits needed increases based on the amount of features in the data. Thus, we want to have a simple dataset with only a few features so that the clustering can be performed with a few qubits too.

In [None]:
# Using the PCA from Scikit-learn to reduce the dimensions
data = pd.read_csv(csv_path3)
pca = PCA(n_components=2) # Choose the n_components based on how many features you want in your data
pca.fit(data)
features_reduced = pca.transform(data)

In [None]:
# Checking the results of the feature reducing

print(features_reduced.shape)
print(features_reduced.dtype)
print(pca.explained_variance_ratio_.cumsum()) # This tells us how much of the information is still left in the reduced data
print(features_reduced[:4])

In [None]:
# Sometimes the PCA can take a lot of time so the results can be saved into a CSV file
# We also need to use this reduced data in the "QKmeans" notebook so it is good to have it saved
pca_df = pd.DataFrame(features_reduced)
pca_path = '/Path/to/your/EuroSAT_RGB_PCA.csv'
pca_df.to_csv(pca_path, index=False)

Now the data has been successfully prepared and preprocessed into a CSV file that is then easy to use for later computations. In the next Jupyter Notebook, we will use this data for the quantum and classical implementations of the K-means clustering algorithm.