<a href="https://colab.research.google.com/github/PavanDaniele/drone-person-detection/blob/main/dataset_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up: mount drive + import libraries

In [1]:
# Run this Every time you start a new session
from google.colab import drive
drive.mount('/content/drive') # to mount google drive (to see/access it)

Mounted at /content/drive


In [2]:
# Run this snippet Just one time, to install packages
!pip install imagehash
!pip install pillow

Collecting imagehash
  Downloading ImageHash-4.3.2-py2.py3-none-any.whl.metadata (8.4 kB)
Downloading ImageHash-4.3.2-py2.py3-none-any.whl (296 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.7/296.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: imagehash
Successfully installed imagehash-4.3.2


In [15]:
from PIL import Image
import imagehash
import os
from itertools import combinations # to generate all possible combinations of a number of elements from a set

from collections import defaultdict, deque
# defaultdict is a special type of dictionary that automatically creates a default value if you access a nonexistent key
# deque is a list-like structure, but optimized for quick additions and removals at both ends.

import pandas as pd
import matplotlib.pyplot as plt
import random
from collections import Counter
import shutil
from sklearn.model_selection import train_test_split # to partition the dataset with stratification

import numpy as np
import cv2 # OpenCV for image manipulation
import xml.etree.ElementTree as ET # For parsing and editing XML files (annotations)

# Dataset Preparation

In this notebook, I'm going to prepare the dataset for fine-tuning multiple deep learning models (e.g. YOLO, EfficientDet, SSD + MobileNetV2).
The steps include similarity check, dataset splitting (train/val/test), optional image resizing, and bounding box adaptation.
The goal is to generate separate, clean and model-ready datasets for each architecture to enable fair training and evaluation.

In [4]:
image_folder_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS"

There are various metrics to calculate image similarity, such as **SSIM** (Structural Similarity Index), **PSNR** (Peak Signal-to-Noise Ratio), and **Cosine Similarity**. \
In our case, I chose to use **Perceptual Hashing** for an initial check because it is fast, robust, and does not require resizing (which is very important since my AERALIS dataset is composed of images from two different datasets).

This technique reduces the image to a binary signature, and then the *Hamming Distance* is computed to compare the resulting binary hashes.

In [5]:
def get_image_paths(folder_path): # To estract the images file (.jpg) and ignore the .xml and .csv files
  """
  Args:
    folder_path: path to folder containing images
  Returns:
    list of paths to images
  """
  return [os.path.join(image_folder_path, f) for f in os.listdir(image_folder_path)
               if f.lower().endswith(('.jpg'))]

In [6]:
def compute_hash(img_path, method):
  """
  Args:
    img_path: path to image
    method: hash method to use
  Returns:
    hash of image
  """
  img = Image.open(img_path).convert("L")  # Grayscale (because the hash algorithms works best when the image is in black and white)

  if method == 'phash':
    return imagehash.phash(img)
  elif method == 'ahash':
    return imagehash.average_hash(img)
  elif method == 'dhash':
    return imagehash.dhash(img)
  else:
    raise ValueError(f"Hash method not supported: {method}")

In [7]:
def compute_all_hashes(image_paths, methods): # Hash calculation for each images
  """
  Args:
    image_paths: list of paths to images
    methods: list of hash methods to use
  Returns:
    dictionary of hashes
  """
  hashes = {method: {} for method in methods} # to create a dictionary and for each method creates an empty sub-dictionary

  for method in methods:
    print(f"\nCalculation {method} for all images")

    for path in image_paths: # cycles over each image path in the image_paths list
      try:
        h = compute_hash(path, method)
        hashes[method][path] = h # saves the calculated hash in the dictionary structure
      except Exception as e:
        print(f"Error with {path}: {e}")

  return hashes

In [8]:
def compare_hashes(hashes, threshold): # Comparison of images in pairs
  """
  Args:
    hashes: dictionary of hashes
    threshold: distance threshold to consider images as similar
  """
  similar_images = []

  for method in hashes:
    print(f"\nRisultats with {method.upper()}:") # .upper() is used to convert the characters to 'uppercase'
    pairs = combinations(hashes[method].items(), 2) # combinations() is used to generate all the possible pairs without repetitions

    for (path1, hash1), (path2, hash2) in pairs:
      dist = hash1 - hash2
      if dist <= threshold:
        similar_images.append({
          'method': method,
          'image1': os.path.basename(path1),
          'image2': os.path.basename(path2),
          'distance': dist
        })
  return similar_images

The Hamming distance between the hashes of two images tells us how visually similar they are.
The result depends on the threshold:

- 1-2 (Very strict) → Only nearly identical images are detected
- 3-5 (Good compromise) → Balances well between false positives and false negatives
- 6-10 (More permissive) → More images are considered similar, but false positives increas

Let's calculate one Hash at time:

In [12]:
HASH_METHODS = ['phash']
HAMMING_THRESHOLD = 5

image_paths = get_image_paths(image_folder_path)
hashes_phash = compute_all_hashes(image_paths, HASH_METHODS)
similar_images_phash = compare_hashes(hashes_phash, HAMMING_THRESHOLD)

# to see how many distine images are considered similar:
img_set = set()
for entry in similar_images_phash:
    img_set.add(entry['image1'])
    img_set.add(entry['image2'])

print(f"Method: {HASH_METHODS} \n")
print(f"Number of similar distinct images: {len(img_set)}")
print(f"Number of All images: {len(image_paths)}")


Calculation phash for all images

Risultats with PHASH:
Method: ['phash'] 

Number of similar distinct images: 1176
Number of All images: 3426


In [None]:
HASH_METHODS = ['ahash']
HAMMING_THRESHOLD = 5

hashes_ahash = compute_all_hashes(image_paths, HASH_METHODS)
similar_images_ahash = compare_hashes(hashes_ahash, HAMMING_THRESHOLD)

img_set = set()
for entry in similar_images_ahash:
    img_set.add(entry['image1'])
    img_set.add(entry['image2'])

print(f"Method: {HASH_METHODS} \n")
print(f"Number of similar distinct images: {len(img_set)}")
print(f"Number of All images: {len(image_paths)}")


Calculation ahash for all images

Risultats with AHASH:
Method: ['ahash'] 

Number of similar distinct images: 2096
Number of All images: 3426


In [None]:
HASH_METHODS = ['dhash']
HAMMING_THRESHOLD = 5

hashes_dhash = compute_all_hashes(image_paths, HASH_METHODS)
similar_images_dhash = compare_hashes(hashes_dhash, HAMMING_THRESHOLD)

img_set = set()
for entry in similar_images_dhash:
    img_set.add(entry['image1'])
    img_set.add(entry['image2'])

print(f"Method: {HASH_METHODS} \n")
print(f"Number of similar distinct images: {len(img_set)}")
print(f"Number of All images: {len(image_paths)}")


Calculation dhash for all images

Risultats with DHASH:
Method: ['dhash'] 

Number of similar distinct images: 1368
Number of All images: 3426


We observed that the number of similar images is quite high.
Instead of simply removing them (which would unnecessarily reduce the dataset size), we adopt a more conservative strategy: we will distribute these similar images carefully across the training, validation, and test sets, in order to prevent potential overfitting or data leakage.

We chose to use only perceptual hashing (pHash), as it is more robust to minor variations in images and less prone to false positives compared to other variants like aHash and dHash.

But now I want to formulate a hypothesis:
Are we sure that all the 1176 images identified as "similar" by pHash are truly similar to each other? \
It could be that these images do not all resemble each other directly, but instead form subgroups (clusters) of mutually similar images, while being different from those in other groups. \
Let’s try to test this assumption.

To model this relationship, I built a data structure based on an undirected graph, where:

 - each node represents an image
 - an edge connects two images if they are considered similar

We then extracted the connected components from this graph, which effectively represent the actual clusters of similar images. These groups will be used to perform a controlled split of the dataset.

In [9]:
# Constructs an oriented graph in which each image is a node and each similar pair is an arc:
def build_similarity_graph(similar_images):
  """
  Args:
    similar_images: list of similar images
  Returns:
    graph: dictionary of graph
  """
  graph = defaultdict(set) # creates a dictionary (key: name_img, val: set_of_images)

  for pair in similar_images: # scrolls each element(=list of dictionaries) of similar_images
    img1 = pair['image1']
    img2 = pair['image2']
    graph[img1].add(img2) # builds the connection in both directions (undirected arc)
    graph[img2].add(img1)

  return graph

In [48]:
# finds the groups (connected components) in the graph:
def find_connected_components(graph):
  """
  Args:
    graph: dictionary of graph
  Returns:
    groups: list of groups
  """
  visited = set() # keeps track of images already visited
  groups = [] # will contain the final groups

  for node in sorted(graph): # scrolls each node(=image) in the graph (is important to order the nodes to ensure stability)
    if node not in visited: # If the image has not yet been visited, then start a new group
      group = []

      # Start a BFS (Breadth-First Search) with a queue
      # adds the initial node to the queue and marks it as visited
      queue = deque([node])
      visited.add(node)

      while queue: # as long as there are nodes in the tail
        current = queue.popleft() # removes the knot from the head and adds it to the group
        group.append(current)

        # for each neighbor (similar image), if not already visited
        for neighbor in sorted(graph[current]): # (is important to order the nodes to ensure stability)
          if neighbor not in visited:
            visited.add(neighbor) # marks it as visited
            queue.append(neighbor) # puts it in the queue for trial

      # Once the queue is exhausted, the group is complete and it is added to the groups
      groups.append(group)
  return groups

In [49]:
graph = build_similarity_graph(similar_images_phash)
groups = find_connected_components(graph)

print(f"\nFound {len(groups)} groups of similar images.\n")


Found 347 groups of similar images.



Perfect! We were right! \
Now let's do a brief analysis

In [50]:
group_sizes = [len(group) for group in groups]

size_counts = Counter(group_sizes) # count how many groups have size X

# Sort and save to a DataFrame by display
group_distribution = pd.DataFrame(sorted(size_counts.items()), columns=["Group Size", "Number of Groups"])
display(group_distribution)

# Other useful statistics
total_similar_images = sum(group_sizes)
largest_group = max(group_sizes)
average_group_size = total_similar_images / len(groups)

print(f"Total grups: {len(groups)}")
print(f"Total similar images: {total_similar_images}")
print(f"Average group size: {average_group_size:.2f}")
print(f"Largest group: {largest_group} images")

Unnamed: 0,Group Size,Number of Groups
0,2,221
1,3,49
2,4,24
3,5,18
4,6,10
5,7,5
6,8,5
7,9,2
8,10,2
9,11,1


Total grups: 347
Total similar images: 1176
Average group size: 3.39
Largest group: 45 images


In [51]:
percentage = total_similar_images / len(image_paths) * 100
print(f"Percentage of similar images in the dataset: {percentage:.2f}%")

Percentage of similar images in the dataset: 34.33%


As we can see, 34.33% of all images in our AERALIS dataset are similar. This is not ideal. \
But don't worry! We can keep all the images and still avoid overfitting or data leakage by using another technique: *Group-Aware Splitting*.

This method is similar to the more classical Stratified Sampling, but it is more suitable for our case. \
So let’s start using this technique to properly create the Training, Validation, and Test sets. \

But before that, I think it could be interesting to see how the pHash results change when we adjust the similarity threshold.

In [52]:
def filter_similar_images(hashes, threshold): # filters similar images by threshold
  """
  Args:
    hashes: dictionary of hashes
    threshold: distance threshold to consider images as similar
  Returns:
    similar_images: list of similar images
  """
  similar_images = []
  pairs = combinations(hashes.items(), 2)

  for (path1, hash1), (path2, hash2) in pairs:
    dist = hash1 - hash2
    if dist <= threshold:
      similar_images.append({'image1': path1, 'image2': path2, 'distance': dist})

  return similar_images

In [53]:
# Let's see Just the pHash case
HASH_METHODS = ['phash']
thresholds_to_try = [3, 5, 7, 10]

hashes_phash_all = compute_all_hashes(image_paths, HASH_METHODS)['phash'] # Extracts only 'phash' from the returned dictionary

results = []
for thresh in thresholds_to_try: # analyzes each threshold
  similar_images = filter_similar_images(hashes_phash_all, thresh) # for each threshold, calculate similar images with that threshold

  # Extracts all the images that appear at least once as similar (without duplicates):
  img_set = set()
  for pair in similar_images:
    img_set.add(pair['image1'])
    img_set.add(pair['image2'])

  results.append({
    "Threshold": thresh,
    "Num Similar Pairs": len(similar_images),
    "Num Similar Distinct Images": len(img_set),
    "Total Images": len(image_paths),
    "Percent Similar (%)": round(len(img_set) / len(image_paths) * 100, 2)
  })


# To see the results let's converts the list of results to a DataFrame pandas
df_results = pd.DataFrame(results)
display(df_results)


Calculation phash for all images


Unnamed: 0,Threshold,Num Similar Pairs,Num Similar Distinct Images,Total Images,Percent Similar (%)
0,3,1450,839,3426,24.49
1,5,2307,1176,3426,34.33
2,7,3161,1494,3426,43.61
3,10,4960,1971,3426,57.53


We observe that as the threshold increases, the number of pairs considered similar also grows, and consequently, so does the percentage of images involved.

Observations:
- At lower thresholds (e.g., 3), only strongly similar images are identified, but many less obvious duplicates may be missed.

- At higher thresholds (e.g., 10), there's a risk of including different images that only share generic visual elements (false positives).

- Threshold 5 proves to be a good compromise, balancing precision and coverage.

Let us now create a function that uses the group_aware splitting technique:

In [54]:
# Divides the groups of similar images into training, validation, and test sets, keeping each group together
#  (no similar images end up in different sets).
def group_aware_split(groups, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, seed=42):
  """
  Args:
    groups: list of groups
    train_ratio: ratio of images to be assigned to the training set
    val_ratio: ratio of images to be assigned to the validation set
    test_ratio: ratio of images to be assigned to the test set
    seed: seed for the random number generator

  Returns:
    assignments: dictionary with the split of images
  """
  # checks whether the sets add up to 1
  try:
    total_ratio = train_ratio + val_ratio + test_ratio
    if not 0.99 <= total_ratio <= 1.01:
      raise ValueError("The proportions do not add up to 1! You must correct the values.")
  except ValueError as e:
    print(f"ERROR: {e}")
    return None

  split_ratios = {'train': train_ratio, 'val': val_ratio, 'test': test_ratio}

  random.seed(seed) # to initialize the random number generator
  random.shuffle(groups) # to make randomization reproducible

  # dictionary comprehension
  total_images = sum(len(g) for g in groups) # to figure out how many images should be assigned to that split
  target_counts = {k: int(v * total_images) for k, v in split_ratios.items()}
  current_counts = defaultdict(int) # number of images already assigned to each split
  assignments = defaultdict(list) # number of images actually assigned to each split as final output

  for group in groups:
    # Find the split with the lowest saturation ratio
    best_split = min(
      target_counts.items(),
      key=lambda item: current_counts[item[0]] / item[1] if item[1] > 0 else float('inf')
    )[0] # this line is used to take the key (‘train’, ‘val’, ‘test’) of the best split

    assignments[best_split].extend(group) # adds all images in the group to the selected split
    current_counts[best_split] += len(group) # update the counter to know how many images are now in that split

  return assignments

With this function we get a *assignments* dictionary structured so that each list contains the names of the images assigned to the split (train, val, test), keeping similar images together.

In [74]:
# invokes the function to divide the groups of similar images into the different sets
assignments = group_aware_split(groups, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, seed=42)

# check the counts
print({k: len(v) for k, v in assignments.items()})


{'train': 820, 'val': 178, 'test': 178}


Partitioning the dataset into: \
- Training set = 70%,
- Validation set = 15%,
- Test set = 15%

and we obtain a distribution according to:
- 820 images out of 1176 for the Training set
- 178 images out of 1176 for the Validation set
- 178 images out of 1176 for the Test set

\
Now we need to use a *Stratified Split* that ensures that the distribution of classes in the dataset is proportionally balanced across the divisions of the three sets. \
We prefer to use a **Stratified Sampling** technique because we already know that our dataset is somewhat unbalanced, as it contains more images with people than images without.


But, of course, we want to maintain the partitioning we just did for similar images:

In [75]:
# Divides the AERALIS dataset in a layered manner and copies images/.xml to train/val/test.
#   - Maintains similar image assignments (from group_aware_split).
#   - Stratifies remainder split based on CSV ‘class’ column.
#   - Saves images, annotations and generates CSV for each set with all original columns.
def stratified_split(assignments, image_folder_path, output_base_path, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, seed=42):
  """
  Args:
    assignments: dictionary with the split of images
    image_folder_path: path to the folder containing images (and CSV file)
    output_base_path: path to the output folder
    train_ratio, val_ratio, test_ratio: desired proportions of images to be assigned to the training, validation and test set
    seed: seed for the random number generator
  """

  # 1. Create train/val/test folders with subfolders images/ and annotations/
  for split in ['train', 'val', 'test']:
    os.makedirs(os.path.join(output_base_path, split, "images"), exist_ok=True)
    os.makedirs(os.path.join(output_base_path, split, "annotations"), exist_ok=True)

  # 2. Upload the full CSV
  csv_path = os.path.join(image_folder_path, "aeralis_person_labels.csv")
  df_full = pd.read_csv(csv_path)
  df_full = df_full[df_full['filename'].str.lower().str.endswith(('.jpg', '.jpeg', '.png'))] # we really only need '.jpg'
  df = df_full.drop_duplicates(subset='filename', keep='first').copy()

  # df = pd.read_csv(csv_path)
  # df = df[df['filename'].str.lower().str.endswith(('.jpg', '.jpeg', '.png'))]

  # 3. Removes images already assigned (similar)
  already_assigned = set(sum(assignments.values(), []))
  df_unassigned = df[~df['filename'].isin(already_assigned)].copy()

  # 4. Split is stratified according to y labels, class balance is maintained
  X = df_unassigned['filename'].values
  y = df_unassigned['class'].values

  X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    train_size = train_ratio,
    stratify = y,
    random_state = seed
  )
  # X_train, y_train: part that will go into the training set
  # X_temp, y_temp: remaining images to be split still in validation and testing

  # After removing the train part, we need to divide X_temp into val and test:
  val_ratio_adjusted = val_ratio / (val_ratio + test_ratio) # we calculate the new proportion of validation to the remaining total

  # We divide X_temp and y_temp into validation and test:
  X_val, X_test = train_test_split(
    X_temp, train_size = val_ratio_adjusted,
    stratify = y_temp,
    random_state = seed
  )

  # 5. Adds assignments to the dictionary, avoiding duplicates
  # assignments is the dictionary initially created with the similar groups assigned via Group-Aware Splitting
  # X_train, X_val, X_test are the non-similar, stratified image assignments
  for split, split_X in zip(['train', 'val', 'test'], [X_train, X_val, X_test]):
    new_imgs = [x for x in split_X if x not in already_assigned]
    assignments[split].extend(new_imgs)
    already_assigned.update(new_imgs)



  # 6. Copy file and generate final CSV
  for split, file_list in assignments.items():
    # split_df = df[df['filename'].isin(file_list)].copy()
    split_df = df_full[df_full['filename'].isin(file_list)].copy()

    for fname in file_list:
      img_path = os.path.join(image_folder_path, fname)
      xml_name = os.path.splitext(fname)[0] + ".xml"
      xml_path = os.path.join(image_folder_path, xml_name)

      dst_img = os.path.join(output_base_path, split, "images", fname)
      dst_xml = os.path.join(output_base_path, split, "annotations", xml_name)

      if os.path.exists(img_path):
        shutil.copy2(img_path, dst_img)
      if os.path.exists(xml_path):
        shutil.copy2(xml_path, dst_xml)

    # save detailed CSV for split
    split_df.to_csv(os.path.join(output_base_path, f"{split}_set.csv"), index=False)

  # 7. Report
  print("Stratified split completed and files copied.")
  print(f"Train: {len(assignments['train'])} images")
  print(f"Val:   {len(assignments['val'])} images")
  print(f"Test:  {len(assignments['test'])} images")

With this function we are going to create a new folder containing 3 subfolders for the Training, Validation and Test phases, each of which will contain a folder with images and one with their respective annotations. At the same time a CSV file describing the set will be created.

In [76]:
image_folder_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS"

output_base_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS_SPLITTED"

# Initialize assignments (only if is not initialized) with: assignments = group_aware_split(gruppi_simili)

stratified_split(assignments, image_folder_path, output_base_path, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15,seed=42)

Stratified split completed and files copied.
Train: 2395 images
Val:   515 images
Test:  516 images


In [78]:
# Let's see if the numbers matches
total_all = len(assignments['train']) + len(assignments['val']) + len(assignments['test'])
unique_total = len(set(assignments['train'] + assignments['val'] + assignments['test']))

print(f"Total images in split (sum): {total_all}")
print(f"Total unique images: {unique_total}")

Total images in split (sum): 3426
Total unique images: 3426


We now perform a quick check to see if the split did not cause inconsistencies in the data:

In [79]:
splits = ['train', 'val', 'test']
base_path = output_base_path  # già definito

for split in splits:
  img_dir = os.path.join(base_path, split, 'images')
  ann_dir = os.path.join(base_path, split, 'annotations')

  images = [f for f in os.listdir(img_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
  missing_xml = []

  for img in images:
    xml_name = os.path.splitext(img)[0] + ".xml"
    if not os.path.exists(os.path.join(ann_dir, xml_name)):
      missing_xml.append(xml_name)

  print(f"{split.upper()} Images: {len(images)}, Missing XML: {len(missing_xml)}")
  if missing_xml:
    print("\nMissing XML files:")
    for x in missing_xml:
      print("   ", x)

TRAIN Images: 2395, Missing XML: 0
VAL Images: 515, Missing XML: 0
TEST Images: 516, Missing XML: 0


In [80]:
for split in splits:
  csv_path = os.path.join(base_path, f"{split}_set.csv")
  img_dir = os.path.join(base_path, split, "images")

  df_split = pd.read_csv(csv_path)
  csv_filenames = set(df_split['filename'].str.lower())
  actual_images = set(f.lower() for f in os.listdir(img_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png')))

  missing_in_csv = actual_images - csv_filenames # files present in folder but NOT in CSV
  missing_in_dir = csv_filenames - actual_images # files present in CSV but NOT in folder

  print(f"{split.upper()} — CSV: {len(csv_filenames)}, IMG DIR: {len(actual_images)}")
  if missing_in_csv:
    print(f"   {len(missing_in_csv)} images in folder not listed in CSV.")
  if missing_in_dir:
    print(f"   {len(missing_in_dir)} images in CSV not found in folder.")

TRAIN — CSV: 2395, IMG DIR: 2395
VAL — CSV: 515, IMG DIR: 515
TEST — CSV: 516, IMG DIR: 516


Perfect! There are no inconsistencies resulting from the split. \
 We now proceed to create copies of the folder we just created, *AERALIS_SPLITTED*, as we want to ensure that the future study of the models' performance is not affected by different splits. Therefore, we will use the same Train, Val, and Test proportions for all of them, as we have just created

In [81]:
# remember: base_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS_SPLITTED"

model_versions = ["YOLOv8n", "YOLOv11n", "EfficientDet_D0", "EfficientDet_D1", "EfficientDet_D2", "MobileNetV2_SSD"]

for v in model_versions:
    dst = f"/content/drive/MyDrive/projectUPV/datasets/AERALIS_{v}" # constructs the destination path
    shutil.copytree(base_path, dst)


# Resizing


In this section we are going to resize all the images to the correct size for the model version. \
It will also be necessary to edit the Bounding Boxes so that there are no inconsistencies, as well as the values in the csv files.

Per i modelli di detection che ho scelto, e immagini dovranno essere ridimensionate nel seguente modo:
- YOLOv8n e YOLOv11n richiedono dimensioni di immagini 640x640
- EfficientDet D0 richiede dimensioni di immagini 512x512
- EfficientDet D1 richiede dimensioni di immagini 640x640
- EfficientDet D2 richiede dimensioni di immagini 768x768
- MobileNetV2 + SSD richiede dimensioni di immagini 300x300

Tutto ciò per garantire le migliori performance durante i test di fine-tuning ed analisi. Nonostante alcuni modelli siano più flessibili od alcune implementazioni degli stessi permettano un resizing dinamico, ho preferito ridimensionare manualmente i dati in modo da avere un maggior controllo.

For the detection models I selected, the images will be resized as follows:

- YOLOv8n and YOLOv11n require image dimensions of 640×640
- EfficientDet D0 requires 512×512
- EfficientDet D1 requires 640×640
- EfficientDet D2 requires 768×768
- MobileNetV2 + SSD requires 300×300

This is done to ensure optimal performance during fine-tuning and evaluation.
Although some models are more flexible and certain implementations allow dynamic resizing, I preferred to manually resize the data to maintain greater control over the process.

DATA

In [None]:
# Function to perform letterbox resize: resize with retained aspect ratio and padding
def letterbox_resize(image, target_size=(640, 640), color=(114, 114, 114)):
  orig_h, orig_w = image.shape[:2] # original height and width
  target_w, target_h = target_size # desired height and width

  scale = min(target_w / orig_w, target_h / orig_h) # maintaining aspect ratio
  new_w, new_h = int(orig_w * scale), int(orig_h * scale)

  resized = cv2.resize(image, (new_w, new_h), interpolation=cv2.INTER_LINEAR) # resize image
  pad_x = (target_w - new_w) // 2 # horizontal padding
  pad_y = (target_h - new_h) // 2 # vertical padding

  # Adding padding to get target size image
  padded = cv2.copyMakeBorder(
    resized,
    pad_y, target_h - new_h - pad_y,
    pad_x, target_w - new_w - pad_x,
    borderType = cv2.BORDER_CONSTANT, value=color # grey padding
  )

  return padded, scale, (pad_x, pad_y)

In [None]:
# Function to process a single split of the dataset
def process_one_folder(img_dir, xml_dir, out_img_dir, out_xml_dir, target_size=(640, 640)):
  os.makedirs(out_img_dir, exist_ok=True) # folder output images
  os.makedirs(out_xml_dir, exist_ok=True) # folder output xml files

  for fname in os.listdir(img_dir):
    is_image = fname.lower().endswith(('.jpg', '.jpeg', '.png'))
    img_path = os.path.join(img_dir, fname)
    xml_filename = os.path.splitext(fname)[0] + ".xml"
    xml_path = os.path.join(xml_dir, xml_filename)
    xml_exists = os.path.exists(xml_path)
    image = cv2.imread(img_path) if is_image else None # read image if it exist

    if is_image and xml_exists and image is not None:
      resized_img, scale, (pad_x, pad_y) = letterbox_resize(image, target_size)
      cv2.imwrite(os.path.join(out_img_dir, fname), resized_img) # save resized image

      # Update XML
      tree = ET.parse(xml_path)
      root = tree.getroot()

      for obj in root.findall('object'): # for each object (bbox) in the XML file
        bbox = obj.find('bndbox')
        xmin = int(float(bbox.find('xmin').text))
        ymin = int(float(bbox.find('ymin').text))
        xmax = int(float(bbox.find('xmax').text))
        ymax = int(float(bbox.find('ymax').text))

        # apply scaling + padding to the bbox
        xmin = int(xmin * scale + pad_x)
        xmax = int(xmax * scale + pad_x)
        ymin = int(ymin * scale + pad_y)
        ymax = int(ymax * scale + pad_y)

        # clamp of values (avoids out-of-picture bbox)
        bbox.find('xmin').text = str(max(0, min(xmin, target_size[0])))
        bbox.find('ymin').text = str(max(0, min(ymin, target_size[1])))
        bbox.find('xmax').text = str(max(0, min(xmax, target_size[0])))
        bbox.find('ymax').text = str(max(0, min(ymax, target_size[1])))

      # update size in <size> tag
      size_tag = root.find('size')
      size_tag.find('width').text = str(target_size[0])
      size_tag.find('height').text = str(target_size[1])

      # Save the new XML file
      tree.write(os.path.join(out_xml_dir, xml_filename))

    # Error handling/ignora invalid files.
    else:
      if not is_image: # It is not an image file
        print(f"Ignored file (not image): {img_path}")
      elif not xml_exists:
        print(f"Missing file XML for: {fname}")
      elif image is None:
        print(f"Reading error: {img_path}")

In [None]:
# Function to process the entire dataset (Training, Validation, Testing)
def process_entire_dataset(base_input_dir, base_output_dir, splits=('train', 'val', 'test'), target_size=(640, 640)):
  for split in splits:
    print(f"\n Processing split: {split}")
    # Input/output paths for images and annotations.
    img_dir = os.path.join(base_input_dir, split, 'images')
    xml_dir = os.path.join(base_input_dir, split, 'annotations')
    out_img_dir = os.path.join(base_output_dir, split, 'images')
    out_xml_dir = os.path.join(base_output_dir, split, 'annotations')

    # Process a single split
    process_one_folder(img_dir, xml_dir, out_img_dir, out_xml_dir, target_size)

  print("\nAll images and annotations have been processed.")

In [None]:
# ESEMPIO D'USO

process_entire_dataset(
    base_input_dir="/content/drive/MyDrive/dataset_originale",
    base_output_dir="/content/drive/MyDrive/dataset_resized",
    target_size=(640, 640)
)

prova per vedere

In [None]:
DATA 2

SCRIPT 2