<a href="https://colab.research.google.com/github/PavanDaniele/drone-person-detection/blob/main/dataset_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up: mount drive + import libraries

In [1]:
# Run this Every time you start a new session
from google.colab import drive
drive.mount('/content/drive') # to mount google drive (to see/access it)

Mounted at /content/drive


In [2]:
# Run this snippet Just one time, to install packages
!pip install imagehash
!pip install pillow

Collecting imagehash
  Downloading ImageHash-4.3.2-py2.py3-none-any.whl.metadata (8.4 kB)
Downloading ImageHash-4.3.2-py2.py3-none-any.whl (296 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.7/296.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: imagehash
Successfully installed imagehash-4.3.2


In [3]:
from PIL import Image
import imagehash
import os
from itertools import combinations # to generate all possible combinations of a number of elements from a set

from collections import defaultdict, deque
# defaultdict is a special type of dictionary that automatically creates a default value if you access a nonexistent key
# deque is a list-like structure, but optimized for quick additions and removals at both ends.

import pandas as pd
import matplotlib.pyplot as plt
import random
from collections import Counter
import shutil
from sklearn.model_selection import train_test_split # to partition the dataset with stratification

import numpy as np
import cv2 # OpenCV for image manipulation
import xml.etree.ElementTree as ET # For parsing and editing XML files (annotations)

from pathlib import Path # to manage file paths more robustly
import json # to create the final .json file in COCO format
from typing import Dict, List # types to improve readability and autocomplete
from tqdm import tqdm
import re # regular expressions, used to extract numbers from the image name


# Dataset Preparation

In this notebook, I'm going to prepare the dataset for fine-tuning multiple deep learning models (e.g. YOLO, EfficientDet, SSD + MobileNetV2).
The steps include similarity check, dataset splitting (train/val/test), optional image resizing, and bounding box adaptation.
The goal is to generate separate, clean and model-ready datasets for each architecture to enable fair training and evaluation.

In [9]:
image_folder_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS"

There are various metrics to calculate image similarity, such as **SSIM** (Structural Similarity Index), **PSNR** (Peak Signal-to-Noise Ratio), and **Cosine Similarity**. \
In our case, I chose to use **Perceptual Hashing** for an initial check because it is fast, robust, and does not require resizing (which is very important since my AERALIS dataset is composed of images from two different datasets).

This technique reduces the image to a binary signature, and then the *Hamming Distance* is computed to compare the resulting binary hashes.

In [10]:
def get_image_paths(folder_path): # To estract the images file (.jpg) and ignore the .xml and .csv files
  """
  Args:
    folder_path: path to folder containing images

  Returns:
    list of paths to images
  """
  return [os.path.join(image_folder_path, f) for f in os.listdir(image_folder_path)
               if f.lower().endswith(('.jpg'))]

In [11]:
def compute_hash(img_path, method):
  """
  Args:
    img_path: path to image
    method: hash method to use

  Returns:
    hash of image
  """
  img = Image.open(img_path).convert("L")  # Grayscale (because the hash algorithms works best when the image is in black and white)

  if method == 'phash':
    return imagehash.phash(img)
  elif method == 'ahash':
    return imagehash.average_hash(img)
  elif method == 'dhash':
    return imagehash.dhash(img)
  else:
    raise ValueError(f"Hash method not supported: {method}")

In [12]:
def compute_all_hashes(image_paths, methods): # Hash calculation for each images
  """
  Args:
    image_paths: list of paths to images
    methods: list of hash methods to use

  Returns:
    dictionary of hashes
  """
  hashes = {method: {} for method in methods} # to create a dictionary and for each method creates an empty sub-dictionary

  for method in methods:
    print(f"\nCalculation {method} for all images")

    for path in image_paths: # cycles over each image path in the image_paths list
      try:
        h = compute_hash(path, method)
        hashes[method][path] = h # saves the calculated hash in the dictionary structure
      except Exception as e:
        print(f"Error with {path}: {e}")

  return hashes

In [13]:
def compare_hashes(hashes, threshold): # Comparison of images in pairs
  """
  Args:
    hashes: dictionary of hashes
    threshold: distance threshold to consider images as similar
  """
  similar_images = []

  for method in hashes:
    print(f"\nRisultats with {method.upper()}:") # .upper() is used to convert the characters to 'uppercase'
    pairs = combinations(hashes[method].items(), 2) # combinations() is used to generate all the possible pairs without repetitions

    for (path1, hash1), (path2, hash2) in pairs:
      dist = hash1 - hash2
      if dist <= threshold:
        similar_images.append({
          'method': method,
          'image1': os.path.basename(path1),
          'image2': os.path.basename(path2),
          'distance': dist
        })
  return similar_images

The Hamming distance between the hashes of two images tells us how visually similar they are.
The result depends on the threshold:

- 1-2 (Very strict) → Only nearly identical images are detected
- 3-5 (Good compromise) → Balances well between false positives and false negatives
- 6-10 (More permissive) → More images are considered similar, but false positives increas

Let's calculate one Hash at time:

In [None]:
HASH_METHODS = ['phash']
HAMMING_THRESHOLD = 5

image_paths = get_image_paths(image_folder_path)
hashes_phash = compute_all_hashes(image_paths, HASH_METHODS)
similar_images_phash = compare_hashes(hashes_phash, HAMMING_THRESHOLD)

# to see how many distine images are considered similar:
img_set = set()
for entry in similar_images_phash:
    img_set.add(entry['image1'])
    img_set.add(entry['image2'])

print(f"Method: {HASH_METHODS} \n")
print(f"Number of similar distinct images: {len(img_set)}")
print(f"Number of All images: {len(image_paths)}")


Calculation phash for all images

Risultats with PHASH:
Method: ['phash'] 

Number of similar distinct images: 1176
Number of All images: 3426


In [None]:
HASH_METHODS = ['ahash']
HAMMING_THRESHOLD = 5

hashes_ahash = compute_all_hashes(image_paths, HASH_METHODS)
similar_images_ahash = compare_hashes(hashes_ahash, HAMMING_THRESHOLD)

img_set = set()
for entry in similar_images_ahash:
    img_set.add(entry['image1'])
    img_set.add(entry['image2'])

print(f"Method: {HASH_METHODS} \n")
print(f"Number of similar distinct images: {len(img_set)}")
print(f"Number of All images: {len(image_paths)}")


Calculation ahash for all images

Risultats with AHASH:
Method: ['ahash'] 

Number of similar distinct images: 2096
Number of All images: 3426


In [None]:
HASH_METHODS = ['dhash']
HAMMING_THRESHOLD = 5

hashes_dhash = compute_all_hashes(image_paths, HASH_METHODS)
similar_images_dhash = compare_hashes(hashes_dhash, HAMMING_THRESHOLD)

img_set = set()
for entry in similar_images_dhash:
    img_set.add(entry['image1'])
    img_set.add(entry['image2'])

print(f"Method: {HASH_METHODS} \n")
print(f"Number of similar distinct images: {len(img_set)}")
print(f"Number of All images: {len(image_paths)}")


Calculation dhash for all images

Risultats with DHASH:
Method: ['dhash'] 

Number of similar distinct images: 1368
Number of All images: 3426


We observed that the number of similar images is quite high.
Instead of simply removing them (which would unnecessarily reduce the dataset size), we adopt a more conservative strategy: we will distribute these similar images carefully across the training, validation, and test sets, in order to prevent potential overfitting or data leakage.

We chose to use only perceptual hashing (pHash), as it is more robust to minor variations in images and less prone to false positives compared to other variants like aHash and dHash.

But now I want to formulate a hypothesis:
Are we sure that all the 1176 images identified as "similar" by pHash are truly similar to each other? \
It could be that these images do not all resemble each other directly, but instead form subgroups (clusters) of mutually similar images, while being different from those in other groups. \
Let’s try to test this assumption.

To model this relationship, I built a data structure based on an undirected graph, where:

 - each node represents an image
 - an edge connects two images if they are considered similar

We then extracted the connected components from this graph, which effectively represent the actual clusters of similar images. These groups will be used to perform a controlled split of the dataset.

*Remember: a connected component is a maximal subset of a set (space) in which all points (nodes) are connected to each other.*

In [14]:
# Constructs an oriented graph in which each image is a node and each similar pair is an arc:
def build_similarity_graph(similar_images):
  """
  Args:
    similar_images: list of similar images

  Returns:
    graph: dictionary of graph
  """
  graph = defaultdict(set) # creates a dictionary (key: name_img, val: set_of_images)

  for pair in similar_images: # scrolls each element(=list of dictionaries) of similar_images
    img1 = pair['image1']
    img2 = pair['image2']
    graph[img1].add(img2) # builds the connection in both directions (undirected arc)
    graph[img2].add(img1)

  return graph

In [15]:
# finds the groups (connected components) in the graph:
def find_connected_components(graph):
  """
  Args:
    graph: dictionary of graph

  Returns:
    groups: list of groups
  """
  visited = set() # keeps track of images already visited
  groups = [] # will contain the final groups

  for node in sorted(graph): # scrolls each node(=image) in the graph (is important to order the nodes to ensure stability)
    if node not in visited: # If the image has not yet been visited, then start a new group
      group = []

      # Start a BFS (Breadth-First Search) with a queue
      # adds the initial node to the queue and marks it as visited
      queue = deque([node])
      visited.add(node)

      while queue: # as long as there are nodes in the tail
        current = queue.popleft() # removes the knot from the head and adds it to the group
        group.append(current)

        # for each neighbor (similar image), if not already visited
        for neighbor in sorted(graph[current]): # (is important to order the nodes to ensure stability)
          if neighbor not in visited:
            visited.add(neighbor) # marks it as visited
            queue.append(neighbor) # puts it in the queue for trial

      # Once the queue is exhausted, the group is complete and it is added to the groups
      groups.append(group)
  return groups

In [None]:
graph = build_similarity_graph(similar_images_phash)
groups = find_connected_components(graph)

print(f"\nFound {len(groups)} groups of similar images.\n")


Found 347 groups of similar images.



Perfect! We were right! \
Now let's do a brief analysis

In [None]:
group_sizes = [len(group) for group in groups]

size_counts = Counter(group_sizes) # count how many groups have size X

# Sort and save to a DataFrame by display
group_distribution = pd.DataFrame(sorted(size_counts.items()), columns=["Group Size", "Number of Groups"])
display(group_distribution)

# Other useful statistics
total_similar_images = sum(group_sizes)
largest_group = max(group_sizes)
average_group_size = total_similar_images / len(groups)

print(f"Total grups: {len(groups)}")
print(f"Total similar images: {total_similar_images}")
print(f"Average group size: {average_group_size:.2f}")
print(f"Largest group: {largest_group} images")

Unnamed: 0,Group Size,Number of Groups
0,2,221
1,3,49
2,4,24
3,5,18
4,6,10
5,7,5
6,8,5
7,9,2
8,10,2
9,11,1


Total grups: 347
Total similar images: 1176
Average group size: 3.39
Largest group: 45 images


In [None]:
percentage = total_similar_images / len(image_paths) * 100
print(f"Percentage of similar images in the dataset: {percentage:.2f}%")

Percentage of similar images in the dataset: 34.33%


As we can see, 34.33% of all images in our AERALIS dataset are similar. This is not ideal. \
But don't worry! We can keep all the images and still avoid overfitting or data leakage by using another technique: *Group-Aware Splitting*.

This method is similar to the more classical Stratified Sampling, but it is more suitable for our case. \
So let’s start using this technique to properly create the Training, Validation, and Test sets. \

But before that, I think it could be interesting to see how the pHash results change when we adjust the similarity threshold.

In [16]:
def filter_similar_images(hashes, threshold): # filters similar images by threshold
  """
  Args:
    hashes: dictionary of hashes
    threshold: distance threshold to consider images as similar

  Returns:
    similar_images: list of similar images
  """
  similar_images = []
  pairs = combinations(hashes.items(), 2)

  for (path1, hash1), (path2, hash2) in pairs:
    dist = hash1 - hash2
    if dist <= threshold:
      similar_images.append({'image1': path1, 'image2': path2, 'distance': dist})

  return similar_images

In [None]:
# Let's see Just the pHash case
HASH_METHODS = ['phash']
thresholds_to_try = [3, 5, 7, 10]

hashes_phash_all = compute_all_hashes(image_paths, HASH_METHODS)['phash'] # Extracts only 'phash' from the returned dictionary

results = []
for thresh in thresholds_to_try: # analyzes each threshold
  similar_images = filter_similar_images(hashes_phash_all, thresh) # for each threshold, calculate similar images with that threshold

  # Extracts all the images that appear at least once as similar (without duplicates):
  img_set = set()
  for pair in similar_images:
    img_set.add(pair['image1'])
    img_set.add(pair['image2'])

  results.append({
    "Threshold": thresh,
    "Num Similar Pairs": len(similar_images),
    "Num Similar Distinct Images": len(img_set),
    "Total Images": len(image_paths),
    "Percent Similar (%)": round(len(img_set) / len(image_paths) * 100, 2)
  })


# To see the results let's converts the list of results to a DataFrame pandas
df_results = pd.DataFrame(results)
display(df_results)


Calculation phash for all images


Unnamed: 0,Threshold,Num Similar Pairs,Num Similar Distinct Images,Total Images,Percent Similar (%)
0,3,1450,839,3426,24.49
1,5,2307,1176,3426,34.33
2,7,3161,1494,3426,43.61
3,10,4960,1971,3426,57.53


We observe that as the threshold increases, the number of pairs considered similar also grows, and consequently, so does the percentage of images involved.

Observations:
- At lower thresholds (e.g., 3), only strongly similar images are identified, but many less obvious duplicates may be missed.

- At higher thresholds (e.g., 10), there's a risk of including different images that only share generic visual elements (false positives).

- Threshold 5 proves to be a good compromise, balancing precision and coverage.

Let us now create a function that uses the group_aware splitting technique:

In [17]:
# Divides the groups of similar images into training, validation, and test sets, keeping each group together
#  (no similar images end up in different sets).
def group_aware_split(groups, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, seed=42):
  """
  Args:
    groups: list of groups
    train_ratio: ratio of images to be assigned to the training set
    val_ratio: ratio of images to be assigned to the validation set
    test_ratio: ratio of images to be assigned to the test set
    seed: seed for the random number generator

  Returns:
    assignments: dictionary with the split of images
  """
  # checks whether the sets add up to 1
  try:
    total_ratio = train_ratio + val_ratio + test_ratio
    if not 0.99 <= total_ratio <= 1.01:
      raise ValueError("The proportions do not add up to 1! You must correct the values.")
  except ValueError as e:
    print(f"ERROR: {e}")
    return None

  split_ratios = {'train': train_ratio, 'val': val_ratio, 'test': test_ratio}

  random.seed(seed) # to initialize the random number generator
  random.shuffle(groups) # to make randomization reproducible

  # dictionary comprehension
  total_images = sum(len(g) for g in groups) # to figure out how many images should be assigned to that split
  target_counts = {k: int(v * total_images) for k, v in split_ratios.items()}
  current_counts = defaultdict(int) # number of images already assigned to each split
  assignments = defaultdict(list) # number of images actually assigned to each split as final output

  for group in groups:
    # Find the split with the lowest saturation ratio
    best_split = min(
      target_counts.items(),
      key=lambda item: current_counts[item[0]] / item[1] if item[1] > 0 else float('inf')
    )[0] # this line is used to take the key (‘train’, ‘val’, ‘test’) of the best split

    assignments[best_split].extend(group) # adds all images in the group to the selected split
    current_counts[best_split] += len(group) # update the counter to know how many images are now in that split

  return assignments

With this function we get a *assignments* dictionary structured so that each list contains the names of the images assigned to the split (train, val, test), keeping similar images together.

In [None]:
# invokes the function to divide the groups of similar images into the different sets
assignments = group_aware_split(groups, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, seed=42)

# check the counts
print({k: len(v) for k, v in assignments.items()})


{'train': 820, 'val': 178, 'test': 178}


Partitioning the dataset into: \
- Training set = 70%,
- Validation set = 15%,
- Test set = 15%

and we obtain a distribution according to:
- 820 images out of 1176 for the Training set
- 178 images out of 1176 for the Validation set
- 178 images out of 1176 for the Test set

\
Now we need to use a *Stratified Split* that ensures that the distribution of classes in the dataset is proportionally balanced across the divisions of the three sets. \
We prefer to use a **Stratified Sampling** technique because we already know that our dataset is somewhat unbalanced, as it contains more images with people than images without.


But, of course, we want to maintain the partitioning we just did for similar images:

In [18]:
# Divides the AERALIS dataset in a layered manner and copies images/.xml to train/val/test.
#   - Maintains similar image assignments (from group_aware_split).
#   - Stratifies remainder split based on CSV ‘class’ column.
#   - Saves images, annotations and generates CSV for each set with all original columns.
def stratified_split(assignments, image_folder_path, output_base_path, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, seed=42):
  """
  Args:
    assignments: dictionary with the split of images
    image_folder_path: path to the folder containing images (and CSV file)
    output_base_path: path to the output folder
    train_ratio, val_ratio, test_ratio: desired proportions of images to be assigned to the training, validation and test set
    seed: seed for the random number generator
  """

  # 1. Create train/val/test folders with subfolders images/ and annotations/
  for split in ['train', 'val', 'test']:
    os.makedirs(os.path.join(output_base_path, split, "images"), exist_ok=True)
    os.makedirs(os.path.join(output_base_path, split, "annotations"), exist_ok=True)

  # 2. Upload the full CSV
  csv_path = os.path.join(image_folder_path, "aeralis_person_labels.csv")
  df_full = pd.read_csv(csv_path)
  df_full = df_full[df_full['filename'].str.lower().str.endswith(('.jpg', '.jpeg', '.png'))] # we really only need '.jpg'
  df = df_full.drop_duplicates(subset='filename', keep='first').copy()

  # df = pd.read_csv(csv_path)
  # df = df[df['filename'].str.lower().str.endswith(('.jpg', '.jpeg', '.png'))]

  # 3. Removes images already assigned (similar)
  already_assigned = set(sum(assignments.values(), []))
  df_unassigned = df[~df['filename'].isin(already_assigned)].copy()

  # 4. Split is stratified according to y labels, class balance is maintained
  X = df_unassigned['filename'].values
  y = df_unassigned['class'].values

  X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    train_size = train_ratio,
    stratify = y,
    random_state = seed
  )
  # X_train, y_train: part that will go into the training set
  # X_temp, y_temp: remaining images to be split still in validation and testing

  # After removing the train part, we need to divide X_temp into val and test:
  val_ratio_adjusted = val_ratio / (val_ratio + test_ratio) # we calculate the new proportion of validation to the remaining total

  # We divide X_temp and y_temp into validation and test:
  X_val, X_test = train_test_split(
    X_temp, train_size = val_ratio_adjusted,
    stratify = y_temp,
    random_state = seed
  )

  # 5. Adds assignments to the dictionary, avoiding duplicates
  # assignments is the dictionary initially created with the similar groups assigned via Group-Aware Splitting
  # X_train, X_val, X_test are the non-similar, stratified image assignments
  for split, split_X in zip(['train', 'val', 'test'], [X_train, X_val, X_test]):
    new_imgs = [x for x in split_X if x not in already_assigned]
    assignments[split].extend(new_imgs)
    already_assigned.update(new_imgs)



  # 6. Copy file and generate final CSV
  for split, file_list in assignments.items():
    # split_df = df[df['filename'].isin(file_list)].copy()
    split_df = df_full[df_full['filename'].isin(file_list)].copy()

    for fname in file_list:
      img_path = os.path.join(image_folder_path, fname)
      xml_name = os.path.splitext(fname)[0] + ".xml"
      xml_path = os.path.join(image_folder_path, xml_name)

      dst_img = os.path.join(output_base_path, split, "images", fname)
      dst_xml = os.path.join(output_base_path, split, "annotations", xml_name)

      if os.path.exists(img_path):
        shutil.copy2(img_path, dst_img)
      if os.path.exists(xml_path):
        shutil.copy2(xml_path, dst_xml)

    # save detailed CSV for split
    split_df.to_csv(os.path.join(output_base_path, f"{split}_set.csv"), index=False)

  # 7. Report
  print("Stratified split completed and files copied.")
  print(f"Train: {len(assignments['train'])} images")
  print(f"Val:   {len(assignments['val'])} images")
  print(f"Test:  {len(assignments['test'])} images")

With this function we are going to create a new folder containing 3 subfolders for the Training, Validation and Test phases, each of which will contain a folder with images and one with their respective annotations. At the same time a CSV file describing the set will be created.

In [None]:
image_folder_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS"

output_base_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS_SPLITTED"

# Initialize assignments (only if is not initialized) with: assignments = group_aware_split(gruppi_simili)

stratified_split(assignments, image_folder_path, output_base_path, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15,seed=42)

Stratified split completed and files copied.
Train: 2395 images
Val:   515 images
Test:  516 images


In [None]:
# Let's see if the numbers matches
total_all = len(assignments['train']) + len(assignments['val']) + len(assignments['test'])
unique_total = len(set(assignments['train'] + assignments['val'] + assignments['test']))

print(f"Total images in split (sum): {total_all}")
print(f"Total unique images: {unique_total}")

Total images in split (sum): 3426
Total unique images: 3426


We now perform a quick check to see if the split did not cause inconsistencies in the data:

In [None]:
splits = ['train', 'val', 'test']
base_path = output_base_path  # già definito

for split in splits:
  img_dir = os.path.join(base_path, split, 'images')
  ann_dir = os.path.join(base_path, split, 'annotations')

  images = [f for f in os.listdir(img_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
  missing_xml = []

  for img in images:
    xml_name = os.path.splitext(img)[0] + ".xml"
    if not os.path.exists(os.path.join(ann_dir, xml_name)):
      missing_xml.append(xml_name)

  print(f"{split.upper()} Images: {len(images)}, Missing XML: {len(missing_xml)}")
  if missing_xml:
    print("\nMissing XML files:")
    for x in missing_xml:
      print("   ", x)

TRAIN Images: 2395, Missing XML: 0
VAL Images: 515, Missing XML: 0
TEST Images: 516, Missing XML: 0


In [None]:
for split in splits:
  csv_path = os.path.join(base_path, f"{split}_set.csv")
  img_dir = os.path.join(base_path, split, "images")

  df_split = pd.read_csv(csv_path)
  csv_filenames = set(df_split['filename'].str.lower())
  actual_images = set(f.lower() for f in os.listdir(img_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png')))

  missing_in_csv = actual_images - csv_filenames # files present in folder but NOT in CSV
  missing_in_dir = csv_filenames - actual_images # files present in CSV but NOT in folder

  print(f"{split.upper()} — CSV: {len(csv_filenames)}, IMG DIR: {len(actual_images)}")
  if missing_in_csv:
    print(f"   {len(missing_in_csv)} images in folder not listed in CSV.")
  if missing_in_dir:
    print(f"   {len(missing_in_dir)} images in CSV not found in folder.")

TRAIN — CSV: 2395, IMG DIR: 2395
VAL — CSV: 515, IMG DIR: 515
TEST — CSV: 516, IMG DIR: 516


Perfect! There are no inconsistencies resulting from the split. \
 We now proceed to create copies of the folder we just created, *AERALIS_SPLITTED*, as we want to ensure that the future study of the models' performance is not affected by different splits. Therefore, we will use the same Train, Val, and Test proportions for all of them, as we have just created

In [21]:
remember: base_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS_SPLITTED"

model_YOLO_versions = ["YOLOv8n", "YOLOv11n"]

for v in model_YOLO_versions:
    dst = f"/content/drive/MyDrive/projectUPV/datasets/AERALIS_{v}" # constructs the destination path
    shutil.copytree(base_path, dst)

FileExistsError: [Errno 17] File exists: '/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n'

In [None]:
model_EfficientDet_versions = ["EfficientDet_D0", "EfficientDet_D1", "EfficientDet_D2"]

for v in model_EfficientDet_versions:
    dst = f"/content/drive/MyDrive/projectUPV/datasets/AERALIS_{v}" # constructs the destination path
    shutil.copytree(base_path, dst)

In [None]:
model_SSD_versions = ["MobileNetV2_SSD", "MobileNetV3_SSD"]

for v in model_SSD_versions:
    dst = f"/content/drive/MyDrive/projectUPV/datasets/AERALIS_{v}" # constructs the destination path
    shutil.copytree(base_path, dst)

In the next session, we will resize the images and their corresponding annotations for each set. This will eliminate any variability caused by the automatic resizing performed by the different models. \
Each model is designed for inputs of specific dimensions and, for this initial phase of “offline” fine-tuning (i.e., in my development environment), I prefer to test each model in optimal conditions, making the most of its capabilities.

# Resizing


In this section, I discuss the preprocessing strategies for adapting the dataset to the input requirements of various detection models. \

Originally, I planned to resize all images to the specific size required by each model version (see table below), and to update the bounding boxes and annotation files accordingly. \
This approach, especially with letterbox resize, ensures that images are adapted while maintaining the correct aspect ratio (an essential requirement for YOLO models), which are trained on letterboxed data.

During this phase, it is very important not to resize the images by *stretching* them.
We therefore use **Letterbox Resize** to avoid distortions. \
Letterbox resize is a preprocessing technique that:

 - Allows resizing the image while maintaining the original aspect ratio (i.e., the ratio between the width and height of an image).

 - Adds padding (black border or any other color) on the remaining sides to exactly match the target size.

This ensures that object proportions are preserved and the bounding boxes remain unchanged.

Model input sizes:

- YOLOv8n and YOLOv11n require image dimensions of 640×640
- EfficientDet D0 requires 512×512
- EfficientDet D1 requires 640×640
- EfficientDet D2 requires 768×768
- MobileNetV2 + SSD requires 300×300
- MobileNetV3 + SSDLite requires 320×320

In [None]:
# Function to perform letterbox resize: resize with retained aspect ratio and padding
def letterbox_resize(image, target_size=(640, 640), color=(114, 114, 114)):
  """
  Args:
    image: image to be resized
    target_size: tuple = (width, height)
    color: tuple = (r, g, b)

  Returns:
    resized: resized image
    scale: scale factor
    (pad_x, pad_y): padding values
  """
  orig_h, orig_w = image.shape[:2] # original height and width
  target_w, target_h = target_size # desired height and width

  scale = min(target_w / orig_w, target_h / orig_h) # maintaining aspect ratio
  new_w, new_h = int(orig_w * scale), int(orig_h * scale)

  resized = cv2.resize(image, (new_w, new_h), interpolation=cv2.INTER_LINEAR) # resize image
  pad_x = (target_w - new_w) // 2 # horizontal padding
  pad_y = (target_h - new_h) // 2 # vertical padding

  # Adding padding to get target size image
  padded = cv2.copyMakeBorder(
    resized,
    pad_y, target_h - new_h - pad_y,
    pad_x, target_w - new_w - pad_x,
    borderType = cv2.BORDER_CONSTANT, value=color # grey padding
  )

  return padded, scale, (pad_x, pad_y)

In [None]:
# Function to process a single split of the dataset
def process_one_folder(img_dir, xml_dir, out_img_dir, out_xml_dir, target_size=(640, 640)):
  """
  Args:
    img_dir: path to input image folder
    xml_dir: path to input xml folder
    out_img_dir: path to output image folder
    out_xml_dir: path to output xml folder
    target_size: tuple = (width, height)
  """
  os.makedirs(out_img_dir, exist_ok=True) # folder output images
  os.makedirs(out_xml_dir, exist_ok=True) # folder output xml files

  for fname in os.listdir(img_dir):
    is_image = fname.lower().endswith(('.jpg', '.jpeg', '.png'))
    img_path = os.path.join(img_dir, fname)
    xml_filename = os.path.splitext(fname)[0] + ".xml"
    xml_path = os.path.join(xml_dir, xml_filename)
    xml_exists = os.path.exists(xml_path)
    image = cv2.imread(img_path) if is_image else None # read image if it exist

    if is_image and xml_exists and image is not None:
      resized_img, scale, (pad_x, pad_y) = letterbox_resize(image, target_size)
      cv2.imwrite(os.path.join(out_img_dir, fname), resized_img) # save resized image

      # Update XML
      tree = ET.parse(xml_path)
      root = tree.getroot()

      for obj in root.findall('object'): # for each object (bbox) in the XML file
        bbox = obj.find('bndbox')
        xmin = int(float(bbox.find('xmin').text))
        ymin = int(float(bbox.find('ymin').text))
        xmax = int(float(bbox.find('xmax').text))
        ymax = int(float(bbox.find('ymax').text))

        # apply scaling + padding to the bbox
        xmin = int(xmin * scale + pad_x)
        xmax = int(xmax * scale + pad_x)
        ymin = int(ymin * scale + pad_y)
        ymax = int(ymax * scale + pad_y)

        # clamp of values (avoids out-of-picture bbox)
        bbox.find('xmin').text = str(max(0, min(xmin, target_size[0])))
        bbox.find('ymin').text = str(max(0, min(ymin, target_size[1])))
        bbox.find('xmax').text = str(max(0, min(xmax, target_size[0])))
        bbox.find('ymax').text = str(max(0, min(ymax, target_size[1])))

      # update size in <size> tag
      size_tag = root.find('size')
      size_tag.find('width').text = str(target_size[0])
      size_tag.find('height').text = str(target_size[1])

      # Save the new XML file
      tree.write(os.path.join(out_xml_dir, xml_filename))

    # Error handling/ignora invalid files.
    else:
      if not is_image: # It is not an image file
        print(f"Ignored file (not image): {img_path}")
      elif not xml_exists:
        print(f"Missing file XML for: {fname}")
      elif image is None:
        print(f"Reading error: {img_path}")

In [None]:
# Function to process the entire dataset (Training, Validation, Testing)
def process_entire_dataset(base_input_dir, base_output_dir, splits=('train', 'val', 'test'), target_size=(640, 640)):
  """
  Args:
    base_input_dir: path to input folder
    base_output_dir: path to output folder
    splits: tuple = ('train', 'val', 'test')
    target_size: tuple = (width, height)
  """
  for split in splits:
    print(f"\n Processing split: {split}")
    # Input/output paths for images and annotations.
    img_dir = os.path.join(base_input_dir, split, 'images')
    xml_dir = os.path.join(base_input_dir, split, 'annotations')
    out_img_dir = os.path.join(base_output_dir, split, 'images')
    out_xml_dir = os.path.join(base_output_dir, split, 'annotations')

    # Process a single split
    process_one_folder(img_dir, xml_dir, out_img_dir, out_xml_dir, target_size)

  print("\nAll images and annotations have been processed.")

However, through detailed analysis, I realized that not all models handle resizing in the same way. \

- YOLO requires and expects letterbox resize, as this is the format used during its training.

- EfficientDet and MobileNetV2/V3+SSDLite do not use letterbox: they expect a standard ("stretch") resize, potentially with cropping, but no padding. If letterbox is applied, it introduces artificial borders the models have never seen in training, potentially reducing detection accuracy, especially for objects at the image edges.

Although I have implemented functions for letterbox resizing and verified their correctness, I have decided not to apply any resizing in advance at this stage.
Instead, I will perform the fine-tuning directly on the original dataset, letting each official library handle the resizing "on the fly" according to the correct pipeline for each model. \
This strategy reduces the risk of inconsistencies, ensures full compatibility with each model's expectations, and simplifies future updates or changes in input size requirements. \

In [5]:
# Let's test the resizing
# base_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS_SPLITTED"

# AERALIS_YOLOv8n (640 x 640)
# process_entire_dataset(
#  base_input_dir = "/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n", # input
#  base_output_dir = "/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized", # output
#  splits=('train', 'val', 'test'), # default (can be omitted)
#  target_size=(640, 640)
#)

If, during the experiments, I find that training becomes too slow or resource-intensive due to the large original images, I will reconsider applying resizing in advance, always ensuring to use the correct method for each model.

# Conversion

Now it's time to convert the annotations. \
To ensure consistency, we need to convert all the .xml files (which contain the bounding box annotations for each image) into the appropriate format required by each model.

The choice of annotation format mostly depends on the *implementation* of the model we plan to use, not on intrinsic format limitations. \

**Documentation vs Implementation**
 - The official documentation of each training framework describes the required annotation format, but

 - In practice, it is the training scripts (or the APIs of libraries such as Ultralytics YOLO, TensorFlow Object Detection API, PyTorch Lightning, etc.) that enforce that format.

**YOLO TXT Format** \
According to the official Ultralytics documentation for YOLO:
  - *“One text file per image: Each image in the dataset has a corresponding text file with the same name as the image file and the '.txt' extension.”* \
  (link: https://docs.ultralytics.com/datasets/segment/#supported-dataset-formats)

  - *“Convert these annotations into the YOLO .txt file format which Ultralytics supports.”* \
  (link: https://docs.ultralytics.com/datasets/#contribute-new-datasets)

This means that if we use Ultralytics YOLO (as we plan to do), the .txt format is mandatory, and each file must contain annotations in normalized coordinates: `<class_id> <x_center> <y_center> <width> <height>`

**EfficientDet / SSD – COCO JSON Format** \

In PyTorch, the standard loader for object detection is *torchvision.datasets.CocoDetection*, which requires a COCO JSON file. \

Therefore, when using models like EfficientDet or MobileNetV2 + SSD, annotations must be converted to a single .json file in the COCO format. \
This is not optional, as the loader won’t work with .txt files or raw XMLs.

*What is a Loader?* \
A loader is a component that:
 - Reads images and annotations from disk
 - Applies transforms (resize, padding, augmentation, normalization)
 - Batches them into tensors for training/inference

In PyTorch, these are usually defined as subclasses of torch.utils.data.Dataset.

**Note on EfficientDet** \
For fine-tuning EfficientDet, we will likely use the *rwightman/efficientdet-pytorch* implementation, a faithful and stable PyTorch port of Google's original model. \

I decided not to use the official TensorFlow version, as this project is based on PyTorch workflows. \

But more detailed discussion about model selection and fine-tuning strategy will be provided in a future notebook.

Let's start to transform the annotations for the YOLO model:

In [None]:
CLASS_MAP = {"person": 0}  # maps the classes with a numeric ID

In [None]:
def convert_bbox(size, box): # Function to convert bounding box from absolute to normalized coordinates in YOLO format
  """
  Args:
    size: tuple = (width, height)
    box: tuple = (xmin, ymin, xmax, ymax)

  Returns:
    tuple = (x_center, y_center, w, h)
  """
  dw, dh = 1.0/size[0], 1.0/size[1] # size = (width, height)
  x_center = (box[0] + box[2]) / 2.0 * dw # box = (xmin, ymin, xmax, ymax)
  y_center = (box[1] + box[3]) / 2.0 * dh
  w = (box[2] - box[0]) * dw
  h = (box[3] - box[1]) * dh

  return x_center, y_center, w, h

In [None]:
def xml_to_yolo(xml_path: Path, txt_path: Path): # converts a single .xml file to a .txt file in the YOLO style
  """
  Args:
    xml_path: Path to input XML annotation file
    txt_path: Path to output YOLO .txt annotation file
  """
  tree = ET.parse(xml_path)
  root = tree.getroot()

  w = int(root.find('size/width').text)
  h = int(root.find('size/height').text)

  lines = []

  for obj in root.findall('object'):
    cls = obj.find('name').text
    if cls in CLASS_MAP:  # solo se la classe è mappata
      xmin = float(obj.find('bndbox/xmin').text)
      ymin = float(obj.find('bndbox/ymin').text)
      xmax = float(obj.find('bndbox/xmax').text)
      ymax = float(obj.find('bndbox/ymax').text)
      bbox = convert_bbox((w, h), (xmin, ymin, xmax, ymax))
      line = f"{CLASS_MAP[cls]} " + " ".join(f"{v:.6f}" for v in bbox)
      lines.append(line)

  txt_path.write_text("\n".join(lines) + ("\n" if lines else ""))

In [None]:
def batch_convert(xml_dir: Path, txt_dir: Path): # to convert all xml files to one directory
  """
  Args:
    xml_dir: Path to input directory containing .xml annotations
    txt_dir: Path to output directory for .txt annotations
  """
  txt_dir.mkdir(parents=True, exist_ok=True)
  for xml in xml_dir.rglob("*.xml"):
    xml_to_yolo(xml, txt_dir / f"{xml.stem}.txt")

Let's try for YOLOv8n:

In [None]:
# Converts for TRAIN
batch_convert(
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/train/annotations"), # input
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/train/labels") # output
)

# Converts for VAL
batch_convert(
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/val/annotations"), # input
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/val/labels") # output
)

# Converts for TEST
batch_convert(
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/test/annotations"), # input
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/test/labels") # output
)

And now for YOLOv11n as well:

In [None]:
# Converts for TRAIN
batch_convert(
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/train/annotations"), # input
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/train/labels") # output
)

# Converts for VAL
batch_convert(
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/val/annotations"), # input
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/val/labels") # output
)

# Converts for TEST
batch_convert(
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/test/annotations"), # input
  Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/test/labels") # output
)

Let's check everything:

In [None]:
# to verify that for each .xml file in a directory there is a corresponding .txt file in the directory of converted YOLO labels:
def check_yolo_conversion(xml_dir: Path, txt_dir: Path):
  """
  Args:
    xml_dir: Path to input directory containing .xml annotations
    txt_dir: Path to input directory containing .txt annotations
  """
  xml_files = sorted([f.stem for f in xml_dir.glob("*.xml")])
  txt_files = sorted([f.stem for f in txt_dir.glob("*.txt")])

  missing_txt = [f for f in xml_files if f not in txt_files]
  extra_txt = [f for f in txt_files if f not in xml_files]

  if not missing_txt and not extra_txt:
    print("All .xml files have the corresponding .txt file.")

  else:
    print("Some matches are not correct:")

    if missing_txt:
      print(f"The following .txt files are missing for: {missing_txt}")
    if extra_txt:
      print(f"Excess .txt file (no corresponding .xml): {extra_txt}")

In [None]:
# Let's check

# YOLOv8n

# TRAIN
check_yolo_conversion(
  xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/train/annotations"),
  txt_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/train/labels")
)

# VAL
check_yolo_conversion(
  xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/val/annotations"),
  txt_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/val/labels")
)

# TEST
check_yolo_conversion(
  xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/test/annotations"),
  txt_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv8n_resized/test/labels")
)

All .xml files have the corresponding .txt file.
All .xml files have the corresponding .txt file.
All .xml files have the corresponding .txt file.


In [None]:
# YOLOv11n

# TRAIN
check_yolo_conversion(
  xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/train/annotations"),
  txt_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/train/labels")
)

# VAL
check_yolo_conversion(
  xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/val/annotations"),
  txt_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/val/labels")
)

# TEST
check_yolo_conversion(
  xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/test/annotations"),
  txt_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_YOLOv11n_resized/test/labels")
)

All .xml files have the corresponding .txt file.
All .xml files have the corresponding .txt file.
All .xml files have the corresponding .txt file.


The convert_bbox function and the xml_to_yolo, batch_convert scripts are based on standard methodologies for converting Pascal VOC -> YOLO annotations, as described for example in tutorials on Medium and in open GitHub repositories.

And now we can convert also the annotations for the EfficientDet model:

In [None]:
# Returns a Python dictionary that associates each class name with a numeric ID
def get_label2id(labels_path: str) -> Dict[str, int]:
  """
  Args:
    labels_path: path to file containing class names

  Returns:
    dict: dictionary that maps class names to class IDs
  """
  with open(labels_path, 'r') as f:
    labels_str = f.read().split()
  labels_ids = list(range(1, len(labels_str)+1))

  return dict(zip(labels_str, labels_ids))

In [None]:
# Builds a list of complete paths to the annotation files (XML)
def get_annpaths(ann_dir_path: str, ann_ids_path: str, ext: str = '') -> List[str]:
  """
  Args:
    ann_dir_path: path to directory containing annotation files
    ann_ids_path: path to file containing annotation IDs
    ext: extension of annotation files

  Returns:
    list: list of complete paths to annotation files
  """
  ext_with_dot = '.' + ext if ext else ''
  with open(ann_ids_path, 'r') as f:
    ann_ids = f.read().split()

  return [os.path.join(ann_dir_path, aid + ext_with_dot) for aid in ann_ids]

In [None]:
# Extracts image information (name, size, ID) from an XML tree
def get_image_info(annotation_root, extract_num_from_imgid=True):
  """
  Args:
    annotation_root: XML tree
    extract_num_from_imgid: boolean

  Returns:
    dict: image information
  """
  path = annotation_root.findtext('path')
  filename = os.path.basename(path) if path else annotation_root.findtext('filename')
  img_name = os.path.basename(filename)
  img_id = os.path.splitext(img_name)[0]
  if extract_num_from_imgid and isinstance(img_id, str):
    img_id = int(re.findall(r'\d+', img_id)[0])

  size = annotation_root.find('size')
  width = int(size.findtext('width'))
  height = int(size.findtext('height'))

  return {'file_name': filename, 'height': height, 'width': width, 'id': img_id}

In [None]:
# Takes an XML object (an <object>) and returns the annotation in COCO format
def get_coco_annotation_from_obj(obj, label2id, found_labels=None):
  """
  Args:
    obj: XML object
    label2id: dictionary that maps class names to class IDs
    found_labels: list of class names

  Returns:
    dict: COCO annotation for the object
  """
  label = obj.findtext('name')
  if found_labels is not None:
    found_labels.add(label)

  assert label in label2id, f"Label '{label}' not in label2id"
  category_id = label2id[label]
  bndbox = obj.find('bndbox')
  xmin = int(bndbox.findtext('xmin')) - 1
  ymin = int(bndbox.findtext('ymin')) - 1
  xmax = int(bndbox.findtext('xmax'))
  ymax = int(bndbox.findtext('ymax'))
  assert xmax > xmin and ymax > ymin
  o_width = xmax - xmin
  o_height = ymax - ymin

  return {
    'area': o_width * o_height,
    'iscrowd': 0,
    'bbox': [xmin, ymin, o_width, o_height],
    'category_id': category_id,
    'ignore': 0,
    'segmentation': []
  }

In [None]:
# Function that converts annotations from Pascal VOC (.xml) format to COCO (.json) format
def convert_xmls_to_cocojson(annotation_paths: List[str], label2id: Dict[str, int], output_jsonpath: str, extract_num_from_imgid: bool = True):
  """
  Args:
    annotation_paths: list of paths to annotation files
    label2id: dictionary that maps class names to class IDs
    output_jsonpath: path to output JSON file
    extract_num_from_imgid: boolean
  """
  output_json_dict = {
    "images": [],
    "type": "instances",
    "annotations": [],
    "categories": []
  }
  bnd_id = 1
  found_labels = set()
  print('Start converting...')

  for a_path in tqdm(annotation_paths):
    ann_tree = ET.parse(a_path)
    ann_root = ann_tree.getroot()

    img_info = get_image_info(ann_root, extract_num_from_imgid)
    img_id = img_info['id']
    output_json_dict['images'].append(img_info)

    for obj in ann_root.findall('object'):
      ann = get_coco_annotation_from_obj(obj, label2id, found_labels)
      ann.update({'image_id': img_id, 'id': bnd_id})
      output_json_dict['annotations'].append(ann)
      bnd_id += 1

  missing_in_labels = found_labels - set(label2id.keys())
  if missing_in_labels:
    print(f"Classes found in XMLs but not present in labels.txt: {missing_in_labels}")
  else:
    print("All the classes found in the XMLs are present in labels.txt.")

  for label, label_id in label2id.items():
    output_json_dict['categories'].append({'supercategory': 'none', 'id': label_id, 'name': label})

  with open(output_jsonpath, 'w') as f:
    json.dump(output_json_dict, f, indent=2)


In [None]:
# Reusable function that performs all steps on a base directory
def convert_pascalvoc_to_coco(base_dir_path: str):
  """
  Args:
    base_dir_path: path to base directory containing the dataset
  """
  base_dir = Path(base_dir_path)
  splits = ["train", "val", "test"]
  labels_path = base_dir / "labels.txt"

  # Step 1: create labels.txt (for your project: only "person")
  with open(labels_path, "w") as f:
    f.write("person\n")

  # Step 2: generate image_list.txt for each split
  for split in splits:
    xml_dir = base_dir / split / "annotations"
    output_list = base_dir / split / "image_list.txt"

    xml_files = sorted(xml_dir.glob("*.xml"))
    with open(output_list, "w") as f:
      for xml_file in xml_files:
        f.write(f"{xml_file.stem}\n")
    print(f"{output_list.name} generated with {len(xml_files)} entries.")

  # Step 3: perform conversion from .xml to COCO JSON
  label2id = get_label2id(str(labels_path))

  for split in splits:
    print(f"\nConverting {split.upper()} split to COCO JSON...")
    xml_dir = base_dir / split / "annotations"
    ids_file = base_dir / split / "image_list.txt"
    output_json = base_dir / f"annotations_{split}.json"

    annotation_paths = get_annpaths(str(xml_dir), str(ids_file), ext="xml")
    convert_xmls_to_cocojson(annotation_paths, label2id, str(output_json))

Let's try it:

In [None]:
# Apply to each dataset EfficientDet o MobileNetV2/V3+SSD

paths = [
  "/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D0",
  "/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D1",
  "/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D2",
  "/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV2_SSD",
  "/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV3_SSD"
]

for path in paths:
  print(f"\nProcessing dataset: {path}")
  convert_pascalvoc_to_coco(path)



Processing dataset: /content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D0_resized
image_list.txt generated with 2395 entries.
image_list.txt generated with 515 entries.
image_list.txt generated with 516 entries.

Converting TRAIN split to COCO JSON...
Start converting...


100%|██████████| 2395/2395 [01:38<00:00, 24.25it/s] 



Converting VAL split to COCO JSON...
Start converting...


100%|██████████| 515/515 [02:35<00:00,  3.31it/s]



Converting TEST split to COCO JSON...
Start converting...


100%|██████████| 516/516 [02:42<00:00,  3.18it/s]



Processing dataset: /content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D1_resized
image_list.txt generated with 2395 entries.
image_list.txt generated with 515 entries.
image_list.txt generated with 516 entries.

Converting TRAIN split to COCO JSON...
Start converting...


100%|██████████| 2395/2395 [01:32<00:00, 25.88it/s] 



Converting VAL split to COCO JSON...
Start converting...


100%|██████████| 515/515 [02:42<00:00,  3.17it/s]



Converting TEST split to COCO JSON...
Start converting...


100%|██████████| 516/516 [02:19<00:00,  3.70it/s]



Processing dataset: /content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D2_resized
image_list.txt generated with 2395 entries.
image_list.txt generated with 515 entries.
image_list.txt generated with 516 entries.

Converting TRAIN split to COCO JSON...
Start converting...


100%|██████████| 2395/2395 [01:05<00:00, 36.76it/s] 



Converting VAL split to COCO JSON...
Start converting...


100%|██████████| 515/515 [02:56<00:00,  2.92it/s]



Converting TEST split to COCO JSON...
Start converting...


100%|██████████| 516/516 [02:07<00:00,  4.05it/s]



Processing dataset: /content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV2_SSD_resized
image_list.txt generated with 2395 entries.
image_list.txt generated with 515 entries.
image_list.txt generated with 516 entries.

Converting TRAIN split to COCO JSON...
Start converting...


100%|██████████| 2395/2395 [01:01<00:00, 39.17it/s] 



Converting VAL split to COCO JSON...
Start converting...


100%|██████████| 515/515 [00:02<00:00, 180.51it/s]



Converting TEST split to COCO JSON...
Start converting...


100%|██████████| 516/516 [00:02<00:00, 209.28it/s]


Perfect, we now want to check if everything is ok:

In [None]:
def check_coco_conversion(xml_dir: Path, json_path: Path):
  """
  Args:
    xml_dir: Path to the folder containing .xml annotations
    json_path: Path to the generated COCO JSON file
  """
  # Read .xml files
  xml_files = sorted([f.stem for f in xml_dir.glob("*.xml")])

  # Upload the COCO JSON file
  with open(json_path, 'r') as f:
    coco_data = json.load(f)

  # Takes the names (without extension) of the annotated images
  json_files = sorted([Path(img['file_name']).stem for img in coco_data['images']])

  missing_in_json = [f for f in xml_files if f not in json_files]
  extra_in_json = [f for f in json_files if f not in xml_files]

  if not missing_in_json and not extra_in_json:
    print("All .xml files have a match in the COCO JSON.")
  else:
    print("Some matches are not correct:")
    if missing_in_json:
      print(f"Missing from the JSON: {missing_in_json}")
    if extra_in_json:
      print(f"Present in JSON but not in XMLs: {extra_in_json}")


In [None]:
# Let's check:

# EfficientDet D0

# TRAIN
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D0/train/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D0/annotations_train.json")
)

# VAL
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D0/val/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D0/annotations_val.json")
)

# TEST
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D0/test/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D0/annotations_test.json")
)

All .xml files have a match in the COCO JSON.
All .xml files have a match in the COCO JSON.
All .xml files have a match in the COCO JSON.


In [None]:
# EfficientDet D1

# TRAIN
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D1/train/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D1/annotations_train.json")
)

# VAL
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D1/val/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D1/annotations_val.json")
)

# TEST
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D1/test/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D1/annotations_test.json")
)

All .xml files have a match in the COCO JSON.
All .xml files have a match in the COCO JSON.
All .xml files have a match in the COCO JSON.


In [None]:
# EfficientDet D2

# TRAIN
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D2/train/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D2/annotations_train.json")
)

# VAL
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D2/val/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D2/annotations_val.json")
)

# TEST
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D2/test/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_EfficientDet_D2/annotations_test.json")
)

All .xml files have a match in the COCO JSON.
All .xml files have a match in the COCO JSON.
All .xml files have a match in the COCO JSON.


In [None]:
# MobileNetV2 + SSD

# TRAIN
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV2_SSD/train/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV2_SSD/annotations_train.json")
)

# VAL
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV2_SSD/val/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV2_SSD/annotations_val.json")
)

# TEST
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV2_SSD/test/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV2_SSD/annotations_test.json")
)

All .xml files have a match in the COCO JSON.
All .xml files have a match in the COCO JSON.
All .xml files have a match in the COCO JSON.


In [None]:
# MobileNetV3 + SSD

# TRAIN
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV3_SSD/train/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV3_SSD/annotations_train.json")
)

# VAL
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV3_SSD/val/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV3_SSD/annotations_val.json")
)

# TEST
check_coco_conversion(
    xml_dir=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV3_SSD/test/annotations"),
    json_path=Path("/content/drive/MyDrive/projectUPV/datasets/AERALIS_MobileNetV3_SSD/annotations_test.json")
)

**Acknowledgements:** Part of the code for converting annotations from PASCAL VOC (.xml) to COCO JSON is based on an open implementation of Roboflow / yukkyo (https://blog.roboflow.com/how-to-convert-annotations-from-voc-xml-to-coco-json/).