<a href="https://colab.research.google.com/github/PavanDaniele/drone-person-detection/blob/main/data_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Run this Every time you start a new session
from google.colab import drive
drive.mount('/content/drive') # to mount google drive (to see/access it)

ValueError: mount failed

In [None]:
import os
import xml.etree.ElementTree as ET
import pandas as pd
import re  # To import the module for the regular expression
import shutil # To operate on files and folders

# Data Exploration
In this notebook, I will explore the dataset used for the person recognition project through drones. My goal is to analyze the content of the dataset, understand its characteristics, and apply necessary transformations (such as resizing, normalization, and creation of training and testing sets). This process is crucial to prepare the data so that it can be used for training deep learning models.

The chosen dataset includes images of people taken by drones at different altitudes, and it is essential to analyze it to identify the labels, classes, and any anomalies in the data.

### Useful Functions

In [None]:
def read_file_csv(folder_path, filename): # To read a file csv in a folder
  """
  Read a specific CSV file from a specified folder.

  Args:
    folder_path (str): Path to the folder containing the CSV file.
    filename (str): Name of the CSV file.

  Returns:
    df (pd.DataFrame): DataFrame containing the data from the CSV file.
  """

  filename_path = os.path.join(folder_path, filename)
  print("Attempting to read:", filename_path)
  df_filename = pd.read_csv(filename_path)
  print("File read successfully.")

  return df_filename

In [None]:
# Check if there are jpg images without associated xml files or vice versa, in which case we should delete these files.
def check_image_xml_consistency(folder_path): # Check if xml and jpg files are associated
  """
  Check if every image .jpg has a .xml file with the same name and viceversa.

  Args:
    folder_path (str): path of the folder containing images and xml annotations.

  Returns:
    dict: dictionary that contain images without xml and xml without any associated image.
  """
  jpg_files = {os.path.splitext(f)[0] for f in os.listdir(folder_path) if f.lower().endswith('.jpg')}
  xml_files = {os.path.splitext(f)[0] for f in os.listdir(folder_path) if f.lower().endswith('.xml')}

  images_without_xml = sorted(jpg_files - xml_files)
  xml_without_images = sorted(xml_files - jpg_files)

  print(f"Total images: {len(jpg_files)}")
  print(f"Total XML files: {len(xml_files)}")
  print(f"Images without matching XML: {len(images_without_xml)}")
  print(f"XML files without matching image: {len(xml_without_images)}")

  return {
    "images_without_xml": images_without_xml,
    "xml_without_images": xml_without_images
  }

# HOW TO BEHAVE:
#### If the .jpg image does not have the associated .xml file:
# -> if it is from the SARD dataset (delete it).
# -> if it is from the Herida dataset keep it valid as a negative image (without objects), include it in the CSV file with class = no_person or similar
#
#### If the .xml file does not have the associated .jpg image:
# -> the .xml file is useless and is likely to generate errors in the parser (delete it)

In [None]:
def check_image_dimensions_consistency(file_csv): # Check if all the images have the same dimensions
  """
  Check if all images in the CSV have the same dimensions (width, height).

  Args:
    file_csv (str): path to csv which contains columns 'filename', 'width' and 'height'.

  Returns:
    pd.DataFrame: different dimensions found with count.
  """
  df = pd.read_csv(file_csv)

  for col in ['width', 'height']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

  # dedup by filename to avoid multiple bboxes
  dfu = df[['filename','width','height']].drop_duplicates()
  dimension_counts = dfu.groupby(['width', 'height'])['filename'].nunique().reset_index(name='image_count')

  if len(dimension_counts) == 1:
    print(f"All images have the same size: {dimension_counts.iloc[0]['width']}x{dimension_counts.iloc[0]['height']}")
  else:
    print(f"Found {len(dimension_counts)} different image sizes:")
    print(dimension_counts)

  return dimension_counts

In [None]:
# Note on the "not_defined" label:
# In the SARD dataset, the "not_defined" label is used when a person is clearly present in the image,but their activity or posture
#  cannot be reliably classified (e.g., due to occlusion or ambiguity).
# Since my task is person detection (i.e., detecting the presence of a person, regardless of their behavior), I decided to treat
#  all "not_defined" annotations as valid person instances.
# This avoids undercounting people in cases where the action is unclear but their presence is certain.

def classify_image_annotation_quality(file_csv, dataset_folder, min_pixels=8, min_ratio=0.01): # To find suspicious images
  """
    Classify images into 3 categories using a size-based heuristic:

    1) case1_all_suspicious:
       Every bounding box in the image is considered "small".
       A bbox is "small" if:
        - its width < min_pixels OR its height < min_pixels, OR
        - (only if 'width' and 'height' image columns are available in the CSV)
          its relative size is small: (bbox_w / img_width < min_ratio) OR (bbox_h / img_height < min_ratio).

    2) case2_some_suspicious:
       At least one bbox in the image is "small", but not all.

    3) case3_no_annotations:
       The image file exists in 'dataset_folder' but does not appear in the CSV (i.e., no annotation rows).

  Args:
    file_csv (str): Path to the CSV with columns ['filename','xmin','ymin','xmax','ymax'] and optionally ['width','height'] (image size).
    dataset_folder (str): Folder containing the image files.
    min_pixels (int): Minimum bbox size in pixels (width or height) to be considered valid.
    min_ratio (float): Minimum relative bbox size (vs image size) to be considered valid.
  """
  df = pd.read_csv(file_csv)

  # Conversion of numerical coordinates
  for c in ['xmin','ymin','xmax','ymax','width','height']:
    if c in df: df[c] = pd.to_numeric(df[c], errors='coerce')

  # If width/height are not in the CSV, skip the ratio check
  has_wh = {'width','height'}.issubset(df.columns)

  df['bbox_w'] = df['xmax'] - df['xmin']
  df['bbox_h'] = df['ymax'] - df['ymin']
  small_pixel = (df['bbox_w'] < min_pixels) | (df['bbox_h'] < min_pixels)
  small_ratio = False
  if has_wh:
    small_ratio = ((df['bbox_w'] / df['width'] < min_ratio) | (df['bbox_h'] / df['height'] < min_ratio))
  df['suspicious'] = small_pixel | small_ratio

  # Assemble images
  grouped = df.groupby('filename')['suspicious']

  case1_all_suspicious = set(grouped.all()[grouped.all()].index)  # All boxes are sus
  case2_some_suspicious = set((grouped.any() & ~grouped.all())[ (grouped.any() & ~grouped.all()) ].index)  # just some boxes are sus

  # All images annotated
  annotated_images = set(df['filename'].unique())

  # All images in the folder
  all_images = {f for f in os.listdir(dataset_folder) if f.endswith('.jpg')}

  # Images not annotated (no raw in the csv)
  case3_no_annotations = all_images - annotated_images

  return {
    "case1_all_suspicious": case1_all_suspicious, # I'll treat these as if they were empty images
    "case2_some_suspicious": case2_some_suspicious,
    "case3_no_annotations": case3_no_annotations # these are empty images
  }

In [None]:
def infer_indexing_from_csv_verbose(csv_path, thresh=0.2):
  cols = ['xmin','ymin','xmax','ymax','width','height']
  df = pd.read_csv(csv_path)  # load everything

  # robustness: guarantee numerical
  for c in cols:
    if c in df: df[c] = pd.to_numeric(df[c], errors='coerce')

  # indicative fractions (booleans -> 1/0 -> .mean() = % True)
  frac_min0 = ((df['xmin'] == 0) | (df['ymin'] == 0)).mean()
  frac_min1 = ((df['xmin'] == 1) | (df['ymin'] == 1)).mean()

  if {'xmax','width'}.issubset(df.columns):
    frac_max_eq_w = (df['xmax'] == df['width']).mean()
  else:
    frac_max_eq_w = 0.0

  if {'ymax','height'}.issubset(df.columns):
    frac_max_eq_h = (df['ymax'] == df['height']).mean()
  else:
    frac_max_eq_h = 0.0

  likely_voc_1based = (frac_min1 > thresh) or (max(frac_max_eq_w, frac_max_eq_h) > thresh)
  likely_0based     = (frac_min0 > thresh)

  if likely_voc_1based and not likely_0based:
    idx = "1-based"
  elif likely_0based and not likely_voc_1based:
    idx = "0-based"
  else:
    idx = "ambiguous"

  return {
    "indexing": idx,
    "frac_min0": float(frac_min0),
    "frac_min1": float(frac_min1),
    "frac_max_eq_w": float(frac_max_eq_w),
    "frac_max_eq_h": float(frac_max_eq_h),
    "threshold": float(thresh),
    "rows_used": int(len(df))
  }

In [None]:
# To check if bounding boxes in CSV are out of bounds for images
def check_bboxes_out_of_bounds_from_csv(file_csv, voc_one_indexed=True): # When converting to 0-based, you can call with voc_one_indexed=False
  """
  Check if bounding boxes in CSV are out of bounds for images.

  Args:
    file_csv (str): Path to CSV with columns: xmin, ymin, xmax, ymax, width, height.
    voc_one_indexed (bool): Whether the CSV is 1-based (VOC format) or 0-based.

  Returns:
    invalid_rows (pd.DataFrame): Annotations with bbox out of bounds.
  """
  df = pd.read_csv(file_csv)

  # Make sure the fields are numeric
  for col in ['xmin', 'ymin', 'xmax', 'ymax', 'width', 'height']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

  """ QUELLO CHE C'ERA PRIMA
    # Create mask for bbox out of bounds
    out_of_bounds_mask = ~(
      (df['xmin'] >= 0) &
      (df['ymin'] >= 0) &
      (df['xmax'] <= df['width']) &
      (df['ymax'] <= df['height']) &
      (df['xmax'] > df['xmin']) &
      (df['ymax'] > df['ymin'])
    )

    invalid_rows = df[out_of_bounds_mask].copy()

    print(f"Checked all entries. Found {len(invalid_rows)} invalid bounding boxes.")
    return invalid_rows
  """
  # to avoid false “out-of-bounds” on the edges (common with drones from above)
  min_ok = 1 if voc_one_indexed else 0
  lb_ok = (df['xmin'] >= min_ok) & (df['ymin'] >= min_ok)
  ub_ok = (df['xmax'] <= df['width']) & (df['ymax'] <= df['height'])
  shape_ok = (df['xmax'] > df['xmin']) & (df['ymax'] > df['ymin'])

  invalid_rows = df[~(lb_ok & ub_ok & shape_ok)].copy()
  print(f"Checked all entries. Found {len(invalid_rows)} invalid bounding boxes.")

  return invalid_rows

In [None]:
def is_image_present(folder_path, image_name): # To check if an image is present or not
  """
  Checks whether an image file is present in the folder.

  Args:
    folder_path (str): Path to the folder.
    image_name (str): Name of the file es. 'img001.jpg'.

  Returns:
    bool: True if the file is present, False otherwise.
  """
  return os.path.isfile(os.path.join(folder_path, image_name))

In [None]:
def parse_xml_annotations(xml_folder): # Read all the XML files in a folder and return a pandas DataFrame with the annotations in them
  """
  Parses annotation data from all XML files in a given folder and returns a DataFrame containing object labels and bounding boxes.

  Args:
    xml_folder (str): Path to the folder containing XML annotation files.

  Returns:
    pandas.DataFrame: A DataFrame with columns ['filename', 'class', 'xmin', 'ymin', 'xmax', 'ymax']
    containing all annotations extracted from the XML files.
  """
  annotations = []

  for file in os.listdir(xml_folder):
    if file.endswith('.xml'):
      xml_path = os.path.join(xml_folder, file)
      try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        filename = root.find('filename').text.strip()
        for obj in root.findall('object'):
          label = obj.find('name').text.strip().lower()
          bbox = obj.find('bndbox')
          xmin = int(bbox.find('xmin').text)
          ymin = int(bbox.find('ymin').text)
          xmax = int(bbox.find('xmax').text)
          ymax = int(bbox.find('ymax').text)

          annotations.append({
            'filename': filename,
            'class': label,
            'xmin': xmin,
            'ymin': ymin,
            'xmax': xmax,
            'ymax': ymax
          })

      except ET.ParseError:
        print(f"Warning: Failed to parse {xml_path}")

  return pd.DataFrame(annotations)




def check_csv_vs_xml_annotations(folder_path, csv_filename): # Check if the csv annotations match the xml files
  """
  Compares object annotations in a CSV file against annotations found in XML files.

  Args:
    folder_path (str): Path to the folder containing both the CSV and XML files.
    csv_filename (str): Name of the CSV file to validate.

  Returns:
    dict: Summary with:
      - 'matched': Number of matched annotation rows.
      - 'only_in_csv': Number of unmatched rows only in CSV.
      - 'only_in_xml': Number of unmatched rows only in XML.
      - 'files_only_in_csv': Set of filenames found only in the CSV.
      - 'files_only_in_xml': Set of filenames found only in the XML.
  """
  csv_path = os.path.join(folder_path, csv_filename)
  df_csv = pd.read_csv(csv_path)
  df_xml = parse_xml_annotations(folder_path)

  # Normalize filename and class
  df_csv['filename'] = df_csv['filename'].astype(str).str.strip()
  df_csv['class'] = df_csv['class'].astype(str).str.strip().str.lower()

  merged = pd.merge(
    df_csv,
    df_xml,
    how='outer',
    on=['filename', 'class', 'xmin', 'ymin', 'xmax', 'ymax'],
    indicator=True
  )

  # Identifica i filename con mismatch
  only_csv_filenames = set(merged[merged['_merge'] == 'left_only']['filename'])
  only_xml_filenames = set(merged[merged['_merge'] == 'right_only']['filename'])

  return {
    'matched': merged['_merge'].value_counts().get('both', 0),
    'only_in_csv': merged['_merge'].value_counts().get('left_only', 0),
    'only_in_xml': merged['_merge'].value_counts().get('right_only', 0),
    'files_only_in_csv': sorted(only_csv_filenames),
    'files_only_in_xml': sorted(only_xml_filenames)
  }

In [None]:
def preview_dataset_cleaning(csv_files, images_to_remove):
  """
  Shows which rows would be removed from CSVs without editing them.

  Args:
    csv_files (list of str): CSV Paths
    images_to_remove (list or set): Names of files to be removed
  """
  results = {}
  for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    hits = df[df['filename'].isin(images_to_remove)]
    print(f"\n {csv_file}: {len(hits)} rows to remove")
    print("Filenames involved:", hits['filename'].unique())
    results[csv_file] = {'rows': len(hits), 'filenames': hits['filename'].unique()}

  return results

In [None]:
def clean_dataset(folder, csv_files, images_to_remove): # To delete images and xml file (also in the csv)
  """
  Delete images and XML files, and directly remove related rows from original CSV files.

  Args:
    folder (str): Path to folder containing .jpg and xml files.
    csv_files (list of str): List of paths to CSV files to clean.
    images_to_remove (list or set): Filenames (e.g., 'gss1128.jpg') to remove.
  """
  deleted_images = []
  deleted_xmls = []
  removed_csv_rows = {}

  # 1. Delete image and XML files
  for image in images_to_remove:
    image_path = os.path.join(folder, image)
    xml_path = os.path.join(folder, os.path.splitext(image)[0] + '.xml')

    if os.path.exists(image_path):
      os.remove(image_path)
      deleted_images.append(image)
    else:
      print(f"Image not found: {image}")

    if os.path.exists(xml_path):
      os.remove(xml_path)
      deleted_xmls.append(os.path.basename(xml_path))
    else:
      print(f"XML not found: {os.path.basename(xml_path)}")

  # 2. Remove related rows directly from original CSV files
  for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    initial_len = len(df)

    # Print the rows will be deleted
    hits = df[df['filename'].isin(images_to_remove)]
    removed_csv_rows[csv_file] = list(hits['filename'].unique())
    print(f"\nCheck on {csv_file}: {len(hits)} rows to delete")
    print("Filenames:", hits['filename'].unique())

    # Drop rows with matching filenames
    df = df[~df['filename'].isin(images_to_remove)] # ~ boolean negation operator
    # .isin(): to control if each element in a column is present in a list, set, ecc

    # Overwrite the original CSV file
    df.to_csv(csv_file, index=False)

    print(f"Updated {csv_file}: {initial_len - len(df)} rows removed")

  return {
    "deleted_images": deleted_images,
    "deleted_xmls": deleted_xmls,
    "removed_csv_rows": removed_csv_rows
  }

In [None]:
def analyze_dataset_annotations(file_csv, dataset_folder):  # Basic Analysis of the dataset
  """
  Analize the csv file and verify if there are images (.jpg) without annotation
  """
  df = pd.read_csv(file_csv)

  # Count the number of annotations for each class
  count_labels = df['class'].value_counts()

  # Number of bounding box per image
  bboxes_per_image = df.groupby('filename').size()
  number_bboxes_per_image = bboxes_per_image.value_counts().sort_index()

  # All the images annotated in the csv
  annotated_images = set(df['filename'].unique())

  # All images in the folder
  all_images = {f for f in os.listdir(dataset_folder) if f.endswith('.jpg')}

  # Images without any annotation in csv file
  images_without_annotations = all_images - annotated_images

  return {
    "label_count": count_labels,
    "distribution_of_bbox": number_bboxes_per_image,
    "number_images_annotated": len(annotated_images),
    "number_all_images": len(all_images),
    "number_images_without_annotations": len(images_without_annotations),
    "images_without_annotations": images_without_annotations # Useful for debug or to remove them
  }

## Dataset SARD:

This dataset was built for detecting casualties and persons in search and rescue scenarios in drone images and videos. \
The actors in the footage have simulate exhausted and injured persons as well as "classic" types of movement of people in nature, such as running, walking, standing, sitting, or lying down. \
The shots include persons on macadam roads, in quarries, low and high grass, forest shade, and the like.

In [None]:
# os.chdir('../') # to change the directory (../ is the start)
os.chdir('/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD') # the directory with the dataset
print(os.getcwd()) # to see in which directory we are
# print(os.listdir()) # to see what there is in the current directory (add the path inside the () to see the content of another directory)

/content/drive/.shortcut-targets-by-id/1LQbD7p_iS5KLqGNdfrYEvsAx0i_bgB0h/projectUPV/datasets/SARD_dataset/SARD


In [None]:
folder_path_sard = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'

We start the cleaning process by checking if there are images without an associated XML file or vice versa:

In [None]:
sard_dataset_path = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'
inconsistencies_sard = check_image_xml_consistency(sard_dataset_path)

# Let's see which files aren't matching
total_unmatched_sard = len(inconsistencies_sard["images_without_xml"]) + len(inconsistencies_sard["xml_without_images"])
print("Total unmatched files:", total_unmatched_sard)

print("\nImages without XML:")
for f in inconsistencies_sard['images_without_xml']:
  print(f"  - {f}.jpg")

print("\nXML files without image:")
for f in inconsistencies_sard['xml_without_images']:
  print(f"  - {f}.xml")


Total images: 1983
Total XML files: 1981
Images without matching XML: 2
XML files without matching image: 0
Total unmatched files: 2

Images without XML:
  - gss1128.jpg
  - gss806.jpg

XML files without image:


We notice that 2 images do not have an associated XML file. These images are:
- gss1128.jpg
- gss806.jpg

And there are No XML files without an associated image.

In [None]:
is_image_present(sard_dataset_path, "gss806.jpg")

True

In [None]:
is_image_present(sard_dataset_path, "gss1128.jpg")

True

Now we want to delete these files cause are useless:

In [None]:
print("Images without XML:", len(inconsistencies_sard['images_without_xml']), "   are: ", inconsistencies_sard['images_without_xml'])

def delete_orphan_files(folder_path, orphan_xml_basenames): # to delete the images without xml files associated
  """
  Delete JPG files that are not associated with any XML file.

  Args:
    folder_path (str): Path to the folder containing the files.
    orphan_xml_basenames (list or set): List of base filenames (without extension) for orphan JPGs.
  """
  for img_base in orphan_xml_basenames:
    img_path = os.path.join(folder_path, img_base + '.jpg')
    if os.path.exists(img_path):
      os.remove(img_path)
      print(f"Deleted: {img_base}.jpg")


delete_orphan_files(sard_dataset_path, inconsistencies_sard['images_without_xml'])

Images without XML: 2    are:  ['gss1128', 'gss806']
Deleted: gss1128.jpg
Deleted: gss806.jpg


In [None]:
is_image_present(sard_dataset_path, "gss806.jpg")

False

In [None]:
is_image_present(sard_dataset_path, "gss1128.jpg")

False

Now we delete the other 3 files:

In [None]:
# is_image_present(sard_dataset_path, "gss974 (1).jpg")

In [None]:
# is_image_present(sard_dataset_path, "gss98 (1).jpg")

In [None]:
# is_image_present(sard_dataset_path, "gss99 (1).jpg")

In [None]:
print("Images without XML:", len(inconsistencies_sard['xml_without_images']), "   are: ", inconsistencies_sard['xml_without_images'])

def delete_orphan_xml_files(folder_path, orphan_xml_basenames):  # to delete the xml files without images associated
  """
  Delete XML files that don't have a corresponding .jpg image.

  Args:
    folder_path (str): Path to the dataset folder.
    orphan_xml_basenames (list or set): List of base filenames (without extension) for orphan XMLs.
  """
  for xml_base in orphan_xml_basenames:
    xml_path = os.path.join(folder_path, xml_base + '.xml')
    if os.path.exists(xml_path):
      os.remove(xml_path)
      print(f"Deleted: {xml_base}.xml")
    else:
      print(f"File not found (already deleted?): {xml_base}.xml")


# delete_orphan_xml_files(sard_dataset_path, inconsistencies_sard['xml_without_images'])

Images without XML: 0    are:  []


In [None]:
# is_image_present(sard_dataset_path, "gss974 (1).jpg")

In [None]:
# is_image_present(sard_dataset_path, "gss98 (1).jpg")

In [None]:
# is_image_present(sard_dataset_path, "gss99 (1).jpg")

Let's check if in the csv files are some rows about the deleted files. If so we need to delete them too:

In [None]:
def check_removed_images_in_csv(csv_paths, removed_images): # To check if the csv files contains something about the images we removed
  """
  Check if the deleted images are still present in the CSV files.

  Args:
    csv_paths (list of str): CSV paths to check.
    removed_images (list or set): Names of image files to look for.
  """
  for csv_file in csv_paths:
    print(f"\nCheck in: {os.path.basename(csv_file)}")
    df = pd.read_csv(csv_file)

    matches = df[df['filename'].isin(removed_images)]
    if not matches.empty:
      print(f"Found {len(matches)} lines related to deleted images:")
      print(matches['filename'].unique())
    else:
      print("No lines to delete found.")




folder_path_sard = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'

csv_sard_labels_path = os.path.join(folder_path_sard, 'sard_labels.csv')
csv_sard_person_labels_path = os.path.join(folder_path_sard, 'sard_person_labels.csv')
csv_files = [csv_sard_labels_path, csv_sard_person_labels_path]

removed_images_without_xml = inconsistencies_sard['images_without_xml']
removed_xml_without_images = inconsistencies_sard['xml_without_images']

check_removed_images_in_csv(csv_files, removed_images_without_xml)
check_removed_images_in_csv(csv_files, removed_xml_without_images)


Check in: sard_labels.csv
No lines to delete found.

Check in: sard_person_labels.csv
No lines to delete found.

Check in: sard_labels.csv
No lines to delete found.

Check in: sard_person_labels.csv
No lines to delete found.


In [None]:
def find_really_empty_images_strict(dataset_path): # To find all the empty images
  """
  Find images where bounding box coordinates are all very small (<= 3)

  Args:
    dataset_path (str): Path to dataset containing .jpg and .xml files.

  Returns:
    really_empty_images (set): Set of image filenames considered truly empty.
  """

  really_empty_images = set()

  for filename in os.listdir(dataset_path):
    if filename.endswith('.xml'):
      xml_path = os.path.join(dataset_path, filename)

      try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        labels = []
        suspicious_bboxes = []

        for obj in root.findall('object'):
          label = obj.find('name').text.strip()
          bndbox = obj.find('bndbox')
          xmin = int(bndbox.find('xmin').text.strip())
          ymin = int(bndbox.find('ymin').text.strip())
          xmax = int(bndbox.find('xmax').text.strip())
          ymax = int(bndbox.find('ymax').text.strip())

          labels.append(label)
          suspicious = (xmin <= 3 or ymin <= 3 or xmax <= 3 or ymax <= 3)
          suspicious_bboxes.append(suspicious)

        if labels and all(suspicious_bboxes):
          image_filename = root.find('filename').text.strip()
          really_empty_images.add(image_filename)

      except ET.ParseError:
        print(f"Warning: Failed to parse {xml_path}")

  return really_empty_images

sard_dataset_path = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'
really_empty_images = find_really_empty_images_strict(sard_dataset_path)

print("There are ", len(really_empty_images), " empty images")
print("Really empty images:", really_empty_images)

There are  4  empty images
Really empty images: {'gss1129.jpg', 'gss327.jpg', 'gss326.jpg', 'gss1138.jpg'}


We read the CSV file containing annotation information from the XML files associated with the images:

In [None]:
# To read a specific file csv
folder_path_sard = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'

# print("Files in directory:", os.listdir(folder_path_sard))
csv_files = [f for f in os.listdir(folder_path_sard) if f.lower().endswith('.csv')]
print("CSV files:", csv_files)

read_file_csv(folder_path_sard, 'sard_labels.csv')

CSV files: ['sard_labels.csv', 'sard_person_labels.csv']
Attempting to read: /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD/sard_labels.csv


Unnamed: 0,filename,width,height,class,xmin,ymin,xmax,ymax
0,gss1299.jpg,1920,1080,Sitting,473,667,519,733
1,gss1604.jpg,1920,1080,Not-defined,927,317,953,347
2,gss1604.jpg,1920,1080,Not-defined,999,589,1033,636
3,gss801.jpg,1920,1080,Not-defined,315,394,391,456
4,gss501.jpg,1920,1080,Not-defined,1412,337,1453,376
...,...,...,...,...,...,...,...,...
6527,gss955.jpg,1920,1080,Standing,164,777,220,857
6528,gss955.jpg,1920,1080,Lying,1094,664,1199,702
6529,gss955.jpg,1920,1080,Sitting,1491,460,1516,489
6530,gss955.jpg,1920,1080,Lying,1668,595,1720,625


In [None]:
read_file_csv(folder_path_sard, 'sard_person_labels.csv')

Attempting to read: /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD/sard_person_labels.csv


Unnamed: 0,filename,width,height,class,xmin,ymin,xmax,ymax
0,gss1299.jpg,1920,1080,person,473,667,519,733
1,gss1604.jpg,1920,1080,person,927,317,953,347
2,gss1604.jpg,1920,1080,person,999,589,1033,636
3,gss801.jpg,1920,1080,person,315,394,391,456
4,gss501.jpg,1920,1080,person,1412,337,1453,376
...,...,...,...,...,...,...,...,...
6527,gss955.jpg,1920,1080,person,164,777,220,857
6528,gss955.jpg,1920,1080,person,1094,664,1199,702
6529,gss955.jpg,1920,1080,person,1491,460,1516,489
6530,gss955.jpg,1920,1080,person,1668,595,1720,625


We immediately notice that the CSV is unordered, making it hard to read. \
We proceed by renaming the .jpg files (images) and the corresponding .xml files:

In [None]:
# As we see, the file is not sorted. We want to order it but first we should rename also the images and .xml files

def rename_images_and_xml_files(folder_path, prefix="gss"): # To rename a .jpg images and xml files with a prefix
  """
  Rename only paired .jpg and .xml files to format gssNNNN.jpg/xml to ensure consistency between image and annotation files.

  Args:
    folder_path (str): Path to dataset containing images
    prefix (str): Prefix to add to the new filenames
  """
  for filename in os.listdir(folder_path):
    if filename.endswith('.jpg') or filename.endswith('.xml'):
      match = re.match(rf'{prefix}(\d+)\.(jpg|xml)', filename)
      if match:
        number = int(match.group(1))
        extension = match.group(2)
        new_name = f"{prefix}{str(number).zfill(4)}.{extension}"

        old_path = os.path.join(folder_path, filename)
        new_path = os.path.join(folder_path, new_name)

        if old_path != new_path:
          os.rename(old_path, new_path)
          # print(f"Renamed: {filename} → {new_name}")

  print("Files renamed successfully!")

rename_images_and_xml_files(folder_path_sard)

Files renamed successfully!


Now that we have correctly renamed the .jpg and .xml files, we continue by updating the filenames in the CSV file. \
Then, we sort the rows in the CSV file based on the filename.

In [None]:
# Modify the names in gssNNNN.jpg format
def format_filename(old_name, prefix="gss"):
  num = re.findall(r'\d+', old_name)
  if num:
    return f"{prefix}{int(num[0]):04d}.jpg"
  return old_name  # In case there are no numbers

In [None]:
def update_csv_filenames(folder_path, csv_file, prefix="gss"): # to change the filename column in a csv file
  """
  Update the 'filename' column in a CSV file with a new prefix in the format gssNNNN.jpg.

  Args:
    folder_path (str): Path to dataset containing csv
    csv_file (str): Name of the CSV file
    prefix (str): Prefix to add to the new filenames
  """
  # After renaming, update the filenames in the CSV
  csv_path = os.path.join(folder_path, csv_file)
  df = pd.read_csv(csv_path)

  df['filename'] = df['filename'].apply(lambda x: format_filename(x, prefix=prefix))

  # Sort de dataframe by filename
  df = df.sort_values(by='filename').reset_index(drop=True)

  # Save the updated and sorted DataFrame
  df.to_csv(csv_path, index=False)

  # Print the updated DataFrame to verify the changes
  print(f"Updated CSV '{csv_file}':")
  print(df.head())  # Print first few rows to verify

In [None]:
update_csv_filenames(folder_path_sard, 'sard_labels.csv')
read_file_csv(folder_path_sard, 'sard_labels.csv')

Updated CSV 'sard_labels.csv':
      filename  width  height     class  xmin  ymin  xmax  ymax
0  gss0001.jpg   1920    1080   Walking  1110   358  1134   424
1  gss0001.jpg   1920    1080  Standing  1077   367  1100   428
2  gss0001.jpg   1920    1080   Sitting  1041   144  1061   173
3  gss0002.jpg   1920    1080   Sitting  1041   142  1062   173
4  gss0002.jpg   1920    1080  Standing  1079   365  1101   428
Attempting to read: /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD/sard_labels.csv


Unnamed: 0,filename,width,height,class,xmin,ymin,xmax,ymax
0,gss0001.jpg,1920,1080,Walking,1110,358,1134,424
1,gss0001.jpg,1920,1080,Standing,1077,367,1100,428
2,gss0001.jpg,1920,1080,Sitting,1041,144,1061,173
3,gss0002.jpg,1920,1080,Sitting,1041,142,1062,173
4,gss0002.jpg,1920,1080,Standing,1079,365,1101,428
...,...,...,...,...,...,...,...,...
6527,gss2104.jpg,1920,1080,Standing,1174,604,1226,672
6528,gss2104.jpg,1920,1080,Standing,1226,1040,1266,1080
6529,gss2104.jpg,1920,1080,Standing,1391,931,1458,982
6530,gss2104.jpg,1920,1080,Standing,1548,1032,1608,1071


In [None]:
update_csv_filenames(folder_path_sard, 'sard_person_labels.csv')
read_file_csv(folder_path_sard, 'sard_person_labels.csv')

Updated CSV 'sard_person_labels.csv':
      filename  width  height   class  xmin  ymin  xmax  ymax
0  gss0001.jpg   1920    1080  person  1110   358  1134   424
1  gss0001.jpg   1920    1080  person  1077   367  1100   428
2  gss0001.jpg   1920    1080  person  1041   144  1061   173
3  gss0002.jpg   1920    1080  person  1041   142  1062   173
4  gss0002.jpg   1920    1080  person  1079   365  1101   428
Attempting to read: /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD/sard_person_labels.csv


Unnamed: 0,filename,width,height,class,xmin,ymin,xmax,ymax
0,gss0001.jpg,1920,1080,person,1110,358,1134,424
1,gss0001.jpg,1920,1080,person,1077,367,1100,428
2,gss0001.jpg,1920,1080,person,1041,144,1061,173
3,gss0002.jpg,1920,1080,person,1041,142,1062,173
4,gss0002.jpg,1920,1080,person,1079,365,1101,428
...,...,...,...,...,...,...,...,...
6527,gss2104.jpg,1920,1080,person,1174,604,1226,672
6528,gss2104.jpg,1920,1080,person,1226,1040,1266,1080
6529,gss2104.jpg,1920,1080,person,1391,931,1458,982
6530,gss2104.jpg,1920,1080,person,1548,1032,1608,1071


Now we need to update the filenames in the XML files as well, to avoid inconsistencies:

In [None]:
def update_xml_filenames_sard(folder_path, prefix="gss"): # To update the filename tag in each xml file in a folder
  """
  Update the <filename> tag in each XML file in the folder to follow gssNNNN.jpg format.

  Args:
    folder_path (str): Path to the folder containing XML files.
    prefix (str): Prefix for the filenames (default: 'gss').

  Returns:
    list: List of XML filenames that were updated.
  """
  updated_files = []
  updated = 0

  for file in os.listdir(folder_path):
    if file.endswith('.xml'):
      xml_path = os.path.join(folder_path, file)
      try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        filename_tag = root.find('filename')
        if filename_tag is not None:
          old_name = filename_tag.text
          new_name = format_filename(old_name, prefix)

          if old_name != new_name:
            filename_tag.text = new_name
            tree.write(xml_path)
            updated += 1
            updated_files.append(file)

      except ET.ParseError:
        print(f"Failed to parse {file}")

  # print(f"Updated <filename> in {updated} XML files.")
  return updated_files

# This function change the filename tag only if 'old_name' is different from what it should be (in other words 'new_name')

update_xml_filenames_sard(folder_path_sard)

Updated <filename> in 939 XML files.


['gss0500.xml',
 'gss0050.xml',
 'gss0501.xml',
 'gss0504.xml',
 'gss0506.xml',
 'gss0051.xml',
 'gss0508.xml',
 'gss0507.xml',
 'gss0509.xml',
 'gss0510.xml',
 'gss0505.xml',
 'gss0511.xml',
 'gss0512.xml',
 'gss0518.xml',
 'gss0052.xml',
 'gss0516.xml',
 'gss0513.xml',
 'gss0515.xml',
 'gss0517.xml',
 'gss0514.xml',
 'gss0519.xml',
 'gss0521.xml',
 'gss0523.xml',
 'gss0527.xml',
 'gss0526.xml',
 'gss0525.xml',
 'gss0520.xml',
 'gss0524.xml',
 'gss0522.xml',
 'gss0530.xml',
 'gss0529.xml',
 'gss0528.xml',
 'gss0532.xml',
 'gss0531.xml',
 'gss0534.xml',
 'gss0053.xml',
 'gss0533.xml',
 'gss0054.xml',
 'gss0541.xml',
 'gss0542.xml',
 'gss0537.xml',
 'gss0540.xml',
 'gss0538.xml',
 'gss0539.xml',
 'gss0535.xml',
 'gss0536.xml',
 'gss0548.xml',
 'gss0543.xml',
 'gss0546.xml',
 'gss0549.xml',
 'gss0544.xml',
 'gss0547.xml',
 'gss0545.xml',
 'gss0055.xml',
 'gss0555.xml',
 'gss0550.xml',
 'gss0557.xml',
 'gss0553.xml',
 'gss0551.xml',
 'gss0552.xml',
 'gss0556.xml',
 'gss0554.xml',
 'gss055

The number of updated files does not match the total number of XML files. \
We then check for inconsistencies:

In [None]:
def find_unmodified_xml_files(folder_path):
  """
  Check which .xml files have not updated the <filename> field consistently with their names.
  Example: if the file is named 'gss0123.xml', the <filename> field should be 'gss0123.jpg'.

  Args:
    folder_path (str): path to the folder with .xml files

  Returns:
    list: xml file names that have not been edited correctly
  """
  not_updated = []

  for file in os.listdir(folder_path):
    if file.endswith('.xml'):
      expected_filename = os.path.splitext(file)[0] + '.jpg'
      xml_path = os.path.join(folder_path, file)

      try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        filename_tag = root.find('filename')
        if filename_tag is not None:
          current_value = filename_tag.text.strip()
          if current_value != expected_filename:
            not_updated.append(file)

      except ET.ParseError:
        print(f"Error in the parsing of: {file}")

  print(f"Found {len(not_updated)} outdated XML files.")
  return not_updated


non_modificated = find_unmodified_xml_files(folder_path_sard)

print(non_modificated)

Found 0 outdated XML files.
[]


Perfect, all XML files are fully updated internally.

\
We identify any files that remain inconsistent:

In [None]:
folder_path_sard = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'

updated = set(update_xml_filenames_sard(folder_path_sard))
not_updated = set(find_unmodified_xml_files(folder_path_sard))

# Check overlap or discrepancies
print("Files that should have been updated but have not:")
print(updated & not_updated)

print("\nFiles that have been updated correctly:")
print(updated - not_updated)

Updated <filename> in 0 XML files.
Found 0 outdated XML files.
Files that should have been updated but have not:
set()

Files that have been updated correctly:
set()


There are no inconsistencies. \
We normalize the class names in the _sard_labels.csv_ file to match those in the XML annotations:

In [None]:
def map_class_name(label):
  """
  Map a class name from CSV to its corresponding XML label.
  """
  class_mapping = {
    "Standing": "stands",
    "Not-defined": "not_defined",
    "Lying": "laying_down",
    "Sitting": "seated",
    "Walking": "Walking",
    "Running": "Running"
  }
  return class_mapping.get(label, label)

def normalize_csv_annotations(folder_path, csv_filename): # To normalize class name in a csv file to match XML annotation format
  """
  Normalize class names in a CSV file to match XML annotation format.

  Args:
    folder_path (str): Path to the folder containing the CSV file.
    csv_filename (str): Name of the CSV file.
  """
  csv_path = os.path.join(folder_path, csv_filename)
  df = pd.read_csv(csv_path)

  # Normalize class names
  df['class'] = df['class'].apply(map_class_name).str.lower().str.strip()

  # Sort by filename
  df = df.sort_values(by='filename').reset_index(drop=True)

  # Save the modified CSV
  df.to_csv(csv_path, index=False)
  print(f"CSV '{csv_filename}' normalized and saved.")
  return df

normalize_csv_annotations(folder_path_sard, 'sard_labels.csv')

CSV 'sard_labels.csv' normalized and saved.


Unnamed: 0,filename,width,height,class,xmin,ymin,xmax,ymax
0,gss0001.jpg,1920,1080,walking,1110,358,1134,424
1,gss0001.jpg,1920,1080,stands,1077,367,1100,428
2,gss0001.jpg,1920,1080,seated,1041,144,1061,173
3,gss0002.jpg,1920,1080,seated,1041,142,1062,173
4,gss0002.jpg,1920,1080,stands,1079,365,1101,428
...,...,...,...,...,...,...,...,...
6527,gss2104.jpg,1920,1080,stands,1391,931,1458,982
6528,gss2104.jpg,1920,1080,stands,1548,1032,1608,1071
6529,gss2104.jpg,1920,1080,stands,1174,604,1226,672
6530,gss2104.jpg,1920,1080,stands,1226,1040,1266,1080


Let's check whether the annotations in the CSV are correct by comparing them with the corresponding XML files:

In [None]:
result = check_csv_vs_xml_annotations(folder_path_sard, 'sard_labels.csv')
print("Result for sard_labels.csv:\n")
print("Correct Annotations:", result['matched'])
print("Annotations only in CSV:", result['only_in_csv'])
print("Annotazioni only in XML:", result['only_in_xml'])


result_person = check_csv_vs_xml_annotations(folder_path_sard, 'sard_person_labels.csv')
print("\nResult for sard_person_labels.csv:\n")
print("Correct Annotations:", result_person['matched'])
print("Annotations only in CSV:", result_person['only_in_csv'])
print("Annotazioni only in XML:", result_person['only_in_xml'])

# if result['files_only_in_csv']:
#   print("Images with annotations only in CSV:")
#   for f in result['files_only_in_csv']:
#     print("  -", f)

# if result['files_only_in_xml']:
#   print("Images with annotations only in XML:")
#   for f in result['files_only_in_xml']:
#     print("  -", f)

Result for sard_labels.csv:

Correct Annotations: 6532
Annotations only in CSV: 0
Annotazioni only in XML: 0

Result for sard_person_labels.csv:

Correct Annotations: 0
Annotations only in CSV: 6532
Annotazioni only in XML: 6532


In [None]:
# To compare two csv files: The purpose is to ensure that both CSV files are consistent when the class labels are normalized to "person"
def compare_personified_csvs(folder_path, full_csv='sard_labels.csv', person_csv='sard_person_labels.csv'):
  """
  Compares two CSV files where one contains full class labels and the other has all labels replaced with 'person'.

  Args:
    folder_path (str): Path to the folder containing the CSV files.
    full_csv (str): CSV file with full class names.
    person_csv (str): CSV file with all classes replaced by 'person'.

  Returns:
    dict: Summary of matching and mismatching rows.
  """
  path_full = os.path.join(folder_path, full_csv)
  path_person = os.path.join(folder_path, person_csv)

  df_full = pd.read_csv(path_full)
  df_person = pd.read_csv(path_person)

  # Normalize the full CSV to have 'person' as the class
  df_full_personified = df_full.copy()
  df_full_personified['class'] = 'person'

  # Sort both for consistent comparison
  df_full_personified = df_full_personified.sort_values(by=['filename', 'xmin', 'ymin', 'xmax', 'ymax']).reset_index(drop=True)
  df_person = df_person.sort_values(by=['filename', 'xmin', 'ymin', 'xmax', 'ymax']).reset_index(drop=True)

  # Compare relevant columns
  cols_to_compare = ['filename', 'xmin', 'ymin', 'xmax', 'ymax', 'class']
  comparison = df_full_personified[cols_to_compare].equals(df_person[cols_to_compare])

  if comparison:
    print("The CSV files match after normalizing class names to 'person'.")
  else:
    mismatches = df_full_personified[cols_to_compare].compare(df_person[cols_to_compare])
    print("Differences found between the two CSVs after normalizing classes.")
    print("Sample differences:")
    print(mismatches.head(10))

  return comparison


compare_personified_csvs(folder_path_sard)

The CSV files match after normalizing class names to 'person'.


True

We start with some general analysis of the cleaned dataset:

In [None]:
folder_path_sard = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'
csv_path_sard = os.path.join(folder_path_sard, 'sard_labels.csv')

risultati = analyze_dataset_annotations(csv_path_sard, folder_path_sard)

print("How many images for each label:")
print(risultati["label_count"])

print("\nDistribution of bbox for image:")
for n_bbox, n_images in risultati["distribution_of_bbox"].items():
  print(f"There are {n_images} images with {n_bbox} bbox")

print("\nImages with at least one label (bounding box):", risultati["number_images_annotated"])
print("Total images:", risultati["number_all_images"])
print("Images without any label:", risultati["number_images_without_annotations"])
print("The images without any label are:", risultati["images_without_annotations"])


How many images for each label:
class
stands         1896
laying_down    1591
walking        1303
not_defined    1039
seated          624
running          79
Name: count, dtype: int64

Distribution of bbox for image:
There are 305 images with 1 bbox
There are 649 images with 2 bbox
There are 355 images with 3 bbox
There are 197 images with 4 bbox
There are 141 images with 5 bbox
There are 136 images with 6 bbox
There are 95 images with 7 bbox
There are 37 images with 8 bbox
There are 66 images with 9 bbox

Images with at least one label (bounding box): 1981
Total images: 1981
Images without any label: 0
The images without any label are: set()


But we are more interested to analize the _sard_person_labels.csv_:

In [None]:
csv_path_sard = os.path.join(folder_path_sard, 'sard_person_labels.csv')

risultati = analyze_dataset_annotations(csv_path_sard, folder_path_sard)

print("How many images for each label:")
print(risultati["label_count"])

print("\nDistribution of bbox for image:")
for n_bbox, n_images in risultati["distribution_of_bbox"].items():
  print(f"There are {n_images} images with {n_bbox} bbox")

print("\nImages with at least one label (bounding box):", risultati["number_images_annotated"])
print("Total images:", risultati["number_all_images"])
print("Images without any label:", risultati["number_images_without_annotations"])
print("The images without any label are:", risultati["images_without_annotations"])

How many images for each label:
class
person    6532
Name: count, dtype: int64

Distribution of bbox for image:
There are 305 images with 1 bbox
There are 649 images with 2 bbox
There are 355 images with 3 bbox
There are 197 images with 4 bbox
There are 141 images with 5 bbox
There are 136 images with 6 bbox
There are 95 images with 7 bbox
There are 37 images with 8 bbox
There are 66 images with 9 bbox

Images with at least one label (bounding box): 1981
Total images: 1981
Images without any label: 0
The images without any label are: set()


We check that all images have the same dimensions:

In [None]:
print("\nAbout 'sard_person_labels.csv': ")
csv_sard_person_labels_path = os.path.join(folder_path_sard, 'sard_person_labels.csv')

check_image_dimensions_consistency(csv_sard_person_labels_path)


About 'sard_person_labels.csv': 
All images have the same size: 1920x1080


Unnamed: 0,width,height,image_count
0,1920,1080,1981


Now we verify that all bounding boxes are within the image boundaries:

In [None]:
csv_sard_person_labels_path = os.path.join(folder_path_sard, 'sard_person_labels.csv')

# 1) indexation estimate
info_sard = infer_indexing_from_csv_verbose(csv_sard_person_labels_path, thresh=0.2)
print("SARD indexing:", info_sard)

# 2) flag for checking (fallback: if ambiguous, assume VOC 1-based)
voc_flag_sard = (info_sard["indexing"] == "1-based") or (info_sard["indexing"] == "ambiguous")

# 3) OOB with correct flag
invalid_bboxes_sard_person_labels = check_bboxes_out_of_bounds_from_csv(
  csv_sard_person_labels_path, voc_one_indexed=voc_flag_sard
)
print(invalid_bboxes_sard_person_labels[['filename','xmin','ymin','xmax','ymax','width','height']])

Checked all entries. Found 0 invalid bounding boxes.
Empty DataFrame
Columns: [filename, xmin, ymin, xmax, ymax, width, height]
Index: []


_Note on the "not_defined" label:_ \
In the SARD dataset, the "not_defined" label is used when a person is clearly present in the image, but their activity or posture cannot be reliably classified (e.g., due to occlusion or ambiguity). \
Since my task is person detection (i.e., detecting the presence of a person, regardless of their behavior), I decided to treat all "not_defined" annotations as valid person instances. \
This avoids undercounting people in cases where the action is unclear but their presence is certain.

We now categorize the images into three groups to identify potential annotation errors. \
This check allows us to avoid working with misannotated or unannotated images, improving dataset quality.

In [None]:
folder_path_sard = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'

csv_sard_person_labels_path = os.path.join(folder_path_sard, 'sard_person_labels.csv')

result = classify_image_annotation_quality(csv_sard_person_labels_path, folder_path_sard)

print(f"Case 1 (only bbox suspicious): {len(result['case1_all_suspicious'])} and they are: {result['case1_all_suspicious']}")
print(f"Case 2 (bbox mix): {len(result['case2_some_suspicious'])}")
print(f"Case 3 (no annotation): {len(result['case3_no_annotations'])} and they are: {result['case3_no_annotations']}")

# I also think that the images in the case 1 and 3 are useless in this unbalanced dataset. So i'm going to delete them.

Case 1 (only bbox suspicious): 4 and they are: {'gss0326.jpg', 'gss1129.jpg', 'gss0327.jpg', 'gss1138.jpg'}
Case 2 (bbox mix): 107
Case 3 (no annotation): 0 and they are: set()


We notice that the repeating images are always the same. \
We now delete the images identified as suspicious, specifically: images with very small bounding boxes (case 1), and images with no annotations (case 3):

In [None]:
# At first let's see which rows would be removed from CSVs without editing them.

csv_sard_labels_path = os.path.join(folder_path_sard, 'sard_labels.csv')
csv_sard_person_labels_path = os.path.join(folder_path_sard, 'sard_person_labels.csv')

csv_files = [csv_sard_labels_path, csv_sard_person_labels_path]

print("case1: ", result['case1_all_suspicious'])
print("case3: ", result['case3_no_annotations'])

images_to_remove = result['case1_all_suspicious'].union(result['case3_no_annotations'])

# Show what would be eliminated
preview_dataset_cleaning(csv_files, images_to_remove)

case1:  {'gss0326.jpg', 'gss1129.jpg', 'gss0327.jpg', 'gss1138.jpg'}
case3:  set()

 /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD/sard_labels.csv: 4 rows to remove
Filenames involved: ['gss0326.jpg' 'gss0327.jpg' 'gss1129.jpg' 'gss1138.jpg']

 /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD/sard_person_labels.csv: 4 rows to remove
Filenames involved: ['gss0326.jpg' 'gss0327.jpg' 'gss1129.jpg' 'gss1138.jpg']


In [None]:
# Ok now we can really delete the images

items_deleted = clean_dataset(folder_path_sard, csv_files, images_to_remove)
print("Number of images deleted: ", len(items_deleted["deleted_images"]))
print("Number of xml files deleted: ", len(items_deleted["deleted_xmls"]))
print("Rows deleted: ", len(items_deleted["removed_csv_rows"]))


Check on /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD/sard_labels.csv: 4 rows to delete
Filenames: ['gss0326.jpg' 'gss0327.jpg' 'gss1129.jpg' 'gss1138.jpg']
Updated /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD/sard_labels.csv: 4 rows removed
Number of images deleted:  4
Number of xml files deleted:  4
Rows deleted:  1


In [None]:
is_image_present(sard_dataset_path, "gss0329.jpg")

False

And now let's see again some basic analysis:

In [None]:
csv_path_sard = os.path.join(folder_path_sard, 'sard_person_labels.csv')

results = analyze_dataset_annotations(csv_path_sard, folder_path_sard)

print("How many images for each label:")
print(results["label_count"])

print("\nDistribution of bbox for image:")
for n_bbox, n_images in results["distribution_of_bbox"].items():
  print(f"There are {n_images} images with {n_bbox} bbox")

print("\nImages with at least one label (bounding box):", results["number_images_annotated"])
print("Total images:", results["number_all_images"])
print("Images without any label:", results["number_images_without_annotations"])
print("The images without any label are:", results["images_without_annotations"])

How many images for each label:
class
person    6532
Name: count, dtype: int64

Distribution of bbox for image:
There are 305 images with 1 bbox
There are 649 images with 2 bbox
There are 355 images with 3 bbox
There are 197 images with 4 bbox
There are 141 images with 5 bbox
There are 136 images with 6 bbox
There are 95 images with 7 bbox
There are 37 images with 8 bbox
There are 66 images with 9 bbox

Images with at least one label (bounding box): 1981
Total images: 1977
Images without any label: 0
The images without any label are: set()


Finally, for the future merge, we want to unify all classes to 'person':

In [None]:
def unify_sard_classes_to_person(folder_path, new_class="person"):
  """
  Update all classes (<name>) in SARD XML files to 'person'.

  Args:
    folder_path (str): Folder containing XML files.
    new_class (str): Class name to be set (default: 'person').

  Riturns:
    int: Number of updated XML files.
  """
  updated_files = 0

  for file in os.listdir(folder_path):
    if file.endswith('.xml'):
      xml_path = os.path.join(folder_path, file)

      try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        modified = False
        for obj in root.findall('object'):
          name_tag = obj.find('name')
          if name_tag is not None and name_tag.text != new_class:
            name_tag.text = new_class
            modified = True

        if modified:
          tree.write(xml_path)
          updated_files += 1

      except ET.ParseError:
        print(f"Error in the parsing of: {file}")

  print(f"Updated {updated_files} XML files.")
  return updated_files # number of updated files

In [None]:
folder_path_sard = "/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD"
unify_sard_classes_to_person(folder_path_sard)

Updated 1977 XML files.


1977

And now let's see if it really went well:

In [None]:
def count_xml_with_non_person_classes(folder_path):
  """
  Count how many XML files contain at least one class other than 'person'.

  Args:
    folder_path (str): path to folder with XML files

  Returns:
    int: Number of XML files that contain at least one non 'person' class
    list: list of the names of these files
  """
  non_person_files = []

  for file in os.listdir(folder_path):
    if file.endswith(".xml"):
      file_path = os.path.join(folder_path, file)
      try:
        tree = ET.parse(file_path)
        root = tree.getroot()

        # Checks all objects in the file
        found_non_person = False
        for obj in root.findall("object"):
          name_tag = obj.find("name")
          if name_tag is not None and name_tag.text.strip().lower() != "person":
            found_non_person = True

        if found_non_person:
          non_person_files.append(file)

      except ET.ParseError:
        print(f"Error in the parsing of: {file}")

  print(f"XML files with at least one class other than 'person': {len(non_person_files)}")
  return len(non_person_files), non_person_files

In [None]:
count, files = count_xml_with_non_person_classes(folder_path_sard) # Expected 0 files without person as class

XML files with at least one class other than 'person': 0


_Note:_ \
The SARD dataset is imbalanced: there are very few (almost none) empty images (i.e., without people). With such a dataset, the model will get used to always seeing people. I believe it's better to have a certain number of images without people. \
To address this issue, I decided to use another dataset, 'Herida', which contains both images with and without people. \
We now proceed with cleaning that dataset, and later we will merge it with SARD, creating a dataset suitable for my objective.

#### Images with effects

In [None]:
# To rename images with effects
folder_path_fog = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/Corr/fog'
folder_path_snow = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/Corr/snow'
folder_path_motion_blur = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/Corr/motion_blur'

prefix_fog = 'fog'
prefix_snow = 'snow'
prefix_motion_blur = 'motion_blur'

def rename_files(folder_path, prefix):
  """
  Rename images with effects (fog, snow, motion blur)

  Args:
    folder_path (str): Path to dataset containing images
    prefix (str): Prefix to add to the new filenames
  """

  files = sorted([f for f in os.listdir(folder_path) if f.lower().endswith((".jpg", ".jpeg", ".png"))]) # to order files by name
  for i, filename in enumerate(files, start=1):
    ext = os.path.splitext(filename)[1] # to separate "namefile.jpg" in ("namefile", ".jpg"), and [1] to take just the extension
    new_name = f"{prefix}_{i:04d}{ext}" # to build the new name
    old_path = os.path.join(folder_path, filename)
    new_path = os.path.join(folder_path, new_name)
    os.rename(old_path, new_path)

rename_files(folder_path_fog, prefix_fog)
rename_files(folder_path_snow, prefix_snow)
rename_files(folder_path_motion_blur, prefix_motion_blur)

In [None]:

fog_mapping = {
  'fog_0001.jpg': 'gss2.jpg', 'fog_0002.jpg': 'gss4.jpg', 'fog_0003.jpg': 'gss7.jpg', 'fog_0004.jpg': 'gss9.jpg', 'fog_0005.jpg': 'gss12.jpg', 'fog_0006.jpg': 'gss14.jpg', 'fog_0007.jpg': 'gss17.jpg', 'fog_0008.jpg': 'gss19.jpg', 'fog_0009.jpg': 'gss22.jpg', 'fog_0010.jpg': 'gss24.jpg', 'fog_0011.jpg': 'gss27.jpg', 'fog_0012.jpg': 'gss29.jpg', 'fog_0013.jpg': 'gss32.jpg', 'fog_0014.jpg': 'gss34.jpg', 'fog_0015.jpg': 'gss37.jpg', 'fog_0016.jpg': 'gss39.jpg', 'fog_0017.jpg': 'gss42.jpg', 'fog_0018.jpg': 'gss44.jpg', 'fog_0019.jpg': 'gss47.jpg', 'fog_0020.jpg': 'gss49.jpg',
  'fog_0021.jpg': 'gss52.jpg', 'fog_0022.jpg': 'gss54.jpg', 'fog_0023.jpg': 'gss57.jpg', 'fog_0024.jpg': 'gss59.jpg', 'fog_0025.jpg': 'gss66.jpg', 'fog_0026.jpg': 'gss83.jpg', 'fog_0027.jpg': 'gss87.jpg', 'fog_0028.jpg': 'gss89.jpg', 'fog_0029.jpg': 'gss92.jpg', 'fog_0030.jpg': 'gss94.jpg', 'fog_0031.jpg': 'gss97.jpg', 'fog_0032.jpg': 'gss99.jpg', 'fog_0033.jpg': 'gss102.jpg', 'fog_0034.jpg': 'gss104.jpg', 'fog_0035.jpg': 'gss107.jpg', 'fog_0036.jpg': 'gss109.jpg', 'fog_0037.jpg': 'gss112.jpg', 'fog_0038.jpg': 'gss114.jpg', 'fog_0039.jpg': 'gss117.jpg', 'fog_0040.jpg': 'gss119.jpg',
  'fog_0041.jpg': 'gss122.jpg', 'fog_0042.jpg': 'gss124.jpg', 'fog_0043.jpg': 'gss127.jpg', 'fog_0044.jpg': 'gss129.jpg', 'fog_0045.jpg': 'gss132.jpg', 'fog_0046.jpg': 'gss134.jpg', 'fog_0047.jpg': 'gss137.jpg', 'fog_0048.jpg': 'gss139.jpg', 'fog_0049.jpg': 'gss142.jpg', 'fog_0050.jpg': 'gss144.jpg', 'fog_0051.jpg': 'gss147.jpg', 'fog_0052.jpg': 'gss149.jpg', 'fog_0053.jpg': 'gss152.jpg', 'fog_0054.jpg': 'gss154.jpg', 'fog_0055.jpg': 'gss157.jpg', 'fog_0056.jpg': 'gss159.jpg', 'fog_0057.jpg': 'gss162.jpg', 'fog_0058.jpg': 'gss164.jpg', 'fog_0059.jpg': 'gss167.jpg', 'fog_0060.jpg': 'gss169.jpg',
  'fog_0061.jpg': 'gss172.jpg', 'fog_0062.jpg': 'gss174.jpg', 'fog_0063.jpg': 'gss177.jpg', 'fog_0064.jpg': 'gss179.jpg', 'fog_0065.jpg': 'gss182.jpg', 'fog_0066.jpg': 'gss184.jpg', 'fog_0067.jpg': 'gss187.jpg', 'fog_0068.jpg': 'gss189.jpg', 'fog_0069.jpg': 'gss192.jpg', 'fog_0070.jpg': 'gss194.jpg', 'fog_0071.jpg': 'gss197.jpg', 'fog_0072.jpg': 'gss199.jpg', 'fog_0073.jpg': 'gss202.jpg', 'fog_0074.jpg': 'gss204.jpg', 'fog_0075.jpg': 'gss207.jpg', 'fog_0076.jpg': 'gss209.jpg', 'fog_0077.jpg': 'gss212.jpg', 'fog_0078.jpg': 'gss214.jpg', 'fog_0079.jpg': 'gss217.jpg', 'fog_0080.jpg': 'gss219.jpg',
  'fog_0081.jpg': 'gss222.jpg', 'fog_0082.jpg': 'gss224.jpg', 'fog_0083.jpg': 'gss227.jpg', 'fog_0084.jpg': 'gss229.jpg', 'fog_0085.jpg': 'gss232.jpg', 'fog_0086.jpg': 'gss234.jpg', 'fog_0087.jpg': 'gss237.jpg', 'fog_0088.jpg': 'gss239.jpg', 'fog_0089.jpg': 'gss242.jpg', 'fog_0090.jpg': 'gss244.jpg', 'fog_0091.jpg': 'gss247.jpg', 'fog_0092.jpg': 'gss249.jpg', 'fog_0093.jpg': 'gss252.jpg', 'fog_0094.jpg': 'gss254.jpg', 'fog_0095.jpg': 'gss257.jpg', 'fog_0096.jpg': 'gss259.jpg', 'fog_0097.jpg': 'gss262.jpg', 'fog_0098.jpg': 'gss264.jpg', 'fog_0099.jpg': 'gss267.jpg', 'fog_0100.jpg': 'gss269.jpg',
  'fog_0101.jpg': 'gss272.jpg', 'fog_0102.jpg': 'gss274.jpg', 'fog_0103.jpg': 'gss277.jpg', 'fog_0104.jpg': 'gss279.jpg', 'fog_0105.jpg': 'gss282.jpg', 'fog_0106.jpg': 'gss288.jpg', 'fog_0107.jpg': 'gss291.jpg', 'fog_0108.jpg': 'gss293.jpg', 'fog_0109.jpg': 'gss296.jpg', 'fog_0110.jpg': 'gss298.jpg', 'fog_0111.jpg': 'gss301.jpg', 'fog_0112.jpg': 'gss303.jpg', 'fog_0113.jpg': 'gss306.jpg', 'fog_0114.jpg': 'gss309.jpg', 'fog_0115.jpg': 'gss312.jpg', 'fog_0116.jpg': 'gss314.jpg', 'fog_0117.jpg': 'gss317.jpg', 'fog_0118.jpg': 'gss319.jpg', 'fog_0119.jpg': 'gss322.jpg', 'fog_0120.jpg': 'gss324.jpg',
  'fog_0121.jpg': 'gss327.jpg', 'fog_0122.jpg': 'gss331.jpg', 'fog_0123.jpg': 'gss334.jpg', 'fog_0124.jpg': 'gss336.jpg', 'fog_0125.jpg': 'gss339.jpg', 'fog_0126.jpg': 'gss341.jpg', 'fog_0127.jpg': 'gss344.jpg', 'fog_0128.jpg': 'gss346.jpg', 'fog_0129.jpg': 'gss349.jpg', 'fog_0130.jpg': 'gss351.jpg', 'fog_0131.jpg': 'gss354.jpg', 'fog_0132.jpg': 'gss356.jpg', 'fog_0133.jpg': 'gss359.jpg', 'fog_0134.jpg': 'gss361.jpg', 'fog_0135.jpg': 'gss364.jpg', 'fog_0136.jpg': 'gss366.jpg', 'fog_0137.jpg': 'gss369.jpg', 'fog_0207.jpg': 'gss544.jpg', 'fog_0208.jpg': 'gss546.jpg', 'fog_0209.jpg': 'gss549.jpg',
  'fog_0210.jpg': 'gss551.jpg', 'fog_0211.jpg': 'gss554.jpg', 'fog_0212.jpg': 'gss556.jpg', 'fog_0213.jpg': 'gss559.jpg', 'fog_0214.jpg': 'gss561.jpg', 'fog_0215.jpg': 'gss564.jpg', 'fog_0216.jpg': 'gss566.jpg', 'fog_0217.jpg': 'gss569.jpg', 'fog_0218.jpg': 'gss571.jpg', 'fog_0219.jpg': 'gss574.jpg', 'fog_0220.jpg': 'gss576.jpg', 'fog_0221.jpg': 'gss579.jpg', 'fog_0222.jpg': 'gss581.jpg', 'fog_0223.jpg': 'gss584.jpg', 'fog_0224.jpg': 'gss586.jpg', 'fog_0225.jpg': 'gss589.jpg', 'fog_0226.jpg': 'gss591.jpg', 'fog_0227.jpg': 'gss594.jpg', 'fog_0237.jpg': 'gss619.jpg', 'fog_0238.jpg': 'gss621.jpg',
  'fog_0239.jpg': 'gss624.jpg', 'fog_0240.jpg': 'gss626.jpg', 'fog_0241.jpg': 'gss629.jpg', 'fog_0242.jpg': 'gss631.jpg', 'fog_0243.jpg': 'gss634.jpg', 'fog_0244.jpg': 'gss636.jpg', 'fog_0245.jpg': 'gss639.jpg', 'fog_0246.jpg': 'gss641.jpg', 'fog_0247.jpg': 'gss644.jpg', 'fog_0248.jpg': 'gss646.jpg', 'fog_0249.jpg': 'gss649.jpg', 'fog_0250.jpg': 'gss651.jpg', 'fog_0251.jpg': 'gss654.jpg', 'fog_0252.jpg': 'gss656.jpg', 'fog_0253.jpg': 'gss659.jpg', 'fog_0254.jpg': 'gss661.jpg', 'fog_0255.jpg': 'gss664.jpg', 'fog_0256.jpg': 'gss666.jpg', 'fog_0257.jpg': 'gss669.jpg', 'fog_0258.jpg': 'gss671.jpg',
  'fog_0259.jpg': 'gss674.jpg', 'fog_0260.jpg': 'gss676.jpg', 'fog_0261.jpg': 'gss679.jpg', 'fog_0262.jpg': 'gss681.jpg', 'fog_0263.jpg': 'gss684.jpg', 'fog_0264.jpg': 'gss686.jpg', 'fog_0265.jpg': 'gss689.jpg', 'fog_0266.jpg': 'gss691.jpg', 'fog_0267.jpg': 'gss694.jpg', 'fog_0268.jpg': 'gss696.jpg', 'fog_0269.jpg': 'gss699.jpg', 'fog_0270.jpg': 'gss701.jpg', 'fog_0271.jpg': 'gss704.jpg', 'fog_0272.jpg': 'gss706.jpg', 'fog_0273.jpg': 'gss709.jpg', 'fog_0274.jpg': 'gss711.jpg', 'fog_0275.jpg': 'gss714.jpg', 'fog_0276.jpg': 'gss716.jpg', 'fog_0277.jpg': 'gss719.jpg', 'fog_0278.jpg': 'gss721.jpg',
  'fog_0279.jpg': 'gss724.jpg', 'fog_0280.jpg': 'gss726.jpg', 'fog_0281.jpg': 'gss729.jpg', 'fog_0282.jpg': 'gss731.jpg', 'fog_0283.jpg': 'gss734.jpg', 'fog_0284.jpg': 'gss736.jpg', 'fog_0285.jpg': 'gss739.jpg', 'fog_0286.jpg': 'gss741.jpg', 'fog_0287.jpg': 'gss748.jpg', 'fog_0288.jpg': 'gss750.jpg', 'fog_0289.jpg': 'gss753.jpg', 'fog_0290.jpg': 'gss755.jpg', 'fog_0291.jpg': 'gss758.jpg', 'fog_0292.jpg': 'gss760.jpg', 'fog_0293.jpg': 'gss763.jpg', 'fog_0294.jpg': 'gss765.jpg', 'fog_0295.jpg': 'gss768.jpg', 'fog_0296.jpg': 'gss770.jpg', 'fog_0297.jpg': 'gss773.jpg', 'fog_0298.jpg': 'gss775.jpg',
  'fog_0299.jpg': 'gss778.jpg', 'fog_0300.jpg': 'gss780.jpg', 'fog_0301.jpg': 'gss783.jpg', 'fog_0302.jpg': 'gss785.jpg', 'fog_0303.jpg': 'gss788.jpg', 'fog_0304.jpg': 'gss790.jpg', 'fog_0305.jpg': 'gss793.jpg', 'fog_0306.jpg': 'gss795.jpg', 'fog_0307.jpg': 'gss798.jpg', 'fog_0308.jpg': 'gss800.jpg', 'fog_0309.jpg': 'gss803.jpg', 'fog_0310.jpg': 'gss805.jpg', 'fog_0311.jpg': 'gss809.jpg', 'fog_0312.jpg': 'gss811.jpg', 'fog_0313.jpg': 'gss814.jpg', 'fog_0314.jpg': 'gss816.jpg', 'fog_0315.jpg': 'gss819.jpg', 'fog_0316.jpg': 'gss821.jpg', 'fog_0317.jpg': 'gss824.jpg', 'fog_0318.jpg': 'gss826.jpg',
  'fog_0319.jpg': 'gss829.jpg', 'fog_0320.jpg': 'gss831.jpg', 'fog_0321.jpg': 'gss834.jpg', 'fog_0322.jpg': 'gss837.jpg', 'fog_0323.jpg': 'gss840.jpg', 'fog_0324.jpg': 'gss842.jpg', 'fog_0325.jpg': 'gss845.jpg', 'fog_0326.jpg': 'gss847.jpg', 'fog_0327.jpg': 'gss850.jpg', 'fog_0328.jpg': 'gss852.jpg', 'fog_0329.jpg': 'gss855.jpg', 'fog_0330.jpg': 'gss857.jpg', 'fog_0331.jpg': 'gss860.jpg', 'fog_0332.jpg': 'gss862.jpg', 'fog_0333.jpg': 'gss865.jpg', 'fog_0334.jpg': 'gss867.jpg', 'fog_0335.jpg': 'gss870.jpg', 'fog_0336.jpg': 'gss872.jpg', 'fog_0337.jpg': 'gss875.jpg', 'fog_0338.jpg': 'gss877.jpg',
  'fog_0339.jpg': 'gss880.jpg', 'fog_0340.jpg': 'gss882.jpg', 'fog_0341.jpg': 'gss885.jpg', 'fog_0342.jpg': 'gss887.jpg', 'fog_0343.jpg': 'gss892.jpg', 'fog_0344.jpg': 'gss894.jpg', 'fog_0345.jpg': 'gss897.jpg', 'fog_0346.jpg': 'gss899.jpg', 'fog_0347.jpg': 'gss902.jpg', 'fog_0348.jpg': 'gss904.jpg', 'fog_0349.jpg': 'gss907.jpg', 'fog_0350.jpg': 'gss909.jpg', 'fog_0351.jpg': 'gss912.jpg', 'fog_0352.jpg': 'gss914.jpg', 'fog_0353.jpg': 'gss917.jpg', 'fog_0354.jpg': 'gss919.jpg', 'fog_0355.jpg': 'gss922.jpg', 'fog_0356.jpg': 'gss924.jpg', 'fog_0357.jpg': 'gss927.jpg', 'fog_0358.jpg': 'gss929.jpg',
  'fog_0359.jpg': 'gss932.jpg', 'fog_0360.jpg': 'gss934.jpg', 'fog_0361.jpg': 'gss937.jpg', 'fog_0362.jpg': 'gss939.jpg', 'fog_0363.jpg': 'gss942.jpg', 'fog_0364.jpg': 'gss944.jpg', 'fog_0365.jpg': 'gss947.jpg', 'fog_0366.jpg': 'gss949.jpg', 'fog_0367.jpg': 'gss952.jpg', 'fog_0368.jpg': 'gss954.jpg', 'fog_0369.jpg': 'gss957.jpg', 'fog_0370.jpg': 'gss959.jpg', 'fog_0371.jpg': 'gss962.jpg', 'fog_0372.jpg': 'gss964.jpg', 'fog_0373.jpg': 'gss967.jpg', 'fog_0374.jpg': 'gss969.jpg', 'fog_0375.jpg': 'gss972.jpg', 'fog_0376.jpg': 'gss974.jpg', 'fog_0377.jpg': 'gss1008.jpg', 'fog_0378.jpg': 'gss1010.jpg',
  'fog_0379.jpg': 'gss1013.jpg', 'fog_0380.jpg': 'gss1015.jpg', 'fog_0381.jpg': 'gss1018.jpg', 'fog_0382.jpg': 'gss1020.jpg', 'fog_0383.jpg': 'gss1023.jpg', 'fog_0384.jpg': 'gss1025.jpg', 'fog_0385.jpg': 'gss1028.jpg', 'fog_0386.jpg': 'gss1030.jpg', 'fog_0387.jpg': 'gss1033.jpg', 'fog_0388.jpg': 'gss1035.jpg', 'fog_0389.jpg': 'gss1038.jpg', 'fog_0390.jpg': 'gss1040.jpg', 'fog_0391.jpg': 'gss1043.jpg', 'fog_0392.jpg': 'gss1045.jpg', 'fog_0393.jpg': 'gss1048.jpg', 'fog_0394.jpg': 'gss1050.jpg', 'fog_0395.jpg': 'gss1053.jpg', 'fog_0396.jpg': 'gss1055.jpg', 'fog_0397.jpg': 'gss1058.jpg', 'fog_0398.jpg': 'gss1060.jpg',
  'fog_0399.jpg': 'gss1063.jpg', 'fog_0400.jpg': 'gss1065.jpg', 'fog_0401.jpg': 'gss1068.jpg', 'fog_0402.jpg': 'gss1070.jpg', 'fog_0403.jpg': 'gss1073.jpg', 'fog_0404.jpg': 'gss1075.jpg', 'fog_0405.jpg': 'gss1078.jpg', 'fog_0406.jpg': 'gss1080.jpg', 'fog_0407.jpg': 'gss1083.jpg', 'fog_0408.jpg': 'gss1085.jpg', 'fog_0409.jpg': 'gss1088.jpg', 'fog_0410.jpg': 'gss1090.jpg', 'fog_0411.jpg': 'gss1093.jpg', 'fog_0412.jpg': 'gss1095.jpg', 'fog_0413.jpg': 'gss1098.jpg', 'fog_0414.jpg': 'gss1100.jpg', 'fog_0415.jpg': 'gss1103.jpg', 'fog_0416.jpg': 'gss1105.jpg', 'fog_0417.jpg': 'gss1108.jpg', 'fog_0418.jpg': 'gss1110.jpg',
  'fog_0419.jpg': 'gss1113.jpg', 'fog_0420.jpg': 'gss1115.jpg', 'fog_0421.jpg': 'gss1118.jpg', 'fog_0422.jpg': 'gss1120.jpg', 'fog_0423.jpg': 'gss1123.jpg', 'fog_0424.jpg': 'gss1125.jpg', 'fog_0425.jpg': 'gss1129.jpg', 'fog_0426.jpg': 'gss1139.jpg', 'fog_0427.jpg': 'gss1142.jpg', 'fog_0428.jpg': 'gss1144.jpg', 'fog_0429.jpg': 'gss1147.jpg', 'fog_0430.jpg': 'gss1149.jpg', 'fog_0431.jpg': 'gss1152.jpg', 'fog_0432.jpg': 'gss1154.jpg', 'fog_0433.jpg': 'gss1157.jpg', 'fog_0434.jpg': 'gss1159.jpg', 'fog_0435.jpg': 'gss1162.jpg', 'fog_0436.jpg': 'gss1164.jpg', 'fog_0437.jpg': 'gss1167.jpg', 'fog_0438.jpg': 'gss1169.jpg',
  'fog_0439.jpg': 'gss1172.jpg', 'fog_0440.jpg': 'gss1174.jpg', 'fog_0441.jpg': 'gss1177.jpg', 'fog_0442.jpg': 'gss1179.jpg', 'fog_0443.jpg': 'gss1182.jpg', 'fog_0444.jpg': 'gss1184.jpg', 'fog_0445.jpg': 'gss1187.jpg', 'fog_0446.jpg': 'gss1189.jpg', 'fog_0447.jpg': 'gss1197.jpg', 'fog_0448.jpg': 'gss1199.jpg', 'fog_0449.jpg': 'gss1202.jpg', 'fog_0450.jpg': 'gss1204.jpg', 'fog_0451.jpg': 'gss1207.jpg', 'fog_0452.jpg': 'gss1209.jpg', 'fog_0453.jpg': 'gss1212.jpg', 'fog_0454.jpg': 'gss1214.jpg', 'fog_0455.jpg': 'gss1217.jpg', 'fog_0456.jpg': 'gss1219.jpg', 'fog_0457.jpg': 'gss1222.jpg', 'fog_0458.jpg': 'gss1224.jpg',
  'fog_0459.jpg': 'gss1227.jpg', 'fog_0460.jpg': 'gss1229.jpg', 'fog_0461.jpg': 'gss1232.jpg', 'fog_0462.jpg': 'gss1234.jpg', 'fog_0463.jpg': 'gss1259.jpg', 'fog_0464.jpg': 'gss1261.jpg', 'fog_0465.jpg': 'gss1264.jpg', 'fog_0466.jpg': 'gss1266.jpg', 'fog_0467.jpg': 'gss1269.jpg', 'fog_0468.jpg': 'gss1271.jpg', 'fog_0469.jpg': 'gss1291.jpg', 'fog_0470.jpg': 'gss1293.jpg', 'fog_0471.jpg': 'gss1296.jpg', 'fog_0472.jpg': 'gss1298.jpg', 'fog_0473.jpg': 'gss1301.jpg', 'fog_0474.jpg': 'gss1303.jpg', 'fog_0475.jpg': 'gss1306.jpg', 'fog_0476.jpg': 'gss1308.jpg', 'fog_0477.jpg': 'gss1311.jpg', 'fog_0478.jpg': 'gss1313.jpg',
  'fog_0479.jpg': 'gss1316.jpg', 'fog_0480.jpg': 'gss1318.jpg', 'fog_0481.jpg': 'gss1321.jpg', 'fog_0482.jpg': 'gss1323.jpg', 'fog_0483.jpg': 'gss1326.jpg', 'fog_0484.jpg': 'gss1328.jpg', 'fog_0485.jpg': 'gss1331.jpg', 'fog_0486.jpg': 'gss1333.jpg', 'fog_0487.jpg': 'gss1336.jpg', 'fog_0488.jpg': 'gss1338.jpg', 'fog_0489.jpg': 'gss1341.jpg', 'fog_0490.jpg': 'gss1343.jpg', 'fog_0491.jpg': 'gss1346.jpg', 'fog_0492.jpg': 'gss1348.jpg', 'fog_0493.jpg': 'gss1351.jpg', 'fog_0494.jpg': 'gss1353.jpg', 'fog_0495.jpg': 'gss1356.jpg', 'fog_0496.jpg': 'gss1358.jpg', 'fog_0497.jpg': 'gss1361.jpg', 'fog_0498.jpg': 'gss1363.jpg',
  'fog_0499.jpg': 'gss1366.jpg', 'fog_0500.jpg': 'gss1368.jpg', 'fog_0501.jpg': 'gss1371.jpg', 'fog_0502.jpg': 'gss1373.jpg', 'fog_0503.jpg': 'gss1376.jpg', 'fog_0504.jpg': 'gss1378.jpg', 'fog_0505.jpg': 'gss1381.jpg', 'fog_0506.jpg': 'gss1383.jpg', 'fog_0507.jpg': 'gss1386.jpg', 'fog_0508.jpg': 'gss1388.jpg', 'fog_0509.jpg': 'gss1391.jpg', 'fog_0510.jpg': 'gss1393.jpg', 'fog_0511.jpg': 'gss1396.jpg', 'fog_0512.jpg': 'gss1398.jpg', 'fog_0513.jpg': 'gss1401.jpg', 'fog_0514.jpg': 'gss1403.jpg', 'fog_0515.jpg': 'gss1406.jpg', 'fog_0516.jpg': 'gss1408.jpg', 'fog_0517.jpg': 'gss1411.jpg', 'fog_0518.jpg': 'gss1413.jpg',
  'fog_0519.jpg': 'gss1416.jpg', 'fog_0520.jpg': 'gss1418.jpg', 'fog_0521.jpg': 'gss1421.jpg', 'fog_0522.jpg': 'gss1423.jpg', 'fog_0523.jpg': 'gss1426.jpg', 'fog_0524.jpg': 'gss1428.jpg', 'fog_0525.jpg': 'gss1431.jpg', 'fog_0526.jpg': 'gss1433.jpg', 'fog_0527.jpg': 'gss1436.jpg', 'fog_0528.jpg': 'gss1438.jpg', 'fog_0529.jpg': 'gss1441.jpg', 'fog_0530.jpg': 'gss1443.jpg', 'fog_0531.jpg': 'gss1446.jpg', 'fog_0532.jpg': 'gss1448.jpg', 'fog_0533.jpg': 'gss1451.jpg', 'fog_0534.jpg': 'gss1453.jpg', 'fog_0535.jpg': 'gss1456.jpg', 'fog_0536.jpg': 'gss1458.jpg', 'fog_0537.jpg': 'gss1461.jpg', 'fog_0538.jpg': 'gss1463.jpg',
  'fog_0539.jpg': 'gss1466.jpg', 'fog_0540.jpg': 'gss1468.jpg', 'fog_0541.jpg': 'gss1471.jpg', 'fog_0542.jpg': 'gss1473.jpg', 'fog_0543.jpg': 'gss1476.jpg', 'fog_0544.jpg': 'gss1478.jpg', 'fog_0545.jpg': 'gss1481.jpg', 'fog_0546.jpg': 'gss1483.jpg', 'fog_0547.jpg': 'gss1486.jpg', 'fog_0548.jpg': 'gss1488.jpg', 'fog_0549.jpg': 'gss1491.jpg', 'fog_0550.jpg': 'gss1493.jpg', 'fog_0551.jpg': 'gss1496.jpg', 'fog_0552.jpg': 'gss1498.jpg', 'fog_0553.jpg': 'gss1501.jpg', 'fog_0554.jpg': 'gss1503.jpg', 'fog_0555.jpg': 'gss1506.jpg', 'fog_0556.jpg': 'gss1508.jpg', 'fog_0557.jpg': 'gss1511.jpg', 'fog_0558.jpg': 'gss1513.jpg',
  'fog_0559.jpg': 'gss1516.jpg', 'fog_0560.jpg': 'gss1518.jpg', 'fog_0561.jpg': 'gss1521.jpg', 'fog_0562.jpg': 'gss1523.jpg', 'fog_0563.jpg': 'gss1526.jpg', 'fog_0564.jpg': 'gss1528.jpg', 'fog_0565.jpg': 'gss1531.jpg', 'fog_0566.jpg': 'gss1533.jpg', 'fog_0567.jpg': 'gss1536.jpg', 'fog_0568.jpg': 'gss1538.jpg', 'fog_0569.jpg': 'gss1541.jpg', 'fog_0570.jpg': 'gss1543.jpg', 'fog_0571.jpg': 'gss1546.jpg', 'fog_0572.jpg': 'gss1548.jpg', 'fog_0573.jpg': 'gss1551.jpg', 'fog_0574.jpg': 'gss1553.jpg', 'fog_0575.jpg': 'gss1556.jpg', 'fog_0576.jpg': 'gss1558.jpg', 'fog_0577.jpg': 'gss1561.jpg', 'fog_0578.jpg': 'gss1563.jpg',
  'fog_0579.jpg': 'gss1566.jpg', 'fog_0580.jpg': 'gss1568.jpg', 'fog_0581.jpg': 'gss1571.jpg', 'fog_0582.jpg': 'gss1573.jpg', 'fog_0583.jpg': 'gss1576.jpg', 'fog_0584.jpg': 'gss1578.jpg', 'fog_0585.jpg': 'gss1581.jpg', 'fog_0586.jpg': 'gss1583.jpg', 'fog_0587.jpg': 'gss1586.jpg', 'fog_0588.jpg': 'gss1588.jpg', 'fog_0589.jpg': 'gss1591.jpg', 'fog_0590.jpg': 'gss1593.jpg', 'fog_0591.jpg': 'gss1596.jpg', 'fog_0592.jpg': 'gss1598.jpg', 'fog_0593.jpg': 'gss1601.jpg', 'fog_0594.jpg': 'gss1603.jpg', 'fog_0595.jpg': 'gss1606.jpg', 'fog_0596.jpg': 'gss1608.jpg', 'fog_0597.jpg': 'gss1611.jpg', 'fog_0598.jpg': 'gss1613.jpg',
  'fog_0599.jpg': 'gss1616.jpg', 'fog_0600.jpg': 'gss1618.jpg', 'fog_0601.jpg': 'gss1621.jpg', 'fog_0602.jpg': 'gss1623.jpg', 'fog_0603.jpg': 'gss1626.jpg', 'fog_0604.jpg': 'gss1628.jpg', 'fog_0605.jpg': 'gss1631.jpg', 'fog_0606.jpg': 'gss1633.jpg', 'fog_0607.jpg': 'gss1636.jpg', 'fog_0608.jpg': 'gss1638.jpg', 'fog_0609.jpg': 'gss1641.jpg', 'fog_0610.jpg': 'gss1643.jpg', 'fog_0611.jpg': 'gss1646.jpg', 'fog_0612.jpg': 'gss1648.jpg', 'fog_0613.jpg': 'gss1651.jpg', 'fog_0614.jpg': 'gss1653.jpg', 'fog_0615.jpg': 'gss1656.jpg', 'fog_0616.jpg': 'gss1658.jpg', 'fog_0617.jpg': 'gss1661.jpg', 'fog_0618.jpg': 'gss1663.jpg',
  'fog_0619.jpg': 'gss1666.jpg', 'fog_0620.jpg': 'gss1668.jpg', 'fog_0621.jpg': 'gss1671.jpg', 'fog_0622.jpg': 'gss1673.jpg', 'fog_0623.jpg': 'gss1676.jpg', 'fog_0624.jpg': 'gss1678.jpg', 'fog_0625.jpg': 'gss1681.jpg', 'fog_0626.jpg': 'gss1683.jpg', 'fog_0627.jpg': 'gss1686.jpg', 'fog_0628.jpg': 'gss1688.jpg', 'fog_0629.jpg': 'gss1691.jpg', 'fog_0630.jpg': 'gss1693.jpg', 'fog_0631.jpg': 'gss1696.jpg', 'fog_0632.jpg': 'gss1699.jpg', 'fog_0633.jpg': 'gss1701.jpg', 'fog_0634.jpg': 'gss1703.jpg', 'fog_0635.jpg': 'gss1706.jpg', 'fog_0636.jpg': 'gss1708.jpg', 'fog_0637.jpg': 'gss1711.jpg', 'fog_0638.jpg': 'gss1713.jpg',
  'fog_0639.jpg': 'gss1716.jpg', 'fog_0640.jpg': 'gss1718.jpg', 'fog_0641.jpg': 'gss1721.jpg', 'fog_0642.jpg': 'gss1723.jpg', 'fog_0643.jpg': 'gss1726.jpg', 'fog_0644.jpg': 'gss1728.jpg', 'fog_0645.jpg': 'gss1731.jpg', 'fog_0646.jpg': 'gss1733.jpg', 'fog_0647.jpg': 'gss1736.jpg', 'fog_0648.jpg': 'gss1738.jpg', 'fog_0649.jpg': 'gss1741.jpg', 'fog_0650.jpg': 'gss1743.jpg', 'fog_0651.jpg': 'gss1746.jpg', 'fog_0652.jpg': 'gss1748.jpg', 'fog_0653.jpg': 'gss1751.jpg', 'fog_0654.jpg': 'gss1753.jpg', 'fog_0655.jpg': 'gss1756.jpg', 'fog_0656.jpg': 'gss1758.jpg', 'fog_0657.jpg': 'gss1761.jpg', 'fog_0658.jpg': 'gss1763.jpg',
  'fog_0659.jpg': 'gss1766.jpg', 'fog_0660.jpg': 'gss1768.jpg', 'fog_0661.jpg': 'gss1771.jpg', 'fog_0662.jpg': 'gss1773.jpg', 'fog_0663.jpg': 'gss1776.jpg', 'fog_0664.jpg': 'gss1778.jpg', 'fog_0665.jpg': 'gss1781.jpg', 'fog_0666.jpg': 'gss1783.jpg', 'fog_0667.jpg': 'gss1786.jpg', 'fog_0668.jpg': 'gss1788.jpg', 'fog_0669.jpg': 'gss1791.jpg', 'fog_0670.jpg': 'gss1793.jpg', 'fog_0671.jpg': 'gss1796.jpg', 'fog_0672.jpg': 'gss1798.jpg', 'fog_0673.jpg': 'gss1801.jpg', 'fog_0674.jpg': 'gss1803.jpg', 'fog_0675.jpg': 'gss1806.jpg', 'fog_0676.jpg': 'gss1808.jpg', 'fog_0677.jpg': 'gss1811.jpg', 'fog_0678.jpg': 'gss1813.jpg',
  'fog_0679.jpg': 'gss1816.jpg', 'fog_0680.jpg': 'gss1818.jpg', 'fog_0681.jpg': 'gss1821.jpg', 'fog_0682.jpg': 'gss1823.jpg', 'fog_0683.jpg': 'gss1826.jpg', 'fog_0684.jpg': 'gss1828.jpg', 'fog_0685.jpg': 'gss1831.jpg', 'fog_0686.jpg': 'gss1833.jpg', 'fog_0687.jpg': 'gss1836.jpg', 'fog_0688.jpg': 'gss1838.jpg', 'fog_0689.jpg': 'gss1841.jpg', 'fog_0690.jpg': 'gss1843.jpg', 'fog_0691.jpg': 'gss1846.jpg', 'fog_0692.jpg': 'gss1848.jpg', 'fog_0693.jpg': 'gss1851.jpg', 'fog_0694.jpg': 'gss1853.jpg', 'fog_0695.jpg': 'gss1856.jpg', 'fog_0696.jpg': 'gss1858.jpg', 'fog_0697.jpg': 'gss1861.jpg', 'fog_0698.jpg': 'gss1863.jpg',
  'fog_0699.jpg': 'gss1866.jpg', 'fog_0700.jpg': 'gss1868.jpg', 'fog_0701.jpg': 'gss1871.jpg', 'fog_0702.jpg': 'gss1873.jpg', 'fog_0703.jpg': 'gss1876.jpg', 'fog_0704.jpg': 'gss1878.jpg', 'fog_0705.jpg': 'gss1881.jpg', 'fog_0706.jpg': 'gss1883.jpg', 'fog_0707.jpg': 'gss1886.jpg', 'fog_0708.jpg': 'gss1888.jpg', 'fog_0709.jpg': 'gss1891.jpg', 'fog_0710.jpg': 'gss1893.jpg', 'fog_0711.jpg': 'gss1896.jpg', 'fog_0712.jpg': 'gss1898.jpg', 'fog_0713.jpg': 'gss1901.jpg', 'fog_0714.jpg': 'gss1903.jpg', 'fog_0715.jpg': 'gss1906.jpg', 'fog_0716.jpg': 'gss1908.jpg', 'fog_0717.jpg': 'gss1911.jpg', 'fog_0718.jpg': 'gss1913.jpg',
  'fog_0719.jpg': 'gss1916.jpg', 'fog_0720.jpg': 'gss1918.jpg', 'fog_0721.jpg': 'gss1921.jpg', 'fog_0722.jpg': 'gss1923.jpg', 'fog_0723.jpg': 'gss1926.jpg', 'fog_0724.jpg': 'gss1928.jpg', 'fog_0725.jpg': 'gss1931.jpg', 'fog_0726.jpg': 'gss1933.jpg', 'fog_0727.jpg': 'gss1936.jpg', 'fog_0728.jpg': 'gss1938.jpg', 'fog_0729.jpg': 'gss1941.jpg', 'fog_0730.jpg': 'gss1943.jpg', 'fog_0731.jpg': 'gss1946.jpg', 'fog_0732.jpg': 'gss1948.jpg', 'fog_0733.jpg': 'gss1951.jpg', 'fog_0734.jpg': 'gss1953.jpg', 'fog_0735.jpg': 'gss1956.jpg', 'fog_0736.jpg': 'gss1958.jpg', 'fog_0737.jpg': 'gss1961.jpg', 'fog_0738.jpg': 'gss1963.jpg',
  'fog_0739.jpg': 'gss1966.jpg', 'fog_0740.jpg': 'gss1968.jpg', 'fog_0741.jpg': 'gss1971.jpg', 'fog_0742.jpg': 'gss1973.jpg', 'fog_0743.jpg': 'gss1980.jpg', 'fog_0744.jpg': 'gss1982.jpg', 'fog_0745.jpg': 'gss1985.jpg', 'fog_0746.jpg': 'gss1987.jpg', 'fog_0747.jpg': 'gss1990.jpg', 'fog_0748.jpg': 'gss1992.jpg', 'fog_0749.jpg': 'gss1995.jpg', 'fog_0750.jpg': 'gss1997.jpg', 'fog_0751.jpg': 'gss2000.jpg', 'fog_0752.jpg': 'gss2002.jpg', 'fog_0753.jpg': 'gss2005.jpg', 'fog_0754.jpg': 'gss2007.jpg', 'fog_0755.jpg': 'gss2010.jpg', 'fog_0756.jpg': 'gss2012.jpg', 'fog_0757.jpg': 'gss2015.jpg', 'fog_0758.jpg': 'gss2017.jpg',
  'fog_0759.jpg': 'gss2020.jpg', 'fog_0760.jpg': 'gss2022.jpg', 'fog_0761.jpg': 'gss2025.jpg', 'fog_0762.jpg': 'gss2027.jpg', 'fog_0763.jpg': 'gss2030.jpg', 'fog_0764.jpg': 'gss2032.jpg', 'fog_0765.jpg': 'gss2035.jpg', 'fog_0766.jpg': 'gss2037.jpg', 'fog_0767.jpg': 'gss2040.jpg', 'fog_0768.jpg': 'gss2042.jpg', 'fog_0769.jpg': 'gss2045.jpg', 'fog_0770.jpg': 'gss2047.jpg', 'fog_0771.jpg': 'gss2050.jpg', 'fog_0772.jpg': 'gss2052.jpg', 'fog_0773.jpg': 'gss2055.jpg', 'fog_0774.jpg': 'gss2057.jpg', 'fog_0775.jpg': 'gss2060.jpg', 'fog_0776.jpg': 'gss2062.jpg', 'fog_0777.jpg': 'gss2065.jpg', 'fog_0778.jpg': 'gss2067.jpg',
  'fog_0779.jpg': 'gss2070.jpg', 'fog_0780.jpg': 'gss2072.jpg', 'fog_0781.jpg': 'gss2075.jpg', 'fog_0782.jpg': 'gss2077.jpg', 'fog_0783.jpg': 'gss2080.jpg', 'fog_0784.jpg': 'gss2082.jpg', 'fog_0785.jpg': 'gss2085.jpg', 'fog_0786.jpg': 'gss2087.jpg', 'fog_0787.jpg': 'gss2090.jpg', 'fog_0788.jpg': 'gss2092.jpg', 'fog_0789.jpg': 'gss2095.jpg', 'fog_0790.jpg': 'gss2097.jpg', 'fog_0791.jpg': 'gss2100.jpg', 'fog_0792.jpg': 'gss2102.jpg'
  }

This dataset was built for detecting casualties and persons in search and rescue scenarios in drone images and videos. The actors in the footage have simulate exhausted and injured persons as well as "classic" types of movement of people in nature, such as running, walking, standing, sitting, or lying down. The shots include persons on macadam roads, in quarries, low and high grass, forest shade, and the like.

### Reflection on the "Not-Defined" label:

The image annotation consists of the position of the bounding box around each object of interest, the size of the bounding box in terms of width and height, and the corresponding class designation (Standing, Walking, Running, Sitting, Lying, Not Defined) for the person.

We want to implement a person detection system and are not interested in understanding the movement that the person is making. Therefore, the first step we will take is to unify all the labels into a single class, 'person'. But what should we do with the 'not-defined' label? What does it represent?

After analyzing the dataset, I noticed that there are many images containing the 'not-defined' label, so removing these images cannot be considered. In all of these, the subject is clearly a person. Therefore, the 'not-defined' class does not identify the presence or absence of a person, but rather the movement of the person in the image, which, for various reasons, is undefined.

We have therefore decided to include the 'not-defined' class in the unified 'person' class.

To increase the robustness of the SARD data, an extension of the SARD set, called Corr, was created that includes images that further simulate different weather conditions that may occur in actual search and rescue situations such as fog, snow, and ice. Also, blur images are included in the Corr set that occur in real conditions as a result of camera movement and aerial shooting in motion.

## Dataset Herida

In [None]:
# os.chdir('../') # to change the directory (../ is the start)
os.chdir('/content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train') # the directory with the dataset
print(os.getcwd()) # to see in which directory we are
# print(os.listdir()) # to see what there is in the current directory (add the path inside the () to see the content of another directory)

/content/drive/.shortcut-targets-by-id/1LQbD7p_iS5KLqGNdfrYEvsAx0i_bgB0h/projectUPV/datasets/Herida_dataset/train


In [None]:
folder_path_herida = "/content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train"

We start by checking if there are any jpg images without an associated XML file or vice versa:

In [None]:
inconsistencies_herida = check_image_xml_consistency(folder_path_herida)

# Let's see which files aren't matching
total_unmatched_herida = len(inconsistencies_herida["images_without_xml"]) + len(inconsistencies_herida["xml_without_images"])
print("Total unmatched files:", total_unmatched_herida)

print("\nImages without XML:")
for f in inconsistencies_herida['images_without_xml']:
  print(f"  - {f}.jpg")

print("\nXML files without image:")
for f in inconsistencies_herida['xml_without_images']:
  print(f"  - {f}.xml")


Total images: 1546
Total XML files: 1546
Images without matching XML: 0
XML files without matching image: 0
Total unmatched files: 0

Images without XML:

XML files without image:


Fortunately, all images have an associated XML file. \
We now rename all images (jpg) and XML files using the prefix _hda_:

In [None]:
def rename_paired_images_and_xml(folder_path, prefix="hda"): # To rename all images (jpg) and xml files using hda as prefix
  """
  Rename the pairs .jpg and .xml files in the 'prefixNNNN.jpg/xml' format.
  Skip orphan files (shouldn't be there, already checked).
  """
  # Take name without extension
  jpg_basenames = {os.path.splitext(f)[0] for f in os.listdir(folder_path) if f.lower().endswith('.jpg')}
  xml_basenames = {os.path.splitext(f)[0] for f in os.listdir(folder_path) if f.lower().endswith('.xml')}

  # Find the couples
  paired = sorted(jpg_basenames & xml_basenames)

  print(f"Found {len(paired)} pairs to rename \n")

  for i, base_name in enumerate(paired, start=1):
    new_base = f"{prefix}{str(i).zfill(4)}"

    old_jpg_path = os.path.join(folder_path, base_name + '.jpg')
    old_xml_path = os.path.join(folder_path, base_name + '.xml')

    new_jpg_path = os.path.join(folder_path, new_base + '.jpg')
    new_xml_path = os.path.join(folder_path, new_base + '.xml')

    # Rename only if the file exists
    if os.path.exists(old_jpg_path):
      os.rename(old_jpg_path, new_jpg_path)
    if os.path.exists(old_xml_path):
      os.rename(old_xml_path, new_xml_path)

    #print(f"{base_name} → {new_base}.jpg / {new_base}.xml")

  print("\n All file renamed with success")

# Let's try
rename_paired_images_and_xml(folder_path_herida, prefix="hda")



Found 1546 pairs to rename 


 All file renamed with success


Rename all filenames inside the XML files accordingly:

In [None]:
def update_xml_filenames_herida(xml_folder): # To rename all the 'filename' in the xml files in a specific directory
  """
  Update <filename> and <path> tags in XML files to match renamed .jpg files.
  Assumes each XML file has a matching .jpg with the same basename.

  Args:
      xml_folder (str): Path to the folder containing renamed XML and JPG files.
  """
  updated = 0

  for file in os.listdir(xml_folder):
    if file.endswith('.xml'):
      xml_path = os.path.join(xml_folder, file)

      try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        new_filename = os.path.splitext(file)[0] + '.jpg'

        # Update <filename> and <path> tags
        filename_tag = root.find('filename')
        if filename_tag is not None:
          filename_tag.text = new_filename

        path_tag = root.find('path')
        if path_tag is not None:
          path_tag.text = new_filename  # or full path if needed

        # Save the updated XML
        tree.write(xml_path)
        updated += 1

      except ET.ParseError:
        print(f"Error parsing {file}, skipping.")

  print(f"Updated {updated} XML files with correct filename/path.")

# Let's try
update_xml_filenames_herida(folder_path_herida)

Updated 1546 XML files with correct filename/path.


We standardize the labels, renaming _"human"_ to _"person"_:

In [None]:
def update_xml_class(folder_path, new_class="person"): # To change all class from 'human' to 'person' in the Herida dataset
  """
  Update all <name> tags in each <object> of XML files to a new class name.

  Args:
    folder_path (str): Path to folder with .xml files.
    new_class (str): New class name to replace current ones.
  """
  updated = 0

  for file in os.listdir(folder_path):
    if file.endswith('.xml'):
      xml_path = os.path.join(folder_path, file)

      try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        modified = False
        for obj in root.findall('object'):
          name_tag = obj.find('name')
          if name_tag is not None and name_tag.text != new_class:
            name_tag.text = new_class
            modified = True

        if modified:
          tree.write(xml_path)
          updated += 1

      except ET.ParseError:
        print(f"Failed to parse {file}")

  print(f"Updated <name> in {updated} XML files.")



update_xml_class(folder_path_herida)
#update_xml_class(folder_path_herida, new_class="person")

Updated <name> in 985 XML files.


I noticed that there are some pictures in urban settings, and we don't like that. \
So let's create a function that removes those images and their associated xml files:

In [None]:
def remove_images_by_number_range(folder, start, end, prefix="hda"):
  """
  Remove all images and XML files with a numerical index in the given range.

  Args:
    folder (str): Path to folder containing .jpg and .xml files.
    start (int): Starting index (inclusive).
    end (int): Ending index (inclusive).
    prefix (str): Filename prefix (e.g., 'hda' for 'hda1234.jpg').
  """
  deleted_images = []
  deleted_xmls = []

  for i in range(start, end + 1):
    base_name = f"{prefix}{str(i).zfill(4)}"
    img_path = os.path.join(folder, base_name + ".jpg")
    xml_path = os.path.join(folder, base_name + ".xml")

    if os.path.exists(img_path):
      os.remove(img_path)
      deleted_images.append(base_name + ".jpg")
    if os.path.exists(xml_path):
      os.remove(xml_path)
      deleted_xmls.append(base_name + ".xml")

  print(f"Deleted {len(deleted_images)} images and {len(deleted_xmls)} XML files.") # Expected 87 images and 87 xml files eliminated

In [None]:
folder_path_herida = "/content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train"

remove_images_by_number_range(folder_path_herida, 1425, 1511, prefix="hda")

Deleted 87 images and 87 XML files.


We generate a CSV file containing the annotations from the XML files:

In [None]:
# To create csv file with (filename, width, height, class, xmin, ymin, xmax, ymax) based on info from various xml files
def create_csv_from_xml(xml_folder, output_csv_path):
  """
  Create a CSV from XML files in the folder and save it inside the same folder.
  The CSV contains the following columns: filename, width, height, class, xmin, ymin, xmax, ymax.

  Args:
    xml_folder (str): Path to folder with .xml files.
    output_csv_path (str): Name of the output CSV file (default: 'annotations.csv').

  Returns:
    str: Path to the created CSV file.
  """
  herida_person_labels = []

  # if output_csv_path is relative, save it in xml_folder
  if not os.path.isabs(output_csv_path):
    output_csv_path = os.path.join(xml_folder, output_csv_path)

  for file in os.listdir(xml_folder):
    if file.endswith('.xml'):
      xml_path = os.path.join(xml_folder, file)

      try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        filename = root.find('filename').text.strip()
        size = root.find('size')
        width = int(size.find('width').text)
        height = int(size.find('height').text)

        objects = root.findall('object')

        if not objects:
          herida_person_labels.append({
            'filename': filename,
            'width': width,
            'height': height,
            'class': 'no_person',
            'xmin': None,
            'ymin': None,
            'xmax': None,
            'ymax': None
          })
        else:
          for obj in objects:
            label = obj.find('name').text.strip()
            bbox = obj.find('bndbox')
            xmin = int(bbox.find('xmin').text)
            ymin = int(bbox.find('ymin').text)
            xmax = int(bbox.find('xmax').text)
            ymax = int(bbox.find('ymax').text)

            herida_person_labels.append({
              'filename': filename,
              'width': width,
              'height': height,
              'class': label,
              'xmin': xmin,
              'ymin': ymin,
              'xmax': xmax,
              'ymax': ymax
            })

      except Exception as e:
        print(f"Error parsing {file}: {e}")

  df = pd.DataFrame(herida_person_labels)

  # Save the CSV in the same folder
  df.sort_values(by='filename', inplace=True)
  df.to_csv(output_csv_path, index=False)

  print(f"CSV created: {output_csv_path} ({len(df)} rows)")
  return output_csv_path



# Let's try:
folder_path_herida = '/content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train'

csv_herida = create_csv_from_xml(folder_path_herida, "herida_person_labels.csv")

df = pd.read_csv(csv_herida)
print(df.head()) # Print first few rows to verify

CSV created: herida_person_labels.csv (2685 rows)
      filename  width  height   class    xmin    ymin    xmax    ymax
0  hda0001.jpg   4000    3000  person  3471.0  1195.0  3540.0  1275.0
1  hda0001.jpg   4000    3000  person  2645.0  2820.0  2709.0  2887.0
2  hda0001.jpg   4000    3000  person   355.0  1751.0   435.0  1791.0
3  hda0002.jpg   4000    3000  person  2656.0  2810.0  2706.0  2884.0
4  hda0002.jpg   4000    3000  person  3478.0  1180.0  3536.0  1269.0


Let's check if this new csv file is correct:

In [None]:
result = check_csv_vs_xml_annotations(folder_path_herida, 'herida_person_labels.csv')

print("Result for herida_person_labels.csv:\n")
print("Correct Annotations:", result['matched'])
print("Annotations only in CSV:", result['only_in_csv'])
print("Annotazioni only in XML:", result['only_in_xml'])

Result for herida_person_labels.csv:

Correct Annotations: 2129
Annotations only in CSV: 556
Annotazioni only in XML: 0


We can see that there are a lot of annotations only in the csv file, these are the images without bounding boxes. \
\
Now let's verify that all images have the same dimensions:

In [None]:
folder_path_herida = '/content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train'

csv_herida_person_labels_path = os.path.join(folder_path_herida, 'herida_person_labels.csv')

check_image_dimensions_consistency(csv_herida_person_labels_path)

All images have the same size: 4000x3000


Unnamed: 0,width,height,image_count
0,4000,3000,1459


We now verify that all bounding boxes are within the image boundaries:

In [None]:
csv_herida_person_labels_path = os.path.join(folder_path_herida, 'herida_person_labels.csv')

info_herida = infer_indexing_from_csv_verbose(csv_herida_person_labels_path, thresh=0.2)
print("HERIDA indexing:", info_herida)

voc_flag_herida = (info_herida["indexing"] == "1-based") or (info_herida["indexing"] == "ambiguous")

invalid_bboxes = check_bboxes_out_of_bounds_from_csv(
  csv_herida_person_labels_path, voc_one_indexed=voc_flag_herida
)
print(invalid_bboxes[['filename','xmin','ymin','xmax','ymax','width','height']])

Checked all entries. Found 556 invalid bounding boxes.
         filename  xmin  ymin  xmax  ymax  width  height
39    hda0016.jpg   NaN   NaN   NaN   NaN   4000    3000
40    hda0017.jpg   NaN   NaN   NaN   NaN   4000    3000
41    hda0018.jpg   NaN   NaN   NaN   NaN   4000    3000
42    hda0019.jpg   NaN   NaN   NaN   NaN   4000    3000
43    hda0020.jpg   NaN   NaN   NaN   NaN   4000    3000
...           ...   ...   ...   ...   ...    ...     ...
2173  hda1420.jpg   NaN   NaN   NaN   NaN   4000    3000
2174  hda1421.jpg   NaN   NaN   NaN   NaN   4000    3000
2176  hda1423.jpg   NaN   NaN   NaN   NaN   4000    3000
2553  hda1539.jpg   NaN   NaN   NaN   NaN   4000    3000
2554  hda1540.jpg   NaN   NaN   NaN   NaN   4000    3000

[556 rows x 7 columns]


We detect “suspicious” bounding boxes using the same function previously applied to the SARD dataset:

In [None]:
folder_path_herida = '/content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train'

csv_herida_person_labels_path = os.path.join(folder_path_herida, 'herida_person_labels.csv')

result = classify_image_annotation_quality(csv_herida_person_labels_path, folder_path_herida)

print(f"Case 1 (only bbox suspicious): {len(result['case1_all_suspicious'])} and they are: {result['case1_all_suspicious']}")
print(f"Case 2 (bbox mix): {len(result['case2_some_suspicious'])}")
print(f"Case 3 (no annotation): {len(result['case3_no_annotations'])} and they are: {result['case3_no_annotations']}")

# I also think that the images in the case 1 and 3 are useless. So i'm going to delete them.

Case 1 (only bbox suspicious): 10 and they are: {'hda0857.jpg', 'hda0925.jpg', 'hda1178.jpg', 'hda1182.jpg', 'hda1183.jpg', 'hda1316.jpg', 'hda0498.jpg', 'hda0255.jpg', 'hda1269.jpg', 'hda0646.jpg'}
Case 2 (bbox mix): 13
Case 3 (no annotation): 0 and they are: set()


In [None]:
# At first let's see which rows would be removed from CSVs without editing them.

csv_files = [csv_herida_person_labels_path]

print("case1: ", result['case1_all_suspicious'])
print("case3: ", result['case3_no_annotations'])

images_to_remove = result['case1_all_suspicious'].union(result['case3_no_annotations'])

# Show what would be eliminated
preview_dataset_cleaning(csv_files, images_to_remove)

case1:  {'hda0857.jpg', 'hda0925.jpg', 'hda1178.jpg', 'hda1182.jpg', 'hda1183.jpg', 'hda1316.jpg', 'hda0498.jpg', 'hda0255.jpg', 'hda1269.jpg', 'hda0646.jpg'}
case3:  set()

 /content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train/herida_person_labels.csv: 10 rows to remove
Filenames involved: ['hda0255.jpg' 'hda0498.jpg' 'hda0646.jpg' 'hda0857.jpg' 'hda0925.jpg'
 'hda1178.jpg' 'hda1182.jpg' 'hda1183.jpg' 'hda1269.jpg' 'hda1316.jpg']


In [None]:
# Ok now we can really delete the images

items_deleted = clean_dataset(folder_path_herida, csv_files, images_to_remove)
print("Number of images deleted: ", len(items_deleted["deleted_images"]))
print("Number of xml files deleted: ", len(items_deleted["deleted_xmls"]))
print("Rows deleted: ", len(items_deleted["removed_csv_rows"]))


Check on /content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train/herida_person_labels.csv: 10 rows to delete
Filenames: ['hda0255.jpg' 'hda0498.jpg' 'hda0646.jpg' 'hda0857.jpg' 'hda0925.jpg'
 'hda1178.jpg' 'hda1182.jpg' 'hda1183.jpg' 'hda1269.jpg' 'hda1316.jpg']
Updated /content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train/herida_person_labels.csv: 10 rows removed
Number of images deleted:  10
Number of xml files deleted:  10
Rows deleted:  1


We perform a general analysis of the dataset:

In [None]:
folder_path_herida = '/content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train'
csv_herida_person_labels_path = os.path.join(folder_path_herida, 'herida_person_labels.csv')
results = analyze_dataset_annotations(csv_herida_person_labels_path, folder_path_herida)

print("How many images for each label:")
print(results["label_count"])

print("\nDistribution of bbox for image:")
for n_bbox, n_images in results["distribution_of_bbox"].items():
  print(f"There are {n_images} images with {n_bbox} bbox")

print("\nImages with at least one label (bounding box):", results["number_images_annotated"])
print("Total images:", results["number_all_images"])
print("Images without any label:", results["number_images_without_annotations"])
print("The images without any label are:", results["images_without_annotations"])


How many images for each label:
class
person       2119
no_person     556
Name: count, dtype: int64

Distribution of bbox for image:
There are 932 images with 1 bbox
There are 350 images with 2 bbox
There are 81 images with 3 bbox
There are 23 images with 4 bbox
There are 9 images with 5 bbox
There are 9 images with 6 bbox
There are 9 images with 7 bbox
There are 7 images with 8 bbox
There are 2 images with 9 bbox
There are 2 images with 10 bbox
There are 2 images with 11 bbox
There are 2 images with 14 bbox
There are 4 images with 15 bbox
There are 4 images with 16 bbox
There are 1 images with 17 bbox
There are 1 images with 18 bbox
There are 3 images with 19 bbox
There are 3 images with 21 bbox
There are 3 images with 22 bbox
There are 1 images with 28 bbox
There are 1 images with 29 bbox

Images with at least one label (bounding box): 1449
Total images: 1449
Images without any label: 0
The images without any label are: set()


Since the two datasets contain images of different sizes, we need to perform resizing/normalization of the images. \
I chose to change the size of the images after the merge, as the most suitable model has not yet been decided.

## Merge

To optimize so that the model can generalize best, I decided to use both the Sard and Herida datasets. We proceed to merge the two to thus obtain a new, more balanced dataset, which we will call _AERALIS_

We perform a merge of the two datasets:

In [None]:
# STEP 1: Copy images and XML to a single folder

def merge_folders(source_folders, target_folder):
  """
  Merge the contents of multiple folders into a single folder.

  Args:
    source_folders: list of folders to copy from
    target_folder: folder to copy to
  """
  os.makedirs(target_folder, exist_ok=True)

  for folder in source_folders:
    for file in os.listdir(folder):
      if file.lower().endswith(('.jpg', '.xml')):
        src_path = os.path.join(folder, file)
        dst_path = os.path.join(target_folder, file)

        if not os.path.exists(dst_path):  # Avoid overwrites
          shutil.copy2(src_path, dst_path)
        else:
          print(f"File already exists, skipping: {file}")
    print(f"Copied files from {folder} to {target_folder}")

In [None]:
merged_folder_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS"

merge_folders([folder_path_sard, folder_path_herida], merged_folder_path)

Copied files from /content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD to /content/drive/MyDrive/projectUPV/datasets/AERALIS
Copied files from /content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train to /content/drive/MyDrive/projectUPV/datasets/AERALIS


Let us now verify that the newly created dataset really does contain all the files:

In [None]:
def verify_merged_files(source_folders, merged_folder):
  """
  Verify that all files in the source folders are present in the merged folder.

  Args:
    source_folders: list of folders to copy from
    target_folder: folder to copy to
  """
  from collections import Counter

  def collect_files(folder, extensions):
    files = []
    for file in os.listdir(folder):
      if file.lower().endswith(extensions):
        files.append(file)
    return files

  # 1. Collect all .jpg and .xml files from the sources
  all_source_files = []
  for folder in source_folders:
      files = collect_files(folder, ('.jpg', '.xml'))
      all_source_files.extend(files)

  # Count how many identical files appear multiple times (possible duplicates)
  duplicates = [item for item, count in Counter(all_source_files).items() if count > 1]

  # 2. Collect all .jpg and .xml files from the merge folder
  merged_files = set(collect_files(merged_folder, ('.jpg', '.xml')))
  all_source_set = set(all_source_files)

  # 3. Compare
  missing_in_merge = all_source_set - merged_files
  unexpected_in_merge = merged_files - all_source_set

  print(f"Total source files (jpg + xml): {len(all_source_files)}")
  print(f"Unique files: {len(all_source_set)}")
  print(f"Files present in the merge folder: {len(merged_files)}")
  print(f"Duplicate files in the sources: {duplicates if duplicates else 'None'}")
  print(f"\nMissing files in the merge: {len(missing_in_merge)}")
  for f in sorted(missing_in_merge):
    print(f"  - {f}")

  print(f"\nUnexpected files in the merge folder: {len(unexpected_in_merge)}")
  for f in sorted(unexpected_in_merge):
    print(f"  - {f}")


In [None]:
merged_folder_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS"
folder_path_sard = '/content/drive/MyDrive/projectUPV/datasets/SARD_dataset/SARD'
folder_path_herida = "/content/drive/MyDrive/projectUPV/datasets/Herida_dataset/train"

source_folders = [folder_path_sard, folder_path_herida]

verify_merged_files(source_folders, merged_folder_path)

Total source files (jpg + xml): 6852
Unique files: 6852
Files present in the merge folder: 6852
Duplicate files in the sources: None

Missing files in the merge: 0

Unexpected files in the merge folder: 0


Let's see if the association between xml file and jpg file has been maintained:

In [None]:
inconsistencies_aeralis = check_image_xml_consistency(merged_folder_path)

# Let's see which files aren't matching
total_unmatched_aeralis = len(inconsistencies_aeralis["images_without_xml"]) + len(inconsistencies_aeralis["xml_without_images"])
print("Total unmatched files:", total_unmatched_aeralis)

print("\nImages without XML:")
for f in inconsistencies_aeralis['images_without_xml']:
  print(f"  - {f}.jpg")

print("\nXML files without image:")
for f in inconsistencies_aeralis['xml_without_images']:
  print(f"  - {f}.xml")

Total images: 3426
Total XML files: 3426
Images without matching XML: 0
XML files without matching image: 0
Total unmatched files: 0

Images without XML:

XML files without image:


Perfect, all pairs are intact.

\
Let's rename the files:

In [None]:
# STEP 2: Rename the files (jpg and xml) and update the xmls

# Let's rename all images (jpg) and xml files using aer as prefix
rename_paired_images_and_xml(merged_folder_path, prefix="aer") # This function was written in the "Dataset Herida" Section

Found 3426 pairs to rename 


 All file renamed with success


In the end we rename all filenames inside the XML files accordingly:

In [None]:
# To rename all the 'filename' (and the 'path') tag in the xml files in a specific directory

# Let's try
update_xml_filenames_herida(merged_folder_path) # This function was written in the "Dataset Herida" Section

Updated 3426 XML files with correct filename/path.


Now that the aeralis directory has been created, we want to set it up as a working directory:

In [None]:
# os.chdir('../') # to change the directory (../ is the start)
os.chdir('/content/drive/MyDrive/projectUPV/datasets/AERALIS') # the directory with the dataset
print(os.getcwd()) # to see in which directory we are
# print(os.listdir()) # to see what there is in the current directory (add the path inside the () to see the content of another directory)

/content/drive/.shortcut-targets-by-id/1LQbD7p_iS5KLqGNdfrYEvsAx0i_bgB0h/projectUPV/datasets/AERALIS


We generate a CSV file containing the annotations from the XML files:

In [None]:
# STEP 3: Create new unified CSV (merge of the two CSVs)

# Let's try:
merged_folder_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS"

# This function was written in the "Dataset Herida" Section
csv_aeralis = create_csv_from_xml(merged_folder_path, "aeralis_person_labels.csv")

df = pd.read_csv(csv_aeralis)
print(df.head()) # Print first few rows to verify

CSV created: aeralis_person_labels.csv (9203 rows)
      filename  width  height   class    xmin   ymin    xmax   ymax
0  aer0001.jpg   1920    1080  person  1110.0  358.0  1134.0  424.0
1  aer0001.jpg   1920    1080  person  1077.0  367.0  1100.0  428.0
2  aer0001.jpg   1920    1080  person  1041.0  144.0  1061.0  173.0
3  aer0002.jpg   1920    1080  person  1104.0  347.0  1126.0  414.0
4  aer0002.jpg   1920    1080  person  1079.0  365.0  1101.0  428.0


Let's check if this new csv file is correct:

In [None]:
result = check_csv_vs_xml_annotations(merged_folder_path, 'aeralis_person_labels.csv')

print("Result for aeralis_person_labels.csv:\n")
print("Correct Annotations:", result['matched'])
print("Annotations only in CSV:", result['only_in_csv'])
print("Annotazioni only in XML:", result['only_in_xml'])

Result for aeralis_person_labels.csv:

Correct Annotations: 8647
Annotations only in CSV: 556
Annotazioni only in XML: 0


All Perfect!

\

Finally we are interested to see how this new dataset behaves with some basic analysis:

In [None]:
merged_folder_path = "/content/drive/MyDrive/projectUPV/datasets/AERALIS"
csv_aeralis_person_labels_path = os.path.join(merged_folder_path, 'aeralis_person_labels.csv')
results = analyze_dataset_annotations(csv_aeralis_person_labels_path, merged_folder_path)

print("How many images for each label:")
print(results["label_count"])

print("\nDistribution of bbox for image:")
for n_bbox, n_images in results["distribution_of_bbox"].items():
  print(f"There are {n_images} images with {n_bbox} bbox")

print("\nImages with at least one label (bounding box):", results["number_images_annotated"])
print("Total images:", results["number_all_images"])
print("Images without any label:", results["number_images_without_annotations"])
print("The images without any label are:", results["images_without_annotations"])

How many images for each label:
class
person       8647
no_person     556
Name: count, dtype: int64

Distribution of bbox for image:
There are 1233 images with 1 bbox
There are 999 images with 2 bbox
There are 436 images with 3 bbox
There are 220 images with 4 bbox
There are 150 images with 5 bbox
There are 145 images with 6 bbox
There are 104 images with 7 bbox
There are 44 images with 8 bbox
There are 68 images with 9 bbox
There are 2 images with 10 bbox
There are 2 images with 11 bbox
There are 2 images with 14 bbox
There are 4 images with 15 bbox
There are 4 images with 16 bbox
There are 1 images with 17 bbox
There are 1 images with 18 bbox
There are 3 images with 19 bbox
There are 3 images with 21 bbox
There are 3 images with 22 bbox
There are 1 images with 28 bbox
There are 1 images with 29 bbox

Images with at least one label (bounding box): 3426
Total images: 3426
Images without any label: 0
The images without any label are: set()


And in the end, only a few final checks:

In [None]:
# 1. Verification of classes in the embattled CSV:
df = pd.read_csv(csv_aeralis_person_labels_path)

print(df['class'].value_counts())

class
person       8647
no_person     556
Name: count, dtype: int64


In [None]:
# 2. Check for unexpected NaN values:
print(df.isnull().sum())

filename      0
width         0
height        0
class         0
xmin        556
ymin        556
xmax        556
ymax        556
dtype: int64


In [None]:
# 3. Bounding box out of image:
invalid_boxes = df[
    (df['xmin'] >= df['xmax']) |
    (df['ymin'] >= df['ymax']) |
    (df['xmin'] < 0) |
    (df['ymin'] < 0) |
    (df['xmax'] > df['width']) |
    (df['ymax'] > df['height'])
]
print(f"Invalid bbox rows: {len(invalid_boxes)}")

Invalid bbox rows: 0
