# Data Preprocessing for all the Datasets

## Preprocessing Goals


- **COCO** dataset:
    1. Transform the folder structure from [images, annotations] to [train, validation]
    2. Transform the instances_train2017.json and instances_val2017.json to text files.
        - Bounding Boxes 
        - Class code
        - Segmentation data (mask data with format [(x_i, y_i), (x_(i+1), y_(i+1), ... )])
    3. Resize images and annotations
    4. Compute mean and standard deviation for the data using 2500 images as samples


- **Vis-Drone** dataset:
    1. Extract the bounding boxes from format [x_min, y_min, width, height] to [x_min, y_min, x_max, y_max]
    2. Resize images and annotations
    3. Produce the segmentation data from the bounding box data: [(x_min, y_min), (x_min, y_max), (x_max, y_min), (x_max, y_max)]
    4. Compute mean and standard deviation for the data


- **UAV-SOD Drone** dataset:
    1. Extract the bounding box data and the class codes from XML files and create the text file equivalent
    2. Resize images and annotations
    3. Produce the segmentation data from the bounding box data: [(x_min, y_min), (x_min, y_max), (x_max, y_min), (x_max, y_max)]
    4. Compute mean and standard deviation for the data


- City Scapes dataset:
    1. Transform the folder structure from [images, annotations] to [train, validation]
    2. Resize images and annotations
    3. Produce the segmentation data from the bounding box data: [(x_min, y_min), (x_min, y_max), (x_max, y_min), (x_max, y_max)]
    4. Compute mean and standard deviation for the data

### Import Libraries and data paths

In [1]:
# Import Libraries
import os
module_path = os.path.abspath(os.path.join('..')) 
import warnings
import os
from matplotlib import pyplot as plt
from tqdm import tqdm
import src.data_preprocessing as preprocessing

 
warnings.filterwarnings("ignore")


# Import base data paths
COCO_DATA_PATH = "data/coco2017/"
VIS_DATA_PATH = "data/vis_drone_data/"
SOD_DATA_PATH = "data/uav_sod_data/"
CITY_DATA_PATH = "data/city_scapes/"

### COCO2017 Data

In [None]:
# Start the process by re-organizing the folder structure to the typical train -> [images, annotations] and validation -> [images, annotations]
preprocessing.reorganize_coco_structure(COCO_DATA_PATH)

# Define the train, test and validation paths
train_path      = os.path.join(COCO_DATA_PATH, "train")
validation_path = os.path.join(COCO_DATA_PATH, "validation")

coco_paths = [train_path, validation_path]

# Fix the annotations format, resize the images
for path in coco_paths:
    images_path      = os.path.join(path, "images")
    annotations_path = os.path.join(path, "annotations")
    
    # Annotations and image transformations
    preprocessing.convert_coco_annotations(os.path.join(annotations_path, "annotations.json"), "annotations")
    preprocessing.resize_data(path)
    
# Get the mean and standard deviation for the COCO training set 
preprocessing.compute_mean_std(os.path.join(train_path, "images") , "coco_data")

### Vis-Drone Data

In [None]:
# Define the train, test and validation paths
train_path      = os.path.join(VIS_DATA_PATH, "train")
validation_path = os.path.join(VIS_DATA_PATH, "validation")

vis_paths = [train_path, validation_path]


# Fix the annotations format, resize the images
for path in coco_paths:
    images_path      = os.path.join(path, "images")
    annotations_path = os.path.join(path, "annotations")
    
    # Annotations and image transformations
    preprocessing.extract_annotation_values(annotations_path)
    preprocessing.resize_data(path)
    
    
# Get the mean and standard deviation for the Vis-Drone training set 
preprocessing.compute_mean_std(os.path.join(train_path, "images") , "vis_data")

### UAV-SOD Data

In [None]:
# Define the train, test and validation paths
train_path      = os.path.join(SOD_DATA_PATH, "train")
test_path       = os.path.join(SOD_DATA_PATH, "test"  )
validation_path = os.path.join(SOD_DATA_PATH, "validation")

uav_paths = [train_path, test_path, validation_path]


# Fix the annotations format, resize the images
for path in uav_paths:
    images_path      = os.path.join(path, "images")
    annotations_path = os.path.join(path, "annotations")
    
    # Annotation and image transformations
    preprocessing.xml_to_txt(annotations_path)
    preprocessing.resize_data(path)
    

# Get the mean and standard deviation for the UAV training set 
preprocessing.compute_mean_std(os.path.join(train_path, "images") , "uav_data")

### CityScapes Data

In [3]:
# Start the process by re-organizing the folder structure for the annotations
preprocessing.reorganize_cityscapes_annotations(CITY_DATA_PATH)

# Start the process by re-organizing the folder structure for the images
# SOS: CityScapes has a weird dataset where the images and annotations are at two completely different folders
image_folder = "city_scapes"
preprocessing.reorganize_cityscapes_annotations(image_folder)


# Manual Action: Move the images with the annotation files

Converting JSON to Text files: 100%|██████████| 1/1 [00:00<00:00, 1041.80it/s]


## Important Note

The last cell is important to be executed after a manual action, where we take the images from the images city scapes folder and add to them the annotations folder.

In [None]:
# Define the train, test and validation paths
train_path      = os.path.join(CITY_DATA_PATH, "train")
test_path       = os.path.join(CITY_DATA_PATH, "test"  )
validation_path = os.path.join(CITY_DATA_PATH, "validation")

city_paths = [train_path, test_path, validation_path]

# Fix the annotations format, resize the images
for path in city_paths:
    images_path      = os.path.join(path, "images")
    annotations_path = os.path.join(path, "annotations")
    
    # Annotation and image transformations
    preprocessing.json_to_text(annotations_path, annotations_path)
    preprocessing.resize_data(path)


# Rename the image and annotation files in order to make it easier for the training
preprocessing.rename_files(CITY_DATA_PATH)
preprocessing.compute_mean_std(images_path, "uav_sod")