# Data preprocessing for YOLONAS based detection models

The following notebook contains the code necessary for: 
* Rescaling the images to 1080 x 1080 pixels without distortion by completing the longest side with zeros
* Computing the Region Of Interest coordinates for the new size
* Generating a training dataset with labels and normalized coordinates in .txt format [label centerX centerY width height].

The abnormality classes adressed in this experiment are: 1 Architectural distortion, 2 Mass and 3 Calcification

The public datasets used are: 

* MIAS
* CBIS-DDSM
* CDD-CESM
* INbreast
* BMCD
* VinDr

To which I'll provide a link to their respective documentation within the README.md. I will also provide in this repository the .xlsx file containing the coordinates to every ROI. Take into consideration that the images from the MIAS database have been rotated, squared and transformed into DICOM so the coodinates here displayed will not fit the original PGM images. 

Don't forget to modify the image paths in the 'AbsPath' column according to your own storage.

In [42]:
import os
import torch
from tqdm import tqdm
import pandas as pd
import numpy as np
import utils
import processing

In [43]:
this_device = 'cuda' if torch.cuda.is_available() else 'cpu'
this_device

'cuda'

## Coordinates file 

This code computes the necessary data to obtain both COCO and YOLO format labels. Larger dataframes will take a long time to process. Progress can be tracked with the tdqm module.
This cell has already been run, so you can save time by using the .csv as it is.

In [46]:
csv_file = '..\\DetectorDatasetMG\\Templatedetfile.csv'
df = pd.read_csv(csv_file)

In [47]:
class ImageNumbering:
    
    def __init__(self):
        self.prev_row = None
        self.current_num = 1
        self.current_idx = 1

    def update_row(self, row):
        row, self.prev_row, self.current_num, self.current_idx = utils.assign_image_numbers(row, self.prev_row, self.current_num, self.current_idx)
        row = utils.assign_labels_and_names(row)
        row = utils.resized_normalized_coordinates(row)
        return row


tqdm.pandas() # Initialize tqdm for progress bar
img = ImageNumbering()
df_progress = df.progress_apply(img.update_row, axis=1)

df_progress.to_csv(csv_file, index=False)


100%|██████████| 6678/6678 [56:53<00:00,  1.96it/s]  


#### Run this cell to get a different dataset split [train, test, valid]

In [50]:
def assign_dl_folder (csv_file, ratio):
    df = pd.read_csv(csv_file)
    
    #Just to be sure...
    assert len(ratio) == 3, "Ratio must be a tuple of three elements"
    assert sum(ratio) <= 1, "Sum of ratio must be less than or equal to 1"
    
    categories = ['train', 'valid', 'test']
    
    df['set'] = np.random.choice(categories, size=len(df), p=ratio)
    
    df.to_csv(csv_file, index =False)

assign_dl_folder(csv_file, ratio=(0.8, 0.15, 0.05))

Now the Excel file contains all the needed information to generate the dataset. Please review the comments on utils.py and processing.py for further information

## Image processing
#### WARNING: This algorithm is not GPU accelerated yet and takes a long time to process.

In [None]:
def data_processing(row, root):
    if row['AbsPath'].endswith(('.dcm', '.dicom', '.DCM')):
        image = dicom_preprocessing(row['AbsPath'], row['pixy'], row['pixx'])
    else:
        image = cv2.imread(row['AbsPath'])
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    image = reduce_poisson_noise(image)
    squared = add_zeros_for_square(image, row['pixy'], row['pixx'])
    det_size = (1080, 1080)
    resized_squared = cv2.resize(squared, det_size, interpolation=cv2.INTER_LINEAR)

    # Build storing path
    image_folder = os.path.join(root, row['set'], 'images')
    os.makedirs(image_folder, exist_ok=True)
    image_name = f"{row['ImageName']}.png"
    image_path = os.path.join(image_folder, image_name)
    cv2.imwrite(image_path, resized_squared)

    # Write label information to text file
    label_folder = os.path.join(root, row['set'], 'labels')
    os.makedirs(label_folder, exist_ok=True)
    label_file_path = os.path.join(label_folder, f"{row['ImageName']}_{row['Index']}.txt")
    with open(label_file_path, 'w') as f:
        f.write(f"{row['Label']} {row['cx']} {row['cy']} {row['nw']} {row['nh']}")

    return row

In [None]:
root = "..\\DetectorDatasetMG\\data" 
df = pd.read_csv(csv_file)
tqdm.pandas() 
df_process_progress = df.progress_apply(lambda row: data_processing(row, root), axis=1)


In [40]:
cv2.__version__
import shutil

def copy_images_with_names(df, source_column, name_column, destination_folder):
    # Ensure the destination folder exists
    os.makedirs(destination_folder, exist_ok=True)
    
    # Iterate over the DataFrame rows and copy each image
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Copying images"):
        source_path = row[source_column]
        image_name = row[name_column]
        destination_path = os.path.join(destination_folder, image_name)
        shutil.copy(source_path, destination_path)

        
df = pd.read_csv(csv_file)

destination_folder = '../DetectorDatasetMG/data'


In [41]:
copy_images_with_names(df, 'AbsPath', 'ImageName', destination_folder)

Copying images: 100%|██████████| 269/269 [00:34<00:00,  7.74it/s] 


###### 1% reached at 5 minutes 