# 3 Class image dataset of mammography with abnormalities for YOLONAS based detection models

The following notebook contains the code necessary for: 
* Rescaling the images to 1080 x 1080 pixels without distortion by completing the longest side with zeros
* Computing the Region Of Interest coordinates for the new size
* Generating a training dataset with labels and normalized coordinates in .txt format [label centerx centery width height].

The abnormality classes adressed in this experiment are: 1 Architectural distortion, 2 Mass and 3 Calcification

The public datasets used are: 

* MIAS
* CBIS-DDSM
* CDD-CESM
* INbreast
* BMCD
* VinDr

To which I'll provide a link to their respective documentation within the README.md. I will also provide in this repository the .xlsx file containing the coordinates to every ROI. There are 5191 images and 6678 abnormalities in total. 

Don't forget to modify the image paths in the 'AbsPath' column according to your own storage.

In [None]:
import cupy as cp 
from utils import *
import os
from processing import *
import torch
# import tqdm

In [None]:
this_device = 'cuda' if torch.cuda.is_available() else 'cpu'
this_device

### For generating the excel file with the most relevant information about the images

This code computes the necessary data to obtain both COCO and YOLO format labels

In [None]:
excel_file = '..\\DetectorDatasetMG\\Templatedetfile.xlsx'


In [None]:
df = assign_image_number(excel_file)
df = assign_labels_and_names (df, excel_file)
assign_finding_index(df, excel_file)
assign_dl_folder(excel_file, ratio=(0.7,0.2,0.1))
resized_normalized_coordinates(excel_file)

Now the Excel file contains all the needed information to generate the dataset. Please review the comments on utils.py and processing.py for further information

#### Run this version below if you don't have a GPU. It will take very long to process

The result is a dataset of normalized and denoised mammography images for better model performance

In [None]:
def dataset_for_detector (file, root):
    df = pd.read_excel(file)
    
    # Define data
    for i, row in df.iterrows():
        
        if row['AbsPath'].endswith('.dcm') or row['AbsPath'].endswith ('.dicom') or row['AbsPath'].endswith('.DCM'): 
            image = dicom_preprocessing(row['AbsPath'], row['pixy'], row['pixx'])
            
        else:
            image = cv2.imread(row['AbsPath'])
            image = cv2.normalize(image.astype(float), None, 0, 255, cv2.NORM_MINMAX).astype(np.uint8)
        
        image = reduce_poisson_noise(image)
        
        squared = add_zeros_for_square(image, row['pixy'], row['pixx'])
        det_size = (1080, 1080)
        resized_squared = cv2.resize(squared, det_size, interpolation=cv2.INTER_LINEAR)
        
        # Build storing path
        dest_folder = os.path.join(root, row['set'],'images')
        if not os.path.exists(dest_folder):
            os.makedirs(dest_folder)
            
        # Save the images
        image_name = f"{row['ImageName']}_{row['Index']}.png"
        image_path = os.path.join(dest_folder, image_name)
        cv2.imwrite(image_path, resized_squared)
        
        # Write label information to text file
        label_folder = os.path.join(root, row['set'], 'labels')
        if not os.path.exists(label_folder):
            os.makedirs(label_folder)
            
        
        label_file_path = os.path.join(label_folder, f"{row['ImageName']}_{row['Index']}.txt")
        with open(label_file_path, 'w') as f:
            f.write(f"{row['Label']} {row['cx']} {row['cy']} {row['nw']} {row['nh']}")
            
        print (row['id'])


In [None]:
dataset_for_detector(file= "C:/Users/sandra/Downloads/yaporfavor.xlsx", root="F:/Bases/Cancer/Mama/Mamografias/BRAHMA_DETECCION/Ver_14062024_NORM")

### Run this version below if you have a GPU

In [None]:
#This is still under construction