Start date (yyyy/mm/dd): 2024/09/04
Author: Alessandro Ulivi (ale.ulivi@gmail.com)
Description: the notebook was written to split time-lapse files (each file is a 3D array with dimensions TYX) of pip2-enriched domain dynamics and corresponding manually annotated binary segmentation masks (each mask is a 3D array with dimensions TYX). The notebook:
1) splits each individual time-lapse file and the corresponding segmentation mask into individual time-points (2D array of dimension YX obtained by splitting the intial time-lapse along the T axis).
2) Individual time-points and corresponding segmentation mask are further chunked into arrays of 256x256 pixel size. When chunking, it is made sure that chunks for the same array don't overlap. 4 chunks are obtained per each individual time-point (4 chunks for the raw time-point and 4 for the corresponding segmentation mask). All together, these 4 chunks form a 512x512 array covering the central part of each time-point (raw time-point ad corresponding segmentation mask). The pixels at the boarder of the time-points, framing the 512x512 central chunked array, are discarded (the frame of discarded pixels has size 208 pixel at each top and bottom edge and 86 pixels at each left-right).
3) chunks of individual time-points, both for the raw time-point and for the corresponding binary segmentation time-point, are saved only if the binary segmentation chunk contains some pixels which are labelled.
4) When saving, the names of the raw chunks match the names of the corresponding chunks for the segmentaion mask. Raw time-points and segmentation masks are saved in separate folders. The structure of the saving name is "{original name of the sample}_s{progressive numbering}_y{y coordinate of the pixel on the top left corner of the chunk within the original time-point array}_x{y coordinate of the pixel on the top left corner of the chunk within the original time-point array}.tif"
5) Data are divided in 3 fraction, each save in a different directory: a train fraction, a validation fraction and a test fraction.

Expected structure of input and output folders.
- input_folder
    - input_raw_folder
        - sample_1 -> raw time-lapse file. Dimensions TYX. The position of T dimension can vary. .tif file. The name contains the string target_structure. No other file in the sample_ folder contains the string
        - sample_2 -> raw time-lapse file. Dimensions TYX. The position of T dimension can vary. .tif file. The name contains the string target_structure. No other file in the sample_ folder contains the string
        ...
    - input_masks_folder
        - sample_1 -> manually labelled binary mask for the raw time-lapse file. Dimensions TYX. The position of T dimension can vary but it must match that of the corresponding raw time-lapse file. .tif file. The name contains the string target_structure. No other file in the sample_ folder contains the string
        - sample_2 -> manually labelled binary mask for the raw time-lapse file. Dimensions TYX. The position of T dimension can vary but it must match that of the corresponding raw time-lapse file. .tif file. The name contains the string target_structure. No other file in the sample_ folder contains the string
        ...

- output_folder
    - output_train_folder
    - output_validation_folder
    - output_test_test

Input_folder, input_masks_folder, input_raw_folder, output_folder, output_train_folder, output_validation_folder, output_test_test can have any name. Their names are specified below.

Otuput structure.
- output_folder
    - output_train_folder
        - raw -> raw time-points. .tif file. Dimensions YX obtained by splittig raw time-lapse files along the Y axis. Time-lapse files from different sample_ folders are pooled.
        - label -> manually labelled binary mask of individual time-points. .tif file. Dimensions YX obtained by splittig manually labelled binary masks along the Y axis. Masks from different sample_ folders are pooled.
    - output_validation_folder
        - raw -> raw time-points. .tif file. Dimensions YX obtained by splittig raw time-lapse files along the Y axis. Time-lapse files from different sample_ folders are pooled.
        - label -> manually labelled binary mask of individual time-points. .tif file. Dimensions YX obtained by splittig manually labelled binary masks along the Y axis. Masks from different sample_ folders are pooled.
    - output_test_test
        - raw -> raw time-points. .tif file. Dimensions YX obtained by splittig raw time-lapse files along the Y axis. Time-lapse files from different sample_ folders are pooled.
        - label -> manually labelled binary mask of individual time-points. .tif file. Dimensions YX obtained by splittig manually labelled binary masks along the Y axis. Masks from different sample_ folders are pooled.

The subsets of files used as, respectively, the validation and test datasets are saved in, respectively the "output_validation_folder" and the "output_test_test" sub-folders of the output_folder.

As files within "sample_X" folders are pooled in the output folders, the names of the raw time-lapse files and corresponding labelled binary masks are expected to be different between different "sample_" folders.

NOTES:
1) Time-lapse files have been obtained by imaging live at a Nikon CSU-X1 Spinning disk microscope the C. elegans early embryo (1-6 cells stage) using a 100x, 1.4NA, oil immersion objective (xy pixel size 0.11 um). The labelling of pip2-enriched domains is obtained using of the ACR074 C.elegans strain expressing mCherry fused with a PIP2 binding-domain.
2) Raw time-lapse files in input_folder are not the actual raw images. They went through a pre-processing process involving the reorganization of the actual raw files into individual time-lapse files. During this reorganization matadata were lost.
3) (note from 2024/09/04): the split between train, validation and test datasets is done, for the moment, by selecting, randomly, a certain number of samples (variables validation_fraction and test_fraction), and using them as validation and test datasets. An alternative method could have been to randomly select a fraction of time-points per each sample. However I fear that by doing this some data leaking might happen as time-points are not independent, in fact, a high correlation is expected (e.g. the emrbyo is usually positioned in the center of the field of view and does not move) and the closer the timepoints, the higher their correlation.
4) (note from 2024/10/09): the chunking method leads to a potentia bias: because embryos are mostly centered in each time-point, by chunking a 512x512 array centered on the time-point into 4 non-overlapping chunks of 256x256 pixels, each chunk would have roughly half array of embryo and half array of background. For this reason labelled structures are mostly on the boarder of the chunk and not randomly distributed on the train arrays. This is an aspect which at the moment I am not considering, but might have to be considered in the future.

In [1]:
#Import files
import os
import numpy as np
import tifffile
import random
from utils import listdirNHF, check_folder_files_else_make_folder, chunk_center, measure_labelled_pixels_fraction
from data_preparation import normalize

In [2]:
#Indicate the derectoris of the input folder and the output folder
input_folder = r"C:\Users\aless\OneDrive\Desktop\Ale\personal\others\courses_certificates\EMBL_deeplearning_2024\dataset"
output_folder = r"C:\Users\aless\OneDrive\Desktop\Ale\personal\projects\pip2_segmentation\data"

#Indicate folders names
input_masks_folder_name = 'dl_training'
input_raw_folder_name = 'raw'
output_train_folder_name = 'train'
output_validation_folder_name = 'validation'
output_test_folder_name = 'test'

#indicate the string of the target structure
target_structure = "pip2"

#indicate the axis along which to split files
split_axis = 0

#set the highpass threshold of labelled pixels to be present in the chunk of the image in order for it to be saved.
#the threshold is a fraction. The chunk of the image (both raw and label mask) is saved if the fraction of labelled pixels in the image is > threshold_label_px
threshold_label_px = 0.005 

#indicate the fraction of data to be saved as validation dataset
validation_fraction = 0.3

#indicate the fraction of data to be saved as test dataset
test_fraction = 0.2

#indicate if a random.seed should be used when selecting the samples for training and test - by default a seed is used
use_random_seed = True


In [14]:
#Create the directory of the input_mask_folder, the input_raw_folder, the output_train_folder, the output_validation_folder and the output_test_folder
input_masks_folder = os.path.join(input_folder, input_masks_folder_name)
input_raw_folder = os.path.join(input_folder, input_raw_folder_name)
output_train_folder = os.path.join(output_folder, output_train_folder_name)
output_validation_folder = os.path.join(output_folder, output_validation_folder_name)
output_test_folder = os.path.join(output_folder, output_test_folder_name)

#Initialize a variable to keep track of whether files have already been generated in the output folders
files_presence_output = []

#Create the "raw" and "label" sub-folders in output_train_folder, output_validation_folder and output_test_folder, if they don't exist
# NOTE: track the fact that they have already been created or not and whether they contain or not files inside
#--- train
output_train_folder_raw = os.path.join(output_train_folder, 'raw')
output_train_folder_raw_presence = check_folder_files_else_make_folder(output_train_folder_raw)
files_presence_output.append(output_train_folder_raw_presence)

output_train_folder_label = os.path.join(output_train_folder, 'label')
output_train_folder_label_presence = check_folder_files_else_make_folder(output_train_folder_label)
files_presence_output.append(output_train_folder_label_presence)

#--- validation
output_validation_folder_raw = os.path.join(output_validation_folder, 'raw')
output_validation_folder_raw_presence = check_folder_files_else_make_folder(output_validation_folder_raw)
files_presence_output.append(output_validation_folder_raw_presence)

output_validation_folder_label = os.path.join(output_validation_folder, 'label')
output_validation_folder_label_presence = check_folder_files_else_make_folder(output_validation_folder_label)
files_presence_output.append(output_validation_folder_label_presence)

#--- test
output_test_folder_raw = os.path.join(output_test_folder, 'raw')
output_test_folder_raw_presence = check_folder_files_else_make_folder(output_test_folder_raw)
files_presence_output.append(output_test_folder_raw_presence)

output_test_folder_label = os.path.join(output_test_folder, 'label')
output_test_folder_label_presence = check_folder_files_else_make_folder(output_test_folder_label)
files_presence_output.append(output_test_folder_label_presence)

# if no file is already present in any of the raw, and label output folders.
# split time-lapse files and the corresponding binary masks,
# randomly assign the time-points to train, validation and test datasets
# NOTE: avoiding the procedure if files have been created this is an extra precaution to avoid that the train and test datasets get mixed
# when re-running or changing the code (e.g. as the split is done randomly and without seed, it is possible that a sample is assigned to train in a first run and
# to test in a second run of the script).
if any(files_presence_output):
    print("WARNING! objects are found in some raw or label output folder. The processing is not continued. Eliminate any object present and re-run the script")
else:
    #Create a list of files in input_masks_folder - note that these files correspond to the samples to analyse
    input_masks_folder_samples_list = listdirNHF(input_masks_folder)

    # if use_random_seed is set to True (default), used a random seed before splitting samples in train and test_samples
    # else, split the train test datasets randomly
    if use_random_seed:
        #create the random seed
        random.seed(33)
        #randomly pick samples used for the test dataset
        test_samples = random.sample(input_masks_folder_samples_list, k=round(test_fraction*len(input_masks_folder_samples_list)))
        #remove the train samples from from the list of samples
        train_validation_samples = [s for s in input_masks_folder_samples_list if s not in test_samples]
        #randomly pick the samples used for the validation dataset - NOTE: the len() used in the k parameter is that of the sample list with ALL the samples
        validation_samples = random.sample(train_validation_samples, k=round(validation_fraction*len(input_masks_folder_samples_list)))
        #use the remaining samples for the train dataset
        train_samples = [s1 for s1 in train_validation_samples if s1 not in validation_samples]

    else:
        #randomly pick samples used for the test dataset
        test_samples = random.sample(input_masks_folder_samples_list, k=round(test_fraction*len(input_masks_folder_samples_list)))
        #remove the train samples from from the list of samples
        train_validation_samples = [s for s in input_masks_folder_samples_list if s not in test_samples]
        #randomly pick the samples used for the validation dataset - NOTE: the len() used in the k parameter is that of the sample list with ALL the samples
        validation_samples = random.sample(train_validation_samples, k=round(validation_fraction*len(input_masks_folder_samples_list)))
        #use the remaining samples for the train dataset
        train_samples = [s1 for s1 in train_validation_samples if s1 not in validation_samples]

    #Iterate through the folders (samples) of the labelled masks in input_folder sub-directory input_masks_folder
    for i_f in input_masks_folder_samples_list:
        #create the directory of the folder containing labelled masks
        label_mask_sample_folder_dir = os.path.join(input_masks_folder, i_f)
        #get the directory of the raw data for the sample inside input_raw_folder
        raw_sample_folder_dir = os.path.join(input_raw_folder, i_f)
        #get the name of the file containing the target_structure string within the sample folder inside the raw data
        target_raw_file = [r_f for r_f in listdirNHF(raw_sample_folder_dir) if target_structure in r_f][0] #note: 1 single file with the target name is expected
        #get the radix of the saving file name
        target_file_radix = f"{target_raw_file[:target_raw_file.index(target_structure)+len(target_structure)]}_{i_f}"
        #get the directories of the files corresponding to the labelled mask in the corresponding raw data for the target structure
        targstruct_labelled_mask_dir = os.path.join(label_mask_sample_folder_dir, [l_m for l_m in listdirNHF(label_mask_sample_folder_dir) if target_structure in l_m][0]) #note: 1 single file with the target name is expected
        targstruct_raw_dir = os.path.join(raw_sample_folder_dir, target_raw_file)
        #open the raw file and the labelled mask
        targstruct_labelled_mask = tifffile.imread(targstruct_labelled_mask_dir)
        targstruct_raw = tifffile.imread(targstruct_raw_dir)
        #split targstruct_labelled_mask and targstruct_raw along split_axis
        targstruct_labelled_mask = [np.squeeze(trg) for trg in np.split(targstruct_labelled_mask, indices_or_sections=targstruct_labelled_mask.shape[split_axis], axis=split_axis)]
        targstruct_raw_split = [np.squeeze(trg1) for trg1 in np.split(targstruct_raw, indices_or_sections=targstruct_raw.shape[split_axis], axis=split_axis)]
        #iterate through the splat arrays
        for c_ount, split_array_raw in enumerate(targstruct_raw_split):
            #get the label mask corresponding to the raw array
            split_array_label = targstruct_labelled_mask[c_ount]
            #divide split_array_raw and split_array_label into chunks
            chunks_split_array_raw, coords_split_array_raw = chunk_center(split_array_raw, chunk_x=256,chunk_y=256)
            chunks_split_array_label, coords_split_array_label = chunk_center(split_array_label, chunk_x=256,chunk_y=256)
            #iterate through the chunks of chunks_split_array_label
            for chk_n, chunk_array_label in enumerate(list(chunks_split_array_label)):
                #calculate the fraction of labelled pixels in chunk_array_label
                label_px_frac = measure_labelled_pixels_fraction(chunk_array_label)
                #save the label chunk and the corresponding raw image in the respective train, validation or test folders if the fraction of label pixels
                #is higher than threshold_label_px
                if label_px_frac>threshold_label_px:
                    #get the corresponding raw image chunk
                    chunk_array_raw = chunks_split_array_raw[chk_n, ...]
                    #get the coordinates of the top left pixel of the chunk, within the initial split_array_label
                    chunk_y_coord, chunk_x_coord =  coords_split_array_label[chk_n][0],coords_split_array_label[chk_n][1] 
                    #form the radiz of the saving name
                    saving_name = f"{target_file_radix}_s{c_ount}_y{chunk_y_coord}_x{chunk_x_coord}.tif"
                    #form the saving directory based on whether sample should be used for train, validation or test and save the results
                    if i_f in train_samples:
                        raw_full_saving_directory_train = os.path.join(output_train_folder_raw, saving_name)
                        label_full_saving_directory_train = os.path.join(output_train_folder_label, saving_name)
                        tifffile.imwrite(raw_full_saving_directory_train, chunk_array_raw, photometric='minisblack')
                        tifffile.imwrite(label_full_saving_directory_train, chunk_array_label, photometric='minisblack')
                    #form the saving directory based on whether sample should be used for train, validation or test and save the results
                    elif i_f in validation_samples:
                        raw_full_saving_directory_validation = os.path.join(output_validation_folder_raw, saving_name)
                        label_full_saving_directory_validation = os.path.join(output_validation_folder_label, saving_name)
                        tifffile.imwrite(raw_full_saving_directory_validation, chunk_array_raw, photometric='minisblack')
                        tifffile.imwrite(label_full_saving_directory_validation, chunk_array_label, photometric='minisblack')
                    #form the saving directory based on whether sample should be used for train, validation or test and save the results
                    elif i_f in test_samples:
                        raw_full_saving_directory_test = os.path.join(output_test_folder_raw, saving_name)
                        label_full_saving_directory_test = os.path.join(output_test_folder_label, saving_name)
                        tifffile.imwrite(raw_full_saving_directory_test, chunk_array_raw, photometric='minisblack')
                        tifffile.imwrite(label_full_saving_directory_test, chunk_array_label, photometric='minisblack')
                    else:
                        print(f"sample {i_f} was neither in the train samples nor in the test samples")

