Start date (yyyy/mm/dd): 2024/09/04
Author: Alessandro Ulivi (ale.ulivi@gmail.com)
Description: the notebook was written to split time-lapse images of pip2-enrich domains dynamics in the early C. elegans development saved as individual files with corresponding manually annotated binary segmentation masks, into individual time-points with names matching between raw files and corresponding segmentation ground truths. Raw files and segmentation masks are saved in separate folders. In addition, a percentage of the files is saved in a different directory, to be used as validation dataset.

Expected structure of input and output folders.
- input_folder
    - input_masks_folder
        - sample_1 -> manually labelled binary mask. .tif file. The name contains the string target_structure. No other files in the sample_ folder contain the string
        - sample_2 -> manually labelled binary mask. .tif file. The name contains the string target_structure. No other files in the sample_ folder contain the string
        ...
    - input_raw_folder
        - sample_1 -> raw file. .tif file. The name contains the string target_structure. No other files in the sample_ folder contain the string
        - sample_2 -> raw file. .tif file. The name contains the string target_structure. No other files in the sample_ folder contain the string
        ...

- output_folder
    - output_train_folder
    - output_test_test

Input_folder, input_masks_folder, input_raw_folder, output_folder, output_train_folder, output_test_test can have any name. Their names are specified below.

Output structure of output folder.
- output_folder
    - output_train_folder
        - raw -> raw file. .tif file.
        - label -> manually labelled binary mask. .tif file.
    - output_test_test
        - raw -> raw file. .tif file.
        - label -> manually labelled binary mask. .tif file.

The subset of files used as validation dataset is saved in the "output_test_test" sub-folder of the output_folder.

As files within "sample_X" folders are pooled in the output folders, the names of the raw_files and labelled binary masks are expected to be different between different "sample_X" folders.

NOTE: raw_files in input_folder are not the actual raw images. They when to pre-processing involving the reorganization of the actual raw files into individual time-lapse images. During this reorganization matadata were lost.
NOTE 2 (2024/09/04): the split between train and validation datasets is done, for the moment, by selecting, randomly, a certain number of samples (variable test_fraction), and using them as validation dataset. An alternative method could have been to randomly select a fraction of timepoint per each sample. However I fear that by doing this some data leaking might happen as timepoints are not independent, in fact, a high correlation is expected (e.g. the emrbyo is usually positioned in the center of the field of view and does not move) and the closer the timepoints, the higher their correlation. 

In [1]:
#Import files
import os
import numpy as np
import tifffile
import random
from utils import listdirNHF, check_files_else_make_folders

In [12]:
#Indicate the derectoris of the input folder and the output folder
input_folder = r""
output_folder = r""

#Indicate folders names
input_masks_folder_name = 'dl_training'
input_raw_folder_name = 'raw'
output_train_folder_name = 'train'
output_test_test_name = 'test'

#indicate the string of the target structure
target_structure = "pip2"

#indicate the axis along which to split files
split_axis = 0

#indicate the fraction of data to be saved as test dataset
test_fraction = 0.3

#indicate if a random.seed should be used when selecting the samples for training and test - by default a seed is used
use_random_seed = True


In [25]:
#Create the directory of the input_mask_folder, the the input_raw_folder, the output_train_folder and the output_test_test
input_masks_folder = os.path.join(input_folder, input_masks_folder_name)
input_raw_folder = os.path.join(input_folder, input_raw_folder_name)
output_train_folder = os.path.join(output_folder, output_train_folder_name)
output_test_folder = os.path.join(output_folder, output_test_test_name)

#Initialize a variable to keep track of whether files have already been generated
files_presence_output = []

#Create the "raw" and "label" sub-folders in output_train_folder and output_test_test, if they don-t exist - track the fact that they have already been created or not
output_train_folder_raw = os.path.join(output_train_folder, 'raw')
output_train_folder_raw_presence = check_files_else_make_folders(output_train_folder_raw)
files_presence_output.append(output_train_folder_raw_presence)

output_train_folder_label = os.path.join(output_train_folder, 'label')
output_train_folder_label_presence = check_files_else_make_folders(output_train_folder_label)
files_presence_output.append(output_train_folder_label_presence)

output_test_folder_raw = os.path.join(output_test_folder, 'raw')
output_test_folder_raw_presence = check_files_else_make_folders(output_test_folder_raw)
files_presence_output.append(output_test_folder_raw_presence)

output_test_folder_label = os.path.join(output_test_folder, 'label')
output_test_folder_label_presence = check_files_else_make_folders(output_test_folder_label)
files_presence_output.append(output_test_folder_label_presence)

#split files, randomly assign the to train and test datasets only if not file is present in any of the raw, and label output folders.
#NOTE: this is an extra precaution to avoid that, for any reason, the train and test datasets get mixed when re-running or changing the code.
if any(files_presence_output):
    print("WARNING! objects are found in any raw or label output folder. The processing is not continued. Eliminate any object present and re-run the script")
else:
    #Create a list of files in input_masks_folder - note that these files correspond to the samples to analyse
    input_masks_folder_samples_list = listdirNHF(input_masks_folder)

    # if use_random_seed is set to True (default), used a random seed before splitting samples in train and test_samples
    # else, split the train test datasets randomly
    if use_random_seed:
        #create the random seed
        random.seed(33)
        #randomly pick a samples used for training and samples used for validation
        test_samples = random.sample(input_masks_folder_samples_list, k=round(test_fraction*len(input_masks_folder_samples_list)))
        train_samples = [s for s in input_masks_folder_samples_list if s not in test_samples]
    else:
        #randomly pick a samples used for training and samples used for validation
        test_samples = random.sample(input_masks_folder_samples_list, k=round(test_fraction*len(input_masks_folder_samples_list)))
        train_samples = [s for s in input_masks_folder_samples_list if s not in test_samples]

    #Iterate through the folders (samples) of the labelled masks in input_folder sub-directory input_masks_folder
    for i_f in input_masks_folder_samples_list:
        #create the directory of the folder containing labelled masks
        label_mask_sample_folder_dir = os.path.join(input_masks_folder, i_f)
        #get the directory of the raw data for the sample inside input_raw_folder
        raw_sample_folder_dir = os.path.join(input_raw_folder, i_f)
        #get the name of the file containing the target_structure string within the sample folder inside the raw data
        target_raw_file = [r_f for r_f in listdirNHF(raw_sample_folder_dir) if target_structure in r_f][0] #note: 1 single file with the target name is expected
        #get the radix of the saving file name
        target_file_radix = f"{target_raw_file[:target_raw_file.index(target_structure)+len(target_structure)]}_{i_f}"
        #get the directories of the files corresponding to the labelled mask in the corresponding raw data for the target structure
        targstruct_labelled_mask_dir = os.path.join(label_mask_sample_folder_dir, [l_m for l_m in listdirNHF(label_mask_sample_folder_dir) if target_structure in l_m][0])
        targstruct_raw_dir = os.path.join(raw_sample_folder_dir, target_raw_file)
        #open the raw file and the labelled mask
        targstruct_labelled_mask = tifffile.imread(targstruct_labelled_mask_dir)
        targstruct_raw = tifffile.imread(targstruct_raw_dir)
        #split targstruct_labelled_mask and targstruct_raw along split_axis
        targstruct_labelled_mask = [np.squeeze(trg) for trg in np.split(targstruct_labelled_mask, indices_or_sections=targstruct_labelled_mask.shape[split_axis], axis=split_axis)]
        targstruct_raw_split = [np.squeeze(trg1) for trg1 in np.split(targstruct_raw, indices_or_sections=targstruct_raw.shape[split_axis], axis=split_axis)]
        #iterate through the splat arrays
        for c_ount, split_array_raw in enumerate(targstruct_raw_split):
            #get the label mask corresponding to the raw array
            split_array_label = targstruct_labelled_mask[c_ount]
            #form the saving name
            saving_name = f"{target_file_radix}_s{c_ount}.tif"
            #for the saving directory based on whether sample should be used for train or test and save the results
            if i_f in train_samples:
                raw_full_saving_directory_train = os.path.join(output_train_folder_raw, saving_name)
                label_full_saving_directory_train = os.path.join(output_train_folder_label, saving_name)
                tifffile.imwrite(raw_full_saving_directory_train, split_array_raw, photometric='minisblack')
                tifffile.imwrite(label_full_saving_directory_train, split_array_label, photometric='minisblack')
            #for the saving directory based on whether sample should be used for train or test and save the results
            elif i_f in test_samples:
                raw_full_saving_directory_test = os.path.join(output_test_folder_raw, saving_name)
                label_full_saving_directory_test = os.path.join(output_test_folder_label, saving_name)
                tifffile.imwrite(raw_full_saving_directory_test, split_array_raw, photometric='minisblack')
                tifffile.imwrite(label_full_saving_directory_test, split_array_label, photometric='minisblack')
            else:
                print(f"sample {i_f} was neither in the train samples nor in the test samples")

