Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Loader Support #5

Closed
mathpluscode opened this issue Jun 10, 2020 · 2 comments · Fixed by #62 or #64
Closed

Data Loader Support #5

mathpluscode opened this issue Jun 10, 2020 · 2 comments · Fixed by #62 or #64
Assignees
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@mathpluscode
Copy link
Member

mathpluscode commented Jun 10, 2020

Data Loader Support

To facilitate the user experience, we plan to prepare some default data loaders for different use scenarios. Currently, Nifti and H5 formats are supported. For different types of use cases and image formats, a customised data loader is needed (add a link to the tutorial).

Data Format

There are some prerequisites on the data:

  • Data must be split into train / val / test before and stored in different directories. Although val or test data are optional.
  • Each image or label is in 3D. Image has shape (width, height, depth); label has shape (width, height, depth) or (width, height, depth, num_labels).
  • The data do not have to be of the same shape - All will be resized to the same shape before feed-in. In order to prevent unexpected effects, it may be recommended that all images are pre-processed to the desirable shape.

Supported scenarios

Unpaired images (e.g. single-modality inter-subject registration)

  • Case 1-1 multiple independent images.
  • Case 1-2 multiple independent images and corresponding labels.

Grouped unpaired images (e.g. single-modality intra-subject registration)

  • Case 2-1 multiple subjects each with multiple images.
  • Case 2-2 multiple subjects each with multiple images and corresponding labels.

Paired images (e.g. two-modality intra-subject registration)

  • Case 3-1 multiple paired images.
  • Case 3-2 multiple paired images and corresponding labels.

Sampling during training

Sampling for multiple labels

In any case when corresponding labels are available and there are multiple types of labels, e.g. the segmentation of different organs in a CT image, two options are available:

  1. During one epoch, each image would be sampled only once and when there are multiple labels, we will randomly sample one label at a time. (Default)
  2. During one epoch, each image would be paired with each available label. So, if an image has four types of labels, it will be sampled for four times and each time corresponds to a different label.
    When using multiple labels, it is the user's responsibility to ensure the labels are ordered, such that label_idx are the corresponding types in (width, height, depth, label_idx) - the same type of landmark or ROI - between all labels

Sampling for multiple subjects each with multiple images

When multiple subjects each with multiple images are available, multiple different sampling methods are supported:

  1. Inter-subject, one image is sampled from subject A as moving image, and another one image is sampled from a different subject B as fixed image.
  2. Intra-subject, two images are sampled from the same subject. In this case, we can specify:
    a) moving image always has a smaller index, e.g. at an earlier time;
    b) moving image always has a larger index, e.g. at a later time; or
    c) no constraint on the order.

For the first two options, the intra-subject images will be ascending-sorted by name to represent ordered sequential images, such as time-series data
*Multiple label sampling is also supported once image pair is sampled; In case there are no consistent label types defined between subjects, an option is available to turned off label contribution to the loss for those inter-subject image pairs.

Examples (folder structure and filename requirement)

In the following, we take train directory as an example to list how the files should be stored.

Nifti Data Format

Assuming each .nii.gz file contains only one tensor, which is either image or label.

Unpaired data

This is the simplest case. Data are assumed to be stored under train/images and train/labels directories.

Nifti Case 1-1 Images only

We only have images without any labels and all images are considered to be independent samples. So all data should be stored under train/images, e.g.:

  • train
    • images
      • subject1.nii.gz
      • subject2.nii.gz
      • ...

(It is also ok if the data are further grouped into different directories under images as we will directly scan all nifti files under train/images.)

Nifti Case 1-2 Images with labels

In this case, we have both images and labels. So all images should be stored under train/images and all labels should be stored under train/labels. The corresponding image file name and label file name should be exactly the same, e.g.:

  • train
    • images
      • subject1.nii.gz
      • subject2.nii.gz
      • ...
    • labels
      • subject1.nii.gz
      • subject2.nii.gz
      • ...

Grouped unpaired images

Nifti Case 2-1 Images only

We have images without any labels, but images are grouped under different subjects/groups, e.g. time-series observations for each subject/group. For instance, the data set can be the CT scans of multiple patients (subjects/groups) where each patient has multiple scans acquired at different time points. So all data should be stored under train/images and the leaf directories (directories that do not have sub-directories) must represent different subjects/groups, e.g.:

  • train
    • images
      • subject1
        • obs1.nii.gz
        • obs2.nii.gz
        • ...
      • subject2
        • obs1.nii.gz
        • obs2.nii.gz
        • ...
      • ...

(It is also ok if the data are grouped into different directories, but the leaf directories will be considered as different subjects/groups.)

Nifti Case 2-2 Images with labels

We have both images and labels. So all images should be stored under train/images and all labels should be stored under train/labels. The leaf directories will be considered as different subjects/groups and the corresponding image file name and label file name should be exactly the same, e.g.:

  • train
    • images
      • subject1
        • obs1.nii.gz
        • obs2.nii.gz
        • ...
      • ...
    • labels
      • subject1
        • obs1.nii.gz
        • obs2.nii.gz
        • ...
      • ...

Paired images

In this case, images are paired, for example, to represent a multimodal moving and fixed image pairs to register. Data are assumed to be stored under train/moving_images, train/fixed_images, train/moving_labels, and train/fixed_labels directories.

Nifti Case 3-1 Images only

We only have paired images without any labels. So all data should be stored under train/moving_images, train/fixed_images and the images corresponding to the same subject should have exactly the same name, e.g.:

  • train
    • moving_images
      • subject1.nii.gz
      • subject2.nii.gz
      • ...
    • fixed_images
      • subject1.nii.gz
      • subject2.nii.gz
      • ...

(It is ok if the data are further grouped into different directories under train/moving_images and train/fixed_images as we will directly scan all nifti files under them.)

Nifti Case 3-2 Images with labels

We have both images and labels. So all data should be stored under train/moving_images, train/fixed_images, train/moving_labels, and train/fixed_labels . The images and labels corresponding to the same subjects/groups should have exactly the same names, e.g.:

  • train
    • moving_images
      • subject1.nii.gz
      • subject2.nii.gz
      • ...
    • fixed_images
      • subject1.nii.gz
      • subject2.nii.gz
      • ...
    • moving_labels
      • subject1.nii.gz
      • subject2.nii.gz
      • ...
    • fixed_labels
      • subject1.nii.gz
      • subject2.nii.gz
      • ...

H5 Data Format

Each .h5 file is similar to a dictionary, having multiple key-value pairs. Hierarchical multi-level h5 indexing is not used. Each value is either image or label.

Unpaired images

H5 Case 1-1 Images only

Each key corresponds to one image, e.g. {"subject1": data1, "subject2": data1, ...}. All data should be stored under train/images, it can be a single h5 file or multiple h5 files e.g.:

  • train
    • images
      • part1.h5
      • part2.h5
      • ...

H5 Case 1-2 Images with labels

Each key corresponds to one subject. Data can be stored in two single h5 files (one for image and one for label), the keys in the files should be the same.

  • train
    • images
      • data.h5 (keys = ["subject1", "subject2", ...])
    • labels
      • data.h5 (keys = ["subject1", "subject2", ...])

Grouped unpaired images

H5 Case 2-1 Images only

Similar to case 1-1 above, but the keys, in this case, have to share the same format like subject%d-%d where %d represents a number. For instance, subject3-2 corresponds to the second observation for the subjects. Otherwise, the file structure is the same as case 1-1, e.g.

  • train
    • images
      • part1.h5 (keys = ["subject1-1", "subject1-2", "subject2-1", ...])
      • part2.h5
      • ...

H5 Case 2-2 Images with labels

Similar to case 1-2 and 2-1 above, the keys have to share the same format like subject%d-%d and the keys for images and labels should be consistent.

  • train
    • images
      • part1.h5 (keys = ["subject1-1", "subject1-2", ...])
      • part2.h5 (keys = ["subject101-1", "subject101-2", ...])
      • ...
    • labels
      • part1.h5 (keys = ["subject1-1", "subject1-2", ...])
      • part2.h5 (keys = ["subject101-1", "subject101-2", ...])
      • ...

Paired images

In this case, data are paired. Data are assumed to be stored under train/moving_images, train/fixed_images, train/moving_labels, and train/fixed_labels directories.

H5 Case 3-1 Images only

We only have paired images without any labels. So all data should be stored under train/moving_images, train/fixed_images and the keys corresponding to the same subject should be the same, e.g.:

  • train
    • moving_images
      • part1.h5 (keys = ["subject1", "subject2", ...])
      • part2.h5
      • ...
    • fixed_images
      • part1.h5 (keys = ["subject1", "subject2", ...])
      • part2.h5
      • ...

H5 Case 3-2 Images with labels

We have both images and labels. So all data should be stored under train/moving_images, train/fixed_images, train/moving_labels, and train/fixed_labels. The keys corresponding to the same subject should be the same, e.g.:

  • train
    • moving_images
      • data.h5 (keys = ["subject1", "subject2", ...])
    • fixed_images
      • data.h5 (keys = ["subject1", "subject2", ...])
    • moving_labels
      • data.h5 (keys = ["subject1", "subject2", ...])
    • fixed_labels
      • data.h5 (keys = ["subject1", "subject2", ...])
@ucl-candi
Copy link
Collaborator

ucl-candi commented Jun 12, 2020

@tvercaut can you review this please?

@ucl-candi
Copy link
Collaborator

@QianyeYang could you please also review this and comment if any? Thanks!

@YipengHu YipengHu self-assigned this Jun 14, 2020
mathpluscode added a commit that referenced this issue Jun 15, 2020
mathpluscode added a commit that referenced this issue Jun 15, 2020
mathpluscode added a commit that referenced this issue Jun 15, 2020
@ucl-candi ucl-candi added this to the Pre-alpha-0-loader milestone Jun 16, 2020
@ucl-candi ucl-candi added help wanted Extra attention is needed question Further information is requested labels Jun 16, 2020
mathpluscode added a commit that referenced this issue Jun 16, 2020
mathpluscode added a commit that referenced this issue Jun 19, 2020
mathpluscode added a commit that referenced this issue Jun 21, 2020
mathpluscode added a commit that referenced this issue Jun 21, 2020
@mathpluscode mathpluscode mentioned this issue Jun 21, 2020
8 tasks
mathpluscode added a commit that referenced this issue Jun 21, 2020
mathpluscode added a commit that referenced this issue Jun 21, 2020
mathpluscode added a commit that referenced this issue Jun 21, 2020
mathpluscode added a commit that referenced this issue Jun 21, 2020
mathpluscode added a commit that referenced this issue Jun 21, 2020
mathpluscode added a commit that referenced this issue Jun 22, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
s-sd pushed a commit that referenced this issue Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants