Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch dataloading [GSK-2347] #10

Merged
merged 15 commits into from
Dec 22, 2023
Merged

Conversation

pierlj
Copy link
Collaborator

@pierlj pierlj commented Dec 13, 2023

Add batch_size and shuffle arguments to DataLoaderBase.

Mainly changes the behavior of the DataIteratorBase.__next__ when batch_size > 1 so it retrieves multiple items from the dataset and batched them using a _collate function.

Copy link

linear bot commented Dec 13, 2023

GSK-2347 Image Generator

We should have a mechanism in place to load images on the fly from disk.

a la: https://www.kaggle.com/code/abhmul/python-image-generator-tutorial

loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved
loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved
loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved
loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved
Comment on lines 70 to 74
if elements[0][1] is not None:
batched_elements[1] = np.stack(batched_elements[1], axis=0)
if elements[0][2] is not None:
batched_elements[2] = {key: [meta[key] for meta in batched_elements[2]] for key in batched_elements[2][0]}
return batched_elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • could you please create an Enum to represent the images, marks and meta indices?
  • although not intended by design to have a dataloader that contains marked and unmarked images, but I can think of a case where a dataloader might have images with meta and images without. No need to treat this case for now, but let's definitely do a sanity check:
if all(elt is not None for elt in batched_elements[1]):

instead of

if elements[0][1] is not None:

(same for meta)

raise an exception in case the if fails, saying that we only support image loading that either have marks or don't (similar for meta)... I'll let you find a better wording.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add the sanity checks, however the design of the collate function is specific to the dataset we currently have. If we have a different dataset with a different format (e.g. marks not stored as a numpy array), the collate function should change accordingly. I will add it as an optionnal argument.

Regarding the enum, what do you want exactly, I am not sure to understand ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the modifications, but I can't raised an exception when the test is false as it will be false when all elements' meta are None as well. It would involve complex to check it. Instead with the current behavior, if a dataset has marks only for some images, it will return a list like this [m1, None, None, m2, ...] simply not stacked as a unique array. Same for meta data.

loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved
loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved
loreal_poc/dataloaders/wrappers.py Outdated Show resolved Hide resolved
loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved
loreal_poc/dataloaders/base.py Show resolved Hide resolved
loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved
) -> Tuple[np.ndarray, Optional[np.ndarray], Optional[Dict[Any, Any]]]:
batched_elements = list(zip(*elements))
# TO DO: create image stack but require same size images and therefore automatic padding or resize.
if all(elt is not None for elt in batched_elements[1]): # check if all marks are not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would be against silently "hiding" some elts. Maybe raise some exception if partial part of the data only have None ? or use some configurable func to create a 'default elt"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not hiding elts when partial part of the data has None, it will simply not batch it and return a list of marks/metadata and None.

I think it is better to do this than raising an issue when some images have no marks/metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's crucial to have the output of our dataloader as standard as possible. Both cases are not optimal.

  • the exception (which I also had in mind at the beginning) will be bypassed for batch_size=1 or accidental alignments of batches having all Nones, vs batches having all not Nones.
  • the freedom of returning of stacks, and Nones based on the batch would lead to undefined behaviours (expecting a np.array vs list from the same loader for different batches).

I agree with @Hartorn that the best option here would be to have a default elt:

  • for marks, a nan array (of shape (batch_size, n_landmarks, n_dimensions)
  • for meta, an empty dict with list of Nones (of length batch_size)
    for the particular case of batch_size=1, the default value for marks and meta can be both None.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, for the marks it makes sense, I will replace each None in the batch by an array of np.nan of size (68, 2), then I can stack everything so the output will be a (batch_size, 68, 2) array with potentially nan values when marks are not available.

For the meta, I don't understand what you mean. What I can image is a dict of list with None when values are not available. However, we still need a default when all meta are None, it could be simply an empty dict, or a dict with list of None of length batch_size. In the latter case, we must assume the structure (the keys) of the dict in advance because there is no way to retrieve it from the meta since they are None. What do you prefer ?

Copy link
Collaborator Author

@pierlj pierlj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change requested by @Hartorn, except for the return of partially annotated dataset, I think it is better the way it is now.

@pierlj pierlj requested a review from Hartorn December 20, 2023 13:46
Copy link
Contributor

@rabah-khalek rabah-khalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment re partial outputs. Could you also please add unit-tests that covers the batching and sampling? Thanks

) -> Tuple[np.ndarray, Optional[np.ndarray], Optional[Dict[Any, Any]]]:
batched_elements = list(zip(*elements))
# TO DO: create image stack but require same size images and therefore automatic padding or resize.
if all(elt is not None for elt in batched_elements[1]): # check if all marks are not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's crucial to have the output of our dataloader as standard as possible. Both cases are not optimal.

  • the exception (which I also had in mind at the beginning) will be bypassed for batch_size=1 or accidental alignments of batches having all Nones, vs batches having all not Nones.
  • the freedom of returning of stacks, and Nones based on the batch would lead to undefined behaviours (expecting a np.array vs list from the same loader for different batches).

I agree with @Hartorn that the best option here would be to have a default elt:

  • for marks, a nan array (of shape (batch_size, n_landmarks, n_dimensions)
  • for meta, an empty dict with list of Nones (of length batch_size)
    for the particular case of batch_size=1, the default value for marks and meta can be both None.

index_sampler: Sequence[int]
batch_size: int

def __init__(self, name: str, batch_size: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's set the default value of batch_size to 1

@rabah-khalek rabah-khalek merged commit eb56a68 into main Dec 22, 2023
1 check passed
@rabah-khalek rabah-khalek deleted the GSK-2347-add-batch-dataloader branch December 22, 2023 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants