-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch dataloading [GSK-2347] #10
Conversation
GSK-2347 Image Generator
We should have a mechanism in place to load images on the fly from disk. a la: https://www.kaggle.com/code/abhmul/python-image-generator-tutorial |
loreal_poc/dataloaders/base.py
Outdated
if elements[0][1] is not None: | ||
batched_elements[1] = np.stack(batched_elements[1], axis=0) | ||
if elements[0][2] is not None: | ||
batched_elements[2] = {key: [meta[key] for meta in batched_elements[2]] for key in batched_elements[2][0]} | ||
return batched_elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- could you please create an
Enum
to represent the images, marks and meta indices? - although not intended by design to have a dataloader that contains marked and unmarked images, but I can think of a case where a dataloader might have images with meta and images without. No need to treat this case for now, but let's definitely do a sanity check:
if all(elt is not None for elt in batched_elements[1]):
instead of
if elements[0][1] is not None:
(same for meta)
raise an exception in case the if fails, saying that we only support image loading that either have marks or don't (similar for meta)... I'll let you find a better wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will add the sanity checks, however the design of the collate function is specific to the dataset we currently have. If we have a different dataset with a different format (e.g. marks not stored as a numpy array), the collate function should change accordingly. I will add it as an optionnal argument.
Regarding the enum, what do you want exactly, I am not sure to understand ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the modifications, but I can't raised an exception when the test is false as it will be false when all elements' meta are None as well. It would involve complex to check it. Instead with the current behavior, if a dataset has marks only for some images, it will return a list like this [m1, None, None, m2, ...]
simply not stacked as a unique array. Same for meta data.
Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com>
loreal_poc/dataloaders/base.py
Outdated
) -> Tuple[np.ndarray, Optional[np.ndarray], Optional[Dict[Any, Any]]]: | ||
batched_elements = list(zip(*elements)) | ||
# TO DO: create image stack but require same size images and therefore automatic padding or resize. | ||
if all(elt is not None for elt in batched_elements[1]): # check if all marks are not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would be against silently "hiding" some elts. Maybe raise some exception if partial part of the data only have None ? or use some configurable func to create a 'default elt"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not hiding elts when partial part of the data has None, it will simply not batch it and return a list of marks/metadata and None.
I think it is better to do this than raising an issue when some images have no marks/metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's crucial to have the output of our dataloader as standard as possible. Both cases are not optimal.
- the exception (which I also had in mind at the beginning) will be bypassed for
batch_size=1
or accidental alignments of batches having all Nones, vs batches having all not Nones. - the freedom of returning of stacks, and Nones based on the batch would lead to undefined behaviours (expecting a
np.array
vslist
from the same loader for different batches).
I agree with @Hartorn that the best option here would be to have a default elt
:
- for marks, a nan array (of shape
(batch_size, n_landmarks, n_dimensions)
- for meta, an empty dict with list of Nones (of length
batch_size
)
for the particular case ofbatch_size=1
, the default value for marks and meta can be bothNone
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, for the marks it makes sense, I will replace each None in the batch by an array of np.nan
of size (68, 2)
, then I can stack everything so the output will be a (batch_size, 68, 2)
array with potentially nan values when marks are not available.
For the meta, I don't understand what you mean. What I can image is a dict of list with None when values are not available. However, we still need a default when all meta are None, it could be simply an empty dict, or a dict with list of None of length batch_size
. In the latter case, we must assume the structure (the keys) of the dict in advance because there is no way to retrieve it from the meta since they are None. What do you prefer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change requested by @Hartorn, except for the return of partially annotated dataset, I think it is better the way it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment re partial outputs. Could you also please add unit-tests that covers the batching and sampling? Thanks
loreal_poc/dataloaders/base.py
Outdated
) -> Tuple[np.ndarray, Optional[np.ndarray], Optional[Dict[Any, Any]]]: | ||
batched_elements = list(zip(*elements)) | ||
# TO DO: create image stack but require same size images and therefore automatic padding or resize. | ||
if all(elt is not None for elt in batched_elements[1]): # check if all marks are not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's crucial to have the output of our dataloader as standard as possible. Both cases are not optimal.
- the exception (which I also had in mind at the beginning) will be bypassed for
batch_size=1
or accidental alignments of batches having all Nones, vs batches having all not Nones. - the freedom of returning of stacks, and Nones based on the batch would lead to undefined behaviours (expecting a
np.array
vslist
from the same loader for different batches).
I agree with @Hartorn that the best option here would be to have a default elt
:
- for marks, a nan array (of shape
(batch_size, n_landmarks, n_dimensions)
- for meta, an empty dict with list of Nones (of length
batch_size
)
for the particular case ofbatch_size=1
, the default value for marks and meta can be bothNone
.
loreal_poc/dataloaders/base.py
Outdated
index_sampler: Sequence[int] | ||
batch_size: int | ||
|
||
def __init__(self, name: str, batch_size: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's set the default value of batch_size
to 1
Add
batch_size
andshuffle
arguments toDataLoaderBase
.Mainly changes the behavior of the
DataIteratorBase.__next__
whenbatch_size > 1
so it retrieves multiple items from the dataset and batched them using a_collate
function.