Add batch dataloading [GSK-2347] #10

pierlj · 2023-12-13T17:16:03Z

Add batch_size and shuffle arguments to DataLoaderBase.

Mainly changes the behavior of the DataIteratorBase.__next__ when batch_size > 1 so it retrieves multiple items from the dataset and batched them using a _collate function.

linear · 2023-12-13T17:16:06Z

GSK-2347 Image Generator

We should have a mechanism in place to load images on the fly from disk.

a la: https://www.kaggle.com/code/abhmul/python-image-generator-tutorial

loreal_poc/dataloaders/base.py

rabah-khalek · 2023-12-14T13:47:08Z

loreal_poc/dataloaders/base.py

+        if elements[0][1] is not None:
+            batched_elements[1] = np.stack(batched_elements[1], axis=0)
+        if elements[0][2] is not None:
+            batched_elements[2] = {key: [meta[key] for meta in batched_elements[2]] for key in batched_elements[2][0]}
+        return batched_elements


could you please create an Enum to represent the images, marks and meta indices?

although not intended by design to have a dataloader that contains marked and unmarked images, but I can think of a case where a dataloader might have images with meta and images without. No need to treat this case for now, but let's definitely do a sanity check:

if all(elt is not None for elt in batched_elements[1]):

instead of

if elements[0][1] is not None:

(same for meta)

raise an exception in case the if fails, saying that we only support image loading that either have marks or don't (similar for meta)... I'll let you find a better wording.

Sure, I will add the sanity checks, however the design of the collate function is specific to the dataset we currently have. If we have a different dataset with a different format (e.g. marks not stored as a numpy array), the collate function should change accordingly. I will add it as an optionnal argument.

Regarding the enum, what do you want exactly, I am not sure to understand ?

I made the modifications, but I can't raised an exception when the test is false as it will be false when all elements' meta are None as well. It would involve complex to check it. Instead with the current behavior, if a dataset has marks only for some images, it will return a list like this [m1, None, None, m2, ...] simply not stacked as a unique array. Same for meta data.

loreal_poc/dataloaders/base.py

Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com>

loreal_poc/dataloaders/base.py

loreal_poc/dataloaders/wrappers.py

loreal_poc/dataloaders/base.py

Hartorn · 2023-12-18T14:39:48Z

loreal_poc/dataloaders/base.py

+    ) -> Tuple[np.ndarray, Optional[np.ndarray], Optional[Dict[Any, Any]]]:
+        batched_elements = list(zip(*elements))
+        # TO DO: create image stack but require same size images and therefore automatic padding or resize.
+        if all(elt is not None for elt in batched_elements[1]):  # check if all marks are not None


Hmm, I would be against silently "hiding" some elts. Maybe raise some exception if partial part of the data only have None ? or use some configurable func to create a 'default elt"

It is not hiding elts when partial part of the data has None, it will simply not batch it and return a list of marks/metadata and None.

I think it is better to do this than raising an issue when some images have no marks/metadata.

I think it's crucial to have the output of our dataloader as standard as possible. Both cases are not optimal.

the exception (which I also had in mind at the beginning) will be bypassed for batch_size=1 or accidental alignments of batches having all Nones, vs batches having all not Nones.

the freedom of returning of stacks, and Nones based on the batch would lead to undefined behaviours (expecting a np.array vs list from the same loader for different batches).

I agree with @Hartorn that the best option here would be to have a default elt:

for marks, a nan array (of shape (batch_size, n_landmarks, n_dimensions)

for meta, an empty dict with list of Nones (of length batch_size)
for the particular case of batch_size=1, the default value for marks and meta can be both None.

Ok, for the marks it makes sense, I will replace each None in the batch by an array of np.nan of size (68, 2), then I can stack everything so the output will be a (batch_size, 68, 2) array with potentially nan values when marks are not available.

For the meta, I don't understand what you mean. What I can image is a dict of list with None when values are not available. However, we still need a default when all meta are None, it could be simply an empty dict, or a dict with list of None of length batch_size. In the latter case, we must assume the structure (the keys) of the dict in advance because there is no way to retrieve it from the meta since they are None. What do you prefer ?

pierlj

I made the change requested by @Hartorn, except for the return of partially annotated dataset, I think it is better the way it is now.

rabah-khalek

I added a comment re partial outputs. Could you also please add unit-tests that covers the batching and sampling? Thanks

rabah-khalek · 2023-12-20T17:28:35Z

loreal_poc/dataloaders/base.py

+    ) -> Tuple[np.ndarray, Optional[np.ndarray], Optional[Dict[Any, Any]]]:
+        batched_elements = list(zip(*elements))
+        # TO DO: create image stack but require same size images and therefore automatic padding or resize.
+        if all(elt is not None for elt in batched_elements[1]):  # check if all marks are not None


I think it's crucial to have the output of our dataloader as standard as possible. Both cases are not optimal.

the exception (which I also had in mind at the beginning) will be bypassed for batch_size=1 or accidental alignments of batches having all Nones, vs batches having all not Nones.

the freedom of returning of stacks, and Nones based on the batch would lead to undefined behaviours (expecting a np.array vs list from the same loader for different batches).

I agree with @Hartorn that the best option here would be to have a default elt:

for marks, a nan array (of shape (batch_size, n_landmarks, n_dimensions)

for meta, an empty dict with list of Nones (of length batch_size)
for the particular case of batch_size=1, the default value for marks and meta can be both None.

rabah-khalek · 2023-12-20T17:31:34Z

loreal_poc/dataloaders/base.py

+    index_sampler: Sequence[int]
+    batch_size: int
+
+    def __init__(self, name: str, batch_size: int):


let's set the default value of batch_size to 1

loreal_poc/dataloaders/base.py

Add batch dataloading

3a03110

pierlj requested a review from rabah-khalek December 13, 2023 17:16

rabah-khalek suggested changes Dec 14, 2023

View reviewed changes

pierlj and others added 2 commits December 14, 2023 15:37

Apply suggestions from Rabah's review

9a52a53

Co-authored-by: Rabah Abdul Khalek <rabah.khalek@gmail.com>

Add collate_fn as argument to DataLoaderBase and improve validation

ee1ab6e

pierlj requested a review from rabah-khalek December 14, 2023 16:42

rabah-khalek and others added 2 commits December 14, 2023 17:42

Merge branch 'main' into GSK-2347-add-batch-dataloader

1a808cb

Add seed argument in DataLoaderBase

7d8e4e4

Hartorn requested changes Dec 18, 2023

View reviewed changes

pierlj commented Dec 20, 2023

View reviewed changes

Add batch_size validation and change rng handling

204e730

pierlj requested a review from Hartorn December 20, 2023 13:46

rabah-khalek added 2 commits December 20, 2023 18:01

Delete examples/test.ipynb

f0a5063

Merge branch 'main' into GSK-2347-add-batch-dataloader

9713f12

rabah-khalek suggested changes Dec 20, 2023

View reviewed changes

rabah-khalek reviewed Dec 20, 2023

View reviewed changes

loreal_poc/dataloaders/base.py Outdated Show resolved Hide resolved

rabah-khalek and others added 5 commits December 20, 2023 18:32

Update loreal_poc/dataloaders/base.py

f97c3ce

Change handling of missing meta and marks and add tests

6bd016e

Merge branch 'main' into GSK-2347-add-batch-dataloader

a6655c2

refactoring of marks and meta default values, change of collate_fn

05737d6

refactoring

69c6e56

rabah-khalek approved these changes Dec 22, 2023

View reviewed changes

rabah-khalek added 2 commits December 22, 2023 17:45

temp fix of FFHQ dataloader

db8f99b

Merge branch 'main' into GSK-2347-add-batch-dataloader

3b3ddd6

rabah-khalek merged commit eb56a68 into main Dec 22, 2023
1 check passed

rabah-khalek deleted the GSK-2347-add-batch-dataloader branch December 22, 2023 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch dataloading [GSK-2347] #10

Add batch dataloading [GSK-2347] #10

pierlj commented Dec 13, 2023

linear bot commented Dec 13, 2023

rabah-khalek Dec 14, 2023

pierlj Dec 14, 2023

pierlj Dec 14, 2023

Hartorn Dec 18, 2023

pierlj Dec 20, 2023

rabah-khalek Dec 20, 2023

pierlj Dec 21, 2023

pierlj left a comment

rabah-khalek left a comment

rabah-khalek Dec 20, 2023

rabah-khalek Dec 20, 2023

Add batch dataloading [GSK-2347] #10

Add batch dataloading [GSK-2347] #10

Conversation

pierlj commented Dec 13, 2023

linear bot commented Dec 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pierlj left a comment

Choose a reason for hiding this comment

rabah-khalek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment