optimized train time for a use case of small samples and large batch #268

mosheraboh · 2023-02-01T08:13:22Z

Optimize nicely the train running time in such a use case (small samples and large batch size)
Leaving ndict optimization for the future

…size

mosheraboh · 2023-02-01T08:21:10Z

fuse/utils/data/collate.py

                has_error = True
                has_missing_values = True
                if self._raise_error_key_missing:
                    raise Exception(f"Error: key {key} does not exist in sample {index}: {sample}")
                else:
                    value = None
-            else:


removing to optimize the running time.
Detecting NaNs and more could move to an optional op (in the data pipeline)

mosheraboh · 2023-02-01T08:32:26Z

fuse/utils/ndict.py

@@ -228,7 +246,7 @@ def pop(self, key: str) -> Any:
        del self[key]
        return res

-    def indices(self, indices: Union[torch.Tensor, numpy.ndarray]) -> dict:
+    def indices(self, indices: Optional[numpy.ndarray]) -> dict:


self review: will remove the optional here

SagiPolaczek

Thanks!!

I added few comment inline to consider :)

SagiPolaczek · 2023-02-01T08:37:19Z

fuse/data/utils/collates.py

@@ -68,9 +68,11 @@ def __call__(self, samples: List[Dict]) -> Dict:
        batch_dict = NDict()

        # collect all keys
-        keys = self._collect_all_keys(samples)
        if self._keep_keys:


In the description we say: "missing keep_keys are skipped." , I think that now we won't do that.
Could it be an issue? If a user specifies to keep a key that doesn't exist?

Good catch Now we will throw an error in such a case. I will update the comment.

SagiPolaczek · 2023-02-01T09:11:44Z

fuse/utils/data/collate.py

        batch_size = len(batch["data.sample_id"])
    else:
        batch_size = None

-    if batch_size is None:
+        keys = batch.keys()
+
        for key in keys:
            if isinstance(batch[key], torch.Tensor):


Why there are two different loops? one for each case - torch.Tensor, (np.ndarray, list)
If I'm not missing something we can check for the two cases in the same loop.

Because, first I want to look for tensor (trust it more), and if I can't find one then my second choice is (np.ndarray, list)

SagiPolaczek · 2023-02-01T09:31:15Z

fuse/utils/ndict.py

-
-        return all_keys
+    @staticmethod
+    def _flatten_static(item: Union[dict, Any], prefix: str, flat_dict: dict) -> None:


SagiPolaczek · 2023-02-01T09:33:54Z

fuse/utils/ndict.py


    def keypaths(self) -> List[str]:
        """
        returns a list of keypaths (i.e. "a.b.c.d") to all values in the nested dict
        """
-        return list(self.flatten().keys())
+        return NDict._keypaths_static(self._stored, None)


Why not using the same paradigm as before? Just calling flatten()?
The two static functions has a lot in common

Mostly cause I don't want the overhead of creating a dictionary and extracting the keys,

SagiPolaczek

Thanks again!
I added comments in a different review 😄

(I used the vscode interface and because of the changes in the middle it didn't allow me to approve)

optimized train time for a use case of small samples and large batch …

0142556

…size

mosheraboh requested a review from SagiPolaczek February 1, 2023 08:21

mosheraboh commented Feb 1, 2023

View reviewed changes

Moshe Raboh Moshiko.Raboh@ibm.com added 4 commits February 1, 2023 03:36

minor fixes

a51c8ca

self review comments

b7de443

fix test failure

849932c

black

3d17095

SagiPolaczek reviewed Feb 1, 2023

View reviewed changes

SagiPolaczek self-requested a review February 1, 2023 09:53

SagiPolaczek previously approved these changes Feb 1, 2023

View reviewed changes

update comment

74540c3

mosheraboh dismissed SagiPolaczek’s stale review via 74540c3 February 1, 2023 10:28

mosheraboh merged commit 50a1565 into master Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimized train time for a use case of small samples and large batch #268

optimized train time for a use case of small samples and large batch #268

mosheraboh commented Feb 1, 2023 •

edited

mosheraboh Feb 1, 2023

mosheraboh Feb 1, 2023

SagiPolaczek left a comment

SagiPolaczek Feb 1, 2023

mosheraboh Feb 1, 2023

SagiPolaczek Feb 1, 2023

mosheraboh Feb 1, 2023

SagiPolaczek Feb 1, 2023

SagiPolaczek Feb 1, 2023

mosheraboh Feb 1, 2023

SagiPolaczek left a comment

optimized train time for a use case of small samples and large batch #268

optimized train time for a use case of small samples and large batch #268

Conversation

mosheraboh commented Feb 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SagiPolaczek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SagiPolaczek left a comment

Choose a reason for hiding this comment

mosheraboh commented Feb 1, 2023 •

edited