Fixed batch sizes #68

kylebgorman · 2023-05-29T22:43:13Z

This is a draft. Important simplifications while I get the basics right:

I assume that we are always padding to the max (for source and target); I can make it optional later, though the performance penalty is quite small in my experiments.
I am ignoring feature models for now; that can be incorporated later.
Prediction isn't tested yet.

I set the actual source_max_length to the min(max(longest source string in train, longest source string in dev), --max_source_length)), and similarly for target length. I then lightly modify the LSTM (it needs to be told the max source length) and the transformer (it needs to make the positional embedding as large as max of the longest source and target string). Everything else is just plumbing.

Closes #50.

It is not plugged into anything yet. Working on issue CUNY-CL#50.

This is not strictly related to the issue but it came up so I did it. Currying is implemented at a very low level in CPython, and eliminates 3N dictionary lookups.

Adamits

This looks good! I left a couple small nits. Regarding future things:

I assume that we are always padding to the max (for source and target); I can make it optional later, though the performance penalty is quite small in my experiments.

So we would add an option to dynamically pad to the max of a given batch? Yeah since we have the code already to this, we might as well add that option back later. But true, this is less of a concern these days.

I am ignoring feature models for now; that can be incorporated later.

Sounds good. I guess we would just want another argument for max feature size in models that have separate features? The API for this could be a little confusing, probably worth thinking through a bit.

Adamits · 2023-05-30T23:02:13Z

yoyodyne/collators.py

@@ -27,27 +27,32 @@ def __init__(
        pad_idx,
        config: dataconfig.DataConfig,
        arch: str,
-        max_source_length: int = defaults.MAX_SOURCE_LENGTH,
-        max_target_length: int = defaults.MAX_TARGET_LENGTH,
+        max_source_length=defaults.MAX_SOURCE_LENGTH,


Should this have an annotation Optional[int] ?

Adamits · 2023-05-30T23:02:24Z

yoyodyne/collators.py

-        max_source_length: int = defaults.MAX_SOURCE_LENGTH,
-        max_target_length: int = defaults.MAX_TARGET_LENGTH,
+        max_source_length=defaults.MAX_SOURCE_LENGTH,
+        max_target_length=defaults.MAX_TARGET_LENGTH,


Should this have an annotation Optional[int] ?

Not if it's specified a kwargs with a default value that isn't None.

Adamits · 2023-05-30T23:51:27Z

yoyodyne/datasets.py

+    @functools.cached_property
+    def max_source_length(self) -> int:
+        # " + 2" for start and end tag.
+        return max(len(source) + 2 for source, _, *_ in self.samples)


Tiny thing, but is max(...) + 2 preferable?

Adamits · 2023-05-31T15:27:32Z

yoyodyne/models/pointer_generator.py

@@ -81,7 +81,7 @@ def encode(
            packed_outs,
            batch_first=True,
            padding_value=self.pad_idx,
-            total_length=None,
+            total_length=self.max_source_length,


Good catch on this!

Adamits

LGTM

kylebgorman · 2024-07-03T19:31:21Z

This is really stale so I'm going to close it and reopen next time / if I tackle it in the future.

kylebgorman added 4 commits May 29, 2023 11:45

Adds --pad_max flag and relevant docs.

4543c5b

It is not plugged into anything yet. Working on issue CUNY-CL#50.

Adds a cute optimization to dataconfig.

064c05a

This is not strictly related to the issue but it came up so I did it. Currying is implemented at a very low level in CPython, and eliminates 3N dictionary lookups.

Fixed batch sizes throughout.

8577612

More work on mandatory fixed batch size.

8df8ab6

kylebgorman requested a review from Adamits May 29, 2023 22:50

Removes unused properties.

7c5df5d

Adamits suggested changes May 31, 2023

View reviewed changes

Update datasets.py

1727f69

Adamits approved these changes May 31, 2023

View reviewed changes

kylebgorman closed this Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed batch sizes #68

Fixed batch sizes #68

kylebgorman commented May 29, 2023 •

edited

Loading

Adamits left a comment

Adamits May 30, 2023

Adamits May 30, 2023

kylebgorman May 31, 2023

Adamits May 30, 2023 •

edited

Loading

kylebgorman May 31, 2023

Adamits May 31, 2023

Adamits left a comment

kylebgorman commented Jul 3, 2024

Fixed batch sizes #68

Fixed batch sizes #68

Conversation

kylebgorman commented May 29, 2023 • edited Loading

Adamits left a comment

Choose a reason for hiding this comment

Adamits May 30, 2023

Choose a reason for hiding this comment

Adamits May 30, 2023

Choose a reason for hiding this comment

kylebgorman May 31, 2023

Choose a reason for hiding this comment

Adamits May 30, 2023 • edited Loading

Choose a reason for hiding this comment

kylebgorman May 31, 2023

Choose a reason for hiding this comment

Adamits May 31, 2023

Choose a reason for hiding this comment

Adamits left a comment

Choose a reason for hiding this comment

kylebgorman commented Jul 3, 2024

kylebgorman commented May 29, 2023 •

edited

Loading

Adamits May 30, 2023 •

edited

Loading