Fix/sequence length power of 2 by le1nux · Pull Request #158 · Modalities/modalities

le1nux · 2024-06-19T16:57:56Z

What does this PR do?

Previously, the block_size in the dataset would be set to a power of two, resulting in the sequence length being block_size -1, which is not best practice and can impact the model training e.g., throughput-wise.

As a fix, we now specify the sequence_length in the config instead of the block_size. During Dataset instantiation we chose the block_size to be sequence_length+1.

Previously, we would also chunk the dataset into block_size long chunks. Each chunk would then be used for training individually. As a result, the last token of a block would be only used as a target but never as an input. We changed this, such that we reuse the last token of a batch as the first one of the subsequent batch.

General changes

nothing apart from points mentioned above

Breaking Changes

replaced block_size in Dataset, Model and NumberConversion with sequence_length

Checklist before submitting final PR

My PR is minimal and addresses one issue / enhancement in isolation
I have merged main into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have fixed all failing tests (python tests/tests.py)

…hen applicable

…ed as the first token (i.e., first input token) of the subsequent block

flxst

Great work! Added some minor comments.

…the indexation of the original samples

… indexation

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

…lities into fix/dataset_index

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks.

le1nux added 3 commits June 19, 2024 17:40

refactor: block_size in datasets is now sequence_length +1

b25b8e0

fix: failing end2end tests due to sequence_length / block_size changes

c36bc20

refactor: replaced context_size and block_size with sequence_length w…

bfad9fc

…hen applicable

le1nux changed the base branch from main to dev_experiments June 19, 2024 16:58

flxst linked an issue Jun 20, 2024 that may be closed by this pull request

Revision of block size and sequence length #156

Closed

flxst assigned le1nux Jun 20, 2024

refactor: the last token from a block (i.e., last target token) is us…

2ded83d

…ed as the first token (i.e., first input token) of the subsequent block

le1nux added the enhancement New feature or request label Jun 20, 2024

refactor: renamed all model_sequence_length with sequence_length

4b88bfd

le1nux requested review from flxst, fromm-m and lhahn-iis and removed request for flxst and lhahn-iis June 20, 2024 12:00

le1nux marked this pull request as ready for review June 20, 2024 12:01

flxst requested a review from mali-git June 24, 2024 07:57

le1nux removed the request for review from lhahn-iis June 24, 2024 10:18

flxst approved these changes Jun 25, 2024

View reviewed changes

Comment thread src/modalities/config/config.py Outdated

Comment thread src/modalities/models/gpt2/gpt2_model.py Outdated

Comment thread src/modalities/models/gpt2/gpt2_model.py Outdated

Comment thread tests/checkpointing/test_fsdp_to_disc_checkpointing.py Outdated

le1nux and others added 10 commits June 25, 2024 14:58

fix: we use the correct byte-based indexation now

1772f47

test: added test test_original_samples_in_packed_dataset for testing …

3c6bfbc

…the indexation of the original samples

chore: updated getting started documentation regarding the byte-based…

3a45e52

… indexation

fix: fixed index in dummy_packed_data_path of conftest

0f28492

chore: updated readme inaccuracy

378c59c

Update src/modalities/config/config.py

cc7af6a

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

Update src/modalities/models/gpt2/gpt2_model.py

e845c10

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

Update src/modalities/models/gpt2/gpt2_model.py

a9f4166

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

Update tests/checkpointing/test_fsdp_to_disc_checkpointing.py

7185f7e

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

Update src/modalities/dataloader/create_packed_data.py

138fa85

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

le1nux and others added 3 commits June 28, 2024 10:30

chore: renamed offset to offset_in_bytes for consistency

455c26a

chore: Merge branch 'fix/dataset_index' of github.com:Modalities/moda…

28c9c88

…lities into fix/dataset_index

Update src/modalities/dataloader/create_packed_data.py

969f11a

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

mali-git approved these changes Jun 29, 2024

View reviewed changes

mali-git and others added 3 commits June 29, 2024 16:37

chore: add comments

a8a0a1d

Merge pull request #164 from Modalities/fix/dataset_index

8ddf17c

Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks.

Merge branch 'dev_experiments' into fix/sequence_length_power_of_2

96980d3

le1nux merged commit 563d864 into dev_experiments Jun 30, 2024

le1nux deleted the fix/sequence_length_power_of_2 branch June 30, 2024 10:51

le1nux mentioned this pull request Jun 30, 2024

Towards stable modalities version #141

Merged

5 tasks

flxst mentioned this pull request Jul 2, 2024

Fix getting started example #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/sequence length power of 2#158

Fix/sequence length power of 2#158
le1nux merged 21 commits intodev_experimentsfrom
fix/sequence_length_power_of_2

le1nux commented Jun 19, 2024 •

edited

Loading

Uh oh!

flxst left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

le1nux commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

le1nux commented Jun 19, 2024 •

edited

Loading