Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks. by le1nux · Pull Request #164 · Modalities/modalities

le1nux · 2024-06-25T13:51:01Z

What does this PR do?

The index values in the pbin files had the wrong values. They did start with an offset and additionally, we added another offset of HEADER size when reading from the file buffer.
See here for the initial offset during pbin index creation:

modalities/src/modalities/dataloader/create_packed_data.py

Line 145 in 4aa2e88

curr_offset = EmbeddedStreamData.HEADER_SIZE_IN_BYTES

and the additional offset that is used when reading from the memmap during training:

modalities/src/modalities/dataloader/create_packed_data.py

Line 262 in 4aa2e88

    
           self.data = np.memmap(self._data_path, mode="r", offset=self.HEADER_SIZE_IN_BYTES, shape=(self.data_len,))

This PR fixes this issue and makes the index always start at byte 0, only applying the offset once when reading from the memmap file.

General changes

index tuples are now always in bytes and the start of the first sample in the data section starts at byte 0 (before the was a wrong offset)
added test for indexing the original samples
the documentation was updated accordingly
removed block_size from abstract classes that don't need to see the block_size concept

Breaking Changes

None

Checklist before submitting final PR

My PR is minimal and addresses one issue / enhancement in isolation
I have merged main into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have fixed all failing tests (python tests/tests.py)

…the indexation of the original samples

… indexation

le1nux · 2024-06-25T14:02:29Z

fixes #163

flxst

Nice work! I have two general remarks:

I think that the class inheritance structure, PackedMemMapDatasetBase -> PackedMemMapDatasetBase, PackedMemMapDatasetContinuous is suboptimal now that the parent class PackedMemMapDatasetBase is not an abstract base class anymore, but actually can be instantiated and used by itself. It might be cleaner to restore an abstract base class with an abstract method _generate_packing_index, and inherit three classes from it with the three different implementations of the method.
The names of the classes are hard to understand and maybe a bit misleading. For instance, the class PackedMemMapDatasetBase does not actually do packing, does it? Also, why Continuous? Maybe we could try to find better names.

I will approve the PR now since I don't believe that my above suggestions for improvement are essential, but I am happy to take another look if you decide to implement them!

flxst · 2024-06-27T20:52:43Z

    def _generate_packing_index(self) -> List[Tuple[int, int]]:
-        raise NotImplementedError
+        # index is a tuple of offset and length in bytes
+        return self._embedded_stream_data.index_base


I do not like that this method is overwritten in the inherited classes PackedMemMapDatasetContinuous and PackedMemMapDatasetMegatron, see my general comment.

will be adressed as part of a new PR and issue #167

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

…lities into fix/dataset_index

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

le1nux · 2024-06-28T08:52:17Z

Yes, the inheritance structure can be improved. I suggest we do this in a separate PR together with improving the "packing" terms in those cases when there is no actual packing happening.

I added the issue #167 for addressing this.

le1nux added 5 commits June 25, 2024 14:58

fix: we use the correct byte-based indexation now

1772f47

test: added test test_original_samples_in_packed_dataset for testing …

3c6bfbc

…the indexation of the original samples

chore: updated getting started documentation regarding the byte-based…

3a45e52

… indexation

fix: fixed index in dummy_packed_data_path of conftest

0f28492

chore: updated readme inaccuracy

378c59c

le1nux added bug Something isn't working enhancement New feature or request labels Jun 25, 2024

le1nux requested review from flxst and mali-git June 25, 2024 13:51

le1nux self-assigned this Jun 25, 2024

flxst approved these changes Jun 27, 2024

View reviewed changes

le1nux and others added 4 commits June 28, 2024 10:24

Update src/modalities/dataloader/create_packed_data.py

138fa85

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

chore: renamed offset to offset_in_bytes for consistency

455c26a

chore: Merge branch 'fix/dataset_index' of github.com:Modalities/moda…

28c9c88

…lities into fix/dataset_index

Update src/modalities/dataloader/create_packed_data.py

969f11a

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

le1nux mentioned this pull request Jun 28, 2024

Improve the dataset inheritance and class naming #167

Open

chore: add comments

a8a0a1d

mali-git approved these changes Jun 30, 2024

View reviewed changes

Comment thread examples/getting_started/README.md

le1nux force-pushed the fix/dataset_index branch from 464c6bd to a8a0a1d Compare June 30, 2024 10:44

le1nux merged commit 8ddf17c into fix/sequence_length_power_of_2 Jun 30, 2024

le1nux deleted the fix/dataset_index branch June 30, 2024 10:47

le1nux mentioned this pull request Jun 30, 2024

Towards stable modalities version #141

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks.#164

Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks.#164
le1nux merged 10 commits intofix/sequence_length_power_of_2from
fix/dataset_index

le1nux commented Jun 25, 2024

Uh oh!

le1nux commented Jun 25, 2024

Uh oh!

flxst left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flxst Jun 27, 2024

Uh oh!

le1nux Jun 28, 2024

Uh oh!

le1nux commented Jun 28, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

le1nux commented Jun 25, 2024

What does this PR do?

General changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

le1nux commented Jun 25, 2024

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flxst Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

le1nux Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

le1nux commented Jun 28, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants