Skip to content

Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks.#164

Merged
le1nux merged 10 commits intofix/sequence_length_power_of_2from
fix/dataset_index
Jun 30, 2024
Merged

Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks.#164
le1nux merged 10 commits intofix/sequence_length_power_of_2from
fix/dataset_index

Conversation

@le1nux
Copy link
Copy Markdown
Member

@le1nux le1nux commented Jun 25, 2024

What does this PR do?

The index values in the pbin files had the wrong values. They did start with an offset and additionally, we added another offset of HEADER size when reading from the file buffer.
See here for the initial offset during pbin index creation:

curr_offset = EmbeddedStreamData.HEADER_SIZE_IN_BYTES

and the additional offset that is used when reading from the memmap during training:

self.data = np.memmap(self._data_path, mode="r", offset=self.HEADER_SIZE_IN_BYTES, shape=(self.data_len,))

This PR fixes this issue and makes the index always start at byte 0, only applying the offset once when reading from the memmap file.

General changes

  • index tuples are now always in bytes and the start of the first sample in the data section starts at byte 0 (before the was a wrong offset)
  • added test for indexing the original samples
  • the documentation was updated accordingly
  • removed block_size from abstract classes that don't need to see the block_size concept

Breaking Changes

  • None

Checklist before submitting final PR

  • My PR is minimal and addresses one issue / enhancement in isolation
  • I have merged main into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have fixed all failing tests (python tests/tests.py)

@le1nux le1nux added bug Something isn't working enhancement New feature or request labels Jun 25, 2024
@le1nux le1nux requested review from flxst and mali-git June 25, 2024 13:51
@le1nux le1nux self-assigned this Jun 25, 2024
@le1nux
Copy link
Copy Markdown
Member Author

le1nux commented Jun 25, 2024

fixes #163

Copy link
Copy Markdown
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have two general remarks:

  1. I think that the class inheritance structure, PackedMemMapDatasetBase -> PackedMemMapDatasetBase, PackedMemMapDatasetContinuous is suboptimal now that the parent class PackedMemMapDatasetBase is not an abstract base class anymore, but actually can be instantiated and used by itself. It might be cleaner to restore an abstract base class with an abstract method _generate_packing_index, and inherit three classes from it with the three different implementations of the method.
  2. The names of the classes are hard to understand and maybe a bit misleading. For instance, the class PackedMemMapDatasetBase does not actually do packing, does it? Also, why Continuous? Maybe we could try to find better names.

I will approve the PR now since I don't believe that my above suggestions for improvement are essential, but I am happy to take another look if you decide to implement them!

Comment thread src/modalities/dataloader/create_packed_data.py Outdated
Comment thread src/modalities/dataloader/create_packed_data.py Outdated
Comment thread src/modalities/dataloader/dataset.py Outdated
def _generate_packing_index(self) -> List[Tuple[int, int]]:
raise NotImplementedError
# index is a tuple of offset and length in bytes
return self._embedded_stream_data.index_base
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like that this method is overwritten in the inherited classes PackedMemMapDatasetContinuous and PackedMemMapDatasetMegatron, see my general comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be adressed as part of a new PR and issue #167

le1nux and others added 4 commits June 28, 2024 10:24
@le1nux
Copy link
Copy Markdown
Member Author

le1nux commented Jun 28, 2024

Yes, the inheritance structure can be improved. I suggest we do this in a separate PR together with improving the "packing" terms in those cases when there is no actual packing happening.

I added the issue #167 for addressing this.

Comment thread examples/getting_started/README.md
@le1nux le1nux force-pushed the fix/dataset_index branch from 464c6bd to a8a0a1d Compare June 30, 2024 10:44
@le1nux le1nux merged commit 8ddf17c into fix/sequence_length_power_of_2 Jun 30, 2024
@le1nux le1nux deleted the fix/dataset_index branch June 30, 2024 10:47
@le1nux le1nux mentioned this pull request Jun 30, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants