Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks.#164
Conversation
…the indexation of the original samples
|
fixes #163 |
flxst
left a comment
There was a problem hiding this comment.
Nice work! I have two general remarks:
- I think that the class inheritance structure,
PackedMemMapDatasetBase->PackedMemMapDatasetBase,PackedMemMapDatasetContinuousis suboptimal now that the parent classPackedMemMapDatasetBaseis not an abstract base class anymore, but actually can be instantiated and used by itself. It might be cleaner to restore an abstract base class with an abstract method_generate_packing_index, and inherit three classes from it with the three different implementations of the method. - The names of the classes are hard to understand and maybe a bit misleading. For instance, the class
PackedMemMapDatasetBasedoes not actually do packing, does it? Also, whyContinuous? Maybe we could try to find better names.
I will approve the PR now since I don't believe that my above suggestions for improvement are essential, but I am happy to take another look if you decide to implement them!
| def _generate_packing_index(self) -> List[Tuple[int, int]]: | ||
| raise NotImplementedError | ||
| # index is a tuple of offset and length in bytes | ||
| return self._embedded_stream_data.index_base |
There was a problem hiding this comment.
I do not like that this method is overwritten in the inherited classes PackedMemMapDatasetContinuous and PackedMemMapDatasetMegatron, see my general comment.
There was a problem hiding this comment.
will be adressed as part of a new PR and issue #167
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
…lities into fix/dataset_index
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
|
Yes, the inheritance structure can be improved. I suggest we do this in a separate PR together with improving the "packing" terms in those cases when there is no actual packing happening. I added the issue #167 for addressing this. |
464c6bd to
a8a0a1d
Compare
What does this PR do?
The index values in the pbin files had the wrong values. They did start with an offset and additionally, we added another offset of HEADER size when reading from the file buffer.
See here for the initial offset during pbin index creation:
modalities/src/modalities/dataloader/create_packed_data.py
Line 145 in 4aa2e88
and the additional offset that is used when reading from the memmap during training:
modalities/src/modalities/dataloader/create_packed_data.py
Line 262 in 4aa2e88
This PR fixes this issue and makes the index always start at byte 0, only applying the offset once when reading from the memmap file.
General changes
block_sizefrom abstract classes that don't need to see theblock_sizeconceptBreaking Changes
Checklist before submitting final PR
python tests/tests.py)