Bug: Dataset implementation does not 

* The index does not correctly index the original samples. We want the first index in the sample start at byte 0. During iteration the HEADER size in bytes will then be added to the index to read from the correct distance from the start.
Right now we apply the offset twice, leading to wrong indexation of individual samples.
https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/dataset.py#L165
https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/create_packed_data.py#L262

Since we were only using packing in practice this issue has not been observed so far. 


While fixing this, we should also use byte-wise indices globally and then when reading from the memmap file as part of the buffer recompute the index position to token position in the buffer.   

https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/dataset.py#L161C1-L169C60

The documentation also need the respective updates. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Dataset implementation does not #163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Dataset implementation does not #163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions