Skip to content

Bug: Dataset implementation does not  #163

@le1nux

Description

@le1nux
  • The index does not correctly index the original samples. We want the first index in the sample start at byte 0. During iteration the HEADER size in bytes will then be added to the index to read from the correct distance from the start.
    Right now we apply the offset twice, leading to wrong indexation of individual samples.
    self._embedded_stream_data.data, dtype=self._token_dtype_on_disk, count=length, offset=offset

    self.data = np.memmap(self._data_path, mode="r", offset=self.HEADER_SIZE_IN_BYTES, shape=(self.data_len,))

Since we were only using packing in practice this issue has not been observed so far.

While fixing this, we should also use byte-wise indices globally and then when reading from the memmap file as part of the buffer recompute the index position to token position in the buffer.

https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/dataset.py#L161C1-L169C60

The documentation also need the respective updates.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions