- The index does not correctly index the original samples. We want the first index in the sample start at byte 0. During iteration the HEADER size in bytes will then be added to the index to read from the correct distance from the start.
Right now we apply the offset twice, leading to wrong indexation of individual samples.
|
self._embedded_stream_data.data, dtype=self._token_dtype_on_disk, count=length, offset=offset |
|
self.data = np.memmap(self._data_path, mode="r", offset=self.HEADER_SIZE_IN_BYTES, shape=(self.data_len,)) |
Since we were only using packing in practice this issue has not been observed so far.
While fixing this, we should also use byte-wise indices globally and then when reading from the memmap file as part of the buffer recompute the index position to token position in the buffer.
https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/dataset.py#L161C1-L169C60
The documentation also need the respective updates.
Right now we apply the offset twice, leading to wrong indexation of individual samples.
modalities/src/modalities/dataloader/dataset.py
Line 165 in 4aa2e88
modalities/src/modalities/dataloader/create_packed_data.py
Line 262 in 4aa2e88
Since we were only using packing in practice this issue has not been observed so far.
While fixing this, we should also use byte-wise indices globally and then when reading from the memmap file as part of the buffer recompute the index position to token position in the buffer.
https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/dataset.py#L161C1-L169C60
The documentation also need the respective updates.