Skip to content

Conversation

@le1nux
Copy link
Member

@le1nux le1nux commented Dec 13, 2024

What does this PR do?

Previously, the number of bytes per token was calculated by math.ceil(log_2(vocab_size)/8), leading to ranges between 1 and 4 bytes.
However, the dataset implementation only support 1, 2 and 4 bytes per token, as defined here

np_dtype_of_tokens_on_disk_from_bytes = {
1: np.dtype(np.uint8).newbyteorder("<"),
2: np.dtype(np.uint16).newbyteorder("<"),
4: np.dtype(np.uint32).newbyteorder("<"),
}

and

self._token_dtype_on_disk = self.np_dtype_of_tokens_on_disk_from_bytes[self._token_size_in_bytes]
self._token_dtype_in_ram = self.type_converter_for_torch[self._token_size_in_bytes]

I added a switch case that maps to the respective byte sizes, when packing the data.

This adds some inefficiencies as a vobabulary size > 65536 already requires 4 bytes per token, effectively doubling the storage requirements.

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • [] I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux changed the title Bugfix for using the expected number of bytes per token bug fix: Enforce power of 2 number of bytes per token Dec 13, 2024
@le1nux le1nux self-assigned this Dec 13, 2024
@le1nux le1nux requested review from flxst and mali-git December 13, 2024 20:15
@le1nux le1nux added the bug Something isn't working label Dec 13, 2024
@le1nux le1nux added this to the v0.3.2 milestone Dec 13, 2024
Copy link
Member

@fromm-m fromm-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
@le1nux le1nux merged commit 7cd60e2 into main Dec 16, 2024
1 check passed
@le1nux le1nux deleted the fix_num_bytes_per_token_power_of_2 branch December 16, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants