bug fix: Enforce power of 2 number of bytes per token #282

le1nux · 2024-12-13T20:13:31Z

What does this PR do?

Previously, the number of bytes per token was calculated by math.ceil(log_2(vocab_size)/8), leading to ranges between 1 and 4 bytes.
However, the dataset implementation only support 1, 2 and 4 bytes per token, as defined here

modalities/src/modalities/dataloader/dataset.py

Lines 202 to 206 in 0483362

    
           np_dtype_of_tokens_on_disk_from_bytes = { 
        
               1: np.dtype(np.uint8).newbyteorder("<"), 
        
               2: np.dtype(np.uint16).newbyteorder("<"), 
        
               4: np.dtype(np.uint32).newbyteorder("<"), 
        
           }

and

modalities/src/modalities/dataloader/dataset.py

Lines 233 to 234 in 0483362

    
           self._token_dtype_on_disk = self.np_dtype_of_tokens_on_disk_from_bytes[self._token_size_in_bytes] 
        
           self._token_dtype_in_ram = self.type_converter_for_torch[self._token_size_in_bytes]

I added a switch case that maps to the respective byte sizes, when packing the data.

This adds some inefficiencies as a vobabulary size > 65536 already requires 4 bytes per token, effectively doubling the storage requirements.

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
[] I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

fromm-m

LGTM

flxst

LGTM

src/modalities/dataloader/create_packed_data.py

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

le1nux added 7 commits December 12, 2024 11:18

refactor: fixed wrong number of bytes per token calculation

6839941

feat: aded test for _get_required_num_of_bytes_to_repr

38e7582

chore: updated CHANGELOG_DEV

a40069f

refactor: fixed mismatch between char and byte index

ea23f0e

refactor: improved the test for index creation

cd54ae2

chore: updated CHANGELOG_DEV.md

e3e9d67

refactor: the num bytes per token is now power of two

0483362

le1nux changed the title ~~Bugfix for using the expected number of bytes per token~~ bug fix: Enforce power of 2 number of bytes per token Dec 13, 2024

le1nux self-assigned this Dec 13, 2024

le1nux requested review from flxst and mali-git December 13, 2024 20:15

le1nux added the bug Something isn't working label Dec 13, 2024

le1nux added this to the v0.3.2 milestone Dec 13, 2024

chore: updated changelog

88b1201

fromm-m approved these changes Dec 14, 2024

View reviewed changes

flxst approved these changes Dec 16, 2024

View reviewed changes

src/modalities/dataloader/create_packed_data.py Outdated Show resolved Hide resolved

Update src/modalities/dataloader/create_packed_data.py

373c99a

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

le1nux merged commit 7cd60e2 into main Dec 16, 2024
1 check passed

le1nux deleted the fix_num_bytes_per_token_power_of_2 branch December 16, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug fix: Enforce power of 2 number of bytes per token #282

bug fix: Enforce power of 2 number of bytes per token #282

Uh oh!

le1nux commented Dec 13, 2024

Uh oh!

fromm-m left a comment

Uh oh!

flxst left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	np_dtype_of_tokens_on_disk_from_bytes = {
	1: np.dtype(np.uint8).newbyteorder("<"),
	2: np.dtype(np.uint16).newbyteorder("<"),
	4: np.dtype(np.uint32).newbyteorder("<"),
	}

	self._token_dtype_on_disk = self.np_dtype_of_tokens_on_disk_from_bytes[self._token_size_in_bytes]
	self._token_dtype_in_ram = self.type_converter_for_torch[self._token_size_in_bytes]

bug fix: Enforce power of 2 number of bytes per token #282

bug fix: Enforce power of 2 number of bytes per token #282

Uh oh!

Conversation

le1nux commented Dec 13, 2024

What does this PR do?

Checklist before submitting final PR

Uh oh!

fromm-m left a comment

Choose a reason for hiding this comment

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants