Revision of block size and sequence length

In https://github.com/Modalities/modalities/commit/19d85e308879030cf4f9fe88f8972b6381a9e116, we introduced `num_embeddings = block size - 1`. This corresponds to the case where an original token sequence of length `block_size` is collated to generate input-output-sequences of length `num_embeddings` (the last token appears only in the output).

However, this means that if we choose e.g. `block_size = 2048`, the length of the sequences in a training batch will be of size `num_embedding = 2047`, which is problematic.

We should consider restoring `num_embeddings = block_size`, while collating data using original token sequences `block_size + 1` instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revision of block size and sequence length #156

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revision of block size and sequence length #156

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions