In 19d85e3, we introduced num_embeddings = block size - 1. This corresponds to the case where an original token sequence of length block_size is collated to generate input-output-sequences of length num_embeddings (the last token appears only in the output).
However, this means that if we choose e.g. block_size = 2048, the length of the sequences in a training batch will be of size num_embedding = 2047, which is problematic.
We should consider restoring num_embeddings = block_size, while collating data using original token sequences block_size + 1 instead.
In 19d85e3, we introduced
num_embeddings = block size - 1. This corresponds to the case where an original token sequence of lengthblock_sizeis collated to generate input-output-sequences of lengthnum_embeddings(the last token appears only in the output).However, this means that if we choose e.g.
block_size = 2048, the length of the sequences in a training batch will be of sizenum_embedding = 2047, which is problematic.We should consider restoring
num_embeddings = block_size, while collating data using original token sequencesblock_size + 1instead.