-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Great work on your paper! I’m currently trying to reproduce the MoBA/full hybrid experiment and was wondering if you could share details about the dataset you used.
I noticed that the experiment involves training on 30B tokens with a context length of 32K. Does this mean that all training samples must be at least 32K tokens long? Or is it acceptable to construct training samples by concatenating shorter sequences?
I’m particularly concerned about how concatenation might affect the final loss curve. Since the position-wise loss would be computed over concatenated sequences, wouldn’t this lead to discontinuities in the loss curve, making it difficult to match the trend shown in Figure 5a? For example, if I only have 10K-token sequences, then the loss curve for a single 32K sequence would essentially be composed of three distinct segments, each corresponding to a different concatenated sample. Would this impact the outcomes?
That will be so great if you could provide some dataset info.