Inquiry About Dataset and Training Details for MoBA/Full Hybrid Experiment

Great work on your paper! I’m currently trying to reproduce the MoBA/full hybrid experiment and was wondering if you could share details about the dataset you used.

I noticed that the experiment involves training on 30B tokens with a context length of 32K. Does this mean that all training samples must be at least 32K tokens long? Or is it acceptable to construct training samples by concatenating shorter sequences?

I’m particularly concerned about how concatenation might affect the final loss curve. Since the position-wise loss would be computed over concatenated sequences, wouldn’t this lead to discontinuities in the loss curve, making it difficult to match the trend shown in Figure 5a? For example, if I only have 10K-token sequences, then the loss curve for a single 32K sequence would essentially be composed of three distinct segments, each corresponding to a different concatenated sample. Would this impact the outcomes?

That will be so great if you could provide some dataset info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inquiry About Dataset and Training Details for MoBA/Full Hybrid Experiment #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inquiry About Dataset and Training Details for MoBA/Full Hybrid Experiment #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions