Skip to content

Inquiry About Dataset and Training Details for MoBA/Full Hybrid Experiment #20

@yukiy927

Description

@yukiy927

Great work on your paper! I’m currently trying to reproduce the MoBA/full hybrid experiment and was wondering if you could share details about the dataset you used.

I noticed that the experiment involves training on 30B tokens with a context length of 32K. Does this mean that all training samples must be at least 32K tokens long? Or is it acceptable to construct training samples by concatenating shorter sequences?

I’m particularly concerned about how concatenation might affect the final loss curve. Since the position-wise loss would be computed over concatenated sequences, wouldn’t this lead to discontinuities in the loss curve, making it difficult to match the trend shown in Figure 5a? For example, if I only have 10K-token sequences, then the loss curve for a single 32K sequence would essentially be composed of three distinct segments, each corresponding to a different concatenated sample. Would this impact the outcomes?

That will be so great if you could provide some dataset info.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions