Skip to content

Dataset subsets#151

Merged
philipp-fischer merged 19 commits intodevelopfrom
feature/sample_range
Aug 21, 2025
Merged

Dataset subsets#151
philipp-fischer merged 19 commits intodevelopfrom
feature/sample_range

Conversation

@voegtlel
Copy link
Copy Markdown
Collaborator

@voegtlel voegtlel commented Jul 4, 2025

Fixes #13

Implements the sample range like this:

Example from tests:

ds1: Values [0, 55]
ds2: Values [100, 155]
ds3: Values [200, 255]

metadataset_ratio2.yaml:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    # take [10, 30] from ds1, [20, 40] from ds2 and then only [20%, 80%]
    # I.e. sample range: [14, 26], 2 * [124, 136]
    subset: {range: [20%, 80%]}
    blend_epochized:
      - path: ds1
        subset: {range: [10, 30]}
      - repetitions: 2
        subset: {range: [20, 40]}
        path: ds2
__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    subset_ratio: [0%, 50%]
    blend_epochized:
      - path: ds3
        # take [30, 50] from ds3, then first 50%, resulting in samples [230, 240]
        subset: {range: [30, 50]}
      - repetitions: 2
        # Inner sample range: [14, 26], 2 * [124, 136], total=12*3=36
        # Applying subset ratio 25%-75%: [17, 23], 2*[127, 133], total=3*6=18
        # Applying outer 50%: [17, 20], 2*[127, 130], total=3*3=9
        # Applying repetition: 2*[17, 20], 4*[127, 130], total=2*9=18
        subset: {range: [25%, 75%]}
        path: metadataset_ratio2.yaml

Recursively applies the subset (innermost first). subset with absolute values can only be applied to a "leaf" dataset (i.e. not a recursive metadataset).

This is

Creating as draft up for discussion.

@voegtlel voegtlel requested a review from philipp-fischer July 4, 2025 12:40
@voegtlel voegtlel changed the title First draft for sample range First draft for dataset subsets Jul 8, 2025
@voegtlel voegtlel marked this pull request as ready for review July 8, 2025 12:05
@voegtlel voegtlel changed the title First draft for dataset subsets Dataset subsets Aug 14, 2025
Comment thread docs/source/advanced/subsets.md Outdated
Comment thread docs/source/advanced/subsets.md Outdated
Comment thread src/megatron/energon/flavors/webdataset/structs.py Outdated
Comment thread src/megatron/energon/metadataset/metadataset_v2.py Outdated
Comment thread src/megatron/energon/metadataset/metadataset_v2.py Outdated
Comment thread src/megatron/energon/metadataset/metadataset_v2.py Outdated
Comment thread src/megatron/energon/metadataset/metadataset_v2.py Outdated
Comment thread src/megatron/energon/metadataset/metadataset_v2.py Outdated
Comment thread src/megatron/energon/flavors/webdataset/sharder.py
Comment thread src/megatron/energon/typed_converter.py Outdated
Comment thread docs/source/advanced/subsets.md Outdated
Comment thread src/megatron/energon/flavors/webdataset/structs.py Outdated
Comment thread src/megatron/energon/flavors/webdataset/structs.py Outdated
Comment thread src/megatron/energon/metadataset/metadataset_v2.py Outdated
Comment thread src/megatron/energon/metadataset/metadataset_v2.py Outdated
@philipp-fischer philipp-fischer merged commit 5acd962 into develop Aug 21, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sample Range for Metadataset

2 participants