Skip to content

[Improve] Use map_fn and collate_fn to manage dataset and dataloader #8

Merged
LZHgrla merged 7 commits intoInternLM:mainfrom
LZHgrla:lzh/data
Jul 21, 2023
Merged

[Improve] Use map_fn and collate_fn to manage dataset and dataloader #8
LZHgrla merged 7 commits intoInternLM:mainfrom
LZHgrla:lzh/data

Conversation

@LZHgrla
Copy link
Copy Markdown
Contributor

@LZHgrla LZHgrla commented Jul 21, 2023

This PR is based on open-mmlab/mmengine#1262

@LZHgrla LZHgrla merged commit c9e59bb into InternLM:main Jul 21, 2023
@LZHgrla LZHgrla deleted the lzh/data branch July 21, 2023 07:10
llkn-2 pushed a commit to llkn-2/xtuner that referenced this pull request Jul 31, 2024
…nternLM#8)

* use global constants

* refactor dataset map_fn

* refactor collate_fn

* fix bugs

* add mmlu collator

* add default pad_token_id for tokenizer

* use print_log
HIT-cwh added a commit to HIT-cwh/xtuner that referenced this pull request Aug 5, 2024
* support llama3.1

* fix load jsonl

* fix build_llm_model: set attn_implementation and torch_dtype
jayhenry added a commit to jayhenry/xtuner that referenced this pull request Mar 23, 2026
…mparison

- Add generate_stress_pack_config: greedy packing with uniform [200,16000] token lengths
- Add _MockDataset: satisfies JsonlDataset interface without file I/O
- Add TestStress with 3 tests:
  - test_generate_stress_pack_config: validates NPY directory output
  - test_multiprocess_getitem: 8 fork'd processes with random index sampling,
    reports init time, RSS/PSS deltas, and __getitem__ latency per rank
  - test_mmap_memory_saving: two subprocesses compare load_config RSS/PSS/elapsed
    for mmap=True (0.2MB, 0.7ms) vs mmap=False (24MB, 7.5ms)
- Updated feature_list.json: marked feature InternLM#8 as passing (8/8 complete)

Made-with: Cursor
jayhenry added a commit to jayhenry/xtuner that referenced this pull request Mar 23, 2026
…mparison

- Add generate_stress_pack_config: greedy packing with uniform [200,16000] token lengths
- Add _MockDataset: satisfies JsonlDataset interface without file I/O
- Add TestStress with 3 tests:
  - test_generate_stress_pack_config: validates NPY directory output
  - test_multiprocess_getitem: 8 fork'd processes with random index sampling,
    reports init time, RSS/PSS deltas, and __getitem__ latency per rank
  - test_mmap_memory_saving: two subprocesses compare load_config RSS/PSS/elapsed
    for mmap=True (0.2MB, 0.7ms) vs mmap=False (24MB, 7.5ms)
- Updated feature_list.json: marked feature InternLM#8 as passing (8/8 complete)

Made-with: Cursor
jayhenry added a commit to jayhenry/xtuner that referenced this pull request Mar 23, 2026
…mparison

- Add generate_stress_pack_config: greedy packing with uniform [200,16000] token lengths
- Add _MockDataset: satisfies JsonlDataset interface without file I/O
- Add TestStress with 3 tests:
  - test_generate_stress_pack_config: validates NPY directory output
  - test_multiprocess_getitem: 8 fork'd processes with random index sampling,
    reports init time, RSS/PSS deltas, and __getitem__ latency per rank
  - test_mmap_memory_saving: two subprocesses compare load_config RSS/PSS/elapsed
    for mmap=True (0.2MB, 0.7ms) vs mmap=False (24MB, 7.5ms)
- Updated feature_list.json: marked feature InternLM#8 as passing (8/8 complete)

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant