Skip to content

feat: add multiprocessing support for dataset loading and processing#48

Merged
Ki-Seki merged 1 commit intomainfrom
feat/gim-sft-dataset
Sep 24, 2025
Merged

feat: add multiprocessing support for dataset loading and processing#48
Ki-Seki merged 1 commit intomainfrom
feat/gim-sft-dataset

Conversation

@Ki-Seki
Copy link
Member

@Ki-Seki Ki-Seki commented Sep 24, 2025

No description provided.

Copilot AI review requested due to automatic review settings September 24, 2025 09:05
@codecov
Copy link

codecov bot commented Sep 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds multiprocessing support to dataset loading and processing operations to improve performance. The changes enable parallel processing across multiple CPU cores when loading datasets from Hugging Face and applying transformations.

  • Adds num_proc=os.cpu_count() parameter to load_dataset, map, and filter operations
  • Updates dataset output structure to organize files by dataset name and subset
  • Removes old upload script and adds new one for the restructured dataset format

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

File Description
dataset/utils.py Updates save_dataset function to support nested directory structure and configurable dataset names
dataset/upload_masked_io.py Removes old dataset upload script
dataset/upload_gim_sft.py Adds new upload script for restructured GIM-SFT dataset format
dataset/mask_*.py (13 files) Adds os import and multiprocessing support to dataset operations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Ki-Seki Ki-Seki merged commit ad2b347 into main Sep 24, 2025
6 checks passed
@Ki-Seki Ki-Seki deleted the feat/gim-sft-dataset branch September 24, 2025 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants