-
Notifications
You must be signed in to change notification settings - Fork 240
Open
Description
Is your feature request related to a problem? Please describe.
After sending this PR #677, I think the data module should be refactor, which is hard to use for user-defined dataset.
Decouple data_type and data_path
In
Line 94 in 51d8006
| data_cls = data_config["dataset_name"] |
dataset_name is actually data_type, then load data from HF.
Actually, user's data would not open-source. We have to manually change source code here.
Support multiple data path
In
| train_original_dataset = load_dataset("json", data_files=train_ds_path)["train"] |
OpenAIFormatDataset.__init__ only supports single path, not list of data path. It does not match real scenario: training data comes from difference domains or teams.
llamafactory style (https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) may be better.
Dataset constructor definition
In
Line 116 in 51d8006
| data_config["chat_key"], |
If you do not set system_key for OpenAIFormatDataset, it crashes. So I have to change the code:
data = hf_datasets.OpenAIFormatDataset(
data_config["train_data_path"],
data_config["val_data_path"],
**{k: data_config[k] for k in ("chat_key", "system_key", "system_prompt") if k in data_config}
)Reactions are currently unavailable