Skip to content

[suggest] please refactor data module #688

@tpoisonooo

Description

@tpoisonooo

Is your feature request related to a problem? Please describe.

After sending this PR #677, I think the data module should be refactor, which is hard to use for user-defined dataset.

Decouple data_type and data_path

In

data_cls = data_config["dataset_name"]
, dataset_name is actually data_type, then load data from HF.

Actually, user's data would not open-source. We have to manually change source code here.

Support multiple data path

In

train_original_dataset = load_dataset("json", data_files=train_ds_path)["train"]
,

OpenAIFormatDataset.__init__ only supports single path, not list of data path. It does not match real scenario: training data comes from difference domains or teams.

llamafactory style (https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) may be better.

Dataset constructor definition

In

data_config["chat_key"],
, the key-list in dataset diffs.

If you do not set system_key for OpenAIFormatDataset, it crashes. So I have to change the code:

        data = hf_datasets.OpenAIFormatDataset(
            data_config["train_data_path"],
            data_config["val_data_path"],
            **{k: data_config[k] for k in ("chat_key", "system_key", "system_prompt") if k in data_config}
        )

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions