[suggest] please refactor data module

**Is your feature request related to a problem? Please describe.**

After sending this PR https://github.com/NVIDIA-NeMo/RL/pull/677, I think the `data` module should be refactor, which is hard to use for user-defined dataset.

## Decouple `data_type` and `data_path`

In https://github.com/NVIDIA-NeMo/RL/blob/51d8006ea1605ad705c5454c66d6987548fb4518/examples/run_sft.py#L94 ,  `dataset_name` is actually data_type, then load data from HF.

Actually, user's data would not open-source. We have to manually change source code here.

## Support multiple data path

In https://github.com/NVIDIA-NeMo/RL/blob/51d8006ea1605ad705c5454c66d6987548fb4518/nemo_rl/data/hf_datasets/oai_format_dataset.py#L51 ,

`OpenAIFormatDataset.__init__` only supports single path, not list of data path. It does not match real scenario: training data comes from difference domains or teams.

llamafactory style (https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) may be better.

## Dataset constructor definition

In https://github.com/NVIDIA-NeMo/RL/blob/51d8006ea1605ad705c5454c66d6987548fb4518/examples/run_sft.py#L116 , the key-list in dataset diffs.

If you do not set `system_key` for `OpenAIFormatDataset`, it crashes. So I have to change the code:

```python
        data = hf_datasets.OpenAIFormatDataset(
            data_config["train_data_path"],
            data_config["val_data_path"],
            **{k: data_config[k] for k in ("chat_key", "system_key", "system_prompt") if k in data_config}
        )
```





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[suggest] please refactor data module #688

Decouple `data_type` and `data_path`

Support multiple data path

Dataset constructor definition

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[suggest] please refactor data module #688

Description

Decouple data_type and data_path

Support multiple data path

Dataset constructor definition

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Decouple `data_type` and `data_path`