diff --git a/data/datasets/README.md b/data/datasets/README.md index 5d7100e556..3f8c0ff0a2 100644 --- a/data/datasets/README.md +++ b/data/datasets/README.md @@ -19,13 +19,14 @@ To see the datasets people are currently working on, please refer to datasets - The final version of each dataset is pushed to the [OpenAssisstant Hugging Face](https://huggingface.co/OpenAssistant) +- All data **must** be `UTF-8` encoded to simplify training! ## **Dataset Formats** -To simplify the training process, all datasets must be stored in one of the two -formats: +To simplify the training process, all datasets must be `UTF-8` encoded and +stored in either one of these two formats: -- parquet with the option `row_group_size=100` +- parquet with the option `row_group_size=100` and `index=False` - jsonl or jsonl.gz ## **Dataset Types** @@ -183,6 +184,8 @@ df = pd.read_json(...) # or any other way df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False) ``` +Make sure the text data in the dataframe is properly encoded as `UTF-8`! + #### 2. Install Hugging Face Hub ```bash diff --git a/docs/docs/data/datasets.md b/docs/docs/data/datasets.md index 2c01b5b6f3..64347f33d6 100644 --- a/docs/docs/data/datasets.md +++ b/docs/docs/data/datasets.md @@ -7,6 +7,8 @@ github repository aims to provide a diverse and accessible collection of datasets that can be used to train OpenAssistant models.
Our goal is to cover a wide range of topics, languages and tasks. +To simplify the training process, all data must be `UTF-8` encoded. + ### **Current Progress** To see the datasets people are currently working on, please refer to @@ -26,8 +28,8 @@ To see the datasets people are currently working on, please refer to ## **Dataset Formats** To simplify the training process, all datasets must be stored as Parquet files -with the option `row_group_size=100`.
There are two types of datasets -accepted: instruction and text-only. +with the option `row_group_size=100` and `index=False`.
There are two types +of datasets accepted: instruction and text-only. ### **Instruction format** @@ -92,6 +94,8 @@ df = pd.read_json(...) # or any other way df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False) ``` +Make sure the text data in the dataframe is properly encoded as `UTF-8`! + #### 2. Install Hugging Face Hub ```bash