diff --git a/data/datasets/README.md b/data/datasets/README.md
index 5d7100e556..3f8c0ff0a2 100644
--- a/data/datasets/README.md
+++ b/data/datasets/README.md
@@ -19,13 +19,14 @@ To see the datasets people are currently working on, please refer to
datasets
- The final version of each dataset is pushed to the
[OpenAssisstant Hugging Face](https://huggingface.co/OpenAssistant)
+- All data **must** be `UTF-8` encoded to simplify training!
## **Dataset Formats**
-To simplify the training process, all datasets must be stored in one of the two
-formats:
+To simplify the training process, all datasets must be `UTF-8` encoded and
+stored in either one of these two formats:
-- parquet with the option `row_group_size=100`
+- parquet with the option `row_group_size=100` and `index=False`
- jsonl or jsonl.gz
## **Dataset Types**
@@ -183,6 +184,8 @@ df = pd.read_json(...) # or any other way
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
```
+Make sure the text data in the dataframe is properly encoded as `UTF-8`!
+
#### 2. Install Hugging Face Hub
```bash
diff --git a/docs/docs/data/datasets.md b/docs/docs/data/datasets.md
index 2c01b5b6f3..64347f33d6 100644
--- a/docs/docs/data/datasets.md
+++ b/docs/docs/data/datasets.md
@@ -7,6 +7,8 @@ github repository aims to provide a diverse and accessible collection of
datasets that can be used to train OpenAssistant models.
Our goal is to
cover a wide range of topics, languages and tasks.
+To simplify the training process, all data must be `UTF-8` encoded.
+
### **Current Progress**
To see the datasets people are currently working on, please refer to
@@ -26,8 +28,8 @@ To see the datasets people are currently working on, please refer to
## **Dataset Formats**
To simplify the training process, all datasets must be stored as Parquet files
-with the option `row_group_size=100`.
There are two types of datasets
-accepted: instruction and text-only.
+with the option `row_group_size=100` and `index=False`.
There are two types
+of datasets accepted: instruction and text-only.
### **Instruction format**
@@ -92,6 +94,8 @@ df = pd.read_json(...) # or any other way
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
```
+Make sure the text data in the dataframe is properly encoded as `UTF-8`!
+
#### 2. Install Hugging Face Hub
```bash