LAION-AI · andreaskoepf · Feb 19, 2023 · Feb 19, 2023
@@ -1,10 +1,63 @@
-# **Datasets**
+## **Overview**
 
-This folder contains datasets loading scripts that are used to train
-OpenAssistant. The current list of datasets can be found
-[here](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk).
+This repository aims to provide a diverse and accessible collection of datasets
+that can be used to train OpenAssistant models.<br/> Our goal is to cover a wide
+range of topics, languages and tasks.
 
-## **Adding a New Dataset**
+### **Current Progress**
+
+To see the datasets people are currently working on, please refer to
+**[the spreadsheet](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk)**.
+
+### **Repository Structure**
+
+- Each dataset is organized into its own folder, which may include notebooks,
+  processing scripts, markdown files and other materials that explain the
+  dataset creation process
+- The dataset files themselves are stored on Hugging Face
+- The root `__init__.py` lists the dataset names and corresponding Hugging Face
+  datasets
+- The final version of each dataset is pushed to the
+  [OpenAssisstant Hugging Face](https://huggingface.co/OpenAssistant)
+
+## **Dataset Formats**
+
+To simplify the training process, all datasets must be stored as Parquet files
+with the option `row_group_size=100`.<br/> There are two types of datasets
+accepted: instruction and text-only.
+
+### **Instruction format**
+
+Instruction datasets are designed to align language models with human
+interactions. These can take the form of question-answer, request-response,
+task-solution pairs, and so on. The instruction dataset must include the
+following columns:
+
+1. **INSTRUCTION** (string): Instruction text
+2. **RESPONSE** (string): Expected response to the instruction
+3. **SOURCE** (string): Original data source short name, e.g. "wikipedia"
+4. **METADATA** (JSON string, optional): Any other useful information stored in
+   JSON<br/> For example, NSFW content can be marked as `{"nsfw": true}`
+
+### **Text-only format**
+
+For datasets that do not fit into the instruction format, text-only format is
+proposed. The text-only dataset must include the following columns:
+
+1. **TEXT** (string)
+2. **SOURCE** (string)
+3. **METADATA** (JSON string, optional)
+
+## **Dataset Requirements**
+
+The dataset must adhere to the following requirements:
+
+- Must have a permissive license
+- Must not contain child sexual abuse materials
+- Must not contain materials with private individual's personal information
+  (e.g. name, address, phone number, government ID, or medical information)
+
+## **How to Contribute**
 
 To add a new dataset to OpenAssistant, follow these steps:
 
@@ -20,11 +73,11 @@ To add a new dataset to OpenAssistant, follow these steps:
    link the issue in the pull request description. For more information, see
    [below](#making-a-pull-request).
 
-## **Creating a Dataset on Hugging Face**
+### **Creating a Dataset on Hugging Face**
 
 To create a new dataset on Hugging Face, follow these steps:
 
-#### 1. Convert your dataset file(s) to the Parquet format using the [pandas](https://pandas.pydata.org/) library:
+#### 1. Convert your dataset file(s) to the Parquet format using [pandas](https://pandas.pydata.org/) and [pyarrow](https://pypi.org/project/pyarrow/) libraries:
 
 ```python
 import pandas as pd
@@ -53,7 +106,7 @@ login:
 huggingface-cli login
 ```
 
-- in Jupyter notebook (cuurently does not work in
+- in Jupyter notebook (currently does not work in
   [Visual Studio Code](https://github.com/huggingface/huggingface_hub/issues/752))
 
 ```python
@@ -69,13 +122,13 @@ ds = Dataset.from_parquet("dataset.parquet")
 ds.push_to_hub("your_huggingface_name/dataset_name")
 ```
 
-#### 5. Update the `README.md` file
+#### 5. Update the Hugging Face `README.md` file
 
 Update the `README.md` file of your dataset by visiting this link:
 https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md
 (paste your HuggingFace name and dataset)
 
-## **Making a Pull Request**
+### **Making a Pull Request**
 
 #### 1. Fork this repository
 
@@ -84,18 +137,16 @@ https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/REA
 #### 3. Add your dataset to the repository
 
 - Create a folder with the name of your dataset.
-- Add a loading script that loads your dataset from HuggingFace, for example:
+- Add files that describe your dataset and its creation, such as a README,
+  notebooks, scrapers, etc.
+- Add your dataset to the parent `__init__.py`
 
-  ```python
-  from datasets import load_dataset
-
-  if __name__ == "__main__":
-      ds = load_dataset("your_huggingface_name/dataset_name")
-      print(ds["train"])
-  ```
-
-- Optionally, add any other files that describe your dataset and its creation,
-  such as a README, notebooks, scrapers, etc.
+```python
+INSTRUCTION_DATASETS = {
+  ...,
+  "dataset_name": "your_huggingface_name/dataset_name"
+}
+```
 
 #### 4. Stage your changes and run the pre-commit hook
 

@@ -1 +1,3 @@
+TEXT_DATASETS = {}
+
 INSTRUCTION_DATASETS = {"grade-school-math-instructions": "qwedsacf/grade-school-math-instructions"}