Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets README improvements #1741

Merged
merged 1 commit into from
Feb 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
93 changes: 72 additions & 21 deletions openassistant/datasets/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,63 @@
# **Datasets**
## **Overview**

This folder contains datasets loading scripts that are used to train
OpenAssistant. The current list of datasets can be found
[here](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk).
This repository aims to provide a diverse and accessible collection of datasets
that can be used to train OpenAssistant models.<br/> Our goal is to cover a wide
range of topics, languages and tasks.

## **Adding a New Dataset**
### **Current Progress**

To see the datasets people are currently working on, please refer to
**[the spreadsheet](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk)**.

### **Repository Structure**

- Each dataset is organized into its own folder, which may include notebooks,
processing scripts, markdown files and other materials that explain the
dataset creation process
- The dataset files themselves are stored on Hugging Face
- The root `__init__.py` lists the dataset names and corresponding Hugging Face
datasets
- The final version of each dataset is pushed to the
[OpenAssisstant Hugging Face](https://huggingface.co/OpenAssistant)

## **Dataset Formats**

To simplify the training process, all datasets must be stored as Parquet files
with the option `row_group_size=100`.<br/> There are two types of datasets
accepted: instruction and text-only.

### **Instruction format**

Instruction datasets are designed to align language models with human
interactions. These can take the form of question-answer, request-response,
task-solution pairs, and so on. The instruction dataset must include the
following columns:

1. **INSTRUCTION** (string): Instruction text
2. **RESPONSE** (string): Expected response to the instruction
3. **SOURCE** (string): Original data source short name, e.g. "wikipedia"
4. **METADATA** (JSON string, optional): Any other useful information stored in
JSON<br/> For example, NSFW content can be marked as `{"nsfw": true}`

### **Text-only format**

For datasets that do not fit into the instruction format, text-only format is
proposed. The text-only dataset must include the following columns:

1. **TEXT** (string)
2. **SOURCE** (string)
3. **METADATA** (JSON string, optional)

## **Dataset Requirements**

The dataset must adhere to the following requirements:

- Must have a permissive license
- Must not contain child sexual abuse materials
- Must not contain materials with private individual's personal information
(e.g. name, address, phone number, government ID, or medical information)

## **How to Contribute**

To add a new dataset to OpenAssistant, follow these steps:

Expand All @@ -20,11 +73,11 @@ To add a new dataset to OpenAssistant, follow these steps:
link the issue in the pull request description. For more information, see
[below](#making-a-pull-request).

## **Creating a Dataset on Hugging Face**
### **Creating a Dataset on Hugging Face**

To create a new dataset on Hugging Face, follow these steps:

#### 1. Convert your dataset file(s) to the Parquet format using the [pandas](https://pandas.pydata.org/) library:
#### 1. Convert your dataset file(s) to the Parquet format using [pandas](https://pandas.pydata.org/) and [pyarrow](https://pypi.org/project/pyarrow/) libraries:

```python
import pandas as pd
Expand Down Expand Up @@ -53,7 +106,7 @@ login:
huggingface-cli login
```

- in Jupyter notebook (cuurently does not work in
- in Jupyter notebook (currently does not work in
[Visual Studio Code](https://github.com/huggingface/huggingface_hub/issues/752))

```python
Expand All @@ -69,13 +122,13 @@ ds = Dataset.from_parquet("dataset.parquet")
ds.push_to_hub("your_huggingface_name/dataset_name")
```

#### 5. Update the `README.md` file
#### 5. Update the Hugging Face `README.md` file

Update the `README.md` file of your dataset by visiting this link:
https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md
(paste your HuggingFace name and dataset)

## **Making a Pull Request**
### **Making a Pull Request**

#### 1. Fork this repository

Expand All @@ -84,18 +137,16 @@ https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/REA
#### 3. Add your dataset to the repository

- Create a folder with the name of your dataset.
- Add a loading script that loads your dataset from HuggingFace, for example:
- Add files that describe your dataset and its creation, such as a README,
notebooks, scrapers, etc.
- Add your dataset to the parent `__init__.py`

```python
from datasets import load_dataset

if __name__ == "__main__":
ds = load_dataset("your_huggingface_name/dataset_name")
print(ds["train"])
```

- Optionally, add any other files that describe your dataset and its creation,
such as a README, notebooks, scrapers, etc.
```python
INSTRUCTION_DATASETS = {
...,
"dataset_name": "your_huggingface_name/dataset_name"
}
```

#### 4. Stage your changes and run the pre-commit hook

Expand Down
2 changes: 2 additions & 0 deletions openassistant/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
TEXT_DATASETS = {}

INSTRUCTION_DATASETS = {"grade-school-math-instructions": "qwedsacf/grade-school-math-instructions"}