Skip to content

9. Training datasets

Willza edited this page Jun 10, 2023 · 11 revisions

OpenAssistant conversations dataset (OASST1)

Resource link: https://huggingface.co/datasets/OpenAssistant/oasst1

# Example code
import pandas as pd
from datasets import load_dataset
ds = load_dataset("OpenAssistant/oasst1")
df_train = ds['train'].to_pandas()     # len(train)=84437 (95%)
df_validation = ds['validation'].to_pandas()    # len(val)=4401 (5%)

Wikitext

Resource link: https://huggingface.co/datasets/wikitext

# Example code
from datasets import load_dataset
ds = load_dataset('wikitext', 'wikitext-2-v1')
df = ds['train'].to_pandas()

Base model training dataset - "SlimPajama"

Resource link: https://huggingface.co/datasets/cerebras/SlimPajama-627B
Note: this dataset is large

from datasets import load_dataset
ds = load_dataset("cerebras/SlimPajama-627B")