Skip to content

Latest commit

 

History

History
136 lines (107 loc) · 3.84 KB

data_card.md

File metadata and controls

136 lines (107 loc) · 3.84 KB
dataset_info
features splits download_size dataset_size annotations_creators language_creators language license multilinguality pretty_name size_categories source_datasets task_categories task_ids paperswithcode_id
name sequence
input_ids
int32
name num_bytes num_examples
train
22274051772
43166767
12187746609
22274051772
no-annotation
found
en
other
monolingual
pretokenized,filtered,sorted subset of the Pile
10B<n<100B
the-pile
text-generation
fill-mask
language-modeling
masked-language-modeling
the-pile-cramming

Dataset Card for "the_pile_WordPiecex32768_97b8e776baafb99c3892e6572a9f51b3"

Dataset Description

Dataset Summary

This is a preprocessed, tokenized dataset for the cramming-project.

Use only with the tokenizer uploaded here. This version is 97b8e776baafb99c3892e6572a9f51b3, which corresponds to a specific dataset construction setup, described below. The raw data source is the Pile, a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Languages

This dataset is in English (EN).

Data Splits

This preprocessed subset contains only a train split.

Dataset Creation

The configuration to create this dataset with the cramming project code (https://github.com/JonasGeiping/cramming) is

# This is a slice of the pile, loaded from a local source
name: the_pile
defaults:
  - sources:
      - the_pile

#
# Preprocessing
normalizer:
  force_lowercase: True
  strip_accents: True
  force_english_keyboard: True
  whitespace_escape: False
tokenizer: WordPiece
vocab_size: 32768

# Dataset Formation
seq_length: 128
include_cls_token_in_corpus: False
include_sep_token_in_corpus: True
use_type_ids: False
max_entries_in_raw_dataset: 16e6 # About 40 mio seqs of length 128
max_seq_in_tokenized_dataset: 85e6 # Select only this many tokenized sequences.
# max_seq_in_tokenized_dataset should be just slightly more than budget * 60 * 60 * expected tokens/sec for the single epoch of training

# Data Cleaning:
named_entity_simplification: False
remove_whitespaces: False
remove_trash: True
trash_cutoff: 0.25
deduplicate_entries: False
deduplication_threshold: 75

# Data Order:
ordering: sentence-length-curriculum # could be a curriculum

Considerations for Using the Data

Limitations and bias: This training data was further filtered and sorted beyond the normal preprocessing. These modifications were not tested for unintended consequences.

Additional Information

Dataset Curators

This dataset is a filtered, sorted and preprocessed subset of the the-Pile made by Jonas Geiping . The original dataset was primarily curated by Leo Gao and Stella Biderman, with assistance from other authors of the Pile paper.

Licensing Information

Please refer to the specific license depending on the subset you use at https://huggingface.co/datasets/EleutherAI/pile

Citation Information

@article{gao2020pile,
  title={The {P}ile: An 800{GB} dataset of diverse text for language modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
  journal={arXiv preprint arXiv:2101.00027},
  year={2020}
}
@article{biderman2022datasheet,
  title={Datasheet for the pile},
  author={Biderman, Stella and Bicheno, Kieran and Gao, Leo},
  journal={arXiv preprint arXiv:2201.07311},
  year={2022}
}