Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pretraining_preparation.py - previously it would die with "BuilderConfig 'train' not found. Available: ['default']" #43

Merged
merged 1 commit into from
Feb 18, 2024

Conversation

euclaise
Copy link
Contributor

You attempt to load the train configuration of the huggingface dataset, but train is a split rather than a configuration for pile-readymade, so it complains:

Traceback (most recent call last):
  File "/workspace/cramming/pretrain.py", line 196, in launch
    cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
  File "/workspace/cramming/cramming/utils.py", line 54, in main_launcher
    metrics = main_fn(cfg, setup)
  File "/workspace/cramming/pretrain.py", line 21, in main_training_process
    dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
  File "/workspace/cramming/cramming/data/pretraining_preparation.py", line 40, in load_pretraining_corpus
    return _load_from_hub(cfg_data, data_path)
  File "/workspace/cramming/cramming/data/pretraining_preparation.py", line 460, in _load_from_hub
    tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2548, in load_dataset
    builder_instance = load_dataset_builder(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2257, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 592, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'train' not found. Available: ['default']

I've changed it to request the train split instead of a train configuration.

@JonasGeiping
Copy link
Owner

Thanks!

@JonasGeiping JonasGeiping merged commit 1b88ce4 into JonasGeiping:main Feb 18, 2024
@shiwenqin
Copy link
Contributor

Hi,
Thanks for the fix, however when I run the pretraining script with the updated command the following error was raised:

166 Resolving data files: 100%|███████████████████| 88/88 [00:02<00:00, 43.91it/s]
167 Error executing job with overrides: ['name=cram_24h', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade', 'budget=24']
168 Traceback (most recent call last):
169 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 196, in launch
170 cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
171 File "/localdisk/home/Work/Repositories/cramming/cramming/utils.py", line 54, in main_launcher
172 metrics = main_fn(cfg, setup)
173 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 21, in main_training_process
174 dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
175 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 40, in load_pretraining_corpus
176 return _load_from_hub(cfg_data, data_path)
177 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 461, in _load_from_hub
178 tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, split="train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
179 File "/home/.local/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 60, in getitem
180 raise NotImplementedError("Subclasses of Dataset should implement getitem.")
181 NotImplementedError: Subclasses of Dataset should implement getitem.
182 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Have you encountered similar issues?

Thank you

@JonasGeiping JonasGeiping mentioned this pull request Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants