Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From PR 43 #44

Closed
JonasGeiping opened this issue Mar 1, 2024 · 5 comments
Closed

From PR 43 #44

JonasGeiping opened this issue Mar 1, 2024 · 5 comments

Comments

@JonasGeiping
Copy link
Owner

JonasGeiping commented Mar 1, 2024

Thanks for the fix, however when I run the pretraining script with the updated command the following error was raised:

166 Resolving data files: 100%|███████████████████| 88/88 [00:02<00:00, 43.91it/s]
167 Error executing job with overrides: ['name=cram_24h', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade', 'budget=24']
168 Traceback (most recent call last):
169 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 196, in launch
170 cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
171 File "/localdisk/home/Work/Repositories/cramming/cramming/utils.py", line 54, in main_launcher
172 metrics = main_fn(cfg, setup)
173 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 21, in main_training_process
174 dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
175 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 40, in load_pretraining_corpus
176 return _load_from_hub(cfg_data, data_path)
177 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 461, in _load_from_hub
178 tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, split="train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
179 File "/home/.local/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 60, in getitem
180 raise NotImplementedError("Subclasses of Dataset should implement getitem.")
181 NotImplementedError: Subclasses of Dataset should implement getitem.
182 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Have you encountered similar issues?

Thank you

Originally posted by @shiwenqin in #43 (comment)

@JonasGeiping
Copy link
Owner Author

Is this still an issue?

@JonasGeiping
Copy link
Owner Author

Problem might be related to differences in versions of the datasets package. The fix from the PR is only necessary for newer releases, and a problem for older ones.

@shiwenqin
Copy link
Contributor

I originally faced the same error msg with @euclaise , and after the proposed change is applied, I instead face this error message.

The work-around I made to this problem is to change the line from

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

to

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "default", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

And it solves the problem, however I'm not familiar with the datasets package so I'm not sure if it is the right fix.

@keeeeenw
Copy link

I originally faced the same error msg with @euclaise , and after the proposed change is applied, I instead face this error message.

The work-around I made to this problem is to change the line from

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

to

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "default", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

And it solves the problem, however I'm not familiar with the datasets package so I'm not sure if it is the right fix.

This worked for me! For anyone interested, I am running Python 3.10, datasets 2.18.0, Ubuntu 22.0.

JonasGeiping added a commit that referenced this issue Mar 19, 2024
@JonasGeiping
Copy link
Owner Author

This fix is now included in commit 2875a3b thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants