From PR 43 #44

JonasGeiping · 2024-03-01T14:58:30Z

Thanks for the fix, however when I run the pretraining script with the updated command the following error was raised:

166 Resolving data files: 100%|███████████████████| 88/88 [00:02<00:00, 43.91it/s]
167 Error executing job with overrides: ['name=cram_24h', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade', 'budget=24']
168 Traceback (most recent call last):
169 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 196, in launch
170 cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
171 File "/localdisk/home/Work/Repositories/cramming/cramming/utils.py", line 54, in main_launcher
172 metrics = main_fn(cfg, setup)
173 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 21, in main_training_process
174 dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
175 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 40, in load_pretraining_corpus
176 return _load_from_hub(cfg_data, data_path)
177 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 461, in _load_from_hub
178 tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, split="train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
179 File "/home/.local/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 60, in getitem
180 raise NotImplementedError("Subclasses of Dataset should implement getitem.")
181 NotImplementedError: Subclasses of Dataset should implement getitem.
182 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Have you encountered similar issues?

Thank you

Originally posted by @shiwenqin in #43 (comment)

JonasGeiping · 2024-03-01T14:59:17Z

Is this still an issue?

JonasGeiping · 2024-03-01T15:00:25Z

Problem might be related to differences in versions of the datasets package. The fix from the PR is only necessary for newer releases, and a problem for older ones.

shiwenqin · 2024-03-01T18:29:28Z

I originally faced the same error msg with @euclaise , and after the proposed change is applied, I instead face this error message.

The work-around I made to this problem is to change the line from

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

to

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "default", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

And it solves the problem, however I'm not familiar with the datasets package so I'm not sure if it is the right fix.

keeeeenw · 2024-03-15T04:51:11Z

I originally faced the same error msg with @euclaise , and after the proposed change is applied, I instead face this error message.

The work-around I made to this problem is to change the line from

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

to

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "default", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

And it solves the problem, however I'm not familiar with the datasets package so I'm not sure if it is the right fix.

This worked for me! For anyone interested, I am running Python 3.10, datasets 2.18.0, Ubuntu 22.0.

JonasGeiping · 2024-03-19T21:38:48Z

This fix is now included in commit 2875a3b thanks!

shiwenqin mentioned this issue Mar 1, 2024

Unable to replicate the results using the default command #45

Closed

JonasGeiping added a commit that referenced this issue Mar 19, 2024

Fix based on #44

2875a3b

JonasGeiping closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From PR 43 #44

From PR 43 #44

JonasGeiping commented Mar 1, 2024 •

edited

JonasGeiping commented Mar 1, 2024

JonasGeiping commented Mar 1, 2024

shiwenqin commented Mar 1, 2024

keeeeenw commented Mar 15, 2024

JonasGeiping commented Mar 19, 2024

From PR 43 #44

From PR 43 #44

Comments

JonasGeiping commented Mar 1, 2024 • edited

JonasGeiping commented Mar 1, 2024

JonasGeiping commented Mar 1, 2024

shiwenqin commented Mar 1, 2024

keeeeenw commented Mar 15, 2024

JonasGeiping commented Mar 19, 2024

JonasGeiping commented Mar 1, 2024 •

edited