Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on preprocessing of LAION data #32

Closed
bokyeong1015 opened this issue Sep 3, 2023 · 3 comments
Closed

Discussion on preprocessing of LAION data #32

bokyeong1015 opened this issue Sep 3, 2023 · 3 comments

Comments

@bokyeong1015
Copy link
Member

[Question]

I have another question.

I split the LAION-aesthetic V2 5+ dataset into several subsets, e.g., 5M, 10M, 89M, etc, and I made metadata.csv for each subset.

Then, when I tried to train with multi-gpus using the subset dataset, I faced the below error.

I guess that the problem was caused by the data itself.

FYI, I didn't pre-process the data except for resolution (512x512) when I downloaded data.

Did you also face this problem?

Or did you conduct any pre-processing of the LAION data??

Steps: 0%| | 283/400000 [35:52<813:24:06, 7.33s/it, kd_feat_loss=58.6, kd_output_loss=0.0447, lr=5e-5, sd_loss=0.185, step_loss=58.9]
Traceback (most recent call last):
File "/home/user01/bk-sdm/src/kd_train_text_to_image.py, line 1171, in
main()
File "/home/user01/bk-sdm/src/kd_train_text_to_image.py", line 961, in main
for step, batch in enumerate(train_dataloader):
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/accelerate/data_loader.py", line 388, in iter
next_batch = next(dataloader_iter)
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
data = self.dataset.getitems(possibly_batched_index)
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in getitems
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
IndexError: index 63 is out of bounds for dimension 0 with size 63

@bokyeong1015
Copy link
Member Author

bokyeong1015 commented Sep 3, 2023

@youngwanLEE, please find our response below:

Did you also face this problem?

We haven’t encountered such an error IndexError: index 63 is out of bounds for dimension 0 with size 63

Did you conduct any pre-processing of the LAION data??

We removed some problematic image-text pairs (empty text files and PIL-unreadable images); however, this led to different error messages from yours.


We’ve tried to reproduce this error (by changing batch sizes under a multi-gpu setting, adding empty lines in metadata.csv, using very long text prompts or multi-line prompts), but eventually failed to do so (no or different errors occurred).

Could you provide more context about this error? It would be very appreciated if you could generously share your update and/or solution to this issue.

@youngwanLEE
Copy link

youngwanLEE commented Sep 5, 2023

@bokyeong1015 thanks for your effort :)

I finally solved this problem.

The problem was caused by empty text files in the dataset.

When I filtered the empty text pairs, the problem was solved.

From now on, I started to train models on larger datasets over 10M image-text pairs.

Thanks again :)

It would be ok to close this issue.

@bokyeong1015
Copy link
Member Author

Great, thanks for sharing! Hope your training goes well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants