Skip to content

Conversation

@aireenmei
Copy link
Collaborator

No description provided.

@aireenmei aireenmei force-pushed the aireen/data_input_hf branch 13 times, most recently from 017fdc1 to cc51757 Compare October 4, 2024 06:41
@aireenmei aireenmei force-pushed the aireen/data_input_hf branch 2 times, most recently from 5102f59 to e9ce1da Compare October 17, 2024 05:37
@aireenmei aireenmei force-pushed the aireen/data_input_hf branch from e9ce1da to aa0618a Compare October 17, 2024 06:10
@aireenmei
Copy link
Collaborator Author

aireenmei commented Oct 17, 2024

Perf testing

  • No v5p-128 capacity. Tested on v4-128 with per_device_batch_size=1 (bigger batch size cause OOM).
  • Since with HF streaming, we apply caption encoding and vae on the fly. The baseline is the original "make_pokemon_iterator" but with these operations on the fly. On v4-128 with small batch_size, HF streaming a larger dataset (I use BleachNick/UltraEdit_500k as example here, other datasets of better resolution need custom preprocessing for cleaning/reformat caption, so I choose BleachNick/UltraEdit_500k as the quickstart example in the doc) has comparable step time vs baseline.
  • baseline config: dataset_type=tf cache_latents_text_encoder_outputs=False dataset_name=diffusers/pokemon-gpt4-captions train_split=split
  • baseline cloud logging: https://cloudlogging.app.goo.gl/gqvkgydmkak7UfJW6
  • HF streaming config: dataset_type=hf dataset_name=BleachNick/UltraEdit_500k image_column=source_image caption_column=source_caption train_split=FreeForm
  • HF streaming cloud logging: https://cloudlogging.app.goo.gl/xjihHuCPAAbQmu3s6

@aireenmei aireenmei marked this pull request as ready for review October 17, 2024 06:32
@aireenmei aireenmei requested a review from entrpn October 17, 2024 06:32
@entrpn entrpn merged commit 6271ab7 into main Oct 17, 2024
3 checks passed
@entrpn entrpn deleted the aireen/data_input_hf branch October 17, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants