Q on comparison with SFTTrainer #42

RonanKMcGovern · 2024-04-01T11:13:32Z

The README mentions:

The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.

Is this reversed? If the batch size is smaller with SFTTrainer, wouldn't you use higher gradient accumulation?

Separately, I note that SFTTrainer and fsdp trainings take the same time on the graph shown. I assume SFTTrainer is using DDP, so it should be quite a bit slower, no? Perhaps even close to 2x slower because the batch size is smaller so there are more forward passes required?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q on comparison with SFTTrainer #42

Q on comparison with SFTTrainer #42

RonanKMcGovern commented Apr 1, 2024

Q on comparison with SFTTrainer #42

Q on comparison with SFTTrainer #42

Comments

RonanKMcGovern commented Apr 1, 2024